## Defining NEW Refine Low Confidence (RLC) Regions of the MTb genome

### Maximillian Marin (mgmarin@g.harvard.edu)

### Goal: To output (in BED format) the NEW refined low confidence regions of the Mtb genome.

We are defining the RLC regions as follows: <br>
   A) Genes and Intergenic Regions with the highest density of FPs

   B) Regions which have EBR < 0.9
    
   C) Regions which were frequenly "Ambigously defined"  ground truths due to DUPLICATION or high sequence divergence from H37Rv.

**NOTE: Exclusion of these regions does not necessarily mean that Illumina WGS can never accurately variant call these regions. Instead, these definitions highlight any regions which has evidence of difficulty and inconsistent analysis.**

This approach is conservative and attempts to minimize the greatest sources of error, as well as remove regions which due to divergence from the reference genome, it is difficult to confidently define a ground truth even with a long read assembly.

“The refined low confidence regions are defined to account for the largest sources of error and uncertainty in analysis of Illumina WGS. The refined low confidence regions contain: a) all false positives variant hotspots identified (50 kb), b) poorly recalled positions as identified using Empirical Base Pair Recall, and c) regions containing strong sequence divergence or CNV relative to H37Rv.”



In [1]:
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from scipy import stats
# import gffutils

%matplotlib inline

#### Pandas Viewing Settings

In [2]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## Define Directories

In [3]:
PB_Vs_Illumina_DataAnalysis_Dir = "../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI"



Genmap_Map_AnalysisDir = PB_Vs_Illumina_DataAnalysis_Dir + "/201027_Genmap_Mappability_H37rv_V1"  

FalsePositive_Analysis_V2_Dir = PB_Vs_Illumina_DataAnalysis_Dir + "/210126_FalsePositivesAnalysis_V4"  

PBvIll_EBR_Dir = PB_Vs_Illumina_DataAnalysis_Dir + "/210112_EBR_H37rv_36CI_MM2vsPilon_V7"         



# Parse in BED file annotations of H37Rv genome (PLC groups)

In [4]:
RepoRef_Dir = "../../References"
pLC_ExcludedRegionsScheme_RepoRef_Dir = f"{RepoRef_Dir}/pLowConfideceRegions_CoscollaEtAlScheme_Files"


Mtb_H37rv_pLCRegions_Coscolla_BED_PATH = f"{pLC_ExcludedRegionsScheme_RepoRef_Dir}/201027_Mtb_H37rv_pLC_Regions_CoscollaExcludedGenes.bed"

Mtb_H37rv_pLCRegions_Coscolla_Subset_PEPPEs_BED_PATH = f"{pLC_ExcludedRegionsScheme_RepoRef_Dir}/201027_Mtb_H37rv_pLC_Regions_CoscollaExcludedGenes.PEPPEs.bed"

Mtb_H37rv_pLCRegions_Coscolla_Subset_MGEs_BED_PATH = f"{pLC_ExcludedRegionsScheme_RepoRef_Dir}/201027_Mtb_H37rv_pLC_Regions_CoscollaExcludedGenes.MGEs.bed"

Mtb_H37rv_pLCRegions_Coscolla_Subset_RepetitiveGenes_BED_PATH = f"{pLC_ExcludedRegionsScheme_RepoRef_Dir}/201027_Mtb_H37rv_pLC_Regions_CoscollaExcludedGenes.RepetitiveGenes.bed"

Mtb_H37rv_pLCRegions_Only_PE_PGRS_And_PPE_MPTR_BED_PATH = f"{pLC_ExcludedRegionsScheme_RepoRef_Dir}/201027_Mtb_H37rv_pLC_Regions_Subset_For_PE_PGRS_And_PPE_MPTR_Only_85_genes.bed"

Mtb_H37rv_pLCRegions_Coscolla_BED_MERGED_PATH = f"{pLC_ExcludedRegionsScheme_RepoRef_Dir}/201027_Mtb_H37rv_pLC_Regions_CoscollaExcludedRegion.Merged.bed"

Mtb_H37rv_HighConfidenceRegions_NONCoscollaRegions_BED_PATH = f"{pLC_ExcludedRegionsScheme_RepoRef_Dir}/201027_Mtb_H37rv_HighConfidence_Regions_NONpLC_Regions.bed"


In [5]:
!head $Mtb_H37rv_pLCRegions_Coscolla_BED_PATH

NC_000962.3	33581	33794	Rv0031	Rv0031	InsertionSeqs_And_Phages	None
NC_000962.3	103709	104663	Rv0094c	Rv0094c	InsertionSeqs_And_Phages	None
NC_000962.3	104804	105215	Rv0095c	Rv0095c	InsertionSeqs_And_Phages	None
NC_000962.3	105323	106715	Rv0096	PPE1	PE/PPEs	PPE_SL-2_PPE-PPW
NC_000962.3	131381	132872	Rv0109	PE_PGRS1	PE/PPEs	PE_V_PGRS
NC_000962.3	149532	150996	Rv0124	PE_PGRS2	PE/PPEs	PE_V_PGRS
NC_000962.3	177542	179309	Rv0151c	PE1	PE/PPEs	PE_V_
NC_000962.3	179318	180896	Rv0152c	PE2	PE/PPEs	PE_V_
NC_000962.3	187432	188839	Rv0159c	PE3	PE/PPEs	PE_V_
NC_000962.3	188930	190439	Rv0160c	PE4	PE/PPEs	PE_V_


In [6]:
#HighConfidenceRegions_BED_DF = pd.read_csv(Mtb_H37rv_HighConfidenceRegions_NONCoscollaRegions_BED_PATH, sep = "\t", header = None)
#HighConfidenceRegions_BED_DF.columns = ["Chrom", "Start", "End"]
#HighConfidenceRegions_BED_DF["Length"] = HighConfidenceRegions_BED_DF["End"] - HighConfidenceRegions_BED_DF["Start"]
#HighConfidenceRegions_BED_DF.shape

In [7]:
#HighConfidenceRegions_BED_DF.head()

In [8]:
LowConfidenceRegions_BED_DF = pd.read_csv(Mtb_H37rv_pLCRegions_Coscolla_BED_MERGED_PATH, sep = "\t", header = None)
LowConfidenceRegions_BED_DF.columns = ["Chrom", "Start", "End"]
LowConfidenceRegions_BED_DF["Length"] = LowConfidenceRegions_BED_DF["End"] - LowConfidenceRegions_BED_DF["Start"]
LowConfidenceRegions_BED_DF.shape

(330, 4)

In [9]:
#LowConfidenceRegions_BED_DF.head()

## Read back in pickle of "dictOf_EBR_31CI_DFs"

In [10]:
PB_Vs_Illumina_DataAnalysis_Dir = "../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI"

# Define directory for EBR analysis data
PBvIll_EBR_Dir = PB_Vs_Illumina_DataAnalysis_Dir + "/210112_EBR_H37rv_36CI_MM2vsPilon_V7"         

# Parse in aggregated EBR-36CI array
EBR_36CI_WGS40X_NPZ_PATH = f"{PBvIll_EBR_Dir}/210112_EBR_V7_36CI.npz"

EBR_36CI_Array_A4 = np.load(EBR_36CI_WGS40X_NPZ_PATH)["arr_0"]



In [11]:
#dictOf_EBR_IndivIsolate_NPYs.keys()

In [12]:
EBR_36CI_Array_A4.shape[0]

4411532

In [13]:
np.nanmean(EBR_36CI_Array_A4)

0.9886257598437438

## Read back in EBR and Pmappability H37Rv gene level analysis

In [14]:
#Repo_DataDir = "../../Data"


FeatureLevelAnalysis_Dir_O2 = PBvIll_EBR_Dir + "/210113_H37Rv_FeatureLevelAnalysis_EBR_Pmap" 


H37Rv_FeatureLevelAnalysis_EBR_Pmap_TSV_O2_Repo = f"{FeatureLevelAnalysis_Dir_O2}/H37Rv_FeatureLevelAnalysis.EBR_And_Pmap.tsv"
H37Rv_GeneLevelAnalysis_EBR_Pmap_TSV_O2_Repo = f"{FeatureLevelAnalysis_Dir_O2}/H37Rv_FeatureLevelAnalysis.EBR_And_Pmap.Genes.tsv"
H37Rv_IntergenicLevelAnalysis_EBR_Pmap_TSV_O2_Repo = f"{FeatureLevelAnalysis_Dir_O2}/H37Rv_FeatureLevelAnalysis.EBR_And_Pmap.IntergenicRegions.tsv"

FLA_DF = pd.read_csv(H37Rv_FeatureLevelAnalysis_EBR_Pmap_TSV_O2_Repo, sep = "\t",)

GLA_DF = pd.read_csv(H37Rv_GeneLevelAnalysis_EBR_Pmap_TSV_O2_Repo, sep = "\t",)

# A) Parsing in Genomic Regions Annotated by False Positives SNSs

In [15]:
All_FPs_SNPs_FiltMQ30_VCF_PATH = f"{FalsePositive_Analysis_V2_Dir}/PMP_36CI.SNPs.All.FPs.FiltMQ30.vcf"
All_FPs_SNPs_VCF_PATH = f"{FalsePositive_Analysis_V2_Dir}/PMP_36CI.SNPs.All.FPs.vcf"


H37rv_AllRegions_AnnoBy_All_FPs_PATH = f"{FalsePositive_Analysis_V2_Dir}/200901_Mtb_H37rv_AllRegions_Info.SNPs.All.FPs.bed" 
H37rv_AllRegions_AnnoBy_All_FPs_FiltMQ30_PATH = f"{FalsePositive_Analysis_V2_Dir}/200901_Mtb_H37rv_AllRegions_Info.SNPs.FPs.FiltMQ30.bed" 


In [16]:
AllRegions_AnnoByFPs_FiltMQ30_DF = pd.read_csv(H37rv_AllRegions_AnnoBy_All_FPs_FiltMQ30_PATH, sep="\t", header=None)

AllRegions_AnnoByFPs_FiltMQ30_DF.columns = ['Chrom', 'Start', 'End',
                                    'Strand', 'H37rv_GeneID', 'Symbol',
                                    'ExcludedGroup_Category', 'PEandPPE_Subfamily',
                                    'Functional_Category', "FP_Count"] 
                                   
AllRegions_AnnoByFPs_FiltMQ30_DF = AllRegions_AnnoByFPs_FiltMQ30_DF.sort_values("Start")
AllRegions_AnnoByFPs_FiltMQ30_DF["Length"] = AllRegions_AnnoByFPs_FiltMQ30_DF["End"] - AllRegions_AnnoByFPs_FiltMQ30_DF["Start"]

# Select for the top 30 regions based on the total # of FPs detected across all 36 isolates
Regions_Top30SourcesOfFPs_FiltMQ30_DF = AllRegions_AnnoByFPs_FiltMQ30_DF.sort_values("FP_Count", ascending=False).head(30)


AllRegions_AnnoByFPs_FiltMQ30_DF.head()

Unnamed: 0,Chrom,Start,End,Strand,H37rv_GeneID,Symbol,ExcludedGroup_Category,PEandPPE_Subfamily,Functional_Category,FP_Count,Length
271,NC_000962.3,0,1524,+,Rv0001,dnaA,NotExcluded,,information pathways,0,1524
272,NC_000962.3,1524,2051,,IntergenicRegion_1_Rv0001-Rv0002,,Intergenic,,Intergenic,0,527
273,NC_000962.3,2051,3260,+,Rv0002,dnaN,NotExcluded,,information pathways,0,1209
274,NC_000962.3,3260,3279,,IntergenicRegion_2_Rv0002-Rv0003,,Intergenic,,Intergenic,0,19
275,NC_000962.3,3279,4437,+,Rv0003,recF,NotExcluded,,information pathways,0,1158


#### Look at top 5 regions (genes + intergenic regions) in terms of # of FPs

In [17]:
Regions_Top30SourcesOfFPs_FiltMQ30_DF.head(10)

Unnamed: 0,Chrom,Start,End,Strand,H37rv_GeneID,Symbol,ExcludedGroup_Category,PEandPPE_Subfamily,Functional_Category,FP_Count,Length
6286,NC_000962.3,3845970,3847164,,IntergenicRegion_2683_Rv3428c-Rv3429,,Intergenic,,Intergenic,142,1194
2840,NC_000962.3,1636003,1638229,-,Rv1452c,PE_PGRS28,PE/PPEs,PE_V_PGRS,PE/PPE,59,2226
3645,NC_000962.3,2162931,2167311,-,Rv1917c,PPE34,PE/PPEs,PPE_SL-5_PPE-MPTR,PE/PPE,35,4380
138,NC_000962.3,1981613,1984775,-,Rv1753c,PPE24,PE/PPEs,PPE_SL-5_PPE-MPTR,PE/PPE,24,3162
6428,NC_000962.3,3941723,3944963,+,Rv3512,PE_PGRS56,PE/PPEs,PE_V_PGRS,PE/PPE,16,3240
5393,NC_000962.3,3232506,3232870,,IntergenicRegion_2285_Rv2920c-Rv2921c,,Intergenic,,Intergenic,16,364
30,NC_000962.3,3931004,3936710,+,Rv3508,PE_PGRS54,PE/PPEs,PE_V_PGRS,PE/PPE,16,5706
5574,NC_000962.3,3379375,3380452,-,Rv3021c,PPE47,PE/PPEs,,PE/PPE,14,1077
186,NC_000962.3,2867123,2867786,+,Rv2544,lppB,Coscolla Repetitive Genes,,cell wall and cell processes,14,663
245,NC_000962.3,3945793,3950263,+,Rv3514,PE_PGRS57,PE/PPEs,PE_V_PGRS,PE/PPE,14,4470


In [18]:
AllRegions_AnnoByFPs_FiltMQ30_DF.sort_values("FP_Count", ascending=False).head(10)["FP_Count"].sum() / AllRegions_AnnoByFPs_FiltMQ30_DF["FP_Count"].sum()   

0.6386861313868614

In [19]:
AllRegions_AnnoByFPs_FiltMQ30_DF.sort_values("FP_Count", ascending=False).head(20)["FP_Count"].sum() / AllRegions_AnnoByFPs_FiltMQ30_DF["FP_Count"].sum()   

0.801094890510949

In [20]:
AllRegions_AnnoByFPs_FiltMQ30_DF.sort_values("FP_Count", ascending=False).head(30)["FP_Count"].sum() / AllRegions_AnnoByFPs_FiltMQ30_DF["FP_Count"].sum()

0.8941605839416058

In [21]:
Regions_Top30SourcesOfFPs_FiltMQ30_DF

Unnamed: 0,Chrom,Start,End,Strand,H37rv_GeneID,Symbol,ExcludedGroup_Category,PEandPPE_Subfamily,Functional_Category,FP_Count,Length
6286,NC_000962.3,3845970,3847164,,IntergenicRegion_2683_Rv3428c-Rv3429,,Intergenic,,Intergenic,142,1194
2840,NC_000962.3,1636003,1638229,-,Rv1452c,PE_PGRS28,PE/PPEs,PE_V_PGRS,PE/PPE,59,2226
3645,NC_000962.3,2162931,2167311,-,Rv1917c,PPE34,PE/PPEs,PPE_SL-5_PPE-MPTR,PE/PPE,35,4380
138,NC_000962.3,1981613,1984775,-,Rv1753c,PPE24,PE/PPEs,PPE_SL-5_PPE-MPTR,PE/PPE,24,3162
6428,NC_000962.3,3941723,3944963,+,Rv3512,PE_PGRS56,PE/PPEs,PE_V_PGRS,PE/PPE,16,3240
5393,NC_000962.3,3232506,3232870,,IntergenicRegion_2285_Rv2920c-Rv2921c,,Intergenic,,Intergenic,16,364
30,NC_000962.3,3931004,3936710,+,Rv3508,PE_PGRS54,PE/PPEs,PE_V_PGRS,PE/PPE,16,5706
5574,NC_000962.3,3379375,3380452,-,Rv3021c,PPE47,PE/PPEs,,PE/PPE,14,1077
186,NC_000962.3,2867123,2867786,+,Rv2544,lppB,Coscolla Repetitive Genes,,cell wall and cell processes,14,663
245,NC_000962.3,3945793,3950263,+,Rv3514,PE_PGRS57,PE/PPEs,PE_V_PGRS,PE/PPE,14,4470


In [22]:
Regions_Top30SourcesOfFPs_FiltMQ30_DF.head(2)

Unnamed: 0,Chrom,Start,End,Strand,H37rv_GeneID,Symbol,ExcludedGroup_Category,PEandPPE_Subfamily,Functional_Category,FP_Count,Length
6286,NC_000962.3,3845970,3847164,,IntergenicRegion_2683_Rv3428c-Rv3429,,Intergenic,,Intergenic,142,1194
2840,NC_000962.3,1636003,1638229,-,Rv1452c,PE_PGRS28,PE/PPEs,PE_V_PGRS,PE/PPE,59,2226


In [23]:
Regions_Top30SourcesOfFPs_FiltMQ30_DF["Length"].sum()

65394

In [24]:
Regions_Top30SourcesOfFPs_FiltMQ30_DF["Length"].sum() / 4411532 * 100

1.4823421886093087

### Output BED file for the top 30 regions ranked by # of FPs

In [25]:
Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH = f"{FalsePositive_Analysis_V2_Dir}/210202_Mtb_H37rv.Top30SourcesOfFPs.FiltMQ30.bed" 
Regions_Top30SourcesOfFPs_FiltMQ30_BED_WithHeader_PATH = f"{FalsePositive_Analysis_V2_Dir}/210202_Mtb_H37rv.Top30SourcesOfFPs.FiltMQ30.WithHeader.bed" 


Regions_Top30SourcesOfFPs_FiltMQ30_DF.to_csv(Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH,
                           sep = "\t",
                           index = False,
                           header = False)


Regions_Top30SourcesOfFPs_FiltMQ30_DF.to_csv(Regions_Top30SourcesOfFPs_FiltMQ30_BED_WithHeader_PATH,
                           sep = "\t",
                           index = False,
                           header = True)



In [26]:
!ls -lah $FalsePositive_Analysis_V2_Dir

total 1008K
drwxrwsr-x  2 mm774 farhat  453 Feb  2 23:37 .
drwxrwsr-x 12 mm774 farhat  574 Mar 29 19:36 ..
-rw-rw-r--  1 mm774 farhat 637K Mar 28 22:17 200901_Mtb_H37rv_AllRegions_Info.SNPs.All.FPs.bed
-rw-rw-r--  1 mm774 farhat 637K Mar 28 22:17 200901_Mtb_H37rv_AllRegions_Info.SNPs.FPs.FiltMQ30.bed
-rw-rw-r--  1 mm774 farhat 2.6K Mar 30 13:12 210202_Mtb_H37rv.Top30SourcesOfFPs.FiltMQ30.bed
-rw-rw-r--  1 mm774 farhat 2.7K Mar 30 13:12 210202_Mtb_H37rv.Top30SourcesOfFPs.FiltMQ30.WithHeader.bed
-rw-rw-r--  1 mm774 farhat 665K Mar 28 22:17 H37Rv_AllRegions_GeneAndIntergenic_NumberOf_FPs.FiltMQ30.tsv
-rw-rw-r--  1 mm774 farhat  92K Mar 28 22:17 PMP_36CI.SNPs.All.FPs.FiltMQ30.vcf
-rw-rw-r--  1 mm774 farhat 112K Mar 28 22:17 PMP_36CI.SNPs.All.FPs.vcf


# B) Regions with EBR scores < 0.9

In [27]:
PB_Vs_Illumina_DataAnalysis_Dir = "../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI"
PBvIll_EBR_Dir = PB_Vs_Illumina_DataAnalysis_Dir + "/210112_EBR_H37rv_36CI_MM2vsPilon_V7"     
PBvsIll_EBR_IndivSample_NPZs = f"{PBvIll_EBR_Dir}/210112_EBR_H37rv_IndividualSampleRecall_NPZs"

EBR_36CI_BED_Below_09_AllPos_PATH = f"{PBvIll_EBR_Dir}/EBR_V7_36CI.Below_0.9.And.Ambiguous.AllPositions.bed"
EBR_36CI_BED_Below_09_Regions_PATH = f"{PBvIll_EBR_Dir}/EBR_V7_36CI.Below_0.9.And.Ambiguous.Regions.bed"


In [28]:
EBR_36CI_WGS40X_BED_DF_Regions_Below09 = pd.read_csv(EBR_36CI_BED_Below_09_Regions_PATH, sep = '\t', header=None)
EBR_36CI_WGS40X_BED_DF_Regions_Below09.columns = ["Chrom", "Start", "End"]
EBR_36CI_WGS40X_BED_DF_Regions_Below09["Length"] = EBR_36CI_WGS40X_BED_DF_Regions_Below09["End"] - EBR_36CI_WGS40X_BED_DF_Regions_Below09["Start"]
EBR_36CI_WGS40X_BED_DF_Regions_Below09.shape

(884, 4)

In [29]:
EBR_36CI_WGS40X_BED_DF_Regions_Below09.head()

Unnamed: 0,Chrom,Start,End,Length
0,NC_000962.3,24720,24738,18
1,NC_000962.3,39029,39030,1
2,NC_000962.3,55548,55647,99
3,NC_000962.3,71464,71584,120
4,NC_000962.3,79569,83035,3466


In [138]:
EBR_36CI_WGS40X_BED_DF_Regions_Below09["Length"].sum()

141960

# C) Regions frequently defined as Ambigous in more than 25% of isolates (in 10 or more isolates)

In [30]:
36 * 0.25

9.0

In [31]:
!ls -lah $PBvIll_EBR_Ambigous_RegionsDir

total 568K
drwxrwsr-x  4 mm774 farhat  288 Mar 30 13:12 .
drwxrwsr-x 85 mm774 farhat 4.9K Mar 26 23:43 ..
-rw-rw-r--  1 mm774 farhat 133K Mar 30 13:12 210202_3_A_Define_RefinedLowConfidenceRegions_V1.ipynb
-rw-rw-r--  1 mm774 farhat 388K Mar 30 12:27 210219_3_B_HapPy_VC_PR_Curves_V2_UsingMinimap2GroundTruth_PilonVCing_SNPs_And_INDELs_EBR_V7_WiRLCRemoved.ipynb
drwxrwsr-x  2 mm774 farhat  216 Mar 30 12:25 Illumina_VariantCalling_Eval_Plots
drwxrwsr-x  2 mm774 farhat  222 Mar 26 23:44 .ipynb_checkpoints


In [32]:
# Define directory for analysis

PBvIll_EBR_Ambigous_RegionsDir = f"{PBvIll_EBR_Dir}/EBR_36CI_AmbigousRegions"

EBR_36CI_Ambigous_Regions_BED_PATH = f"{PBvIll_EBR_Ambigous_RegionsDir}/EBR_36CI_AmbigousRegions_V1.bed"
EBR_36CI_Ambigous_Regions_WithHeader_BED_PATH = f"{PBvIll_EBR_Ambigous_RegionsDir}/EBR_36CI_AmbigousRegions_V1.WithHeader.bed.tsv"     


EBR_36CI_BED_DF_AMB_ONLY = pd.read_csv(EBR_36CI_Ambigous_Regions_WithHeader_BED_PATH, sep = "\t")


In [33]:
!head -n 6 $EBR_36CI_Ambigous_Regions_WithHeader_BED_PATH

chrom	chromStart	chromEnd	name	score	Length
NC_000962.3	334641	334653	Region996_Length_12_bp	Ambiguous	12
NC_000962.3	334694	334723	Region998_Length_29_bp	Ambiguous	29
NC_000962.3	888762	889020	Region4252_Length_258_bp	Ambiguous	258
NC_000962.3	889033	890373	Region4254_Length_1340_bp	Ambiguous	1340
NC_000962.3	1093947	1094062	Region5514_Length_115_bp	Ambiguous	115


In [34]:
EBR_36CI_BED_DF_AMB_ONLY.head(5)

Unnamed: 0,chrom,chromStart,chromEnd,name,score,Length
0,NC_000962.3,334641,334653,Region996_Length_12_bp,Ambiguous,12
1,NC_000962.3,334694,334723,Region998_Length_29_bp,Ambiguous,29
2,NC_000962.3,888762,889020,Region4252_Length_258_bp,Ambiguous,258
3,NC_000962.3,889033,890373,Region4254_Length_1340_bp,Ambiguous,1340
4,NC_000962.3,1093947,1094062,Region5514_Length_115_bp,Ambiguous,115


### What % of the genome had high levels of ambiguouity in EBR analysis? (> 25% of isolates with Amb at position) 

#### Answer: 0.35% (15,813 bp of the genome, largely in MGEs)

In [35]:
EBR_36CI_BED_DF_AMB_ONLY.shape

(24, 6)

In [36]:
EBR_36CI_BED_DF_AMB_ONLY["Length"].sum()

15813

In [37]:
(EBR_36CI_BED_DF_AMB_ONLY["Length"].sum() / 4411532) * 100

0.35844690687951486

# Z) Define positions with P-map-K50-E4 < 1.0 (Bonus region masking for ultra-conservative approach)

## Read back in pickle of Genmap pileup mappability calculations

In [38]:
Genmap_Map_AnalysisDir = PB_Vs_Illumina_DataAnalysis_Dir + "/201027_Genmap_Mappability_H37rv_V1"  
ParsedAndPickled_GenmapOutput = f"{Genmap_Map_AnalysisDir}/201027_ParsedAndPickled_GenmapOutput"

Pickle_PATH_dictOf_GM_PileupMap_Arrays = ParsedAndPickled_GenmapOutput + "/201027_dictOf_GM_PileupMap_Arrays.pickle"   


In [39]:
with open(Pickle_PATH_dictOf_GM_PileupMap_Arrays, "rb") as f: dictOf_GM_PileupMap_Arrays = pickle.load(f)

dictOf_GM_PileupMap_Arrays.keys()

dict_keys(['K50_E0', 'K50_E2', 'K50_E4', 'K75_E0', 'K75_E2', 'K75_E4', 'K100_E0', 'K100_E2', 'K100_E4', 'K125_E0', 'K125_E2', 'K125_E4', 'K150_E0', 'K150_E2', 'K150_E4'])

In [40]:
#!ls -1 $Genmap_Map_AnalysisDir/ 

In [41]:
PMap_K50E4_200730_BED_BELOW_1_PATH = f"{Genmap_Map_AnalysisDir}/201027_PMap_K50E4_BaseLevelInfo_BELOW_1.bed"

PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH = f"{Genmap_Map_AnalysisDir}/201027_PMap_K50E4_Regions_BELOW_1.bed"     


In [42]:
!head $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH | head -n 3

NC_000962.3	23173	23238
NC_000962.3	79499	79574
NC_000962.3	80181	80380


In [43]:
PMap_K50E4_BED_Below1_DF = pd.read_csv(PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH, sep = "\t", header = None)
PMap_K50E4_BED_Below1_DF.columns = ["Chrom", "Start", "End"]
PMap_K50E4_BED_Below1_DF["Length"] = PMap_K50E4_BED_Below1_DF["End"] - PMap_K50E4_BED_Below1_DF["Start"]
PMap_K50E4_BED_Below1_DF.shape

(1092, 4)

In [44]:
PMap_K50E4_BED_Below1_DF.head()

Unnamed: 0,Chrom,Start,End,Length
0,NC_000962.3,23173,23238,65
1,NC_000962.3,79499,79574,75
2,NC_000962.3,80181,80380,199
3,NC_000962.3,80459,80518,59
4,NC_000962.3,82166,82238,72


## Evaluate coverage of all 3 PLC subcategories with AMB regions
A total of 262 bp (1.8% of all 36CI-Amb Regions) were from regions which didn't overlap with regions that were already part of the PLC regions


In [45]:
15813 - 262

15551

In [46]:
15551/15813

0.9834313539492823

In [47]:
!bedtools coverage -a $EBR_36CI_Ambigous_Regions_BED_PATH -b $Mtb_H37rv_pLCRegions_Coscolla_Subset_PEPPEs_BED_PATH $Mtb_H37rv_pLCRegions_Coscolla_Subset_MGEs_BED_PATH $Mtb_H37rv_pLCRegions_Coscolla_Subset_RepetitiveGenes_BED_PATH

NC_000962.3	334641	334653	Region996_Length_12_bp	Ambiguous	12	1	12	12	1.0000000
NC_000962.3	334694	334723	Region998_Length_29_bp	Ambiguous	29	1	29	29	1.0000000
NC_000962.3	888762	889020	Region4252_Length_258_bp	Ambiguous	258	0	0	258	0.0000000
NC_000962.3	889033	890373	Region4254_Length_1340_bp	Ambiguous	1340	2	1262	1340	0.9417911
NC_000962.3	1093947	1094062	Region5514_Length_115_bp	Ambiguous	115	1	115	115	1.0000000
NC_000962.3	1480948	1481670	Region7844_Length_722_bp	Ambiguous	722	1	722	722	1.0000000
NC_000962.3	1541951	1543304	Region8088_Length_1353_bp	Ambiguous	1353	2	1262	1353	0.9327421
NC_000962.3	1637087	1637214	Region8717_Length_127_bp	Ambiguous	127	1	127	127	1.0000000
NC_000962.3	1987702	1989057	Region9675_Length_1355_bp	Ambiguous	1355	2	1262	1355	0.9313653
NC_000962.3	1996100	1997452	Region9810_Length_1352_bp	Ambiguous	1352	3	1297	1352	0.9593195
NC_000962.3	2268721	2268725	Region11088_Length_4_bp	Ambiguous	4	0	0	4	0.0000000
NC_000962.3	2550013	2551366	Region12413_Length_1353_bp

### Evaluate coverage between Amb Regions and Regions w/ P-map-K50-E4 < 1.0

In [48]:
!bedtools coverage -a $EBR_36CI_Ambigous_Regions_BED_PATH -b $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH

NC_000962.3	334641	334653	Region996_Length_12_bp	Ambiguous	12	1	12	12	1.0000000
NC_000962.3	334694	334723	Region998_Length_29_bp	Ambiguous	29	1	29	29	1.0000000
NC_000962.3	888762	889020	Region4252_Length_258_bp	Ambiguous	258	1	7	258	0.0271318
NC_000962.3	889033	890373	Region4254_Length_1340_bp	Ambiguous	1340	1	1340	1340	1.0000000
NC_000962.3	1093947	1094062	Region5514_Length_115_bp	Ambiguous	115	1	115	115	1.0000000
NC_000962.3	1480948	1481670	Region7844_Length_722_bp	Ambiguous	722	5	375	722	0.5193906
NC_000962.3	1541951	1543304	Region8088_Length_1353_bp	Ambiguous	1353	1	1353	1353	1.0000000
NC_000962.3	1637087	1637214	Region8717_Length_127_bp	Ambiguous	127	1	87	127	0.6850393
NC_000962.3	1987702	1989057	Region9675_Length_1355_bp	Ambiguous	1355	1	1355	1355	1.0000000
NC_000962.3	1996100	1997452	Region9810_Length_1352_bp	Ambiguous	1352	1	1352	1352	1.0000000
NC_000962.3	2268721	2268725	Region11088_Length_4_bp	Ambiguous	4	0	0	4	0.0000000
NC_000962.3	2550013	2551366	Region12413_Length_1353_bp	

# Calculating proportion overlap between FP hotspots and EBR < 0.9

In [49]:
!head $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH

NC_000962.3	3845970	3847164		IntergenicRegion_2683_Rv3428c-Rv3429		Intergenic		Intergenic	142	1194
NC_000962.3	1636003	1638229	-	Rv1452c	PE_PGRS28	PE/PPEs	PE_V_PGRS	PE/PPE	59	2226
NC_000962.3	2162931	2167311	-	Rv1917c	PPE34	PE/PPEs	PPE_SL-5_PPE-MPTR	PE/PPE	35	4380
NC_000962.3	1981613	1984775	-	Rv1753c	PPE24	PE/PPEs	PPE_SL-5_PPE-MPTR	PE/PPE	24	3162
NC_000962.3	3941723	3944963	+	Rv3512	PE_PGRS56	PE/PPEs	PE_V_PGRS	PE/PPE	16	3240
NC_000962.3	3232506	3232870		IntergenicRegion_2285_Rv2920c-Rv2921c		Intergenic		Intergenic	16	364
NC_000962.3	3931004	3936710	+	Rv3508	PE_PGRS54	PE/PPEs	PE_V_PGRS	PE/PPE	16	5706
NC_000962.3	3379375	3380452	-	Rv3021c	PPE47	PE/PPEs	None	PE/PPE	14	1077
NC_000962.3	2867123	2867786	+	Rv2544	lppB	Coscolla Repetitive Genes	None	cell wall and cell processes	14	663
NC_000962.3	3945793	3950263	+	Rv3514	PE_PGRS57	PE/PPEs	PE_V_PGRS	PE/PPE	14	4470


In [50]:
!head -n 2 $EBR_36CI_BED_Below_09_Regions_PATH

NC_000962.3	24720	24738
NC_000962.3	39029	39030


In [51]:
!bedtools coverage -a $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH -b $EBR_36CI_BED_Below_09_Regions_PATH | cut -f 13 | grep -o '[[:digit:]]*' | paste -sd+ - | bc

30277


In [52]:
!bedtools coverage -a $EBR_36CI_BED_Below_09_Regions_PATH -b $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH | cut -f 5 | grep -o '[[:digit:]]*' | paste -sd+ - | bc

30239


In [53]:
!bedtools coverage -a $EBR_36CI_BED_Below_09_Regions_PATH -b $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH | head -n 5

NC_000962.3	24720	24738	0	0	18	0.0000000
NC_000962.3	39029	39030	0	0	1	0.0000000
NC_000962.3	55548	55647	0	0	99	0.0000000
NC_000962.3	71464	71584	0	0	120	0.0000000
NC_000962.3	79569	83035	0	0	3466	0.0000000


In [54]:
!bedtools coverage -a $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH -b $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH | head -n 5

NC_000962.3	3845970	3847164		IntergenicRegion_2683_Rv3428c-Rv3429		Intergenic		Intergenic	142	1194	1	1194	1194	1.0000000
NC_000962.3	1636003	1638229	-	Rv1452c	PE_PGRS28	PE/PPEs	PE_V_PGRS	PE/PPE	59	2226	1	2226	2226	1.0000000
NC_000962.3	2162931	2167311	-	Rv1917c	PPE34	PE/PPEs	PPE_SL-5_PPE-MPTR	PE/PPE	35	4380	1	4380	4380	1.0000000
NC_000962.3	1981613	1984775	-	Rv1753c	PPE24	PE/PPEs	PPE_SL-5_PPE-MPTR	PE/PPE	24	3162	1	3162	3162	1.0000000
NC_000962.3	3941723	3944963	+	Rv3512	PE_PGRS56	PE/PPEs	PE_V_PGRS	PE/PPE	16	3240	2	3240	3240	1.0000000


# Calculating proportion overlap between FP hotspots and non-unique sequence

In [55]:
!head $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH

NC_000962.3	3845970	3847164		IntergenicRegion_2683_Rv3428c-Rv3429		Intergenic		Intergenic	142	1194
NC_000962.3	1636003	1638229	-	Rv1452c	PE_PGRS28	PE/PPEs	PE_V_PGRS	PE/PPE	59	2226
NC_000962.3	2162931	2167311	-	Rv1917c	PPE34	PE/PPEs	PPE_SL-5_PPE-MPTR	PE/PPE	35	4380
NC_000962.3	1981613	1984775	-	Rv1753c	PPE24	PE/PPEs	PPE_SL-5_PPE-MPTR	PE/PPE	24	3162
NC_000962.3	3941723	3944963	+	Rv3512	PE_PGRS56	PE/PPEs	PE_V_PGRS	PE/PPE	16	3240
NC_000962.3	3232506	3232870		IntergenicRegion_2285_Rv2920c-Rv2921c		Intergenic		Intergenic	16	364
NC_000962.3	3931004	3936710	+	Rv3508	PE_PGRS54	PE/PPEs	PE_V_PGRS	PE/PPE	16	5706
NC_000962.3	3379375	3380452	-	Rv3021c	PPE47	PE/PPEs	None	PE/PPE	14	1077
NC_000962.3	2867123	2867786	+	Rv2544	lppB	Coscolla Repetitive Genes	None	cell wall and cell processes	14	663
NC_000962.3	3945793	3950263	+	Rv3514	PE_PGRS57	PE/PPEs	PE_V_PGRS	PE/PPE	14	4470


In [56]:
!head -n 2 $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH

NC_000962.3	23173	23238
NC_000962.3	79499	79574


In [57]:
!bedtools coverage -a $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH -b $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH | cut -f 13 | grep -o '[[:digit:]]*' | paste -sd+ - | bc

28695


In [58]:
!bedtools coverage -a $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH -b $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH | cut -f 5 | grep -o '[[:digit:]]*' | paste -sd+ - | bc

28657


In [59]:
#!bedtools coverage -a $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH -b $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH | cut -f 5

In [60]:
#!bedtools coverage -a $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH -b $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH | cut -f 13

In [61]:
!bedtools coverage -a $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH -b $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH | head -n 5

NC_000962.3	23173	23238	0	0	65	0.0000000
NC_000962.3	79499	79574	0	0	75	0.0000000
NC_000962.3	80181	80380	0	0	199	0.0000000
NC_000962.3	80459	80518	0	0	59	0.0000000
NC_000962.3	82166	82238	0	0	72	0.0000000


# Calc overlap between SVs and RLC subsets (A,B,C, Z)

### Parse in SVs

In [62]:
Repo_DataDir = "../../Data"

PMP_28CI_NucDiff_SV_Analysis_Dir = Repo_DataDir + "/210126_PMP_36CI_NucDiff_SV_Analysis_Dir"
PMP_28CI_NucDiff_AllSVs_Detected_TSV = PMP_28CI_NucDiff_SV_Analysis_Dir + "/210126.PMP.36CI.NucDiff_AllSVs_Detected.V2.tsv"
PMP_28CI_NucDiff_AllSVs_Detected_50bp_Filtered_BED = PMP_28CI_NucDiff_SV_Analysis_Dir + "/210126.PMP.36CI.NucDiff_AllSVs_Detected.V2.50bp.bed"



NucDiff_SVs_28CI_DF = pd.read_csv(PMP_28CI_NucDiff_AllSVs_Detected_TSV, sep="\t")


NucDiff_SVs_28CI_50bp_DF = NucDiff_SVs_28CI_DF[ NucDiff_SVs_28CI_DF["SV_Length"] >= 50]


NucDiff_SVs_28CI_50bp_DF.to_csv(PMP_28CI_NucDiff_AllSVs_Detected_50bp_Filtered_BED,
                           sep = "\t",
                           index = False,
                           header = False)




In [132]:
NucDiff_SVs_28CI_50bp_DF.shape

(3620, 7)

In [63]:
!ls -lah $PMP_28CI_NucDiff_SV_Analysis_Dir

total 360K
drwxrwsr-x  2 mm774 farhat  197 Feb 19 21:30 .
drwxrwsr-x 47 mm774 farhat 2.6K Mar 29 19:36 ..
-rw-rw-r--  1 mm774 farhat 218K Mar 30 13:12 210126.PMP.36CI.NucDiff_AllSVs_Detected.V2.50bp.bed
-rw-rw-r--  1 mm774 farhat 327K Feb 19 21:28 210126.PMP.36CI.NucDiff_AllSVs_Detected.V2.bed
-rw-rw-r--  1 mm774 farhat 327K Mar 26 13:42 210126.PMP.36CI.NucDiff_AllSVs_Detected.V2.tsv


In [64]:
!wc -l $PMP_28CI_NucDiff_SV_Analysis_Dir/*

  3620 ../../Data/210126_PMP_36CI_NucDiff_SV_Analysis_Dir/210126.PMP.36CI.NucDiff_AllSVs_Detected.V2.50bp.bed
  5486 ../../Data/210126_PMP_36CI_NucDiff_SV_Analysis_Dir/210126.PMP.36CI.NucDiff_AllSVs_Detected.V2.bed
  5485 ../../Data/210126_PMP_36CI_NucDiff_SV_Analysis_Dir/210126.PMP.36CI.NucDiff_AllSVs_Detected.V2.tsv
 14591 total


In [65]:
!head $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH

NC_000962.3	3845970	3847164		IntergenicRegion_2683_Rv3428c-Rv3429		Intergenic		Intergenic	142	1194
NC_000962.3	1636003	1638229	-	Rv1452c	PE_PGRS28	PE/PPEs	PE_V_PGRS	PE/PPE	59	2226
NC_000962.3	2162931	2167311	-	Rv1917c	PPE34	PE/PPEs	PPE_SL-5_PPE-MPTR	PE/PPE	35	4380
NC_000962.3	1981613	1984775	-	Rv1753c	PPE24	PE/PPEs	PPE_SL-5_PPE-MPTR	PE/PPE	24	3162
NC_000962.3	3941723	3944963	+	Rv3512	PE_PGRS56	PE/PPEs	PE_V_PGRS	PE/PPE	16	3240
NC_000962.3	3232506	3232870		IntergenicRegion_2285_Rv2920c-Rv2921c		Intergenic		Intergenic	16	364
NC_000962.3	3931004	3936710	+	Rv3508	PE_PGRS54	PE/PPEs	PE_V_PGRS	PE/PPE	16	5706
NC_000962.3	3379375	3380452	-	Rv3021c	PPE47	PE/PPEs	None	PE/PPE	14	1077
NC_000962.3	2867123	2867786	+	Rv2544	lppB	Coscolla Repetitive Genes	None	cell wall and cell processes	14	663
NC_000962.3	3945793	3950263	+	Rv3514	PE_PGRS57	PE/PPEs	PE_V_PGRS	PE/PPE	14	4470


## SV overlap with RLC - (A): FP hotspot regions

In [66]:
!bedtools coverage -a $PMP_28CI_NucDiff_AllSVs_Detected_50bp_Filtered_BED -b $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH | cut -f 8 | sort | uniq -c

   2610 0
    956 1
     29 2
     25 4


In [67]:
25 + 29 + 958 

1012

In [68]:
25 + 29 + 958 + 2610

3622

In [133]:
1012/3620

0.2795580110497238

In [70]:
awk_text="'$8 >= 1 '"

In [71]:
!bedtools coverage -a $PMP_28CI_NucDiff_AllSVs_Detected_50bp_Filtered_BED -b $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH  | head -n 30 | awk $awk_text

NC_000962.3	335050	337913	tandem_duplication	2862	M0011368_9	lineage4	1	1260	2863	0.4400978
NC_000962.3	580760	580814	tandem_duplication	53	M0011368_9	lineage4	1	48	54	0.8888889
NC_000962.3	1983037	1983350	tandem_duplication	312	M0011368_9	lineage4	1	313	313	1.0000000
NC_000962.3	2074547	2074633	collapsed_repeat	86	M0011368_9	lineage4	1	86	86	1.0000000
NC_000962.3	2163326	2163530	duplication	203	M0011368_9	lineage4	1	204	204	1.0000000
NC_000962.3	2164023	2164092	collapsed_tandem_repeat	69	M0011368_9	lineage4	1	69	69	1.0000000
NC_000962.3	2166574	2166575	insertion	1355	M0011368_9	lineage4	1	1	1	1.0000000


In [72]:
!bedtools coverage -a $PMP_28CI_NucDiff_AllSVs_Detected_50bp_Filtered_BED -b $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH | awk $awk_text | wc -l 

1010


In [134]:
1010 / 3620

0.2795580110497238

## SV overlap with RLC - (B): EBR < 0.9 regions

In [74]:
EBR_36CI_BED_Below_09_Regions_PATH

'../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI/210112_EBR_H37rv_36CI_MM2vsPilon_V7/EBR_V7_36CI.Below_0.9.And.Ambiguous.Regions.bed'

In [75]:
!bedtools coverage -a $PMP_28CI_NucDiff_AllSVs_Detected_50bp_Filtered_BED -b $EBR_36CI_BED_Below_09_Regions_PATH | awk $awk_text | wc -l 

2340


In [135]:
2340 / 3620

0.6464088397790055

## SV overlap with RLC - (C): Amb regions

In [76]:
!bedtools coverage -a $PMP_28CI_NucDiff_AllSVs_Detected_50bp_Filtered_BED -b $EBR_36CI_Ambigous_Regions_BED_PATH | awk $awk_text | wc -l 

505


In [136]:
505/3620

0.13950276243093923

## SV overlap with RLC - (Z): Low Pmap regions (< 1)

In [78]:
!wc -l $PMP_28CI_NucDiff_AllSVs_Detected_50bp_Filtered_BED

3620 ../../Data/210126_PMP_36CI_NucDiff_SV_Analysis_Dir/210126.PMP.36CI.NucDiff_AllSVs_Detected.V2.50bp.bed


In [79]:
!bedtools coverage -a $PMP_28CI_NucDiff_AllSVs_Detected_50bp_Filtered_BED -b $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH | awk $awk_text | wc -l 

2541


In [137]:
2541 / 3620

0.7022099447513812

# Q: What proportion of genome do the RLC regions overlap with? 


In [81]:
Regions_Top30SourcesOfFPs_FiltMQ30_DF.head(2)

Unnamed: 0,Chrom,Start,End,Strand,H37rv_GeneID,Symbol,ExcludedGroup_Category,PEandPPE_Subfamily,Functional_Category,FP_Count,Length
6286,NC_000962.3,3845970,3847164,,IntergenicRegion_2683_Rv3428c-Rv3429,,Intergenic,,Intergenic,142,1194
2840,NC_000962.3,1636003,1638229,-,Rv1452c,PE_PGRS28,PE/PPEs,PE_V_PGRS,PE/PPE,59,2226


In [82]:
!head -n 2 $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH

NC_000962.3	3845970	3847164		IntergenicRegion_2683_Rv3428c-Rv3429		Intergenic		Intergenic	142	1194
NC_000962.3	1636003	1638229	-	Rv1452c	PE_PGRS28	PE/PPEs	PE_V_PGRS	PE/PPE	59	2226


In [83]:
Regions_Top30SourcesOfFPs_FiltMQ30_DF["Length"].sum()

65394

In [84]:
Regions_Top30SourcesOfFPs_FiltMQ30_DF["Length"].sum()/ 4411532 * 100

1.4823421886093087

## B) Overlap of EBR < 0.9 and the entire H37Rv genome

In [85]:
EBR_36CI_WGS40X_BED_DF_Regions_Below09.head(3)

Unnamed: 0,Chrom,Start,End,Length
0,NC_000962.3,24720,24738,18
1,NC_000962.3,39029,39030,1
2,NC_000962.3,55548,55647,99


In [86]:
!head -n 3 $EBR_36CI_BED_Below_09_Regions_PATH

NC_000962.3	24720	24738
NC_000962.3	39029	39030
NC_000962.3	55548	55647


In [87]:
EBR_36CI_WGS40X_BED_DF_Regions_Below09["Length"].sum()

141960

In [88]:
EBR_36CI_WGS40X_BED_DF_Regions_Below09["Length"].sum() / 4411532 * 100

3.2179297350670923

In [89]:
## Z)

In [90]:
PMap_K50E4_BED_Below1_DF.head(2)

Unnamed: 0,Chrom,Start,End,Length
0,NC_000962.3,23173,23238,65
1,NC_000962.3,79499,79574,75


In [91]:
PMap_K50E4_BED_Below1_DF["Length"].sum()

189147

In [92]:
PMap_K50E4_BED_Below1_DF["Length"].sum() / 4411532 * 100

4.287558154400784

In [93]:
PMap_K50E4_BED_Below1_DF.head(2)

Unnamed: 0,Chrom,Start,End,Length
0,NC_000962.3,23173,23238,65
1,NC_000962.3,79499,79574,75


In [94]:
!head -n 2 $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH

NC_000962.3	23173	23238
NC_000962.3	79499	79574


In [95]:
## C)

In [96]:
EBR_36CI_BED_DF_AMB_ONLY.head(2)

Unnamed: 0,chrom,chromStart,chromEnd,name,score,Length
0,NC_000962.3,334641,334653,Region996_Length_12_bp,Ambiguous,12
1,NC_000962.3,334694,334723,Region998_Length_29_bp,Ambiguous,29


In [97]:
!head -n 2 $EBR_36CI_Ambigous_Regions_BED_PATH

NC_000962.3	334641	334653	Region996_Length_12_bp	Ambiguous	12
NC_000962.3	334694	334723	Region998_Length_29_bp	Ambiguous	29


In [98]:
EBR_36CI_BED_DF_AMB_ONLY["Length"].sum()

15813

In [99]:
(EBR_36CI_BED_DF_AMB_ONLY["Length"].sum() / 4411532) * 100

0.35844690687951486

In [100]:
!cat $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH $EBR_36CI_Ambigous_Regions_BED_PATH | cut -f 1-3 | bedtools sort  | bedtools merge | wc -l 

1076


In [101]:
!cat $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH $EBR_36CI_Ambigous_Regions_BED_PATH | cut -f 1-3 | bedtools sort  | bedtools merge | head

NC_000962.3	23173	23238
NC_000962.3	79499	79574
NC_000962.3	80181	80380
NC_000962.3	80459	80518
NC_000962.3	82166	82238
NC_000962.3	82372	82487
NC_000962.3	103702	104785
NC_000962.3	104830	104941
NC_000962.3	104942	105137
NC_000962.3	131785	131844


# Outputting BED files defining RLC regions (+ additional region masking combinations)

## A) Define directory for output of RLC results

In [102]:
Repo_DataDir = "../../Data"
PB_Vs_Illumina_DataAnalysis_Dir = "../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI"

Defining_RLC_Regions_RepoDataDir = Repo_DataDir + "/210215_Defining_RLC_Regions"
Defining_RLC_Regions_AnalysisDir = PB_Vs_Illumina_DataAnalysis_Dir + "/210215_Defining_RLC_Regions"


!mkdir $Defining_RLC_Regions_RepoDataDir
!mkdir $Defining_RLC_Regions_AnalysisDir


mkdir: cannot create directory ‘../../Data/210215_Defining_RLC_Regions’: File exists
mkdir: cannot create directory ‘../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI/210215_Defining_RLC_Regions’: File exists


### Copy all subsets that contribute to RLCs

In [103]:
!cp $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH $Defining_RLC_Regions_RepoDataDir/
!cp $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH $Defining_RLC_Regions_AnalysisDir/

!cp $EBR_36CI_BED_Below_09_Regions_PATH $Defining_RLC_Regions_RepoDataDir/
!cp $EBR_36CI_BED_Below_09_Regions_PATH $Defining_RLC_Regions_AnalysisDir/

!cp $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH $Defining_RLC_Regions_RepoDataDir/
!cp $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH $Defining_RLC_Regions_AnalysisDir/

!cp $EBR_36CI_Ambigous_Regions_BED_PATH $Defining_RLC_Regions_RepoDataDir/
!cp $EBR_36CI_Ambigous_Regions_BED_PATH $Defining_RLC_Regions_AnalysisDir/


## B) Test merging of different sub-masks

In [104]:
!cat $EBR_36CI_BED_Below_09_Regions_PATH $EBR_36CI_Ambigous_Regions_BED_PATH | cut -f 1-3 | bedtools sort  | bedtools merge | wc -l

884


In [105]:
!cat $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH $EBR_36CI_BED_Below_09_Regions_PATH $EBR_36CI_Ambigous_Regions_BED_PATH | cut -f 1-3 | bedtools sort | bedtools merge | wc -l

773


In [106]:
!cat $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH $EBR_36CI_BED_Below_09_Regions_PATH $EBR_36CI_Ambigous_Regions_BED_PATH | cut -f 1-3 | bedtools sort | bedtools merge | head 

NC_000962.3	24720	24738
NC_000962.3	39029	39030
NC_000962.3	55548	55647
NC_000962.3	71464	71584
NC_000962.3	79569	83035
NC_000962.3	86778	86779
NC_000962.3	86780	86782
NC_000962.3	103749	104811
NC_000962.3	104813	104962
NC_000962.3	131928	131979


## A) Output RLCs (low EBR, FP Hotspots, and Amb regions)

In [107]:

RLC_BED_PATH_O2Dir = f"{Defining_RLC_Regions_AnalysisDir}/RLC_Regions.H37Rv.bed" 
RLC_BED_PATH_RepoDir = f"{Defining_RLC_Regions_RepoDataDir}/RLC_Regions.H37Rv.bed"


!cat $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH $EBR_36CI_BED_Below_09_Regions_PATH $EBR_36CI_Ambigous_Regions_BED_PATH | cut -f 1-3 | bedtools sort | bedtools merge > $RLC_BED_PATH_O2Dir

!cp $RLC_BED_PATH_O2Dir $RLC_BED_PATH_RepoDir

In [108]:
RefinedLCRs_DF = pd.read_csv(RLC_BED_PATH_RepoDir, sep = "\t", header = None)
RefinedLCRs_DF.columns = ["Chrom", "Start", "End"]
RefinedLCRs_DF["Length"] = RefinedLCRs_DF["End"] - RefinedLCRs_DF["Start"]
RefinedLCRs_DF.shape

(773, 4)

In [109]:
RefinedLCRs_DF["Length"].sum()

177077

In [110]:
(RefinedLCRs_DF["Length"].sum() / 4411532) 

0.0401395705618819

In [111]:
(RefinedLCRs_DF["Length"].sum() / 4411532) * 100

4.0139570561881905

In [112]:
### OLD PLC Regions Length 

In [113]:
LowConfidenceRegions_BED_DF["Length"].sum()

469501

In [114]:
(LowConfidenceRegions_BED_DF["Length"].sum() / 4411532) * 100

10.642584027498838

## B) Output RLCs (low EBR, FP Hotspots, and Amb regions) + Pileup Mappability (K=50bp, E=4)

In [115]:
RLC_PlusLowPmapK50E4_BED_PATH_O2Dir = f"{Defining_RLC_Regions_AnalysisDir}/RLC_Regions.Plus.LowPmapK50E4.H37Rv.bed" 
RLC_PlusLowPmapK50E4_BED_PATH_RepoDir = f"{Defining_RLC_Regions_RepoDataDir}/RLC_Regions.Plus.LowPmapK50E4.H37Rv.bed"


!cat $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH $EBR_36CI_BED_Below_09_Regions_PATH $EBR_36CI_Ambigous_Regions_BED_PATH $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH | cut -f 1-3 | bedtools sort | bedtools merge > $RLC_PlusLowPmapK50E4_BED_PATH_O2Dir   

!cp $RLC_PlusLowPmapK50E4_BED_PATH_O2Dir $RLC_PlusLowPmapK50E4_BED_PATH_RepoDir

In [116]:
RefinedLCRs_WiPmapK50E4_DF = pd.read_csv(RLC_PlusLowPmapK50E4_BED_PATH_RepoDir, sep = "\t", header = None)
RefinedLCRs_WiPmapK50E4_DF.columns = ["Chrom", "Start", "End"]
RefinedLCRs_WiPmapK50E4_DF["Length"] = RefinedLCRs_WiPmapK50E4_DF["End"] - RefinedLCRs_WiPmapK50E4_DF["Start"]
RefinedLCRs_WiPmapK50E4_DF.shape

(1324, 4)

In [117]:
RefinedLCRs_WiPmapK50E4_DF["Length"].sum()

276750

In [118]:
(RefinedLCRs_WiPmapK50E4_DF["Length"].sum() / 4411532) * 100

6.273330897293729

## C) Output RLCs (low EBR, FP Hotspots, and Amb regions) + ALL Annotated Mobile Genetic Elements (MGEs)

In [119]:
RLC_PlusMGEs_BED_PATH_O2Dir = f"{Defining_RLC_Regions_AnalysisDir}/RLC_Regions.Plus.MGEs.H37Rv.bed" 
RLC_PlusMGEs_BED_PATH_RepoDir = f"{Defining_RLC_Regions_RepoDataDir}/RLC_Regions.Plus.MGEs.H37Rv.bed"


!cat $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH $EBR_36CI_Ambigous_Regions_BED_PATH $Mtb_H37rv_pLCRegions_Coscolla_Subset_MGEs_BED_PATH | cut -f 1-3 | bedtools sort | bedtools merge > $RLC_PlusMGEs_BED_PATH_O2Dir

!cp $RLC_PlusMGEs_BED_PATH_O2Dir $RLC_PlusMGEs_BED_PATH_RepoDir


In [120]:
RefinedLCRs_WiMGE_DF = pd.read_csv(RLC_PlusMGEs_BED_PATH_RepoDir, sep = "\t", header = None)
RefinedLCRs_WiMGE_DF.columns = ["Chrom", "Start", "End"]
RefinedLCRs_WiMGE_DF["Length"] = RefinedLCRs_WiMGE_DF["End"] - RefinedLCRs_WiMGE_DF["Start"]
RefinedLCRs_WiMGE_DF.shape

(909, 4)

In [121]:
RefinedLCRs_WiMGE_DF["Length"].sum()

277867

In [122]:
(RefinedLCRs_WiMGE_DF["Length"].sum() / 4411532) * 100

6.298650899506113

In [None]:
# 278 kb

In [123]:
!cat $Regions_Top30SourcesOfFPs_FiltMQ30_BED_PATH $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH | cut -f 1-3 | bedtools sort | bedtools merge > ./RefinedLCRs_NoAmbRemoved.bed             


In [124]:
RefinedLCRs_WiMGE_DF = pd.read_csv("./RefinedLCRs_NoAmbRemoved.bed", sep = "\t", header = None)
RefinedLCRs_WiMGE_DF.columns = ["Chrom", "Start", "End"]
RefinedLCRs_WiMGE_DF["Length"] = RefinedLCRs_WiMGE_DF["End"] - RefinedLCRs_WiMGE_DF["Start"]
RefinedLCRs_WiMGE_DF.shape

(961, 4)

In [125]:
RefinedLCRs_WiMGE_DF["Length"].sum()

225846

In [126]:
(RefinedLCRs_WiMGE_DF["Length"].sum() / 4411532) * 100

5.119446033713458

In [127]:
PMap_K50E4_BED_Below1_DF["Length"].sum()

189147

In [128]:
Regions_Top30SourcesOfFPs_FiltMQ30_DF["Length"].sum()

65394

In [129]:
189147 + 65394

254541

In [130]:
254541 - 225846

28695

## Look at final output dir

In [131]:
!ls -lah $Defining_RLC_Regions_AnalysisDir

total 232K
drwxrwsr-x  2 mm774 farhat  379 Mar 29 22:17 .
drwxrwsr-x 12 mm774 farhat  574 Mar 29 19:36 ..
-rw-rw-r--  1 mm774 farhat  30K Mar 30 13:12 201027_PMap_K50E4_Regions_BELOW_1.bed
-rw-rw-r--  1 mm774 farhat 2.6K Mar 30 13:12 210202_Mtb_H37rv.Top30SourcesOfFPs.FiltMQ30.bed
-rw-rw-r--  1 mm774 farhat 1.6K Mar 30 13:12 EBR_36CI_AmbigousRegions_V1.bed
-rw-rw-r--  1 mm774 farhat  24K Mar 30 13:12 EBR_V7_36CI.Below_0.9.And.Ambiguous.Regions.bed
-rw-rw-r--  1 mm774 farhat  21K Mar 30 13:12 RLC_Regions.H37Rv.bed
-rw-rw-r--  1 mm774 farhat  36K Mar 30 13:12 RLC_Regions.Plus.LowPmapK50E4.H37Rv.bed
-rw-rw-r--  1 mm774 farhat  25K Mar 30 13:12 RLC_Regions.Plus.MGEs.H37Rv.bed
