# 20/12/11 - Calculating Genmap Mappability across the H37rv reference genome 

### Maximillian Marin
### mgmarin@g.harvard.edu

Goal: To run Genmap (https://github.com/cpockrandt/genmap) on the H37rv reference genome with varying parameters (varying Kmer size and error rate)


In [1]:
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
# import gffutils

%matplotlib inline

#### Pandas Viewing Settings

In [2]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## Test installation of Genmap

#### NOTE: 19/12/17 it works!

In [3]:
!genmap --version

GenMap version: 1.3.0
SeqAn version: 2.4.1


In [4]:
#!genmap -h

In [5]:
## Let's look at options for running genmap map
!genmap index --help

GenMap index

[1mSYNOPSIS[0m

[1mDESCRIPTION[0m
    GenMap is a tool for fast and exact computation of genome mappability and
    can also be used for multiple genomes, e.g., to search for marker
    sequences.

    Detailed information is available in the wiki:
    <https://github.com/cpockrandt/genmap/wiki>

    Index creation. Only supports DNA and RNA (A, C, G, T/U, N). Other
    characters will be converted to N. Choose between the following index
    construction algorithms (-A / --algorithm): * divsufsort (recommended,
    faster, needs about `6n` space in main memory/RAM, `10n` for sequences
    >2GB), * skew (needs more than `25n` space on secondary memory/disk, i.e.,
    in TMPDIR), where `n` is the total number of bases in your fasta file(s).

[1mOPTIONS[0m
    [1m-h[0m, [1m--help[0m
          Display the help message.
    [1m--version-check[0m [4mBOOL[0m
          Turn this option off to disable version update notifications of the
          application. One of

In [6]:
## Let's look at options for running genmap map
!genmap map --help

GenMap map

[1mSYNOPSIS[0m

[1mDESCRIPTION[0m
    GenMap is a tool for fast and exact computation of genome mappability and
    can also be used for multiple genomes, e.g., to search for marker
    sequences.

    Detailed information is available in the wiki:
    <https://github.com/cpockrandt/genmap/wiki>

    Tool for computing the mappability/frequency on nucleotide sequences. It
    supports multi-fasta files with DNA or RNA alphabets (A, C, G, T/U, N).
    Frequency is the absolute number of occurrences, mappability is the
    inverse, i.e., 1 / frequency-value.

[1mOPTIONS[0m
    [1m-h[0m, [1m--help[0m
          Display the help message.
    [1m--version-check[0m [4mBOOL[0m
          Turn this option off to disable version update notifications of the
          application. One of [4m1[0m, [4mON[0m, [4mTRUE[0m, [4mT[0m, [4mYES[0m, [4m0[0m, [4mOFF[0m, [4mFALSE[0m, [4mF[0m, and [4mNO[0m.
          Default: [4m1[0m.
    [1m--version[0m
         

## Define directories

### Make output directory for this analysis

## Step 0) Creating a Genmap Index of H37rv

#### Definie path to Genmap index for H37rv reference genome FASTA

In [7]:
Mtb_RefDir="/n/data1/hms/dbmi/farhat/mm774/References"
H37rv_Genmap_IDX_Dir_PATH = f"{Mtb_RefDir}/H37rv_Genmap_IDX_Dir"

H37rv_FA_PATH = f"{Mtb_RefDir}/GCF_000195955.2_ASM19595v2_genomic.fasta"

H37rv_FA_SeqNameTrimmed_PATH = f"{Mtb_RefDir}/GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.fasta"

In [8]:
## https://askubuntu.com/questions/974459/remove-all-letters-after-space-in-a-line-that-start-with-specific-character
!sed '/^>/ s/ .*//' $H37rv_FA_PATH > $H37rv_FA_SeqNameTrimmed_PATH
!samtools faidx $H37rv_FA_SeqNameTrimmed_PATH

samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory


In [9]:
!head -1 $H37rv_FA_PATH
!head -1 $H37rv_FA_SeqNameTrimmed_PATH

>NC_000962.3 Mycobacterium tuberculosis H37Rv, complete genome
>NC_000962.3


### Run genmap index

In [10]:
!genmap index -v -F $H37rv_FA_SeqNameTrimmed_PATH -I $H37rv_Genmap_IDX_Dir_PATH

ERROR: The directory for the index already exists at /n/data1/hms/dbmi/farhat/mm774/References/H37rv_Genmap_IDX_Dir
       Please remove it, or choose a different location.


## Step 1) Run Genmap with varying parameters

In [11]:
Mtb_RefDir="/n/data1/hms/dbmi/farhat/mm774/References"
H37rv_Genmap_IDX_Dir_PATH = f"{Mtb_RefDir}/H37rv_Genmap_IDX_Dir"

In [12]:
PB_Vs_Illumina_DataAnalysis_Dir = "../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI"

!mkdir $PB_Vs_Illumina_DataAnalysis_Dir

Genmap_Map_AnalysisDir = PB_Vs_Illumina_DataAnalysis_Dir + "/201027_Genmap_Mappability_H37rv_V1"  

!mkdir $Genmap_Map_AnalysisDir

mkdir: cannot create directory ‘../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI’: File exists
mkdir: cannot create directory ‘../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI/201027_Genmap_Mappability_H37rv_V1’: File exists


In [13]:
!ls -1 $PB_Vs_Illumina_DataAnalysis_Dir

201016_FP_TP_FN_And_SV_DistributionAcrossH37Rv_Analysis_V3_PB_MM2_GT
201027_Genmap_Mappability_H37rv_V1
210112_EBR_H37rv_36CI_MM2vsPilon_V7
210113_CoverageBiasAnalysis_GC_V2
210126_FalsePositivesAnalysis_V4
Happy_VC_Eval_ResultsDir_36CI


In [14]:
i = 1

for Kmer_Size in np.arange(50, 150 + 25, 25):
    for E_Mismatches in [0, 2, 4]:
            
        print("RunID:", i, "  -  ", "K =", Kmer_Size, " E =", E_Mismatches)
        i += 1
        
        run_Genmap_OutputDir = f"{Genmap_Map_AnalysisDir}/Genmap_Map_K{Kmer_Size}_E{E_Mismatches}_Output"
        !mkdir $run_Genmap_OutputDir
        
        # Run Genmap with given parameters (K and E)
        !genmap map -K $Kmer_Size -E $E_Mismatches -I $H37rv_Genmap_IDX_Dir_PATH -O $run_Genmap_OutputDir -t -w -bg -v  # --reverse-complement

        print("\n")
        

RunID: 1   -   K = 50  E = 0
mkdir: cannot create directory ‘../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI/201027_Genmap_Mappability_H37rv_V1/Genmap_Map_K50_E0_Output’: File exists
Index was loaded (dna4 alphabet, sampling rate of 10).
- The BWT is represented by 32 bit values.
- The sampled suffix array is represented by pairs of 16 and 32 bit values.
- Index was built on a single fasta file.
Progress: 100.00%[K
- TXT file written in 4.38 seconds
- WIG file written in 0.04 seconds
- bedgraph file written in 0.03 seconds
Mappability computed in 8.95 seconds


RunID: 2   -   K = 50  E = 2
mkdir: cannot create directory ‘../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI/201027_Genmap_Mappability_H37rv_V1/Genmap_Map_K50_E2_Output’: File exists
Index was loaded (dna4 alphabet, sampling rate of 10).
- The BWT is represented by 32 bit values.
- The sampled suffix array is represented by pairs of 16 and 32 bit values.
- Index was built on a single fasta file.
Progress: 100.00%[K
- TXT fil

## Look at main output directory (for all runs)

In [15]:
!ls -1 $Genmap_Map_AnalysisDir

201027_ParsedAndPickled_GenmapOutput
201027_PMap_K100E4_BaseLevelInfo_BELOW_1.bed
201027_PMap_K100E4_Regions_BELOW_1.bed
201027_PMap_K50E4_BaseLevelInfo_BELOW_1.bed
201027_PMap_K50E4_Regions_BELOW_1.bed
Genmap_Map_K100_E0_Output
Genmap_Map_K100_E2_Output
Genmap_Map_K100_E4_Output
Genmap_Map_K125_E0_Output
Genmap_Map_K125_E2_Output
Genmap_Map_K125_E4_Output
Genmap_Map_K150_E0_Output
Genmap_Map_K150_E2_Output
Genmap_Map_K150_E4_Output
Genmap_Map_K50_E0_Output
Genmap_Map_K50_E2_Output
Genmap_Map_K50_E4_Output
Genmap_Map_K75_E0_Output
Genmap_Map_K75_E2_Output
Genmap_Map_K75_E4_Output


### Let's look at a specific output directory

In [16]:
!ls -1 $Genmap_Map_AnalysisDir/Genmap_Map_K100_E2_Output

201027_H37rv_PileupMappability_K100_E2.bed
201027_H37rv_PileupMappability_K100_E2.bedgraph
201027_H37rv_PileupMappability_K100_E2.npy
201027_H37rv_PileupMappability_K100_E2.npy.npz
201027_H37rv_PileupMappability_K100_E2.npz
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.bedgraph
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.chrom.sizes
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.txt
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.wig


## Step 2) Calculate PILEUP mappability across H37rv 

#### Parse Genmap outputs and calculate pileup mappability across H37rv using varying K and E parameters

In [17]:
def Genmap_TXT_OutputParser(input_Genmap_Output_TXT_PATH, window_size):
    """
    Function that parses the text output of Genmap and then returns two numpy arrays.
    OUTPUT:
    1) H37rv_Kmer_Mappability_Array: 
    2) H37rv_Pileup_Mappability_Array: 
    """
    
    with open(input_Genmap_Output_TXT_PATH, "r") as InputFile_Genmap_TXT:
        H37rv_Kmer_Mappability_List = InputFile_Genmap_TXT.readlines()[1].split(" ")
        H37rv_Kmer_Mappability_Array = np.array(H37rv_Kmer_Mappability_List).astype(float)
    
    
    H37rv_Pileup_Mappability_Array = np.array( [ np.mean( H37rv_Kmer_Mappability_Array[i - (window_size): i ] ) for i in np.arange(window_size, len(H37rv_Kmer_Mappability_Array) - window_size ) ])    
    
    #H37rv_Pileup_Mappability_Array_PaddedEnds = np.concatenate([ np.full((window_size,), -1),
    #                                                  H37rv_Pileup_Mappability_Array,
    #                                                  np.full((window_size,), -1) ])  
    
    H37rv_Pileup_Mappability_Array_PaddedEnds = np.concatenate([ np.full((window_size,), np.nan),
                                                      H37rv_Pileup_Mappability_Array,
                                                      np.full((window_size,), np.nan) ]) 
    
    
    return H37rv_Kmer_Mappability_Array, H37rv_Pileup_Mappability_Array_PaddedEnds


In [18]:
Genmap_Map_AnalysisDir

'../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI/201027_Genmap_Mappability_H37rv_V1'

In [19]:
dictOf_GM_PileupMap_Arrays = {}
dictOf_GM_KmerMap_Arrays = {}


i = 1
for Kmer_Size in np.arange(50, 150 + 25, 25):
    for E_Mismatches in [0, 2, 4]:
            
        RunID = f"K{Kmer_Size}_E{E_Mismatches}"
    
        print(i, "RunID:", RunID)
        i += 1
        
        i_GM_OutputDir = f"{Genmap_Map_AnalysisDir}/Genmap_Map_K{Kmer_Size}_E{E_Mismatches}_Output"
        
        i_GM_Map_TXT = f"{i_GM_OutputDir}/GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.txt"       
        
        i_GM_KmerMap_Array, i_GM_PileupMap_Array = Genmap_TXT_OutputParser(i_GM_Map_TXT, Kmer_Size)
        
        dictOf_GM_PileupMap_Arrays[RunID] = i_GM_PileupMap_Array
        dictOf_GM_KmerMap_Arrays[RunID] = i_GM_KmerMap_Array
        
        

1 RunID: K50_E0
2 RunID: K50_E2
3 RunID: K50_E4
4 RunID: K75_E0
5 RunID: K75_E2
6 RunID: K75_E4
7 RunID: K100_E0
8 RunID: K100_E2
9 RunID: K100_E4
10 RunID: K125_E0
11 RunID: K125_E2
12 RunID: K125_E4
13 RunID: K150_E0
14 RunID: K150_E2
15 RunID: K150_E4


In [20]:
dictOf_GM_PileupMap_Arrays.keys()

dict_keys(['K50_E0', 'K50_E2', 'K50_E4', 'K75_E0', 'K75_E2', 'K75_E4', 'K100_E0', 'K100_E2', 'K100_E4', 'K125_E0', 'K125_E2', 'K125_E4', 'K150_E0', 'K150_E2', 'K150_E4'])

In [21]:
dictOf_GM_KmerMap_Arrays.keys()

dict_keys(['K50_E0', 'K50_E2', 'K50_E4', 'K75_E0', 'K75_E2', 'K75_E4', 'K100_E0', 'K100_E2', 'K100_E4', 'K125_E0', 'K125_E2', 'K125_E4', 'K150_E0', 'K150_E2', 'K150_E4'])

In [22]:
dictOf_GM_PileupMap_Arrays["K50_E0"]

array([nan, nan, nan, ..., nan, nan, nan])

In [23]:
dictOf_GM_PileupMap_Arrays["K50_E2"]

array([nan, nan, nan, ..., nan, nan, nan])

In [24]:
dictOf_GM_KmerMap_Arrays["K50_E2"]

array([1., 1., 1., ..., 0., 0., 0.])

# Saving parsed data (as Python Pickle)

In [25]:
import pickle

## Lets save ("Pickle") the dictionary of SNP Comparison dataframes on O2

In [26]:
ParsedAndPickled_GenmapOutput = f"{Genmap_Map_AnalysisDir}/201027_ParsedAndPickled_GenmapOutput"

!mkdir $ParsedAndPickled_GenmapOutput

mkdir: cannot create directory ‘../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI/201027_Genmap_Mappability_H37rv_V1/201027_ParsedAndPickled_GenmapOutput’: File exists


In [27]:
Pickle_PATH_dictOf_GM_KmerMap_Arrays = ParsedAndPickled_GenmapOutput + "/201027_dictOf_GM_KmerMap_Arrays.pickle"   

with open(Pickle_PATH_dictOf_GM_KmerMap_Arrays, 'wb') as outputFile:
    pickle.dump(dictOf_GM_KmerMap_Arrays, outputFile)
    

In [28]:
Pickle_PATH_dictOf_GM_PileupMap_Arrays = ParsedAndPickled_GenmapOutput + "/201027_dictOf_GM_PileupMap_Arrays.pickle"   
with open(Pickle_PATH_dictOf_GM_PileupMap_Arrays, 'wb') as outputFile:
    pickle.dump(dictOf_GM_PileupMap_Arrays, outputFile)

### Look at pickled files:

In [29]:
!ls -lah $ParsedAndPickled_GenmapOutput

total 550M
drwxrwsr-x  2 mm774 farhat  114 Jan 18 13:25 .
drwxrwsr-x 18 mm774 farhat  927 Jan 18 13:26 ..
-rw-rw-r--  1 mm774 farhat 505M Mar 16 14:25 201027_dictOf_GM_KmerMap_Arrays.pickle
-rw-rw-r--  1 mm774 farhat 505M Mar 16 14:25 201027_dictOf_GM_PileupMap_Arrays.pickle


In [30]:
!md5sum $ParsedAndPickled_GenmapOutput/*

3d08e447c9b2d3bd092971db466b6f64  ../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI/201027_Genmap_Mappability_H37rv_V1/201027_ParsedAndPickled_GenmapOutput/201027_dictOf_GM_KmerMap_Arrays.pickle
6573b503acbd4b90785b1e73e024301f  ../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI/201027_Genmap_Mappability_H37rv_V1/201027_ParsedAndPickled_GenmapOutput/201027_dictOf_GM_PileupMap_Arrays.pickle


## Read back in pickle of Genmap pileup mappability calculations

In [31]:
PB_Vs_Illumina_DataAnalysis_Dir = "../../../210112_PBvsI_VCeval_AnalysisDir_V7_36CI"

Genmap_Map_AnalysisDir = PB_Vs_Illumina_DataAnalysis_Dir + "/201027_Genmap_Mappability_H37rv_V1"  

ParsedAndPickled_GenmapOutput = f"{Genmap_Map_AnalysisDir}/201027_ParsedAndPickled_GenmapOutput"


In [32]:
!ls -lah $Genmap_Map_AnalysisDir

total 2.2M
drwxrwsr-x 18 mm774 farhat  927 Jan 18 13:26 .
drwxrwsr-x  8 mm774 farhat  339 Mar 15 18:02 ..
drwxrwsr-x  2 mm774 farhat  114 Jan 18 13:25 201027_ParsedAndPickled_GenmapOutput
-rw-rw-r--  1 mm774 farhat 2.5M Mar 16 14:03 201027_PMap_K100E4_BaseLevelInfo_BELOW_1.bed
-rw-rw-r--  1 mm774 farhat 6.0K Mar 16 14:03 201027_PMap_K100E4_Regions_BELOW_1.bed
-rw-rw-r--  1 mm774 farhat 4.3M Mar 16 14:03 201027_PMap_K50E4_BaseLevelInfo_BELOW_1.bed
-rw-rw-r--  1 mm774 farhat  30K Mar 16 14:03 201027_PMap_K50E4_Regions_BELOW_1.bed
drwxrwsr-x  2 mm774 farhat  634 Mar 16 14:03 Genmap_Map_K100_E0_Output
drwxrwsr-x  2 mm774 farhat  634 Mar 16 14:03 Genmap_Map_K100_E2_Output
drwxrwsr-x  2 mm774 farhat  570 Mar 16 14:04 Genmap_Map_K100_E4_Output
drwxrwsr-x  2 mm774 farhat  570 Mar 16 14:04 Genmap_Map_K125_E0_Output
drwxrwsr-x  2 mm774 farhat  570 Mar 16 14:04 Genmap_Map_K125_E2_Output
drwxrwsr-x  2 mm774 farhat  570 Mar 16 14:04 Genmap_Map_K125_E4_Output
drwxrwsr-x  2 mm774 farhat  570 Mar 16 1

In [33]:
### !rm ______ $Genmap_Map_AnalysisDir/200819*

In [34]:


Pickle_PATH_dictOf_GM_PileupMap_Arrays = ParsedAndPickled_GenmapOutput + "/201027_dictOf_GM_PileupMap_Arrays.pickle"   
Pickle_PATH_dictOf_GM_KmerMap_Arrays = ParsedAndPickled_GenmapOutput + "/201027_dictOf_GM_KmerMap_Arrays.pickle"   

with open(Pickle_PATH_dictOf_GM_PileupMap_Arrays, "rb") as f: dictOf_GM_PileupMap_Arrays = pickle.load(f)       
with open(Pickle_PATH_dictOf_GM_KmerMap_Arrays, "rb") as f: dictOf_GM_KmerMap_Arrays = pickle.load(f)                                        

In [35]:
dictOf_GM_PileupMap_Arrays.keys()

dict_keys(['K50_E0', 'K50_E2', 'K50_E4', 'K75_E0', 'K75_E2', 'K75_E4', 'K100_E0', 'K100_E2', 'K100_E4', 'K125_E0', 'K125_E2', 'K125_E4', 'K150_E0', 'K150_E2', 'K150_E4'])

In [36]:
dictOf_GM_KmerMap_Arrays.keys()

dict_keys(['K50_E0', 'K50_E2', 'K50_E4', 'K75_E0', 'K75_E2', 'K75_E4', 'K100_E0', 'K100_E2', 'K100_E4', 'K125_E0', 'K125_E2', 'K125_E4', 'K150_E0', 'K150_E2', 'K150_E4'])

In [37]:
Genmap_ParamsUsed = list( dictOf_GM_PileupMap_Arrays.keys() )
Genmap_ParamsUsed

['K50_E0',
 'K50_E2',
 'K50_E4',
 'K75_E0',
 'K75_E2',
 'K75_E4',
 'K100_E0',
 'K100_E2',
 'K100_E4',
 'K125_E0',
 'K125_E2',
 'K125_E4',
 'K150_E0',
 'K150_E2',
 'K150_E4']

# Output Mappability data into various formats (NPY, TSV, BED, BEDGRAPH etc)

## Define function for output of a BED file from a NP array with values for each basepair position

In [38]:
# BED format specifications: https://useast.ensembl.org/info/website/upload/bed.html

def convert_GenomeNParray_To_BED_DF(input_GenomeNParray, genomeChrom = "NC_000962.3"):
    """ """
    last_Score = input_GenomeNParray[0]

    startOfRegion = 0
    listOfBED_Tuples = []
    RegionCounter = 1

    for RefPos_0based in tqdm(np.arange(len(input_GenomeNParray))):

        EBR_Score = input_GenomeNParray[RefPos_0based]

        if EBR_Score != last_Score:

            endOfRegion = RefPos_0based
            lengthOfRegion = endOfRegion - startOfRegion 

            BED_EntryTuple = (genomeChrom, startOfRegion, endOfRegion, f"Region{RegionCounter}_Length_{lengthOfRegion}_bp", last_Score,)
            listOfBED_Tuples.append(BED_EntryTuple)

            RegionCounter += 1

            #print(f"{H37rv_ChrName}, {startOfRegion}, {RefPos_0based}, {lengthOfRegion}_bp, {last_Score}, .")

            startOfRegion = RefPos_0based 

            #1 Output the last range
            #2 Store the new score    

        last_Score = EBR_Score #2 Store the new score   

        
        
    endOfRegion = RefPos_0based + 1
    lengthOfRegion = endOfRegion - startOfRegion 

    BED_EntryTuple = (genomeChrom, startOfRegion, endOfRegion, f"Region{RegionCounter}_Length_{lengthOfRegion}_bp", last_Score)
    listOfBED_Tuples.append(BED_EntryTuple)       

    BED_DF = pd.DataFrame(listOfBED_Tuples)
    
    BED_DF.columns = ["chrom", "chromStart", "chromEnd", "name", "score" ]
    
    
    return BED_DF

In [39]:
!ls -lah $ParsedAndPickled_GenmapOutput

total 76M
drwxrwsr-x  2 mm774 farhat  114 Jan 18 13:25 .
drwxrwsr-x 18 mm774 farhat  927 Jan 18 13:26 ..
-rw-rw-r--  1 mm774 farhat 505M Mar 16 14:25 201027_dictOf_GM_KmerMap_Arrays.pickle
-rw-rw-r--  1 mm774 farhat 505M Mar 16 14:25 201027_dictOf_GM_PileupMap_Arrays.pickle


In [40]:
list( dictOf_GM_PileupMap_Arrays.keys() )

['K50_E0',
 'K50_E2',
 'K50_E4',
 'K75_E0',
 'K75_E2',
 'K75_E4',
 'K100_E0',
 'K100_E2',
 'K100_E4',
 'K125_E0',
 'K125_E2',
 'K125_E4',
 'K150_E0',
 'K150_E2',
 'K150_E4']

## Output EBR data into various formats (TSV, BED, BEDGRAPH etc)

In [41]:
Genmap_ParamsUsed = list( dictOf_GM_PileupMap_Arrays.keys() )
Genmap_ParamsUsed

['K50_E0',
 'K50_E2',
 'K50_E4',
 'K75_E0',
 'K75_E2',
 'K75_E4',
 'K100_E0',
 'K100_E2',
 'K100_E4',
 'K125_E0',
 'K125_E2',
 'K125_E4',
 'K150_E0',
 'K150_E2',
 'K150_E4']

In [42]:
Genmap_ParamsUsed

['K50_E0',
 'K50_E2',
 'K50_E4',
 'K75_E0',
 'K75_E2',
 'K75_E4',
 'K100_E0',
 'K100_E2',
 'K100_E4',
 'K125_E0',
 'K125_E2',
 'K125_E4',
 'K150_E0',
 'K150_E2',
 'K150_E4']

In [43]:
Genmap_ParamsUsed = list( dictOf_GM_PileupMap_Arrays.keys() )

for Genmap_Params in Genmap_ParamsUsed:
    
    run_Genmap_OutputDir = f"{Genmap_Map_AnalysisDir}/Genmap_Map_{Genmap_Params}_Output"
        
    Run_PileupMappability_Array = dictOf_GM_PileupMap_Arrays[Genmap_Params]
    
    PMap_BED_DF = convert_GenomeNParray_To_BED_DF(Run_PileupMappability_Array)

    PMap_BED_DF.columns = ["chrom", "chromStart", "chromEnd", "name", "PileupMap_Score"]
    #PMap_BED_DF["name"] = PMap_BED_DF.index + 1

    PileupMappability_WiParams_200328_BED = f"{run_Genmap_OutputDir}/201027_H37rv_PileupMappability_{Genmap_Params}.bed"
    PileupMappability_WiParams_200328_BEDGRAPH = f"{run_Genmap_OutputDir}/201027_H37rv_PileupMappability_{Genmap_Params}.bedgraph"      

    PileupMappability_WiParams_200328_NPY_PATH = f"{run_Genmap_OutputDir}/201027_H37rv_PileupMappability_{Genmap_Params}.npy"
    PileupMappability_WiParams_200328_NPZ_PATH = f"{run_Genmap_OutputDir}/201027_H37rv_PileupMappability_{Genmap_Params}.npz"

    #np.save(PileupMappability_WiParams_200328_NPY_PATH, Run_PileupMappability_Array )
    np.savez_compressed(PileupMappability_WiParams_200328_NPZ_PATH, Run_PileupMappability_Array )

    PMap_BED_DF.to_csv(PileupMappability_WiParams_200328_BED,
                           sep = "\t",
                           index = False,
                           header = False)
    
    !cut -f 1,2,3,5 $PileupMappability_WiParams_200328_BED > $PileupMappability_WiParams_200328_BEDGRAPH


100%|██████████| 4411532/4411532 [00:02<00:00, 1590078.51it/s]
100%|██████████| 4411532/4411532 [00:02<00:00, 1496988.86it/s]
100%|██████████| 4411532/4411532 [00:02<00:00, 1559522.43it/s]
100%|██████████| 4411532/4411532 [00:02<00:00, 1591890.41it/s]
100%|██████████| 4411532/4411532 [00:02<00:00, 1584139.85it/s]
100%|██████████| 4411532/4411532 [00:02<00:00, 1590157.63it/s]
100%|██████████| 4411532/4411532 [00:02<00:00, 1595023.79it/s]
100%|██████████| 4411532/4411532 [00:02<00:00, 1578940.05it/s]
100%|██████████| 4411532/4411532 [00:02<00:00, 1581794.04it/s]
100%|██████████| 4411532/4411532 [00:02<00:00, 1597365.59it/s]
100%|██████████| 4411532/4411532 [00:02<00:00, 1580191.38it/s]
100%|██████████| 4411532/4411532 [00:02<00:00, 1578406.81it/s]
100%|██████████| 4411532/4411532 [00:02<00:00, 1581286.71it/s]
100%|██████████| 4411532/4411532 [00:02<00:00, 1574716.64it/s]
100%|██████████| 4411532/4411532 [00:02<00:00, 1567559.51it/s]


In [44]:
!ls -1 $Genmap_Map_AnalysisDir

201027_ParsedAndPickled_GenmapOutput
201027_PMap_K100E4_BaseLevelInfo_BELOW_1.bed
201027_PMap_K100E4_Regions_BELOW_1.bed
201027_PMap_K50E4_BaseLevelInfo_BELOW_1.bed
201027_PMap_K50E4_Regions_BELOW_1.bed
Genmap_Map_K100_E0_Output
Genmap_Map_K100_E2_Output
Genmap_Map_K100_E4_Output
Genmap_Map_K125_E0_Output
Genmap_Map_K125_E2_Output
Genmap_Map_K125_E4_Output
Genmap_Map_K150_E0_Output
Genmap_Map_K150_E2_Output
Genmap_Map_K150_E4_Output
Genmap_Map_K50_E0_Output
Genmap_Map_K50_E2_Output
Genmap_Map_K50_E4_Output
Genmap_Map_K75_E0_Output
Genmap_Map_K75_E2_Output
Genmap_Map_K75_E4_Output


In [45]:
!ls -1 $Genmap_Map_AnalysisDir/Genmap_Map_K100_E4_Output

201027_H37rv_PileupMappability_K100_E4.bed
201027_H37rv_PileupMappability_K100_E4.bedgraph
201027_H37rv_PileupMappability_K100_E4.npy
201027_H37rv_PileupMappability_K100_E4.npz
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.bedgraph
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.chrom.sizes
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.txt
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.wig


In [75]:
!ls -1 $Genmap_Map_AnalysisDir/Genmap_Map_K50_E0_Output

201027_H37rv_PileupMappability_K50_E0.bed
201027_H37rv_PileupMappability_K50_E0.bedgraph
201027_H37rv_PileupMappability_K50_E0.npy
201027_H37rv_PileupMappability_K50_E0.npy.npz
201027_H37rv_PileupMappability_K50_E0.npz
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.bedgraph
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.chrom.sizes
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.txt
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.wig


In [76]:
!ls -1 $Genmap_Map_AnalysisDir/Genmap_Map_K100_E0_Output

201027_H37rv_PileupMappability_K100_E0.bed
201027_H37rv_PileupMappability_K100_E0.bedgraph
201027_H37rv_PileupMappability_K100_E0.npy
201027_H37rv_PileupMappability_K100_E0.npy.npz
201027_H37rv_PileupMappability_K100_E0.npz
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.bedgraph
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.chrom.sizes
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.txt
GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.wig


## K50_E4 - Output bedfiles specifiying all regions below 1 Pmap score

In [48]:
K50_E4_PileupMappability_NPA = dictOf_GM_PileupMap_Arrays["K50_E4"]
PMap_BED_DF = convert_GenomeNParray_To_BED_DF(K50_E4_PileupMappability_NPA)

PMap_BED_DF.columns = ["chrom", "chromStart", "chromEnd", "name", "PileupMap_Score"]
#PMap_BED_DF["name"] = PMap_BED_DF.index + 1
PMap_BED_DF_NoLengthColn = PMap_BED_DF.copy()

PMap_BED_DF["Length"] = PMap_BED_DF["chromEnd"] - PMap_BED_DF["chromStart"]

PMap_K50E4_BED_DF_Below1 = PMap_BED_DF[ (PMap_BED_DF["PileupMap_Score"] < 1) & (PMap_BED_DF["PileupMap_Score"] > 0)]  

PMap_K50E4_200730_BED_BELOW_1_PATH = f"{Genmap_Map_AnalysisDir}/201027_PMap_K50E4_BaseLevelInfo_BELOW_1.bed"

PMap_K50E4_BED_DF_Below1.to_csv(PMap_K50E4_200730_BED_BELOW_1_PATH,
                           sep = "\t",
                           index = False,
                           header = False)

PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH = f"{Genmap_Map_AnalysisDir}/201027_PMap_K50E4_Regions_BELOW_1.bed"

# Merge/condense adjacent basepairs that are below the defined threshold
!bedtools merge -i $PMap_K50E4_200730_BED_BELOW_1_PATH > $PMap_K50E4_200730_BED_REGIONS_BELOW_1_PATH


100%|██████████| 4411532/4411532 [00:02<00:00, 1538128.40it/s]


In [49]:
!ls -1 $Genmap_Map_AnalysisDir

201027_ParsedAndPickled_GenmapOutput
201027_PMap_K100E4_BaseLevelInfo_BELOW_1.bed
201027_PMap_K100E4_Regions_BELOW_1.bed
201027_PMap_K50E4_BaseLevelInfo_BELOW_1.bed
201027_PMap_K50E4_Regions_BELOW_1.bed
Genmap_Map_K100_E0_Output
Genmap_Map_K100_E2_Output
Genmap_Map_K100_E4_Output
Genmap_Map_K125_E0_Output
Genmap_Map_K125_E2_Output
Genmap_Map_K125_E4_Output
Genmap_Map_K150_E0_Output
Genmap_Map_K150_E2_Output
Genmap_Map_K150_E4_Output
Genmap_Map_K50_E0_Output
Genmap_Map_K50_E2_Output
Genmap_Map_K50_E4_Output
Genmap_Map_K75_E0_Output
Genmap_Map_K75_E2_Output
Genmap_Map_K75_E4_Output


## K100_E4 - Output bedfiles specifiying all regions below 1 Pmap score

In [50]:
K100_E4_PileupMappability_NPA = dictOf_GM_PileupMap_Arrays["K100_E4"]
PMap_BED_DF = convert_GenomeNParray_To_BED_DF(K100_E4_PileupMappability_NPA)

PMap_BED_DF.columns = ["chrom", "chromStart", "chromEnd", "name", "PileupMap_Score"]
#PMap_BED_DF["name"] = PMap_BED_DF.index + 1
PMap_BED_DF_NoLengthColn = PMap_BED_DF.copy()

PMap_BED_DF["Length"] = PMap_BED_DF["chromEnd"] - PMap_BED_DF["chromStart"]

PMap_K100E4_BED_DF_Below1 = PMap_BED_DF[ (PMap_BED_DF["PileupMap_Score"] < 1) & (PMap_BED_DF["PileupMap_Score"] > 0)]  

PMap_K100E4_200730_BED_BELOW_1_PATH = f"{Genmap_Map_AnalysisDir}/201027_PMap_K100E4_BaseLevelInfo_BELOW_1.bed"

PMap_K100E4_BED_DF_Below1.to_csv(PMap_K100E4_200730_BED_BELOW_1_PATH,
                           sep = "\t",
                           index = False,
                           header = False)

PMap_K100E4_200730_BED_REGIONS_BELOW_1_PATH = f"{Genmap_Map_AnalysisDir}/201027_PMap_K100E4_Regions_BELOW_1.bed"

# Merge/condense adjacent basepairs that are below the defined threshold
!bedtools merge -i $PMap_K100E4_200730_BED_BELOW_1_PATH > $PMap_K100E4_200730_BED_REGIONS_BELOW_1_PATH

100%|██████████| 4411532/4411532 [00:02<00:00, 1549629.00it/s]


In [51]:
!ls -lah $Genmap_Map_AnalysisDir/Genmap_Map_K50_E4_Output

total 42M
drwxrwsr-x  2 mm774 farhat  566 Mar 16 14:04 .
drwxrwsr-x 18 mm774 farhat  927 Jan 18 13:26 ..
-rw-rw-r--  1 mm774 farhat 4.2M Mar 16 14:25 201027_H37rv_PileupMappability_K50_E4.bed
-rw-rw-r--  1 mm774 farhat 2.6M Mar 16 14:25 201027_H37rv_PileupMappability_K50_E4.bedgraph
-rw-rw-r--  1 mm774 farhat  34M Mar 16 13:48 201027_H37rv_PileupMappability_K50_E4.npy
-rw-rw-r--  1 mm774 farhat 323K Mar 16 14:25 201027_H37rv_PileupMappability_K50_E4.npz
-rw-rw-r--  1 mm774 farhat 235K Mar 16 14:09 GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.bedgraph
-rw-rw-r--  1 mm774 farhat   20 Mar 16 14:09 GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.chrom.sizes
-rw-rw-r--  1 mm774 farhat 8.9M Mar 16 14:09 GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.txt
-rw-rw-r--  1 mm774 farhat 343K Mar 16 14:09 GCF_000195955.2_ASM19595v2_genomic.TrimmedSeqName.genmap.wig


In [52]:
dictOf_GM_PileupMap_Arrays.keys()

dict_keys(['K50_E0', 'K50_E2', 'K50_E4', 'K75_E0', 'K75_E2', 'K75_E4', 'K100_E0', 'K100_E2', 'K100_E4', 'K125_E0', 'K125_E2', 'K125_E4', 'K150_E0', 'K150_E2', 'K150_E4'])

In [53]:
dictOf_GM_PileupMap_Arrays['K50_E0'][-100:]

array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan])

In [54]:
dictOf_GM_PileupMap_Arrays['K50_E0'][:100]

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

## What proportion of the genome has Pileup mappability < 1 ?

In [55]:
Pmap_K50E0_NP = dictOf_GM_PileupMap_Arrays['K50_E0']
Pmap_K75E0_NP = dictOf_GM_PileupMap_Arrays['K75_E0']
Pmap_K100E0_NP = dictOf_GM_PileupMap_Arrays['K100_E0']
Pmap_K150E0_NP = dictOf_GM_PileupMap_Arrays['K150_E0']

In [56]:
((Pmap_K50E0_NP < 1) & (Pmap_K50E0_NP >= 0)).sum()

  """Entry point for launching an IPython kernel.
  """Entry point for launching an IPython kernel.


105136

In [57]:
((Pmap_K75E0_NP < 1) & (Pmap_K75E0_NP >= 0)).sum()

  """Entry point for launching an IPython kernel.
  """Entry point for launching an IPython kernel.


95896

In [58]:
((Pmap_K100E0_NP < 1) & (Pmap_K100E0_NP >= 0)).sum()

  """Entry point for launching an IPython kernel.
  """Entry point for launching an IPython kernel.


91143

In [59]:
((Pmap_K150E0_NP < 1) & (Pmap_K150E0_NP >= 0)).sum()

  """Entry point for launching an IPython kernel.
  """Entry point for launching an IPython kernel.


84199

In [60]:
## What proportion of the genome has Pileup mappability < 1 ?

In [61]:
(105136 / 4411432 ) * 100

2.3832623964281896

In [62]:
(95896 / 4411432 ) * 100

2.173806600668445

In [63]:
(91143 / 4411432 ) * 100

2.066063808758698

In [64]:
(84199 / 4411432 ) * 100

1.908654604672587

In [65]:
Pmap_K50E0_NP.shape

(4411532,)

In [66]:
Pmap_K50E0_NP[ (Pmap_K50E0_NP < 1) & (Pmap_K50E0_NP >= 0) ].shape

  """Entry point for launching an IPython kernel.
  """Entry point for launching an IPython kernel.


(105136,)

In [67]:
105136

105136

In [68]:
105136 / 4411432

0.023832623964281895

In [69]:
### 2.38%

In [70]:
105136 / 0.023832623964281895

4411432.0

In [71]:
((Pmap_K50E0_NP == 1) & (Pmap_K50E0_NP >= 0)).sum()

  """Entry point for launching an IPython kernel.


4306296

In [72]:
105136 + 4306296

4411432

In [73]:
((Pmap_K50E0_NP < 1) & (Pmap_K50E0_NP >= 0)).shape

  """Entry point for launching an IPython kernel.
  """Entry point for launching an IPython kernel.


(4411532,)

In [74]:
### OutDir: /n/data1/hms/dbmi/farhat/mm774/Projects/PacBio_Evaluation_Project/200328_PBvsI_VCeval_AnalysisDir