# This notebook was created to collect frameshift indel variants with $10\% \le $ alt AF $< 75\%$ in *mmpR*, *mmpL5*, *mmpS5*, *eis*, *whiB7* and *ahpC* from 31,428 isolates in our sample

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [10]:
import vcf

%matplotlib inline
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.ticker as ticker
from itertools import compress
from pylab import MaxNLocator
import seaborn as sns; sns.set()
from matplotlib.colors import LogNorm
from matplotlib import gridspec
import ast
import itertools
import seaborn as sns
from sklearn.preprocessing import StandardScaler

import fastcluster
from sklearn import cluster, datasets
import scipy.cluster.hierarchy as hier
from sklearn.cluster import KMeans
import time
import sys
import pickle

import Bio
from Bio.Alphabet import IUPAC
from Bio.Blast.Applications import NcbiblastnCommandline
from Bio.Blast import NCBIXML
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio import pairwise2
from Bio import SeqIO
from Bio.Graphics import GenomeDiagram
from Bio.SeqUtils import GC
from Bio import Phylo

from Bio.Align.Applications import MuscleCommandline
from StringIO import StringIO
from Bio import AlignIO
from Bio.Align import AlignInfo
from Bio.Seq import MutableSeq
import itertools
import gzip

import networkx as nx
import scipy
from collections import Counter

################################################################################################################################################################################################################

# [1] Load INDEL genotype matrix and Annotation Files

################################################################################################################################################################################################################

In [3]:
#load isolate annotation file (columns of Genotype Matrix)
isolate_annotation_DF = pd.read_pickle('/n/data1/hms/dbmi/farhat/Roger/mmpR_BDQ_mutant_project/rolling_DB_scrape_indels/Genotypes_Filtered_2/genotypes_isolate_annotation.pkl')

#load INDEL annotation file (rows of Genotype Matrix) with gene annotation information
INDEL_annotation_DF = pd.read_pickle('/n/data1/hms/dbmi/farhat/Roger/mmpR_BDQ_mutant_project/rolling_DB_scrape_indels/Genotypes_Filtered_2/genotypes_INDEL_functional_annotation.pkl')
INDEL_annotation_DF.reset_index(inplace = True , drop = False)

#load Genotypes Matrix
genotypes_array =  np.load('/n/data1/hms/dbmi/farhat/Roger/mmpR_BDQ_mutant_project/rolling_DB_scrape_indels/Genotypes_Filtered_2/genotypes_matrix.npy')

In [4]:
isolate_annotation_DF.head()

Unnamed: 0,isolate_ID,lineage_1,lineage_2,lineage_3,lineage_4,lineage_5,lineage_6,lineage_7,lineage_8,lineage_9,lineage_10,lineage_11,lineage_call,group
0,SAMN13051687,2,2,1,1.0,1.0,i3,,,,,,2.2.1.1.1.i3,2
1,SAMN09100245,4,2,1,2.0,1.0,1,i3,2.0,,,,4.2.1.2.1.1.i3.2,4B
2,SAMN08732238,2,2,1,1.0,1.0,,,,,,,2.2.1.1.1,2
3,SAMN07658260,3,1,1,,,,,,,,,3.1.1,3
4,SAMN03648003,2,2,1,1.0,1.0,,,,,,,2.2.1.1.1,2


In [5]:
INDEL_annotation_DF.head()

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos
0,ACCGACGAAG_313_A,313,ACCGACGAAG,A,Essential,dnaA,Rv0001,313.0,del,inframe,105.0
1,TC_1549_T,1549,TC,T,,,Rv0001_Rv0002,,del,frameshift,
2,TAA_1552_T,1552,TAA,T,,,Rv0001_Rv0002,,del,frameshift,
3,TA_1552_T,1552,TA,T,,,Rv0001_Rv0002,,del,frameshift,
4,T_1552_TA,1552,T,TA,,,Rv0001_Rv0002,,ins,frameshift,


In [6]:
genotypes_array

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int8)

In [7]:
np.shape(genotypes_array)

(55268, 31428)

################################################################################################################################################################################################################

# [2] Keep only *mixed* indels ($10\% \le $ alt AF $< 75\%$)

################################################################################################################################################################################################################

The genotypes matrix retains all high-quality indels coded as % allele frequency:
- $0\%$ allele frequency
- between $10\%$ and $100\%$ allele frequency
- -9 for bad quality calls 
filter and retain only the ones with *mixed* allele frequencies

In [None]:
#drop rows (indel variants) with ZERO mixed indels
INDELs_to_keep_filter = []

#iterate through each indel variant (takes ~1 hour)
for row_i in INDEL_annotation_DF.index: 
    
    #number of indels that are mixed with 10% <= AF < 75% out of 31,428 isolates
    num_mixed_indels_i = sum([(AF_larger_than_10 and AF_smaller_than_75) for AF_larger_than_10, AF_smaller_than_75 in zip(genotypes_array[row_i , :] >= 10, genotypes_array[row_i , :] < 75)])
    
    if num_mixed_indels_i >= 1:
        INDELs_to_keep_filter.append(True)
    elif num_mixed_indels_i == 0:
        INDELs_to_keep_filter.append(False)

Drop Indels (rows) with no mixed calls across all of the isolates from the INDEL annotation dataframe

In [None]:
sum(INDELs_to_keep_filter)

In [None]:
#filter Genotypes Matrix
genotypes_array = genotypes_array[INDELs_to_keep_filter , :]

#filter INDEL annotation file
INDEL_annotation_DF = INDEL_annotation_DF[INDELs_to_keep_filter]
INDEL_annotation_DF.reset_index(drop = True , inplace = True) #re-index new filtered INDEL annotation DF (so new index matches indexing of genotypes matrix rows)

In [None]:
len(INDELs_to_keep_filter) - sum(INDELs_to_keep_filter) #number of INDELs dropped by this filter

In [None]:
np.shape(genotypes_array)

In [None]:
INDEL_annotation_DF.head()

In [None]:
np.shape(INDEL_annotation_DF) #Annotation for Rows of Genotypes Matrix

Save __Genotypes Matrix__

In [None]:
#save Genotypes Matrix
np.save('/n/data1/hms/dbmi/farhat/Roger/mmpR_BDQ_mutant_project/rolling_DB_scrape_indels/Genotypes_Mixed_only/genotypes_matrix' , genotypes_array , allow_pickle = True)

Save __INDEL annotation file__

In [None]:
INDEL_annotation_DF.to_pickle('/n/data1/hms/dbmi/farhat/Roger/mmpR_BDQ_mutant_project/rolling_DB_scrape_indels/Genotypes_Mixed_only/genotypes_INDEL_functional_annotation.pkl')

Save __Isolate annotation file__

In [None]:
isolate_annotation_DF.to_pickle('/n/data1/hms/dbmi/farhat/Roger/mmpR_BDQ_mutant_project/rolling_DB_scrape_indels/Genotypes_Mixed_only/genotypes_isolate_annotation.pkl')

################################################################################################################################################################################################################

# [3] Analyze *frameshift* mixed indels in mmpR, mmpS5, mmpL5, eis, whiB7, ahpC

################################################################################################################################################################################################################

### Load *mixed* INDEL genotype matrix and Annotation Files

In [11]:
#load isolate annotation file (columns of Genotype Matrix)
isolate_annotation_DF = pd.read_pickle('/n/data1/hms/dbmi/farhat/Roger/mmpR_BDQ_mutant_project/rolling_DB_scrape_indels/Genotypes_Mixed_only/genotypes_isolate_annotation.pkl')

#load INDEL annotation file (rows of Genotype Matrix) with gene annotation information
INDEL_annotation_DF = pd.read_pickle('/n/data1/hms/dbmi/farhat/Roger/mmpR_BDQ_mutant_project/rolling_DB_scrape_indels/Genotypes_Mixed_only/genotypes_INDEL_functional_annotation.pkl')

#load Genotypes Matrix
genotypes_array =  np.load('/n/data1/hms/dbmi/farhat/Roger/mmpR_BDQ_mutant_project/rolling_DB_scrape_indels/Genotypes_Mixed_only/genotypes_matrix.npy')

In [12]:
isolate_annotation_DF.head()

Unnamed: 0,isolate_ID,lineage_1,lineage_2,lineage_3,lineage_4,lineage_5,lineage_6,lineage_7,lineage_8,lineage_9,lineage_10,lineage_11,lineage_call,group
0,SAMN13051687,2,2,1,1.0,1.0,i3,,,,,,2.2.1.1.1.i3,2
1,SAMN09100245,4,2,1,2.0,1.0,1,i3,2.0,,,,4.2.1.2.1.1.i3.2,4B
2,SAMN08732238,2,2,1,1.0,1.0,,,,,,,2.2.1.1.1,2
3,SAMN07658260,3,1,1,,,,,,,,,3.1.1,3
4,SAMN03648003,2,2,1,1.0,1.0,,,,,,,2.2.1.1.1,2


In [13]:
np.shape(isolate_annotation_DF)

(31428, 14)

In [14]:
INDEL_annotation_DF.head()

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos
0,TA_1552_T,1552,TA,T,,,Rv0001_Rv0002,,del,frameshift,
1,T_1552_TA,1552,T,TA,,,Rv0001_Rv0002,,ins,frameshift,
2,G_1622_GCGCACAGA,1622,G,GCGCACAGA,,,Rv0001_Rv0002,,ins,frameshift,
3,C_1652_CG,1652,C,CG,,,Rv0001_Rv0002,,ins,frameshift,
4,A_1692_ACCC,1692,A,ACCC,,,Rv0001_Rv0002,,ins,inframe,


In [15]:
np.shape(INDEL_annotation_DF)

(7731, 11)

In [16]:
genotypes_array

array([[ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0, -9, ...,  0,  0, -9],
       ...,
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0]], dtype=int8)

In [17]:
np.shape(genotypes_array)

(7731, 31428)

## keep only *frameshift* indels

In [18]:
INDEL_annotation_DF = INDEL_annotation_DF[INDEL_annotation_DF.INDEL_type == 'frameshift']

In [19]:
INDEL_annotation_DF.head()

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos
0,TA_1552_T,1552,TA,T,,,Rv0001_Rv0002,,del,frameshift,
1,T_1552_TA,1552,T,TA,,,Rv0001_Rv0002,,ins,frameshift,
2,G_1622_GCGCACAGA,1622,G,GCGCACAGA,,,Rv0001_Rv0002,,ins,frameshift,
3,C_1652_CG,1652,C,CG,,,Rv0001_Rv0002,,ins,frameshift,
6,T_1779_TA,1779,T,TA,,,Rv0001_Rv0002,,ins,frameshift,


In [20]:
np.shape(INDEL_annotation_DF)

(5925, 11)

### Look for indels in *mmpR*

In [63]:
INDEL_annotation_DF[[('Rv0678' in gene_id) & ('_' not in gene_id) for gene_id in INDEL_annotation_DF.gene_id]].head() #mmpR

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos
1374,C_779181_CG,779181,C,CG,Non-Essential,Rv0678,Rv0678,192,ins,frameshift,64
1375,G_779249_GC,779249,G,GC,Non-Essential,Rv0678,Rv0678,260,ins,frameshift,87


Insertion in *mmpR* is **C_779181_CG**

Deletion in *mmpR* is **CG_779181_C**

In [64]:
mmpR_indels_df = INDEL_annotation_DF[[('Rv0678' in gene_id) & ('_' not in gene_id) for gene_id in INDEL_annotation_DF.gene_id]]

### Look for indels in *mmpS5*

In [65]:
INDEL_annotation_DF[[('Rv0677c' in gene_id) & ('_' not in gene_id) for gene_id in INDEL_annotation_DF.gene_id]].head() #mmpS5

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos
1373,G_778696_GTTGA,778696,G,GTTGA,Non-Essential,mmpS5,Rv0677c,210,ins,frameshift,70


In [66]:
mmpS5_indels_df = INDEL_annotation_DF[[('Rv0677c' in gene_id) & ('_' not in gene_id) for gene_id in INDEL_annotation_DF.gene_id]]

### Look for indels in *mmpL5*

In [67]:
INDEL_annotation_DF[[('Rv0676c' in gene_id) & ('_' not in gene_id) for gene_id in INDEL_annotation_DF.gene_id]].head() #mmpL5

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos
1372,CGATCT_777076_C,777076,CGATCT,C,Non-Essential,mmpL5,Rv0676c,1405,del,frameshift,469


Deletion in *mmpL5* is **AC_777875_A**

In [68]:
mmpL5_indels_df = INDEL_annotation_DF[[('Rv0676c' in gene_id) & ('_' not in gene_id) for gene_id in INDEL_annotation_DF.gene_id]]

### Look for indels in *eis*

In [69]:
INDEL_annotation_DF[[('Rv2416c' in gene_id) & ('_' not in gene_id) for gene_id in INDEL_annotation_DF.gene_id]].head() #eis

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos
4787,GGT_2714526_G,2714526,GGT,G,Non-Essential,eis,Rv2416c,807,del,frameshift,269
4788,G_2714847_GCT,2714847,G,GCT,Non-Essential,eis,Rv2416c,486,ins,frameshift,162


In [70]:
eis_indels_df = INDEL_annotation_DF[[('Rv2416c' in gene_id) & ('_' not in gene_id) for gene_id in INDEL_annotation_DF.gene_id]]

### Look for indels in *whiB7*

In [71]:
INDEL_annotation_DF[[('Rv3197A' in gene_id) & ('_' not in gene_id) for gene_id in INDEL_annotation_DF.gene_id]].head() #whiB7

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos
6195,GC_3568488_G,3568488,GC,G,Non-Essential,whiB7,Rv3197A,192,del,frameshift,64


In [72]:
whiB7_indels_df = INDEL_annotation_DF[[('Rv3197A' in gene_id) & ('_' not in gene_id) for gene_id in INDEL_annotation_DF.gene_id]]

### Look for indels in *ahpC*

In [73]:
INDEL_annotation_DF[[('Rv2428' in gene_id) & ('_' not in gene_id) for gene_id in INDEL_annotation_DF.gene_id]].head() #ahpC

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos
4811,G_2726370_GT,2726370,G,GT,Antibiotic Resistance,ahpC,Rv2428,178,ins,frameshift,60
4812,C_2726405_CA,2726405,C,CA,Antibiotic Resistance,ahpC,Rv2428,213,ins,frameshift,71


In [74]:
ahpC_indels_df = INDEL_annotation_DF[[('Rv2428' in gene_id) & ('_' not in gene_id) for gene_id in INDEL_annotation_DF.gene_id]]

################################################################################################################################################################################################################

# [4] Get genotypes for indels of interest

################################################################################################################################################################################################################

### *Function* to get the genotypes for a specifc indel and get attributes for isolates that support indel

In [75]:
def get_genotypes_for_indel(indel_i_key):
    
    #0 supports Ref, 10-75 supports Alt, -9 indicated bad quality call
    indel_i_genotypes = genotypes_array[INDEL_annotation_DF[INDEL_annotation_DF.key == indel_i_key].index[0] , :] 

    #count the number of isolates that support this (mixed) indel call
    mixed_indel_bool = [(AF_larger_than_10 and AF_smaller_than_75) for AF_larger_than_10, AF_smaller_than_75 in zip(indel_i_genotypes >= 10, indel_i_genotypes < 75)]
    num_isolates_indel_i = np.sum(mixed_indel_bool)

    #create a boolean filter and extract information for the isolates that support this call
    isolate_annotation_indel_i = isolate_annotation_DF[mixed_indel_bool]

    #find number of sub-lineages that have indel support in at least 1 isolate
    num_sublineages_with_indel_i = len(set(list(isolate_annotation_indel_i.lineage_call.values)))

    #get list of sublineages w/ at least 1 isolates that supports indel call
    #sublineages_with_indel = list(set(isolate_annotation_indel_i.lineage_call.values))
    sublineage_with_indel_count_dict = Counter(isolate_annotation_indel_i.lineage_call.values)
    sublineages_with_indel_list = []
    for sublineage_i in sublineage_with_indel_count_dict.keys():

        sublineage_i_with_indel = sublineage_i + '({0})'.format(str(sublineage_with_indel_count_dict[sublineage_i]))
        sublineages_with_indel_list.append(sublineage_i_with_indel)

    sublineages_with_indel = ' - '.join(sublineages_with_indel_list)

    #get the alternate allele frequencies for the mixed indel calls (these were rounded down to the nearest %)
    indel_i_allele_freqs = indel_i_genotypes[mixed_indel_bool]
    indel_i_allele_freqs_with_isolate_list = [ isolate_ID + '(' + str(alt_AF) + '%)' for isolate_ID, alt_AF in zip(isolate_annotation_indel_i.isolate_ID, indel_i_allele_freqs) ]
    indel_i_allele_freqs_with_isolate_str = ' - '.join(indel_i_allele_freqs_with_isolate_list)
    
    return [num_isolates_indel_i , num_sublineages_with_indel_i , sublineages_with_indel , indel_i_allele_freqs_with_isolate_str]

### *Function* to append isolate info to indels DataFrame

In [76]:
def append_isolate_info_to_indel_DF(locus_indels_df):
    
    num_isolates_with_indel = []
    num_sublineages_with_indel = []
    sublineages_with_indel = []
    isolates_with_indel_and_altAF = []

    for indel_i_key in locus_indels_df.key:

        num_isolates_with_indel_i , num_sublineages_with_indel_i , sublineages_with_indel_i , isolates_with_indel_i_and_altAF = get_genotypes_for_indel(indel_i_key)

        num_isolates_with_indel.append(num_isolates_with_indel_i)
        num_sublineages_with_indel.append(num_sublineages_with_indel_i)
        sublineages_with_indel.append(sublineages_with_indel_i)
        isolates_with_indel_and_altAF.append(isolates_with_indel_i_and_altAF)

    locus_indels_df.loc[: , 'num_isolates'] = num_isolates_with_indel
    locus_indels_df.loc[: , 'num_sublineages'] = num_sublineages_with_indel
    locus_indels_df.loc[: , 'sublineages'] = sublineages_with_indel    
    locus_indels_df.loc[: , 'isolateID_and_AF'] = isolates_with_indel_and_altAF

    #drop indels present in 0 isolates
    locus_indels_df = locus_indels_df[locus_indels_df.num_isolates > 0]
    
    return locus_indels_df

### mmpR

In [77]:
mmpR_indels_df.head()

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos
1374,C_779181_CG,779181,C,CG,Non-Essential,Rv0678,Rv0678,192,ins,frameshift,64
1375,G_779249_GC,779249,G,GC,Non-Essential,Rv0678,Rv0678,260,ins,frameshift,87


In [78]:
np.shape(mmpR_indels_df)

(2, 11)

In [79]:
mmpR_indels_df = append_isolate_info_to_indel_DF(mmpR_indels_df)

In [80]:
mmpR_indels_df

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos,num_isolates,num_sublineages,sublineages,isolateID_and_AF
1374,C_779181_CG,779181,C,CG,Non-Essential,Rv0678,Rv0678,192,ins,frameshift,64,2,2,4.2.1.1.2(1) - 4.2.1.1.1.1.2(1),SAMEA104357571(66%) - Peru4498(71%)
1375,G_779249_GC,779249,G,GC,Non-Essential,Rv0678,Rv0678,260,ins,frameshift,87,1,1,4.2.1.2.2.1.1(1),SAMEA2534929(69%)


### mmpS5

In [81]:
mmpS5_indels_df.head()

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos
1373,G_778696_GTTGA,778696,G,GTTGA,Non-Essential,mmpS5,Rv0677c,210,ins,frameshift,70


In [82]:
np.shape(mmpS5_indels_df)

(1, 11)

In [83]:
mmpS5_indels_df = append_isolate_info_to_indel_DF(mmpS5_indels_df)

In [84]:
mmpS5_indels_df

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos,num_isolates,num_sublineages,sublineages,isolateID_and_AF
1373,G_778696_GTTGA,778696,G,GTTGA,Non-Essential,mmpS5,Rv0677c,210,ins,frameshift,70,1,1,1(1),SAMEA3558232(54%)


### mmpL5

In [85]:
mmpL5_indels_df.head()

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos
1372,CGATCT_777076_C,777076,CGATCT,C,Non-Essential,mmpL5,Rv0676c,1405,del,frameshift,469


In [86]:
np.shape(mmpL5_indels_df)

(1, 11)

In [87]:
mmpL5_indels_df = append_isolate_info_to_indel_DF(mmpL5_indels_df)

In [88]:
mmpL5_indels_df

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos,num_isolates,num_sublineages,sublineages,isolateID_and_AF
1372,CGATCT_777076_C,777076,CGATCT,C,Non-Essential,mmpL5,Rv0676c,1405,del,frameshift,469,2,1,2.2.1.1.1.i3(2),SAMN08708254(70%) - SAMN08709186(71%)


### eis

In [89]:
eis_indels_df.head()

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos
4787,GGT_2714526_G,2714526,GGT,G,Non-Essential,eis,Rv2416c,807,del,frameshift,269
4788,G_2714847_GCT,2714847,G,GCT,Non-Essential,eis,Rv2416c,486,ins,frameshift,162


In [90]:
np.shape(eis_indels_df)

(2, 11)

In [91]:
eis_indels_df = append_isolate_info_to_indel_DF(eis_indels_df)

In [92]:
eis_indels_df

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos,num_isolates,num_sublineages,sublineages,isolateID_and_AF
4787,GGT_2714526_G,2714526,GGT,G,Non-Essential,eis,Rv2416c,807,del,frameshift,269,84,2,2.2.1.1(2) - 2.2.1.1.1(82),SAMEA1119916(67%) - SAMN06055959(68%) - SAMN06...
4788,G_2714847_GCT,2714847,G,GCT,Non-Essential,eis,Rv2416c,486,ins,frameshift,162,1,1,4.1.i1.1.1.1(1),Peru2912(69%)


In [93]:
eis_indels_df.loc[4787, :].isolateID_and_AF

'SAMEA1119916(67%) - SAMN06055959(68%) - SAMN06092379(71%) - SAMEA2533580(59%) - SAMN06091746(70%) - SAMN06055864(73%) - SAMN08912870(67%) - SAMN08912867(70%) - SAMN06055832(72%) - SAMEA1569341(65%) - SAMN07660153(60%) - SAMN03648834(73%) - SAMN06092543(71%) - SAMN06055526(72%) - SAMN06092377(72%) - SAMN06092566(71%) - SAMEA2535268(60%) - SAMEA1118160(74%) - SAMN06091821(70%) - SAMN06092673(72%) - SAMN06092504(68%) - SAMN06056018(62%) - SAMN06055667(74%) - SAMN06092617(71%) - SAMN07658614(67%) - IDR1600027139(73%) - SAMN06092586(72%) - SAMN06055935(70%) - SAMN06091901(73%) - SAMN06092205(66%) - SAMN06055602(74%) - SAMN06055439(73%) - IDR1200022433(68%) - SAMEA1903069(62%) - SAMN08912910(63%) - SAMN06055566(73%) - SAMN06055907(61%) - SAMN06092550(72%) - SAMN06055544(73%) - SAMN06092177(71%) - SAMN06055983(66%) - SAMN03648790(73%) - SAMN06055446(69%) - SAMN03648789(63%) - SAMN06091927(70%) - SAMN06055789(68%) - SAMN06091733(67%) - SAMEA2534301(74%) - SAMN06092238(73%) - SAMN06055792(69%)

### whiB7

In [94]:
whiB7_indels_df.head()

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos
6195,GC_3568488_G,3568488,GC,G,Non-Essential,whiB7,Rv3197A,192,del,frameshift,64


In [95]:
np.shape(whiB7_indels_df)

(1, 11)

In [96]:
whiB7_indels_df = append_isolate_info_to_indel_DF(whiB7_indels_df)

In [97]:
whiB7_indels_df

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos,num_isolates,num_sublineages,sublineages,isolateID_and_AF
6195,GC_3568488_G,3568488,GC,G,Non-Essential,whiB7,Rv3197A,192,del,frameshift,64,11,3,1.2.1.1.2(2) - 1.2.1.1(1) - 1.2.1.1.1(8),SAMN04276657(66%) - SAMN09101721(73%) - SAMEA5...


### ahpC

In [98]:
ahpC_indels_df.head()

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos
4811,G_2726370_GT,2726370,G,GT,Antibiotic Resistance,ahpC,Rv2428,178,ins,frameshift,60
4812,C_2726405_CA,2726405,C,CA,Antibiotic Resistance,ahpC,Rv2428,213,ins,frameshift,71


In [99]:
np.shape(ahpC_indels_df)

(2, 11)

In [100]:
ahpC_indels_df = append_isolate_info_to_indel_DF(ahpC_indels_df)

In [101]:
ahpC_indels_df

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos,num_isolates,num_sublineages,sublineages,isolateID_and_AF
4811,G_2726370_GT,2726370,G,GT,Antibiotic Resistance,ahpC,Rv2428,178,ins,frameshift,60,1,1,1.1.2(1),SAMN04396889(68%)
4812,C_2726405_CA,2726405,C,CA,Antibiotic Resistance,ahpC,Rv2428,213,ins,frameshift,71,2,1,2.2.1.1.1(2),SAMN08912870(63%) - SAMN03648834(67%)


Save DataFrame for *mmpR*, *mmpS5*, *mmpL5*, *eis*, *whiB7*, *ahpC* indels

In [102]:
mmpR_mmpS5_mmpL5_eis_whiB7_ahpC_indels_df = mmpR_indels_df.append(mmpS5_indels_df.append(mmpL5_indels_df.append(eis_indels_df.append(whiB7_indels_df.append(ahpC_indels_df)))))

In [103]:
np.shape(mmpR_mmpS5_mmpL5_eis_whiB7_ahpC_indels_df)

(9, 15)

In [104]:
mmpR_mmpS5_mmpL5_eis_whiB7_ahpC_indels_df

Unnamed: 0,key,pos,ref,alt,gene_category,gene_name,gene_id,gene_pos,ins_del,INDEL_type,codon_pos,num_isolates,num_sublineages,sublineages,isolateID_and_AF
1374,C_779181_CG,779181,C,CG,Non-Essential,Rv0678,Rv0678,192,ins,frameshift,64,2,2,4.2.1.1.2(1) - 4.2.1.1.1.1.2(1),SAMEA104357571(66%) - Peru4498(71%)
1375,G_779249_GC,779249,G,GC,Non-Essential,Rv0678,Rv0678,260,ins,frameshift,87,1,1,4.2.1.2.2.1.1(1),SAMEA2534929(69%)
1373,G_778696_GTTGA,778696,G,GTTGA,Non-Essential,mmpS5,Rv0677c,210,ins,frameshift,70,1,1,1(1),SAMEA3558232(54%)
1372,CGATCT_777076_C,777076,CGATCT,C,Non-Essential,mmpL5,Rv0676c,1405,del,frameshift,469,2,1,2.2.1.1.1.i3(2),SAMN08708254(70%) - SAMN08709186(71%)
4787,GGT_2714526_G,2714526,GGT,G,Non-Essential,eis,Rv2416c,807,del,frameshift,269,84,2,2.2.1.1(2) - 2.2.1.1.1(82),SAMEA1119916(67%) - SAMN06055959(68%) - SAMN06...
4788,G_2714847_GCT,2714847,G,GCT,Non-Essential,eis,Rv2416c,486,ins,frameshift,162,1,1,4.1.i1.1.1.1(1),Peru2912(69%)
6195,GC_3568488_G,3568488,GC,G,Non-Essential,whiB7,Rv3197A,192,del,frameshift,64,11,3,1.2.1.1.2(2) - 1.2.1.1(1) - 1.2.1.1.1(8),SAMN04276657(66%) - SAMN09101721(73%) - SAMEA5...
4811,G_2726370_GT,2726370,G,GT,Antibiotic Resistance,ahpC,Rv2428,178,ins,frameshift,60,1,1,1.1.2(1),SAMN04396889(68%)
4812,C_2726405_CA,2726405,C,CA,Antibiotic Resistance,ahpC,Rv2428,213,ins,frameshift,71,2,1,2.2.1.1.1(2),SAMN08912870(63%) - SAMN03648834(67%)


In [105]:
mmpR_mmpS5_mmpL5_eis_whiB7_ahpC_indels_df.to_csv('/n/data1/hms/dbmi/farhat/Roger/mmpR_BDQ_mutant_project/CSV files/mmpR_mmpS5_mmpL5_eis_whiB7_ahpC_mixed_indels_in_31428_isolates.csv' , sep = ',')