# tb-rnap-compensation

In [11]:
import pandas

pandas.options.display.max_columns=999

Let's load in the `MUTATIONS` table and have a look. Importantly, this table also records `NULL`s (where there are no reads at an amino acid so we have no evidence of what is there) and `FILTER_FAIL`s (where is some evidence but not enough to be statistically significant). These need excluding.

In [14]:
MUTATIONS = pandas.read_pickle('tables/MUTATIONS.pkl.gz')
MUTATIONS.reset_index(inplace=True)
MUTATIONS = MUTATIONS[(MUTATIONS.IS_FILTER_PASS) & (~MUTATIONS.IS_HET) & (~MUTATIONS.IS_NULL)]
MUTATIONS[:4]

Unnamed: 0,UNIQUEID,GENE,MUTATION,POSITION,AMINO_ACID_NUMBER,GENOME_INDEX,NUCLEOTIDE_NUMBER,REF,ALT,IS_SNP,IS_INDEL,IN_CDS,IN_PROMOTER,IS_SYNONYMOUS,IS_NONSYNONYMOUS,IS_HET,IS_NULL,IS_FILTER_PASS,ELEMENT_TYPE,MUTATION_TYPE,INDEL_LENGTH,INDEL_1,INDEL_2,SITEID,NUMBER_NUCLEOTIDE_CHANGES
0,site.02.subj.0958.lab.22A197.iso.1,rpoB,P45S,45.0,45.0,,,ccg,tcg,True,False,True,False,False,True,False,False,True,GENE,AAM,,,,2,1
1,site.02.subj.0958.lab.22A197.iso.1,rpoB,S450L,450.0,450.0,,,tcg,ttg,True,False,True,False,False,True,False,False,True,GENE,AAM,,,,2,1
2,site.02.subj.0958.lab.22A197.iso.1,rpoB,A1075A,1075.0,1075.0,,,gct,gcc,True,False,True,False,True,False,False,False,True,GENE,AAM,,,,2,1
3,site.02.subj.0958.lab.22A197.iso.1,rpoC,D271E,271.0,271.0,,,gac,gag,True,False,True,False,False,True,False,False,True,GENE,AAM,,,,2,1


To get a quick feel for the amount of reversion that may be happening, let's do a quick cross-tab

In [29]:
pandas.crosstab(MUTATIONS.GENE, MUTATIONS.NUMBER_NUCLEOTIDE_CHANGES)

NUMBER_NUCLEOTIDE_CHANGES,0,1,2,3
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
rpoA,219,15384,429,41
rpoB,9491,115155,3053,911
rpoC,1175,101868,3631,1108
rpoZ,102,1593,32,1
sigA,343,13279,395,79


Sure enough there are a good number of codons in `rpoB` and `rpoC` which have two or three bases different to the reference genome.

Let's look at those were two bases are different as they fit our hypothesis (harder to explain three!)

In [31]:
MUTATIONS[MUTATIONS.NUMBER_NUCLEOTIDE_CHANGES == 2][:5]

Unnamed: 0,UNIQUEID,GENE,MUTATION,POSITION,AMINO_ACID_NUMBER,GENOME_INDEX,NUCLEOTIDE_NUMBER,REF,ALT,IS_SNP,IS_INDEL,IN_CDS,IN_PROMOTER,IS_SYNONYMOUS,IS_NONSYNONYMOUS,IS_HET,IS_NULL,IS_FILTER_PASS,ELEMENT_TYPE,MUTATION_TYPE,INDEL_LENGTH,INDEL_1,INDEL_2,SITEID,NUMBER_NUCLEOTIDE_CHANGES
817,site.02.subj.0926.lab.22A161.iso.1,rpoB,S450F,450.0,450.0,,,tcg,ttc,True,False,True,False,False,True,False,False,True,GENE,AAM,,,,2,2
859,site.02.subj.0893.lab.22A127.iso.1,rpoB,D435F,435.0,435.0,,,gac,ttc,True,False,True,False,False,True,False,False,True,GENE,AAM,,,,2,2
1022,site.02.subj.0197.lab.2013221241.iso.1,sigA,A55S,55.0,55.0,,,gcc,tcg,True,False,True,False,False,True,False,False,True,GENE,AAM,,,,2,2
1506,site.02.subj.0074.lab.22A026.iso.1,rpoB,H445C,445.0,445.0,,,cac,tgc,True,False,True,False,False,True,False,False,True,GENE,AAM,,,,2,2
2244,site.05.subj.LR-2335.lab.FN-01418-18.iso.1,rpoB,S450M,450.0,450.0,,,tcg,atg,True,False,True,False,False,True,False,False,True,GENE,AAM,,,,5,2


In [32]:
a = MUTATIONS[(MUTATIONS.GENE == 'rpoB') & (MUTATIONS.NUMBER_NUCLEOTIDE_CHANGES == 2)].AMINO_ACID_NUMBER.value_counts()
a[a>50]

450.0    232
445.0    160
656.0    136
640.0    135
662.0    129
243.0    102
641.0     99
549.0     97
435.0     85
681.0     78
670.0     76
545.0     76
443.0     59
544.0     56
431.0     52
Name: AMINO_ACID_NUMBER, dtype: int64