## Data Wrangling 
1. Load the MaxQuant output (proteusAll.tsv) into a dataframe
2. Keep "Sequence" and "Leader razor protein" columns
3. Duplicate peptide sequences are not helpful at this point so only keept one of each 
4. Find a way to map UniProt IDs (e.g. Q15452) in FASTApeptides.tsv file to gene names in the Leading razor peptides (e.g. RPL4) 
5. Remove from FASTApeptides.tsv all peptides from proteins that do not appear in MAXQuant - then you'll have a list of all tryptic peptides that could be in the sample.
6. Using the peptide sequences in MAXQuant, assign 1 (detected) or 0 (not detected) to all the peptides in FASTApeptides.tsv.

##  IMPORT LIBRARIES

In [4]:
# import the libraries 
import pandas as pd

##  IMPORT  MAXQUANT output and FASTA Peptides

In [5]:
# Load the MaxQuant output (proteus.tsv) into a dataframe and FASTApeptides
MAXQuant = pd.read_csv('proteus.tsv', sep='\t') 
FASTApeptides = pd.read_table('fasta_peptides.tsv', sep='\t')

In [6]:
FASTApeptides.shape

(556449, 3)

In [7]:
MAXQuant.shape

(98579, 19)

In [8]:
print(len(MAXQuant.loc[MAXQuant['Potential contaminant'] == '+'])) # 759 contaminants

759


In [9]:
len(MAXQuant.loc[MAXQuant['Proteins'] == ';']) # checking for PAGS

0

In [10]:
len(MAXQuant[MAXQuant['Proteins'].str.contains(";")])

0

In [11]:
len(MAXQuant[MAXQuant['Modifications'].str.contains("Acetyl")])/ len(MAXQuant) * 100 # percentage of Acetylated modified 

1.720447559825115

In [12]:
len(MAXQuant[MAXQuant['Modifications'].str.contains("Acetyl")])

1696

In [13]:
len(MAXQuant[MAXQuant['Proteins'].str.contains("CON__", na=False)]) # checking for contaminants

0

In [14]:
MAXQuant

Unnamed: 0,Sequence,Modified sequence,Modifications,Proteins,Leading razor protein,Experiment,Charge,Reverse,Potential contaminant,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,Reporter intensity 9,Reporter intensity 10
0,AAAAAAAATMALAAPSSPTPESPTMLTK,_AAAAAAAATMALAAPSSPTPESPTMLTK_,Unmodified,INCENP,INCENP,1,4,,,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,1300.10,1281.10
1,AAAAAAAGDSDSWDADAFSVEDPVR,_(Acetyl (Protein N-term))AAAAAAAGDSDSWDADAFSV...,Acetyl (Protein N-term),EIF3J,EIF3J,1,3,,,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,2974.00,2880.20
2,AAAAAAAGDSDSWDADAFSVEDPVR,_(Acetyl (Protein N-term))AAAAAAAGDSDSWDADAFSV...,Acetyl (Protein N-term),EIF3J,EIF3J,1,2,,,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,715.44,909.51
3,AAAAAAAGDSDSWDADAFSVEDPVRK,_(Acetyl (Protein N-term))AAAAAAAGDSDSWDADAFSV...,Acetyl (Protein N-term),EIF3J,EIF3J,1,3,,,9892.50,12415.00,10484.00,8091.60,10791.00,7756.90,9398.80,10201.00,3995.00,6233.10
4,AAAAAAAGDSDSWDADAFSVEDPVRK,_(Acetyl (Protein N-term))AAAAAAAGDSDSWDADAFSV...,Acetyl (Protein N-term),EIF3J,EIF3J,1,2,,,3148.40,5433.80,4041.40,2951.90,4652.90,3160.80,4107.00,3887.30,1751.90,2445.20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,_YYVTIIDAPGHR_,Unmodified,EEF1A1P8,EEF1A1P8,1,3,,,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,366.15,809.01
98575,YYVTIIDAPGHR,_YYVTIIDAPGHR_,Unmodified,EEF1A1P8,EEF1A1P8,1,2,,,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,3560.30,4610.70
98576,YYVTIIDAPGHR,_YYVTIIDAPGHR_,Unmodified,EEF1A1P8,EEF1A1P8,1,4,,,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,474.26,639.43
98577,YYVTIIDAPGHR,_YYVTIIDAPGHR_,Unmodified,EEF1A1P8,EEF1A1P8,1,3,,,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,891.62,1267.50


In [15]:
MAXQuantcleaned = MAXQuant.loc[MAXQuant['Potential contaminant'] != '+'] # removed 759 contaminants

In [16]:
MAXQuantcleaned.shape

(97820, 19)

In [17]:
MAXQuantcleaned.isnull().sum() # there are no null values except for reverse and potential contaminants.

Sequence                     0
Modified sequence            0
Modifications                0
Proteins                     0
Leading razor protein        0
Experiment                   0
Charge                       0
Reverse                  97820
Potential contaminant    97820
Reporter intensity 1         0
Reporter intensity 2         0
Reporter intensity 3         0
Reporter intensity 4         0
Reporter intensity 5         0
Reporter intensity 6         0
Reporter intensity 7         0
Reporter intensity 8         0
Reporter intensity 9         0
Reporter intensity 10        0
dtype: int64

In [18]:
MAXQuantcleaned.shape

(97820, 19)

In [19]:
reporter_intensity= MAXQuantcleaned. iloc[:, 9:] # take the 10 plex ion intensities 

In [20]:
reporter_intensity.shape

(97820, 10)

In [21]:
MAXQuant = MAXQuantcleaned[['Sequence', 'Leading razor protein']] # keeping leading razor proteins and sequence columns

In [22]:
MAXQuant

Unnamed: 0,Sequence,Leading razor protein
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J
3,AAAAAAAGDSDSWDADAFSVEDPVRK,EIF3J
4,AAAAAAAGDSDSWDADAFSVEDPVRK,EIF3J
...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8
98575,YYVTIIDAPGHR,EEF1A1P8
98576,YYVTIIDAPGHR,EEF1A1P8
98577,YYVTIIDAPGHR,EEF1A1P8


### 3. concat reporter ion intensities df to sequence and leading razor protein

In [23]:
MAXQuant = pd.concat([MAXQuant, reporter_intensity], axis=1, join='inner')

In [24]:
MAXQuant # the resulting dataframe after concatonating maxquant and all reporter ion intensities

Unnamed: 0,Sequence,Leading razor protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,Reporter intensity 9,Reporter intensity 10
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,1300.10,1281.10
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,2974.00,2880.20
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,715.44,909.51
3,AAAAAAAGDSDSWDADAFSVEDPVRK,EIF3J,9892.50,12415.00,10484.00,8091.60,10791.00,7756.90,9398.80,10201.00,3995.00,6233.10
4,AAAAAAAGDSDSWDADAFSVEDPVRK,EIF3J,3148.40,5433.80,4041.40,2951.90,4652.90,3160.80,4107.00,3887.30,1751.90,2445.20
...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,366.15,809.01
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,3560.30,4610.70
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,474.26,639.43
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,891.62,1267.50


### Count the number of missed cleaved peptides in the sample 

In [25]:
def count_missed_cleavage(string):
    count = 0
    for i in range(len(string[:-1])):
        if string[i] == 'K' or string[i] == 'R':
            if i == len(string)-1:
                count += 1
            elif string[i+1] != 'P':
                count += 1
    return count

In [26]:
MAXQuant['Missedcleavedpepcount']= MAXQuant['Sequence'].apply(lambda x: count_missed_cleavage(x))

In [27]:
MAXQuant

Unnamed: 0,Sequence,Leading razor protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,Reporter intensity 9,Reporter intensity 10,Missedcleavedpepcount
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,1300.10,1281.10,0
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,2974.00,2880.20,0
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,715.44,909.51,0
3,AAAAAAAGDSDSWDADAFSVEDPVRK,EIF3J,9892.50,12415.00,10484.00,8091.60,10791.00,7756.90,9398.80,10201.00,3995.00,6233.10,1
4,AAAAAAAGDSDSWDADAFSVEDPVRK,EIF3J,3148.40,5433.80,4041.40,2951.90,4652.90,3160.80,4107.00,3887.30,1751.90,2445.20,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,366.15,809.01,0
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,3560.30,4610.70,0
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,474.26,639.43,0
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,891.62,1267.50,0


In [28]:
#print((MAXQuant['Missedcleavedpepcount'] != 0)).sum()
MAXQuant.isnull().sum()

Sequence                 0
Leading razor protein    0
Reporter intensity 1     0
Reporter intensity 2     0
Reporter intensity 3     0
Reporter intensity 4     0
Reporter intensity 5     0
Reporter intensity 6     0
Reporter intensity 7     0
Reporter intensity 8     0
Reporter intensity 9     0
Reporter intensity 10    0
Missedcleavedpepcount    0
dtype: int64

In [29]:
countmissed = (MAXQuant['Missedcleavedpepcount'] != 0).sum()

In [30]:
two = (MAXQuant['Missedcleavedpepcount'] == 2)

In [31]:
two

0        False
1        False
2        False
3        False
4        False
         ...  
98574    False
98575    False
98576    False
98577    False
98578    False
Name: Missedcleavedpepcount, Length: 97820, dtype: bool

In [32]:
MAXQuant["Missedcleavedpepcount"].value_counts()

0    87426
1     9790
2      604
Name: Missedcleavedpepcount, dtype: int64

In [33]:
countmissed

10394

In [34]:
missed_analysis = (MAXQuant['Missedcleavedpepcount'] != 0).sum()

In [35]:
missed_analysis

10394

In [36]:
10394/(len(MAXQuant))*100 # 10.625 % of missed cleaved in experimental detected peptides the number of peptides with missed cleavages after removing potential contaminants. 

10.625638928644449

In [37]:
true_tryptic_peptides_count = (MAXQuant['Missedcleavedpepcount'] == 0).sum()

In [38]:
true_tryptic_peptides_count

87426

In [39]:
87426/ len(MAXQuant)*100 # 89.37 % are tryptic peptides in the experimental detected peptides. 

89.37436107135555

In [40]:
countmissed + true_tryptic_peptides_count

97820

In [41]:
# some sequences had KP when missed cleavage = 1, KP is not a missed cleavage

### Remove missed cleaved peptides

In [42]:
MAXQuant = MAXQuant.loc[(MAXQuant['Missedcleavedpepcount'] == 0)] # removing them here as i have assigned counts for the observed peptide

In [43]:
MAXQuant

Unnamed: 0,Sequence,Leading razor protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,Reporter intensity 9,Reporter intensity 10,Missedcleavedpepcount
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,1300.10,1281.10,0
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,2974.00,2880.20,0
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,715.44,909.51,0
6,AAAAAAALQAK,RPL4,1432.70,2065.60,1541.50,1309.60,1399.40,1293.00,1301.90,1506.40,973.21,933.73,0
7,AAAAAAALQAK,RPL4,6730.20,10094.00,7433.50,6778.40,7854.90,6717.30,7399.50,7257.50,4343.90,4956.40,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,366.15,809.01,0
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,3560.30,4610.70,0
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,474.26,639.43,0
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,891.62,1267.50,0


In [44]:
len(MAXQuant.loc[(MAXQuant['Missedcleavedpepcount'] != 0)]) # 0 missed cleaved peptides 

0

### Calculate the total of reporter ion intensity for each peptide

In [45]:
MAXQuant["Total"] = MAXQuant.iloc[:, 2:].sum(axis = 1 ) #  total of the  reporter ion intensities for the peptide

In [46]:
MAXQuant

Unnamed: 0,Sequence,Leading razor protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,Reporter intensity 9,Reporter intensity 10,Missedcleavedpepcount,Total
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,1300.10,1281.10,0,16532.40
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,2974.00,2880.20,0,44275.70
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,715.44,909.51,0,10601.01
6,AAAAAAALQAK,RPL4,1432.70,2065.60,1541.50,1309.60,1399.40,1293.00,1301.90,1506.40,973.21,933.73,0,13757.04
7,AAAAAAALQAK,RPL4,6730.20,10094.00,7433.50,6778.40,7854.90,6717.30,7399.50,7257.50,4343.90,4956.40,0,69565.60
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,366.15,809.01,0,9831.12
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,3560.30,4610.70,0,47814.00
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,474.26,639.43,0,6020.02
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,891.62,1267.50,0,14500.12


###  Remove peptides with 0 signal across all of the ten plex labels

In [47]:
MAXQuant = MAXQuant[MAXQuant['Total'] != 0] # removing  rows with total peptide intensities accross labels as 0. 

In [48]:
MAXQuant.shape

(85054, 14)

### Remove any peptides in  MAXQuant that map to more than one different protein

In [49]:
# remove any peptides in detected_peptides that map to more than one different protein
MAXQuant = MAXQuant.groupby('Sequence').filter(lambda x: x['Leading razor protein'].nunique() == 1)

In [50]:
MAXQuant

Unnamed: 0,Sequence,Leading razor protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,Reporter intensity 9,Reporter intensity 10,Missedcleavedpepcount,Total
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,1300.10,1281.10,0,16532.40
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,2974.00,2880.20,0,44275.70
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,715.44,909.51,0,10601.01
6,AAAAAAALQAK,RPL4,1432.70,2065.60,1541.50,1309.60,1399.40,1293.00,1301.90,1506.40,973.21,933.73,0,13757.04
7,AAAAAAALQAK,RPL4,6730.20,10094.00,7433.50,6778.40,7854.90,6717.30,7399.50,7257.50,4343.90,4956.40,0,69565.60
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,366.15,809.01,0,9831.12
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,3560.30,4610.70,0,47814.00
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,474.26,639.43,0,6020.02
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,891.62,1267.50,0,14500.12


In [51]:
sumofintensities = MAXQuant.groupby(['Leading razor protein']).Total.sum().reset_index() # peptide intensities were summed to get the relative protein abundance 

In [52]:
sumofintensities

Unnamed: 0,Leading razor protein,Total
0,AAAS,494706.920
1,AACS,396915.510
2,AAGAB,414553.069
3,AAK1,681542.860
4,AAMP,277031.600
...,...,...
5754,ZWILCH,605847.188
5755,ZWINT,24961.440
5756,ZYG11B,78877.350
5757,ZYX,1916630.743


In [53]:
sumofintensities['Leading razor protein'].nunique()  #NUMBER OF UNIQUE PROTEINS IN THE SAMPLE 

5759

In [54]:
sumofintensities = sumofintensities.rename(columns={"Leading razor protein": "Protein"}) 

In [55]:
MAXQuant = MAXQuant.rename(columns={"Leading razor protein": "Protein"}) # change column Leading Razor Protein to protein

In [56]:
sum_of_intensities_dict = dict(zip(sumofintensities.Protein , sumofintensities.Total)) # dictionary made 

In [57]:
len(sum_of_intensities_dict) # to check if the values were imputed for individual/ unique proteins 

5759

In [58]:
MAXQuant['SMT'] = MAXQuant['Protein'].map(sum_of_intensities_dict)

In [59]:
MAXQuant

Unnamed: 0,Sequence,Protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,Reporter intensity 9,Reporter intensity 10,Missedcleavedpepcount,Total,SMT
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,1300.10,1281.10,0,16532.40,56437.60
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,2974.00,2880.20,0,44275.70,2270292.51
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,715.44,909.51,0,10601.01,2270292.51
6,AAAAAAALQAK,RPL4,1432.70,2065.60,1541.50,1309.60,1399.40,1293.00,1301.90,1506.40,973.21,933.73,0,13757.04,1979939.15
7,AAAAAAALQAK,RPL4,6730.20,10094.00,7433.50,6778.40,7854.90,6717.30,7399.50,7257.50,4343.90,4956.40,0,69565.60,1979939.15
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,366.15,809.01,0,9831.12,2048619.85
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,3560.30,4610.70,0,47814.00,2048619.85
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,474.26,639.43,0,6020.02,2048619.85
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,891.62,1267.50,0,14500.12,2048619.85


In [60]:
len(MAXQuant[MAXQuant['Total'] == 0]) # there are 0 peptides with no intensity across all labels. 

0

In [61]:
len(MAXQuant[MAXQuant['SMT'] == 0]) # there are 0 proteins that have 0.0 summed intensity. 

0

## Map UniProt IDs  in FASTApeptides.tsv file to Protein Names from uniprotAPI

In [62]:
#mapping_df =  pd.read_csv('mapping.csv') # previously converted identifiers via uniprotAPI and saved as mapping.csv

In [63]:
#mapping_df

In [64]:
#mapping_df.isnull().sum()

In [65]:
#mapping_df = mapping_df.dropna(axis=0, how='any')

In [66]:
#mapping_df.isnull().sum()

In [67]:
#mapping_dict = mapping_df.set_index('A').to_dict()['B']

In [68]:
#FASTApeptides['Protein'].replace(mapping_dict, inplace = True)

In [69]:
FASTApeptides

Unnamed: 0,Protein,Peptide,Length
0,P51451,MGLVSSK,505
1,P51451,GQWSPLK,505
2,P51451,DAPPLPPLVVFNHLTPPPPDEHLDEDK,505
3,P51451,HFVVALYDYTAMNDR,505
4,P51451,GTGDWWLAR,505
...,...,...,...
556444,Q6P179,ENWTHLLK,960
556445,Q6P179,FDLGSYDIR,960
556446,Q6P179,MIISGTTAHFSSK,960
556447,Q6P179,LFFESLEAQGSHLDIFQTVLETITK,960


#### USE UPDATED FASTApeptides_new.CSV FOR FASTA PEPTIDES WHERE ID HAS BEEN CONVERTED TO GENENAME/PROTEIN NAME 

In [70]:
#FASTApeptides.to_csv('FASTApeptides_new.csv', index = False)

In [71]:
FASTApeptides =  pd.read_table('FASTApeptides_new.csv', sep=',')

In [72]:
FASTApeptides.isnull().sum()

Protein     0
Sequence    0
Length      0
dtype: int64

In [73]:
FASTApeptides.shape

(556449, 3)

## CALCULATE SRI / LENGTH OF PROTEIN

 1. make a dictionary with protein and length from FASTApeptides
 2. map protein length to maxquant output peptides
 3. check for any null values in length column (if any impute with median)
 4. calculate SRI divided by length of the protein
 5. calculate the sum of the SRI/ L for all proteins in the sample 
 6. CALCULATE NSMT =  (SRI/L) / the sum of SRI /L

In [74]:
length_dict = dict(zip(FASTApeptides.Protein , FASTApeptides.Length)) # dictionary with protein lengths 

In [75]:
MAXQuant["Length"] = MAXQuant['Protein'].map(length_dict) # map lengths to MAXQuant Proteins

In [76]:
MAXQuant.isnull().sum() # 7149 proteins were not matched with the same protein lengths , solution to impute with median value 

Sequence                    0
Protein                     0
Reporter intensity 1        0
Reporter intensity 2        0
Reporter intensity 3        0
Reporter intensity 4        0
Reporter intensity 5        0
Reporter intensity 6        0
Reporter intensity 7        0
Reporter intensity 8        0
Reporter intensity 9        0
Reporter intensity 10       0
Missedcleavedpepcount       0
Total                       0
SMT                         0
Length                   7498
dtype: int64

In [77]:
MAXQuant

Unnamed: 0,Sequence,Protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,Reporter intensity 9,Reporter intensity 10,Missedcleavedpepcount,Total,SMT,Length
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,1300.10,1281.10,0,16532.40,56437.60,918.0
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,2974.00,2880.20,0,44275.70,2270292.51,258.0
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,715.44,909.51,0,10601.01,2270292.51,258.0
6,AAAAAAALQAK,RPL4,1432.70,2065.60,1541.50,1309.60,1399.40,1293.00,1301.90,1506.40,973.21,933.73,0,13757.04,1979939.15,427.0
7,AAAAAAALQAK,RPL4,6730.20,10094.00,7433.50,6778.40,7854.90,6717.30,7399.50,7257.50,4343.90,4956.40,0,69565.60,1979939.15,427.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,366.15,809.01,0,9831.12,2048619.85,
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,3560.30,4610.70,0,47814.00,2048619.85,
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,474.26,639.43,0,6020.02,2048619.85,
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,891.62,1267.50,0,14500.12,2048619.85,


In [78]:
MAXQuant['Length'].fillna(MAXQuant['Length'].median(), inplace=True) # imputed missing values with a median value for protein lengths 

In [79]:
MAXQuant.isnull().sum()

Sequence                 0
Protein                  0
Reporter intensity 1     0
Reporter intensity 2     0
Reporter intensity 3     0
Reporter intensity 4     0
Reporter intensity 5     0
Reporter intensity 6     0
Reporter intensity 7     0
Reporter intensity 8     0
Reporter intensity 9     0
Reporter intensity 10    0
Missedcleavedpepcount    0
Total                    0
SMT                      0
Length                   0
dtype: int64

In [80]:
MAXQuant

Unnamed: 0,Sequence,Protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,Reporter intensity 9,Reporter intensity 10,Missedcleavedpepcount,Total,SMT,Length
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,1300.10,1281.10,0,16532.40,56437.60,918.0
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,2974.00,2880.20,0,44275.70,2270292.51,258.0
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,715.44,909.51,0,10601.01,2270292.51,258.0
6,AAAAAAALQAK,RPL4,1432.70,2065.60,1541.50,1309.60,1399.40,1293.00,1301.90,1506.40,973.21,933.73,0,13757.04,1979939.15,427.0
7,AAAAAAALQAK,RPL4,6730.20,10094.00,7433.50,6778.40,7854.90,6717.30,7399.50,7257.50,4343.90,4956.40,0,69565.60,1979939.15,427.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,366.15,809.01,0,9831.12,2048619.85,580.0
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,3560.30,4610.70,0,47814.00,2048619.85,580.0
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,474.26,639.43,0,6020.02,2048619.85,580.0
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,891.62,1267.50,0,14500.12,2048619.85,580.0


In [81]:
MAXQuant["SMT_over_length"] = MAXQuant["SMT"] / MAXQuant["Length"] # calculate SMT divided by length

In [82]:
MAXQuant

Unnamed: 0,Sequence,Protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,Reporter intensity 9,Reporter intensity 10,Missedcleavedpepcount,Total,SMT,Length,SMT_over_length
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,1300.10,1281.10,0,16532.40,56437.60,918.0,61.478867
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,2974.00,2880.20,0,44275.70,2270292.51,258.0,8799.583372
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,715.44,909.51,0,10601.01,2270292.51,258.0,8799.583372
6,AAAAAAALQAK,RPL4,1432.70,2065.60,1541.50,1309.60,1399.40,1293.00,1301.90,1506.40,973.21,933.73,0,13757.04,1979939.15,427.0,4636.859836
7,AAAAAAALQAK,RPL4,6730.20,10094.00,7433.50,6778.40,7854.90,6717.30,7399.50,7257.50,4343.90,4956.40,0,69565.60,1979939.15,427.0,4636.859836
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,366.15,809.01,0,9831.12,2048619.85,580.0,3532.103190
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,3560.30,4610.70,0,47814.00,2048619.85,580.0,3532.103190
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,474.26,639.43,0,6020.02,2048619.85,580.0,3532.103190
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,891.62,1267.50,0,14500.12,2048619.85,580.0,3532.103190


In [83]:
# copy MAXQuant but drop duplicate proteins, then calculate NSMT for proteins in the sample ignoring duplicates proteins within the sum 

In [84]:
x = MAXQuant.drop_duplicates(['Protein']).reset_index(drop = True )

In [85]:
x["sum_SMT_over_length"] = x['SMT_over_length'].sum()

In [86]:
x

Unnamed: 0,Sequence,Protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,Reporter intensity 9,Reporter intensity 10,Missedcleavedpepcount,Total,SMT,Length,SMT_over_length,sum_SMT_over_length
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.1,2029.8,1706.1,1627.9,2152.4,1703.4,1626.2,1878.3,1300.10,1281.10,0,16532.40,56437.600,918.0,61.478867,1.282166e+07
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.0,6282.9,5224.4,6393.7,3946.4,3680.9,3678.1,5648.1,2974.00,2880.20,0,44275.70,2270292.510,258.0,8799.583372,1.282166e+07
2,AAAAAAALQAK,RPL4,1432.7,2065.6,1541.5,1309.6,1399.4,1293.0,1301.9,1506.4,973.21,933.73,0,13757.04,1979939.150,427.0,4636.859836,1.282166e+07
3,AAAAAAGAASGLPGPVAQGLK,IPO9,2484.6,2721.5,2206.1,1993.4,2238.3,1884.8,2516.6,2614.4,996.36,1732.10,0,21388.16,3547224.919,1041.0,3407.516733,1.282166e+07
4,AAAAAAVGPGAGGAGSAVPGGAGPCATVSVFPGAR,CARM1,3983.0,5753.4,5197.5,4952.5,5411.9,4310.2,5176.5,4557.8,2695.60,5521.20,0,47559.60,874337.160,608.0,1438.054539,1.282166e+07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5754,YVLEEAEQLEPR,PRPF38A,8801.9,7389.6,8681.3,5452.9,9372.7,6875.3,9129.9,7517.4,4717.80,5896.50,0,73835.30,73835.300,312.0,236.651603,1.282166e+07
5755,YVNPETVAALLSGK,CDC25C,3454.3,3023.7,3369.5,2697.0,3653.8,2733.7,3832.9,3025.3,2214.70,2324.30,0,30329.20,30329.200,473.0,64.120930,1.282166e+07
5756,YWPVIPLK,GEMIN8,9407.1,9597.5,8917.3,9461.9,10562.0,8516.7,11244.0,11012.0,5008.00,5489.20,0,89215.70,89215.700,242.0,368.659917,1.282166e+07
5757,YYETVSDVLNSVK,COG2,4267.6,4600.4,4373.8,3621.2,4406.0,3164.2,5500.0,4586.4,2643.40,2982.80,0,40145.80,40145.800,738.0,54.398103,1.282166e+07


In [87]:
# dictionary for ignoring duplicate proteins in this cumulative sum
sum_of_SMT_over_length = dict(zip(x.Protein , x.sum_SMT_over_length)) 

In [88]:
MAXQuant["sum_of_SMT_over_length"] = MAXQuant['Protein'].map(sum_of_SMT_over_length)

In [89]:
MAXQuant["NSMT"] = MAXQuant["SMT_over_length"]/ MAXQuant["sum_of_SMT_over_length"] # (SMT/PROTEIN LENGTH)/ (SUM_SMT_over_length)

In [90]:
MAXQuant # NSMT is calculated 

Unnamed: 0,Sequence,Protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,Reporter intensity 9,Reporter intensity 10,Missedcleavedpepcount,Total,SMT,Length,SMT_over_length,sum_of_SMT_over_length,NSMT
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,1300.10,1281.10,0,16532.40,56437.60,918.0,61.478867,1.282166e+07,0.000005
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,2974.00,2880.20,0,44275.70,2270292.51,258.0,8799.583372,1.282166e+07,0.000686
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,715.44,909.51,0,10601.01,2270292.51,258.0,8799.583372,1.282166e+07,0.000686
6,AAAAAAALQAK,RPL4,1432.70,2065.60,1541.50,1309.60,1399.40,1293.00,1301.90,1506.40,973.21,933.73,0,13757.04,1979939.15,427.0,4636.859836,1.282166e+07,0.000362
7,AAAAAAALQAK,RPL4,6730.20,10094.00,7433.50,6778.40,7854.90,6717.30,7399.50,7257.50,4343.90,4956.40,0,69565.60,1979939.15,427.0,4636.859836,1.282166e+07,0.000362
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,366.15,809.01,0,9831.12,2048619.85,580.0,3532.103190,1.282166e+07,0.000275
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,3560.30,4610.70,0,47814.00,2048619.85,580.0,3532.103190,1.282166e+07,0.000275
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,474.26,639.43,0,6020.02,2048619.85,580.0,3532.103190,1.282166e+07,0.000275
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,891.62,1267.50,0,14500.12,2048619.85,580.0,3532.103190,1.282166e+07,0.000275


## CALCULATE SRI/ SC for MAXQuant
 1. Get SPECTRAL count column for proteins as I have already calculated SRI
 2. Calculate SRI divided by SPECTRAL COUNT for each protein
 3.  Calculate the sum of SRI divided by spectral count for each protein
 4. Calculate  Normalized SRI divided by SC

In [91]:
# assign a count vale of '1' for each PSM (peptide spectrum match)
MAXQuant["PSM"] = 1

In [92]:
MAXQuant

Unnamed: 0,Sequence,Protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,Reporter intensity 9,Reporter intensity 10,Missedcleavedpepcount,Total,SMT,Length,SMT_over_length,sum_of_SMT_over_length,NSMT,PSM
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,1300.10,1281.10,0,16532.40,56437.60,918.0,61.478867,1.282166e+07,0.000005,1
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,2974.00,2880.20,0,44275.70,2270292.51,258.0,8799.583372,1.282166e+07,0.000686,1
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,715.44,909.51,0,10601.01,2270292.51,258.0,8799.583372,1.282166e+07,0.000686,1
6,AAAAAAALQAK,RPL4,1432.70,2065.60,1541.50,1309.60,1399.40,1293.00,1301.90,1506.40,973.21,933.73,0,13757.04,1979939.15,427.0,4636.859836,1.282166e+07,0.000362,1
7,AAAAAAALQAK,RPL4,6730.20,10094.00,7433.50,6778.40,7854.90,6717.30,7399.50,7257.50,4343.90,4956.40,0,69565.60,1979939.15,427.0,4636.859836,1.282166e+07,0.000362,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,366.15,809.01,0,9831.12,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,3560.30,4610.70,0,47814.00,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,474.26,639.43,0,6020.02,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,891.62,1267.50,0,14500.12,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1


In [93]:
# count number of times each PSM per protein appears
PSM_per_protein = MAXQuant.groupby("Protein")["PSM"].sum().to_frame()

In [94]:
PSM_per_protein = PSM_per_protein.reset_index(level=0) # fix the structure by reseting index to make it easier to map the values to the right proteins 

In [95]:
PSM_per_protein_dict = dict(zip(PSM_per_protein.Protein, PSM_per_protein.PSM))

In [96]:
MAXQuant["SC"]= MAXQuant['Protein'].map(PSM_per_protein_dict)

### Calculate NSAF 

In [97]:
MAXQuant['SAF'] = MAXQuant['SC'] / MAXQuant['Length']

In [98]:
MAXQuant

Unnamed: 0,Sequence,Protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,...,Missedcleavedpepcount,Total,SMT,Length,SMT_over_length,sum_of_SMT_over_length,NSMT,PSM,SC,SAF
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,...,0,16532.40,56437.60,918.0,61.478867,1.282166e+07,0.000005,1,3,0.003268
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,...,0,44275.70,2270292.51,258.0,8799.583372,1.282166e+07,0.000686,1,25,0.096899
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,...,0,10601.01,2270292.51,258.0,8799.583372,1.282166e+07,0.000686,1,25,0.096899
6,AAAAAAALQAK,RPL4,1432.70,2065.60,1541.50,1309.60,1399.40,1293.00,1301.90,1506.40,...,0,13757.04,1979939.15,427.0,4636.859836,1.282166e+07,0.000362,1,16,0.037471
7,AAAAAAALQAK,RPL4,6730.20,10094.00,7433.50,6778.40,7854.90,6717.30,7399.50,7257.50,...,0,69565.60,1979939.15,427.0,4636.859836,1.282166e+07,0.000362,1,16,0.037471
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,...,0,9831.12,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,...,0,47814.00,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,...,0,6020.02,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,...,0,14500.12,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207


In [99]:
i = MAXQuant.drop_duplicates(['Protein']).reset_index(drop = True )

In [100]:
i["sum_SAF_over_length"] = i['SAF'].sum()

In [101]:
# dictionary for ignoring duplicate proteins in this cumulative sum
sum_of_SAF_over_length = dict(zip(i.Protein , i.sum_SAF_over_length)) 

In [102]:
MAXQuant["sum_of_SAF_over_length"] = MAXQuant['Protein'].map(sum_of_SAF_over_length)

In [103]:
MAXQuant

Unnamed: 0,Sequence,Protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,...,Total,SMT,Length,SMT_over_length,sum_of_SMT_over_length,NSMT,PSM,SC,SAF,sum_of_SAF_over_length
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,...,16532.40,56437.60,918.0,61.478867,1.282166e+07,0.000005,1,3,0.003268,179.016358
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,...,44275.70,2270292.51,258.0,8799.583372,1.282166e+07,0.000686,1,25,0.096899,179.016358
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,...,10601.01,2270292.51,258.0,8799.583372,1.282166e+07,0.000686,1,25,0.096899,179.016358
6,AAAAAAALQAK,RPL4,1432.70,2065.60,1541.50,1309.60,1399.40,1293.00,1301.90,1506.40,...,13757.04,1979939.15,427.0,4636.859836,1.282166e+07,0.000362,1,16,0.037471,179.016358
7,AAAAAAALQAK,RPL4,6730.20,10094.00,7433.50,6778.40,7854.90,6717.30,7399.50,7257.50,...,69565.60,1979939.15,427.0,4636.859836,1.282166e+07,0.000362,1,16,0.037471,179.016358
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,...,9831.12,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,...,47814.00,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,...,6020.02,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,...,14500.12,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358


In [104]:
MAXQuant["NSAF"] = MAXQuant["SAF"]/ MAXQuant["sum_of_SAF_over_length"] 

In [105]:
MAXQuant

Unnamed: 0,Sequence,Protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,...,SMT,Length,SMT_over_length,sum_of_SMT_over_length,NSMT,PSM,SC,SAF,sum_of_SAF_over_length,NSAF
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,...,56437.60,918.0,61.478867,1.282166e+07,0.000005,1,3,0.003268,179.016358,0.000018
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,...,2270292.51,258.0,8799.583372,1.282166e+07,0.000686,1,25,0.096899,179.016358,0.000541
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,...,2270292.51,258.0,8799.583372,1.282166e+07,0.000686,1,25,0.096899,179.016358,0.000541
6,AAAAAAALQAK,RPL4,1432.70,2065.60,1541.50,1309.60,1399.40,1293.00,1301.90,1506.40,...,1979939.15,427.0,4636.859836,1.282166e+07,0.000362,1,16,0.037471,179.016358,0.000209
7,AAAAAAALQAK,RPL4,6730.20,10094.00,7433.50,6778.40,7854.90,6717.30,7399.50,7257.50,...,1979939.15,427.0,4636.859836,1.282166e+07,0.000362,1,16,0.037471,179.016358,0.000209
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,...,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,...,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,...,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,...,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202


In [106]:
MAXQuant.isnull().sum() # Spectral Count has been successfuly matched with no null values.

Sequence                  0
Protein                   0
Reporter intensity 1      0
Reporter intensity 2      0
Reporter intensity 3      0
Reporter intensity 4      0
Reporter intensity 5      0
Reporter intensity 6      0
Reporter intensity 7      0
Reporter intensity 8      0
Reporter intensity 9      0
Reporter intensity 10     0
Missedcleavedpepcount     0
Total                     0
SMT                       0
Length                    0
SMT_over_length           0
sum_of_SMT_over_length    0
NSMT                      0
PSM                       0
SC                        0
SAF                       0
sum_of_SAF_over_length    0
NSAF                      0
dtype: int64

In [107]:
# calculate SMT over SC this time
MAXQuant["SMT_over_SC"] = MAXQuant["SMT"] / MAXQuant["SC"]

In [108]:
MAXQuant

Unnamed: 0,Sequence,Protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,...,Length,SMT_over_length,sum_of_SMT_over_length,NSMT,PSM,SC,SAF,sum_of_SAF_over_length,NSAF,SMT_over_SC
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,...,918.0,61.478867,1.282166e+07,0.000005,1,3,0.003268,179.016358,0.000018,18812.533333
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,...,258.0,8799.583372,1.282166e+07,0.000686,1,25,0.096899,179.016358,0.000541,90811.700400
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,...,258.0,8799.583372,1.282166e+07,0.000686,1,25,0.096899,179.016358,0.000541,90811.700400
6,AAAAAAALQAK,RPL4,1432.70,2065.60,1541.50,1309.60,1399.40,1293.00,1301.90,1506.40,...,427.0,4636.859836,1.282166e+07,0.000362,1,16,0.037471,179.016358,0.000209,123746.196875
7,AAAAAAALQAK,RPL4,6730.20,10094.00,7433.50,6778.40,7854.90,6717.30,7399.50,7257.50,...,427.0,4636.859836,1.282166e+07,0.000362,1,16,0.037471,179.016358,0.000209,123746.196875
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,...,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202,97553.326190
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,...,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202,97553.326190
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,...,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202,97553.326190
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,...,580.0,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202,97553.326190


### copy MAXQuant but drop duplicate proteins, then calculate normalised SMT divided by length for proteins in the sample ignoring duplicate proteins within the sum 

In [109]:
y = MAXQuant.drop_duplicates(['Protein']).reset_index(drop = True )

In [110]:
y["sum_SMT_over_SC"] = y['SMT_over_SC'].sum()

In [111]:
y

Unnamed: 0,Sequence,Protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,...,SMT_over_length,sum_of_SMT_over_length,NSMT,PSM,SC,SAF,sum_of_SAF_over_length,NSAF,SMT_over_SC,sum_SMT_over_SC
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.1,2029.8,1706.1,1627.9,2152.4,1703.4,1626.2,1878.3,...,61.478867,1.282166e+07,0.000005,1,3,0.003268,179.016358,0.000018,18812.533333,3.517341e+08
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.0,6282.9,5224.4,6393.7,3946.4,3680.9,3678.1,5648.1,...,8799.583372,1.282166e+07,0.000686,1,25,0.096899,179.016358,0.000541,90811.700400,3.517341e+08
2,AAAAAAALQAK,RPL4,1432.7,2065.6,1541.5,1309.6,1399.4,1293.0,1301.9,1506.4,...,4636.859836,1.282166e+07,0.000362,1,16,0.037471,179.016358,0.000209,123746.196875,3.517341e+08
3,AAAAAAGAASGLPGPVAQGLK,IPO9,2484.6,2721.5,2206.1,1993.4,2238.3,1884.8,2516.6,2614.4,...,3407.516733,1.282166e+07,0.000266,1,55,0.052834,179.016358,0.000295,64494.998527,3.517341e+08
4,AAAAAAVGPGAGGAGSAVPGGAGPCATVSVFPGAR,CARM1,3983.0,5753.4,5197.5,4952.5,5411.9,4310.2,5176.5,4557.8,...,1438.054539,1.282166e+07,0.000112,1,12,0.019737,179.016358,0.000110,72861.430000,3.517341e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5754,YVLEEAEQLEPR,PRPF38A,8801.9,7389.6,8681.3,5452.9,9372.7,6875.3,9129.9,7517.4,...,236.651603,1.282166e+07,0.000018,1,1,0.003205,179.016358,0.000018,73835.300000,3.517341e+08
5755,YVNPETVAALLSGK,CDC25C,3454.3,3023.7,3369.5,2697.0,3653.8,2733.7,3832.9,3025.3,...,64.120930,1.282166e+07,0.000005,1,1,0.002114,179.016358,0.000012,30329.200000,3.517341e+08
5756,YWPVIPLK,GEMIN8,9407.1,9597.5,8917.3,9461.9,10562.0,8516.7,11244.0,11012.0,...,368.659917,1.282166e+07,0.000029,1,1,0.004132,179.016358,0.000023,89215.700000,3.517341e+08
5757,YYETVSDVLNSVK,COG2,4267.6,4600.4,4373.8,3621.2,4406.0,3164.2,5500.0,4586.4,...,54.398103,1.282166e+07,0.000004,1,1,0.001355,179.016358,0.000008,40145.800000,3.517341e+08


In [112]:
# dictionary for ignoring duplicate proteins in this cumulative sum
sum_of_SMT_over_SC = dict(zip(y.Protein , y.sum_SMT_over_SC)) 

In [113]:
MAXQuant["sum_of_SMT_over_SC"] = MAXQuant['Protein'].map(sum_of_SMT_over_SC)

In [114]:
MAXQuant

Unnamed: 0,Sequence,Protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,...,SMT_over_length,sum_of_SMT_over_length,NSMT,PSM,SC,SAF,sum_of_SAF_over_length,NSAF,SMT_over_SC,sum_of_SMT_over_SC
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,...,61.478867,1.282166e+07,0.000005,1,3,0.003268,179.016358,0.000018,18812.533333,3.517341e+08
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,...,8799.583372,1.282166e+07,0.000686,1,25,0.096899,179.016358,0.000541,90811.700400,3.517341e+08
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,...,8799.583372,1.282166e+07,0.000686,1,25,0.096899,179.016358,0.000541,90811.700400,3.517341e+08
6,AAAAAAALQAK,RPL4,1432.70,2065.60,1541.50,1309.60,1399.40,1293.00,1301.90,1506.40,...,4636.859836,1.282166e+07,0.000362,1,16,0.037471,179.016358,0.000209,123746.196875,3.517341e+08
7,AAAAAAALQAK,RPL4,6730.20,10094.00,7433.50,6778.40,7854.90,6717.30,7399.50,7257.50,...,4636.859836,1.282166e+07,0.000362,1,16,0.037471,179.016358,0.000209,123746.196875,3.517341e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,...,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202,97553.326190,3.517341e+08
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,...,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202,97553.326190,3.517341e+08
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,...,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202,97553.326190,3.517341e+08
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,...,3532.103190,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202,97553.326190,3.517341e+08


In [115]:
## calculated the normalised SMT over Spectral Count, abreaviated to NSMT over SC
MAXQuant["NSMT_over_SC"] = MAXQuant["SMT_over_SC"]/ MAXQuant["sum_of_SMT_over_SC"]

In [116]:
MAXQuant

Unnamed: 0,Sequence,Protein,Reporter intensity 1,Reporter intensity 2,Reporter intensity 3,Reporter intensity 4,Reporter intensity 5,Reporter intensity 6,Reporter intensity 7,Reporter intensity 8,...,sum_of_SMT_over_length,NSMT,PSM,SC,SAF,sum_of_SAF_over_length,NSAF,SMT_over_SC,sum_of_SMT_over_SC,NSMT_over_SC
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,1227.10,2029.80,1706.10,1627.90,2152.40,1703.40,1626.20,1878.30,...,1.282166e+07,0.000005,1,3,0.003268,179.016358,0.000018,18812.533333,3.517341e+08,0.000053
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,3567.00,6282.90,5224.40,6393.70,3946.40,3680.90,3678.10,5648.10,...,1.282166e+07,0.000686,1,25,0.096899,179.016358,0.000541,90811.700400,3.517341e+08,0.000258
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,1023.60,960.57,945.93,1123.60,1554.30,798.46,1275.30,1294.30,...,1.282166e+07,0.000686,1,25,0.096899,179.016358,0.000541,90811.700400,3.517341e+08,0.000258
6,AAAAAAALQAK,RPL4,1432.70,2065.60,1541.50,1309.60,1399.40,1293.00,1301.90,1506.40,...,1.282166e+07,0.000362,1,16,0.037471,179.016358,0.000209,123746.196875,3.517341e+08,0.000352
7,AAAAAAALQAK,RPL4,6730.20,10094.00,7433.50,6778.40,7854.90,6717.30,7399.50,7257.50,...,1.282166e+07,0.000362,1,16,0.037471,179.016358,0.000209,123746.196875,3.517341e+08,0.000352
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,1247.40,1664.50,1017.60,845.33,1008.30,834.43,1032.80,1005.60,...,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202,97553.326190,3.517341e+08,0.000277
98575,YYVTIIDAPGHR,EEF1A1P8,4767.50,6249.20,4322.00,4218.60,4899.10,4623.20,5160.60,5402.80,...,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202,97553.326190,3.517341e+08,0.000277
98576,YYVTIIDAPGHR,EEF1A1P8,589.45,671.16,526.84,602.89,713.17,424.66,694.02,684.14,...,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202,97553.326190,3.517341e+08,0.000277
98577,YYVTIIDAPGHR,EEF1A1P8,1193.60,2036.40,1479.00,1289.70,1870.30,1505.50,1447.00,1519.50,...,1.282166e+07,0.000275,1,21,0.036207,179.016358,0.000202,97553.326190,3.517341e+08,0.000277


### Fix Layout 

In [117]:
MAXQuant.drop(MAXQuant.iloc[:, 2:12], inplace=True, axis=1) # drop columns I dont need

In [118]:
MAXQuant.drop('PSM', axis=1, inplace=True) # remove PSM assignemnt this was a place holder column as I now have Spectral count per protein in the sample 

In [119]:
FASTApeptides = FASTApeptides.rename(columns={"Peptide": "Sequence"})

In [120]:
MAXQuant

Unnamed: 0,Sequence,Protein,Missedcleavedpepcount,Total,SMT,Length,SMT_over_length,sum_of_SMT_over_length,NSMT,SC,SAF,sum_of_SAF_over_length,NSAF,SMT_over_SC,sum_of_SMT_over_SC,NSMT_over_SC
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,0,16532.40,56437.60,918.0,61.478867,1.282166e+07,0.000005,3,0.003268,179.016358,0.000018,18812.533333,3.517341e+08,0.000053
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,0,44275.70,2270292.51,258.0,8799.583372,1.282166e+07,0.000686,25,0.096899,179.016358,0.000541,90811.700400,3.517341e+08,0.000258
2,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,0,10601.01,2270292.51,258.0,8799.583372,1.282166e+07,0.000686,25,0.096899,179.016358,0.000541,90811.700400,3.517341e+08,0.000258
6,AAAAAAALQAK,RPL4,0,13757.04,1979939.15,427.0,4636.859836,1.282166e+07,0.000362,16,0.037471,179.016358,0.000209,123746.196875,3.517341e+08,0.000352
7,AAAAAAALQAK,RPL4,0,69565.60,1979939.15,427.0,4636.859836,1.282166e+07,0.000362,16,0.037471,179.016358,0.000209,123746.196875,3.517341e+08,0.000352
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98574,YYVTIIDAPGHR,EEF1A1P8,0,9831.12,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,21,0.036207,179.016358,0.000202,97553.326190,3.517341e+08,0.000277
98575,YYVTIIDAPGHR,EEF1A1P8,0,47814.00,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,21,0.036207,179.016358,0.000202,97553.326190,3.517341e+08,0.000277
98576,YYVTIIDAPGHR,EEF1A1P8,0,6020.02,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,21,0.036207,179.016358,0.000202,97553.326190,3.517341e+08,0.000277
98577,YYVTIIDAPGHR,EEF1A1P8,0,14500.12,2048619.85,580.0,3532.103190,1.282166e+07,0.000275,21,0.036207,179.016358,0.000202,97553.326190,3.517341e+08,0.000277


In [121]:
#MAXQuant = MAXQuant[['Sequence', 'Protein', 'NSMT', 'NSMT_over_SC']] # get columns u need 

In [122]:
MAXQuant.isnull().sum()

Sequence                  0
Protein                   0
Missedcleavedpepcount     0
Total                     0
SMT                       0
Length                    0
SMT_over_length           0
sum_of_SMT_over_length    0
NSMT                      0
SC                        0
SAF                       0
sum_of_SAF_over_length    0
NSAF                      0
SMT_over_SC               0
sum_of_SMT_over_SC        0
NSMT_over_SC              0
dtype: int64

In [123]:
# check how many protein+peptide in detected_peptides are present in fasta_peptides (should be 42,932)
len(MAXQuant.set_index(['Protein', 'Sequence']).index.isin(FASTApeptides.set_index(['Protein', 'Sequence']).index))

85054

In [124]:
len(FASTApeptides['Protein'])

556449

### Remove Duplicated Peptide Sequences after MS2 intensities quantitation

In [125]:
# Check and remove duplicated sequences
MAXQuant['Sequence'].value_counts()

LCYVALDFEQEMATAASSSSLEK          122
WGDAGAEYVVESTGVFTTMEK             96
NMITGTSQADCAVLIVAAGVGEFEAGISK     64
HQGVMVGMGQK                       57
FTASAGIQVVGDDLTVTNPK              54
                                ... 
ISLAIPNLGNTSQQEYK                  1
ISIFNEHK                           1
ISIEMHGTLEDQLSHLR                  1
ISICSSDK                           1
LDVTLAK                            1
Name: Sequence, Length: 37437, dtype: int64

In [126]:
MAXQuant.shape

(85054, 16)

In [127]:
MAXQuant = MAXQuant.drop_duplicates(['Sequence','Protein'], keep = 'first').reset_index(drop = True )

In [128]:
len(MAXQuant['Sequence'].unique())

37437

In [129]:
len(MAXQuant['Protein'].unique())

5759

In [130]:
MAXQuant['Sequence'].value_counts() 

AAAAAAAATMALAAPSSPTPESPTMLTK    1
NSVEEWTTEDWTEDLSETK             1
NSSHAGAFVIVTEEAIAK              1
NSSNKPAVTTK                     1
NSSPEDLFDEI                     1
                               ..
GEDIFLDMFEDEYR                  1
GEDLFEDGGIIR                    1
GEDLTEEEDGGIIR                  1
GEDMMHPLK                       1
YYVTIIDAPGHR                    1
Name: Sequence, Length: 37437, dtype: int64

In [131]:
MAXQuant.NSMT.apply(type) #  quantifications are classed as floats so they are real numbers with scientific notation

0        <class 'float'>
1        <class 'float'>
2        <class 'float'>
3        <class 'float'>
4        <class 'float'>
              ...       
37432    <class 'float'>
37433    <class 'float'>
37434    <class 'float'>
37435    <class 'float'>
37436    <class 'float'>
Name: NSMT, Length: 37437, dtype: object

### In FASTApeptides.tsv remove all peptides from proteins that are not in MAXQUANT OUTPUT
### In FASTApeptides.tsv KEEP all peptides from proteins that are in MAXQUANT OUTPUT

In [132]:
# remove proteins in fasta_peptides not present in detected_peptides - let this be expected_peptides
expected_peptides = FASTApeptides[FASTApeptides["Protein"].isin(MAXQuant["Protein"])]

In [133]:
expected_peptides # a data frame of trypic digest peptides that could be in the sample 

Unnamed: 0,Protein,Sequence,Length
161,BRD4,MSAESGPGTR,1362
162,BRD4,NLPVMGDGLETSQMSTTQAQAQPQPANAASTNPPPPETSNPNKPK,1362
163,BRD4,QTNQLQYLLR,1362
164,BRD4,HQFAWPFQQPVDAVK,1362
165,BRD4,LNLPDYYK,1362
...,...,...,...
556444,ERAP2,ENWTHLLK,960
556445,ERAP2,FDLGSYDIR,960
556446,ERAP2,MIISGTTAHFSSK,960
556447,ERAP2,LFFESLEAQGSHLDIFQTVLETITK,960


## Get Undetected Peptides

In [134]:
undetected_peptides = pd.merge(expected_peptides, MAXQuant, on=["Sequence"], how ='left', indicator=True)

In [135]:
undetected_peptides

Unnamed: 0,Protein_x,Sequence,Length_x,Protein_y,Missedcleavedpepcount,Total,SMT,Length_y,SMT_over_length,sum_of_SMT_over_length,NSMT,SC,SAF,sum_of_SAF_over_length,NSAF,SMT_over_SC,sum_of_SMT_over_SC,NSMT_over_SC,_merge
0,BRD4,MSAESGPGTR,1362,,,,,,,,,,,,,,,,left_only
1,BRD4,NLPVMGDGLETSQMSTTQAQAQPQPANAASTNPPPPETSNPNKPK,1362,,,,,,,,,,,,,,,,left_only
2,BRD4,QTNQLQYLLR,1362,,,,,,,,,,,,,,,,left_only
3,BRD4,HQFAWPFQQPVDAVK,1362,,,,,,,,,,,,,,,,left_only
4,BRD4,LNLPDYYK,1362,,,,,,,,,,,,,,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
159890,ERAP2,ENWTHLLK,960,ERAP2,0.0,22632.5,1218369.7,960.0,1269.135104,1.282166e+07,0.000099,13.0,0.013542,179.016358,0.000076,93720.746154,3.517341e+08,0.000266,both
159891,ERAP2,FDLGSYDIR,960,,,,,,,,,,,,,,,,left_only
159892,ERAP2,MIISGTTAHFSSK,960,,,,,,,,,,,,,,,,left_only
159893,ERAP2,LFFESLEAQGSHLDIFQTVLETITK,960,,,,,,,,,,,,,,,,left_only


## FILTER DATA

In [136]:
# find peptides/PSMs present only in undetected_peptides
len(undetected_peptides[undetected_peptides['_merge'].str.contains("left_only", na=False)])

123723

In [137]:
len(undetected_peptides[undetected_peptides['_merge'].str.contains("both", na=False)])

36172

In [138]:
len(expected_peptides[expected_peptides["Sequence"].isin(MAXQuant["Sequence"])])

36172

In [139]:
# find number of undetected using isin
len(expected_peptides[~expected_peptides["Sequence"].isin(MAXQuant["Sequence"])])

123723

In [140]:
# initial check on how many total proteins and peptides in detected_peptides are in fasta_peptides 
len(MAXQuant[MAXQuant["Protein"].isin(FASTApeptides["Protein"])])

34786

In [141]:
undetected_peptides = expected_peptides[~expected_peptides["Sequence"].isin(MAXQuant["Sequence"])]
undetected_peptides.shape

(123723, 3)

In [142]:
undetected_peptides.describe(include="all")

Unnamed: 0,Protein,Sequence,Length
count,123723,123723,123723.0
unique,5019,121989,
top,SYNE2,IHTGEKPYK,
freq,402,18,
mean,,,1176.378984
std,,,1062.81095
min,,,58.0
25%,,,519.0
50%,,,858.0
75%,,,1435.0


In [143]:
undetected_peptides.shape

(123723, 3)

In [144]:
undetected_peptides

Unnamed: 0,Protein,Sequence,Length
161,BRD4,MSAESGPGTR,1362
162,BRD4,NLPVMGDGLETSQMSTTQAQAQPQPANAASTNPPPPETSNPNKPK,1362
163,BRD4,QTNQLQYLLR,1362
164,BRD4,HQFAWPFQQPVDAVK,1362
165,BRD4,LNLPDYYK,1362
...,...,...,...
556440,ERAP2,ILYALSTSK,960
556442,ERAP2,TQNLAALLHAIAR,960
556445,ERAP2,FDLGSYDIR,960
556446,ERAP2,MIISGTTAHFSSK,960


In [145]:
# check for any intersection between detected and undetected (should be 0)
print(len(set(MAXQuant["Sequence"]).intersection(set(undetected_peptides["Sequence"]))))

0


In [146]:
#initial check on how many total proteins and peptides in fasta_peptides are not in detected_peptides
len(FASTApeptides[~FASTApeptides["Protein"].isin(MAXQuant["Protein"])])

396554

## Mapping  (SRI) for Undetected Peptides

1. make a dictionary for SRI values for protein to map to undetected peptides
2. Protein lengths are already available from FASTApeptides so no need to map again.
3. check for null values, impute with a median value if none
4. Calculate SRI
5. Calculate SRI/Length
6. Calculate the sum of SRI/ length
7. Calculate NSRI = (SRI/length) / (the sum of SRI / length)

In [147]:
MAXQuant['SMT'] = MAXQuant['Protein'].map(sum_of_intensities_dict)

In [148]:
undetected_peptides['SMT'] = undetected_peptides['Protein'].map(sum_of_intensities_dict)

In [149]:
undetected_peptides.isnull().sum() # correct mapping since all values for proteins have been filled 

Protein     0
Sequence    0
Length      0
SMT         0
dtype: int64

In [150]:
undetected_peptides.reset_index(drop = True )

Unnamed: 0,Protein,Sequence,Length,SMT
0,BRD4,MSAESGPGTR,1362,150547.0
1,BRD4,NLPVMGDGLETSQMSTTQAQAQPQPANAASTNPPPPETSNPNKPK,1362,150547.0
2,BRD4,QTNQLQYLLR,1362,150547.0
3,BRD4,HQFAWPFQQPVDAVK,1362,150547.0
4,BRD4,LNLPDYYK,1362,150547.0
...,...,...,...,...
123718,ERAP2,ILYALSTSK,960,1218369.7
123719,ERAP2,TQNLAALLHAIAR,960,1218369.7
123720,ERAP2,FDLGSYDIR,960,1218369.7
123721,ERAP2,MIISGTTAHFSSK,960,1218369.7


In [151]:
print(undetected_peptides.shape)
print(MAXQuant.shape)

(123723, 4)
(37437, 16)


In [152]:
# calculate SMT divided by length
undetected_peptides["SMT_over_length"] = undetected_peptides["SMT"] / undetected_peptides["Length"] 

In [153]:
undetected_peptides.isnull().sum() # checking for null values 

Protein            0
Sequence           0
Length             0
SMT                0
SMT_over_length    0
dtype: int64

In [154]:
undetected_peptides

Unnamed: 0,Protein,Sequence,Length,SMT,SMT_over_length
161,BRD4,MSAESGPGTR,1362,150547.0,110.533774
162,BRD4,NLPVMGDGLETSQMSTTQAQAQPQPANAASTNPPPPETSNPNKPK,1362,150547.0,110.533774
163,BRD4,QTNQLQYLLR,1362,150547.0,110.533774
164,BRD4,HQFAWPFQQPVDAVK,1362,150547.0,110.533774
165,BRD4,LNLPDYYK,1362,150547.0,110.533774
...,...,...,...,...,...
556440,ERAP2,ILYALSTSK,960,1218369.7,1269.135104
556442,ERAP2,TQNLAALLHAIAR,960,1218369.7,1269.135104
556445,ERAP2,FDLGSYDIR,960,1218369.7,1269.135104
556446,ERAP2,MIISGTTAHFSSK,960,1218369.7,1269.135104


In [155]:
z = undetected_peptides.drop_duplicates(['Protein']).reset_index(drop = True )

In [156]:
# calculate the "sum of SMT/ length"
z.shape

(5019, 5)

In [157]:
z["sum_SMT_over_length"] = z['SMT_over_length'].sum() # calculates the "sum of SMT/ length" for individual proteins

In [158]:
sum_of_SMT_over_length_undetected_peptides = dict(zip(z.Protein , z.sum_SMT_over_length)) 

In [159]:
undetected_peptides["sum_of_SMT_over_length"] = undetected_peptides['Protein'].map(sum_of_SMT_over_length_undetected_peptides)

In [160]:
undetected_peptides

Unnamed: 0,Protein,Sequence,Length,SMT,SMT_over_length,sum_of_SMT_over_length
161,BRD4,MSAESGPGTR,1362,150547.0,110.533774,1.158510e+07
162,BRD4,NLPVMGDGLETSQMSTTQAQAQPQPANAASTNPPPPETSNPNKPK,1362,150547.0,110.533774,1.158510e+07
163,BRD4,QTNQLQYLLR,1362,150547.0,110.533774,1.158510e+07
164,BRD4,HQFAWPFQQPVDAVK,1362,150547.0,110.533774,1.158510e+07
165,BRD4,LNLPDYYK,1362,150547.0,110.533774,1.158510e+07
...,...,...,...,...,...,...
556440,ERAP2,ILYALSTSK,960,1218369.7,1269.135104,1.158510e+07
556442,ERAP2,TQNLAALLHAIAR,960,1218369.7,1269.135104,1.158510e+07
556445,ERAP2,FDLGSYDIR,960,1218369.7,1269.135104,1.158510e+07
556446,ERAP2,MIISGTTAHFSSK,960,1218369.7,1269.135104,1.158510e+07


In [161]:
undetected_peptides["NSMT"] = undetected_peptides["SMT_over_length"]/ undetected_peptides["sum_of_SMT_over_length"]

In [162]:
undetected_peptides.isnull().sum() # no discrepancies 

Protein                   0
Sequence                  0
Length                    0
SMT                       0
SMT_over_length           0
sum_of_SMT_over_length    0
NSMT                      0
dtype: int64

## Calculate Normalised SRI  over Spectral Counts for Undetected peptides
1. Create a dictionary using protein as keys and NSRI_OVER SC values 
2. Map Normalised NSRI divided by Spectral count values onto undetected detected peptides
3. Remove duplicated sequences 
4. Remove peptides that map to more than one protein

In [163]:
# check for any intersection between detected and undetected (should be 0)
print(len(set(MAXQuant["Sequence"]).intersection(set(undetected_peptides["Sequence"]))))

0


In [164]:
undetected_peptides

Unnamed: 0,Protein,Sequence,Length,SMT,SMT_over_length,sum_of_SMT_over_length,NSMT
161,BRD4,MSAESGPGTR,1362,150547.0,110.533774,1.158510e+07,0.00001
162,BRD4,NLPVMGDGLETSQMSTTQAQAQPQPANAASTNPPPPETSNPNKPK,1362,150547.0,110.533774,1.158510e+07,0.00001
163,BRD4,QTNQLQYLLR,1362,150547.0,110.533774,1.158510e+07,0.00001
164,BRD4,HQFAWPFQQPVDAVK,1362,150547.0,110.533774,1.158510e+07,0.00001
165,BRD4,LNLPDYYK,1362,150547.0,110.533774,1.158510e+07,0.00001
...,...,...,...,...,...,...,...
556440,ERAP2,ILYALSTSK,960,1218369.7,1269.135104,1.158510e+07,0.00011
556442,ERAP2,TQNLAALLHAIAR,960,1218369.7,1269.135104,1.158510e+07,0.00011
556445,ERAP2,FDLGSYDIR,960,1218369.7,1269.135104,1.158510e+07,0.00011
556446,ERAP2,MIISGTTAHFSSK,960,1218369.7,1269.135104,1.158510e+07,0.00011


In [165]:
MAXQuant

Unnamed: 0,Sequence,Protein,Missedcleavedpepcount,Total,SMT,Length,SMT_over_length,sum_of_SMT_over_length,NSMT,SC,SAF,sum_of_SAF_over_length,NSAF,SMT_over_SC,sum_of_SMT_over_SC,NSMT_over_SC
0,AAAAAAAATMALAAPSSPTPESPTMLTK,INCENP,0,16532.40,5.643760e+04,918.0,61.478867,1.282166e+07,0.000005,3,0.003268,179.016358,0.000018,18812.533333,3.517341e+08,0.000053
1,AAAAAAAGDSDSWDADAFSVEDPVR,EIF3J,0,44275.70,2.270293e+06,258.0,8799.583372,1.282166e+07,0.000686,25,0.096899,179.016358,0.000541,90811.700400,3.517341e+08,0.000258
2,AAAAAAALQAK,RPL4,0,13757.04,1.979939e+06,427.0,4636.859836,1.282166e+07,0.000362,16,0.037471,179.016358,0.000209,123746.196875,3.517341e+08,0.000352
3,AAAAAAGAASGLPGPVAQGLK,IPO9,0,21388.16,3.547225e+06,1041.0,3407.516733,1.282166e+07,0.000266,55,0.052834,179.016358,0.000295,64494.998527,3.517341e+08,0.000183
4,AAAAAAVGPGAGGAGSAVPGGAGPCATVSVFPGAR,CARM1,0,47559.60,8.743372e+05,608.0,1438.054539,1.282166e+07,0.000112,12,0.019737,179.016358,0.000110,72861.430000,3.517341e+08,0.000207
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37432,YYTGNYDQYVK,ABCF2,0,263272.00,1.628988e+06,623.0,2614.747496,1.282166e+07,0.000204,22,0.035313,179.016358,0.000197,74044.895000,3.517341e+08,0.000211
37433,YYTLEEIQK,CYB5A,0,188939.30,2.352099e+05,134.0,1755.297687,1.282166e+07,0.000137,4,0.029851,179.016358,0.000167,58802.472500,3.517341e+08,0.000167
37434,YYTLFGR,EPRS1,0,42473.20,1.353858e+07,1512.0,8954.086630,1.282166e+07,0.000698,160,0.105820,179.016358,0.000591,84616.118650,3.517341e+08,0.000241
37435,YYTSASGDEMVSLK,HSP90AA1,0,45627.10,1.708469e+07,732.0,23339.734952,1.282166e+07,0.001820,226,0.308743,179.016358,0.001725,75595.955686,3.517341e+08,0.000215


In [166]:
NSMT_over_SC_dict = dict(zip(MAXQuant.Protein , MAXQuant.NSMT_over_SC)) 

In [167]:
undetected_peptides['NSMT_over_SC'] = undetected_peptides['Protein'].map(NSMT_over_SC_dict)

In [168]:
undetected_peptides.isnull().sum() # no null values when mapping Normalised NSMT/ over length

Protein                   0
Sequence                  0
Length                    0
SMT                       0
SMT_over_length           0
sum_of_SMT_over_length    0
NSMT                      0
NSMT_over_SC              0
dtype: int64

In [169]:
undetected_peptides

Unnamed: 0,Protein,Sequence,Length,SMT,SMT_over_length,sum_of_SMT_over_length,NSMT,NSMT_over_SC
161,BRD4,MSAESGPGTR,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214
162,BRD4,NLPVMGDGLETSQMSTTQAQAQPQPANAASTNPPPPETSNPNKPK,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214
163,BRD4,QTNQLQYLLR,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214
164,BRD4,HQFAWPFQQPVDAVK,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214
165,BRD4,LNLPDYYK,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214
...,...,...,...,...,...,...,...,...
556440,ERAP2,ILYALSTSK,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266
556442,ERAP2,TQNLAALLHAIAR,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266
556445,ERAP2,FDLGSYDIR,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266
556446,ERAP2,MIISGTTAHFSSK,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266


## MAP NSAF for Undetected Peptides

In [170]:
NSAF_over_SC_dict = dict(zip(MAXQuant.Protein , MAXQuant.NSAF)) 

In [171]:
undetected_peptides['NSAF'] = undetected_peptides['Protein'].map(NSAF_over_SC_dict)

In [172]:
undetected_peptides.isnull().sum()

Protein                   0
Sequence                  0
Length                    0
SMT                       0
SMT_over_length           0
sum_of_SMT_over_length    0
NSMT                      0
NSMT_over_SC              0
NSAF                      0
dtype: int64

In [173]:
undetected_peptides

Unnamed: 0,Protein,Sequence,Length,SMT,SMT_over_length,sum_of_SMT_over_length,NSMT,NSMT_over_SC,NSAF
161,BRD4,MSAESGPGTR,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
162,BRD4,NLPVMGDGLETSQMSTTQAQAQPQPANAASTNPPPPETSNPNKPK,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
163,BRD4,QTNQLQYLLR,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
164,BRD4,HQFAWPFQQPVDAVK,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
165,BRD4,LNLPDYYK,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
...,...,...,...,...,...,...,...,...,...
556440,ERAP2,ILYALSTSK,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076
556442,ERAP2,TQNLAALLHAIAR,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076
556445,ERAP2,FDLGSYDIR,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076
556446,ERAP2,MIISGTTAHFSSK,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076


In [174]:
# remove any peptides in undetected_peptides that map to more than one different protein
undetected_peptides = undetected_peptides.groupby('Sequence').filter(lambda x: x['Protein'].nunique() == 1)

In [175]:
undetected_peptides

Unnamed: 0,Protein,Sequence,Length,SMT,SMT_over_length,sum_of_SMT_over_length,NSMT,NSMT_over_SC,NSAF
161,BRD4,MSAESGPGTR,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
162,BRD4,NLPVMGDGLETSQMSTTQAQAQPQPANAASTNPPPPETSNPNKPK,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
163,BRD4,QTNQLQYLLR,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
164,BRD4,HQFAWPFQQPVDAVK,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
165,BRD4,LNLPDYYK,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
...,...,...,...,...,...,...,...,...,...
556440,ERAP2,ILYALSTSK,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076
556442,ERAP2,TQNLAALLHAIAR,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076
556445,ERAP2,FDLGSYDIR,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076
556446,ERAP2,MIISGTTAHFSSK,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076


In [176]:
undetected_peptides = undetected_peptides.drop_duplicates(subset = ['Sequence', 'Protein'], 
                                                                      keep = "first").reset_index(drop = True)

In [177]:
undetected_peptides.shape # cleaned undetected peptides

(120726, 9)

### Make Data into Similar formats
1. Make a detectability column and fill with 0 and 1 for undetected peptides and MAXQuant respectively
2. Organise and or drop columns that are not needed for training 

In [178]:
MAXQuant.insert(loc=1, column='Detectability', value=1)
undetected_peptides.insert(loc=1, column='Detectability', value=0)

In [179]:
MAXQuant

Unnamed: 0,Sequence,Detectability,Protein,Missedcleavedpepcount,Total,SMT,Length,SMT_over_length,sum_of_SMT_over_length,NSMT,SC,SAF,sum_of_SAF_over_length,NSAF,SMT_over_SC,sum_of_SMT_over_SC,NSMT_over_SC
0,AAAAAAAATMALAAPSSPTPESPTMLTK,1,INCENP,0,16532.40,5.643760e+04,918.0,61.478867,1.282166e+07,0.000005,3,0.003268,179.016358,0.000018,18812.533333,3.517341e+08,0.000053
1,AAAAAAAGDSDSWDADAFSVEDPVR,1,EIF3J,0,44275.70,2.270293e+06,258.0,8799.583372,1.282166e+07,0.000686,25,0.096899,179.016358,0.000541,90811.700400,3.517341e+08,0.000258
2,AAAAAAALQAK,1,RPL4,0,13757.04,1.979939e+06,427.0,4636.859836,1.282166e+07,0.000362,16,0.037471,179.016358,0.000209,123746.196875,3.517341e+08,0.000352
3,AAAAAAGAASGLPGPVAQGLK,1,IPO9,0,21388.16,3.547225e+06,1041.0,3407.516733,1.282166e+07,0.000266,55,0.052834,179.016358,0.000295,64494.998527,3.517341e+08,0.000183
4,AAAAAAVGPGAGGAGSAVPGGAGPCATVSVFPGAR,1,CARM1,0,47559.60,8.743372e+05,608.0,1438.054539,1.282166e+07,0.000112,12,0.019737,179.016358,0.000110,72861.430000,3.517341e+08,0.000207
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37432,YYTGNYDQYVK,1,ABCF2,0,263272.00,1.628988e+06,623.0,2614.747496,1.282166e+07,0.000204,22,0.035313,179.016358,0.000197,74044.895000,3.517341e+08,0.000211
37433,YYTLEEIQK,1,CYB5A,0,188939.30,2.352099e+05,134.0,1755.297687,1.282166e+07,0.000137,4,0.029851,179.016358,0.000167,58802.472500,3.517341e+08,0.000167
37434,YYTLFGR,1,EPRS1,0,42473.20,1.353858e+07,1512.0,8954.086630,1.282166e+07,0.000698,160,0.105820,179.016358,0.000591,84616.118650,3.517341e+08,0.000241
37435,YYTSASGDEMVSLK,1,HSP90AA1,0,45627.10,1.708469e+07,732.0,23339.734952,1.282166e+07,0.001820,226,0.308743,179.016358,0.001725,75595.955686,3.517341e+08,0.000215


In [180]:
undetected_peptides

Unnamed: 0,Protein,Detectability,Sequence,Length,SMT,SMT_over_length,sum_of_SMT_over_length,NSMT,NSMT_over_SC,NSAF
0,BRD4,0,MSAESGPGTR,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
1,BRD4,0,NLPVMGDGLETSQMSTTQAQAQPQPANAASTNPPPPETSNPNKPK,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
2,BRD4,0,QTNQLQYLLR,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
3,BRD4,0,HQFAWPFQQPVDAVK,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
4,BRD4,0,LNLPDYYK,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
...,...,...,...,...,...,...,...,...,...,...
120721,ERAP2,0,ILYALSTSK,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076
120722,ERAP2,0,TQNLAALLHAIAR,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076
120723,ERAP2,0,FDLGSYDIR,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076
120724,ERAP2,0,MIISGTTAHFSSK,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076


In [181]:
# MAXQuant = MAXQuant.drop(['Length', 'Total', 'Missedcleavedpepcount', 'SC', 'sum_of_SMT_over_length', 'SMT_over_length', 'SMT_over_SC', 'sum_of_SMT_over_SC'], axis=1)

In [182]:
# undetected_peptides = undetected_peptides.drop(['Length', 'SMT_over_length', 'sum_of_SMT_over_length'], axis=1)

In [183]:
undetected_peptides

Unnamed: 0,Protein,Detectability,Sequence,Length,SMT,SMT_over_length,sum_of_SMT_over_length,NSMT,NSMT_over_SC,NSAF
0,BRD4,0,MSAESGPGTR,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
1,BRD4,0,NLPVMGDGLETSQMSTTQAQAQPQPANAASTNPPPPETSNPNKPK,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
2,BRD4,0,QTNQLQYLLR,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
3,BRD4,0,HQFAWPFQQPVDAVK,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
4,BRD4,0,LNLPDYYK,1362,150547.0,110.533774,1.158510e+07,0.00001,0.000214,0.000008
...,...,...,...,...,...,...,...,...,...,...
120721,ERAP2,0,ILYALSTSK,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076
120722,ERAP2,0,TQNLAALLHAIAR,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076
120723,ERAP2,0,FDLGSYDIR,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076
120724,ERAP2,0,MIISGTTAHFSSK,960,1218369.7,1269.135104,1.158510e+07,0.00011,0.000266,0.000076


In [184]:
# MAXQuant.drop(['SAF','sum_of_SAF_over_length' ], axis =1)

In [185]:
MAXQuant

Unnamed: 0,Sequence,Detectability,Protein,Missedcleavedpepcount,Total,SMT,Length,SMT_over_length,sum_of_SMT_over_length,NSMT,SC,SAF,sum_of_SAF_over_length,NSAF,SMT_over_SC,sum_of_SMT_over_SC,NSMT_over_SC
0,AAAAAAAATMALAAPSSPTPESPTMLTK,1,INCENP,0,16532.40,5.643760e+04,918.0,61.478867,1.282166e+07,0.000005,3,0.003268,179.016358,0.000018,18812.533333,3.517341e+08,0.000053
1,AAAAAAAGDSDSWDADAFSVEDPVR,1,EIF3J,0,44275.70,2.270293e+06,258.0,8799.583372,1.282166e+07,0.000686,25,0.096899,179.016358,0.000541,90811.700400,3.517341e+08,0.000258
2,AAAAAAALQAK,1,RPL4,0,13757.04,1.979939e+06,427.0,4636.859836,1.282166e+07,0.000362,16,0.037471,179.016358,0.000209,123746.196875,3.517341e+08,0.000352
3,AAAAAAGAASGLPGPVAQGLK,1,IPO9,0,21388.16,3.547225e+06,1041.0,3407.516733,1.282166e+07,0.000266,55,0.052834,179.016358,0.000295,64494.998527,3.517341e+08,0.000183
4,AAAAAAVGPGAGGAGSAVPGGAGPCATVSVFPGAR,1,CARM1,0,47559.60,8.743372e+05,608.0,1438.054539,1.282166e+07,0.000112,12,0.019737,179.016358,0.000110,72861.430000,3.517341e+08,0.000207
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37432,YYTGNYDQYVK,1,ABCF2,0,263272.00,1.628988e+06,623.0,2614.747496,1.282166e+07,0.000204,22,0.035313,179.016358,0.000197,74044.895000,3.517341e+08,0.000211
37433,YYTLEEIQK,1,CYB5A,0,188939.30,2.352099e+05,134.0,1755.297687,1.282166e+07,0.000137,4,0.029851,179.016358,0.000167,58802.472500,3.517341e+08,0.000167
37434,YYTLFGR,1,EPRS1,0,42473.20,1.353858e+07,1512.0,8954.086630,1.282166e+07,0.000698,160,0.105820,179.016358,0.000591,84616.118650,3.517341e+08,0.000241
37435,YYTSASGDEMVSLK,1,HSP90AA1,0,45627.10,1.708469e+07,732.0,23339.734952,1.282166e+07,0.001820,226,0.308743,179.016358,0.001725,75595.955686,3.517341e+08,0.000215


In [186]:
training_data = pd.concat([MAXQuant, undetected_peptides]) # concat the dataframes to get the final training data with quantitation

In [187]:
training_data.isnull().sum() # the training data with exclusion of missed cleaved peptides.

Sequence                       0
Detectability                  0
Protein                        0
Missedcleavedpepcount     120726
Total                     120726
SMT                            0
Length                         0
SMT_over_length                0
sum_of_SMT_over_length         0
NSMT                           0
SC                        120726
SAF                       120726
sum_of_SAF_over_length    120726
NSAF                           0
SMT_over_SC               120726
sum_of_SMT_over_SC        120726
NSMT_over_SC                   0
dtype: int64

In [188]:
training_data

Unnamed: 0,Sequence,Detectability,Protein,Missedcleavedpepcount,Total,SMT,Length,SMT_over_length,sum_of_SMT_over_length,NSMT,SC,SAF,sum_of_SAF_over_length,NSAF,SMT_over_SC,sum_of_SMT_over_SC,NSMT_over_SC
0,AAAAAAAATMALAAPSSPTPESPTMLTK,1,INCENP,0.0,16532.40,56437.600,918.0,61.478867,1.282166e+07,0.000005,3.0,0.003268,179.016358,0.000018,18812.533333,3.517341e+08,0.000053
1,AAAAAAAGDSDSWDADAFSVEDPVR,1,EIF3J,0.0,44275.70,2270292.510,258.0,8799.583372,1.282166e+07,0.000686,25.0,0.096899,179.016358,0.000541,90811.700400,3.517341e+08,0.000258
2,AAAAAAALQAK,1,RPL4,0.0,13757.04,1979939.150,427.0,4636.859836,1.282166e+07,0.000362,16.0,0.037471,179.016358,0.000209,123746.196875,3.517341e+08,0.000352
3,AAAAAAGAASGLPGPVAQGLK,1,IPO9,0.0,21388.16,3547224.919,1041.0,3407.516733,1.282166e+07,0.000266,55.0,0.052834,179.016358,0.000295,64494.998527,3.517341e+08,0.000183
4,AAAAAAVGPGAGGAGSAVPGGAGPCATVSVFPGAR,1,CARM1,0.0,47559.60,874337.160,608.0,1438.054539,1.282166e+07,0.000112,12.0,0.019737,179.016358,0.000110,72861.430000,3.517341e+08,0.000207
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
120721,ILYALSTSK,0,ERAP2,,,1218369.700,960.0,1269.135104,1.158510e+07,0.000110,,,,0.000076,,,0.000266
120722,TQNLAALLHAIAR,0,ERAP2,,,1218369.700,960.0,1269.135104,1.158510e+07,0.000110,,,,0.000076,,,0.000266
120723,FDLGSYDIR,0,ERAP2,,,1218369.700,960.0,1269.135104,1.158510e+07,0.000110,,,,0.000076,,,0.000266
120724,MIISGTTAHFSSK,0,ERAP2,,,1218369.700,960.0,1269.135104,1.158510e+07,0.000110,,,,0.000076,,,0.000266


In [189]:
training_data=training_data[['Sequence','Detectability', 'SMT', 'NSMT', 'NSAF','NSMT_over_SC', 'Protein']]

In [190]:
training_data

Unnamed: 0,Sequence,Detectability,SMT,NSMT,NSAF,NSMT_over_SC,Protein
0,AAAAAAAATMALAAPSSPTPESPTMLTK,1,56437.600,0.000005,0.000018,0.000053,INCENP
1,AAAAAAAGDSDSWDADAFSVEDPVR,1,2270292.510,0.000686,0.000541,0.000258,EIF3J
2,AAAAAAALQAK,1,1979939.150,0.000362,0.000209,0.000352,RPL4
3,AAAAAAGAASGLPGPVAQGLK,1,3547224.919,0.000266,0.000295,0.000183,IPO9
4,AAAAAAVGPGAGGAGSAVPGGAGPCATVSVFPGAR,1,874337.160,0.000112,0.000110,0.000207,CARM1
...,...,...,...,...,...,...,...
120721,ILYALSTSK,0,1218369.700,0.000110,0.000076,0.000266,ERAP2
120722,TQNLAALLHAIAR,0,1218369.700,0.000110,0.000076,0.000266,ERAP2
120723,FDLGSYDIR,0,1218369.700,0.000110,0.000076,0.000266,ERAP2
120724,MIISGTTAHFSSK,0,1218369.700,0.000110,0.000076,0.000266,ERAP2


# Exporting Training Data 

In [189]:
training_data.to_csv('training_data.csv', index = False) # exporting the data 

###########

In [199]:
undetected_peptides.isnull().sum()

Protein                   0
Detectability             0
Sequence                  0
Length                    0
SMT                       0
SMT_over_length           0
sum_of_SMT_over_length    0
NSMT                      0
NSMT_over_SC              0
NSAF                      0
Missedcleavedpepcount     0
dtype: int64

In [200]:
MAXQuant.isnull().sum()

Sequence                  0
Detectability             0
Protein                   0
Missedcleavedpepcount     0
Total                     0
SMT                       0
Length                    0
SMT_over_length           0
sum_of_SMT_over_length    0
NSMT                      0
SC                        0
SAF                       0
sum_of_SAF_over_length    0
NSAF                      0
SMT_over_SC               0
sum_of_SMT_over_SC        0
NSMT_over_SC              0
dtype: int64