# Sequence database searching

1. Combine the E. coli FASTA database with the YPIC challenge protein.
2. Search the consensus spectra against the combined proteine sequence database.

Search settings:

- Trypsin cleavage (default), maximum 2 missed cleavages.
- Open search: 300 Da precursor mass window.
- Report only the top-ranked match for each spectrum and combine target and decoy matches in a single output file.

In [None]:
! echo ">ypic" >> ../data/processed/20180427_lpino_qehf2_ypic_trypsin_crux/uniprot-proteome%3AUP000000625.fasta
! echo "MAGRHAVEYKKEVERWKNDEREDWHATTHEMKSTFKNDAMENTALLIMITATIKNSINLIFEAREISTHEREASTRKCTKRETKRESPECTWHENITCKMESTKWHATYKKCANPRKDKCEINACELLLAAALEHHHHHH" >> ../data/processed/20180427_lpino_qehf2_ypic_trypsin_crux/uniprot-proteome%3AUP000000625.fasta

In [None]:
! ../bin/crux-3.2.Linux.x86_64/bin/crux tide-search \
    --mods-spec 3M+15.994915 \
    --missed-cleavages 2 \
    --precursor-window 300 \
    --precursor-window-type mass \
    --concat T \
    --output-dir ../data/processed/20180427_lpino_qehf2_ypic_trypsin_crux \
    --overwrite T \
    --top-match 1 \
    ../data/processed/20180427_lpino_qehf2_ypic_trypsin_maracluster/MaRaCluster.consensus_p5.filtered.mgf \
    ../data/processed/20180427_lpino_qehf2_ypic_trypsin_crux/uniprot-proteome%3AUP000000625.fasta

# FDR filtering

Use the search-all-assess-subset strategy to filter the YPIC PSMs on FDR (FDR treshold = 1 %).

In [1]:
import operator
import os

import pandas as pd
import pyteomics.auxiliary
import pyteomics.mass.unimod

In [2]:
psms = pd.read_csv(os.path.join(
    '../data/processed/20180427_lpino_qehf2_ypic_trypsin_crux',
    'tide-search.txt'), sep='\t')

In [3]:
psms['mass diff'] = psms['spectrum neutral mass'] - psms['peptide mass']

In [4]:
psms_ypic = psms[psms['protein id'].str.contains('ypic')]
psms_ecoli = psms[psms['protein id'].str.contains('ECOLI')]
print(f'Matches to the YPIC protein:  {len(psms_ypic)}')
print(f'Matches to E. coli proteins: {len(psms_ecoli)}')

Matches to the YPIC protein:  52
Matches to E. coli proteins: 327


In [5]:
fdr = 0.01

In [6]:
filter_kwargs = {'fdr': fdr, 'key': 'xcorr score', 'reverse': True,
                 'is_decoy': lambda psm: 'decoy_' in psm['protein id'],
                 'formula': 1, 'correction': 0}
psms_ypic_filtered = pyteomics.auxiliary.filter(psms_ypic, **filter_kwargs)
psms_ecoli_filtered = pyteomics.auxiliary.filter(psms_ecoli, **filter_kwargs)

In [7]:
print(f'Matches to the YPIC protein:  {len(psms_ypic_filtered)} '
      f'({fdr:.0%} FDR)')

Matches to the YPIC protein:  52 (1% FDR)


# Modifications

Reference mass difference from the open search to Unimod to determine the likely modifications that are present.

In [8]:
psms_ypic_modified = (
    psms_ypic_filtered[['original target sequence', 'spectrum neutral mass',
                        'peptide mass', 'mass diff']]
    [psms_ypic_filtered['mass diff'].abs() > 0.1].sort_values('mass diff'))

In [9]:
unimod = pyteomics.mass.unimod.Unimod()

In [10]:
md_tol_ppm = 20
mod_names, mod_ppms = [], []
for i, psm in psms_ypic_modified.iterrows():
    potential_mods = []
    for mod in unimod.mods:
        md_ppm = abs((psm['mass diff'] - mod.monoisotopic_mass)
                     / psm['spectrum neutral mass'] * 10**6)
        if md_ppm <= md_tol_ppm:
            potential_mods.append((mod.full_name, md_ppm))
    if len(potential_mods) > 0:
        psm_mod_names, psm_mod_ppms = [], []
        for mod_name, mod_ppm in sorted(potential_mods,
                                        key=operator.itemgetter(1)):
            psm_mod_names.append(mod_name)
            psm_mod_ppms.append(f'{mod_ppm:.2f}')
        mod_names.append(' / '.join(psm_mod_names))
        mod_ppms.append(' / '.join(psm_mod_ppms))
    else:
        mod_names.append('No direct match found in Unimod')
        mod_ppms.append('')
psms_ypic_modified['mod'] = mod_names
psms_ypic_modified['mod ppm'] = mod_ppms

In [11]:
psms_ypic_modified

Unnamed: 0,original target sequence,spectrum neutral mass,peptide mass,mass diff,mod,mod ppm
303,ESPECTWHENITCK,1771.7307,1789.7401,-18.0094,Dehydration / Pyro-glu from E / Loss of ammoni...,0.66 / 0.66 / 8.01 / 9.27 / 9.37 / 19.88
371,CEINACELLLAAALEHHHHHH,2494.1407,2511.1648,-17.0241,Pyro-glu from Q / Loss of ammonia / Met->Asn s...,0.98 / 0.98 / 10.64 / 13.61
255,MESTKWHATYKK,1508.7467,1524.7395,-15.9928,reduction / Ser->Ala substitution / Tyr->Phe s...,1.40 / 1.40 / 1.40 / 10.37 / 13.75
282,ESPECTWHENITCK,1780.7047,1789.7401,-9.0354,Arg->Phe substitution / His->Gln substitution,1.52 / 19.69
327,RESPECTWHENITCK,1946.8277,1945.8413,0.9864,Deamidation / Asn->Asp substitution / Gln->Glu...,1.22 / 1.22 / 1.22 / 5.46 / 14.14 / 19.91
273,ESPECTWHENITCK,1790.7268,1789.7401,0.9867,Deamidation / Asn->Asp substitution / Gln->Glu...,1.50 / 1.50 / 1.50 / 5.77 / 15.55
290,ESPECTWHENITCK,1790.7268,1789.7401,0.9867,Deamidation / Asn->Asp substitution / Gln->Glu...,1.50 / 1.50 / 1.50 / 5.77 / 15.55
300,ESPECTWHENITCK,1790.7268,1789.7401,0.9867,Deamidation / Asn->Asp substitution / Gln->Glu...,1.50 / 1.50 / 1.50 / 5.77 / 15.55
372,CEINACELLLAAALEHHHHHH,2512.1517,2511.1648,0.9869,Deamidation / Asn->Asp substitution / Gln->Glu...,1.15 / 1.15 / 1.15 / 4.03 / 11.16 / 15.63
314,RESPECTWHENITCK,1946.8288,1945.8413,0.9875,Deamidation / Asn->Asp substitution / Gln->Glu...,1.79 / 1.79 / 1.79 / 4.90 / 14.71
