**mzTbs**

_General_

mzTab is meant to be a light-weight, tab-delimited file format for proteomics data. The target audience for this format are primarily researchers outside of proteomics. It should be easy to parse and only contain the minimal information required to evaluate the results of a proteomics experiment. The aim of the format is to present the results of a proteomics experiment in a computationally accessible overview. The aim is not to provide the detailed evidence for these results, or allow recreating the process which led to the results. Both of these functions are established through links to more detailed representations in other formats, in particular mzIdentML and mzQuantML [Ref1](https://code.google.com/archive/p/mztab/). Besides, mzTab can be used alone or with those other formats [Ref2](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4189001/)

**Warning**:Although mzTab can be used to report a detailed view on data, it explicitly does not aim to capture the whole complexity and evidence trail of a proteomics study. Even the most complex mzTab files still include simplifications/assumptions of the experimental results. This, for instance, is the case in identification (e.g. protein inference/grouping is only supported to a limited extent) and quantification results (e.g. the coordinates for isotope patterns in quantified two-dimensional “features” cannot be fully reported). This missing information can be reported using the existing PSI standard formats mzIdentML and mzQuantML.  

_File content_

Section:
- MTD: metadata - was deliberately kept flexible, and the majority of fields are optional. Therefore, it is possible to report different levels of experimental annotation depending on the interest of the producer of the files, ranging from basic annotations to the complete
- PRH: protein hearder
- PRT: protein identifications 
- PEH: peptide header
- PEP: peptide identifications
- PSH: peptide-spectrum  hearder
- PSM: peptide-spectrum match - indicates whether the peptides were unambiguously assigned to a given protein. 
- SMH: small molecules hearder  - is used to report aggregated quantification data based on several PSMs.
- SML: small molecules identifications
- COM: comments 


<!-- ![Fig1](/home/tiago/documents/lncRNA/Study/Fig1_mztabe_content.jpg) -->


In [2]:
from pyteomics import mztab

### 1. Coffie

In [3]:
coffie = mztab.MzTab("/home/tiago/documents/lncRNA/coffie/generated/2experimentos.pride.mztab")

In [4]:
coffie.keys()

odict_keys(['PRT', 'PEP', 'PSM', 'SML'])

In [5]:
coffie.metadata["mzTab-type"]

'Identification'

In [6]:
coffie["PRT"].head()

Unnamed: 0_level_0,accession,description,taxid,species,database,database_version,search_engine,best_search_engine_score[1],search_engine_score[1]_ms_run[1],search_engine_score[1]_ms_run[2],...,num_peptides_unique_ms_run[15],num_peptides_unique_ms_run[16],num_peptides_unique_ms_run[17],num_peptides_unique_ms_run[18],num_peptides_unique_ms_run[19],num_peptides_unique_ms_run[20],num_peptides_unique_ms_run[21],ambiguity_members,modifications,protein_coverage
accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A0A068U1Z5_COFCA,A0A068U1Z5_COFCA,Eukaryotic translation initiation factor 3 sub...,,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",,,,...,,,,,,,,,,
A0A068TTF0_COFCA,A0A068TTF0_COFCA,"Coffea canephora DH200=94 genomic scaffold, sc...",,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",,,,...,,,,,,,,,464-UNIMOD:4,
A0A068U9U8_COFCA,A0A068U9U8_COFCA,"Coffea canephora DH200=94 genomic scaffold, sc...",,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",,,,...,,,,,,,,,,
A0A068URX2_COFCA,A0A068URX2_COFCA,ATP-dependent Clp protease proteolytic subunit...,,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",,,,...,,,,,,,,,,
A0A068TYH5_COFCA,A0A068TYH5_COFCA,"Coffea canephora DH200=94 genomic scaffold, sc...",,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",,,,...,,,,,,,,,,


In [7]:
coffie["PSM"]

Unnamed: 0_level_0,sequence,PSM_ID,accession,unique,database,database_version,search_engine,search_engine_score[1],search_engine_score[2],search_engine_score[3],...,exp_mass_to_charge,calc_mass_to_charge,spectra_ref,pre,post,start,end,opt_global_mzidentml_original_ID,opt_global_cv_MS:1002217_decoy_peptide,opt_global_cv_PRIDE:0000091_rank
PSM_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,VFGPHQWEILR,1,A0A068U1Z5_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",20.95,25.0,0.996025,...,461.25412,461.250733,ms_run[3]:index=262,,,364,374,Spec_5760_VFGPHQWEILR,0,1
2,YLEDKTSVPYEPVYSDEQAR,2,A0A068TTF0_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",35.16,25.0,0.999713,...,797.04852,797.044699,ms_run[8]:index=1954,,,312,331,Spec_57572_YLEDKTSVPYEPVYSDEQAR,0,1
3,YLEDKTSVPYEPVYSDEQAR,3,A0A068TTF0_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",35.26,25.0,0.999544,...,797.04919,797.044699,ms_run[20]:index=1775,,,312,331,Spec_25966_YLEDKTSVPYEPVYSDEQAR,0,1
4,YLEDKTSVPYEPVYSDEQAR,4,A0A068TTF0_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",19.21,25.0,0.999363,...,797.05011,797.044699,ms_run[17]:index=1996,,,312,331,Spec_35544_YLEDKTSVPYEPVYSDEQAR,0,1
5,YLEDKTSVPYEPVYSDEQAR,5,A0A068TTF0_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",23.32,25.0,0.999418,...,797.04919,797.044699,ms_run[15]:index=1870,,,312,331,Spec_44982_YLEDKTSVPYEPVYSDEQAR,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16925,LFILDYHDMLLPFIEGMNSLPGR,16925,A0A068UYE2_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",35.74,25.0,0.999683,...,897.79639,897.794099,ms_run[2]:index=654,,,504,526,Spec_67915_LFILDYHDMLLPFIEGMNSLPGR,0,1
16926,IVNKWNTALIGLMTYFR,16926,A0A068VF06_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",44.30,25.0,0.999500,...,680.71399,680.708199,ms_run[2]:index=308,,,1298,1314,Spec_67515_IVNKWNTALIGLMTYFR,0,1
16927,IVNKWNTALIGLMTYFR,16927,A0A068VF06_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",23.94,25.0,0.997619,...,680.71197,680.708199,ms_run[4]:index=226,,,1298,1314,Spec_64756_IVNKWNTALIGLMTYFR,0,1
16926,IVNKWNTALIGLMTYFR,16926,A0A068V8A1_COFCA,,uniprot-taxonomy%3Acoffee.fasta,,"[Mascot, X!Tandem, Scaffold]",44.30,25.0,0.999500,...,680.71399,680.708199,ms_run[2]:index=308,,,1299,1315,Spec_67515_IVNKWNTALIGLMTYFR,0,1


In [8]:
import urllib.parse
import urllib.request
import uniprot


In [29]:
url = 'https://www.uniprot.org/uploadlists/'
params = {
'from': 'ACC+ID',
'to': 'EMBL_ID',
'format': 'tab',
'query': 'A0A068TTF0_COFCA'
}

data = urllib.parse.urlencode(params)
data = data.encode('utf-8')
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as f:
   response = f.read()
print(response.decode('utf-8'))

From	To
A0A068TTF0_COFCA	HG739087



### 2. Arabidopsis

In [25]:
! gzip -df /home/tiagoborelli/Documentos/lncRNA/arabidopsis/generated/*.gz

In [44]:
import glob
arab_mztfiles = glob.glob("/home/tiagoborelli/Documentos/lncRNA/arabidopsis/generated/*.mztab")
arab_mztfiles

['/home/tiagoborelli/Documentos/lncRNA/arabidopsis/generated/20140908_MN_4 (F016258).pride.mztab',
 '/home/tiagoborelli/Documentos/lncRNA/arabidopsis/generated/MN_20130904_ELF4_2 (F016254).pride.mztab',
 '/home/tiagoborelli/Documentos/lncRNA/arabidopsis/generated/20140908_MN_2 (F016259).pride.mztab',
 '/home/tiagoborelli/Documentos/lncRNA/arabidopsis/generated/MN_20130904_Col (F016255).pride.mztab',
 '/home/tiagoborelli/Documentos/lncRNA/arabidopsis/generated/20131008_MN_phy8_9 (F016252).pride.mztab',
 '/home/tiagoborelli/Documentos/lncRNA/arabidopsis/generated/20131008_MN_elf3_2 (F016253).pride.mztab',
 '/home/tiagoborelli/Documentos/lncRNA/arabidopsis/generated/20140908_MN_10 (F016264).pride.mztab',
 '/home/tiagoborelli/Documentos/lncRNA/arabidopsis/generated/20140908_MN_16 (F016260).pride.mztab',
 '/home/tiagoborelli/Documentos/lncRNA/arabidopsis/generated/20140908_MN_9 (F016256).pride.mztab']

In [48]:
for files in arab_mztfiles:
    arab = mztab.MzTab(files)
    print(arab.metadata["mzTab-type"])

Identification
Identification
Identification
Identification
Identification
Identification
Identification
Identification
Identification


In [53]:
table = mztab.MzTab('/home/tiagoborelli/Documentos/lncRNA/arabidopsis/generated/20140908_MN_9 (F016256).pride.mztab')

In [63]:
table["PSM"].head().columns

Index(['sequence', 'PSM_ID', 'accession', 'unique', 'database',
       'database_version', 'search_engine', 'search_engine_score[1]',
       'search_engine_score[2]', 'search_engine_score[3]', 'modifications',
       'retention_time', 'charge', 'exp_mass_to_charge', 'calc_mass_to_charge',
       'spectra_ref', 'pre', 'post', 'start', 'end',
       'opt_global_mzidentml_original_ID',
       'opt_global_cv_MS:1002217_decoy_peptide',
       'opt_global_cv_PRIDE:0000091_rank'],
      dtype='object')

In [64]:
table["PRT"].head().columns

Index(['accession', 'description', 'taxid', 'species', 'database',
       'database_version', 'search_engine', 'best_search_engine_score[1]',
       'search_engine_score[1]_ms_run[1]', 'num_psms_ms_run[1]',
       'num_peptides_distinct_ms_run[1]', 'num_peptides_unique_ms_run[1]',
       'ambiguity_members', 'modifications', 'protein_coverage'],
      dtype='object')

### 3. Soybean

In [71]:
! gzip -df /home/tiagoborelli/Documentos/lncRNA/soybean/generated/*.gz

In [70]:
import glob
soy_mztfiles = glob.glob("/home/tiagoborelli/Documentos/lncRNA/soybean/generated/*.mztab")
soy_mztfiles

[]

In [None]:
for files in arab_mztfiles:
    arab = mztab.MzTab(files)
    print(arab.metadata["mzTab-type"])