# Convert a tabular peptides dataset into their corresponding proteins
In this example, we: 
 1. load a tabular dataset of peptide intensities
 2. convert it into their corresponding proteins, either as peptide counts or
 3. convert to proteins calculating the sum of peptide intensities.
 4. export the data into Pandas data frames and join for further processing. 

First, load the requried libraries.

In [2]:
import pandas as pd
import omicspylib as opl
from omicspylib import PeptidesDataset
print(f'omicspylib version: {opl.__version__}')

omicspylib version: 0.0.7


Then prepare your data as a Pandas data frame. You need to specify the column name containing the peptide identifier (`peptide_id` in this example), the protein identifier required to perform the group by operation (`protein_id` in this example) and the column names for all experimental conditions like below.

It is expected that you perform any cleaning required for your use case (e.g. removal of reverse hits, contaminants, modified peptides, or shared peptides across proteins etc). 

In [3]:
data_df = pd.read_csv('data/peptides_dataset.tsv', sep='\t')

config = {
    'id_col': 'peptide_id',
    'conditions': {
        'c1': ['c1_rep1', 'c1_rep2', 'c1_rep3', 'c1_rep4', 'c1_rep5'],
        'c2': ['c2_rep1', 'c2_rep2', 'c2_rep3', 'c2_rep4', 'c2_rep5'],
        'c3': ['c3_rep1', 'c3_rep2', 'c3_rep3', 'c3_rep4', 'c3_rep5'],
    },
    'protein_id_col': 'protein_id',
}
data_df.head(3)

Unnamed: 0,peptide_id,protein_id,c1_rep1,c1_rep2,c1_rep3,c1_rep4,c1_rep5,c2_rep1,c2_rep2,c2_rep3,c2_rep4,c2_rep5,c3_rep1,c3_rep2,c3_rep3,c3_rep4,c3_rep5
0,pept147,prot0,1740.91246,0.0,1393.260017,4685.874636,513.393605,502.109101,949.462139,0.0,3006.548317,671.891115,4123.628101,11583.385623,3114.88241,2812.034141,2195.55053
1,pept424,prot0,3668.876134,0.0,0.0,303.011791,1314.382432,404.828763,3723.604607,11838.405382,7586.141805,0.0,336.36333,0.0,200.425728,3891.630707,1395.146624
2,pept631,prot0,0.0,3138.459061,3409.906069,1712.639948,987.488051,0.0,8197.162348,0.0,2067.977126,1111.872036,9229.125064,0.0,19303.06527,2427.103374,491.19581


Next, create the `PeptidesDataset` object that wraps the specified experimental conditions and abstract related operations. For example, you could perform normalization at peptide level, prior to calculating protein abundance values.

In [4]:
peptides_dataset = PeptidesDataset.from_df(data_df, **config)

Use the `to_proteins` method to aggregate peptides dataset into a proteins dataset. This method will return a `ProteinsDataset` that you can keep using e.g. to pairform pairwise comparison between two experimental conditions.

In this example, we will calculate the sum of peptide intensities passing the `sum` value to the `agg_method` argument or the number of peptides, passing the `counts` value. Optionally, you can rename the column names, on the fly, by adding a prefix tag, so that you don't have name conflicts, later that you will join back the data, into one table.

In [5]:
# create ProteinsDataset objects from the PeptidesDataset using different aggregation methods.
proteins_dataset_int = peptides_dataset.to_proteins(agg_method='sum', add_prefix='intensity_')
proteins_dataset_pept_counts = peptides_dataset.to_proteins(agg_method='counts', add_prefix='n_peptides_')

# extract data as Pandas data frames
prot_int = proteins_dataset_int.to_table()
prot_counts = proteins_dataset_pept_counts.to_table()

# merge to one table for further processing
proteins_dataset = prot_counts.merge(prot_int, on='protein_id', how='left')
proteins_dataset.head(3)

Unnamed: 0_level_0,n_peptides_c1_rep1,n_peptides_c1_rep2,n_peptides_c1_rep3,n_peptides_c1_rep4,n_peptides_c1_rep5,n_peptides_c2_rep1,n_peptides_c2_rep2,n_peptides_c2_rep3,n_peptides_c2_rep4,n_peptides_c2_rep5,...,intensity_c2_rep1,intensity_c2_rep2,intensity_c2_rep3,intensity_c2_rep4,intensity_c2_rep5,intensity_c3_rep1,intensity_c3_rep2,intensity_c3_rep3,intensity_c3_rep4,intensity_c3_rep5
protein_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
prot0,7,5,8,8,9,8,8,7,7,8,...,10905.868929,29978.048873,39279.995012,16447.245791,15476.546193,22106.072629,18638.301516,41024.311866,69277.036001,10825.451618
prot1,9,7,8,7,5,4,7,9,9,7,...,6798.912758,17346.823885,14392.470307,22301.020074,34467.525125,14886.915695,35368.129215,5180.27657,13585.279883,36810.953282
prot10,10,9,10,6,7,8,7,6,7,7,...,50901.080878,16509.573663,5293.420927,31092.895805,13232.936859,9335.053026,11066.451746,5719.300722,18697.848325,8109.855595
