# Candida/M. musculus peptide overlap

_PI: Simon Arthur (CSI)_

## Background

We would like to identify the number of shared peptides (from a trypsin digest) from mouse and candida. The mouse strain we normally use is C57BL/6J.  The Candida strain is SC5314; it seems to be the mouse common strain used and has been sequenced (http://www.candidagenome.org/).

## Source data

### M. musculus C57BL/6J

The mouse genome build mm10 is derived from C57BL/6J (see https://www.ebi.ac.uk/ena/data/view/GCA_000001635) consequently the protein sequences from this will be used to form the mouse peptide database. These were downloaded from Ensembl release 94 (22nd Nov 2018), where the assembly used is GRCm38.p6.

In [6]:
%%bash
mkdir data
cd data
wget -nv ftp://ftp.ensembl.org/pub/release-94/fasta/mus_musculus/pep/Mus_musculus.GRCm38.pep.all.fa.gz
gunzip -v Mus_musculus.GRCm38.pep.all.fa.gz

2018-11-22 15:01:25 URL: ftp://ftp.ensembl.org/pub/release-94/fasta/mus_musculus/pep/Mus_musculus.GRCm38.pep.all.fa.gz [11605732] -> "Mus_musculus.GRCm38.pep.all.fa.gz" [1]
Mus_musculus.GRCm38.pep.all.fa.gz:	   75.7% -- replaced with Mus_musculus.GRCm38.pep.all.fa


### Candida albicans SC5314

C. albicans SC5314 has also been sequenced and is available from Ensembl Genomes Relase 41 (22nd Nov 2018), which uses the Cand_albi_SC5314_V4 assembly. 

In [8]:
%%bash
cd data
wget -nv ftp://ftp.ensemblgenomes.org/pub/fungi/release-41/fasta/fungi_ascomycota2_collection/candida_albicans_sc5314_gca_000784635/pep/Candida_albicans_sc5314_gca_000784635.Cand_albi_SC5314_V4.pep.all.fa.gz
gunzip -v Candida_albicans_sc5314_gca_000784635.Cand_albi_SC5314_V4.pep.all.fa.gz

2018-11-22 15:02:15 URL: ftp://ftp.ensemblgenomes.org/pub/fungi/release-41/fasta/fungi_ascomycota2_collection/candida_albicans_sc5314_gca_000784635/pep/Candida_albicans_sc5314_gca_000784635.Cand_albi_SC5314_V4.pep.all.fa.gz [2077375] -> "Candida_albicans_sc5314_gca_000784635.Cand_albi_SC5314_V4.pep.all.fa.gz" [1]
Candida_albicans_sc5314_gca_000784635.Cand_albi_SC5314_V4.pep.all.fa.gz:	   53.2% -- replaced with Candida_albicans_sc5314_gca_000784635.Cand_albi_SC5314_V4.pep.all.fa


## Trypsin digests

The EMBOSS 6.6.0.0 pepdigest tool can carry out _in-silico_ trypsin digests. Trypsin digests are selected using the option '-menu 1', while it also requireed to select wether to use monoisotonic weights with the '-mono' option. By default, pepdigest does not cut proteins at unfavoured sites ('KR' followed by any of 'KRIFLP'). 

In [9]:
%%bash
cd data
pepdigest -seqall Mus_musculus.GRCm38.pep.all.fa -menu 1 -mono N -outfile Mus_musculus.pepdigest
pepdigest -seqall Candida_albicans_sc5314_gca_000784635.Cand_albi_SC5314_V4.pep.all.fa -menu 1 -mono N -outfile Candida_albicans.pepdigest

Report on protein proteolytic enzyme or reagent cleavage sites
Report on protein proteolytic enzyme or reagent cleavage sites


Pepdigest output contains a separate output section for each input protein, from which the separate peptides need to be parsed. In this case these can be written to separate fasta files.

In [39]:
import os
import re
import hashlib
import pandas as pd

def parse_pepdigest(digestfile): 
    
    new_seq_re = re.compile('# Sequence: ([\S]+)')
    peptide_re = re.compile('([\s]+[\d\.]+){3}[\s]+([A-Z\.][\s]+){2}([A-Z]+)')
    
    file = open(digestfile, 'r')
    prot_id=None
    pep_count=0
    peptides=[]
    
    for line in file:
        
        match = new_seq_re.match(line)
        if match:
            prot_id=match.group(1)
            pep_count=0
            
        match = peptide_re.match(line)
        if match:
            pep_count=pep_count+1
            peptide=match.group(3).encode('utf-8')
            md5sum=hashlib.md5(peptide).hexdigest()
            pep_data={
                'protein_id': prot_id,
                'count': pep_count,
                'peptide': peptide.decode('utf-8'),
                'md5sum': md5sum
            }
            peptides.append(pep_data)
        
    columns=['protein_id','count','peptide','md5sum']
    df = pd.DataFrame(peptides,columns=columns)
        
    return df

organisms=['Candida_albicans','Mus_musculus']

unique_dfs=[]
summary_data=[]

for beast in organisms:
    
    df=parse_pepdigest('data/%s.pepdigest' % beast)
    unique_df=df.drop_duplicates('md5sum',keep=False)
    prots=df['protein_id'].unique().tolist()
    unique_prots=unique_df['protein_id'].unique().tolist()
    
    protein_count=len(prots)
    peptide_count=(len(df.index))
    unique_protein_count=len(unique_prots)
    unique_peptide_count=(len(unique_df.index))
    
    unique_dfs.append(unique_df)
    
    data={
        'Organism': beast,
        'Peptide count': peptide_count,
        'Protein count': protein_count,
        'Unique peptide count': unique_peptide_count,
        'Unique protein count': unique_protein_count
    }
    summary_data.append(data)
    
merged_df=pd.concat(unique_dfs)
display(merged_df)

merged_prots=merged_df['protein_id'].unique().tolist()
merged_peptide_count=(len(merged_df.index))
merged_protein_count=len(merged_prots)
unique_merged_df=merged_df.drop_duplicates('md5sum',keep=False)
unique_merged_prots=unique_merged_df['protein_id'].unique().tolist()
unique_merged_peptide_count=(len(unique_merged_df.index))
unique_merged_protein_count=(len(unique_merged_prots))

merged_data={
    'Organism': 'Combined uniques',
    'Peptide count': merged_peptide_count,
    'Protein count': merged_protein_count,
    'Unique peptide count': unique_merged_peptide_count,
    'Unique protein count': unique_merged_protein_count
}
summary_data.append(merged_data)

summary_columns=['Organism','Protein count','Peptide count','Unique protein count', 'Unique peptide count']
summary_df=pd.DataFrame(summary_data,columns=summary_columns)
display(summary_df)

Unnamed: 0,protein_id,count,peptide,md5sum
10,KHC82800,2,SITQSPNGEIQNFHAILDQTVSKLNDAEWCLGVMVEKKKKLDELK,69c053e4a96abacd27dc067ba9fb22ba
11,KHC82800,3,AETAPQKFDNFDDFIGFDINGNTNDEDMLSNMDYEDLKLDDK,4750c9b8773495b8f9d5f6ab503c3332
26,KHC82801,1,VIYYTGPLFVFIYSTYSFCIVTCNLINK,1b237f1504842dd965bf815cbb37781a
27,KHC82801,2,NNHLRIVVVESTDSSK,cf46046b1b1339c13ebed57a33578795
28,KHC82801,3,AVTAQDLTFTSGITR,441057fb135caf5e181cc6dc5bd608e2
29,KHC82801,4,NVVTMFRR,e1de4b397ff1b4f7261709508440d233
30,KHC82801,5,NWSKFPPK,d43fd1fcf5d53a1e21829beaadd33878
31,KHC82801,6,GVLTVSYTK,5d3da8a0cc9404062bbcea25e240a84c
32,KHC82801,7,SIKRLK,fd8c9a985558b6bf8e118a4d57494b4b
33,KHC82801,8,TFCLR,a21f5ad979f1a82c78de940dbdc45f6b


Unnamed: 0,Organism,Protein count,Peptide count,Unique protein count,Unique peptide count
0,Candida_albicans,6380,208558,6197,153141
1,Mus_musculus,65830,2073163,40646,275525
2,Combined uniques,46843,428666,46800,425840
