# Candida/M. musculus peptide overlap

_PI: Simon Arthur (CSI)_

## Background

We would like to identify the number of shared peptides (from a trypsin digest) from mouse and candida. The mouse strain we normally use is C57BL/6J.  The Candida strain is SC5314; it seems to be the mouse common strain used and has been sequenced (http://www.candidagenome.org/).

## Source data

### M. musculus C57BL/6J

The mouse genome build mm10 is derived from C57BL/6J (see https://www.ebi.ac.uk/ena/data/view/GCA_000001635) consequently the protein sequences from this will be used to form the mouse peptide database. These were downloaded from Ensembl release 94 (22nd Nov 2018), where the assembly used is GRCm38.p6.

In [1]:
cd data
wget ftp://ftp.ensembl.org/pub/release-94/fasta/mus_musculus/pep/Mus_musculus.GRCm38.pep.all.fa.gz
gunzip -v Mus_musculus.GRCm38.pep.all.fa.gz

bash: cd: data: No such file or directory
--2018-11-22 14:52:08--  ftp://ftp.ensembl.org/pub/release-94/fasta/mus_musculus/pep/Mus_musculus.GRCm38.pep.all.fa.gz
           => ‘Mus_musculus.GRCm38.pep.all.fa.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.8
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.8|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-94/fasta/mus_musculus/pep ... done.
==> SIZE Mus_musculus.GRCm38.pep.all.fa.gz ... 11605732
==> PASV ... done.    ==> RETR Mus_musculus.GRCm38.pep.all.fa.gz ... done.
Length: 11605732 (11M) (unauthoritative)


2018-11-22 14:52:14 (2.15 MB/s) - ‘Mus_musculus.GRCm38.pep.all.fa.gz’ saved [11605732]

Mus_musculus.GRCm38.pep.all.fa.gz:	   75.7% -- replaced with Mus_musculus.GRCm38.pep.all.fa


### Candida albicans SC5314

C. albicans SC5314 has also been sequenced and is available from Ensembl Genomes Relase 41 (22nd Nov 2018), which uses the Cand_albi_SC5314_V4 assembly. 

In [2]:
wget ftp://ftp.ensemblgenomes.org/pub/fungi/release-41/fasta/fungi_ascomycota2_collection/candida_albicans_sc5314_gca_000784635/pep/Candida_albicans_sc5314_gca_000784635.Cand_albi_SC5314_V4.pep.all.fa.gz
gunzip -v Candida_albicans_sc5314_gca_000784635.Cand_albi_SC5314_V4.pep.all.fa.gz

--2018-11-22 14:52:14--  ftp://ftp.ensemblgenomes.org/pub/fungi/release-41/fasta/fungi_ascomycota2_collection/candida_albicans_sc5314_gca_000784635/pep/Candida_albicans_sc5314_gca_000784635.Cand_albi_SC5314_V4.pep.all.fa.gz
           => ‘Candida_albicans_sc5314_gca_000784635.Cand_albi_SC5314_V4.pep.all.fa.gz’
Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.197.94
Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.197.94|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/fungi/release-41/fasta/fungi_ascomycota2_collection/candida_albicans_sc5314_gca_000784635/pep ... done.
==> SIZE Candida_albicans_sc5314_gca_000784635.Cand_albi_SC5314_V4.pep.all.fa.gz ... 2077375
==> PASV ... done.    ==> RETR Candida_albicans_sc5314_gca_000784635.Cand_albi_SC5314_V4.pep.all.fa.gz ... done.
Length: 2077375 (2.0M) (unauthoritative)


2018-11-22 14:52:15 (2.93 MB/s) - ‘Candida_albic

## Trypsin digests

The EMBOSS 6.6.0.0 pepdigest tool can carry out _in-silico_ trypsin digests. Trypsin digests are selected using the option '-menu 1', while it also requireed to select wether to use monoisotonic weights with the '-mono' option. By default, pepdigest does not cut proteins at unfavoured sites ('KR' followed by any of 'KRIFLP'). 

In [3]:
pepdigest -seqall Mus_musculus.GRCm38.pep.all.fa -menu 1 -mono N -outfile Mus_musculus.pepdigest
pepdigest -seqall Candida_albicans_sc5314_gca_000784635.Cand_albi_SC5314_V4.pep.all.fa -menu 1 -mono N -outfile Candida_albicans.pepdigest

Report on protein proteolytic enzyme or reagent cleavage sites
Report on protein proteolytic enzyme or reagent cleavage sites


Pepdigest output contains a separate output section for each input protein, from which the separate peptides need to be parsed. In this case these can be written to separate fasta files.

In [2]:
#!/usr/bin/env python

import os
import re
import hashlib
import pandas as pd

def parse_pepdigest(digestfile): 
    
    new_seq_re = re.compile('# Sequence: ([\S]+)')
    peptide_re = re.compile('([\s]+[\d\.]+){3}[\s]+([A-Z\.][\s]+){2}([A-Z]+)')
    
    file = open(digestfile, 'r')
    prot_id=None
    pep_count=0
    peptides=[]
    
    for line in file:
        
        match = new_seq_re.match(line)
        if match:
            prot_id=match.group(1)
            pep_count=0
            
        match = peptide_re.match(line)
        if match:
            pep_count=pep_count+1
            peptide=match.group(3).encode('utf-8')
            md5sum=hashlib.md5(peptide).hexdigest()
            pep_data={
                'protein_id': prot_id,
                'count': pep_count,
                'peptide': peptide.decode('utf-8'),
                'md5sum': md5sum
            }
            peptides.append(pep_data)
        
    columns=['protein_id','count','peptide','md5sum']
    df = pd.DataFrame(peptides,columns=columns)
        
    return df


organisms=['Candida_albicans','Mus_musculus']

for beast in organisms:
    df=parse_pepdigest('data/%s.pepdigest' % beast)
    display(df)

FileNotFoundError: [Errno 2] No such file or directory: 'data/Candida_albicans.pepdigest'