# Get random peptides

Similarly to NetMHCpan and MixMHCpred, we will sample random peptides from the human proteome to use them for ranking.

We will generate 100,000 peptides for each peptide length in between 8 and 15.

To accomplish this task, we'll need to follow these steps for each peptide length:

- Parse the human proteome file to extract all possible peptides of lengths 8 to 15.
- Predict which of these peptides are likely to be cut by the proteasome.
- Randomly sample 100,000 peptides for each length, ensuring no duplicates.
- Save the sampled peptides to a file.

Randomly sampled peptides data will be saved in the `processed/random_proteasome_cleaved_peptides` directory. 

#### Details

- We downloaded the human proteome from [UniProt](https://www.uniprot.org/proteomes/UP000005640] ) and saved it in the `data/raw/human_proteome` directory. We downloaded only reviewed (Swiss-Prot) canonical proteins (20,420)
- We will use NetChop to predict proteasome cleavage sites in the human proteome. We will save the cleaved peptides in the `data/interim/cleaved_human_proteome` directory. We downloaded NetChop from the [official website](https://services.healthtech.dtu.dk/services/NetChop-3.1/). Note that it is not available for commercial use.

In [1]:
import os
import tempfile
import subprocess
import numpy as np
from tqdm import tqdm
from Bio import SeqIO

In [2]:
DATA_DIR = '../data'
proteome_fasta_file = os.path.join(DATA_DIR, 'raw', 'human_proteome', 'uniprotkb_proteome_UP000005640_AND_revi_2024_06_14.fasta')
cleaved_human_proteome_dir = os.path.join(DATA_DIR, 'interim', 'cleaved_human_proteome')
processed_random_peptides_dir = os.path.join(DATA_DIR, 'processed', 'random_proteasome_cleaved_peptides')
NETCHOP_PREDICTOR = '/home/bsccns/Documents/PhD/software/netchop-3.1/netchop'
N_RND_PEPTIDES = 100000
peptide_length = [8, 9, 10, 11, 12, 13, 14, 15]

In [3]:
def run_netchop(
        protein_sequence: str,
        path_to_netchop: str
) -> list:
    """
    Run NetChop to predict cleavage sites in a protein sequence.
    :param protein_sequence: str
        Aminoacid sequence of a single protein 
    :param path_to_netchop: str
        Path to the NetChop executable
    :return: 
        List of cleavage sites predicted by NetChop. A cleaved peptide will
        start after the cleavage site.
    """
    # Create a temporary file to write the protein sequence in FASTA format
    with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.fasta') as temp_fasta:
        temp_fasta.write(">protein_sequence\n")
        temp_fasta.write(protein_sequence)
        fasta_file_path = temp_fasta.name
    
    try:
        # Run NetChop
        result = subprocess.run([path_to_netchop, fasta_file_path], capture_output=True, text=True)
        output = result.stdout
        
        # Parse the output
        cleavage_positions = []
        parsing = False  # Initialize the parsing variable
        for line in output.split('\n'):
            if line.startswith('--------------------------------------'):
                # Start parsing after the header
                parsing = True
            elif parsing:
                if line.strip() == '':
                    # Stop parsing at the end of the relevant section
                    break
                parts = line.split()
                if len(parts) >= 4 and parts[2] == 'S':
                    cleavage_positions.append(int(parts[0]))
                    
        return cleavage_positions
    
    finally:
        # Clean up the temporary file
        os.remove(fasta_file_path)
        
def find_cleavable_peptides(
        fasta_file: str, 
        peptide_lengths: list,
        path_to_netchop: str
) -> dict:
    """
    Find peptides that can be cleaved by the proteasome according NetChop given protein sequences.
    :param fasta_file: str
        Path to the FASTA file containing the protein sequences
    :param peptide_lengths: list
        List of peptide lengths to consider
    :param path_to_netchop: str
        Path to the NetChop executable
    :return: dict
        Dictionary containing a list of cleavable peptides for each peptide length
    """
    cleavable_peptides = {length: [] for length in peptide_lengths}
    for record in tqdm(SeqIO.parse(fasta_file, "fasta"), total=len(list(SeqIO.parse(fasta_file, "fasta")))):
        protein_sequence = str(record.seq)
        cleavage_sites = run_netchop(protein_sequence, path_to_netchop)

        for pos in cleavage_sites:
            # Ensure the peptide is within the bounds of the sequence
            for length in peptide_lengths:
                if pos + length <= len(protein_sequence):
                    peptide = protein_sequence[pos:pos+length]
                    cleavable_peptides[length].append(peptide)
    # This will return the list of peptides with duplicates
    return cleavable_peptides


In [4]:
cleavable_peptides = find_cleavable_peptides(proteome_fasta_file, peptide_length, NETCHOP_PREDICTOR)

for length, peptides in cleavable_peptides.items():
    with open(os.path.join(cleaved_human_proteome_dir, f'cleavable_peptides_length_{length}.txt'), 'w') as f:
        for peptide in peptides:
            f.write(f'{peptide}\n')

100%|██████████| 20420/20420 [38:29<00:00,  8.84it/s] 


In [5]:
# save data withou duplicates
for length in peptide_length:
    with open(os.path.join(cleaved_human_proteome_dir, f'cleavable_peptides_length_{length}.txt'), 'r') as f:
        peptides = f.readlines()
        peptides = [peptide.strip() for peptide in peptides]
        peptides = list(set(peptides))
        
    with open(os.path.join(cleaved_human_proteome_dir, f'cleavable_peptides_length_{length}_no_duplicates.txt'), 'w') as f:
        for peptide in peptides:
            f.write(f'{peptide}\n')

In [6]:
# sample random peptides
for length in peptide_length:
    with open(os.path.join(cleaved_human_proteome_dir, f'cleavable_peptides_length_{length}_no_duplicates.txt'), 'r') as f:
        peptides = f.readlines()
        peptides = [peptide.strip() for peptide in peptides]
        random_peptides = np.random.choice(peptides, N_RND_PEPTIDES, replace=False)
        
    with open(os.path.join(processed_random_peptides_dir, f'random_peptides_length_{length}.txt'), 'w') as f:
        for peptide in random_peptides:
            f.write(f'{peptide}\n')