This colab sheet has the scope to guide the lecturer into preprocessing step of antigens and non antigens proteins. Then, data obtained from this step, will be use to develop a feedforward neural network model to predict new unseen antigens.

- Positive and negative datasets were obtained from IEDB (https://www.iedb.org/) with following search parameters: 'epitope:any', 'organism:eucaryote(ID:2759)', 'host:human', 'assay:any', 'outcome:positive AND negative', 'MHC restriction:any', 'disease:infectious'.
Remember to:
- Download data in csv format
- Clean data manually (human proteins are also downloaded, which can bias the data). 

In [None]:
!pip install Bio

In [1]:
# import required libraries
import csv
import re

Now we have a list of proteins but we must acquire their protein sequences to build a machine learning model. let's inspect csv file downloaded and iterate every instance to find uniprot link of the considered protein. 

In [None]:
# open .csv file in reading mode
with open('negative.csv', 'r') as csv_file:
    # create a csv reader object to read the file 
    csv_reader = csv.reader(csv_file)

    # create an empy list to contain valid rows 
    valid_rows = []

    # iterate on every row of csv file
    for row in csv_reader:
        # verify if second column contains Verifichiamo se la seconda colonna contiene un link Uniprot valido
        uniprot_link = re.findall(r'http://www.uniprot.org/uniprot/(\w+)', row[1])
        if uniprot_link:
            # if a valid uniprot link is found, add the row to the list 
            row.append(uniprot_link[0])
            valid_rows.append(row)
    
    # overwrite original csv file only with valid rows 
    with open('negative.csv', 'w', newline='') as new_csv_file:
        csv_writer = csv.writer(new_csv_file)
        csv_writer.writerows(valid_rows)


Do the same with positive dataset

In [None]:
# open .csv file in reading mode
with open('positive.csv', 'r') as csv_file:
    # create a csv reader object to read the file 
    csv_reader = csv.reader(csv_file)

    # create an empy list to contain valid rows 
    valid_rows = []

    # iterate on every row of csv file
    for row in csv_reader:
        # verify if second column contains Verifichiamo se la seconda colonna contiene un link Uniprot valido
        uniprot_link = re.findall(r'http://www.uniprot.org/uniprot/(\w+)', row[1])
        if uniprot_link:
            # if a valid uniprot link is found, add the row to the list 
            row.append(uniprot_link[0])
            valid_rows.append(row)
    
    # overwrite original csv file only with valid rows 
    with open('positive.csv', 'w', newline='') as new_csv_file:
        csv_writer = csv.writer(new_csv_file)
        csv_writer.writerows(valid_rows)

For every protein we catch relate sequence and save it in a .fasta file.

In [None]:
import requests

# open csv file in reading mode
with open('negative.csv', 'r') as csv_file:

    # read csv 
    csv_reader = csv.reader(csv_file, delimiter=',')

    # skip the first line which contains header 
    next(csv_reader)

    # open .fasta file in writing mode
    with open('negative.fasta', 'w') as fasta_file:

        # iterate between rows
        for row in csv_reader:

            # take URL of every protein from second column
            url = row[1]

            # extrapolate protein identifier from the URL
            id_proteina = url.split('/')[-1]

            # create protein URL
            url_uniprot = 'https://www.uniprot.org/uniprot/{}.fasta'.format(id_proteina)

            # download protein sequence from the URL
            response = requests.get(url_uniprot)
            seq = response.text.strip().split('\n')[1:]

            # write protein sequence in the .fasta file
            fasta_file.write('>{}\n{}\n'.format(id_proteina, ''.join(seq)))

print('aminoacidic sequences saved in the .FASTA file ''negative.fasta'') 


Sequenze aminoacidiche salvate in formato FASTA nel file "proteine.fasta"


Since some proteins are shared between positive and negative datasets, it is best to remove them in order to facilitate the model in correctly predicting protein antigenicity.

In [None]:
# import biopython library to handle protein sequences
from Bio import SeqIO

# define outputs names 
file1 = "positive.fasta"
file2 = "negative.fasta"

# create two empy sets to memorize protein sequences 
seqs1 = set()
seqs2 = set()

# open first fasta file and insert sequences into first set 
for record in SeqIO.parse(file1, "fasta"):
    seqs1.add(str(record.seq))

# do the same with second file 
for record in SeqIO.parse(file2, "fasta"):
    seqs2.add(str(record.seq))

# find shared proteins between two files and insert them into a new set 
shared_proteins = seqs1.intersection(seqs2)

# remove shared proteins from original sets
seqs1 = seqs1 - shared_proteins
seqs2 = seqs2 - shared_proteins

# create two new files to contain unique proteins for every class 
with open("unique_proteins_pos.fasta", "w") as f:
    for i, seq in enumerate(seqs1):
        f.write(f">protei_{i}\n{seq}\n")
        
with open("unique_proteins_neg.fasta", "w") as f:
    for i, seq in enumerate(seqs2):
        f.write(f">protein_{i}\n{seq}\n")

print('number of antigens:', len(seqs1))
print('number of NON antigens:', len(seqs2))


no of immunogens: 487
no of non-immunogens: 369


- BONUS ----> not very accurate

Data mining from NCBI

In [None]:
from Bio import Entrez, SeqIO

# insert mail
Entrez.email = "francesco.patane@live.it"

# search on NCBI Protein
handle = Entrez.esearch(db="protein", term='("antigen"[Function]) AND Aspergillus[Organism]', retmax=10000)   # select the organism
record = Entrez.read(handle)

# download seqs
handle = Entrez.efetch(db="protein", id=record["IdList"], rettype="fasta", retmode="text")
records = list(SeqIO.parse(handle, "fasta"))

# save on file
with open("antigens.fasta", "w") as f:
    SeqIO.write(records, f, "fasta")

print("Downloaded seqs:", len(records))


Proteine scaricate: 9999
