# Uniprot and Sequence Alginment

This notebook shows retrieving sequences from [Uniprot](https://www.uniprot.org/) and performing a sequence alignment using `SeqLike`.

Uniprot is a database of protein sequences (not structures) and SeqLike is a library made by moderna for working with sequences and doing sequence alignment. 
It is based on the Sequence object from Biopython.

In this notebook, I demonstrate using `requests` from Python to query Uniprot. 
I pull out the Uniprot IDs from this search result and then retrieve the fasta files for each protein.

We will likely discuss REST APIs and web retrieval in our last lab!

The fasta files are then put into one large file, and I use SeqLike to add these to a pandas Dataframe and perform an alignment.

This is an activity I'm still developing, but consider it a "bonus" notebook in case you finish the other one early!
Go through the code and see if you can figure out what each line does!

In [None]:
import requests
import os

from seqlike import SeqLike, aaSeqLike
from Bio import SeqIO
import pandas as pd

search_query = "myoglobin"

In [None]:
results = requests.get(f"https://rest.uniprot.org/uniprotkb/search?query={search_query}")
results = results.json()

In [None]:
results["results"][0]["primaryAccession"]

In [None]:
pids = [ results["results"][i]["primaryAccession"] for i in range(len(results["results"])) ]
pids

In [None]:
# Retrieve fasta using pids

import os

os.makedirs(f"fasta/{search_query}", exist_ok=True)

files = []
for pid in pids:
    
    file_name = f"fasta/{search_query}/{pid}.fasta"
    files.append(file_name)
    
    if not os.path.isfile(file_name):
        fasta = requests.get(f"https://rest.uniprot.org/uniprotkb/{pid}.fasta")
        
        with open(file_name, "w") as f:
            f.write(fasta.text)


In [None]:
with open(f"fasta/{search_query}/structures.fasta", "w") as f:
    # concatenate files into one
    for file in files:
        with open(file) as structure_file:
            f.write(structure_file.read())

In [None]:
# define the standard amino acids
amino_acids = {'A', 'R', 'N', 'D', 'C', 'E', 'Q', 'G', 'H', 'I', 
               'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V'}
# read in sequences
seqs = [s for s in SeqIO.parse(f"fasta/{search_query}/structures.fasta", "fasta")]

# Find out if each sequence contains only standard amino acids (we only want sequences with standard amino acids
standard = [ False if set(sequence.seq).difference(amino_acids) else True for sequence in seqs  ]

In [None]:
df = pd.DataFrame(
    {
        "names": [s.name for s in seqs],
        "seqs": [aaSeqLike(s) for s in seqs],
    }
)

standard_seqs = df[standard].copy()

standard_seqs["aligned"] = standard_seqs["seqs"].seq.align()
standard_seqs["aligned"] 
standard_seqs["aligned"].seq.plot()