In [2]:
from Bio import SeqIO
import requests

### Swissprot query
Give the number of human swissprot proteins that contain the AA-sequence of the shortest human swissprot protein twice in a window of 100 amino acids![NL]

Our SwissProt query must specify the following:

1. Entry must be reviewed (=Swissprot, otherwise it includes non-manually annotated genes)
2. Species must be _Homo sapiens_.
3. Must be first entry when sorted by size.

In [3]:
#Get the URL 

response = requests.get(URL)

#Write to .fasta 
open("smallest_protein.fna", "wb").write(response.content)


95713

Extract the AA sequence to perform a query of SwissProt.

In [14]:
fasta_sequence = SeqIO.parse(open('smallest_protein.fna'),'fasta')
for fasta in fasta_sequence:
    sequence = fasta.seq

print(sequence)

MATGLGEPVYGLSEDEGESRILRVKVVSGIDLAKKDIFGASDPYVKLSLYVADENRELALVQTKTIKKTLNPKWNEEFYFRVNPSNHRLLFEVFDENRLTRDDFLGQVDVPLSHLPTEDPTMERPYTFKDFLLRPRSHKSRVKGFLRLKMAYMPKNGGQDEENSDQRDDMEHGWEVVDSNDSASQHQEELPPPPLPPGWEEKVDNLGRTYYVNHNNRTTQWHRPSLMDVSSESDNNIRQINQEAAHRRFRSRRHISEDLEPEPSEGGDVPEPWETISEEVNIAGDSLGLALPPPPASPGSRTSPQELSEELSRRLQITPDSNGEQFSSLIQREPSSRLRSCSVTDAVAEQGHLPPPSAPAGRARSSTVTGGEEPTPSVAYVHTTPGLPSGWEERKDAKGRTYYVNHNNRTTTWTRPIMQLAEDGASGSATNSNNHLIEPQIRRPRSLSSPTVTLSAPLEGAKDSPVRRAVKDTLSNPQSPQPSPYNSPKPQHKVTQSFLPPGWEMRIAPNGRPFFIDHNTKTTTWEDPRLKFPVHMRSKTSLNPNDLGPLPPGWEERIHLDGRTFYIDHNSKITQWEDPRLQNPAITGPAVPYSREFKQKYDYFRKKLKKPADIPNRFEMKLHRNNIFEESYRRIMSVKRPDVLKARLWIEFESEKGLDYGGVAREWFFLLSKEMFNPYYGLFEYSATDNYTLQINPNSGLCNEDHLSYFTFIGRVAGLAVFHGKLLDGFFIRPFYKMMLGKQITLNDMESVDSEYYNSLKWILENDPTELDLMFCIDEENFGQTYQVDLKPNGSEIMVTNENKREYIDLVIQWRFVNRVQKQMNAFLEGFTELLPIDLIKIFDENELELLMCGLGDVDVNDWRQHSIYKNGYCPNHPVIQWFWKAVLLMDAEKRIRLLQFVTGTSRVPMNGFAELYGSNGPQLFTIEQWGSPEKLPRAHTCFNRLDLPPYETFEDLREKLLMAVENAQGFEGVD


We are looking for all human proteins who, in a window of 100 AA, contain EI at least __twice__. To do this, we will download FASTA files of the entire human proteome.

In [41]:
#Specif
URL = 'https://www.uniprot.org/uniprot/?query=*&format=fasta&fil=organism:%22Homo%20sapiens%20(Human)%20[9606]%22%20AND%20reviewed:yes'
response = requests.get(URL)

#Write to .fasta 
open("human_proteome.fna", "wb").write(response.content)


KeyboardInterrupt: 

Open the file to get some basic information and ensure correct data acquisition.

20386
20386


We downloaded the FASTA files fror 20386 proteins. This corresponds to the number of human entries on the [SwissProt page](https://www.uniprot.org/uniprot/?query=reviewed:yes). We also verified that each name is unique by means of set addition.

We can now perform a sliding window check for each individual protein. 

__Important__: Not all proteins are at least 100 AA long. If shorter proteins contain the pattern twice, we also count them as valid, as they are technically located within 100 AA of each other.

In [39]:
#Define the pattern & output list
p = 'EI'
matches = []

#Define helper function to determine if 'EI' occurs at least twice
def valid_protein(seq):
    if seq.count('EI') >= 2:
        return True
    return False


#Open file
for fasta in SeqIO.parse(open('human_proteome.fna'),'fasta'):

    #Extract name and sequence
    name, sequence = fasta.id, str(fasta.seq)

    #Check length
    if len(sequence) <= 100:
        
        #Search pattern
        if valid_protein(sequence):
            matches.append(name)

    else:
        #Sliding window of 100 AA long
        for i in range(len(sequence)-99):

            #Current window
            window = sequence[i:i+100]

            #Search pattern
            if valid_protein(window):
                
                #Exit loop if protein is valid (avoids continuously checking the same valid protein)
                matches.append(name)
                break

            
#Conclusion
solution = len(matches)

With all computations performed, we can draw a conclusion.

In [40]:
print(f'Out of {count} human SwissProt proteins, {solution} proteins ({solution/count*100}%) contained the AA-sequence of the smallest human protein at least twice in the span of 100 amino acids.')


Out of 20386 human SwissProt proteins, 5101 proteins (25.022073972333953%) contained the AA-sequence of the smallest human protein at least twice in the span of 100 amino acids.


The answer to the question is 5101.