# Exploring UniProt data with Python


Emma Hatton-Ellis, 28th February 2019.

Bioinformatics Resources for Protein Biology, API session.

### Sources of UniProt data
1. UniProt REST API
     - Provides access to core UniProt data, including search and database mapping services.
     - Link and documentation: https://www.uniprot.org/help/programmatic_access
2. EBI Proteins API
    - In addition to core UniProt data, the Proteins API also allows programmatic access to linked datasets such as reference genome mappings, variant mapping and proteomics peptide mappings.
    - Link and documentation: https://www.ebi.ac.uk/proteins/api/doc/index.html
3. FTP download site
    - The FTP site contains complete datasets for bulk download. If you know that you need to work with very large amounts of data then this may be the best option.
    - Link and documentation: https://www.uniprot.org/downloads

Since this is only a brief introduction, the examples in the rest of this notebook will focus on the Proteins API.

If you would like to learn more about programmatic access to UniProt data, then you may be interested in this webinar:

https://www.ebi.ac.uk/training/online/course/accessing-embl-ebi-resources-and-tools-programmatically/uniprot

Next, on to some practical examples...

### Overview of the Proteins API
The proteins API provides five different services, each with a number of different endpoints (all fairly self-explanatory).
1. Proteins
2. Proteomes
3. Taxonomy
4. Coordinates
5. UniParc

### Interactive use of the Proteins API
The Proteins API has a nice interactive feature which allows you to experiment with all the different endpoints and parameters, before you even start writing any code.

Click on the following link and change the "Response Content Type" to "text/x-fasta" and then type "P12345" into the "accession" field. Finally, hit the "Try it out" button (you may need to scroll down a bit to find this) and take a look at the response data.

https://www.ebi.ac.uk/proteins/api/doc/index.html#!/proteins/search

You can even find sample code snippets (in bash/curl, Perl, Python, Ruby, Java and R) in the section "Request Sample Code".


### Example 1: Retrieve mouse entries with unique proteomics peptide mappings

First, we import the python "requests" library which allows simple access to HTTP endpoints.

You can find out more about the requests library here: http://docs.python-requests.org/en/master/

In [5]:
import requests

To search for proteomics data, we need to set the URL to the correct query endpoint. Note the question mark at the end...it is important to include this.

In [7]:
url = 'https://www.ebi.ac.uk/proteins/api/proteomics?'

Next, define the headers (used to set the data format e.g. fasta, xml, json) and the parameters for the search.

The "offset" and "size" parameters are used to control the number of records returned. Here they are set explicitly to the default values of 0 and 100, respectively. The small size of 100 is fine for demonstration and testing, but for real analysis you probably want to use a larger value such as 50,000 (this is to avoid making lots of consecutive requests to the server).

In [12]:
headers = {'Accept': 'application/json'}

params = {
    'taxid': 10090,
    'unique': 'true',
    'offset': 0,
    'size': 100,
}

Finally, call the requests "get" method to retrive the data.

In [10]:
r = requests.get(url, headers=headers, params=params)

One really nice feature of the requests library is that you can easily convert json format data into a Python dictionary, which makes it simple to access and filter the data fields in your scripts.

In [17]:
data = r.json()
len(data)

100

The data structure is actually a list of dictionaries, where each list item represents a single record. Let's take a look at the information available in the first record retrieved:

In [23]:
data[0]

{'accession': 'A0A023T778',
 'entryName': 'A0A023T778_MOUSE',
 'sequence': 'MSMGSDFYLRYYVGHKGKFGHEFLEFEFRPDGKLRYANNSNYKNDVMIRKEAYVHKSVMEELKRIIDDSEITKEDDALWPPPDRVGRQELEIVIGDEHISFTTSKIGSLIDVNQSKDPEGLRVFYYLVQDLKCLVFSLIGLHFKIKPI',
 'sequenceChecksum': '2A47422841D17AF3',
 'taxid': 10090,
 'features': [{'type': 'PROTEOMICS',
   'begin': '2',
   'end': '10',
   'xrefs': [{'name': 'Proteomes',
     'id': 'UP000000589',
     'url': 'https://www.uniprot.org/proteomes/UP000000589'}],
   'evidences': [{'code': 'ECO:0000213',
     'source': {'name': 'PeptideAtlas',
      'id': 'A0A023T778',
      'url': 'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/proteomics_mapping/README'}},
    {'code': 'ECO:0000213',
     'source': {'name': 'MaxQB',
      'id': 'A0A023T778',
      'url': 'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/proteomics_mapping/README'}}],
   'peptide': 'SMGSDFYLR',
   'unique': True}]}

Finally, we can loop through all the downloaded records and print out some selected details (in this case entry name, plus peptide sequence and position).

In [34]:
for record in data:
    print(record['entryName'])
    for feature in record['features']:
        print(f"\t{feature['peptide']}, {feature['begin']}, {feature['end']}")

A0A023T778_MOUSE
	SMGSDFYLR, 2, 10
A0A067XG46_MOUSE
	AAIIKPTCIK, 111, 120
	AESESLVPDTGAVFTFGK, 40, 57
	FAENIPSK, 60, 67
	LDEVLKEEDSASLLQR, 916, 931
	LFMWGDNSEGQIGLEDK, 199, 215
	LGLPNELLMNHR, 260, 271
	LLDFSPIQK, 552, 560
	LYMFGSNNWGQLGLGSK, 94, 110
	NDIPICLSCGDEHTAIVTGNNK, 72, 93
	QLSAGANTSAALTEDGK, 182, 198
	SPSESMEPLDSDYFEDK, 497, 513
	VIQVACGGGHTVVLTEK, 283, 299
	VLGIPER, 276, 282
	FEDVYEPYISTGSFSINDLSPR, 415, 436
A0A067XG49_MOUSE
	AQQMVEILSDENR, 24, 36
	AQQSASYQPMPADPFAMVSR, 4, 23
	DCSTQTERGPESTK, 451, 464
	DQALSNAQAK, 166, 175
	DTTVISHSPNTSYDTALEAR, 297, 316
	EEEEILMANK, 320, 329
	HFALDAAATVAAQR, 283, 296
	NLRQELDGCYEK, 37, 48
	RCLDMEGR, 330, 337
	VEPVPSTPSPVPPSTPLLSAHSK, 424, 446
	YLEENVMR, 275, 282
	SLMSISNAGSGLLAHSSTLTGAPIMEEK, 377, 404
	TLHAQIIEK, 340, 348
	TPIQILGQEPDAEMVEYLI, 719, 737
A0A067XG51_MOUSE
	ENTEPEEPQLK, 67, 77
A0A067XG53_MOUSE
	ADAGFVYSEAVASHYMR, 81, 97
	AQFEYDPAK, 564, 572
	AQFEYDPAKDDLIPCK, 564, 579
	AVSQVLDSLEEIHALTDCSEK, 314, 334
	CINRETGQQFAVK, 3, 15
	DDHNW

### Example 2: Mapping protein data to genomic coordinates
UniProt entry [O95822](https://www.uniprot.org/uniprot/O95822) is an enzyme which contains two active sites at amino acid positions 329 and 423. This example shows how these functional features can be precisely mapped to the reference genome.

First we define variables for the UniProt accession and active site position.

In [37]:
accession = 'O95822'
active_site_position = 329

Next we construct the URL for the coordinates location endpoint, inserting the variables defined previously:

In [38]:
url = f'https://www.ebi.ac.uk/proteins/api/coordinates/location/{accession}:{active_site_position}'

The final step is to retrieve the data using requests, similar to the previous example. We can also re-use the headers that we defined earlier.

In [40]:
r = requests.get(url, headers=headers)
data = r.json()

Inspecting the data, we can see that the first active site maps to the codon at genomic coordinates 83914992 - 83914994.

In [41]:
data

{'locations': [{'accession': 'O95822',
   'taxid': 9606,
   'chromosome': '16',
   'ensemblTranslationId': 'ENSP00000262430',
   'proteinStart': 329,
   'geneStart': 83914992,
   'proteinEnd': 329,
   'geneEnd': 83914994}]}

In [43]:
data['locations'][0]['geneStart'], data['locations'][0]['geneEnd']

(83914992, 83914994)

### Example 3: Find all variants associated with the disease Usher syndrome, type 1

Type 1 Usher syndrome is in the OMIM database with accession 276900: https://www.omim.org/entry/276900

We can pass the OMIM accession as a parameter to the variation search endpoint. The other steps are much the same as in the previous examples (and hopefully familiar by now).

In [45]:
url = 'https://www.ebi.ac.uk/proteins/api/variation?'

params = {'omim': 276900}

r = requests.get(url, params=params, headers=headers)

data = r.json()

One way to check the number of results from your search is to inspect the headers of the response object. Here we have a relatively small number of records returned. 

In [49]:
r.headers['X-Pagination-TotalRecords']

'52'

Loop through the records and print out the entry and gene name, followed by all the variants and the amino acid change.

In [60]:
for record in data[:5]: # just look at the first 5 records here, for brevity
    print(f"{record['entryName']}, {record['geneName']}")
    for feature in record['features']:
        print(f"\t{feature['wildType']} -> {feature['alternativeSequence']}, {feature['begin']}, {feature['end']}")

A0A087WT71_HUMAN, MYO7A
	V -> A, 10, 10
	G -> R, 25, 25
	A -> E, 26, 26
	E -> *, 45, 45
	W -> *, 47, 47
	P -> R, 61, 61
	L -> P, 73, 73
	Y -> F, 95, 95
	T -> M, 96, 96
	V -> G, 105, 105
	R -> S, 120, 120
	I -> T, 127, 127
	H -> N, 133, 133
	H -> Y, 133, 133
	I -> N, 134, 134
	R -> *, 150, 150
	C -> Y, 153, 153
	A -> T, 162, 162
	G -> R, 163, 163
	T -> M, 165, 165
	L -> P, 196, 196
	R -> C, 206, 206
	S -> G, 211, 211
	R -> C, 212, 212
	R -> H, 212, 212
	G -> R, 214, 214
	D -> N, 218, 218
	Q -> *, 234, 234
	R -> C, 241, 241
	R -> H, 241, 241
	R -> H, 302, 302
	Y -> *, 333, 333
	L -> P, 366, 366
	L -> V, 375, 375
	R -> C, 378, 378
	R -> H, 378, 378
	G -> R, 379, 379
	T -> M, 381, 381
	R -> C, 395, 395
	Y -> C, 403, 403
	V -> A, 411, 411
	K -> *, 420, 420
	E -> Q, 450, 450
	A -> V, 457, 457
	H -> R, 468, 468
	I -> T, 499, 499
	K -> Q, 515, 515
	G -> D, 519, 519
	Q -> *, 531, 531
	P -> H, 540, 540
	H -> P, 574, 574
	D -> N, 576, 576
	R -> W, 616, 616
	R -> H, 623, 623
	R -> *, 634, 634
	R -