# Unipressed examples running via MyBinder

There's a Python package, [Unipressed](https://multimeric.github.io/Unipressed/), by Michael Milton ([@multimeric](https://twitter.com/multimeric)) that allows programmatic access query UniProt's new REST API.  

This notebook combines some of my examples on StackOverflow and Biostars to work in sessions served by MyBinder.
This is designed so you can easily edit these and run your own versions to test things out or collect useful information.  
Be aware though this MyBinder-served session has limited computational resources and so you may easily exceed what is possible here and need to take the ideas and code and move to where you have more resources. Additionally, MyBinder blocks FTP ports to prevent abuse and so not all routes work to retrieve data.

If you do make something useful in your session, grab the code and save it on your local machine or save the current notebook and download it to your machine to upload it to later sessions to pick up where you left off. The same goes for any data you generate! (You'll need to run the installs at the top everytime unless I have set up the environment to already include them installed at session start-up. I HAVE NOT DONE THIS YET.)  

---------

### Prepare environment in session by installing packages needed

In [1]:
%pip install unipressed pandas

Collecting unipressed
  Downloading unipressed-1.4.0-py3-none-any.whl.metadata (21 kB)
Collecting pandas
  Downloading pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting numpy>=1.23.2 (from pandas)
  Downloading numpy-2.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading unipressed-1.4.0-py3-none-any.whl (35 kB)
Downloading pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25hDownloading numpy-2.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading tzdat

------

## Example 1: Get PDB files from a list of Uniprot IDs

This example is based on combining Unipressed use via my MyBinder with [Biostars reply by Mensur Dlakic](https://www.biostars.org/p/9602308/#9602314).  
One of the entries in [the original Biostars post](https://www.biostars.org/p/9602308/#9602308), Q8N183, is not solved experimentally at this time, and so the current query of Uniprot doesn't link to a PDB file. I added some code to collect such entries and get the corresponding AlphaFold models that are currently hosted & available at RCSB if you search the terms 'PDB' combined with a Uniprot ID.

The list with each UniProt id on a separate line should be pasted between the `'''` in the cell below. That is okay if you don't have one yet because a demo list is provided there already to get you going and you should start with that:

In [1]:
l='''
O75771
Q8N183
A0A0M3KKX3
'''

In [2]:
# This cell takes the item listing and makes a Python list object out of it
id_list = l.split("\n")
# next line uses a Python trick to remove the blank ones /empty ones that came from way I made list convenient to paste into
id_list = [x for x in id_list if x]
id_list

['O75771', 'Q8N183', 'A0A0M3KKX3']

In [3]:
from unipressed import IdMappingClient
import time
request = IdMappingClient.submit(
    source="UniProtKB_AC-ID", dest="PDB", ids=id_list
)
time.sleep(3.0)
results_list = list(request.each_result())

In [4]:
results_list 

[{'from': 'O75771', 'to': '2KZ3'},
 {'from': 'O75771', 'to': '8FAZ'},
 {'from': 'O75771', 'to': '8GBJ'},
 {'from': 'O75771', 'to': '8OUY'},
 {'from': 'O75771', 'to': '8OUZ'},
 {'from': 'A0A0M3KKX3', 'to': '4U7N'},
 {'from': 'A0A0M3KKX3', 'to': '4U7O'},
 {'from': 'A0A0M3KKX3', 'to': '4ZKI'}]

In [5]:
import pandas as pd
results_df = pd.DataFrame(results_list)
results_df

Unnamed: 0,from,to
0,O75771,2KZ3
1,O75771,8FAZ
2,O75771,8GBJ
3,O75771,8OUY
4,O75771,8OUZ
5,A0A0M3KKX3,4U7N
6,A0A0M3KKX3,4U7O
7,A0A0M3KKX3,4ZKI


Identify those that failed to get experimental PDB matches:

In [6]:
ids_with_experimental_structures = results_df['from'].unique() # Note cannot use `results_df.from.unique()` because `from` is a Python keyword used in imports
ids_with_no_experimental_structure_exists = list(set(id_list) - set(ids_with_experimental_structures))

Get the structures for the experimental matches:
First we'll take advantage of Python to do this. Then we'll show the code/approach that would better match the route outlined [by Mensur Dlakic in his Biostars reply](https://www.biostars.org/p/9602308/#9602314) where the contents of a text file are piped into the retrieve command.

First...   
Staying mostly in with using in-memory Python objects and employing Jupyter conveniences. In other words, mainly not making a separate text file:

In [7]:
pdb_ids = results_df['to'].unique() # limit to unique because no point repeating redundant entries
for an_id in pdb_ids:
    !curl -OL https://files.rcsb.org/download/{an_id}.cif.gz
    !gunzip {an_id}.cif.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  508k  100  508k    0     0   717k      0 --:--:-- --:--:-- --:--:--  717k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  204k  100  204k    0     0   327k      0 --:--:-- --:--:-- --:--:--  327k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  201k  100  201k    0     0   141k      0  0:00:01  0:00:01 --:--:--  141k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  451k  100  451k    0     0   678k      0 --:--:-- --:--:-- --:--:--  678k
  % Total    % Received % Xferd  Average Speed   Tim

Or if you want the traditional `pdb` files:

In [8]:
pdb_ids = results_df['to'].unique() # limit to unique because no point repeating redundant entries
for an_id in pdb_ids:
    !curl -OL https://files.rcsb.org/download/{an_id}.pdb.gz
    !gunzip {an_id}.pdb.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  425k  100  425k    0     0   324k      0  0:00:01  0:00:01 --:--:--  324k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  153k  100  153k    0     0   145k      0  0:00:01  0:00:01 --:--:--  145k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  155k  100  155k    0     0   264k      0 --:--:-- --:--:-- --:--:--  264k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  292k  100  292k    0     0   253k      0  0:00:01  0:00:01 --:--:--  253k
  % Total    % Received % Xferd  Average Speed   Tim

For those without experimental results, we can get the **predicted Alphafold models** in `cif` format:

In [9]:
for uni_id in ids_with_no_experimental_structure_exists:
    !curl -OL  https://alphafold.ebi.ac.uk/files/AF-{uni_id}-F1-model_v4.cif

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  168k    0  168k    0     0   594k      0 --:--:-- --:--:-- --:--:--  596k


Or the **predicted Alphafold models** in `PDB` format:

In [10]:
for uni_id in ids_with_no_experimental_structure_exists:
    !curl -OL  https://alphafold.ebi.ac.uk/files/AF-{uni_id}-F1-model_v4.pdb

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  116k  100  116k    0     0   526k      0 --:--:-- --:--:-- --:--:--  527k


Keep in mind: **THOSE ARE ONLY PREDICTIONS.**

Second...  
To do it so it better matches the [Biostars reply by Mensur Dlakic](https://www.biostars.org/p/9602308/#9602314), where piping the file of the PDB ids to be retrieved to the retieval command. Note though HERE IT IS DONE WITHOUT FTP SINCE BLOCKED IN MYBINDER SESSIONS.  
To do that, try the next two cells:

In [11]:
pdb_ids = results_df['to'].unique() # limit to unique because no point repeating redundant entries
# save those ids in the Python list to a file `pdb_ids.txt` to better match what Mensur Dlakic's approach
with open("pdb_ids.txt", 'w') as f:
    f.write("\n".join(pdb_ids))

Note there should now be a file `pdb_ids.txt` in this session if you look in the file navigator pane to the left. If you already had a list of your own PDB entry ids with one on each line and no header, you could replace the content of `pdb_ids.txt` in this session with yours before running the next cell.

Running next cell will actually do the downloading after we have se things up:

In [12]:
!cat pdb_ids.txt | xargs -i wget -q -o /dev/null https://files.rcsb.org/download/"{}".pdb # based on https://www.rcsb.org/docs/programmatic-access/file-download-services#file-access-urls-

Now feel free to edit the code here to and re-run these steps to focus on idenitifiers related to your interests.  
If you make anything useful, be sure to download & save this notebook to your own local machine, along with anything useful files that got donwloaded as a result of running the code.

Or continue on to look at the other sections.

--------

## Example 2: Get HGNC (HUGO Gene Nomenclature Committee) IDs from a list of Uniprot IDs


The list with each UniProt id on a separate line should be pasted in between the `'''` in the cell below. That is okay if you don't have one yet because a demo list is provided there already to get you going and you should start with that:

In [1]:
l='''
Q5VV41
O14933
Q8NFH8
'''

In [2]:
# This cell takes the item listing and makes a Python list object out of it
id_list = l.split("\n")
# next line uses a Python trick to remove the blank ones /empty ones that came from way I made list convenient to paste into
id_list = [x for x in id_list if x]
id_list

['Q5VV41', 'O14933', 'Q8NFH8']

In [3]:
from unipressed import IdMappingClient
import time
request = IdMappingClient.submit(
    source="UniProtKB_AC-ID", dest="HGNC", ids=id_list
)
time.sleep(3.0)
results_list = list(request.each_result())

In [4]:
results_list 

[{'from': 'Q5VV41', 'to': 'HGNC:15515'},
 {'from': 'O14933', 'to': 'HGNC:12490'},
 {'from': 'Q8NFH8', 'to': 'HGNC:9963'}]

You may wish to scroll up to the example #1 above and seee some ways to extend that here.

Now feel free to edit the code here to and re-run these steps to focus on idenitifiers related to your interests. 
If you make anything useful, be sure to download & save this notebook to your own local machine, along with anything useful files that got donwloaded as a result of running the code.

Or continue on to look at the other sections.

-------------------

#### Additional Unipressed snippets from answers I had at StackOverflow and Biostars follow:

I'll eventually better detail these but for now the links to the associated StackOverflow and Biostars answers are provided for further investigation.

-------------------

In [None]:
# associated with https://stackoverflow.com/a/73586628/8508004
from unipressed import IdMappingClient
request = IdMappingClient.submit(
    source="UniProtKB_AC-ID", dest="PDB", ids={"A0A0M3KKX3"}
)
time.sleep(3.0)
results_list = list(request.each_result())

In [None]:
import pandas as pd
results_df = pd.DataFrame(results_list)
results_df

-------------------

In [None]:
# from https://www.biostars.org/p/9560315/#9560336
from unipressed import IdMappingClient
import time
request = IdMappingClient.submit(
    source="UniProtKB_AC-ID", dest="Gene_Name", ids={"P10643","P11717","P00450","Q86VB7","P27169","P01871","P06727","O00299", "Q9UBX5", "B7ZKJ8","A0A0G2JPR0","P09493","P35443","Q9Y4F1","P23141", "Q8WWA0", "P04792", "P26447", "P07237", "P08571", "Q9UPN3", "P14151", "P49908", "P33151", "P26038"}
)
time.sleep(5)
results_list = list(request.each_result())

In [None]:
import pandas as pd
results_df = pd.DataFrame(results_list)
results_df

-------------------

In [None]:
# associated with https://stackoverflow.com/a/73586386/8508004
from unipressed import IdMappingClient
import time
request = IdMappingClient.submit(
    source="GeneCards", dest="UniProtKB", ids={"POTEB3", "SYCE3", "CLRN2"}
)
time.sleep(5.0)
results_list = list(request.each_result())

In [None]:
import pandas as pd
results_df = pd.DataFrame(results_list)
results_df

-------

In [None]:
# associated with https://stackoverflow.com/a/73587249/8508004
from unipressed import UniprotkbClient
UniprotkbClient.fetch_one("P03468")["uniProtKBCrossReferences"]

-------

In [None]:
# associated with https://www.biostars.org/p/286919/#9537049
from unipressed import UniprotkbClient
for record in UniprotkbClient.search(
    query={
        "or_": [
        {"ec": "3.1.3.9"},
        {"ec": "2.7.1.2"},
        ],
        "and_": [
        {"organism_id": "9606"},
        ]
    },
    #fields=["length", "gene_names"]
).each_record():
    display(record)

Get that as tab-separated values, `.tsv`:

In [None]:
# associated with https://www.biostars.org/p/286919/#9537049
from unipressed import UniprotkbClient
for record in UniprotkbClient.search(
    query={
        "or_": [
        {"ec": "3.1.3.9"},
        {"ec": "2.7.1.2"},
        ],
        "and_": [
        {"organism_id": "9606"},
        ]
    },
    format="tsv",
    fields=["accession","gene_names", "length"]
).each_record():
    display(record)

Saving records as `.tsv` files:

In [None]:
# from https://gist.github.com/fomightez/54e3b38c9ac516e6687924349527873d
from unipressed import UniprotkbClient
import shutil
for i,record in enumerate(UniprotkbClient.search(
    query={
        "or_": [
        {"ec": "3.1.3.9"},
        {"ec": "2.7.1.2"},
        ],
        "and_": [
        {"organism_id": "9606"},
        ]
    },
    format="tsv",
    fields=["accession","gene_names", "length"]
).each_page()):
    with open(f"{i+1}.tsv", "w") as dest:
        shutil.copyfileobj(record, dest)

You can filter those isoforms to get the 4 seen in the direct access by filtering out any where there's a dash in in the name (see [here](https://www.biostars.org/p/286919/#9537049) for what that my about), like so:

In [None]:
# associated with https://www.biostars.org/p/286919/#9537049
from unipressed import UniprotkbClient

collected=[]
for record in UniprotkbClient.search(
    query={
        "or_": [
        {"ec": "3.1.3.9"},
        {"ec": "2.7.1.2"},
        ],
        "and_": [
        {"organism_id": "9606"},
        ]
    },
    fields=["length", "gene_names"]
).each_record():
    collected.append(record)
collected = [x for x in collected if "-" not in x["primaryAccession"]]
collected

XML Format Example:

The original post in particular asked about downloading the results in XML format. And Unipressed has that built in already. Here some accessing & printing of data stored in the XML record object is done to show something human readable:

In [None]:
# associated with https://www.biostars.org/p/286919/#9537049 , see also https://gist.github.com/fomightez/9d6a04385d143bf7c1de34cefffc0101
from unipressed import UniprotkbClient
for record in UniprotkbClient.search(
    query={
        "or_": [
        {"ec": "3.1.3.9"},
        {"ec": "2.7.1.2"},
        ],
        "and_": [
        {"organism_id": "9606"},
        ]
    },
    format="xml",
).each_record():
    #Show XML object as string by uncommenting out the next two lines & deleting everything after those lines
    #from xml.etree import ElementTree # from https://stackoverflow.com/a/48671499/8508004
    #print(ElementTree.tostring(record, encoding='unicode'))
    #Below based on [Processing XML in Python — ElementTree:A Beginner’s Guide](https://towardsdatascience.com/processing-xml-in-python-elementtree-c8992941efd2)
    # slice `[28:]` added to remove `{http://uniprot.org/uniprot}` from the front of tags
    #[print(elem.tag[28:]) for elem in record.iter()]
    #[print(child.tag, child.attrib) for child in record]
    [print(elem.tag[28:], elem.attrib, elem.text) for elem in record.iter('{http://uniprot.org/uniprot}fullName')]
    [print(elem.tag[28:], elem.attrib, elem.text) for elem in record.iter('{http://uniprot.org/uniprot}ecNumber')]
    [print(elem.tag[28:], elem.attrib) for elem in record.iter('{http://uniprot.org/uniprot}proteinExistence')]
    print("*"*60)

--------

Enjoy!