# Unipressed examples running via MyBinder

There's a Python package, [Unipressed](https://multimeric.github.io/Unipressed/), by Michael Milton ([@multimeric](https://twitter.com/multimeric)) that allows programmatic access query UniProt's new REST API.  

This notebook combines some of my examples on StackOverflow and Biostars to work in sessions served by MyBinder.
This is designed so you can easily edit these and run your own versions to test things out or collect useful information.  
Be aware though this MyBinder-served session has limited computational resources and so you may easily exceed what is possible here and need to take the ideas and code and move to where you have more resources. Additionally, MyBinder blocks FTP ports to prevent abuse and so not all routes work to retrieve data.

If you do make something useful in your session, grab the code and save it on your local machine or save the current notebook and download it to your machine to upload it to later sessions to pick up where you left off. The same goes for any data you generate! (You'll need to run the installs at the top everytime unless I have set up the environment to already include them installed at session start-up. I HAVE NOT DONE THIS YET.)  

---------

### Prepare environment in session by installing packages needed

In [None]:
%pip install unipressed pandas

------

#### Get PDB files from a list of Uniprot IDs

Based on combining Unipressed use via my MyBinder with [Biostars reply by Mensur Dlakic](https://www.biostars.org/p/9602308/#9602314):

The list with each UniProt id on a separate line goes between the `'''` below:

In [None]:
l='''
O75771
Q8N183
Q9UBX5
Q86VB7
B7ZKJ8
P04792
A0A0G2JPR0
'''

In [None]:
l='''
O75771
Q8N183
A0A0M3KKX3
'''

In [None]:
# This cell takes the item listing and makes a Python list object out of it
id_list = l.split("\n")
# next line uses a Python trick to remove the blank ones /empty ones that came from way I made list convenient to paste into
id_list = [x for x in id_list if x]
id_list

In [None]:
from unipressed import IdMappingClient
import time
request = IdMappingClient.submit(
    source="UniProtKB_AC-ID", dest="PDB", ids=id_list
)
time.sleep(3.0)
results_list = list(request.each_result())

In [None]:
results_list 

In [None]:
import pandas as pd
results_df = pd.DataFrame(results_list)
results_df

Identify those that failed to get experimental PDB match:

In [None]:
ids_with_experimental_structures = results_df['from'].unique() # Note cannot use `results_df.from.unique()` because `from` is a Python keyword used in imports
ids_with_no_experimental_structure_exists = list(set(id_list) - set(ids_with_experimental_structures))

Get the structures for the experimental matches:
First we'll take advantage of Python to do this. Then we'll show the code/approach that would better match the route outlined [by Mensur Dlakic in his Biostars reply](https://www.biostars.org/p/9602308/#9602314) where the contents of a text file are piped into the retrieve command.

First... 
Staying mostly in with using in-memory Python objects and employing Jupyter conveniences. In other words, mainly not making a separate text file:

pdb_ids = results_df['to'].unique() # limit to unique because no point repeating redundant entries
for an_id in pdb_ids:
    !curl -OL https://files.rcsb.org/download/{an_id}.cif.gz
    !gunzip {an_id}.cif.gz

Or if you want the traditional `pdb` files:

In [None]:
pdb_ids = results_df['to'].unique() # limit to unique because no point repeating redundant entries
for an_id in pdb_ids:
    !curl -OL https://files.rcsb.org/download/{an_id}.pdb.gz
    !gunzip {an_id}.pdb.gz

Second...  
To do it so it better matches the [Biostars reply by Mensur Dlakic](https://www.biostars.org/p/9602308/#9602314), where piping the file of the PDB ids to be retrieved to the retieval command. Note though HERE IT IS DONE WITHOUT FTP SINCE BLOCKED IN MYBINDER SESSIONS.  
To do that, try the next two cells:

In [None]:
pdb_ids = results_df['to'].unique() # limit to unique because no point repeating redundant entries
# save those ids in the Python list to a file `pdb_ids.txt` to better match what Mensur Dlakic's approach
with open("pdb_ids.txt", 'w') as f:
    f.write("\n".join(pdb_ids))

Note there should now be a file `pdb_ids.txt` in this session if you look in the file navigator pane to the left. If you already had a list of your own PDB entry ids with one on each line and no header, you could replace the content of `pdb_ids.txt` in this session with yours before running the next cell.

Running next cell will actually do the downloading after we have se things up:

In [None]:
!cat pdb_ids.txt | xargs -i wget -q -o /dev/null https://files.rcsb.org/download/"{}".pdb # based on https://www.rcsb.org/docs/programmatic-access/file-download-services#file-access-urls-

-------------------

#### Additional Unipressed snippets from answers I had at StackOverflow and Biostars follow:

I'll eventually better detail these but for now the links to the associated StackOverflow and Biostars answers are provided for further investigation.

-------------------

In [None]:
# associated with https://stackoverflow.com/a/73586628/8508004
from unipressed import IdMappingClient
request = IdMappingClient.submit(
    source="UniProtKB_AC-ID", dest="PDB", ids={"A0A0M3KKX3"}
)
time.sleep(3.0)
results_list = list(request.each_result())

In [None]:
import pandas as pd
results_df = pd.DataFrame(results_list)
results_df

-------------------

In [None]:
# from https://www.biostars.org/p/9560315/#9560336
from unipressed import IdMappingClient
import time
request = IdMappingClient.submit(
    source="UniProtKB_AC-ID", dest="Gene_Name", ids={"P10643","P11717","P00450","Q86VB7","P27169","P01871","P06727","O00299", "Q9UBX5", "B7ZKJ8","A0A0G2JPR0","P09493","P35443","Q9Y4F1","P23141", "Q8WWA0", "P04792", "P26447", "P07237", "P08571", "Q9UPN3", "P14151", "P49908", "P33151", "P26038"}
)
time.sleep(5)
results_list = list(request.each_result())

In [None]:
import pandas as pd
results_df = pd.DataFrame(results_list)
results_df

-------------------

In [None]:
# associated with https://stackoverflow.com/a/73586386/8508004
from unipressed import IdMappingClient
import time
request = IdMappingClient.submit(
    source="GeneCards", dest="UniProtKB", ids={"POTEB3", "SYCE3", "CLRN2"}
)
time.sleep(5.0)
results_list = list(request.each_result())

In [None]:
import pandas as pd
results_df = pd.DataFrame(results_list)
results_df

-------

In [None]:
# associated with https://stackoverflow.com/a/73587249/8508004
from unipressed import UniprotkbClient
UniprotkbClient.fetch_one("P03468")["uniProtKBCrossReferences"]

-------

In [None]:
# associated with https://www.biostars.org/p/286919/#9537049
from unipressed import UniprotkbClient
for record in UniprotkbClient.search(
    query={
        "or_": [
        {"ec": "3.1.3.9"},
        {"ec": "2.7.1.2"},
        ],
        "and_": [
        {"organism_id": "9606"},
        ]
    },
    #fields=["length", "gene_names"]
).each_record():
    display(record)

Get that as tab-separated values, `.tsv`:

In [None]:
# associated with https://www.biostars.org/p/286919/#9537049
from unipressed import UniprotkbClient
for record in UniprotkbClient.search(
    query={
        "or_": [
        {"ec": "3.1.3.9"},
        {"ec": "2.7.1.2"},
        ],
        "and_": [
        {"organism_id": "9606"},
        ]
    },
    format="tsv",
    fields=["accession","gene_names", "length"]
).each_record():
    display(record)

Saving records as `.tsv` files:

In [None]:
# from https://gist.github.com/fomightez/54e3b38c9ac516e6687924349527873d
from unipressed import UniprotkbClient
import shutil
for i,record in enumerate(UniprotkbClient.search(
    query={
        "or_": [
        {"ec": "3.1.3.9"},
        {"ec": "2.7.1.2"},
        ],
        "and_": [
        {"organism_id": "9606"},
        ]
    },
    format="tsv",
    fields=["accession","gene_names", "length"]
).each_page()):
    with open(f"{i+1}.tsv", "w") as dest:
        shutil.copyfileobj(record, dest)

You can filter those isoforms to get the 4 seen in the direct access by filtering out any where there's a dash in in the name (see [here](https://www.biostars.org/p/286919/#9537049) for what that my about), like so:

In [None]:
# associated with https://www.biostars.org/p/286919/#9537049
from unipressed import UniprotkbClient

collected=[]
for record in UniprotkbClient.search(
    query={
        "or_": [
        {"ec": "3.1.3.9"},
        {"ec": "2.7.1.2"},
        ],
        "and_": [
        {"organism_id": "9606"},
        ]
    },
    fields=["length", "gene_names"]
).each_record():
    collected.append(record)
collected = [x for x in collected if "-" not in x["primaryAccession"]]
collected

XML Format Example:

The original post in particular asked about downloading the results in XML format. And Unipressed has that built in already. Here some accessing & printing of data stored in the XML record object is done to show something human readable:

In [None]:
# associated with https://www.biostars.org/p/286919/#9537049 , see also https://gist.github.com/fomightez/9d6a04385d143bf7c1de34cefffc0101
from unipressed import UniprotkbClient
for record in UniprotkbClient.search(
    query={
        "or_": [
        {"ec": "3.1.3.9"},
        {"ec": "2.7.1.2"},
        ],
        "and_": [
        {"organism_id": "9606"},
        ]
    },
    format="xml",
).each_record():
    #Show XML object as string by uncommenting out the next two lines & deleting everything after those lines
    #from xml.etree import ElementTree # from https://stackoverflow.com/a/48671499/8508004
    #print(ElementTree.tostring(record, encoding='unicode'))
    #Below based on [Processing XML in Python — ElementTree:A Beginner’s Guide](https://towardsdatascience.com/processing-xml-in-python-elementtree-c8992941efd2)
    # slice `[28:]` added to remove `{http://uniprot.org/uniprot}` from the front of tags
    #[print(elem.tag[28:]) for elem in record.iter()]
    #[print(child.tag, child.attrib) for child in record]
    [print(elem.tag[28:], elem.attrib, elem.text) for elem in record.iter('{http://uniprot.org/uniprot}fullName')]
    [print(elem.tag[28:], elem.attrib, elem.text) for elem in record.iter('{http://uniprot.org/uniprot}ecNumber')]
    [print(elem.tag[28:], elem.attrib) for elem in record.iter('{http://uniprot.org/uniprot}proteinExistence')]
    print("*"*60)

--------

Enjoy!