## FindIt4Me 

### Overview
FindIt4Me is a tool within a python notebook that takes a list of gene, protein, or sequence identifiers as input, then finds the corresponding identifiers in Uniprot, Ensembl, and NCBI.  Enabling teaching mode will turn on extra output, and is intended to help newbies to python understand file parsing and data wrangling.

The API for Ensembl is a fancy version of the REST API, which integrates well with many workflows.  I've added comments in that section to explain the steps that I take.

The API for uniprot uses python 2.7.x, so there are two ways to go from here:

  1) update the python code from 2.7.x to 3.x

  2) use another API and point the python notebook to run it using that API/language

I've used route 1 before, and my only comment is however much time you think it will take, triple that.
Regarding route 2, I tested the perl API, and it still is workable, meaning it did not throw any errors when I copied and pasted the example code, and then ran it (this is very good news).  The notebook will call a perl script based on that example to query uniprot, which will write the output to a temporary file in a format that I specify.  

There is no one answer to situations like this, and my recommendation is to first see if there is an API that uses python, perl, or bash.  The reason is it is (fairly) simple for these scripting languages to send and receive arguments, or files as arguments.  This then allows you to run your workflow from your language of choice (like a python notebook), and then call that API as external executable command from your notebook or workflow.  You can also do the same with a java, ruby, or a C-style API, but be aware that you will have to be very mindful of I/O.  In other words, when java writes to a file, it may not be in the format that you expect or want, which can create extra work for you parsing and data wrangling.  If R is offered as an API, then I'd recommend using it as well, with the only caveat being that you will need to shape your workflow to parse tables and SQL-like data objects.

### Setup
Each input file must be a single list of identifiers. 

Example:

        Identifier1
        Identifier2
        Identifier3
        
If you have a header, it **must** have a '#' in front to mark it as such.  

Example:

        #header
        Identifier1
        Identifier2
        Identifier3
     

To enable teaching mode, copy and paste this line into the first line of the next cell before running it:

        enableTeachingModeBool = True

Important Note: To turn off teaching mode after you have turned it on, you must relaunch the notebook, or copy and paste *enableTeachingModeBool = False* in any cell after this one, and then run that cell.

### Instructions
Individual sections of the workbook can be run, or the entire notebook can be run, depending on what your need is.

* Part 1 
  * Part 1 will take as input a list of PDB protein identifiers, and an identifier that you provide to it.  The notebook then scans the Uniprot database for a matching identifier.  The default is NCBI Refseq, but others may be searched as well.   
  * Here is the chart for what the python notebook Part 1 will retrieve.  Note that attempting to match a protein to a gene identifier may produce unexpected results:

| What you want | What you should enter | What will happen |
|:---|:---:|:---:|
| Match Uniprot to NCBI | `P_REFSEQ_AC` | Scans Uniprot, then returns the RefSeq-Protein-ID |
| Match Uniprot to EMBL | `EMBL_ID`  | Scans Uniprot, then returns the EMBL-ID  |
| Match Uniprot to Ensembl | `ENSEMBL_ID` | Scans Uniprot, then returns the matching Ensembl identifier |
| Match Uniprot to Ensembl-Protein | `ENSEMBL_PRO_ID` | Scans Uniprot, then returns the matching Ensembl protein identifier |
| Match Uniprot to Ensembl-Transcript | `ENSEMBL_TRS_ID` | Scans Uniprot, then returns the matching Ensembl transcript identifier |
| Match Uniprot to GeneName | `GENENAME` | Scans Uniprot, then returns the matching GeneName from Uniprot |


  * There will be two output files.  The file beginning with 'outFile_perl' will be the raw output from the Uniprot query; you can check this to verifywhat database it queried.  The file beginning with 'uniprot_scan' (this is the file you want) will be a file with the format: 

          PDB_ID,Matched-ID

  * Where 'Matched-ID' is the identifier that matched to what you provided.

  * A full list of identifiers can be found here: https://www.uniprot.org/help/api_idmapping


* Part 2 
  * Part 2 will take as input a list of gene, transcript, or protein identifiers, and return a list with 3 columns.  Column 1 will be the original identifier, and will never be blank.  Column 2 will be the Ensembl identifier, and will also never be blank.  Column 3 will be the NCBI identifier if one can be found:

        ID,Ensembl
        

### Part 1: Match Uniprot to NCBI

In the next cell, enter the filename that contains the list of identifiers for the *inFileName* value, then copy and paste the appropriate label (see chart above) for the databse that you want to scan.  Finally, run that cell and follow the instructions in the output.

In [95]:
inFileName = 'inFile_perl.txt'






# code
parsed_list = []
headerLine = ''

# create timestamp so files don't overwrite each other
import time
import datetime
t_stamp = time.time()
t_stamp_string = datetime.datetime.fromtimestamp(t_stamp).strftime('%Y%m%d_%H%M%S')
outFileStringName = 'outFile_perl' + t_stamp_string + '.txt'

# file checking
try:
    with open(inFileName, 'r') as fOpen:
        for i in fOpen:
            i = i.rstrip('\r\n')
            if i[0] == '#':
                headerLine = i
                continue
            else:
                parsed_list.append(i)
except FileNotFoundError:
    print('It does not look like you entered a file for the input_file_name value\nor that file cannot be found.  Please check what you entered, and run this cell again.')

# teaching mode (verbose output) toggle
enableTeachingMode = False
try:
    if enableTeachingModeBool:
        enableTeachingMode = enableTeachingModeBool
except NameError:
    enableTeachingMode = False
    
if enableTeachingMode:
    print('\n\nTeaching mode turned on.\n\n')
    
if parsed_list:
    print('\nReady to continue!  Select the next cell and run it.')


Ready to continue!  Select the next cell and run it.


In [113]:
%%bash -s "$inFileName" "$outFileStringName"

perl fetch_with_perl.pl $1 > $2
sleep 2
echo 'Select the next cell and run it'

Select the next cell and run it


In [114]:
l = []
with open(outFileStringName, 'r') as fOpen:
    for i in fOpen:
        i = i.rstrip('\r\n')
        iSplit = i.split('\t')
        if len(iSplit) > 1:
            l.append((iSplit[0], iSplit[1]))
            continue
        l.append(i)

# delete headers
l1 = l[1:] 
l = list(filter(lambda x: x[0][0] != '#', l1))
l = l[2:]
# remove whitespace or tabs
res_list = []
for i in l:
    if len(i) == 2:
        s = i[0] + ',' + i[1]
        res_list.append((i[0], i[1]))
    else:
        ln = i.split(' ')
        ln_parsed = list(filter(lambda x: x != '', ln))
        res_list.append((ln_parsed[0], ln_parsed[1]))

t_stamp = time.time()
t_stamp_string = datetime.datetime.fromtimestamp(t_stamp).strftime('%Y%m%d_%H%M%S')
uniprot_outFile = 'uniprot_scan' + t_stamp_string + '.txt'
uniprot_results = []
for i in parsed_list:
    r = list(filter(lambda x: x[0] == i, res_list))
    if not r:
        # print('No entry for ' + str(i))
        uniprot_results.append((i, None))
        with open(uniprot_outFile, 'a') as fWrite:
            outString = i + ',' + 'None' + '\n'
            fWrite.write(outString)
    else:
        # print('Found value: ' + r[0][0] + ',' + r[0][1])
        uniprot_results.append((r[0][0], r[0][1]))
        with open(uniprot_outFile, 'a') as fWrite:
            outString = r[0][0] + ',' + r[0][1] + '\n'
            fWrite.write(outString)   

print('\nUniprot scan complete!  If you do not see any errors, continue to Part 2 to scan Ensembl.')


Uniprot scan complete!  If you do not see any errors, continue to Part 2 to scan Ensembl.


### Part 2: Match Ensembl to NCBI

In the next cell, enter the filename that contains the list of identifiers for the *inFileName* value, then run that cell and follow the instructions in the output.

In [82]:
inFileName = 'ens_list.txt'
organismName = 'human' # must be the format used by Ensembl, example: human or homo_sapiens for "human"

gName = ''
proceedBool = True
if not inFileName:
    print('Please enter a name for the list of identifiers!')
    proceedBool = False
if not organismName:
    print('Please enter an organism name that is used by Ensembl!')
    proceedBool = False
else:
    gName = organismName

if proceedBool:
    print('\nReady to continue!  Select the next cell and run it.')


Ready to continue!  Select the next cell and run it.


In [85]:
import requests, sys
import re
import time
import random
import argparse
import hashlib
import time
import datetime

t_stamp = time.time()
t_stamp_string = datetime.datetime.fromtimestamp(t_stamp).strftime('%Y%m%d_%H%M%S')

outFileString = 'outFile_ensembl' + t_stamp_string + '.txt'


def lookupEnsembleID(ugeneID, ensQuery, genomeName, verboseBool):
    if verboseBool:
        print("Checking")
        print(ensQuery)
    uniqueID = ugeneID
    n = random.random() + 0.5
    time.sleep(n)
    server = "https://rest.ensembl.org/"
    lookupIDstring = "xrefs/symbol/"
    # speciesString = "armadillo/"
    speciesString = genomeName + "/"
    idName = ensQuery
    ext1 = "?content-type=application/json"
    q = server+lookupIDstring+speciesString+idName+ext1
    r = requests.get(q, headers={ "Content-Type" : "application/json"})
    if not r.ok:
        if verboseBool:
            print(q)
            print(ensQuery+"\t"+"MATCH ERROR")
        else:
            print("Error")
        return (uniqueID, (ensQuery, 0))
    else:
        decoded = r.json()
        if not decoded:
            if verboseBool:
                print("couldn't find match in Ensembl Lookup")
            return (uniqueID, (ensQuery, 0))
        else:
            res = decoded[0]
            ensembleName = res["id"]
            if verboseBool:
                print("Found match for")
                print(res)
            return (uniqueID, (ensQuery, ensembleName))

ensQuery_list = []
with open(inFileName, 'r') as fOpen:
    for i in fOpen:
        if i[0] == '#':
            continue # just skip the header for this part
        i = i.rstrip('\r\n')
        ensQuery_list.append(i)

ensQuery_results = []
for i in ensQuery_list:
    uID = hashlib.md5(i.encode('utf-8')).hexdigest()
    r = lookupEnsembleID(uID, i, gName, False)
    ensQuery_results.append(r[1])

for i in ensQuery_results:
    with open(outFileString, 'a') as fWrite:
        resString = ''
        if not i[1]:
            resString = 'None'
        else:
            resString = str(i[1])
        s = str(i[0]) + ',' + resString + '\n'
        fWrite.write(s)
        
if ensQuery_results:
    print('Ensembl scan complete!  If you do not see any errors, continue to Part 3 to join the lists.')

Ensembl scan complete!  If you do not see any errors, continue to Part 3 to join the lists.
