# Parsing UnitProt data

In [174]:
import pandas as pd
from pandas import DataFrame

## Functions to generate queries from data

The following functions are used in the generation of SQL script from data.

**toSQLStr** converts a chunk of data (a tuple/list or a single element) to a format suited for an INSERT query.
The main problem encountered here is that some names may contains single quotes('). SQL allows escaping those by adding an additional quote :
    'go'el' => 'go''el'

In [175]:
 def toSQLStr(tup):
    result = "("
    if type(tup) == str:
        subitem = ""
        for c in tup:
            if c in "'":
                subitem = subitem + "'"
            subitem = subitem + c
        result = result + "'" + subitem + "'"
    elif type(tup) == tuple or type(tup) == list:
        first = True
        for item in tup:
            if first:
                first = False
            else:
                result = result + ", "
            if type(item) == str:
                subitem = ""
                for c in item:
                    if c in "'":
                        subitem = subitem + "'"
                    subitem = subitem + c
                result = result + "'" + subitem + "'"
            else:
                result = result + str(item)
    return result + ")"

**writeInsertQuery** generates an INSERT query from the lines given as parameter. 
It uses the above function to convert those lines of data to a format suitable to SQL.
The query is written to the file specified by `filepath`.

In [176]:
def writeInsertQuery(table, lines, filepath):
    if len(lines) == 0 or not table or not filepath:
        return
    with open(filepath, "w") as outf:
        outf.write("INSERT INTO TABLE " + table + " VALUES")
        outf.writelines("\n" + toSQLStr(line) for line in lines)
        outf.write(";")

## Useful functions

In [177]:
# str.split(delimiter) returns also empty elements; this function simply removes them
def split(s, delimiter):
    content = s.split(delimiter)
    return [item for item in content if item]

# This function adds an couple key-value to a dict assuming it associates a key to a list.
# The value is meant to be added to the array, but if the key does not exist, you will need to create the array first. 
# This function creates this array for you if it doesn't exist yet.
def appendToSubArray(di, key, value):
    if not key in di:
        di[key] = []
    di[key].append(value)
        

## Parsing UnitProt

The main dish; parsing the UnitProt file.

We are only interested in specific lines, so we extract only those. For every line, we keep it as a tuple `(type of line, content)`. A line is parsed with the '\n' char at the end, so we take the opportunity to remove it with `[:-1]`.

In [178]:
cancer_lines = []
with open("../Datasets/UniProtKB/unitprot-cancer/unitprot-cancer.txt") as cf:
    for line in cf:
        content = split(line, ' ')
        if len(content) > 0 and content[0] in ['ID', 'AC', 'DE', 'GN', 'KW', 'DR']:
            cancer_lines.append((content[0], (' '.join(content[1:]))[:-1]))


In the UnitProt file, an entry always begins with the line `ID`, where we find the name of entry. To gather the data related to an entry, we use the class `DataEntry` which will contain all of it. Once we encounter a new line `ID`, we create a new data entry, and we store the previous one.

Among the other lines that are of interest to us, there is :

### Accession numbers (AC)

Items are separated by semicolons. Parsing it is pretty straightforward.

### Keywords (KW)

Items are separated by semicolons. Parsing it is pretty straightforward.

### Data cross-reference (DR)

This line contains different kinds of data. We are only interested here in the GO (Gene Ontology) numbers. A line that contains a GO number always begins with GO, so we check for that.

### Gene names (GN)



### Description (DE)

This is the most complex line to parse here. The line may have the following formats, for what interests us : 

    1 <Category>: <subcategory>=<value>
    2           <subcategory>=<value>
    3 <Supercategory>:
    4         <Category>: <subcategory>=<value>
    5                  <subcategory>=<value>
    6 Flags: <value>

There are `.strip()`s here and there to remove whitespace noise. We also try to exclude the semicolons at the end of results with `[:-1]`.

We remove the content between brackets, as we do not take it into account, by splitting the line by '{' and keeping only the left element.

You can see the details on the data these lines contain [here](https://web.expasy.org/docs/userman.html).


In [179]:
class DataEntry:
    def __init__(self, id):
        self.id = id
        self.ac = []
        self.desc = {"AltName" : {}, 
                     "RecName" : {},
                     "SubName" : {},
                     "Contains" : {
                         "RecName" : {}, 
                         "AltName" : {}, 
                         "SubName" : {}
                     }, 
                     "Includes" : {
                         "RecName" : {}, 
                         "AltName" : {}, 
                         "SubName" : {}
                     }
                    }
        self.flags = []
        self.go = []
        self.keywords = []
        self.gn = {}
        
    # Generates triples from the data. Not updated.
    def triples(self):
        triples = []
        
        # Keywords
        for item in self.keywords:
            triples.append((self.id, "keyword", item))
            
        # Accession numbers
        for item in self.ac:
            triples.append((self.id, "acnumber", item))
            
        # Recommended Names
        if 'RecName' in self.desc:
            for item in self.desc['RecName']:
                triples.append((self.id, "recname", item))
        
        # Alternative Names
        if 'AltName' in self.desc:
            for item in self.desc['AltName']:
                triples.append((self.id, "altname", item))
            
        return triples
    
    # Translates the data to a printable format.
    def __str__(self):
        desc = ""
        desc = desc + "ID=" + self.id 
        desc = desc +"\nDescription=" + str(self.desc)
        desc = desc + "\nKeywords=" + str(self.keywords)
        desc = desc + "\nACcessionNumbers=" + str(self.ac)
        desc = desc + "\nGeneNames=" + str(self.gn)
        desc = desc + "\nGeneOntology=" + str(self.go)
        return desc

In [180]:
cancerData = []

# Loop external data

# The current data entry
currentEntry = None

# Those two corresponds to the keys that have been declared 
# previously in the DataEntry class desc field, category
# being at the first level and subCategory the second level.

# Expected values : [Contains, Includes, None]
subCategory = None

# Expected values : [AltName, RecName, SubName]
category = None
            
for line in cancer_lines:
    
    # If the line is empty, continue on to the next iteration
    if not line[1]:
        continue
    
    # If we encounter an ID line
    if (line[0] == 'ID'):
        # We create a new data entry and store it.
        # All following lines will store their data into this entry, until we meet a new ID line.
        currentEntry = DataEntry(line[1].split(' ')[0])
        cancerData.append(currentEntry)

    elif (line[0] == 'AC'):
        if currentEntry:
                [currentEntry.ac.append(item.strip()) for item in split(line[1], ';')]
                
    elif (line[0] == 'KW'):
        if currentEntry:
                [currentEntry.keywords.append(item.strip()) for item in split(line[1], ';')]
                
    elif (line[0] == 'DR'):
        if currentEntry:
                semicolonsplit = split(line[1], ';')
                # check if the line begins with 'GO', in which case we retrieve the GO number in the line
                if len(semicolonsplit) > 1 and semicolonsplit[0] == 'GO':
                    currentEntry.go.append(split(semicolonsplit[1], ':')[1].strip())     
    
    elif (line[0] == 'GN'):
        
        if currentEntry:
            
            # we ignore the lines that only contains 'and'
            if line[1] == 'and':
                continue
                
            # Gene names are separated by semicolons.
            # An item is of the form <category>=<value>[, <value2>, ...].
            for item in split(line[1], ';'):
                
                content = split(item, '=')
                
                if len(content) > 1:
                    # there may be several values, separated by commas
                    commasplit = split(content[1], ',')
                    for item in commasplit:
                        value = split(item, '{')[0].strip()
                        # we store the value in the array of category
                        appendToSubArray(currentEntry.gn, content[0].strip(), value)
                    
    elif (line[0] == 'DE'):
        if currentEntry:
            colonsplit = split(line[1], ':')
            if len(colonsplit) and colonsplit[0] in ['Contains', 'Includes']:
                category = colonsplit[0]
            elif len(colonsplit) and colonsplit[0] in ['AltName', 'RecName', 'SubName']:
                subCategory = colonsplit[0]
                content = split(colonsplit[1][:-1], '=')
                
                # to remove the content between brackets
                value = split(content[1], '{')[0].strip()
                if category:
                    appendToSubArray(currentEntry.desc[category][subCategory], content[0].strip(), value)
                else:
                    appendToSubArray(currentEntry.desc[subCategory], content[0].strip(), value)
            elif colonsplit[0] == 'Flags':
                currentEntry.flags.append(colonsplit[1][:-1].strip())
            else:
                content = split(line[1], '=')
                if len(content) < 2:
                    continue
                # to remove the content between brackets
                value = split(content[1][:-1], '{')[0].strip()
                if category and subCategory:
                    appendToSubArray(currentEntry.desc[category][subCategory], content[0].strip(), value)
                elif subCategory:
                    appendToSubArray(currentEntry.desc[subCategory], content[0].strip(), value)
                
    
            
    


## Content of a data entry

Once we have extracted the UnitProt entries, we can preview what data one such entry holds, from what we have parsed.

In [181]:
if len(cancerData):
    print(cancerData[0])

ID=P53_HUMAN
Description={'AltName': {'Full': ['Antigen NY-CO-13', 'Phosphoprotein p53', 'Tumor suppressor p53']}, 'Includes': {'AltName': {}, 'RecName': {}, 'SubName': {}}, 'Contains': {'AltName': {}, 'RecName': {}, 'SubName': {}}, 'RecName': {'Full': ['Cellular tumor antigen p53']}, 'SubName': {}}
Keywords=['3D-structure', 'Acetylation', 'Activator', 'Alternative promoter usage', 'Alternative splicing', 'Apoptosis', 'Biological rhythms', 'Cell cycle', 'Complete proteome', 'Cytoplasm', 'Disease mutation', 'DNA-binding', 'Endoplasmic reticulum', 'Glycoprotein', 'Host-virus interaction', 'Isopeptide bond', 'Li-Fraumeni syndrome', 'Metal-binding', 'Methylation', 'Mitochondrion', 'Necrosis', 'Nucleus', 'Phosphoprotein', 'Polymorphism', 'Reference proteome', 'Repressor', 'Transcription', 'Transcription regulation', 'Tumor suppressor', 'Ubl conjugation', 'Zinc.']
ACcessionNumbers=['P04637', 'Q15086', 'Q15087', 'Q15088', 'Q16535', 'Q16807', 'Q16808', 'Q16809', 'Q16810', 'Q16811', 'Q16848', '

## Generating queries from data entries

Now that we have parsed the data and categorized it, we can start to generate queries to insert the data in a database. Here are the tables we have chosen to hold the data of UnitProt :

### The ID table

This table contains only the id (1) of every entry.

### The AC table

For every accession number of an entry, it contains :
* the entry id (1);
* and an associated accession number (2).

### The KW table

For every keyword of an entry, it contains : 
* the entry id (1) 
* the associated keyword (2).

### The GeneName table

For every gene name of an entry, it contains :
* the entry id (1) 
* the type of name (2) 
* the gene name (3).

The type can take three values : 
* synonym
* orfname
* name

### The Gene Ontology table

For every gene ontology reference of an entry, it contains :
* the entry id (1) 
* the reference number (2).

### The Flag table

For every flag of an entry, it contains :
* the entry id (1) 
* the flag (2).

### The Name table

This table bundles the description data into one table. It contains :
* the entry id
* the name category (AltName, RecName, SubName)
* the name subcategory (Full, Short, EC)
* the actual name

You can alter the names of the tables in the query below if you so wish.

The code below generate one SQL script for each table.

You can find the database schema [here](../Database/unitprot_database.pdf).

In [182]:
# change your table names here
id_table = "unitprot"
ac_table = "accession_number_unitprot"
name_table = "description_unitprot"
keyword_table = "keyword_unitprot"
genename_table = "gene_name_unitprot"
gene_ontology_table = "go_unitprot"
flag_table = "flag_unitprot"

lines = {
    id_table : [],
    ac_table : [], 
    # if you want to have both altname and recname in the same table, use this one
    name_table: [],
    keyword_table : [], 
    genename_table : [], 
    gene_ontology_table : [],
    flag_table : []}

for entry in cancerData:
    
    # id table
    lines[id_table].append((entry.id))
    
    # ac table
    [lines[ac_table].append((entry.id, item)) for item in entry.ac]
    
    # Name table
    # alternative names
    [lines[name_table].append((entry.id, "AltName", k, item)) for k, names in entry.desc["AltName"].items() for item in names]
    [lines[name_table].append((entry.id, "AltName", k, item)) for k, names in entry.desc["Contains"]["AltName"].items() for item in names]
    [lines[name_table].append((entry.id, "AltName", k, item)) for k, names in entry.desc["Includes"]["AltName"].items() for item in names]
    
    # recommended names
    [lines[name_table].append((entry.id, "RecName", k, item)) for k, names in entry.desc["RecName"].items() for item in names]
    [lines[name_table].append((entry.id, "RecName", k, item)) for k, names in entry.desc["Contains"]["RecName"].items() for item in names]
    [lines[name_table].append((entry.id, "RecName", k, item)) for k, names in entry.desc["Includes"]["RecName"].items() for item in names]
    
    # flag table
    [lines[flag_table].append((entry.id, item)) for item in entry.flags]
    
    # keyword table
    [lines[keyword_table].append((entry.id, item)) for item in entry.keywords]
    
    # gene ontology table
    [lines[gene_ontology_table].append((entry.id, item)) for item in entry.go]
    
    # gene name table
    [lines[genename_table].append((entry.id, k if "s" not in k else k[:-1], item)) for k, v in entry.gn.items() for item in v]
    
for key, value in lines.items():
    writeInsertQuery(key, value, "queries/" + str(key) + ".sql")

In [183]:
print("Reached the end of the notebook !")

Reached the end of the notebook !
