# Gene Association and Annotations Preprocessing<a id='top'></a>
**Sections:**<br>
[0) Description](#0)<br>
[1) Importing Modules and Packages](#1)<br>
[2) Configuration](#2)<br>
[3) Loading Gene Ontology](#3)<br>
[4) Genes and Annotations (without 'ND' and 'IEA' annotations)](#4)<br>
[5) Genes and Annotations (without 'ND' and with 'IEA' annotations)](#5)<br>

## Description<a id='0'></a>
**Aim:** This Jupyter Notebook is designed to preprocess the gene association file of the species of interest (e.g., _human_ or *yeast*). This preprocessing results in 6 files, 3 of them for $-$IEA and the rest for $+$IEA; each of those 3 files is associated with one sub-ontology in Gene Ontology (namely, BP for *biological process*, CC for *cellular component*, and MF for *molecular function*, defined as the extension of the files).<br>

---
**Output file format:** (space separated)<br>
_accession_id gene_id go_annotation_1 go_annotation_2 ... go_annotation_n_ <br>
A0A1B0GTQ1 CYP2D7 GO:0006082 GO:0006805 GO:0019369 GO:0042738 GO:0055114<br>
E7EML9 PRSS44 GO:0006508 GO:0007281 GO:0007283<br>
... <br>

For more information regarding GO evidence codes such as *No biological Data available (ND)* and *Inferred from Electronic Annotation (IEA)* refer to:<br>[Guide to GO evidence codes](http://geneontology.org/docs/guide-go-evidence-codes/)<br>

---
Files needed for this preprocessing are:
 * **Gene ontology:** ['go.obo' file](http://current.geneontology.org/ontology/go.obo)<br><br>
 
 * **Association files:** [gene association files ingested from GO Consortium members](http://current.geneontology.org/products/pages/downloads.html)
  * **Human** - [Gene Association file (Homo sapiens)](http://geneontology.org/gene-associations/goa_human.gaf.gz)
  * **Yeast** - [Gene Association file (Saccharomyces cerevisiae)](http://current.geneontology.org/annotations/sgd.gaf.gz)<br>

---
[back to top](#top)<br>


## Import<a id='1'></a>
[back to top](#top)<br>

In [None]:
import pandas as pd
import numpy as np
import os
import requests
import easydict
import linecache
import pprint

pp = pprint.PrettyPrinter(indent=4)

## Configuration<a id='2'></a>
[back to top](#top)<br>

In [None]:
species = 'human' # species of interest to load of and save the result for

if species=='human':
    association_file_name = 'goa_human.gaf.gz' # human
    association_file_url = 'http://geneontology.org/gene-associations/goa_human.gaf.gz'
elif species=='yeast':
    association_file_name = 'sgd.gaf.gz' # yeast
    association_file_url = 'http://current.geneontology.org/annotations/sgd.gaf.gz'
    
go_file = 'go.obo' # Gene Ontology file
go_url = 'http://current.geneontology.org/ontology/go.obo'
    
args = easydict.EasyDict({
    "go_dir": 'gene_ontology/raw/',    # directory to the Gene Ontology 'go.obo' file
    "association_file_dir": 'species/{}/association_file/raw'.format(species), # directory to the human association file
    "result_dir": 'species/{}/association_file/processed'.format(species),     # directory in which the results would be saved
    "fully_annotated": True,           # gene strictly should be annotated by all three subontologies while all annotations (in -IEA setting), or at least one annotation are/is experimentally supported (in +IEA setting)
    "download_gene_ontology": True,    # download the latest version of gene ontology into the specified directory above
    "download_association_file": True  # download association file of the specieis of interest into the specified directory above
})

os.makedirs(args.result_dir, exist_ok=True)  # create 'result_dir' folder (if it does not exist already)

subontology_map = {"C":"CC", "P":"BP", "F":"MF"}

## Loading Gene Ontology<a id='3'></a>
[back to top](#top)<br>

In [None]:
if args.download_gene_ontology:
    os.makedirs(args.go_dir, exist_ok=True)  # create 'data_loc' folder (if it does not exist already)
    print("Downloading the latest version of Gene Ontology into '{}'...".format(args.go_dir))
    r = requests.get(go_url, allow_redirects=True)
    open('{}/{}'.format(args.go_dir, go_file), 'wb').write(r.content)

print("Gene Ontology {}".format(linecache.getline('{}/{}'.format(args.go_dir, go_file), 2))) # Now: releases/2021-05-01

In [None]:
"""Reading Gene Ontology to extract Terms and their Descriptive Names"""
with open("{}/go.obo".format(args.go_dir)) as f:
    content = f.readlines()
content = "".join([x for x in content])
content = content.split("[Typedef]")[0].split("[Term]")
print("Information of the last GO term in the file:\n~~~~~~~~~~~~~~~~~~~~~~~~~{}".format(content[-1]))

In [None]:
"""Going through every GO term and extract information needed ('id', 'alt_id', 'namespace', and 'is_obsolete')"""
go_term_dict = {}
alt_id_dict = {}
for c in content:
    go_id = ''
    for l in c.split("\n"):
        # id
        if "id: GO:" in l[0:len("id: GO:")]:
            go_id = l.split("id: ")[1]
            go_term_dict[go_id] = {}
        # alt_id
        if "alt_id:" in l[0:len("alt_id")+1]:
            alt_id = l.split("alt_id: ")[1]
            go_term_dict[go_id].setdefault("alt_id", []).append(alt_id)
            alt_id_dict[alt_id] = go_id
        # namespace
        if "namespace:" in l[0:len("namespace")+1]:
            go_term_dict[go_id]["namespace"] = l.split("namespace: ")[1]
        # is_obsolete
        if "is_obsolete:" in l[0:len("is_obsolete")+1]:
            go_term_dict[go_id]["is_obsolete"] = l.split("is_obsolete: ")[1]

In [None]:
"""printing how the key:values are organized for every GO term"""
for i in range(15):
    print(list(go_term_dict)[i], end=": ")
    pp.pprint(go_term_dict[list(go_term_dict)[i]])

In [None]:
"""grouping GO terms based on the sub-ontologies they belong to"""
subontology_go_term_dict = {}
for go_id in go_term_dict:
    if not go_term_dict[go_id].get('is_obsolete', False): # or => if 'is_obsolete' not in go_term_dict[go_id]:
        subontology_go_term_dict.setdefault(go_term_dict[go_id]['namespace'].split('_')[1][0].upper(), []).append(go_id)

In [None]:
"""printing how the key:values are organized for different sub-ontologies"""
for subontology in subontology_go_term_dict:
    print("{} ({}):: {} <= {} GO term => {}".format(
        subontology, 
        subontology_map[subontology], 
        " ".join(subontology_go_term_dict[subontology][:3]), 
        len(subontology_go_term_dict[subontology]), 
        " ".join(subontology_go_term_dict[subontology][-3:])))

## Genes and Annotations (without 'ND' and 'IEA' annotations)<a id='4'></a>
[back to top](#top)<br>

In [None]:
if args.download_association_file:
    os.makedirs(args.association_file_dir, exist_ok=True)  # create 'data_loc' folder (if it does not exist already)
    print("Downloading the latest version of association file into '{}'...".format(args.association_file_dir))
    r = requests.get(association_file_url, allow_redirects=True)
    open('{}/{}'.format(args.association_file_dir, association_file_name), 'wb').write(r.content)
print("Done!")

In [None]:
df = pd.read_csv("{}/{}".format(args.association_file_dir, association_file_name), sep='\t', comment="!", skip_blank_lines=True, header=None, dtype=str)
df = df.iloc[:,[1, 2, 3, 4, 6, 8]]
if len(df[df[3].isnull()])==0:
    df = df[~df[3].str.contains("NOT")]
    df = df.dropna().reset_index(drop=True)
else:
    df = df[df[3].isnull()]
    df = df.dropna().reset_index(drop=True)
df = df.drop(df.columns[2], axis=1)
df

In [None]:
"""keeping track of the gene ids and their mappings"""
protein_gene_id_map = {}
for gene_id, protein_id in zip(df[1], df[2]):
    protein_gene_id_map[protein_id] = gene_id

##### removing 'ND' and 'IEA' annotations

In [None]:
df = df[(df[6]!='ND') & (df[6]!='IEA')]
df

In [None]:
"""protein dictionary to keep track of annotations for proteins (from each sub-ontology)"""
proteins_dict = {}
for index, row in df.iterrows():
    gene_id = row[2]
    go_term_id = row[4]
    subontology = row[8]
    if go_term_id in subontology_go_term_dict[subontology]:
        proteins_dict.setdefault(gene_id, dict()).setdefault(subontology, set()).add(go_term_id)
    elif go_term_id in alt_id_dict: # some gene are annotated by previous GO term ids, needs to be taken care of
        proteins_dict.setdefault(gene_id, dict()).setdefault(subontology, set()).add(alt_id_dict[go_term_id])
        
"""printing how the key:values are organized for every gene/protein"""
for i in range(5):
    print(list(proteins_dict)[i], end=": ")
    pp.pprint(proteins_dict[list(proteins_dict)[i]])
print("\nTotal number of genes/proteins annotated:", len(proteins_dict))

In [None]:
"""keeping track of fully annotated genes/proteins"""
fully_annotated_proteins_wo_iea = []
for protein in proteins_dict:
    if len(proteins_dict[protein]) == 3:
        fully_annotated_proteins_wo_iea.append(protein)
print("Out of {} proteins {} are (experimentally or manually) annotated by all three sub-ontologies.".format(len(proteins_dict), len(fully_annotated_proteins_wo_iea)))

In [None]:
"""re-organization of proteins annotations based on sub-ontologies"""
subontologies_proteins = {}
for protein in proteins_dict:
    if args.fully_annotated and protein in fully_annotated_proteins_wo_iea:
        for subontology in proteins_dict[protein]:
            subontologies_proteins.setdefault(subontology, dict())[protein] = proteins_dict[protein][subontology]
    elif not args.fully_annotated:
        for subontology in proteins_dict[protein]:
            subontologies_proteins.setdefault(subontology, dict())[protein] = proteins_dict[protein][subontology]
            
"""printing how the key:values are organized for different sub-ontologies"""
for subontology in subontologies_proteins:
    print("{} ({}): {} (length: {})".format(subontology, subontology_map[subontology], 
                                            sorted(list(subontologies_proteins[subontology]))[:5],
                                            len(list(subontologies_proteins[subontology]))))

In [None]:
"""saving the result for no ND and -IEA"""
for subontology in sorted(list(subontologies_proteins)):
    max_length = 0
    with open("{}/gene_protein_GO_terms_without_IEA.{}".format(args.result_dir, subontology_map[subontology]), "w") as fw:
        for protein in subontologies_proteins[subontology]:
            fw.write("{} {} {}\n".format(protein_gene_id_map[protein], protein, " ".join(sorted(subontologies_proteins[subontology][protein]))))
            if max_length < len(subontologies_proteins[subontology][protein]):
                max_length = len(subontologies_proteins[subontology][protein])
        print("{} ({}): {} genes (maximum annotation length: {})".format(subontology, 
                        subontology_map[subontology], 
                        len(subontologies_proteins[subontology]), 
                        max_length))

## Genes and Annotations (without 'ND' and with 'IEA' annotations)<a id='5'></a>
[back to top](#top)<br>

In [None]:
df = pd.read_csv("{}/{}".format(args.association_file_dir, association_file_name), sep='\t', comment="!", skip_blank_lines=True, header=None, dtype=str)
df = df.iloc[:,[1, 2, 3, 4, 6, 8]]
if len(df[df[3].isnull()])==0:
    df = df[~df[3].str.contains("NOT")]
    df = df.dropna().reset_index(drop=True)
else:
    df = df[df[3].isnull()]
    df = df.dropna().reset_index(drop=True)
df = df.drop(df.columns[2], axis=1)
df

##### removing just 'ND' and not 'IEA' annotations

In [None]:
df = df[(df[6]!='ND')]
df

In [None]:
"""protein dictionary to keep track of annotations for proteins (from each sub-ontology)"""
proteins_dict = {}
for index, row in df.iterrows():
    gene = row[2]
    go_term_id = row[4]
    subontology = row[8]
    if go_term_id in subontology_go_term_dict[subontology]:
        proteins_dict.setdefault(gene, dict()).setdefault(subontology, set()).add(go_term_id)
    elif go_term_id in alt_id_dict: # some gene are annotated by previous GO term ids, needs to be taken care of
        proteins_dict.setdefault(gene, dict()).setdefault(subontology, set()).add(alt_id_dict[go_term_id])
        
"""printing how the key:values are organized for every gene/protein"""
for i in range(10):
    print(list(proteins_dict)[i], end=": ")
    pp.pprint(proteins_dict[list(proteins_dict)[i]])
print("\nTotal number of genes/proteins annotated:", len(proteins_dict))

In [None]:
"""re-organization of proteins annotations based on sub-ontologies"""
subontologies_proteins = {}
for protein in proteins_dict:
    if  args.fully_annotated and (protein in fully_annotated_proteins_wo_iea):
        for subontology in proteins_dict[protein]:
            subontologies_proteins.setdefault(subontology, dict())[protein] = proteins_dict[protein][subontology]
    elif not args.fully_annotated:
        for subontology in proteins_dict[protein]:
            subontologies_proteins.setdefault(subontology, dict())[protein] = proteins_dict[protein][subontology]
            
"""printing how the key:values are organized for different sub-ontologies"""
for subontology in subontologies_proteins:
    print("{} ({}): {} (length: {})".format(subontology, subontology_map[subontology], 
                                            sorted(list(subontologies_proteins[subontology]))[:5],
                                            len(list(subontologies_proteins[subontology]))))

In [None]:
"""saving the result for no ND and +IEA"""
for subontology in sorted(list(subontologies_proteins)):
    max_length = 0
    with open("{}/gene_protein_GO_terms_with_IEA.{}".format(args.result_dir, subontology_map[subontology]), "w") as fw:
        for protein in subontologies_proteins[subontology]:
            fw.write("{} {} {}\n".format(protein_gene_id_map[protein], protein, " ".join(sorted(subontologies_proteins[subontology][protein]))))
            if max_length < len(subontologies_proteins[subontology][protein]):
                max_length = len(subontologies_proteins[subontology][protein])
        print("{} ({}): {} genes (maximum annotation length: {})".format(subontology, 
                        subontology_map[subontology], 
                        len(subontologies_proteins[subontology]), 
                        max_length))

[back to top](#top)<br>

---