# Creating a Cellular Compartment Reference for OmicsIntegrator2

In [1]:
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline
from collections import defaultdict

def flatten(list_of_lists): return [item for sublist in list_of_lists for item in sublist]

import mygene

In [2]:
genes = pd.read_csv('../../GONN/GO/cellular_component.csv')
genes.head()

Unnamed: 0,GeneSymbol,GO_ID,GO_term,Evidence
0,A1BG,GO:0005576,extracellular region,HDA
1,A1BG,GO:0005576,extracellular region,IDA
2,A1BG,GO:0005576,extracellular region,TAS
3,A1BG,GO:0005615,extracellular space,HDA
4,A1BG,GO:0031093,platelet alpha granule lumen,TAS


## I. Evidence Codes

Copied from http://geneontology.org/page/guide-go-evidence-codes

### Experimental Evidence codes
Use of an experimental evidence code in a GO annotation indicates that the cited paper displayed results from a physical characterization of a gene or gene product that has supported the association of a GO term. The Experimental Evidence codes are:

- Inferred from Experiment (EXP)
- Inferred from Direct Assay (IDA)
- Inferred from Physical Interaction (IPI)
- Inferred from Mutant Phenotype (IMP)
- Inferred from Genetic Interaction (IGI)
- Inferred from Expression Pattern (IEP)

### High Throughput (HTP) evidence codes
High throughput (HTP) evidence codes may be used to make annotations based upon high throughput methodologies. Use of HTP evidence codes should be carefully considered and follow the GOC's guidelines for their use. The High Throughput Evidence Codes are:

- Inferred from High Throughput Experiment (HTP)
- Inferred from High Throughput Direct Assay (HDA)
- Inferred from Hight Throughput Mutant Phenotype (HMP)
- Inferred from High Throughput Genetic Interaction (HGI)
- Inferred from High Throughput Expression Pattern (HEP)

### Computational Analysis evidence codes
Use of the computational analysis evidence codes indicates that the annotation is based on an in silico analysis of the gene sequence and/or other data as described in the cited reference. The evidence codes in this category also indicate a varying degree of curatorial input. The Computational Analysis evidence codes are:

- Inferred from Sequence or structural Similarity (ISS)
- Inferred from Sequence Orthology (ISO)
- Inferred from Sequence Alignment (ISA)
- Inferred from Sequence Model (ISM)
- Inferred from Genomic Context (IGC)
- Inferred from Biological aspect of Ancestor (IBA)
- Inferred from Biological aspect of Descendant (IBD)
- Inferred from Key Residues (IKR)
- Inferred from Rapid Divergence (IRD)
- Inferred from Reviewed Computational Analysis (RCA)

### Author statement evidence codes
Author statement codes indicate that the annotation was made on the basis of a statement made by the author(s) in the reference cited. The Author Statement evidence codes are:

- Traceable Author Statement (TAS)
- Non-traceable Author Statement (NAS)

### Curator statement evidence codes
Use of the curatorial statement evidence codes indicates an annotation made on the basis of a curatorial judgement that does not fit into one of the other evidence code classifications. The Curatorial Statement codes:

- Inferred by Curator (IC)
- No biological Data available (ND)

### Electronic Annotation evidence code
All of the above evidence codes are assigned by curators. However, GO also uses one evidence code that is assigned by automated methods, without curatorial judgement. The Automatically-Assigned evidence code is

- Inferred from Electronic Annotation (IEA)

In [3]:
solid_codes = ['EXP','IDA','IPI','IMP','IGI','IEP','TAS','NAS']
sketchy_codes = ['HTP','HDA','HMP','HGI','HEP','IC']
bad_codes = ['ISS','ISO','ISA','ISM','IGC','IBA','IBD','IKR','IRD','RCA','IEA','ND']
             
{'solid': len(genes[genes['Evidence'].isin(solid_codes)]), 'sketchy': len(genes[genes['Evidence'].isin(sketchy_codes)]), 'bad': len(genes[genes['Evidence'].isin(bad_codes)])}

{'bad': 39755, 'sketchy': 6061, 'solid': 44651}

## II. Find good ontology depth

In [4]:
g = nx.read_gpickle('../../GONN/GO/GO_cellular_component.pickle')
g

<networkx.classes.digraph.DiGraph at 0x108eb5b00>

In [5]:
df = pd.DataFrame.from_dict(dict(g.nodes(data=True))).transpose()
df.head()

Unnamed: 0,depth,name,namespace
GO:0000015,3,phosphopyruvate hydratase complex,cellular_component
GO:0000109,3,nucleotide-excision repair complex,cellular_component
GO:0000110,4,nucleotide-excision repair factor 1 complex,cellular_component
GO:0000111,4,nucleotide-excision repair factor 2 complex,cellular_component
GO:0000112,4,nucleotide-excision repair factor 3 complex,cellular_component


In [6]:
df[df.depth == 0]

Unnamed: 0,depth,name,namespace
GO:0005575,0,cellular_component,cellular_component


In [7]:
df[df.depth == 1]

Unnamed: 0,depth,name,namespace
GO:0005576,1,extracellular region,cellular_component
GO:0005623,1,cell,cellular_component
GO:0009295,1,nucleoid,cellular_component
GO:0016020,1,membrane,cellular_component
GO:0019012,1,virion,cellular_component
GO:0030054,1,cell junction,cellular_component
GO:0031974,1,membrane-enclosed lumen,cellular_component
GO:0032991,1,protein-containing complex,cellular_component
GO:0043226,1,organelle,cellular_component
GO:0044215,1,other organism,cellular_component


In [8]:
df[df.depth == 2]

Unnamed: 0,depth,name,namespace
GO:0000133,2,polarisome,cellular_component
GO:0000313,2,organellar ribosome,cellular_component
GO:0000346,2,transcription export complex,cellular_component
GO:0000347,2,THO complex,cellular_component
GO:0000408,2,EKC/KEOPS complex,cellular_component
GO:0000417,2,HIR complex,cellular_component
GO:0000439,2,core TFIIH complex,cellular_component
GO:0000444,2,MIS12/MIND type complex,cellular_component
GO:0000776,2,kinetochore,cellular_component
GO:0000797,2,condensin core heterodimer,cellular_component


#### Depth 1 seems good.

## III. Build a mapping from genes to terms via subterms

#### We need to find a list of terms, and for each term, all subterms. Then we can map all genes to the list of terms

In [9]:
level1_terms = df[df.depth == 1].index.tolist()

In [10]:
terms_and_subterms = {term: np.unique(flatten(list(nx.dfs_successors(g, term).values()))).tolist() for term in level1_terms}

In [11]:
terms = [item for l in [subterms+[term] for term, subterms in list(terms_and_subterms.items())] for item in l]
len(terms), len(np.unique(terms)), len(df)

(15810, 4190, 4191)

In [12]:
terms_and_genes = {term: genes[genes.GO_ID.isin(subterms+[term])][['GeneSymbol', 'Evidence']].values.tolist() for term, subterms in terms_and_subterms.items()}

In [13]:
genes_and_terms = flatten([[(gene, term, evidence) for [gene, evidence] in genes] for term, genes in terms_and_genes.items()])

In [14]:
evidence = pd.DataFrame(genes_and_terms, columns=['gene','GO_ID','Evidence']).groupby(['gene', 'GO_ID'])['Evidence'].apply(list).to_frame()
evidence.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Evidence
gene,GO_ID,Unnamed: 2_level_1
A1BG,GO:0005576,"[HDA, IDA, TAS, HDA, HDA, HDA, HDA]"
A1BG,GO:0005623,"[TAS, TAS, TAS]"
A1BG,GO:0031974,"[TAS, TAS, TAS]"
A1BG,GO:0043226,"[TAS, TAS, HDA, TAS]"
A1BG,GO:0044421,"[HDA, HDA, HDA, HDA]"


#### We need to score the evidence for each term for each gene, in cases when a gene maps to two terms

In [15]:
score = {**{type: 3 for type in solid_codes}, **{type: 2 for type in sketchy_codes}, **{type: 1 for type in bad_codes}}
def evidence_list_to_score_list(evidence_list): return [[score[evidence] for evidence in evidence_list]]

In [16]:
evidence_scores = evidence.apply(lambda row: evidence_list_to_score_list(row['Evidence']), axis=1)
evidence_scores.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Evidence
gene,GO_ID,Unnamed: 2_level_1
A1BG,GO:0005576,"[2, 3, 3, 2, 2, 2, 2]"
A1BG,GO:0005623,"[3, 3, 3]"
A1BG,GO:0031974,"[3, 3, 3]"
A1BG,GO:0043226,"[3, 3, 2, 3]"
A1BG,GO:0044421,"[2, 2, 2, 2]"


In [17]:
evidence_scores = evidence_scores.apply(lambda row: sum(row['Evidence']), axis=1).to_frame().rename(columns={0:'Evidence'})
evidence_scores.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Evidence
gene,GO_ID,Unnamed: 2_level_1
A1BG,GO:0005576,16
A1BG,GO:0005623,9
A1BG,GO:0031974,9
A1BG,GO:0043226,11
A1BG,GO:0044421,8


In [18]:
best_evidence = evidence_scores[evidence_scores['Evidence'] == evidence_scores.groupby(['gene'])['Evidence'].transform(max)]
best_evidence.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Evidence
gene,GO_ID,Unnamed: 2_level_1
A1BG,GO:0005576,16
A1CF,GO:0005623,13
A1CF,GO:0044464,13
A2M,GO:0005576,10
A2ML1,GO:0005576,5


#### Although we don't see them here, we need to deal with ties

In [19]:
len(best_evidence), len(best_evidence.reset_index().drop_duplicates('gene'))

(44991, 19338)

In [20]:
best_evidence = best_evidence.reset_index().drop_duplicates('gene').set_index(['gene', 'GO_ID'])
best_evidence.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Evidence
gene,GO_ID,Unnamed: 2_level_1
A1BG,GO:0005576,16
A1CF,GO:0005623,13
A2M,GO:0005576,10
A2ML1,GO:0005576,5
A3GALT2,GO:0043226,3


In [21]:
gene_to_compartment_term = best_evidence.reset_index()[['gene', 'GO_ID']].set_index('gene')
gene_to_compartment_term.head()

Unnamed: 0_level_0,GO_ID
gene,Unnamed: 1_level_1
A1BG,GO:0005576
A1CF,GO:0005623
A2M,GO:0005576
A2ML1,GO:0005576
A3GALT2,GO:0043226


In [22]:
compartments = df[df.depth == 1]['name'].to_frame()
compartments

Unnamed: 0,name
GO:0005576,extracellular region
GO:0005623,cell
GO:0009295,nucleoid
GO:0016020,membrane
GO:0019012,virion
GO:0030054,cell junction
GO:0031974,membrane-enclosed lumen
GO:0032991,protein-containing complex
GO:0043226,organelle
GO:0044215,other organism


In [25]:
cellular_compartments = gene_to_compartment_term.merge(compartments, how='left', left_on='GO_ID', right_index=True)[['GO_ID','name']]
cellular_compartments.head()

Unnamed: 0_level_0,GO_ID,name
gene,Unnamed: 1_level_1,Unnamed: 2_level_1
A1BG,GO:0005576,extracellular region
A1CF,GO:0005623,cell
A2M,GO:0005576,extracellular region
A2ML1,GO:0005576,extracellular region
A3GALT2,GO:0043226,organelle


## IV. Add "Specific Compartments" information to each gene

In [26]:
level2_terms = df[df.depth == 2].index.tolist()

In [27]:
terms_and_subterms = {term: np.unique(flatten(list(nx.dfs_successors(g, term).values()))).tolist() for term in level2_terms}

In [28]:
terms = [item for l in [subterms+[term] for term, subterms in list(terms_and_subterms.items())] for item in l]
len(terms), len(np.unique(terms)), len(df)

(19818, 4169, 4191)

In [29]:
terms_and_genes = {term: genes[genes.GO_ID.isin(subterms+[term])][['GeneSymbol', 'Evidence']].values.tolist() for term, subterms in terms_and_subterms.items()}

In [30]:
genes_and_terms = flatten([[(gene, term, evidence) for [gene, evidence] in genes] for term, genes in terms_and_genes.items()])

In [31]:
evidence = pd.DataFrame(genes_and_terms, columns=['gene','GO_ID','Evidence']).groupby(['gene', 'GO_ID'])['Evidence'].apply(list).to_frame()
evidence.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Evidence
gene,GO_ID,Unnamed: 2_level_1
A1BG,GO:0005615,"[HDA, HDA, HDA]"
A1BG,GO:0005622,"[TAS, TAS, TAS]"
A1BG,GO:0012505,"[TAS, TAS, TAS]"
A1BG,GO:0031012,[HDA]
A1BG,GO:0043227,"[TAS, TAS, HDA, TAS]"


#### We've made a committment for each gene to belong to a single level1 term, so let's remove all the level2 terms which aren't subterms of the previously selected level1 term for each gene

In [32]:
predecessors = {term: list(g.predecessors(term)) for term in level2_terms}

In [33]:
predecessors = {term: [parent for parent in parents if parent in level1_terms] for term, parents in predecessors.items()}

In [34]:
predecessors = {term: parents[0] for term, parents in predecessors.items()}

In [35]:
predecessors = pd.Series(predecessors).rename_axis('level2_term').rename('level1_term').to_frame()
predecessors.head()

Unnamed: 0_level_0,level1_term
level2_term,Unnamed: 1_level_1
GO:0000133,GO:0032991
GO:0000313,GO:0044422
GO:0000346,GO:0032991
GO:0000347,GO:0032991
GO:0000408,GO:0032991


In [36]:
evidence = evidence.reset_index().merge(predecessors, how='left', left_on='GO_ID', right_index=True).set_index(['gene', 'level1_term', 'GO_ID'])
evidence.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Evidence
gene,level1_term,GO_ID,Unnamed: 3_level_1
A1BG,GO:0044421,GO:0005615,"[HDA, HDA, HDA]"
A1BG,GO:0044464,GO:0005622,"[TAS, TAS, TAS]"
A1BG,GO:0044464,GO:0012505,"[TAS, TAS, TAS]"
A1BG,GO:0044421,GO:0031012,[HDA]
A1BG,GO:0043226,GO:0043227,"[TAS, TAS, HDA, TAS]"


In [37]:
evidence = evidence.reset_index().merge(cellular_compartments['GO_ID'].rename('chosen_level1').to_frame(), how='left', left_on='gene', right_index=True)
evidence = evidence[evidence.level1_term == evidence.chosen_level1]
evidence = evidence.set_index(['gene','GO_ID'])['Evidence'].to_frame()
evidence.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Evidence
gene,GO_ID,Unnamed: 2_level_1
A3GALT2,GO:0043227,"[IBA, IBA, IEA]"
A3GALT2,GO:0043229,"[IBA, IEA]"
A4GALT,GO:0031090,[NAS]
A4GNT,GO:0031090,[TAS]
AADAC,GO:0005789,"[IBA, IDA, TAS]"


In [38]:
evidence_scores = evidence.apply(lambda row: evidence_list_to_score_list(row['Evidence']), axis=1)
evidence_scores.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Evidence
gene,GO_ID,Unnamed: 2_level_1
A3GALT2,GO:0043227,"[1, 1, 1]"
A3GALT2,GO:0043229,"[1, 1]"
A4GALT,GO:0031090,[3]
A4GNT,GO:0031090,[3]
AADAC,GO:0005789,"[1, 3, 3]"


In [39]:
evidence_scores = evidence_scores.apply(lambda row: sum(row['Evidence']), axis=1).to_frame().rename(columns={0:'Evidence'})
evidence_scores.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Evidence
gene,GO_ID,Unnamed: 2_level_1
A3GALT2,GO:0043227,3
A3GALT2,GO:0043229,2
A4GALT,GO:0031090,3
A4GNT,GO:0031090,3
AADAC,GO:0005789,7


In [40]:
best_evidence = evidence_scores[evidence_scores['Evidence'] == evidence_scores.groupby(['gene'])['Evidence'].transform(max)]
best_evidence.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Evidence
gene,GO_ID,Unnamed: 2_level_1
A3GALT2,GO:0043227,3
A4GALT,GO:0031090,3
A4GNT,GO:0031090,3
AADAC,GO:0005789,7
ABCA4,GO:0031090,3


In [41]:
len(best_evidence), len(best_evidence.reset_index().drop_duplicates('gene'))

(1683, 1422)

In [42]:
best_evidence = best_evidence.reset_index().drop_duplicates('gene').set_index(['gene', 'GO_ID'])
best_evidence.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Evidence
gene,GO_ID,Unnamed: 2_level_1
A3GALT2,GO:0043227,3
A4GALT,GO:0031090,3
A4GNT,GO:0031090,3
AADAC,GO:0005789,7
ABCA4,GO:0031090,3


In [43]:
gene_to_compartment_term = best_evidence.reset_index()[['gene', 'GO_ID']].set_index('gene')
gene_to_compartment_term.head()

Unnamed: 0_level_0,GO_ID
gene,Unnamed: 1_level_1
A3GALT2,GO:0043227
A4GALT,GO:0031090
A4GNT,GO:0031090
AADAC,GO:0005789
ABCA4,GO:0031090


In [44]:
compartments = df[df.depth == 2]['name'].to_frame()
compartments

Unnamed: 0,name
GO:0000133,polarisome
GO:0000313,organellar ribosome
GO:0000346,transcription export complex
GO:0000347,THO complex
GO:0000408,EKC/KEOPS complex
GO:0000417,HIR complex
GO:0000439,core TFIIH complex
GO:0000444,MIS12/MIND type complex
GO:0000776,kinetochore
GO:0000797,condensin core heterodimer


In [47]:
specific_cellular_compartments = gene_to_compartment_term.merge(compartments, how='left', left_on='GO_ID', right_index=True)[['GO_ID','name']]
specific_cellular_compartments.head()

Unnamed: 0_level_0,GO_ID,name
gene,Unnamed: 1_level_1,Unnamed: 2_level_1
A3GALT2,GO:0043227,membrane-bounded organelle
A4GALT,GO:0031090,organelle membrane
A4GNT,GO:0031090,organelle membrane
AADAC,GO:0005789,endoplasmic reticulum membrane
ABCA4,GO:0031090,organelle membrane


In [49]:
cellular_compartments = cellular_compartments.rename(columns={'GO_ID':'general_compartment_GO_ID', 'name':'general_compartment'})
cellular_compartments.head()

Unnamed: 0_level_0,general_compartment_GO_ID,general_compartment
gene,Unnamed: 1_level_1,Unnamed: 2_level_1
A1BG,GO:0005576,extracellular region
A1CF,GO:0005623,cell
A2M,GO:0005576,extracellular region
A2ML1,GO:0005576,extracellular region
A3GALT2,GO:0043226,organelle


In [51]:
cellular_compartments = cellular_compartments.merge(specific_cellular_compartments, how='left', left_index=True, right_index=True).rename(columns={'GO_ID':'specific_process_GO_ID', 'name':'specific_process'})
cellular_compartments.head()

Unnamed: 0_level_0,general_compartment_GO_ID,general_compartment,specific_process_GO_ID,specific_process
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A1BG,GO:0005576,extracellular region,,
A1CF,GO:0005623,cell,,
A2M,GO:0005576,extracellular region,,
A2ML1,GO:0005576,extracellular region,,
A3GALT2,GO:0043226,organelle,GO:0043227,membrane-bounded organelle


## This subcellular localization annotation is not as refined as the one generated by Bryce from the COMPARTMENTS database, so we will neglect to save it. 

In [46]:
# cellular_compartments.to_pickle('cellular_compartments_gene_annotation.pickle')