## Examining results of ComplEx
Jump to the end for a summary. This notebook primarily serves to show   

In [1]:
from tranql_stellargraph.build_complex import make_model, predict_edge, k_graph, make_type_predicate_mappings

First, let's make sure to generate the ComplEX model to use in a little bit

In [2]:
%%capture
model = make_model()

### Looking at the information we have
We are going to be examining two specific nodes in the graph to start. We'll look at what information there is in the graph about them, what obvious connections can be drawn, and then we'll see what connections the model can make between them.

In [3]:
""" Declare the nodes of interest """
mondo = "MONDO:0010940"
hgnc = "HGNC:10610"

# k_graph here is a KnowledgeGraph of mock1.json, and the model is trained with it as the input
# Let's refer to the knowledge graph's networkx instance directly for ease of use
net = k_graph.net

# Display what their names are
print(mondo, ":", net.nodes[mondo]["attr_dict"]["name"])
print(hgnc, ":", net.nodes[hgnc]["attr_dict"]["name"])
print()
# Display what their types are
mondo_type = net.nodes[mondo]["attr_dict"]["type"]
hgnc_type = net.nodes[hgnc]["attr_dict"]["type"]
print(mondo, ":", mondo_type)
print(hgnc, ":", hgnc_type)

MONDO:0010940 : inherited susceptibility to asthma
HGNC:10610 : CCL11

MONDO:0010940 : ['genetic_condition', 'disease']
HGNC:10610 : ['gene']


As we can see, the premise here is simple: we have a gene __CCL11__ and a genetic condition __inherited susceptibility to asthma__. Our goal is to find how these two might be related.

In [4]:
""" Let's see what edges exist between these nodes """
mondo_to_hgnc = net[mondo][hgnc]
hgnc_to_mondo = net[hgnc][mondo]

print(list(mondo_to_hgnc.keys()))
print(list(hgnc_to_mondo.keys()))

['gene_associated_with_condition']
['gene_associated_with_condition']


So, we have two nodes: inherited susceptibility to asthma and the gene CCL11. We know that these two are associated with each other in some way. Since the two edges are the same, let's just look at one of them.

In [5]:
edge = k_graph.net[mondo][hgnc]["gene_associated_with_condition"]
edge.keys()

dict_keys(['type', 'target_id', 'source_id', 'relation', 'edge_source', 'publications', 'id', 'predicate_id', 'source_database', 'ctime', 'relation_label', 'weight', 'reasoner', 'label'])

Of these keys, relation_label is probably going to tell us the most.

In [6]:
edge["relation_label"]

['likely_pathogenic_for_condition']

Okay, so what we know from the graph is that for these two nodes, the gene CCL11 likely causes inherited susceptibility to asthma in some way.

### Now, let's look at what the model can tell us.
First, we need to get all the possible predicates between gene and genetic_condition/disease
Let's see what predicates there are between them first

In [7]:
# This is structured as {"chemical_substance": {"gene": [pred1, pred2]}}
predicate_map = make_type_predicate_mappings(k_graph)
# For example, let's see the predicates between genetic_condition and gene
print(predicate_map["genetic_condition"])

['has_phenotype', 'disease_to_gene_association', 'contributes_to', 'biomarker_for', 'gene_associated_with_condition']


First, let's build some new edges of mondo->gene.

In [12]:
mondo_to_hgnc_predicates = predicate_map["genetic_condition"]["gene"] + predicate_map["disease"]["gene"]
edges = [
    # source, target, pred
    (mondo, hgnc, predicate) for predicate in mondo_to_hgnc_predicates
]
for edge in edges:
    print(f"{edge[0]}-[{edge[2]}]->{edge[1]}")

MONDO:0010940-[has_phenotype]->HGNC:10610
MONDO:0010940-[disease_to_gene_association]->HGNC:10610
MONDO:0010940-[contributes_to]->HGNC:10610
MONDO:0010940-[biomarker_for]->HGNC:10610
MONDO:0010940-[gene_associated_with_condition]->HGNC:10610
MONDO:0010940-[disease_to_gene_association]->HGNC:10610
MONDO:0010940-[literature_co-occurrence]->HGNC:10610
MONDO:0010940-[contributes_to]->HGNC:10610
MONDO:0010940-[has_phenotype]->HGNC:10610
MONDO:0010940-[biomarker_for]->HGNC:10610
MONDO:0010940-[gene_associated_with_condition]->HGNC:10610


Now let's see if the model thinks they could exist.

In [13]:
predict_edge(model, edges)

Edge MONDO:0010940-[has_phenotype]->HGNC:10610 predicted (0.45043355226516724) (real=False)
Edge MONDO:0010940-[disease_to_gene_association]->HGNC:10610 predicted (0.21170775592327118) (real=False)
Edge MONDO:0010940-[contributes_to]->HGNC:10610 predicted (0.45043355226516724) (real=False)
Edge MONDO:0010940-[biomarker_for]->HGNC:10610 predicted (0.4502994418144226) (real=False)
Edge MONDO:0010940-[gene_associated_with_condition]->HGNC:10610 predicted (0.21170775592327118) (real=True)


It looks like of all those edges we fed in, a few of them are predicted to be real by the model, although none are very strong. Let's look at edges gene->mondo

In [15]:
hgnc_to_mondo_predicates = predicate_map["gene"]["genetic_condition"] + predicate_map["gene"]["disease"]
edges = [
    # source, target, pred
    (hgnc, mondo, predicate) for predicate in mondo_to_hgnc_predicates
]
for edge in edges:
    print(f"{edge[0]}-[{edge[2]}]->{edge[1]}")

HGNC:10610-[has_phenotype]->MONDO:0010940
HGNC:10610-[disease_to_gene_association]->MONDO:0010940
HGNC:10610-[contributes_to]->MONDO:0010940
HGNC:10610-[biomarker_for]->MONDO:0010940
HGNC:10610-[gene_associated_with_condition]->MONDO:0010940
HGNC:10610-[disease_to_gene_association]->MONDO:0010940
HGNC:10610-[literature_co-occurrence]->MONDO:0010940
HGNC:10610-[contributes_to]->MONDO:0010940
HGNC:10610-[has_phenotype]->MONDO:0010940
HGNC:10610-[biomarker_for]->MONDO:0010940
HGNC:10610-[gene_associated_with_condition]->MONDO:0010940


In [21]:
predict_edge(model, edges)

Edge HGNC:10610-[has_phenotype]->MONDO:0010940 predicted (0.39737415313720703) (real=False)
Edge HGNC:10610-[disease_to_gene_association]->MONDO:0010940 predicted (1.193188190460205) (real=False)
Edge HGNC:10610-[contributes_to]->MONDO:0010940 predicted (1.2402820587158203) (real=False)
Edge HGNC:10610-[biomarker_for]->MONDO:0010940 predicted (0.0022675953805446625) (real=False)
Edge HGNC:10610-[gene_associated_with_condition]->MONDO:0010940 predicted (1.193188190460205) (real=True)
Edge HGNC:10610-[disease_to_gene_association]->MONDO:0010940 predicted (0.39737415313720703) (real=False)
Edge HGNC:10610-[literature_co-occurrence]->MONDO:0010940 predicted (1.2402820587158203) (real=False)
Edge HGNC:10610-[contributes_to]->MONDO:0010940 predicted (0.0022675935178995132) (real=False)


There are a few strong ones in here such as contributes_to, literature_co-occurence, and disease_to_gene_association.

## Conclusion
Given two nodes, *inherited susceptibility to asthma* and *CCL11* with a single edge between them, *gene_associated_with_condition*, the model is able to predict other possible edges between the two and demonstrates that it is learning meaningful relationships within the graph.

### Takeaways
There are some predictions I find particularly interesting.
1. disease_to_gene_association is obviously going to be very similar to gene_associated_with_condition, and it looks like the model has made this connection.
2. literature_co-occurrence is a strange type and I'm not completely sure how it really works. However, doing some quick searching, it's quite easy to find prior work done concerning CCL11 and inherited susceptibility to asthma. See: https://www.jacionline.org/article/S0091-6749(05)02508-X/fulltext
3. contributes_to again demonstrates that the model is able to find patterns in edge types, and it seems very realistic that if CCL11 is "likely pathogenic for" inherited susceptibility to asthma, then CCL11 also contributes to this condition.