## Examining results of ComplEx
Jump to the end for a summary. This notebook primarily serves to show the process behind this example.

In [1]:
from build_complex import make_model, predict_edge, make_type_predicate_mappings, get_dataset
from tranql_jupyter import KnowledgeGraph

First, let's make sure to generate the ComplEX model to use in a little bit

In [2]:
%%capture
k_graph = KnowledgeGraph.mock1()
dataset = get_dataset(k_graph)
model = make_model(dataset)

### Looking at the information we have
We are going to be examining two specific nodes in the graph to start. We'll look at what information there is in the graph about them, what obvious connections can be drawn, and then we'll see what connections the model can make between them.

In [3]:
""" Declare the nodes of interest """
mondo = "MONDO:0010940"
hgnc = "HGNC:10610"

# k_graph here is a KnowledgeGraph of mock1.json, and the model is trained with it as the input
# Let's refer to the knowledge graph's networkx instance directly for ease of use
net = k_graph.net

# Display what their names are
print(mondo, ":", net.nodes[mondo]["attr_dict"]["name"])
print(hgnc, ":", net.nodes[hgnc]["attr_dict"]["name"])
print()
# Display what their types are
mondo_type = net.nodes[mondo]["attr_dict"]["type"]
hgnc_type = net.nodes[hgnc]["attr_dict"]["type"]
print(mondo, ":", mondo_type)
print(hgnc, ":", hgnc_type)

MONDO:0010940 : inherited susceptibility to asthma
HGNC:10610 : CCL11

MONDO:0010940 : ['genetic_condition', 'disease']
HGNC:10610 : ['gene']


As we can see, the premise here is simple: we have a gene __CCL11__ and a genetic condition __inherited susceptibility to asthma__. Our goal is to find how these two might be related.

In [4]:
""" Let's see what edges exist between these nodes """
mondo_to_hgnc = net[mondo][hgnc]
hgnc_to_mondo = net[hgnc][mondo]

print(list(mondo_to_hgnc.keys()))
print(list(hgnc_to_mondo.keys()))

['gene_associated_with_condition']
['gene_associated_with_condition']


So, we have two nodes: inherited susceptibility to asthma and the gene CCL11. We know that these two are associated with each other in some way. Since the two edges are the same, let's just look at one of them.

In [5]:
edge = k_graph.net[mondo][hgnc]["gene_associated_with_condition"]
edge.keys()

dict_keys(['type', 'target_id', 'source_id', 'relation', 'edge_source', 'publications', 'id', 'predicate_id', 'source_database', 'ctime', 'relation_label', 'weight', 'reasoner', 'label'])

Of these keys, relation_label is probably going to tell us the most.

In [6]:
edge["relation_label"]

['likely_pathogenic_for_condition']

Okay, so what we know from the graph is that for these two nodes, the gene CCL11 likely causes inherited susceptibility to asthma in some way.

### Now, let's look at what the model can tell us.
First, we need to get all the possible predicates between gene and genetic_condition/disease. Since inherited susceptibility to asthma is more of a genetic condition than it is a disease, we're not going to pay as much attention to disease.

Let's see what predicates there are between them first.

In [7]:
# This is structured as {"chemical_substance": {"gene": [pred1, pred2]}}
predicate_map = make_type_predicate_mappings(k_graph)
# For example, let's see the predicates between genetic_condition and gene
print("genetic_condition->gene", predicate_map["genetic_condition"]["gene"], "\n")
print("disease->gene", predicate_map["disease"]["gene"], "\n")

print("gene->genetic_condition", predicate_map["gene"]["genetic_condition"], "\n")
print("gene->disease", predicate_map["gene"]["disease"])

genetic_condition->gene ['has_phenotype', 'disease_to_gene_association', 'contributes_to', 'biomarker_for', 'gene_associated_with_condition'] 

disease->gene ['disease_to_gene_association', 'literature_co-occurrence', 'contributes_to', 'has_phenotype', 'biomarker_for', 'gene_associated_with_condition'] 

gene->genetic_condition ['biomarker_for', 'has_phenotype', 'contributes_to', 'gene_associated_with_condition'] 

gene->disease ['contributes_to', 'biomarker_for', 'has_phenotype', 'gene_associated_with_condition']


#### A couple things are immediately obvious from above:
- There is a lot of overlap between all of them.
- Predicates can generally go in either direction, although there are a couple exceptions

Now, let's build some new edges of mondo->gene. We'll use all the predicates available, as seen in the first two lists above.

Also, as we just saw, there's a lot of overlap in predicates between genetic_condition and disease, so we'll make sure to remove any duplicate edges to cut down on clutter.

In [8]:
mondo_to_hgnc_predicates = list(set(predicate_map["genetic_condition"]["gene"] + predicate_map["disease"]["gene"]))
edges = [
    # source, target, pred
    (mondo, hgnc, predicate) for predicate in mondo_to_hgnc_predicates
]

### What does the model think?
Let's see what the model predicts about the edges we just created between our two nodes.

Note: predict_edge is a bit of a janky prototype, but it'll should get the job done.

In [9]:
predict_edge(model, dataset, k_graph, edges, show_all=True)

Edge MONDO:0010940-[literature_co-occurrence]->HGNC:10610 predicted (-2.1301167011260986) (real=False)
Edge MONDO:0010940-[has_phenotype]->HGNC:10610 predicted (0.003076719818636775) (real=False)
Edge MONDO:0010940-[gene_associated_with_condition]->HGNC:10610 predicted (0.7244653701782227) (real=True)
Edge MONDO:0010940-[disease_to_gene_association]->HGNC:10610 predicted (-0.40032660961151123) (real=False)
Edge MONDO:0010940-[biomarker_for]->HGNC:10610 predicted (-1.684342861175537) (real=False)
Edge MONDO:0010940-[contributes_to]->HGNC:10610 predicted (-0.25001955032348633) (real=False)


It looks like of all those edges we fed in, a few of them are predicted to be real by the model, although none are very strong. Let's look at edges gene->mondo

In [10]:
hgnc_to_mondo_predicates = list(set(predicate_map["gene"]["genetic_condition"] + predicate_map["gene"]["disease"]))
edges = [
    # source, target, pred
    (hgnc, mondo, predicate) for predicate in mondo_to_hgnc_predicates
]

In [11]:
predict_edge(model, dataset, k_graph, edges, show_all=True)

Edge HGNC:10610-[literature_co-occurrence]->MONDO:0010940 predicted (-3.2013840675354004) (real=False)
Edge HGNC:10610-[has_phenotype]->MONDO:0010940 predicted (-0.009154091589152813) (real=False)
Edge HGNC:10610-[gene_associated_with_condition]->MONDO:0010940 predicted (1.9730486869812012) (real=True)
Edge HGNC:10610-[disease_to_gene_association]->MONDO:0010940 predicted (0.8160892724990845) (real=False)
Edge HGNC:10610-[biomarker_for]->MONDO:0010940 predicted (0.7997798919677734) (real=False)
Edge HGNC:10610-[contributes_to]->MONDO:0010940 predicted (-0.6185520887374878) (real=False)


There are a few strong ones, but the most important is probably disease_to_gene_association. Although we can't say for sure that any of these edges we've created should definitively exist between these two nodes, I think most can agree that given the predicate gene_associated_with_condition exists, then the predicate disease_to_gene_association is also probably valid between the two nodes.

## Conclusion
Given two nodes, *inherited susceptibility to asthma* and *CCL11* with a single edge between them, *gene_associated_with_condition*, the model is able to predict other possible edges between the two and demonstrates that it is learning meaningful relationships within the graph.

### Takeaways
If you download this notebook and run it yourself, you'll see that the model does not always predict the same thing, so what follows will likely vary for you to some degree. Additionally, it is important to note that the model is not perfect. When running it on just one specific example, it's unrealistic to expect it to get it right everytime.

With that said, there are some predictions in particular that I find interesting:
1. **disease_to_gene_association** is obviously going to be very similar to gene_associated_with_condition, and it looks like the model has made this connection. At the very least, the model should be consistently making simple connections like these.

2. **literature_co-occurrence** is a strange type and I'm not completely sure how it really works. Sometimes the model will get this correct and sometimes it won't. However, doing some quick searching, it's quite easy to find prior work done concerning CCL11 and inherited susceptibility to asthma. See: https://www.jacionline.org/article/S0091-6749(05)02508-X/fulltext. It's just not exactly realistic to expect the model to get this predicate correct because of just how unpredictably it behaves.

3. **contributes_to** again demonstrates that the model is able to find patterns in edge types, and it seems very realistic that if CCL11 is "likely pathogenic for" inherited susceptibility to asthma, then CCL11 also contributes to this condition.

4. **biomarker_for** is just like contributes_to. If this gene is likely pathgenic for inherited susceptibility to asthma, then it follows that the gene may also be a biomarker for this condition. This reasoning is substantiated in numerous [biomedical papers](https://pubmed.ncbi.nlm.nih.gov/24796647/). I've only listed one, but simply searching "CCL11 biomarker asthma" should yield more if interested.

