# Knowledge Annotation and Visualization

### Summarizing Knowledge using Network Analysis

The **Clinical Knowledge Graph (CKG)** can be used to annotate a list of proteins based on their connections in the Knowledge Graph. CKG generates a comprehensive graph with all the connections to Diseases, Drugs, Protein Complexes, Pathways and Biological processes. 

All the connections extracted from CKG are then summarized into a smaller subgraph containing only the top 15 nodes of each type (Disease, Durg, Complex, Pathway, Biological_process Publications) based on different network analysis algorithms (centrality, pagerank).

The connections extracted from the graph are:

- Protein-protein interactions
- Protein-disease associations
- Protein-drug associations
- Drug-drug interactions
- Dug-disease indications
- Protein-complex association
- Protein-publication mentions
- Disease-publication mentions

These connections are extracted using these queries: `report_manager/queries/knowledge_annotation.yml` and can be easily extended following the same query format.

Here, we show several examples of how to extract and visualize knowledge for a list of proteins, either specifying the disease/diseases being studied or letting CKG figure it out.

In [40]:
import pandas as pd
from report_manager import knowledge

## Annotation of Proteins Linked to a Specific Disease

We use the **Open Targets platform** https://www.targetvalidation.org/ to obtain lists of genes associated to Fibromyalgia. Open Targets compiled a list of 57 proteins targets that are associated to Fibromyalgia (https://www.targetvalidation.org/disease/EFO_0005687/associations?fcts=datatype:known_drug).

**Fibrimyalgia** is a medical condition characterized by chronic widespread pain and a heightened pain response to pressure. Other symptoms include tiredness to a degree that normal activities are affected, sleep problems and troubles with memory (source: https://en.wikipedia.org/wiki/Fibromyalgia).


We feed the list of proteins to CKG to prioritize all the knowledge gathered in the graph to reveal relationships to other possibly related diseases as well as possible treatments and altered biological processes and pathways.


In [41]:
opentargets_covid_file = 'tmp/targets_associated_with_fibromyalgia.csv'
opentargets_data = pd.read_csv(opentargets_covid_file, sep=',', header=0)

In [42]:
opentargets_data.head()

Unnamed: 0,target.gene_info.symbol,target.id,association_score.overall,association_score.datatypes.genetic_association,association_score.datatypes.somatic_mutation,association_score.datatypes.known_drug,association_score.datatypes.affected_pathway,association_score.datatypes.rna_expression,association_score.datatypes.literature,association_score.datatypes.animal_model,target.gene_info.name
0,HTR2A,ENSG00000102468,1.0,0.0,0.0,1.0,0.0,0.0,0.043838,0.0,5-hydroxytryptamine receptor 2A
1,OPRM1,ENSG00000112038,1.0,0.0,0.0,1.0,0.0,0.0,0.030729,0.0,opioid receptor mu 1
2,DRD2,ENSG00000149295,1.0,0.0,0.0,1.0,0.0,0.0,0.024,0.0,dopamine receptor D2
3,PTGS2,ENSG00000073756,1.0,0.0,0.0,1.0,0.0,0.0,0.0128,0.0,prostaglandin-endoperoxide synthase 2
4,SLC6A2,ENSG00000103546,1.0,0.0,0.0,1.0,0.0,0.0,0.0104,0.0,solute carrier family 6 member 2


In [43]:
target_list = opentargets_data['target.gene_info.symbol'].tolist()

In [44]:
len(target_list)

57

### Knowledge Object

To annotate the list of proteins, we create an empty object of type Knowledge.

Once we have the object, we can simply call the function `annotate_list()` specifying the list of proteins and in this case the disease (or diseases) and what type of entities we want to annotate (Disease, Drug, Pathway, etc.).

In [45]:
#Create Knowledge object
kn = knowledge.Knowledge(identifier='Fibromyalgia', data=None)

In [46]:
# Annotate the list of proteins using function annotate_list
kn.annotate_list(query_list=target_list, # list of proteins
                 entity_type='protein', # type of items in the list
                 queries_file=None, # Allows YML file with customized queries or the default (None)
                 attribute='name',  # What we provide in the list (name, id)
                 diseases=['fibromyalgia'], # List of diseases
                 entities=None) # what types of annotations (Disease, Drug, Pathway, etc.)

This function runs all the queries in `queries_file` (default: `report_manager/queries/knowledge_annotation.yml`) associated to the `entity_type` (protein) and limits the queried information to relationships to the list of proteins provided.

## Summarization and Visualization

The graph contains millions of relationships and the results from the annotation may be too combersome. 

In order to summarize the results and make them easier to understand and navigate, CKG uses network analysis algorithms (centrality (betweenness, closeness) and pagerank) to prioritize the nodes in the knowledge annotation graph. 

The result summarizes the relationships of the top 15 nodes of each entity type according to these algorithms (Disease, Drug, Pathway, Biological_process, Complex, Publication).

The summarized results can be visualized either as a Sankey plot or as a network.

In [47]:
kn.generate_report(visualizations=['network', 'sankey'], # how to visualize the results (network, sankey) 
                   summarize=True, # Whether or not to summarize the annotation
                   method='betweenness', # Method for summarizing the annotation (betweenness, closeness, pagerank)
                   inplace=True) # If True, the summarized is saved, otherwise keep full graph

In [48]:
kn.report.visualize_report(environment='notebook')[0]

Cytoscape(data={'elements': [{'data': {'type': 'Protein', 'color': '#756bb1', 'centrality': 18.0, 'id': 'CACNA…

## All the Knowledge is Accessible

All the relationships extracted from the CKG are stored as a dataframe in the class property `data`.

In [49]:
kn.data.shape

(7174, 7)

In [50]:
kn.data.head()

Unnamed: 0,r.source,rel_type,source,source_type,target,target_type,weight
0,Reactome,ANNOTATED_IN_PATHWAY,CACNA2D2,[Protein],Phase 2 - plateau phase,[Pathway],
1,Reactome,ANNOTATED_IN_PATHWAY,CACNA2D2,[Protein],"Adrenaline,noradrenaline inhibits insulin secr...",[Pathway],
2,Reactome,ANNOTATED_IN_PATHWAY,CACNA2D2,[Protein],Phase 0 - rapid depolarisation,[Pathway],
3,Reactome,ANNOTATED_IN_PATHWAY,CACNA2D2,[Protein],Regulation of insulin secretion,[Pathway],
4,Reactome,ANNOTATED_IN_PATHWAY,CACNA2D2,[Protein],Presynaptic depolarization and calcium channel...,[Pathway],


In [51]:
kn.data.tail()

Unnamed: 0,r.source,rel_type,source,source_type,target,target_type,weight
186,,MENTIONED_IN_PUBLICATION,DRD2,[Protein],PMID:27283899,[Publication],
187,,MENTIONED_IN_PUBLICATION,DRD2,[Protein],PMID:30034335,[Publication],
188,,MENTIONED_IN_PUBLICATION,DRD2,[Protein],PMID:26885825,[Publication],
189,,MENTIONED_IN_PUBLICATION,ADRA2C,[Protein],PMID:27616990,[Publication],
190,,MENTIONED_IN_PUBLICATION,ADRA2C,[Protein],PMID:27826897,[Publication],


The generated knowledge subgraph can also be accessed as a NetworkX Directed graph.

In [52]:
kn.graph

<networkx.classes.digraph.DiGraph at 0x15a9f7da0>

And the report can be downloaded to a specified directory. The directory will contain the Sankey visualization in `png` and `svg` formats, the network in `gml` and `json` formats as well as the nodes and edges (relationships) tables in `tsv` format.

In [53]:
kn.report.download_report('tmp/fibromyalgia')

In some cases, we are interested in annotating a list of proteins and identify what diseases they may be related to. 

In the example before, by specifying a disease, we prioritized the relationships to that disease and at the same time identified many others also associated to the list of proteins. Running the same annotation without specifying fibromyalgia, brings up other diseases that may be relevant to investigate (i.e opiate dependence).

In [54]:
# Annotate the list of proteins using function annotate_list
kn.annotate_list(query_list=target_list, # list of proteins
                 entity_type='protein', # type of items in the list
                 queries_file=None, # Allows YML file with customized queries or the default (None)
                 attribute='name',  # What we provide in the list (name, id)
                 diseases=[], # List of diseases
                 entities=None) # what types of annotations (Disease, Drug, Pathway, etc.)

In [55]:
kn.generate_report(visualizations=['network'], summarize=True, method='betweenness', inplace=True)

In [56]:
kn.report.visualize_report(environment='notebook')[0]

Cytoscape(data={'elements': [{'data': {'type': 'Protein', 'color': '#756bb1', 'centrality': 33.0, 'id': 'CACNA…

## Annotation of a List of Proteins

Providing a list of proteins without specifying a list of diseases shows also the validity of the summarization method and the usefulness of the extracted knowledge.

Here, we show another example annotating a list of proteins (n=84) related to Alzheimer's disease from the **Open Targets Platform** (https://www.targetvalidation.org/disease/EFO_0000249/associations?fcts=datatype:affected_pathway).

In [57]:
opentargets_covid_file = 'tmp/targets_associated_with_Alzheimer\'s_disease.csv'
opentargets_data = pd.read_csv(opentargets_covid_file, sep=',', header=0)
target_list = opentargets_data['target.gene_info.symbol'].tolist()
len(target_list)

84

In [58]:
# Annotate the list of proteins using function annotate_list
kn.annotate_list(query_list=target_list, # list of proteins
                 entity_type='protein', # type of items in the list
                 queries_file=None, # Allows YML file with customized queries or the default (None)
                 attribute='name',  # What we provide in the list (name, id)
                 diseases=[], # List of diseases
                 entities=None) # what types of annotations (Disease, Drug, Pathway, etc.)

In [59]:
kn.generate_report(visualizations=['sankey'], summarize=True, method='betweenness', inplace=False)

In [60]:
kn.report.visualize_report(environment='notebook')

[]

**Betweenness centrality** can be _slow_ depending on the number of relationships in the graph. There are other options for summarizing the knowledge annotation: **closeness centrality** or **pagerank**. 

In [61]:
kn.generate_report(visualizations=['sankey'], summarize=True, method='closeness', inplace=False)

In [62]:
kn.report.visualize_report(environment='notebook')

[]

### References

Ochoa, D. et al. (2021). Open Targets Platform: supporting systematic drug–target identification and prioritisation. Nucleic Acids Research. https://academic.oup.com/nar/article/49/D1/D1302/5983621
