# 4. Topic Labelling
In this notebook we will be infering a label for each one of the topics obtained in the Topic Modelling notebook. The main process will be divided as follows:
* We will begin by linking to [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) each one of the terms that belong to each topic. (Entity linking)
* After the entity linking step, we will generate a Graph with the neighbour nodes of each starting term by traversing their Wikidata information based on a set of properties to expand.
* Once a graph is built for each topic, we will remove isolated subgraphs from them to obtain the main connected subgraph of each one.
* We will apply a set of centrality algorithms to obtain the entity from Wikidata which best represents the topic subgraph. This node will be used as a label for the topic.
* Finally, the topic labels will be added to the final model, which will be serialized for further use in the following notebooks.

## Setup
As always, we will begin by loading a set of constants and initializing the logging system. Since we will be using Bokeh in this notebook, we will configure it to output the results in the Jupyter notebook:

In [1]:
%run __init__.py

INFO:root:Starting logger


In [2]:
from bokeh.io import output_notebook

output_notebook()

In [3]:
import pandas as pd

DF_FILE_PATH = os.path.join(NOTEBOOK_2_RESULTS_DIR, 'protocols_dataframe.pkl')

df = pd.read_pickle(DF_FILE_PATH)
protocols = df['full_text_cleaned'].values

## Entity linking

### Using the entity linking class
An entity linking class has been defined in the _entity_linking.py_ module of the _src_ directory. This class will link the given words to their Wikidata entity by using the [wbsearchentities](https://www.wikidata.org/w/api.php?action=help&modules=wbsearchentities) modules from the MediaWiki API:

In [4]:
from herc_common.entity_linking import WikidataEntityLinker

entity_linker = WikidataEntityLinker()
res = entity_linker.link_entity('agroforestry')
res

('agroforestry', 'http://www.wikidata.org/entity/Q397350')

### Linking each topic's term to Wikidata
In the following cells we are going to load the lda model trained on the Agriculture dataset, obtain the term distribution of each topic, and link each term to Wikidata. We will start by loading both the LDA pipeline and the document term matrix with the term frequency: 

In [5]:
from herc_common.utils import load_object

lda_agriculture_pipe_filename = "protocols_lda_model.pkl"
dtm_tf_filename = "protocols_dtm_tf.pkl"

lda_pipe = load_object(os.path.join(NOTEBOOK_3_RESULTS_DIR, lda_agriculture_pipe_filename))
dtm_tf = load_object(os.path.join(NOTEBOOK_3_RESULTS_DIR, dtm_tf_filename))

In order to obtain the list of terms for each topic, we are going to make use of the _get\_topic\_terms\_by\_relevance_ function to obtain a list of more relevant terms for each topic (see [Sievert & Shirley](https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf) for more information).

In [6]:
from herc_common.utils import get_topic_terms_by_relevance

def link_topic_terms(entity_linker, model, vectorizer,
                     dtm_tf, n_top_words, lambda_=0.6):
    res = []
    topic_terms = get_topic_terms_by_relevance(model, vectorizer, dtm_tf,
                                               n_top_words, lambda_)
    return [[entity_linker.link_entity(entity) for entity in topic]
            for topic in topic_terms]


Finally, we can make used of the function defined above to link each term to Wikidata. The output of the following cell will be a 2D array, with the first dimension corresponding to each topic, and the second one consisting on tuples containing the pair ('term', 'wikidata_uri') for every term of the topic:

In [7]:
linked_terms = link_topic_terms(entity_linker, lda_pipe.named_steps['model'],
                                lda_pipe.named_steps['vectorizer'], dtm_tf, 
                                n_top_words=10, lambda_=0.8)
linked_terms

[[('nematode', 'http://www.wikidata.org/entity/Q5185'),
  ('specimen', 'http://www.wikidata.org/entity/Q2114846'),
  ('sample', 'http://www.wikidata.org/entity/Q485146'),
  ('activity', 'http://www.wikidata.org/entity/Q1914636'),
  ('tissue', 'http://www.wikidata.org/entity/Q40397'),
  ('sod', 'http://www.wikidata.org/entity/Q525728'),
  ('image', 'http://www.wikidata.org/entity/Q478798'),
  ('section', 'http://www.wikidata.org/entity/Q3181348'),
  ('microscopy', 'http://www.wikidata.org/entity/Q27724680'),
  ('bone', 'http://www.wikidata.org/entity/Q265868')],
 [('cell', 'http://www.wikidata.org/entity/Q7868'),
  ('medium', 'http://www.wikidata.org/entity/Q340169'),
  ('day', 'http://www.wikidata.org/entity/Q573'),
  ('plate', 'http://www.wikidata.org/entity/Q57216'),
  ('wash', 'http://www.wikidata.org/entity/Q1437299'),
  ('dish', 'http://www.wikidata.org/entity/Q57216'),
  ('device', 'http://www.wikidata.org/entity/Q3966'),
  ('pipette', 'http://www.wikidata.org/entity/Q163373'),
 

## Obtaining each topic's graphs
In this phase we are going to explore the neighbourhood of each term linked before, to obtain a graph with their related terms from Wikidata. Each set of terms obtained before will be the seed concepts used to obtain the final graph, and a set of properties from Wikidata will be explored recursively to expand the final graph. 

For more information about the implementation of the graph building process, the class used can be accessed at the _graph.py_ module in the source directory.

In the following cell we will be configuring the graph builder to build a graph with a maxium depth of two from every seed node. Higher depth values might cause the resulting topic labels to be very general, while with a smaller value we have the risk of not obtaining a connection between the seed nodes:

In [8]:
from herc_common.graph import WikidataGraphBuilder

graph_builder = WikidataGraphBuilder(max_hops=2)
topic_graphs = [graph_builder.build_graph(topic) for topic in linked_terms]

INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INF

Now that we have obtained the neighbourhood graph of each topic, we are going to plot the results using bokeh. Each node will have a different color depending on their depth with respect to the seed nodes, which will be painted in blue. This will allow us to perform an initial exploration of these graphs:

In [9]:
from bokeh.io import show
from bokeh.layouts import gridplot

from herc_common.bokeh_utils import build_graph_plot


plots = [build_graph_plot(g, f"Topic {idx}") 
         for idx, g in enumerate(topic_graphs)]
grid = gridplot(plots, ncols=2)
show(grid)

An optimum result would be to have every seed term connected in the final graph. However, theere will be some subgraphs which are isolated from the main ones. In the following section we will be solving this issue.

## Getting the main connected subgraph
As we have described before, some of the topic graphs that we have obtained are not fully connected. Small subgraphs which are isolated from the main subgraph will be considered as noise, and removed before the following computations.

In the following cells, we are going to retrieve the largest connected subgraph from each topic's graph, and plot the results to anaylise them:

In [10]:
from herc_common.graph import get_largest_connected_subgraph

connected_topic_subgraphs = [get_largest_connected_subgraph(g) 
                             for g in topic_graphs]

In [11]:
plots = [build_graph_plot(g, f"Largest Connected subgraph for topic {idx}") 
         for idx, g in enumerate(connected_topic_subgraphs)]
grid = gridplot(plots, ncols=2)
show(grid)

In this section we are aiming to see big graphs with the most amount of seed nodes possible. Graphs with few seed nodes from the original term distribution will tend to be less representative of the original topic.

## Obtaining the main component of each topic
Now that we have the final subgraph for each topic, we will be applying several centrality measures to obtain the node that best represents the topic. In the following cell we have defined an auxiliary function that receives a list of algorithm and returns the results of applying them to obtain the best _n_ entities that represent each topic:

In [12]:
import networkx.algorithms as nxa

from herc_common.graph import get_centrality_algorithm_results

def try_centrality_algorithms(topic_subgraphs, algorithms, stop_uris, top_n=4):
    markdown = ""
    for (algorithm, name) in algorithms:
        print(f'Algorithm: {name}')
        results = [get_centrality_algorithm_results(g, algorithm, stop_uris, top_n)
                   for g in topic_subgraphs]
        results_labels = [[(node[0]['label'], node[1]) for node in topic] 
                          for topic in results]
        for idx, result in enumerate(results_labels):
            print(f"Topic {idx}:", result)
            print()
        print()

        
algorithms = [
    (nxa.centrality.information_centrality, "Information centrality"),
    (nxa.centrality.eigenvector_centrality_numpy, "Eigenvector centrality"),
    (nxa.centrality.closeness_centrality, "Closeness centrality"),
    (nxa.centrality.betweenness_centrality, "Betweenness centrality"),
    (nxa.centrality.communicability_betweenness_centrality, "Communicability betweenness centrality")
]

try_centrality_algorithms(connected_topic_subgraphs,
                          algorithms,
                          ['Q4167836', 'Q11862829'])

Algorithm: Information centrality
Topic 0: [('anatomical structure', 0.004469363270219544), ('tissue', 0.004455636242600729), ('bone', 0.004281855299669529), ('organ', 0.004266448287684038)]

Topic 1: [('interaction science', 0.0033606773694677644), ('communication medium', 0.0032228477983422274), ('culture', 0.003176218487945696), ('anthropology', 0.0031371086169037724)]

Topic 2: [('protein', 0.002660649487201953), ('biological process', 0.002559572876380823), ('ribonucleoprotein', 0.0024713402521658588), ('group or class of chemical substances', 0.002462478319774918)]

Topic 3: [('gene', 0.026058631921824105), ('genome', 0.025559105431309903), ('chromosome', 0.024615384615384615), ('DNA', 0.020253164556962026)]

Topic 4: [('brain', 0.0049438776038548534), ('neuroscience', 0.004755781587231344), ('brain stem', 0.004616296866095637), ('rhombencephalon', 0.004559329159101296)]

Topic 5: [('protein', 0.01982480405716921), ('first-order metaclass', 0.015333412575775583), ('Magnesium-prot

## Add labels to LDA model
Finally, we will be saving the best results to our LDA model that has been trained previously. Now, when we load the model again, after a topic has been inferred for a given text we will also be able to return a representative label for the topic, which will be also linked to Wikidata:

In [13]:
from herc_common.topic import Topic

final_results = [get_centrality_algorithm_results(g,
                                                  nxa.centrality.information_centrality,
                                                  ['Q4167836', 'Q11862829'], top_n=1)
                 for g in connected_topic_subgraphs]

final_results_topics = [Topic.from_node(topic[0], topic[1], "lda") 
                        for result in final_results for topic in result]
lda_model = lda_pipe.named_steps['model']

In [14]:
from tqdm import tqdm

import en_core_sci_lg
import string
import numpy as np

en_core_sci_lg.load()

<spacy.lang.en.English at 0x20ffbbf1f48>

In [15]:
from herc_common.topic import LabelledTopicModel

labelled_topic_model = LabelledTopicModel(lda_model, final_results_topics)

lda_pipe.steps.pop()
lda_pipe.steps.append(('model', labelled_topic_model))

In [16]:
from herc_common.utils import save_object

save_object(lda_pipe, os.path.join(NOTEBOOK_4_RESULTS_DIR, 'lda_pipe_with_labels.pkl'))

## Obtaining the results for every article in the dataset

In [17]:
results = lda_pipe.transform(protocols)

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




## Saving the results

In [19]:
NEW_COL_NAME = 'topics_from_lda'

df[NEW_COL_NAME] = ['\n'.join([f"{str(topic)}, {topic.score:.5f}" for topic in result])
                        for result in results]
df.head()

results_df = df[['pr_id', 'title', NEW_COL_NAME]]
results_df.head()

Unnamed: 0,pr_id,title,topics_from_lda
0,e100,Scratch Wound Healing Assay,"interaction science, 0.99548\ninteraction scie..."
1,e1029,ADCC Assay Protocol,"protein, 0.74729\ninteraction science, 0.25011..."
2,e1072,Catalase Activity Assay in Candida glabrata,"botany, 0.62093\nprotein, 0.31150\ninteraction..."
3,e1077,RNA Isolation and Northern Blot Analysis,"protein, 0.42861\nprotein, 0.29078\nanatomical..."
4,e1090,Flow Cytometric Analysis of Autophagic Activit...,"interaction science, 0.99796\ninteraction scie..."


In [20]:
OUTPUT_FILE_NAME = "protocols_df_with_lda_topics.csv"

results_df.to_csv(os.path.join(NOTEBOOK_5_RESULTS_DIR, OUTPUT_FILE_NAME), index=False)