# 5. Topic extraction from NER
In this notebook we are going to perform a selection of topics from the entities recognized in each article. The process will be as follows:
* The Named Entity Recognizer object created in notebook _4_Named_Entity_Recognition_ will be loaded.
* The list of entities for a given text will be retrieved.
* After we have the list of entities, they will be linked to Wikidata.
* From the list of linked entities, we will create a graph with the Wikidata entities obtained by expanding some of their properties.
* Once the graph has been obtained, we will apply centrality algotihms to select the most representative entities from  it, which will serve as topics for the text.

## Setup

In [1]:
%run __init__.py

logger.setLevel(logging.INFO)

INFO:root:Starting logger


In [2]:
from bokeh.io import output_notebook

output_notebook()

## Loading the data
We will start by loading the agriculture dataframe created in notebook 2. After that, we will select the last article to demonstrate what the main data workflow will be:

In [3]:
import pandas as pd

DF_FILE_PATH = os.path.join(NOTEBOOK_2_RESULTS_DIR, 'protocols_dataframe.pkl')

df = pd.read_pickle(DF_FILE_PATH)
protocols = df['full_text_cleaned'].values

In [4]:
text = protocols[-1]

## Loading the NER model
The named entity recognition model created in notebook 4 will now be loaded and used to obtain the entities of the article:

In [5]:
import en_core_sci_lg

from herc_common.text import NamedEntityRecognizer
from collections import Counter
import en_core_sci_lg

ner = NamedEntityRecognizer(en_core_sci_lg, min_entity_counts=1)

In [6]:
nlp = en_core_sci_lg.load()
entities = ner.transform([text])
entities[0][:10]

['supernatant',
 'OMVs',
 'bacteria',
 'E. coli',
 'volume',
 'culture',
 'OMV',
 'Escherichia coli',
 'culture supernatant',
 'LB broth']

## Entity linking
Now, we will be making use of the WikidataEntityLinker class to obtain the Wikidata URI of each entity recognized before:

In [7]:
from herc_common.entity_linking import WikidataEntityLinker

linker = WikidataEntityLinker()
linked_entities = linker.fit_transform(entities)
linked_entities[0][:5]

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




[('supernatant', 'http://www.wikidata.org/entity/Q40307811'),
 ('OMVs', 'http://www.wikidata.org/entity/Q2029890'),
 ('bacteria', 'http://www.wikidata.org/entity/Q10876'),
 ('E. coli', 'http://www.wikidata.org/entity/Q25419'),
 ('volume', 'http://www.wikidata.org/entity/Q39297')]

## Building the graph
After each entity has been linked to Wikidata, w:e will begin exploring their neighbourhood in the knowledge graph to obtain a list of candidates for our final topics

In [8]:
from herc_common.graph import WikidataGraphBuilder

graph_builder = WikidataGraphBuilder(max_hops=2)
entity_graph = graph_builder.build_graph(linked_entities[0])

INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.


In [9]:
from bokeh.io import show
from bokeh.layouts import gridplot

from herc_common.bokeh_utils import build_graph_plot

plot = build_graph_plot(entity_graph, f"Linked entities graph")
show(plot)

Since the graph from above is not completely connected, we will be obtaining the largest connected subgraph:

In [10]:
from herc_common.graph import get_largest_connected_subgraph

connected_entity_subgraph = get_largest_connected_subgraph(entity_graph)

plot = build_graph_plot(connected_entity_subgraph, f"Linked entities graph")
show(plot)

Now that we have the Wikidata graph obtained from our initial list of entities from the text, we will be trying out a list of centrality algorithms to obtain the top 9 entities that represent the text. These entities can be seen as potential topics for the publication:

In [11]:
import networkx.algorithms as nxa

from herc_common.graph import get_centrality_algorithm_results

def try_centrality_algorithms(g, algorithms, stop_uris, top_n=9):
    for (algorithm, name) in algorithms:
        print(f'Algorithm: {name}')
        result = get_centrality_algorithm_results(g, algorithm, stop_uris, top_n)
        print(f"Topics:", result)
        print()
        
algorithms = [
    (nxa.centrality.information_centrality, "Information centrality"),
    (nxa.centrality.eigenvector_centrality_numpy, "Eigenvector centrality"),
    (nxa.centrality.closeness_centrality, "Closeness centrality"),
    (nxa.centrality.betweenness_centrality, "Betweenness centrality"),
    (nxa.centrality.load_centrality, "Load centrality")
]

stop_uris = ['Q4167836', 'Q11862829', 'Q13442814',
             'Q17339814', 'Q24017414', 'Q4671286',
             'Q47154513']
try_centrality_algorithms(connected_entity_subgraph,
                          algorithms,
                          stop_uris)

Algorithm: Information centrality
Topics: [({'qid': 'Q2996394', 'descs': {'en': 'process specifically pertinent to the functioning of integrated living units', 'es': 'conjunto de reacciones químicas que resultan en una transformación.'}, 'labels': {'en': 'biological process', 'es': 'proceso biológico'}, 'label': 'biological process', 'uris': ['https://www.wikidata.org/w/Q2996394', 'https://freebase.toolforge.org//m/0dyc2c', 'https://academic.microsoft.com/v2/detail113219260'], 'n': 2, 'node_color': '#fdae61'}, 0.0007444738078088246), ({'qid': 'Q8054', 'descs': {'en': 'biological molecule consisting of chains of amino acid residues', 'es': 'molécula biológica formada por cadenas lineales de aminoácidos'}, 'labels': {'en': 'protein', 'es': 'proteína'}, 'label': 'protein', 'uris': ['https://www.wikidata.org/w/Q8054', 'https://id.ndl.go.jp/auth/ndlsh/00572676', 'https://freebase.toolforge.org//m/05wvs', 'https://freebase.toolforge.org//m/0qnstdx', 'https://meshb.nlm.nih.gov/record/ui?ui=D0

Topics: [({'qid': 'Q422649', 'descs': {'en': 'polymer produced by a living organism', 'es': 'polímero producido por un organismo vivo'}, 'labels': {'en': 'biopolymer', 'es': 'biopolímero'}, 'label': 'biopolymer', 'uris': ['https://www.wikidata.org/w/Q422649', 'https://id.ndl.go.jp/auth/ndlsh/00570451', 'https://freebase.toolforge.org//m/0199z', 'https://www.jstor.org/topic/biopolymers', 'https://meshb.nlm.nih.gov/record/ui?ui=D001704', 'http://id.nlm.nih.gov/mesh/D05.750.078', 'http://id.nlm.nih.gov/mesh/D25.720.099', 'http://id.nlm.nih.gov/mesh/J01.637.051.720.099', 'https://academic.microsoft.com/v2/detail2778636629'], 'n': 2, 'node_color': '#fdae61'}, 0.21697490092470278), ({'qid': 'Q8054', 'descs': {'en': 'biological molecule consisting of chains of amino acid residues', 'es': 'molécula biológica formada por cadenas lineales de aminoácidos'}, 'labels': {'en': 'protein', 'es': 'proteína'}, 'label': 'protein', 'uris': ['https://www.wikidata.org/w/Q8054', 'https://id.ndl.go.jp/auth/nd

Topics: [({'qid': 'Q7094', 'descs': {'en': 'study of chemical processes in living organisms', 'es': 'ciencia que estudia la composición química de los seres vivos'}, 'labels': {'en': 'biochemistry', 'es': 'bioquímica'}, 'label': 'biochemistry', 'uris': ['https://www.wikidata.org/w/Q7094', 'https://id.ndl.go.jp/auth/ndlsh/00570312', 'https://freebase.toolforge.org//m/0193x', 'https://www.jstor.org/topic/biochemistry', 'https://meshb.nlm.nih.gov/record/ui?ui=D001671', 'http://vocabularies.unesco.org/thesaurus/concept218', 'http://uri.gbv.de/terminology/bk/44.33', 'https://academic.microsoft.com/v2/detail55493867'], 'n': 1, 'node_color': '#abdda4'}, 0.2980018625635739), ({'qid': 'Q422649', 'descs': {'en': 'polymer produced by a living organism', 'es': 'polímero producido por un organismo vivo'}, 'labels': {'en': 'biopolymer', 'es': 'biopolímero'}, 'label': 'biopolymer', 'uris': ['https://www.wikidata.org/w/Q422649', 'https://id.ndl.go.jp/auth/ndlsh/00570451', 'https://freebase.toolforge.o

## Setting up the pipeline
Now that we have seen the main data flow, we will build the final pipeline. This pipeline will receive a list of texts, and return 7 potential topics for each text by executing the steps described above:

In [12]:
from sklearn.pipeline import Pipeline

from herc_common.topic import TopicLabeller


topic_extractor = TopicLabeller(graph_builder, nxa.centrality.closeness_centrality,
                                num_labels_per_topic=7, stop_uris=stop_uris)
topic_pipe = Pipeline([('ner', ner),
                       ('entity_linker', linker),
                       ('topic_extractor', topic_extractor)])

### Obtaining the topics
Before finishing with this notebook, we will be obtaining the list of inferred topics for each one of the articles from the agriculture dataset. To do so, we just have to call the _fit_transform_ method of our pipeline:

In [13]:
results = topic_pipe.fit_transform(protocols)
results[:5]

HBox(children=(FloatProgress(value=0.0), HTML(value='')))

INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.





INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
I

INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.


[[software (Q7397),
  chemistry (Q2329),
  science (Q336),
  research (Q42240),
  chemical substance (Q79529),
  product (Q2424752),
  interaction science (Q97008347)],
 [biological process (Q2996394),
  process (Q3249551),
  carbon dioxide transmembrane transport (Q14905232),
  death (Q4),
  response to carbon dioxide (Q22273540),
  cellular response to carbon dioxide (Q22273541),
  carbon dioxide transport (Q14902500)],
 [protein (Q8054),
  chemistry (Q2329),
  chemical compound (Q11173),
  chemical substance (Q79529),
  group or class of proteins (Q84467700),
  cell type (Q189118),
  biopolymer (Q422649)],
 [biological process (Q2996394),
  process (Q3249551),
  chemical compound (Q11173),
  chemistry (Q2329),
  chemical substance (Q79529),
  science (Q336),
  social organism behavior (Q921513)],
 [process (Q3249551),
  chemical compound (Q11173),
  biological process (Q2996394),
  glucose (Q37525),
  concept (Q151885),
  interaction science (Q97008347),
  intentional human action (

### Saving the results
Now, we will be merging the results into our agriculture dataframe, and save the results to a CSV file. This file will contain the id and title of each article, with their respective topics inferred by the system:

In [15]:
NEW_COL_NAME = 'topics_from_ner'

df[NEW_COL_NAME] = ['\n'.join([f"{str(topic)}, {topic.score:.4f}" for topic in result])
                        for result in results]
df.head()

Unnamed: 0,pr_id,title,abstract,materials,procedure,equipment,background,categories,authors,full_text,full_text_no_abstract,full_text_cleaned,full_text_no_abstract_cleaned,num_chars_text,topics_from_ner
0,e100,Scratch Wound Healing Assay,The scratch wound healing assay has been widel...,"Human MDA-MB-231 cell line (ATCC, catalog numb...",Grow cells in DMEM supplemented with 10% FBS.|...,BD Falcon 24-well tissue culture plate (Fisher...,,Cancer Biology|Invasion & metastasis|Cell biol...,Yanling Chen,Scratch Wound Healing Assay. The scratch wound...,Scratch Wound Healing Assay. Grow cells in DME...,Scratch Wound Healing Assay. The scratch wound...,Scratch Wound Healing Assay. Grow cells in DME...,2583,"software, 0.1962\nchemistry, 0.1951\nscience, ..."
1,e1029,ADCC Assay Protocol,Antibody-dependent cell-mediated cytotoxicity ...,Raji cells (ATCC)|A/California/04/2009 (H1N1) ...,Preperation of Target Cells\n\n\t\t\n\n\n\t\t\...,Round bottomed 96-well plate|Temperature contr...,,Immunology|Immune cell function|Cytotoxicity|C...,Vikram Srivastava|Zheng Yang|Ivan Fan Ngai...,ADCC Assay Protocol. Antibody-dependent cell-m...,ADCC Assay Protocol. Preperation of Target Cel...,ADCC Assay Protocol. Antibody-dependent cell-m...,ADCC Assay Protocol. Preperation of Target Cel...,3824,"biological process, 0.2109\nprocess, 0.1991\nc..."
2,e1072,Catalase Activity Assay in Candida glabrata,Commensal and pathogenic fungi are exposed to ...,Yeast strains \nNote: BG14 was used as the C. ...,Preparation of total soluble extracts\n\t\t\n\...,Orbital incubator shaker|Microfuge tubes|50 ml...,,Microbiology|Microbial biochemistry|Protein|Ac...,Emmanuel Orta-Zavalza|Marcela Briones-Martin...,Catalase Activity Assay in Candida glabrata. C...,Catalase Activity Assay in Candida glabrata. P...,Catalase Activity Assay in Candida glabrata. C...,Catalase Activity Assay in Candida glabrata. P...,4207,"protein, 0.2145\nchemistry, 0.2124\nchemical c..."
3,e1077,RNA Isolation and Northern Blot Analysis,The northern blot is a technique used in molec...,Vero cells (kidney epithelial cells extracted ...,RNA extraction\n\t\t\n\n\t\t\t\tCells were see...,"100 mm cell culture dishes (Corning, catalog n...",,Microbiology|Microbial genetics|RNA|RNA extrac...,Ying Liao|To Sing Fung|Mei Huang|Shouguo Fang|...,RNA Isolation and Northern Blot Analysis. The ...,RNA Isolation and Northern Blot Analysis. RNA ...,RNA Isolation and Northern Blot Analysis. The ...,RNA Isolation and Northern Blot Analysis. RNA ...,6890,"biological process, 0.1942\nprocess, 0.1920\nc..."
4,e1090,Flow Cytometric Analysis of Autophagic Activit...,Flow cytometry allows very sensitive and relia...,"Cells lines of interest (HepG2, HUH7, CMK, K56...",Maintain cells under standard tissue culture c...,"37 °C, 5% CO2 humidified incubator|Centrifuge|...",,Microbiology|Antimicrobial assay|Autophagy ass...,Metodi Stankov|Diana Panayotova-Dimitrova|Ma...,Flow Cytometric Analysis of Autophagic Activit...,Flow Cytometric Analysis of Autophagic Activit...,Flow Cytometric Analysis of Autophagic Activit...,Flow Cytometric Analysis of Autophagic Activit...,5890,"process, 0.1993\nchemical compound, 0.1984\nbi..."


In [16]:
results_df = df[['pr_id', 'title', NEW_COL_NAME]]
results_df.head()

Unnamed: 0,pr_id,title,topics_from_ner
0,e100,Scratch Wound Healing Assay,"software, 0.1962\nchemistry, 0.1951\nscience, ..."
1,e1029,ADCC Assay Protocol,"biological process, 0.2109\nprocess, 0.1991\nc..."
2,e1072,Catalase Activity Assay in Candida glabrata,"protein, 0.2145\nchemistry, 0.2124\nchemical c..."
3,e1077,RNA Isolation and Northern Blot Analysis,"biological process, 0.1942\nprocess, 0.1920\nc..."
4,e1090,Flow Cytometric Analysis of Autophagic Activit...,"process, 0.1993\nchemical compound, 0.1984\nbi..."


In [17]:
OUTPUT_FILE_NAME = "df_with_ner_topics.csv"

results_df.to_csv(os.path.join(NOTEBOOK_5_RESULTS_DIR, OUTPUT_FILE_NAME), index=False)

The pipeline will be now saved for later use in the final system:

In [18]:
from herc_common.utils import save_object

PIPE_OUTPUT_FILE_NAME = "topic_extraction_from_ner_pipe.pkl"

save_object(topic_pipe, os.path.join(NOTEBOOK_5_RESULTS_DIR, PIPE_OUTPUT_FILE_NAME))