# 6. Topic extraction from NER
In this notebook we are going to perform a selection of topics from the entities recognized in each article. The process will be as follows:
* The Named Entity Recognizer object created in notebook _4_Named_Entity_Recognition_ will be loaded.
* The list of entities for a given text will be retrieved.
* After we have the list of entities, they will be linked to Wikidata.
* From the list of linked entities, we will create a graph with the Wikidata entities obtained by expanding some of their properties.
* Once the graph has been obtained, we will apply centrality algotihms to select the most representative entities from  it, which will serve as topics for the text.

## Setup

In [1]:
%run __init__.py

logger.setLevel(logging.INFO)

In [2]:
from bokeh.io import output_notebook

output_notebook()

## Loading the data
We will start by loading the agriculture dataframe created in notebook 2. After that, we will select the last article to demonstrate what the main data workflow will be:

In [3]:
import pandas as pd

PMC_FILE_PATH = os.path.join(NOTEBOOK_2_RESULTS_DIR, 'pmc_dataframe.pkl')

pmc_df = pd.read_pickle(PMC_FILE_PATH)
publications = pmc_df['text_cleaned'].values

In [4]:
text = publications[-1]

## Loading the NER model
The named entity recognition model created in notebook 4 will now be loaded and used to obtain the entities of the article:

In [5]:
import en_core_sci_lg

from herc_common.utils import load_object
from collections import Counter

ner = load_object(os.path.join(NOTEBOOK_4_RESULTS_DIR, 'ner_system.pkl'))

In [6]:
nlp = en_core_sci_lg.load()
entities = ner.transform([text])
entities[0][:10]

['soil',
 'aestivum',
 'trees',
 'production',
 'DNA',
 'temperature',
 'species',
 'tree',
 'II-9',
 'melanosporum']

## Entity linking
Now, we will be making use of the WikidataEntityLinker class to obtain the Wikidata URI of each entity recognized before:

In [7]:
from herc_common.entity_linking import WikidataEntityLinker

linker = WikidataEntityLinker()
linked_entities = linker.fit_transform(entities)
linked_entities[0][:5]

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




[('soil', 'http://www.wikidata.org/entity/Q36133'),
 ('aestivum', None),
 ('trees', 'http://www.wikidata.org/entity/Q10884'),
 ('production', 'http://www.wikidata.org/entity/Q739302'),
 ('DNA', 'http://www.wikidata.org/entity/Q7430')]

## Building the graph
After each entity has been linked to Wikidata, w:e will begin exploring their neighbourhood in the knowledge graph to obtain a list of candidates for our final topics

In [8]:
from herc_common.graph import WikidataGraphBuilder

graph_builder = WikidataGraphBuilder(max_hops=2)
entity_graph = graph_builder.build_graph(linked_entities[0])

INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.


In [9]:
from bokeh.io import show
from bokeh.layouts import gridplot

from herc_common.bokeh_utils import build_graph_plot

plot = build_graph_plot(entity_graph, f"Linked entities graph")
show(plot)

Since the graph from above is not completely connected, we will be obtaining the largest connected subgraph:

In [10]:
from herc_common.graph import get_largest_connected_subgraph

connected_entity_subgraph = get_largest_connected_subgraph(entity_graph)

plot = build_graph_plot(connected_entity_subgraph, f"Linked entities graph")
show(plot)

Now that we have the Wikidata graph obtained from our initial list of entities from the text, we will be trying out a list of centrality algorithms to obtain the top 9 entities that represent the text. These entities can be seen as potential topics for the publication:

In [11]:
import networkx.algorithms as nxa

from herc_common.graph import get_centrality_algorithm_results

def try_centrality_algorithms(g, algorithms, stop_uris, top_n=9):
    for (algorithm, name) in algorithms:
        print(f'Algorithm: {name}')
        result = get_centrality_algorithm_results(g, algorithm, stop_uris, top_n)
        print(f"Topics:", result)
        print()
        
algorithms = [
    (nxa.centrality.information_centrality, "Information centrality"),
    (nxa.centrality.eigenvector_centrality_numpy, "Eigenvector centrality"),
    (nxa.centrality.closeness_centrality, "Closeness centrality"),
    (nxa.centrality.betweenness_centrality, "Betweenness centrality"),
    (nxa.centrality.load_centrality, "Load centrality")
]

stop_uris = ['Q4167836', 'Q11862829', 'Q13442814',
             'Q17339814', 'Q24017414', 'Q4671286',
             'Q47154513']
try_centrality_algorithms(connected_entity_subgraph,
                          algorithms,
                          stop_uris)

Algorithm: Information centrality
Topics: [({'qid': 'Q3249551', 'desc': 'series of events which occur over an extended period of time', 'label': 'process', 'n': 1, 'node_color': '#abdda4'}, 0.0005888447734518098), ({'qid': 'Q12483', 'desc': 'study of the collection, organization, analysis, interpretation, and presentation of data', 'label': 'statistics', 'n': 2, 'node_color': '#fdae61'}, 0.000584568895472393), ({'qid': 'Q7430', 'desc': 'molecule that encodes the genetic instructions used in the development and functioning of all known living organisms and many viruses', 'label': 'DNA', 'n': 0, 'node_color': '#2b83ba'}, 0.0005810923210503984), ({'qid': 'Q2996394', 'desc': 'process specifically pertinent to the functioning of integrated living units', 'label': 'biological process', 'n': 2, 'node_color': '#fdae61'}, 0.000579444033311313), ({'qid': 'Q441', 'desc': 'science of plant life', 'label': 'botany', 'n': 2, 'node_color': '#fdae61'}, 0.0005642585583470689), ({'qid': 'Q7239', 'desc':

## Setting up the pipeline
Now that we have seen the main data flow, we will build the final pipeline. This pipeline will receive a list of texts, and return 7 potential topics for each text by executing the steps described above:

In [12]:
from sklearn.pipeline import Pipeline

from herc_common.topic import TopicLabeller


topic_extractor = TopicLabeller(graph_builder, nxa.centrality.closeness_centrality,
                                num_labels_per_topic=7, stop_uris=stop_uris)
topic_pipe = Pipeline([('ner', ner),
                       ('entity_linker', linker),
                       ('topic_extractor', topic_extractor)])

The pipeline will be now saved for later use in the final system:

### Obtaining the topics
Before finishing with this notebook, we will be obtaining the list of inferred topics for each one of the articles from the agriculture dataset. To do so, we just have to call the _fit_transform_ method of our pipeline:

In [13]:
results = topic_pipe.fit_transform(publications)
results[:5]

HBox(children=(FloatProgress(value=0.0, max=126.0), HTML(value='')))

INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.





INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
IN

INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
IN

[[Topic(label='chemistry', qid='Q2329', desc='branch of physical science concerned with the composition, structure and properties of matter', score=0.20094382706652458, t_type='ner'),
  Topic(label='breastfeeding', qid='Q174876', desc="feeding of babies and young children with milk from a woman's breast", score=0.19604930937175108, t_type='ner'),
  Topic(label='pharmacology', qid='Q128406', desc='study of the interactions that occur between a living organism and chemicals that affect normal or abnormal biochemical function', score=0.19434628975265017, t_type='ner'),
  Topic(label='sociology', qid='Q21201', desc='scientific study of human society and its origins, development, organizations, and institutions', score=0.1929542464551966, t_type='ner'),
  Topic(label='chemical element', qid='Q11344', desc='a species of atoms having the same number of protons in the atomic nucleus', score=0.191332077112625, t_type='ner'),
  Topic(label='forestry science', qid='Q19924411', desc='academic disc

### Saving the results
Now, we will be merging the results into our agriculture dataframe, and save the results to a CSV file. This file will contain the id and title of each article, with their respective topics inferred by the system:

In [14]:
NEW_COL_NAME = 'topics_from_ner'

pmc_df[NEW_COL_NAME] = ['\n'.join([f"{topic.label}, {topic.score:.4f}" for topic in result])
                        for result in results]
pmc_df.head()

Unnamed: 0,id,title,abstract,full_body,authors,references,subjects,text_cleaned,num_chars_text,topics_from_ner
0,E:\hercules\hercules-challenge-publications\da...,Induced Release of a Plant-Defense Volatile ‘D...,Transmission of plant pathogens by insect vect...,Introduction Transmission of plant pathogens b...,Mann Rajinder S.|Ali Jared G.|Hermann Sara L.|...,Insect vector relationships with procaryotic p...,Agriculture|Crops|Pest Control|Biology|Ecology...,Introduction Transmission of plant pathogens b...,54143,"chemistry, 0.2009\nbreastfeeding, 0.1960\nphar..."
1,E:\hercules\hercules-challenge-publications\da...,Carbon and Nitrogen Isotopic Survey of Norther...,The development of isotopic baselines for comp...,Introduction Stable isotope analysis is an imp...,Szpak Paul|White Christine D.|Longstaffe Fred ...,Influence of diet on the distribution of carbo...,Biology|Ecology|Biogeochemistry|Paleontology|P...,Introduction Stable isotope analysis is an imp...,74402,"specialty, 0.2109\nstatistics, 0.2070\narea st..."
2,E:\hercules\hercules-challenge-publications\da...,The effect of ‘Candidatus Liberibacter asiatic...,BackgroundHuanglongbing (HLB) is a highly dest...,Background Citrus Huanglongbing (HLB) or citru...,Nwugo Chika C|Lin Hong|Duan Yongping|Civerolo ...,"Huanglongbing: a destructive, newly-emerging, ...",,Background Citrus Huanglongbing (HLB) or citru...,66099,"biological process, 0.2001\nprocess, 0.1957\ns..."
3,E:\hercules\hercules-challenge-publications\da...,Emissions of CH4 and N2O under Different Tilla...,Understanding greenhouse gases (GHG) emissions...,Introduction With the current rise in global t...,Zhang Hai-Lin|Bai Xiao-Lin|Xue Jian-Fu|Chen Zh...,Simulation of fluxes of greenhouse gases from ...,Agriculture|Agricultural Biotechnology|Agricul...,Introduction With the current rise in global t...,29852,"physics, 0.1876\narea studies, 0.1858\nregiona..."
4,E:\hercules\hercules-challenge-publications\da...,"Physiological, Biochemical, and Molecular Mech...",High temperature (HT) stress is a major enviro...,1. Introduction Among the ever-changing compon...,Hasanuzzaman Mirza|Nahar Kamrun|Alam Md. Mahab...,Climate change 2007–The physical science basis...,,1. Introduction Among the ever-changing compon...,74675,"biological process, 0.2014\nchemistry, 0.1951\..."


In [15]:
results_df = pmc_df[['id', 'title', NEW_COL_NAME]]
results_df.head()

Unnamed: 0,id,title,topics_from_ner
0,E:\hercules\hercules-challenge-publications\da...,Induced Release of a Plant-Defense Volatile ‘D...,"chemistry, 0.2009\nbreastfeeding, 0.1960\nphar..."
1,E:\hercules\hercules-challenge-publications\da...,Carbon and Nitrogen Isotopic Survey of Norther...,"specialty, 0.2109\nstatistics, 0.2070\narea st..."
2,E:\hercules\hercules-challenge-publications\da...,The effect of ‘Candidatus Liberibacter asiatic...,"biological process, 0.2001\nprocess, 0.1957\ns..."
3,E:\hercules\hercules-challenge-publications\da...,Emissions of CH4 and N2O under Different Tilla...,"physics, 0.1876\narea studies, 0.1858\nregiona..."
4,E:\hercules\hercules-challenge-publications\da...,"Physiological, Biochemical, and Molecular Mech...","biological process, 0.2014\nchemistry, 0.1951\..."


In [16]:
OUTPUT_FILE_NAME = "pmc_df_with_ner_topics.csv"

results_df.to_csv(os.path.join(NOTEBOOK_6_RESULTS_DIR, OUTPUT_FILE_NAME), index=False)

In [17]:
from herc_common.utils import save_object

PIPE_OUTPUT_FILE_NAME = "topic_extraction_from_ner_pipe.pkl"

save_object(topic_pipe, os.path.join(NOTEBOOK_6_RESULTS_DIR, PIPE_OUTPUT_FILE_NAME))