# 5. Topic extraction from NER
In this notebook we are going to perform a selection of topics from the entities recognized in each article. The process will be as follows:
* The Named Entity Recognizer object created in notebook _4_Named_Entity_Recognition_ will be loaded.
* The list of entities for a given text will be retrieved.
* After we have the list of entities, they will be linked to Wikidata.
* From the list of linked entities, we will create a graph with the Wikidata entities obtained by expanding some of their properties.
* Once the graph has been obtained, we will apply centrality algotihms to select the most representative entities from  it, which will serve as topics for the text.

## Setup

In [1]:
%run __init__.py

logger.setLevel(logging.INFO)

In [2]:
from bokeh.io import output_notebook

output_notebook()

## Loading the data
We will start by loading the agriculture dataframe created in notebook 2. After that, we will select the last article to demonstrate what the main data workflow will be:

In [3]:
import pandas as pd

GIT_FILE_PATH = os.path.join(NOTEBOOK_1_RESULTS_DIR, 'git_dataframe.pkl')

git_df = pd.read_pickle(GIT_FILE_PATH)
git_repositories = git_df['full_text_cleaned'].values

In [4]:
git_df.head(n=25)

Unnamed: 0,gh_id,name,description,owner_name,languages,readme_text,issues_text,commits_text,filenames,comments_text,full_text,full_text_cleaned,num_chars_text
0,216602979,LIRICAL,LIkelihood Ratio Interpretation of Clinical Ab...,cmungall,"{'Java': 492423, 'FreeMarker': 13149, 'Python'...",LIRICAL. LIkelihood Ratio Interpretation of C...,,Merge pull request #442 from TheJacksonLaborat...,\nCHANGELOG\nREADME\nhoxc13 output\nlirical to...,note that the Jannovar dependency does not nee...,LIkelihood Ratio Interpretation of Clinical Ab...,LIkelihood Ratio Interpretation of Clinical Ab...,3770
1,199330464,wikidata_ontomatcher,Matches ontology classes against wikidata,cmungall,"{'Prolog': 14691, 'Makefile': 1472, 'Dockerfil...",Match an ontology to Wikidata. This applicatio...,Will help with #1 and with https://github.com/...,Adding skos:altLabel\n\nhttps://github.com/cmu...,\nREADME\ninstall\npack\nwikidata ontomatcher\...,,Matches ontology classes against wikidata. Mat...,Matches ontology classes against wikidata. Mat...,519
2,253207181,ro-crate-ruby,"A Ruby gem for creating, manipulating and read...",markwilkinson,"{'Ruby': 52724, 'HTML': 1319}","ro-crate-ruby. This is a WIP gem for creating,...",,Update LICENSE\nBump version\nTidy up and chec...,\n travis\nGemfile\nREADME\nROCrate\nContact P...,*\n * Expands the tree to the target element a...,"A Ruby gem for creating, manipulating and read...","A Ruby gem for creating, manipulating and read...",2559
3,212556220,Misc_Training_scripts,A place for me to keep various miscellanelous ...,markwilkinson,"{'Shell': 15815, 'Ruby': 9445}",Misc_Training_scripts. A place for me to keep ...,,added new cool 3-federated query\nfinished edi...,README\nSpecies Abundance Pub2015\nSpecies Inf...,,A place for me to keep various miscellanelous ...,A place for me to keep various miscellanelous ...,545
4,155879756,FAIRifier,A tool to make data FAIR,mikel-egana-aranguren,"{'Java': 3514431, 'JavaScript': 967765, 'HTML'...",Dependencies: Java 8. Apache Ant. Building. in...,,Merge pull request #16 from Shamanou/developme...,\norg eclipse core resources\norg eclipse jdt ...,*\n * Main class for Refine server application...,A tool to make data FAIR. Dependencies: Java 8...,A tool to make data FAIR. Dependencies: Java 8...,57859
5,90349931,elda,Epimorphics implementation of the Linked Data API,mikel-egana-aranguren,"{'Java': 1892893, 'JavaScript': 1757647, 'XSLT...","Elda, an implementation of the Linked Data API...",,Proper reference Config\nConfiguracion ELDA de...,\nCONTRIBUTING\nLICENCE\nREADME demo\nREADME\n...,Everything that's part of the resource set is ...,Epimorphics implementation of the Linked Data ...,Epimorphics implementation of the Linked Data ...,15907
6,126633812,music-genre-classification,Recognizing the genre of music files using mac...,HareeshBahuleyan,"{'Jupyter Notebook': 7532041, 'Python': 8296}",Music Genre Classification. \n Overview. Reco...,,Update LICENSE\nUpdate README.md\nUpdate READM...,1 audio retrieval\n2 plot spectrogram\n3 1 vgg...,,Recognizing the genre of music files using mac...,Recognizing the genre of music files using mac...,3078
7,173520377,probabilistic_nlg,Tensorflow Implementation of Stochastic Wasser...,HareeshBahuleyan,{'Python': 303839},Stochastic Wasserstein Autoencoder for Probabi...,Bumps [tensorflow-gpu](https://github.com/tens...,Update LICENSE\nUpdate requirements.txt\nUpdat...,README\n init \ndf movie test\ndf movie trai...,,Tensorflow Implementation of Stochastic Wasser...,Tensorflow Implementation of Stochastic Wasser...,3725
8,103798851,DataStructures-Algorithms-InC,Programs of Data Structures and Algorithms in ...,gauravtheP,"{'C': 117644, 'Makefile': 54504, 'C++': 9409, ...",,,Minor modification is done in chainingInHashin...,dep\n01Knapsack Problem\nFloyd Warshall Algor...,Time Complexity: O(nlogn)\nTime Complexity\n W...,Programs of Data Structures and Algorithms in ...,Programs of Data Structures and Algorithms in ...,1330
9,153249816,Music-Generation-Using-Deep-Learning,A Deep Learning Case Study to Generate Music S...,gauravtheP,{'Jupyter Notebook': 52835},Music Generation Using Deep-Learning. Check ou...,Does this model also include chord generation?...,Blog Link Updated\nUpdate README.md\nUpdate RE...,Generate Music\nMusic Generation Train1\nMusic...,,A Deep Learning Case Study to Generate Music S...,A Deep Learning Case Study to Generate Music S...,3922


In [5]:
text = git_repositories[-1]

## Loading the NER model
The named entity recognition model created in notebook 4 will now be loaded and used to obtain the entities of the article:

In [6]:
import en_core_sci_lg

from herc_common.utils import load_object
from collections import Counter

ner = load_object(os.path.join(NOTEBOOK_4_RESULTS_DIR, 'ner_system.pkl'))

In [7]:
nlp = en_core_sci_lg.load()
entities = ner.transform([text])
entities[0][:10]

['Repository',
 'scripts',
 'basketball',
 'Basketball Analytics',
 'repository',
 'scripts',
 'statistics',
 'NBA',
 'basketball',
 'code']

## Entity linking
Now, we will be making use of the WikidataEntityLinker class to obtain the Wikidata URI of each entity recognized before:

In [None]:
from herc_common.entity_linking import DBPediaEntityLinker

dbpedia_linker = DBPediaEntityLinker()
dbpedia_linked_entities = dbpedia_linker.fit_transform(git_repositories)
dbpedia_linked_entities[0][:5]

In [None]:
from herc_common.entity_linking import DBPedia2WikidataMapper

dbpedia2wdmapper = DBPedia2WikidataMapper()
linked_entities = dbpedia2wdmapper.fit_transform(dbpedia_linked_entities)
linked_entities[0][:5]

### Alternative: Linking with Wikidata

In [8]:
from herc_common.entity_linking import WikidataEntityLinker

linker = WikidataEntityLinker()
linked_entities = linker.fit_transform(entities)
linked_entities[0][:5]

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




[('Repository', 'http://www.wikidata.org/entity/Q3133368'),
 ('scripts', 'http://www.wikidata.org/entity/Q187432'),
 ('basketball', 'http://www.wikidata.org/entity/Q5372'),
 ('Basketball Analytics', None),
 ('repository', 'http://www.wikidata.org/entity/Q3133368')]

## Building the graph
After each entity has been linked to Wikidata, w:e will begin exploring their neighbourhood in the knowledge graph to obtain a list of candidates for our final topics

In [9]:
from herc_common.graph import WikidataGraphBuilder

graph_builder = WikidataGraphBuilder(max_hops=2)
entity_graph = graph_builder.build_graph(linked_entities[0])

INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.


In [10]:
from bokeh.io import show
from bokeh.layouts import gridplot

from herc_common.bokeh_utils import build_graph_plot

plot = build_graph_plot(entity_graph, f"Linked entities graph")
show(plot)

Since the graph from above is not completely connected, we will be obtaining the largest connected subgraph:

In [11]:
from herc_common.graph import get_largest_connected_subgraph

connected_entity_subgraph = get_largest_connected_subgraph(entity_graph)

plot = build_graph_plot(connected_entity_subgraph, f"Linked entities graph")
show(plot)

Now that we have the Wikidata graph obtained from our initial list of entities from the text, we will be trying out a list of centrality algorithms to obtain the top 9 entities that represent the text. These entities can be seen as potential topics for the publication:

In [12]:
import networkx.algorithms as nxa

from herc_common.graph import get_centrality_algorithm_results

def try_centrality_algorithms(g, algorithms, stop_uris, top_n=9):
    for (algorithm, name) in algorithms:
        print(f'Algorithm: {name}')
        result = get_centrality_algorithm_results(g, algorithm, stop_uris, top_n)
        print(f"Topics:", [(t[0]['label'], t[1]) for t in result])
        print()
        
algorithms = [
    (nxa.centrality.information_centrality, "Information centrality"),
    (nxa.centrality.eigenvector_centrality_numpy, "Eigenvector centrality"),
    (nxa.centrality.closeness_centrality, "Closeness centrality"),
    (nxa.centrality.betweenness_centrality, "Betweenness centrality"),
    (nxa.centrality.load_centrality, "Load centrality")
]

stop_uris = ['Q4167836', 'Q11862829', 'Q13442814',
             'Q17339814', 'Q24017414', 'Q4671286',
             'Q47154513', 'Q545779', 'Q8789864', 'Q55308649']
try_centrality_algorithms(connected_entity_subgraph,
                          algorithms,
                          stop_uris)

Algorithm: Information centrality
Topics: [('mathematics', 0.00029504273691780664), ('statistics', 0.00029370409507192857), ('mathematical analysis', 0.00029203243687262305), ('science', 0.00028848640355553395), ('visualization', 0.00028751878513157256), ('software', 0.00028435587999180905), ('creative work', 0.00028285134836425387), ('economics', 0.0002815604179522273), ('formal science', 0.0002815394181994335)]

Algorithm: Eigenvector centrality
Topics: [('protein', 0.5410315439750055), ('Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes', 0.3819924753686273), ('hexose transporter', 0.11729962054265053), ('major facilitator superfamily domain-containing protein, putative', 0.10844675646591097), ('major facilitator superfamily domain-containing protein, putative', 0.10844675646591093), ('pantothenate transporter', 0.10844675646591091), ('monocarboxylate transporter, putative', 0.10844675646591088), ('membrane proteins', 0.10789422464

## Setting up the pipeline
Now that we have seen the main data flow, we will build the final pipeline. This pipeline will receive a list of texts, and return 7 potential topics for each text by executing the steps described above:

In [14]:
from sklearn.pipeline import Pipeline

from herc_common.topic import TopicLabeller

topic_extractor = TopicLabeller(graph_builder, nxa.centrality.closeness_centrality,
                                num_labels_per_topic=7, stop_uris=stop_uris)
topic_pipe = Pipeline([#('ner', ner),
                       ('entity_linker', dbpedia_linker), # linker),
                       ('mapper', dbpedia2wdmapper),
                       ('topic_extractor', topic_extractor)])

### Obtaining the topics
Before finishing with this notebook, we will be obtaining the list of inferred topics for each one of the articles from the agriculture dataset. To do so, we just have to call the _fit_transform_ method of our pipeline:

In [15]:
results = topic_pipe.fit_transform(git_repositories)
results[:5]

HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))

INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Started building graph.





INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
INFO:herc_common.graph:Started building graph.
INFO:herc_common.graph:Finished building graph.
IN

[[computer science (Q21198),
  software (Q7397),
  artificial intelligence (Q11660),
  interaction science (Q97008347),
  engineering (Q11023),
  automation (Q184199),
  statistics (Q12483)],
 [Wikidata (Q2013),
  online database (Q7094076),
  knowledge base (Q593744),
  semantic wiki (Q638153),
  knowledge graph (Q33002955),
  Linked Open Data cloud diagram (Q43984865),
  Semantic Web (Q54837)],
 [information (Q11028),
  abstract object (Q7184903),
  advertising (Q37038),
  data (Q42848),
  creative work (Q17537576),
  message (Q628523),
  level (Q1046315)],
 [species (Q7432),
  taxonomic rank (Q427626),
  subgenus (Q3238261),
  rank (Q3100180),
  subseries (Q13198444),
  rank (Q13578154),
  biological classification (Q11398)],
 [software (Q7397),
  interaction science (Q97008347),
  social science (Q34749),
  economics (Q8134),
  philosophy (Q5891),
  concept (Q151885),
  science (Q336)]]

### Saving the results
Now, we will be merging the results into our agriculture dataframe, and save the results to a CSV file. This file will contain the id and title of each article, with their respective topics inferred by the system:

In [16]:
NEW_COL_NAME = 'topics_from_ner'

git_df[NEW_COL_NAME] = ['\n'.join([f"{str(topic)}, {topic.score:.4f}" for topic in result])
                        for result in results]
git_df.head()

Unnamed: 0,gh_id,name,description,owner_name,languages,readme_text,issues_text,commits_text,filenames,comments_text,full_text,full_text_cleaned,num_chars_text,topics_from_ner
0,216602979,LIRICAL,LIkelihood Ratio Interpretation of Clinical Ab...,cmungall,"{'Java': 492423, 'FreeMarker': 13149, 'Python'...",LIRICAL. LIkelihood Ratio Interpretation of C...,,Merge pull request #442 from TheJacksonLaborat...,\nCHANGELOG\nREADME\nhoxc13 output\nlirical to...,note that the Jannovar dependency does not nee...,LIkelihood Ratio Interpretation of Clinical Ab...,LIkelihood Ratio Interpretation of Clinical Ab...,3770,"computer science, 0.2190\nsoftware, 0.2179\nar..."
1,199330464,wikidata_ontomatcher,Matches ontology classes against wikidata,cmungall,"{'Prolog': 14691, 'Makefile': 1472, 'Dockerfil...",Match an ontology to Wikidata. This applicatio...,Will help with #1 and with https://github.com/...,Adding skos:altLabel\n\nhttps://github.com/cmu...,\nREADME\ninstall\npack\nwikidata ontomatcher\...,,Matches ontology classes against wikidata. Mat...,Matches ontology classes against wikidata. Mat...,519,"Wikidata, 0.4082\nonline database, 0.3376\nkno..."
2,253207181,ro-crate-ruby,"A Ruby gem for creating, manipulating and read...",markwilkinson,"{'Ruby': 52724, 'HTML': 1319}","ro-crate-ruby. This is a WIP gem for creating,...",,Update LICENSE\nBump version\nTidy up and chec...,\n travis\nGemfile\nREADME\nROCrate\nContact P...,*\n * Expands the tree to the target element a...,"A Ruby gem for creating, manipulating and read...","A Ruby gem for creating, manipulating and read...",2559,"information, 0.2009\nabstract object, 0.1960\n..."
3,212556220,Misc_Training_scripts,A place for me to keep various miscellanelous ...,markwilkinson,"{'Shell': 15815, 'Ruby': 9445}",Misc_Training_scripts. A place for me to keep ...,,added new cool 3-federated query\nfinished edi...,README\nSpecies Abundance Pub2015\nSpecies Inf...,,A place for me to keep various miscellanelous ...,A place for me to keep various miscellanelous ...,545,"species, 0.6176\ntaxonomic rank, 0.5250\nsubge..."
4,155879756,FAIRifier,A tool to make data FAIR,mikel-egana-aranguren,"{'Java': 3514431, 'JavaScript': 967765, 'HTML'...",Dependencies: Java 8. Apache Ant. Building. in...,,Merge pull request #16 from Shamanou/developme...,\norg eclipse core resources\norg eclipse jdt ...,*\n * Main class for Refine server application...,A tool to make data FAIR. Dependencies: Java 8...,A tool to make data FAIR. Dependencies: Java 8...,57859,"software, 0.2076\ninteraction science, 0.2034\..."


In [17]:
results_df = git_df[['gh_id', 'name', NEW_COL_NAME]]
results_df.head()

Unnamed: 0,gh_id,name,topics_from_ner
0,216602979,LIRICAL,"computer science, 0.2190\nsoftware, 0.2179\nar..."
1,199330464,wikidata_ontomatcher,"Wikidata, 0.4082\nonline database, 0.3376\nkno..."
2,253207181,ro-crate-ruby,"information, 0.2009\nabstract object, 0.1960\n..."
3,212556220,Misc_Training_scripts,"species, 0.6176\ntaxonomic rank, 0.5250\nsubge..."
4,155879756,FAIRifier,"software, 0.2076\ninteraction science, 0.2034\..."


In [18]:
OUTPUT_FILE_NAME = "git_df_with_ner_topics.csv"

results_df.to_csv(os.path.join(NOTEBOOK_5_RESULTS_DIR, OUTPUT_FILE_NAME), index=False)

The pipeline will be now saved for later use in the final system:

In [19]:
from herc_common.utils import save_object

PIPE_OUTPUT_FILE_NAME = "topic_extraction_from_ner_pipe.pkl"

save_object(topic_pipe, os.path.join(NOTEBOOK_5_RESULTS_DIR, PIPE_OUTPUT_FILE_NAME))