<a href="https://colab.research.google.com/github/cbadenes/phd-thesis/blob/master/notebooks/soa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook supports the state-of-the-art content of the thesis: *Semantically-enabled Browsing of Large Multilingual Document Collections, Badenes-Olmedo, C. 2021*

# 2.- Techniques for Document Retrieval

The analysis of human-readable documents is a well-known problem in Artificial Intelligence (AI) in general, and in the Information Retrieval (IR) and Natural Language Processing (NLP) fields in particular. As an academic field of study, information retrieval might be defined as finding documents of an unstructured nature, usually text, that satisfies an information need from within large collections (Manning et al., 2008). As defined in this way, hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their email. Information retrieval is fast becoming the dominant form of information access, surpassing traditional database searching where identifiers are needed to have results.

There are two major categories of IR technology and research: semantic and statistical. Semantic approaches attempt to implement some degree of syntactic and semantic analysis. They try to reproduce to some degree the understanding of the natural language text that a human user would provide. In statistical approaches, the documents that are retrieved or that are highly ranked are those that match the query most closely in terms of some statistical measure. The work presented in this thesis follows this second approach.

## 2.1.- Load Corpus

An illustrative example may help to better understand IR techniques, so the publications listed in Section 1.1 are used as a sample collection for applying each of them.

In [1]:
import requests
import json
import pandas as pd

#increase the max column length
pd.set_option('display.max_colwidth', 200)

corpus_df = pd.read_csv('https://www.dropbox.com/s/pag5jseq2e9wcvb/corpus.csv?raw=1',usecols=['title','text'])
corpus_df

Unnamed: 0,title,text
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc..."
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber..."
3,Distributing Text Mining tasks with librAIry,"Distributing Text Mining tasks with librAIry Carlos Badenes-Olmedo cbadenes@f.upm.es Universidad Polit´ecnica de Madrid Ontology Engineering Group Boadilla del Monte, Spain Jos´e Luis Redondo-Garc..."
4,Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms,"Semantic Web 0 (0) 1 1 IOS Press Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms Editor(s): Tomi Kauppinen, Aalto University, Finland; Daniel Garijo,..."
5,An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts,"An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Carlos Badenes-Olmedo1, Jos´e Luis Redondo-Garc´ıa2, and Oscar Corcho1 1 Universi..."
6,Efficient Clustering from Distributions over Topics,"Efficient Clustering from Distributions over Topics Carlos Badenes-Olmedo cbadenes@￿.upm.es Ontology Engineering Group Universidad Polit´ecnica de Madrid Boadilla del Monte, Spain Jos´e Luis Redon..."
7,Legal Documents Retrieval Across Languages: Topic Hierarchies based on synsets,Cross-lingual annotations of legislative texts enable us to explore major themes covered in multi- lingual legal data and are a key facilitator of semantic similarity when searching for similar do...
8,Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies,"Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies Carlos Badenes-Olmedo cbadenes@fi.upm.es Ontology Engineering Group, Universidad Politécnica de Madrid Boad..."
9,Potentially inappropriate medications in older adults living with HIV,"Potentially inappropriate medications in older adults living with HIV B L�opez-Centeno,1,* C Badenes-Olmedo,2 A Mataix-Sanjuan,1 JM Bell�on,3 L P�erez-Latorre,3 JC L�opez,3 J Bened�ı,4,* S Khoo,5 ..."


## 2.2. Text Pre-Processing

Documents must be pre-processed to transform their texts into terms. These terms are the population that is counted and measured statistically. Most commonly, the terms are words (or combination of adjacent words or characters) that occur in a given query or collection of documents and often require pre-processing. 

### 2.2.1: Methods to transform texts into terms

Words are reduced to a common base form by using a heuristic process that removes affixes, stemming, or by returning its dictionary form, lemma (Porter, 1997). The objective is to eliminate the variation that arises from the occurrence of different grammatical forms of the same word, e.g., ”program”, ”programming”, ”programs”, and ”programmed” should all be recognized as forms of the same word, ”program”.

Another common form of pre-processing is the elimination of common words that have little power to discriminate relevant from non-relevant documents,e.g., ”the”, ”a”, ”it”. Hence, IR engines are usually provided with a stop-list of such noise words. Note that both stemming/lemma and stopwords are language-dependent.

In [2]:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

nlp = spacy.load("en_core_web_sm")

def tokenize(text):
  tokens = nlp(text)
  return tokens

def is_valid(token):
  return len(token.text) > 1 and not token.is_stop

def lemma(token):
  return token.lemma_

def preprocess(text):
  tokens = []
  for token in tokenize(text):
    if is_valid(token): 
      tokens.append(lemma(token))
  return tokens

print("methods created succesfully")

methods created succesfully


The following sentence taken from one of the documents can be used to see each of the steps: *”Probabilistic Topic Models reduce that feature space by annotating documents with thematic information”*.

In [3]:
tokens = preprocess("Probabilistic Topic Models reduce that feature space by annotating documents with thematic information")
print(tokens)

['Probabilistic', 'Topic', 'Models', 'reduce', 'feature', 'space', 'annotate', 'document', 'thematic', 'information']


At this step ’annotating’ was transformed to ’annotate’ and ’documents’ was reduced to ’document’. However, ’Models’ remains unchanged. The reason is that since it starts with a capital letter, it is considered a proper noun. Finally, those words that appear in a stop-word list are removed (e.g. ’that’, ’by’ and ’with’). Each text is transformed into a normalized list of terms.

### 2.2.2. Count words

In [15]:
def count_words(text):
  return len(text.split(" "))

corpus_df['#words'] = corpus_df['text'].apply(count_words)
corpus_df.head(3)

Unnamed: 0,title,text,#words,tokens,#tokens,#uni_tokens,boolean,tf,tf_idf
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...,12954,"[synopsis, Refinements, Extensions, compare, publication, Conference, Proceedings, submission, refined, extended, paper, base, ICTERI, 2017, phd, symposium, paper, Kosa, et, al, fact, submission, ...",6495,1688,"{'synopsis': True, 'Refinements': True, 'Extensions': True, 'compare': True, 'publication': True, 'Conference': True, 'Proceedings': True, 'submission': True, 'refined': True, 'extended': True, 'p...","{'synopsis': 1, 'Refinements': 1, 'Extensions': 1, 'compare': 21, 'publication': 3, 'Conference': 1, 'Proceedings': 1, 'submission': 2, 'refined': 2, 'extended': 1, 'paper': 44, 'base': 43, 'ICTER...","{'synopsis': 0.00036919095809058824, 'Refinements': 0.00036919095809058824, 'Extensions': 0.00036919095809058824, 'compare': 0.0003081622441710275, 'publication': 0.00027997034806942977, 'Conferen..."
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc...",5827,"[enhance, Public, Procurement, European, Union, Constructing, exploit, Integrated, Knowledge, Graph, Ahmet, Soylu1, Oscar, Corcho2, Brian, Elvesæter1, Carlos, Badenes, Olmedo2, Francisco, Yedro2, ...",3511,1355,"{'enhance': True, 'Public': True, 'Procurement': True, 'European': True, 'Union': True, 'Constructing': True, 'exploit': True, 'Integrated': True, 'Knowledge': True, 'Graph': True, 'Ahmet': True, ...","{'enhance': 9, 'Public': 9, 'Procurement': 9, 'European': 5, 'Union': 3, 'Constructing': 1, 'exploit': 1, 'Integrated': 8, 'Knowledge': 11, 'Graph': 11, 'Ahmet': 1, 'Soylu1': 1, 'Oscar': 1, 'Corch...","{'enhance': 0.004369903967572152, 'Public': 0.006146698221357259, 'Procurement': 0.006146698221357259, 'European': 0.00144061650765947, 'Union': 0.0011101819858703454, 'Constructing': 0.0006829664..."
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber...",5417,"[Drugs4Covid, make, drug, information, available, scientific, publication, Carlos, Badenes, Olmedo1, David, Chaves, Fraga1, Mar´ıa, Poveda, Villal´on1, Ana, Iglesias, Molina1, Pablo, Calleja1, Soc...",3347,1413,"{'Drugs4Covid': True, 'make': True, 'drug': True, 'information': True, 'available': True, 'scientific': True, 'publication': True, 'Carlos': True, 'Badenes': True, 'Olmedo1': True, 'David': True, ...","{'Drugs4Covid': 24, 'make': 2, 'drug': 71, 'information': 11, 'available': 12, 'scientific': 14, 'publication': 9, 'Carlos': 1, 'Badenes': 10, 'Olmedo1': 1, 'David': 1, 'Chaves': 1, 'Fraga1': 1, '...","{'Drugs4Covid': 0.017194349132704182, 'make': 0.00019029204130178342, 'drug': 0.03616286661157102, 'information': 0.0, 'available': 0.00034171561328111716, 'scientific': 0.004231375190767469, 'pub..."


### 2.2.3. Tokenize Corpus

In [16]:
corpus_df['tokens'] = corpus_df['text'].apply(preprocess)

corpus_df.head(3)

Unnamed: 0,title,text,#words,tokens,#tokens,#uni_tokens,boolean,tf,tf_idf
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...,12954,"[synopsis, Refinements, Extensions, compare, publication, Conference, Proceedings, submission, refined, extended, paper, base, ICTERI, 2017, phd, symposium, paper, Kosa, et, al, fact, submission, ...",6495,1688,"{'synopsis': True, 'Refinements': True, 'Extensions': True, 'compare': True, 'publication': True, 'Conference': True, 'Proceedings': True, 'submission': True, 'refined': True, 'extended': True, 'p...","{'synopsis': 1, 'Refinements': 1, 'Extensions': 1, 'compare': 21, 'publication': 3, 'Conference': 1, 'Proceedings': 1, 'submission': 2, 'refined': 2, 'extended': 1, 'paper': 44, 'base': 43, 'ICTER...","{'synopsis': 0.00036919095809058824, 'Refinements': 0.00036919095809058824, 'Extensions': 0.00036919095809058824, 'compare': 0.0003081622441710275, 'publication': 0.00027997034806942977, 'Conferen..."
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc...",5827,"[enhance, Public, Procurement, European, Union, Constructing, exploit, Integrated, Knowledge, Graph, Ahmet, Soylu1, Oscar, Corcho2, Brian, Elvesæter1, Carlos, Badenes, Olmedo2, Francisco, Yedro2, ...",3511,1355,"{'enhance': True, 'Public': True, 'Procurement': True, 'European': True, 'Union': True, 'Constructing': True, 'exploit': True, 'Integrated': True, 'Knowledge': True, 'Graph': True, 'Ahmet': True, ...","{'enhance': 9, 'Public': 9, 'Procurement': 9, 'European': 5, 'Union': 3, 'Constructing': 1, 'exploit': 1, 'Integrated': 8, 'Knowledge': 11, 'Graph': 11, 'Ahmet': 1, 'Soylu1': 1, 'Oscar': 1, 'Corch...","{'enhance': 0.004369903967572152, 'Public': 0.006146698221357259, 'Procurement': 0.006146698221357259, 'European': 0.00144061650765947, 'Union': 0.0011101819858703454, 'Constructing': 0.0006829664..."
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber...",5417,"[Drugs4Covid, make, drug, information, available, scientific, publication, Carlos, Badenes, Olmedo1, David, Chaves, Fraga1, Mar´ıa, Poveda, Villal´on1, Ana, Iglesias, Molina1, Pablo, Calleja1, Soc...",3347,1413,"{'Drugs4Covid': True, 'make': True, 'drug': True, 'information': True, 'available': True, 'scientific': True, 'publication': True, 'Carlos': True, 'Badenes': True, 'Olmedo1': True, 'David': True, ...","{'Drugs4Covid': 24, 'make': 2, 'drug': 71, 'information': 11, 'available': 12, 'scientific': 14, 'publication': 9, 'Carlos': 1, 'Badenes': 10, 'Olmedo1': 1, 'David': 1, 'Chaves': 1, 'Fraga1': 1, '...","{'Drugs4Covid': 0.017194349132704182, 'make': 0.00019029204130178342, 'drug': 0.03616286661157102, 'information': 0.0, 'available': 0.00034171561328111716, 'scientific': 0.004231375190767469, 'pub..."


### 2.2.4. Count tokens

In [6]:
def count_tokens(tokens):
  return len(tokens)

corpus_df['#tokens'] = corpus_df['tokens'].apply(count_tokens)

corpus_df.head(3)

Unnamed: 0,title,text,#words,tokens,#tokens
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...,12954,"[synopsis, Refinements, Extensions, compare, publication, Conference, Proceedings, submission, refined, extended, paper, base, ICTERI, 2017, phd, symposium, paper, Kosa, et, al, fact, submission, ...",6495
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc...",5827,"[enhance, Public, Procurement, European, Union, Constructing, exploit, Integrated, Knowledge, Graph, Ahmet, Soylu1, Oscar, Corcho2, Brian, Elvesæter1, Carlos, Badenes, Olmedo2, Francisco, Yedro2, ...",3511
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber...",5417,"[Drugs4Covid, make, drug, information, available, scientific, publication, Carlos, Badenes, Olmedo1, David, Chaves, Fraga1, Mar´ıa, Poveda, Villal´on1, Ana, Iglesias, Molina1, Pablo, Calleja1, Soc...",3347
3,Distributing Text Mining tasks with librAIry,"Distributing Text Mining tasks with librAIry Carlos Badenes-Olmedo cbadenes@f.upm.es Universidad Polit´ecnica de Madrid Ontology Engineering Group Boadilla del Monte, Spain Jos´e Luis Redondo-Garc...",2448,"[distribute, text, mining, task, librAIry, Carlos, Badenes, Olmedo, cbadenes@f.upm.es, Universidad, Polit´ecnica, de, Madrid, Ontology, Engineering, Group, Boadilla, del, Monte, Spain, jos´e, Luis...",1484
4,Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms,"Semantic Web 0 (0) 1 1 IOS Press Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms Editor(s): Tomi Kauppinen, Aalto University, Finland; Daniel Garijo,...",9041,"[semantic, web, IOS, Press, large, scale, semantic, Exploration, Scientific, Literature, Topic, base, Hashing, Algorithms, Editor(s, Tomi, Kauppinen, Aalto, University, Finland, Daniel, Garijo, Un...",5825
5,An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts,"An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Carlos Badenes-Olmedo1, Jos´e Luis Redondo-Garc´ıa2, and Oscar Corcho1 1 Universi...",2641,"[initial, Analysis, Topic, base, Similarity, scientific, document, base, Rhetorical, Discourse, Parts, Carlos, Badenes, Olmedo1, Jos´e, Luis, Redondo, Garc´ıa2, Oscar, Corcho1, Universidad, Polit´...",1438
6,Efficient Clustering from Distributions over Topics,"Efficient Clustering from Distributions over Topics Carlos Badenes-Olmedo cbadenes@￿.upm.es Ontology Engineering Group Universidad Polit´ecnica de Madrid Boadilla del Monte, Spain Jos´e Luis Redon...",5346,"[efficient, clustering, distribution, Topics, Carlos, Badenes, Olmedo, cbadenes@￿.upm.es, Ontology, Engineering, Group, Universidad, Polit´ecnica, de, Madrid, Boadilla, del, Monte, Spain, jos´e, L...",3083
7,Legal Documents Retrieval Across Languages: Topic Hierarchies based on synsets,Cross-lingual annotations of legislative texts enable us to explore major themes covered in multi- lingual legal data and are a key facilitator of semantic similarity when searching for similar do...,1445,"[cross, lingual, annotation, legislative, text, enable, explore, major, theme, cover, multi-, lingual, legal, datum, key, facilitator, semantic, similarity, search, similar, document, multilingual...",790
8,Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies,"Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies Carlos Badenes-Olmedo cbadenes@fi.upm.es Ontology Engineering Group, Universidad Politécnica de Madrid Boad...",4602,"[Scalable, Cross, lingual, document, Similarity, language, specific, Concept, Hierarchies, Carlos, Badenes, Olmedo, cbadenes@fi.upm.es, Ontology, Engineering, Group, Universidad, Politécnica, de, ...",3027
9,Potentially inappropriate medications in older adults living with HIV,"Potentially inappropriate medications in older adults living with HIV B L�opez-Centeno,1,* C Badenes-Olmedo,2 A Mataix-Sanjuan,1 JM Bell�on,3 L P�erez-Latorre,3 JC L�opez,3 J Bened�ı,4,* S Khoo,5 ...",3087,"[potentially, inappropriate, medication, old, adult, live, HIV, opez, Centeno,1, Badenes, Olmedo,2, Mataix, Sanjuan,1, JM, Bell, on,3, erez, Latorre,3, JC, opez,3, Bened, ı,4, khoo,5, Marzolini,6,...",2163


### 2.2.5 Some statistics

In [7]:
unique_tokens = []

for pos in range(len(corpus_df.index)):
  num_words = corpus_df['#words'][pos]
  num_tokens = corpus_df['#tokens'][pos]
  num_unique_tokens = len(set(corpus_df['tokens'][pos]))
  unique_tokens.append(num_unique_tokens)  

corpus_df['#uni_tokens']=unique_tokens
corpus_df.head(3)

Unnamed: 0,title,text,#words,tokens,#tokens,#uni_tokens
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...,12954,"[synopsis, Refinements, Extensions, compare, publication, Conference, Proceedings, submission, refined, extended, paper, base, ICTERI, 2017, phd, symposium, paper, Kosa, et, al, fact, submission, ...",6495,1688
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc...",5827,"[enhance, Public, Procurement, European, Union, Constructing, exploit, Integrated, Knowledge, Graph, Ahmet, Soylu1, Oscar, Corcho2, Brian, Elvesæter1, Carlos, Badenes, Olmedo2, Francisco, Yedro2, ...",3511,1355
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber...",5417,"[Drugs4Covid, make, drug, information, available, scientific, publication, Carlos, Badenes, Olmedo1, David, Chaves, Fraga1, Mar´ıa, Poveda, Villal´on1, Ana, Iglesias, Molina1, Pablo, Calleja1, Soc...",3347,1413
3,Distributing Text Mining tasks with librAIry,"Distributing Text Mining tasks with librAIry Carlos Badenes-Olmedo cbadenes@f.upm.es Universidad Polit´ecnica de Madrid Ontology Engineering Group Boadilla del Monte, Spain Jos´e Luis Redondo-Garc...",2448,"[distribute, text, mining, task, librAIry, Carlos, Badenes, Olmedo, cbadenes@f.upm.es, Universidad, Polit´ecnica, de, Madrid, Ontology, Engineering, Group, Boadilla, del, Monte, Spain, jos´e, Luis...",1484,742
4,Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms,"Semantic Web 0 (0) 1 1 IOS Press Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms Editor(s): Tomi Kauppinen, Aalto University, Finland; Daniel Garijo,...",9041,"[semantic, web, IOS, Press, large, scale, semantic, Exploration, Scientific, Literature, Topic, base, Hashing, Algorithms, Editor(s, Tomi, Kauppinen, Aalto, University, Finland, Daniel, Garijo, Un...",5825,1839
5,An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts,"An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Carlos Badenes-Olmedo1, Jos´e Luis Redondo-Garc´ıa2, and Oscar Corcho1 1 Universi...",2641,"[initial, Analysis, Topic, base, Similarity, scientific, document, base, Rhetorical, Discourse, Parts, Carlos, Badenes, Olmedo1, Jos´e, Luis, Redondo, Garc´ıa2, Oscar, Corcho1, Universidad, Polit´...",1438,590
6,Efficient Clustering from Distributions over Topics,"Efficient Clustering from Distributions over Topics Carlos Badenes-Olmedo cbadenes@￿.upm.es Ontology Engineering Group Universidad Polit´ecnica de Madrid Boadilla del Monte, Spain Jos´e Luis Redon...",5346,"[efficient, clustering, distribution, Topics, Carlos, Badenes, Olmedo, cbadenes@￿.upm.es, Ontology, Engineering, Group, Universidad, Polit´ecnica, de, Madrid, Boadilla, del, Monte, Spain, jos´e, L...",3083,1013
7,Legal Documents Retrieval Across Languages: Topic Hierarchies based on synsets,Cross-lingual annotations of legislative texts enable us to explore major themes covered in multi- lingual legal data and are a key facilitator of semantic similarity when searching for similar do...,1445,"[cross, lingual, annotation, legislative, text, enable, explore, major, theme, cover, multi-, lingual, legal, datum, key, facilitator, semantic, similarity, search, similar, document, multilingual...",790,364
8,Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies,"Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies Carlos Badenes-Olmedo cbadenes@fi.upm.es Ontology Engineering Group, Universidad Politécnica de Madrid Boad...",4602,"[Scalable, Cross, lingual, document, Similarity, language, specific, Concept, Hierarchies, Carlos, Badenes, Olmedo, cbadenes@fi.upm.es, Ontology, Engineering, Group, Universidad, Politécnica, de, ...",3027,1100
9,Potentially inappropriate medications in older adults living with HIV,"Potentially inappropriate medications in older adults living with HIV B L�opez-Centeno,1,* C Badenes-Olmedo,2 A Mataix-Sanjuan,1 JM Bell�on,3 L P�erez-Latorre,3 JC L�opez,3 J Bened�ı,4,* S Khoo,5 ...",3087,"[potentially, inappropriate, medication, old, adult, live, HIV, opez, Centeno,1, Badenes, Olmedo,2, Mataix, Sanjuan,1, JM, Bell, on,3, erez, Latorre,3, JC, opez,3, Bened, ı,4, khoo,5, Marzolini,6,...",2163,1056


## 2.3. Text Vectorization

Once all terms have been pre-processed, numerical weights are assigned to each them. The same term may have a different weight in each distinct document in which it occurs. The weight is usually a measure of how effective the given term is likely to be in distinguishing the given document from other documents in the given collection, and is often normalized to be a fraction between zero and one. Statistical approaches fall into the following categories: boolean, vector space and probabilistic.

In [8]:
all_tokens = []
for tokens in corpus_df['tokens']:
  all_tokens.extend(tokens)

vocabulary = list(set(all_tokens))
print("Vocabulary size:",len(vocabulary)," unique words(tokens)")
print("Vocabulary words:",vocabulary[1:10],"...")

Vocabulary size: 6400  unique words(tokens)
Vocabulary words: ['t0', 'Fern´andez-', 'Recent', 'perceive', 'hand', 'rdf14', 'Rhetoric', 'Entity', 'bioinformatic'] ...


To encode our documents, we’ll create a vectorize function that creates a dictionary whose keys are the tokens in the document and whose values will depend on the approach we use.



The `defaultdic` object allows us to specify what the dictionary will return for a key that hasn’t been assigned to it yet. By setting `defaultdict(int)` we are specifying that a 0 should be returned, thus creating a simple counting dictionary. We can map this function to every item in the corpus creating an iterable of vectorized documents.

### 2.3.1. Boolean Approach

The Boolean representation sets true or false for each vocabulary word depending on whether or not it appears in the document.

In [14]:
from collections import defaultdict

def boolean_vectorize(tokens):
    features = defaultdict(bool)
    for token in tokens:
        features[token] = True
    return features

corpus_df['boolean'] = corpus_df['tokens'].apply(boolean_vectorize)
corpus_df.head(3)

Unnamed: 0,title,text,#words,tokens,#tokens,#uni_tokens,boolean,tf,tf_idf
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...,12954,"[synopsis, Refinements, Extensions, compare, publication, Conference, Proceedings, submission, refined, extended, paper, base, ICTERI, 2017, phd, symposium, paper, Kosa, et, al, fact, submission, ...",6495,1688,"{'synopsis': True, 'Refinements': True, 'Extensions': True, 'compare': True, 'publication': True, 'Conference': True, 'Proceedings': True, 'submission': True, 'refined': True, 'extended': True, 'p...","{'synopsis': 1, 'Refinements': 1, 'Extensions': 1, 'compare': 21, 'publication': 3, 'Conference': 1, 'Proceedings': 1, 'submission': 2, 'refined': 2, 'extended': 1, 'paper': 44, 'base': 43, 'ICTER...","{'synopsis': 0.00036919095809058824, 'Refinements': 0.00036919095809058824, 'Extensions': 0.00036919095809058824, 'compare': 0.0003081622441710275, 'publication': 0.00027997034806942977, 'Conferen..."
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc...",5827,"[enhance, Public, Procurement, European, Union, Constructing, exploit, Integrated, Knowledge, Graph, Ahmet, Soylu1, Oscar, Corcho2, Brian, Elvesæter1, Carlos, Badenes, Olmedo2, Francisco, Yedro2, ...",3511,1355,"{'enhance': True, 'Public': True, 'Procurement': True, 'European': True, 'Union': True, 'Constructing': True, 'exploit': True, 'Integrated': True, 'Knowledge': True, 'Graph': True, 'Ahmet': True, ...","{'enhance': 9, 'Public': 9, 'Procurement': 9, 'European': 5, 'Union': 3, 'Constructing': 1, 'exploit': 1, 'Integrated': 8, 'Knowledge': 11, 'Graph': 11, 'Ahmet': 1, 'Soylu1': 1, 'Oscar': 1, 'Corch...","{'enhance': 0.004369903967572152, 'Public': 0.006146698221357259, 'Procurement': 0.006146698221357259, 'European': 0.00144061650765947, 'Union': 0.0011101819858703454, 'Constructing': 0.0006829664..."
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber...",5417,"[Drugs4Covid, make, drug, information, available, scientific, publication, Carlos, Badenes, Olmedo1, David, Chaves, Fraga1, Mar´ıa, Poveda, Villal´on1, Ana, Iglesias, Molina1, Pablo, Calleja1, Soc...",3347,1413,"{'Drugs4Covid': True, 'make': True, 'drug': True, 'information': True, 'available': True, 'scientific': True, 'publication': True, 'Carlos': True, 'Badenes': True, 'Olmedo1': True, 'David': True, ...","{'Drugs4Covid': 24, 'make': 2, 'drug': 71, 'information': 11, 'available': 12, 'scientific': 14, 'publication': 9, 'Carlos': 1, 'Badenes': 10, 'Olmedo1': 1, 'David': 1, 'Chaves': 1, 'Fraga1': 1, '...","{'Drugs4Covid': 0.017194349132704182, 'make': 0.00019029204130178342, 'drug': 0.03616286661157102, 'information': 0.0, 'available': 0.00034171561328111716, 'scientific': 0.004231375190767469, 'pub..."


In the boolean approach, the query is formulated as a boolean combination of terms. A conventional boolean query uses the classical operators AND, OR, and NOT. The query ”t1 AND t2” is satisfied by a given document D1 if and only if D1 contains both terms t1 and t2. Similarly, the query ”t1 OR t2” is satisfied by D1 if and only if it contains t1 or t2 or both. The query ”t1 AND NOT t2” satisfies D1 if and only if it contains t1 and does not contain t2. More complex boolean queries can be built up out of these operators and evaluated according to the classical rules of boolean algebra. Such a boolean query is either true or false. Correspondingly, a document either satisfies such a query, i.e. is relevant, or does not satisfy it, i.e. is non-relevant. **No ranking is possible**, which is a significant limitation for this approach (Harmon, 1996).

For example, we can search for documents about topic hierarchies and multilinguality.

In [26]:
def relevant(doc):
  #return doc['HIV']
  return doc['multilingual'] and doc['topic'] and doc['hierarchy']
  #return doc['multilingual'] and doc['procurement']

result = []
pos = 0
for vector in corpus_df['boolean']:
  if relevant(vector):
    result.append(corpus_df['title'][pos])
  pos+=1 

for paper in result:
  print("-",paper)

- Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph
- Legal Documents Retrieval Across Languages: Topic Hierarchies based on synsets
- Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies


### 2.3.2 Vector space models

Vector space models (VSM) (Salton and McGill, 1983) were proposed to represent texts as vectors where each entry corresponds to a different term and the number at that entry corresponds to how many times that term is present in the text. The objective was twofold: on the one hand, making document collections manageable since we move from having lots of terms for each text to only one vector per document with a defined dimension; on the other hand, having representations based on metric spaces where calculations can be made, for example comparisons by measuring vector distances.

#### 2.3.2.1 Term-Frequency (TF)

The definition and number of dimensions for each vector are key aspects in a VSM. Based on the use of this type of model, traditional document retrieval tasks over collections of textual documents highly rely on individual features like term frequencies (TF) (Hearst and Hall, 1999). A representational space is created where each term in the vocabulary is projected by a separate and orthogonal dimension.

Vectors are created with the frequency of each word as it appears in the document. In this encoding scheme, each document is represented as the multiset of the tokens that compose it and the value for each word position in the vector is its count. This representation can either be a straight count (integer) encoding or a normalized encoding where each word is weighted by the total number of words in the document.

In [23]:
from collections import defaultdict

def tf_vectorize(tokens):
    features = defaultdict(int)
    for token in tokens:
        features[token] += 1
    return features

corpus_df['tf'] = corpus_df['tokens'].apply(tf_vectorize)
corpus_df.head(3)

Unnamed: 0,title,text,#words,tokens,#tokens,#uni_tokens,boolean,tf,tf_idf
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...,12954,"[synopsis, Refinements, Extensions, compare, publication, Conference, Proceedings, submission, refined, extended, paper, base, ICTERI, 2017, phd, symposium, paper, Kosa, et, al, fact, submission, ...",6495,1688,"{'synopsis': True, 'Refinements': True, 'Extensions': True, 'compare': True, 'publication': True, 'Conference': True, 'Proceedings': True, 'submission': True, 'refined': True, 'extended': True, 'p...","{'synopsis': 1, 'Refinements': 1, 'Extensions': 1, 'compare': 21, 'publication': 3, 'Conference': 1, 'Proceedings': 1, 'submission': 2, 'refined': 2, 'extended': 1, 'paper': 44, 'base': 43, 'ICTER...","{'synopsis': 0.00036919095809058824, 'Refinements': 0.00036919095809058824, 'Extensions': 0.00036919095809058824, 'compare': 0.0003081622441710275, 'publication': 0.00027997034806942977, 'Conferen..."
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc...",5827,"[enhance, Public, Procurement, European, Union, Constructing, exploit, Integrated, Knowledge, Graph, Ahmet, Soylu1, Oscar, Corcho2, Brian, Elvesæter1, Carlos, Badenes, Olmedo2, Francisco, Yedro2, ...",3511,1355,"{'enhance': True, 'Public': True, 'Procurement': True, 'European': True, 'Union': True, 'Constructing': True, 'exploit': True, 'Integrated': True, 'Knowledge': True, 'Graph': True, 'Ahmet': True, ...","{'enhance': 9, 'Public': 9, 'Procurement': 9, 'European': 5, 'Union': 3, 'Constructing': 1, 'exploit': 1, 'Integrated': 8, 'Knowledge': 11, 'Graph': 11, 'Ahmet': 1, 'Soylu1': 1, 'Oscar': 1, 'Corch...","{'enhance': 0.004369903967572152, 'Public': 0.006146698221357259, 'Procurement': 0.006146698221357259, 'European': 0.00144061650765947, 'Union': 0.0011101819858703454, 'Constructing': 0.0006829664..."
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber...",5417,"[Drugs4Covid, make, drug, information, available, scientific, publication, Carlos, Badenes, Olmedo1, David, Chaves, Fraga1, Mar´ıa, Poveda, Villal´on1, Ana, Iglesias, Molina1, Pablo, Calleja1, Soc...",3347,1413,"{'Drugs4Covid': True, 'make': True, 'drug': True, 'information': True, 'available': True, 'scientific': True, 'publication': True, 'Carlos': True, 'Badenes': True, 'Olmedo1': True, 'David': True, ...","{'Drugs4Covid': 24, 'make': 2, 'drug': 71, 'information': 11, 'available': 12, 'scientific': 14, 'publication': 9, 'Carlos': 1, 'Badenes': 10, 'Olmedo1': 1, 'David': 1, 'Chaves': 1, 'Fraga1': 1, '...","{'Drugs4Covid': 0.017194349132704182, 'make': 0.00019029204130178342, 'drug': 0.03616286661157102, 'information': 0.0, 'available': 0.00034171561328111716, 'scientific': 0.004231375190767469, 'pub..."


The relevant results can now be sorted according to the frequency of the terms they contain. But **all terms in a document are treated as equally descriptive**.

In [30]:
def relevant(doc):
  # multiple scores for OR queries, the max value should be returned
  score = 0
  score += doc['multilingual']
  score += doc['topic'] 
  score += doc['hierarchy']
  return score

result = []
pos = 0
for vector in corpus_df['tf']:
  result.append({ 'title': corpus_df['title'][pos],
                 'score' : relevant(vector)})  
  pos+=1 

def sort_by_score(element):
  return element['score']

result.sort(reverse=True, key=sort_by_score)

for paper in result:
  print(paper)

{'title': 'Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms', 'score': 197}
{'title': 'Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies', 'score': 134}
{'title': 'Efficient Clustering from Distributions over Topics', 'score': 85}
{'title': 'Legal Documents Retrieval Across Languages: Topic Hierarchies based on synsets', 'score': 36}
{'title': 'An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts', 'score': 27}
{'title': 'Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph', 'score': 14}
{'title': 'Drugs4Covid: Making drug information available from scientific publications', 'score': 14}
{'title': 'Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation', 'score': 4}
{'title': 'Distributing Text Mining tasks with librAIry', 'score': 2}
{'title': 'Potent

#### 2.3.2.2 Term-Frequency Inverse-Document-Frequency (TF/IDF)

To overcome this limitation, Term-Frequency Inverse-Document Frequency (TF-IDF) (Lee, 1995) relativizes the relevance of each term with respect to the entire corpus. TF-IDF calculates the importance of a term for a document, based on the number of times the term appears in the document itself (term frequency - TF) and the number of documents in the corpus, which contain the term (document frequency - DF).

In [35]:
# https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html
from collections import defaultdict
from nltk.text import TextCollection

texts  = TextCollection(corpus_df['tokens'])

vectors = []

for doc in corpus_df['tokens']:
  features = defaultdict(int)
  for term in doc:
    features[term]=texts.tf_idf(term, doc)
  vectors.append(features)

corpus_df['tf_idf'] = vectors
corpus_df.head(3)

Unnamed: 0,title,text,#words,tokens,#tokens,#uni_tokens,boolean,tf,tf_idf
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...,12954,"[synopsis, Refinements, Extensions, compare, publication, Conference, Proceedings, submission, refined, extended, paper, base, ICTERI, 2017, phd, symposium, paper, Kosa, et, al, fact, submission, ...",6495,1688,"{'synopsis': True, 'Refinements': True, 'Extensions': True, 'compare': True, 'publication': True, 'Conference': True, 'Proceedings': True, 'submission': True, 'refined': True, 'extended': True, 'p...","{'synopsis': 1, 'Refinements': 1, 'Extensions': 1, 'compare': 21, 'publication': 3, 'Conference': 1, 'Proceedings': 1, 'submission': 2, 'refined': 2, 'extended': 1, 'paper': 44, 'base': 43, 'ICTER...","{'synopsis': 0.00036919095809058824, 'Refinements': 0.00036919095809058824, 'Extensions': 0.00036919095809058824, 'compare': 0.0003081622441710275, 'publication': 0.00027997034806942977, 'Conferen..."
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc...",5827,"[enhance, Public, Procurement, European, Union, Constructing, exploit, Integrated, Knowledge, Graph, Ahmet, Soylu1, Oscar, Corcho2, Brian, Elvesæter1, Carlos, Badenes, Olmedo2, Francisco, Yedro2, ...",3511,1355,"{'enhance': True, 'Public': True, 'Procurement': True, 'European': True, 'Union': True, 'Constructing': True, 'exploit': True, 'Integrated': True, 'Knowledge': True, 'Graph': True, 'Ahmet': True, ...","{'enhance': 9, 'Public': 9, 'Procurement': 9, 'European': 5, 'Union': 3, 'Constructing': 1, 'exploit': 1, 'Integrated': 8, 'Knowledge': 11, 'Graph': 11, 'Ahmet': 1, 'Soylu1': 1, 'Oscar': 1, 'Corch...","{'enhance': 0.004369903967572152, 'Public': 0.006146698221357259, 'Procurement': 0.006146698221357259, 'European': 0.00144061650765947, 'Union': 0.0011101819858703454, 'Constructing': 0.0006829664..."
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber...",5417,"[Drugs4Covid, make, drug, information, available, scientific, publication, Carlos, Badenes, Olmedo1, David, Chaves, Fraga1, Mar´ıa, Poveda, Villal´on1, Ana, Iglesias, Molina1, Pablo, Calleja1, Soc...",3347,1413,"{'Drugs4Covid': True, 'make': True, 'drug': True, 'information': True, 'available': True, 'scientific': True, 'publication': True, 'Carlos': True, 'Badenes': True, 'Olmedo1': True, 'David': True, ...","{'Drugs4Covid': 24, 'make': 2, 'drug': 71, 'information': 11, 'available': 12, 'scientific': 14, 'publication': 9, 'Carlos': 1, 'Badenes': 10, 'Olmedo1': 1, 'David': 1, 'Chaves': 1, 'Fraga1': 1, '...","{'Drugs4Covid': 0.017194349132704182, 'make': 0.00019029204130178342, 'drug': 0.03616286661157102, 'information': 0.0, 'available': 0.00034171561328111716, 'scientific': 0.004231375190767469, 'pub..."


Relevance now depends not only on the document, but also on the corpus. Those documents that contain the key terms in a different proportion to the rest of the documents will be the most relevant.

In [36]:
result = []
pos = 0
for vector in corpus_df['tf_idf']:
  result.append({ 'title': corpus_df['title'][pos],
                 'score' : relevant(vector)})  
  pos+=1 

result.sort(reverse=True, key=sort_by_score)

for paper in result:
  print(paper)

{'title': 'Legal Documents Retrieval Across Languages: Topic Hierarchies based on synsets', 'score': 0.016292567800316772}
{'title': 'Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies', 'score': 0.01377950453669197}
{'title': 'Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms', 'score': 0.009309320794608889}
{'title': 'Efficient Clustering from Distributions over Topics', 'score': 0.005723255199216663}
{'title': 'An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts', 'score': 0.0037678086074256494}
{'title': 'Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph', 'score': 0.0012804866676275552}
{'title': 'Drugs4Covid: Making drug information available from scientific publications', 'score': 0.0008393754814670205}
{'title': 'Distributing Text Mining tasks with librAIry', 'score': 

However the **absence of semantic information, and the high-number of dimensions** are the main drawbacks of these approaches that lead to the emergence of other techniques. New ways of characterizing documents appeared based on the automatic generation of models discovering the main themes covered in the corpus.

#### 2.3.2.3 Text embedding

 Among them, text embedding proposes transforming texts into low-dimensional vectors by pre- diction methods based on (i) word sequences or (ii) bag-of-words.

##### 2.3.2.3.1 Word sequences

This approach assumes words with similar meanings tend to occur in similar contexts. It considers that word order is relevant, and is based on Neural Models (NM) that learn word vectors from pairs of target and context words. Context words are taken as words observed to surround a target word. 

Document vectors are usually created by taking the word vectors they contain or by considering them as target and context items. Skip-gram with negative sampling (Word2Vec) (Mikolov et al., 2013) and Global Vectors (GloVe) (Pennington et al., 2014) are indeed the most popular methods to learn word embeddings due to its training efficiency and robustness (Levy et al., 2015).


The Doc2Vec algorithm is an extension of Word2Vec. It proposes a paragraph vector, i.e. an unsupervised algorithm that learns fixed-length feature representations from variable length documents. It takes into consideration the ordering of words within a narrow context, similar to an n-gram model. The combined result generalizes and has a lower dimensionality but still is of a fixed length so it can be used in common machine learning algorithms.

In [46]:
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus_df['tokens'])]
model = Doc2Vec(documents, vector_size=5, min_count=0, window=2, workers=4)

docvecs = []
for pos in range(len(corpus_df['tokens'])):
  docvecs.append(model.docvecs[pos])

corpus_df['d2v'] = docvecs
corpus_df.head(3)


Unnamed: 0,title,text,#words,tokens,#tokens,#uni_tokens,boolean,tf,tf_idf,d2v
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...,12954,"[synopsis, Refinements, Extensions, compare, publication, Conference, Proceedings, submission, refined, extended, paper, base, ICTERI, 2017, phd, symposium, paper, Kosa, et, al, fact, submission, ...",6495,1688,"{'synopsis': True, 'Refinements': True, 'Extensions': True, 'compare': True, 'publication': True, 'Conference': True, 'Proceedings': True, 'submission': True, 'refined': True, 'extended': True, 'p...","{'synopsis': 1, 'Refinements': 1, 'Extensions': 1, 'compare': 21, 'publication': 3, 'Conference': 1, 'Proceedings': 1, 'submission': 2, 'refined': 2, 'extended': 1, 'paper': 44, 'base': 43, 'ICTER...","{'synopsis': 0.00036919095809058824, 'Refinements': 0.00036919095809058824, 'Extensions': 0.00036919095809058824, 'compare': 0.0003081622441710275, 'publication': 0.00027997034806942977, 'Conferen...","[-15.740464, 4.3430924, -8.222287, -0.07739721, -9.146837]"
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc...",5827,"[enhance, Public, Procurement, European, Union, Constructing, exploit, Integrated, Knowledge, Graph, Ahmet, Soylu1, Oscar, Corcho2, Brian, Elvesæter1, Carlos, Badenes, Olmedo2, Francisco, Yedro2, ...",3511,1355,"{'enhance': True, 'Public': True, 'Procurement': True, 'European': True, 'Union': True, 'Constructing': True, 'exploit': True, 'Integrated': True, 'Knowledge': True, 'Graph': True, 'Ahmet': True, ...","{'enhance': 9, 'Public': 9, 'Procurement': 9, 'European': 5, 'Union': 3, 'Constructing': 1, 'exploit': 1, 'Integrated': 8, 'Knowledge': 11, 'Graph': 11, 'Ahmet': 1, 'Soylu1': 1, 'Oscar': 1, 'Corch...","{'enhance': 0.004369903967572152, 'Public': 0.006146698221357259, 'Procurement': 0.006146698221357259, 'European': 0.00144061650765947, 'Union': 0.0011101819858703454, 'Constructing': 0.0006829664...","[-6.1888537, 1.4815748, -11.119544, -5.824677, -7.810485]"
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber...",5417,"[Drugs4Covid, make, drug, information, available, scientific, publication, Carlos, Badenes, Olmedo1, David, Chaves, Fraga1, Mar´ıa, Poveda, Villal´on1, Ana, Iglesias, Molina1, Pablo, Calleja1, Soc...",3347,1413,"{'Drugs4Covid': True, 'make': True, 'drug': True, 'information': True, 'available': True, 'scientific': True, 'publication': True, 'Carlos': True, 'Badenes': True, 'Olmedo1': True, 'David': True, ...","{'Drugs4Covid': 24, 'make': 2, 'drug': 71, 'information': 11, 'available': 12, 'scientific': 14, 'publication': 9, 'Carlos': 1, 'Badenes': 10, 'Olmedo1': 1, 'David': 1, 'Chaves': 1, 'Fraga1': 1, '...","{'Drugs4Covid': 0.017194349132704182, 'make': 0.00019029204130178342, 'drug': 0.03616286661157102, 'information': 0.0, 'available': 0.00034171561328111716, 'scientific': 0.004231375190767469, 'pub...","[-8.049958, 3.3679776, -10.898428, -6.2801666, -8.267454]"


Now, a text or a reference document can be used in the query by  measuring its similarity to the corpus.

In [55]:
from scipy import spatial

query_paper = 4
query_vector = corpus_df['d2v'][query_paper]

def relevant(vector):
  distance = spatial.distance.cosine(query_vector, vector)
  similarity = 1 - distance
  return similarity

pos = 0
result = []
for v1 in corpus_df['d2v']:
  result.append({ 'title': corpus_df['title'][pos],
                 'score' : relevant(v1)})    
  pos+=1

result.sort(reverse=True, key=sort_by_score)

print(corpus_df['title'][query_paper],":")
for paper in result:
  print(paper)

Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms :
{'title': 'Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms', 'score': 1.0}
{'title': 'Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph', 'score': 0.9779927134513855}
{'title': 'Efficient Clustering from Distributions over Topics', 'score': 0.9757887125015259}
{'title': 'Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies', 'score': 0.9753336310386658}
{'title': 'Legal Documents Retrieval Across Languages: Topic Hierarchies based on synsets', 'score': 0.9661298394203186}
{'title': 'Distributing Text Mining tasks with librAIry', 'score': 0.952582061290741}
{'title': 'Drugs4Covid: Making drug information available from scientific publications', 'score': 0.9510497450828552}
{'title': 'An initial Analysis of Topic-based Similarity among Scientifi

##### 2.3.2.3.2 Bag-of-words

This approach does not consider the order of the words to be relevant, but their frequency. It assumes words with similar meanings will occur in similar documents, although a recent proposal uses an embeddings based approach to model the topics (Dieng et al., 2020). Topic models (Blei et al., 2003; Deerwester et al., 1990; Hofmann, 2001) are the main methods based on this approach. This second approach is used in our work since we are not only interested in representing words and documents, but we also seek structures that allows considering knowledge about the collection

In [69]:
import gensim

# Create Dictionary
dictionary = gensim.corpora.Dictionary(corpus_df['tokens'])

# Create bag-of-words
bows = [dictionary.doc2bow(text) for text in corpus_df['tokens']]

print("->",corpus_df['title'][0],":")
for word in bows[0][200:210]:
  id = word[0]
  freq = word[1]
  print(dictionary[id],freq)

- Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation :
Ananiadou 1
Approaches 1
Arenas 1
Astrakhantsev 1
Automated 2
Automation 2
Average 1
B. 2
B1 3
B12 7


### 2.3.3 Probabilistic Topic Models

Probabilistic Topic Models (PTM) (Blei et al., 2003; Hofmann, 2001) are statistical methods based on bag-of-words that analyze the words of the original texts to discover the themes that run through them, how those themes are connected to each other, or how they change over time. 

PTM do not require any prior annotations or labeling of the documents. The topics emerge, as hidden structures, from the analy- sis of the original texts. These structures are topic distributions, per-document topic distributions or per-document per-word topic assignments.

In turn, a topic is a distribution over terms that is biased around those words associated to a single theme. This interpretable hidden structure annotates each document in the collection and these annotations can be used to perform deeper analysis about relationships between documents.

Topic-based representations bring a lot of potential when applied over different IR tasks, as evidenced by recent works in different domains such as scholarly (Gatti et al., 2015), health (Hsin-Min et al., 2016; Nzali et al., 2017), legal (Greene and Cross, 2016; O’Neill et al., 2017), news (He et al., 2017) and social networks (Cheng et al.,2014). 

Topic modeling provides an algorithmic solution to organize and annotate large collections of textual documents according to their topics.

#### 2.3.3.1 LDA

The simplest generative topic model proposed in the state of the art is Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Along with Latent Semantic Analysis (LSA) (Deerwester et al., 1990) and Probabilistic Latent Semantic Analysis (pLSA) (Hofmann, 2001) are part of the field known as topic modeling. They are well-known latent variable models for high dimensional data, such as the bag-of-words representation for textual data or any other count-based data representation. They try to capture the intuition that documents can exhibit multiple themes.

In [75]:
from pprint import pprint

lda_model = gensim.models.ldamodel.LdaModel(corpus=bows,
                                           id2word=dictionary,
                                           num_topics=2, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

pprint(lda_model.print_topics())
#doc_lda = lda_model[corpus]

[(0,
  '0.021*"topic" + 0.016*"document" + 0.013*"base" + 0.007*"model" + '
  '0.006*"distribution" + 0.006*"similarity" + 0.005*"algorithm" + '
  '0.004*"create" + 0.004*"set" + 0.004*"datum"'),
 (1,
  '0.016*"term" + 0.008*"document" + 0.008*"collection" + 0.007*"datum" + '
  '0.007*"  " + 0.007*"value" + 0.007*"saturation" + 0.006*"paper" + '
  '0.005*"extract" + 0.005*"result"')]


Each document exhibits each topic in different proportion, and each word in each document is drawn from one of the topics, where the selected topic is chosen from the per-document distribution over topics. All the documents in a collection share the same set of topics, but each document exhibits these topics in a different proportion. Texts are described as a vector of counts with W components, where W is the number of words in the vocabulary. Each document in the corpus is modeled as a mixture over K topics, and each topic k is a distribution over the vocabulary of W words.

In [84]:
corpus_doc_lda = lda_model[bows]

topic_vectors = []

for doc_lda in corpus_doc_lda:
  topic_vectors.append(doc_lda[0])

corpus_df['topics'] = topic_vectors
corpus_df.head(3)


Unnamed: 0,title,text,#words,tokens,#tokens,#uni_tokens,boolean,tf,tf_idf,d2v,topics
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...,12954,"[synopsis, Refinements, Extensions, compare, publication, Conference, Proceedings, submission, refined, extended, paper, base, ICTERI, 2017, phd, symposium, paper, Kosa, et, al, fact, submission, ...",6495,1688,"{'synopsis': True, 'Refinements': True, 'Extensions': True, 'compare': True, 'publication': True, 'Conference': True, 'Proceedings': True, 'submission': True, 'refined': True, 'extended': True, 'p...","{'synopsis': 1, 'Refinements': 1, 'Extensions': 1, 'compare': 21, 'publication': 3, 'Conference': 1, 'Proceedings': 1, 'submission': 2, 'refined': 2, 'extended': 1, 'paper': 44, 'base': 43, 'ICTER...","{'synopsis': 0.00036919095809058824, 'Refinements': 0.00036919095809058824, 'Extensions': 0.00036919095809058824, 'compare': 0.0003081622441710275, 'publication': 0.00027997034806942977, 'Conferen...","[-15.740464, 4.3430924, -8.222287, -0.07739721, -9.146837]","[(1, 0.99977225)]"
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc...",5827,"[enhance, Public, Procurement, European, Union, Constructing, exploit, Integrated, Knowledge, Graph, Ahmet, Soylu1, Oscar, Corcho2, Brian, Elvesæter1, Carlos, Badenes, Olmedo2, Francisco, Yedro2, ...",3511,1355,"{'enhance': True, 'Public': True, 'Procurement': True, 'European': True, 'Union': True, 'Constructing': True, 'exploit': True, 'Integrated': True, 'Knowledge': True, 'Graph': True, 'Ahmet': True, ...","{'enhance': 9, 'Public': 9, 'Procurement': 9, 'European': 5, 'Union': 3, 'Constructing': 1, 'exploit': 1, 'Integrated': 8, 'Knowledge': 11, 'Graph': 11, 'Ahmet': 1, 'Soylu1': 1, 'Oscar': 1, 'Corch...","{'enhance': 0.004369903967572152, 'Public': 0.006146698221357259, 'Procurement': 0.006146698221357259, 'European': 0.00144061650765947, 'Union': 0.0011101819858703454, 'Constructing': 0.0006829664...","[-6.1888537, 1.4815748, -11.119544, -5.824677, -7.810485]","[(0, 0.015344715), (1, 0.98465526)]"
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber...",5417,"[Drugs4Covid, make, drug, information, available, scientific, publication, Carlos, Badenes, Olmedo1, David, Chaves, Fraga1, Mar´ıa, Poveda, Villal´on1, Ana, Iglesias, Molina1, Pablo, Calleja1, Soc...",3347,1413,"{'Drugs4Covid': True, 'make': True, 'drug': True, 'information': True, 'available': True, 'scientific': True, 'publication': True, 'Carlos': True, 'Badenes': True, 'Olmedo1': True, 'David': True, ...","{'Drugs4Covid': 24, 'make': 2, 'drug': 71, 'information': 11, 'available': 12, 'scientific': 14, 'publication': 9, 'Carlos': 1, 'Badenes': 10, 'Olmedo1': 1, 'David': 1, 'Chaves': 1, 'Fraga1': 1, '...","{'Drugs4Covid': 0.017194349132704182, 'make': 0.00019029204130178342, 'drug': 0.03616286661157102, 'information': 0.0, 'available': 0.00034171561328111716, 'scientific': 0.004231375190767469, 'pub...","[-8.049958, 3.3679776, -10.898428, -6.2801666, -8.267454]","[(0, 0.9985377)]"


In [77]:
from gensim.models import CoherenceModel

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(bows))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=corpus_df['tokens'], dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -7.7004409604324

Coherence Score:  0.5219798593210265


# References



* [Text Vectorization and Transformation Pipelines](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html)
*   List item

