<a href="https://colab.research.google.com/github/cbadenes/phd-thesis/blob/master/notebooks/soa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook supports the state-of-the-art content of the thesis: *Semantically-enabled Browsing of Large Multilingual Document Collections, Badenes-Olmedo, C. 2021*

# 2.- Techniques for Document Retrieval

The analysis of human-readable documents is a well-known problem in Artificial Intelligence (AI) in general, and in the Information Retrieval (IR) and Natural Language Processing (NLP) fields in particular. As an academic field of study, information retrieval might be defined as finding documents of an unstructured nature, usually text, that satisfies an information need from within large collections (Manning et al., 2008). As defined in this way, hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their email. Information retrieval is fast becoming the dominant form of information access, surpassing traditional database searching where identifiers are needed to have results.

There are two major categories of IR technology and research: semantic and statistical. Semantic approaches attempt to implement some degree of syntactic and semantic analysis. They try to reproduce to some degree the understanding of the natural language text that a human user would provide. In statistical approaches, the documents that are retrieved or that are highly ranked are those that match the query most closely in terms of some statistical measure. The work presented in this thesis follows this second approach.

## 2.1.- Load Corpus

An illustrative example may help to better understand IR techniques, so the publications listed in Section 1.1 are used as a sample collection for applying each of them.

In [1]:
import requests
import json
import pandas as pd

#increase the max column length
pd.set_option('display.max_colwidth', 200)

corpus_df = pd.read_csv('https://www.dropbox.com/s/pag5jseq2e9wcvb/corpus.csv?raw=1',usecols=['title','text'])
corpus_df

Unnamed: 0,title,text
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc..."
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber..."
3,Distributing Text Mining tasks with librAIry,"Distributing Text Mining tasks with librAIry Carlos Badenes-Olmedo cbadenes@f.upm.es Universidad Polit´ecnica de Madrid Ontology Engineering Group Boadilla del Monte, Spain Jos´e Luis Redondo-Garc..."
4,Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms,"Semantic Web 0 (0) 1 1 IOS Press Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms Editor(s): Tomi Kauppinen, Aalto University, Finland; Daniel Garijo,..."
5,An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts,"An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Carlos Badenes-Olmedo1, Jos´e Luis Redondo-Garc´ıa2, and Oscar Corcho1 1 Universi..."
6,Efficient Clustering from Distributions over Topics,"Efficient Clustering from Distributions over Topics Carlos Badenes-Olmedo cbadenes@￿.upm.es Ontology Engineering Group Universidad Polit´ecnica de Madrid Boadilla del Monte, Spain Jos´e Luis Redon..."
7,Legal Documents Retrieval Across Languages: Topic Hierarchies based on synsets,Cross-lingual annotations of legislative texts enable us to explore major themes covered in multi- lingual legal data and are a key facilitator of semantic similarity when searching for similar do...
8,Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies,"Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies Carlos Badenes-Olmedo cbadenes@fi.upm.es Ontology Engineering Group, Universidad Politécnica de Madrid Boad..."
9,Potentially inappropriate medications in older adults living with HIV,"Potentially inappropriate medications in older adults living with HIV B L�opez-Centeno,1,* C Badenes-Olmedo,2 A Mataix-Sanjuan,1 JM Bell�on,3 L P�erez-Latorre,3 JC L�opez,3 J Bened�ı,4,* S Khoo,5 ..."


## 2.2. Text Pre-Processing

Documents must be pre-processed to transform their texts into terms. These terms are the population that is counted and measured statistically. Most commonly, the terms are words (or combination of adjacent words or characters) that occur in a given query or collection of documents and often require pre-processing. 

### 2.2.1: Methods to transform texts into terms

Words are reduced to a common base form by using a heuristic process that removes affixes, stemming, or by returning its dictionary form, lemma (Porter, 1997). The objective is to eliminate the variation that arises from the occurrence of different grammatical forms of the same word, e.g., ”program”, ”programming”, ”programs”, and ”programmed” should all be recognized as forms of the same word, ”program”.

Another common form of pre-processing is the elimination of common words that have little power to discriminate relevant from non-relevant documents,e.g., ”the”, ”a”, ”it”. Hence, IR engines are usually provided with a stop-list of such noise words. Note that both stemming/lemma and stopwords are language-dependent.

In [12]:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

nlp = spacy.load("en_core_web_sm")

def tokenize(text):
  tokens = nlp(text)
  return tokens

def is_valid(token):
  return len(token.text) > 1 and not token.is_stop

def lemma(token):
  return token.lemma_

def preprocess(text):
  tokens = []
  for token in tokenize(text):
    if is_valid(token): 
      tokens.append(lemma(token))
  return tokens

print("methods created succesfully")

methods created succesfully


The following sentence taken from one of the documents can be used to see each of the steps: *”Probabilistic Topic Models reduce that feature space by annotating documents with thematic information”*.

In [13]:
tokens = preprocess("Probabilistic Topic Models reduce that feature space by annotating documents with thematic information")
print(tokens)

['Probabilistic', 'Topic', 'Models', 'reduce', 'feature', 'space', 'annotate', 'document', 'thematic', 'information']


At this step ’annotating’ was transformed to ’annotate’ and ’documents’ was reduced to ’document’. However, ’Models’ remains unchanged. The reason is that since it starts with a capital letter, it is considered a proper noun. Finally, those words that appear in a stop-word list are removed (e.g. ’that’, ’by’ and ’with’). Each text is transformed into a normalized list of terms.

### 2.2.2. Count words

In [3]:
def count_words(text):
  return len(text.split(" "))

corpus_df['#words'] = corpus_df['text'].apply(count_words)
corpus_df

Unnamed: 0,title,text,#words
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...,12954
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc...",5827
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber...",5417
3,Distributing Text Mining tasks with librAIry,"Distributing Text Mining tasks with librAIry Carlos Badenes-Olmedo cbadenes@f.upm.es Universidad Polit´ecnica de Madrid Ontology Engineering Group Boadilla del Monte, Spain Jos´e Luis Redondo-Garc...",2448
4,Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms,"Semantic Web 0 (0) 1 1 IOS Press Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms Editor(s): Tomi Kauppinen, Aalto University, Finland; Daniel Garijo,...",9041
5,An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts,"An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Carlos Badenes-Olmedo1, Jos´e Luis Redondo-Garc´ıa2, and Oscar Corcho1 1 Universi...",2641
6,Efficient Clustering from Distributions over Topics,"Efficient Clustering from Distributions over Topics Carlos Badenes-Olmedo cbadenes@￿.upm.es Ontology Engineering Group Universidad Polit´ecnica de Madrid Boadilla del Monte, Spain Jos´e Luis Redon...",5346
7,Legal Documents Retrieval Across Languages: Topic Hierarchies based on synsets,Cross-lingual annotations of legislative texts enable us to explore major themes covered in multi- lingual legal data and are a key facilitator of semantic similarity when searching for similar do...,1445
8,Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies,"Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies Carlos Badenes-Olmedo cbadenes@fi.upm.es Ontology Engineering Group, Universidad Politécnica de Madrid Boad...",4602
9,Potentially inappropriate medications in older adults living with HIV,"Potentially inappropriate medications in older adults living with HIV B L�opez-Centeno,1,* C Badenes-Olmedo,2 A Mataix-Sanjuan,1 JM Bell�on,3 L P�erez-Latorre,3 JC L�opez,3 J Bened�ı,4,* S Khoo,5 ...",3087


### 2.2.3. Tokenize Corpus

In [4]:
corpus_df['tokens'] = corpus_df['text'].apply(preprocess)

corpus_df

Unnamed: 0,title,text,#words,tokens
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...,12954,"[synopsis, Refinements, Extensions, compare, publication, Conference, Proceedings, submission, refined, extended, paper, base, ICTERI, 2017, phd, symposium, paper, Kosa, et, al, fact, submission, ..."
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc...",5827,"[enhance, Public, Procurement, European, Union, Constructing, exploit, Integrated, Knowledge, Graph, Ahmet, Soylu1, Oscar, Corcho2, Brian, Elvesæter1, Carlos, Badenes, Olmedo2, Francisco, Yedro2, ..."
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber...",5417,"[Drugs4Covid, make, drug, information, available, scientific, publication, Carlos, Badenes, Olmedo1, David, Chaves, Fraga1, Mar´ıa, Poveda, Villal´on1, Ana, Iglesias, Molina1, Pablo, Calleja1, Soc..."
3,Distributing Text Mining tasks with librAIry,"Distributing Text Mining tasks with librAIry Carlos Badenes-Olmedo cbadenes@f.upm.es Universidad Polit´ecnica de Madrid Ontology Engineering Group Boadilla del Monte, Spain Jos´e Luis Redondo-Garc...",2448,"[distribute, text, mining, task, librAIry, Carlos, Badenes, Olmedo, cbadenes@f.upm.es, Universidad, Polit´ecnica, de, Madrid, Ontology, Engineering, Group, Boadilla, del, Monte, Spain, jos´e, Luis..."
4,Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms,"Semantic Web 0 (0) 1 1 IOS Press Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms Editor(s): Tomi Kauppinen, Aalto University, Finland; Daniel Garijo,...",9041,"[semantic, web, IOS, Press, large, scale, semantic, Exploration, Scientific, Literature, Topic, base, Hashing, Algorithms, Editor(s, Tomi, Kauppinen, Aalto, University, Finland, Daniel, Garijo, Un..."
5,An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts,"An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Carlos Badenes-Olmedo1, Jos´e Luis Redondo-Garc´ıa2, and Oscar Corcho1 1 Universi...",2641,"[initial, Analysis, Topic, base, Similarity, scientific, document, base, Rhetorical, Discourse, Parts, Carlos, Badenes, Olmedo1, Jos´e, Luis, Redondo, Garc´ıa2, Oscar, Corcho1, Universidad, Polit´..."
6,Efficient Clustering from Distributions over Topics,"Efficient Clustering from Distributions over Topics Carlos Badenes-Olmedo cbadenes@￿.upm.es Ontology Engineering Group Universidad Polit´ecnica de Madrid Boadilla del Monte, Spain Jos´e Luis Redon...",5346,"[efficient, clustering, distribution, Topics, Carlos, Badenes, Olmedo, cbadenes@￿.upm.es, Ontology, Engineering, Group, Universidad, Polit´ecnica, de, Madrid, Boadilla, del, Monte, Spain, jos´e, L..."
7,Legal Documents Retrieval Across Languages: Topic Hierarchies based on synsets,Cross-lingual annotations of legislative texts enable us to explore major themes covered in multi- lingual legal data and are a key facilitator of semantic similarity when searching for similar do...,1445,"[cross, lingual, annotation, legislative, text, enable, explore, major, theme, cover, multi-, lingual, legal, datum, key, facilitator, semantic, similarity, search, similar, document, multilingual..."
8,Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies,"Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies Carlos Badenes-Olmedo cbadenes@fi.upm.es Ontology Engineering Group, Universidad Politécnica de Madrid Boad...",4602,"[Scalable, Cross, lingual, document, Similarity, language, specific, Concept, Hierarchies, Carlos, Badenes, Olmedo, cbadenes@fi.upm.es, Ontology, Engineering, Group, Universidad, Politécnica, de, ..."
9,Potentially inappropriate medications in older adults living with HIV,"Potentially inappropriate medications in older adults living with HIV B L�opez-Centeno,1,* C Badenes-Olmedo,2 A Mataix-Sanjuan,1 JM Bell�on,3 L P�erez-Latorre,3 JC L�opez,3 J Bened�ı,4,* S Khoo,5 ...",3087,"[potentially, inappropriate, medication, old, adult, live, HIV, opez, Centeno,1, Badenes, Olmedo,2, Mataix, Sanjuan,1, JM, Bell, on,3, erez, Latorre,3, JC, opez,3, Bened, ı,4, khoo,5, Marzolini,6,..."


### 2.2.4. Count tokens

In [5]:
def count_tokens(tokens):
  return len(tokens)

corpus_df['#tokens'] = corpus_df['tokens'].apply(count_tokens)

corpus_df

Unnamed: 0,title,text,#words,tokens,#tokens
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...,12954,"[synopsis, Refinements, Extensions, compare, publication, Conference, Proceedings, submission, refined, extended, paper, base, ICTERI, 2017, phd, symposium, paper, Kosa, et, al, fact, submission, ...",6495
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc...",5827,"[enhance, Public, Procurement, European, Union, Constructing, exploit, Integrated, Knowledge, Graph, Ahmet, Soylu1, Oscar, Corcho2, Brian, Elvesæter1, Carlos, Badenes, Olmedo2, Francisco, Yedro2, ...",3511
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber...",5417,"[Drugs4Covid, make, drug, information, available, scientific, publication, Carlos, Badenes, Olmedo1, David, Chaves, Fraga1, Mar´ıa, Poveda, Villal´on1, Ana, Iglesias, Molina1, Pablo, Calleja1, Soc...",3347
3,Distributing Text Mining tasks with librAIry,"Distributing Text Mining tasks with librAIry Carlos Badenes-Olmedo cbadenes@f.upm.es Universidad Polit´ecnica de Madrid Ontology Engineering Group Boadilla del Monte, Spain Jos´e Luis Redondo-Garc...",2448,"[distribute, text, mining, task, librAIry, Carlos, Badenes, Olmedo, cbadenes@f.upm.es, Universidad, Polit´ecnica, de, Madrid, Ontology, Engineering, Group, Boadilla, del, Monte, Spain, jos´e, Luis...",1484
4,Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms,"Semantic Web 0 (0) 1 1 IOS Press Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms Editor(s): Tomi Kauppinen, Aalto University, Finland; Daniel Garijo,...",9041,"[semantic, web, IOS, Press, large, scale, semantic, Exploration, Scientific, Literature, Topic, base, Hashing, Algorithms, Editor(s, Tomi, Kauppinen, Aalto, University, Finland, Daniel, Garijo, Un...",5825
5,An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts,"An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Carlos Badenes-Olmedo1, Jos´e Luis Redondo-Garc´ıa2, and Oscar Corcho1 1 Universi...",2641,"[initial, Analysis, Topic, base, Similarity, scientific, document, base, Rhetorical, Discourse, Parts, Carlos, Badenes, Olmedo1, Jos´e, Luis, Redondo, Garc´ıa2, Oscar, Corcho1, Universidad, Polit´...",1438
6,Efficient Clustering from Distributions over Topics,"Efficient Clustering from Distributions over Topics Carlos Badenes-Olmedo cbadenes@￿.upm.es Ontology Engineering Group Universidad Polit´ecnica de Madrid Boadilla del Monte, Spain Jos´e Luis Redon...",5346,"[efficient, clustering, distribution, Topics, Carlos, Badenes, Olmedo, cbadenes@￿.upm.es, Ontology, Engineering, Group, Universidad, Polit´ecnica, de, Madrid, Boadilla, del, Monte, Spain, jos´e, L...",3083
7,Legal Documents Retrieval Across Languages: Topic Hierarchies based on synsets,Cross-lingual annotations of legislative texts enable us to explore major themes covered in multi- lingual legal data and are a key facilitator of semantic similarity when searching for similar do...,1445,"[cross, lingual, annotation, legislative, text, enable, explore, major, theme, cover, multi-, lingual, legal, datum, key, facilitator, semantic, similarity, search, similar, document, multilingual...",790
8,Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies,"Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies Carlos Badenes-Olmedo cbadenes@fi.upm.es Ontology Engineering Group, Universidad Politécnica de Madrid Boad...",4602,"[Scalable, Cross, lingual, document, Similarity, language, specific, Concept, Hierarchies, Carlos, Badenes, Olmedo, cbadenes@fi.upm.es, Ontology, Engineering, Group, Universidad, Politécnica, de, ...",3027
9,Potentially inappropriate medications in older adults living with HIV,"Potentially inappropriate medications in older adults living with HIV B L�opez-Centeno,1,* C Badenes-Olmedo,2 A Mataix-Sanjuan,1 JM Bell�on,3 L P�erez-Latorre,3 JC L�opez,3 J Bened�ı,4,* S Khoo,5 ...",3087,"[potentially, inappropriate, medication, old, adult, live, HIV, opez, Centeno,1, Badenes, Olmedo,2, Mataix, Sanjuan,1, JM, Bell, on,3, erez, Latorre,3, JC, opez,3, Bened, ı,4, khoo,5, Marzolini,6,...",2163


### 2.2.5 Some statistics

In [6]:
reduction_by_tokens = []
unique_tokens = []
unique_tokens_ratio = []

for pos in range(len(corpus_df.index)):
  num_words = corpus_df['#words'][pos]
  num_tokens = corpus_df['#tokens'][pos]
  num_unique_tokens = len(set(corpus_df['tokens'][pos]))
  
  reduction_ratio = 100-((num_tokens * 100)/num_words)
  unique_ratio= 100-((num_unique_tokens * 100)/num_tokens)

  reduction_by_tokens.append(reduction_ratio)
  unique_tokens.append(num_unique_tokens)
  unique_tokens_ratio.append(unique_ratio)

corpus_df['%red_tokens']=reduction_by_tokens
corpus_df['#uni_tokens']=unique_tokens
corpus_df['%red_uni_tokens']=unique_tokens_ratio
corpus_df

Unnamed: 0,title,text,#words,tokens,#tokens,%reduction,#unique,%unique
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...,12954,"[synopsis, Refinements, Extensions, compare, publication, Conference, Proceedings, submission, refined, extended, paper, base, ICTERI, 2017, phd, symposium, paper, Kosa, et, al, fact, submission, ...",6495,49.861047,1688,74.010778
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc...",5827,"[enhance, Public, Procurement, European, Union, Constructing, exploit, Integrated, Knowledge, Graph, Ahmet, Soylu1, Oscar, Corcho2, Brian, Elvesæter1, Carlos, Badenes, Olmedo2, Francisco, Yedro2, ...",3511,39.74601,1355,61.407007
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber...",5417,"[Drugs4Covid, make, drug, information, available, scientific, publication, Carlos, Badenes, Olmedo1, David, Chaves, Fraga1, Mar´ıa, Poveda, Villal´on1, Ana, Iglesias, Molina1, Pablo, Calleja1, Soc...",3347,38.213033,1413,57.783089
3,Distributing Text Mining tasks with librAIry,"Distributing Text Mining tasks with librAIry Carlos Badenes-Olmedo cbadenes@f.upm.es Universidad Polit´ecnica de Madrid Ontology Engineering Group Boadilla del Monte, Spain Jos´e Luis Redondo-Garc...",2448,"[distribute, text, mining, task, librAIry, Carlos, Badenes, Olmedo, cbadenes@f.upm.es, Universidad, Polit´ecnica, de, Madrid, Ontology, Engineering, Group, Boadilla, del, Monte, Spain, jos´e, Luis...",1484,39.379085,742,50.0
4,Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms,"Semantic Web 0 (0) 1 1 IOS Press Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms Editor(s): Tomi Kauppinen, Aalto University, Finland; Daniel Garijo,...",9041,"[semantic, web, IOS, Press, large, scale, semantic, Exploration, Scientific, Literature, Topic, base, Hashing, Algorithms, Editor(s, Tomi, Kauppinen, Aalto, University, Finland, Daniel, Garijo, Un...",5825,35.571286,1839,68.429185
5,An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts,"An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Carlos Badenes-Olmedo1, Jos´e Luis Redondo-Garc´ıa2, and Oscar Corcho1 1 Universi...",2641,"[initial, Analysis, Topic, base, Similarity, scientific, document, base, Rhetorical, Discourse, Parts, Carlos, Badenes, Olmedo1, Jos´e, Luis, Redondo, Garc´ıa2, Oscar, Corcho1, Universidad, Polit´...",1438,45.550928,590,58.970793
6,Efficient Clustering from Distributions over Topics,"Efficient Clustering from Distributions over Topics Carlos Badenes-Olmedo cbadenes@￿.upm.es Ontology Engineering Group Universidad Polit´ecnica de Madrid Boadilla del Monte, Spain Jos´e Luis Redon...",5346,"[efficient, clustering, distribution, Topics, Carlos, Badenes, Olmedo, cbadenes@￿.upm.es, Ontology, Engineering, Group, Universidad, Polit´ecnica, de, Madrid, Boadilla, del, Monte, Spain, jos´e, L...",3083,42.330715,1013,67.142394
7,Legal Documents Retrieval Across Languages: Topic Hierarchies based on synsets,Cross-lingual annotations of legislative texts enable us to explore major themes covered in multi- lingual legal data and are a key facilitator of semantic similarity when searching for similar do...,1445,"[cross, lingual, annotation, legislative, text, enable, explore, major, theme, cover, multi-, lingual, legal, datum, key, facilitator, semantic, similarity, search, similar, document, multilingual...",790,45.32872,364,53.924051
8,Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies,"Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies Carlos Badenes-Olmedo cbadenes@fi.upm.es Ontology Engineering Group, Universidad Politécnica de Madrid Boad...",4602,"[Scalable, Cross, lingual, document, Similarity, language, specific, Concept, Hierarchies, Carlos, Badenes, Olmedo, cbadenes@fi.upm.es, Ontology, Engineering, Group, Universidad, Politécnica, de, ...",3027,34.22425,1100,63.66039
9,Potentially inappropriate medications in older adults living with HIV,"Potentially inappropriate medications in older adults living with HIV B L�opez-Centeno,1,* C Badenes-Olmedo,2 A Mataix-Sanjuan,1 JM Bell�on,3 L P�erez-Latorre,3 JC L�opez,3 J Bened�ı,4,* S Khoo,5 ...",3087,"[potentially, inappropriate, medication, old, adult, live, HIV, opez, Centeno,1, Badenes, Olmedo,2, Mataix, Sanjuan,1, JM, Bell, on,3, erez, Latorre,3, JC, opez,3, Bened, ı,4, khoo,5, Marzolini,6,...",2163,29.931973,1056,51.178918


## 2.3. Text Vectorization

Once all terms have been pre-processed, numerical weights are assigned to each them. The same term may have a different weight in each distinct document in which it occurs. The weight is usually a measure of how effective the given term is likely to be in distinguishing the given document from other documents in the given collection, and is often normalized to be a fraction between zero and one. Statistical approaches fall into the following categories: boolean, vector space and probabilistic.

In [17]:
all_tokens = []
for tokens in corpus_df['tokens']:
  all_tokens.extend(tokens)

vocabulary = list(set(all_tokens))
print("Vocabulary size:",len(vocabulary)," unique words(tokens)")
print("Vocabulary words:",vocabulary[1:10],"...")

Vocabulary size: 6400  unique words(tokens)
Vocabulary words: ['Model', 'Exploration', 'topic3@en', 'Covid', 'R.R.V.', 'negligible', 't335', 'exactly', 'helpful'] ...


To encode our documents, we’ll create a vectorize function that creates a dictionary whose keys are the tokens in the document and whose values will depend on the approach we use.



The `defaultdic` object allows us to specify what the dictionary will return for a key that hasn’t been assigned to it yet. By setting `defaultdict(int)` we are specifying that a 0 should be returned, thus creating a simple counting dictionary. We can map this function to every item in the corpus creating an iterable of vectorized documents.

### 2.3.1. Boolean Approach

The Boolean representation sets true or false for each vocabulary word depending on whether or not it appears in the document.

In [8]:
from collections import defaultdict

def boolean_vectorize(tokens):
    features = defaultdict(bool)
    for token in tokens:
        features[token] = True
    return features

corpus_df['boolean'] = corpus_df['tokens'].apply(boolean_vectorize)
corpus_df

Unnamed: 0,title,text,#words,tokens,#tokens,%reduction,#unique,%unique,boolean
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...,12954,"[synopsis, Refinements, Extensions, compare, publication, Conference, Proceedings, submission, refined, extended, paper, base, ICTERI, 2017, phd, symposium, paper, Kosa, et, al, fact, submission, ...",6495,49.861047,1688,74.010778,"{'synopsis': True, 'Refinements': True, 'Extensions': True, 'compare': True, 'publication': True, 'Conference': True, 'Proceedings': True, 'submission': True, 'refined': True, 'extended': True, 'p..."
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc...",5827,"[enhance, Public, Procurement, European, Union, Constructing, exploit, Integrated, Knowledge, Graph, Ahmet, Soylu1, Oscar, Corcho2, Brian, Elvesæter1, Carlos, Badenes, Olmedo2, Francisco, Yedro2, ...",3511,39.74601,1355,61.407007,"{'enhance': True, 'Public': True, 'Procurement': True, 'European': True, 'Union': True, 'Constructing': True, 'exploit': True, 'Integrated': True, 'Knowledge': True, 'Graph': True, 'Ahmet': True, ..."
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber...",5417,"[Drugs4Covid, make, drug, information, available, scientific, publication, Carlos, Badenes, Olmedo1, David, Chaves, Fraga1, Mar´ıa, Poveda, Villal´on1, Ana, Iglesias, Molina1, Pablo, Calleja1, Soc...",3347,38.213033,1413,57.783089,"{'Drugs4Covid': True, 'make': True, 'drug': True, 'information': True, 'available': True, 'scientific': True, 'publication': True, 'Carlos': True, 'Badenes': True, 'Olmedo1': True, 'David': True, ..."
3,Distributing Text Mining tasks with librAIry,"Distributing Text Mining tasks with librAIry Carlos Badenes-Olmedo cbadenes@f.upm.es Universidad Polit´ecnica de Madrid Ontology Engineering Group Boadilla del Monte, Spain Jos´e Luis Redondo-Garc...",2448,"[distribute, text, mining, task, librAIry, Carlos, Badenes, Olmedo, cbadenes@f.upm.es, Universidad, Polit´ecnica, de, Madrid, Ontology, Engineering, Group, Boadilla, del, Monte, Spain, jos´e, Luis...",1484,39.379085,742,50.0,"{'distribute': True, 'text': True, 'mining': True, 'task': True, 'librAIry': True, 'Carlos': True, 'Badenes': True, 'Olmedo': True, 'cbadenes@f.upm.es': True, 'Universidad': True, 'Polit´ecnica': ..."
4,Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms,"Semantic Web 0 (0) 1 1 IOS Press Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms Editor(s): Tomi Kauppinen, Aalto University, Finland; Daniel Garijo,...",9041,"[semantic, web, IOS, Press, large, scale, semantic, Exploration, Scientific, Literature, Topic, base, Hashing, Algorithms, Editor(s, Tomi, Kauppinen, Aalto, University, Finland, Daniel, Garijo, Un...",5825,35.571286,1839,68.429185,"{'semantic': True, 'web': True, 'IOS': True, 'Press': True, 'large': True, 'scale': True, 'Exploration': True, 'Scientific': True, 'Literature': True, 'Topic': True, 'base': True, 'Hashing': True,..."
5,An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts,"An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Carlos Badenes-Olmedo1, Jos´e Luis Redondo-Garc´ıa2, and Oscar Corcho1 1 Universi...",2641,"[initial, Analysis, Topic, base, Similarity, scientific, document, base, Rhetorical, Discourse, Parts, Carlos, Badenes, Olmedo1, Jos´e, Luis, Redondo, Garc´ıa2, Oscar, Corcho1, Universidad, Polit´...",1438,45.550928,590,58.970793,"{'initial': True, 'Analysis': True, 'Topic': True, 'base': True, 'Similarity': True, 'scientific': True, 'document': True, 'Rhetorical': True, 'Discourse': True, 'Parts': True, 'Carlos': True, 'Ba..."
6,Efficient Clustering from Distributions over Topics,"Efficient Clustering from Distributions over Topics Carlos Badenes-Olmedo cbadenes@￿.upm.es Ontology Engineering Group Universidad Polit´ecnica de Madrid Boadilla del Monte, Spain Jos´e Luis Redon...",5346,"[efficient, clustering, distribution, Topics, Carlos, Badenes, Olmedo, cbadenes@￿.upm.es, Ontology, Engineering, Group, Universidad, Polit´ecnica, de, Madrid, Boadilla, del, Monte, Spain, jos´e, L...",3083,42.330715,1013,67.142394,"{'efficient': True, 'clustering': True, 'distribution': True, 'Topics': True, 'Carlos': True, 'Badenes': True, 'Olmedo': True, 'cbadenes@￿.upm.es': True, 'Ontology': True, 'Engineering': True, 'Gr..."
7,Legal Documents Retrieval Across Languages: Topic Hierarchies based on synsets,Cross-lingual annotations of legislative texts enable us to explore major themes covered in multi- lingual legal data and are a key facilitator of semantic similarity when searching for similar do...,1445,"[cross, lingual, annotation, legislative, text, enable, explore, major, theme, cover, multi-, lingual, legal, datum, key, facilitator, semantic, similarity, search, similar, document, multilingual...",790,45.32872,364,53.924051,"{'cross': True, 'lingual': True, 'annotation': True, 'legislative': True, 'text': True, 'enable': True, 'explore': True, 'major': True, 'theme': True, 'cover': True, 'multi-': True, 'legal': True,..."
8,Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies,"Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies Carlos Badenes-Olmedo cbadenes@fi.upm.es Ontology Engineering Group, Universidad Politécnica de Madrid Boad...",4602,"[Scalable, Cross, lingual, document, Similarity, language, specific, Concept, Hierarchies, Carlos, Badenes, Olmedo, cbadenes@fi.upm.es, Ontology, Engineering, Group, Universidad, Politécnica, de, ...",3027,34.22425,1100,63.66039,"{'Scalable': True, 'Cross': True, 'lingual': True, 'document': True, 'Similarity': True, 'language': True, 'specific': True, 'Concept': True, 'Hierarchies': True, 'Carlos': True, 'Badenes': True, ..."
9,Potentially inappropriate medications in older adults living with HIV,"Potentially inappropriate medications in older adults living with HIV B L�opez-Centeno,1,* C Badenes-Olmedo,2 A Mataix-Sanjuan,1 JM Bell�on,3 L P�erez-Latorre,3 JC L�opez,3 J Bened�ı,4,* S Khoo,5 ...",3087,"[potentially, inappropriate, medication, old, adult, live, HIV, opez, Centeno,1, Badenes, Olmedo,2, Mataix, Sanjuan,1, JM, Bell, on,3, erez, Latorre,3, JC, opez,3, Bened, ı,4, khoo,5, Marzolini,6,...",2163,29.931973,1056,51.178918,"{'potentially': True, 'inappropriate': True, 'medication': True, 'old': True, 'adult': True, 'live': True, 'HIV': True, 'opez': True, 'Centeno,1': True, 'Badenes': True, 'Olmedo,2': True, 'Mataix'..."


In the boolean approach, the query is formulated as a boolean combination of terms. A conventional boolean query uses the classical operators AND, OR, and NOT. The query ”t1 AND t2” is satisfied by a given document D1 if and only if D1 contains both terms t1 and t2. Similarly, the query ”t1 OR t2” is satisfied by D1 if and only if it contains t1 or t2 or both. The query ”t1 AND NOT t2” satisfies D1 if and only if it contains t1 and does not contain t2. More complex boolean queries can be built up out of these operators and evaluated according to the classical rules of boolean algebra. Such a boolean query is either true or false. Correspondingly, a document either satisfies such a query, i.e. is relevant, or does not satisfy it, i.e. is non-relevant. **No ranking is possible**, which is a significant limitation for this approach (Harmon, 1996).

For example, we can filter documents about public procurement data and multilinguality.

In [9]:
def is_relevant(doc):
  #return doc['HIV']
  #return doc['multilingual']
  return doc['multilingual'] and doc['procurement']

pos = 0
for vector in corpus_df['boolean']:
  if is_relevant(vector):
    print("-",corpus_df['title'][pos])
  pos+=1 

- Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph


### 2.3.2 Vector space models

Vector space models (VSM) (Salton and McGill, 1983) were proposed to represent texts as vectors where each entry corresponds to a different term and the number at that entry corresponds to how many times that term is present in the text. The objective was twofold: on the one hand, making document collections manageable since we move from having lots of terms for each text to only one vector per document with a defined dimension; on the other hand, having representations based on metric spaces where calculations can be made, for example comparisons by measuring vector distances.

The definition and number of dimensions for each vector are key aspects in a VSM. Based on the use of this type of model, traditional document retrieval tasks over collections of textual documents highly rely on individual features like term fre- quencies (TF) (Hearst and Hall, 1999). A representational space is created where each term in the vocabulary is projected by a separate and orthogonal dimension. All terms in a document are treated as equally descriptive.

In [10]:
from collections import defaultdict

def tf_vectorize(tokens):
    features = defaultdict(int)
    for token in tokens:
        features[token] += 1
    return features

corpus_df['tf'] = corpus_df['tokens'].apply(tf_vectorize)
corpus_df

Unnamed: 0,title,text,#words,tokens,#tokens,%reduction,#unique,%unique,boolean,tf
0,Cross-Evaluation of Term Extraction Tools by Measuring Terminological Saturation,Synopsis of the Refinements and Extensions Compared to the Publication in the Conference Proceedings This submission is a refined and extended paper based on the ICTERI 2017 PhD Symposium paper...,12954,"[synopsis, Refinements, Extensions, compare, publication, Conference, Proceedings, submission, refined, extended, paper, base, ICTERI, 2017, phd, symposium, paper, Kosa, et, al, fact, submission, ...",6495,49.861047,1688,74.010778,"{'synopsis': True, 'Refinements': True, 'Extensions': True, 'compare': True, 'publication': True, 'Conference': True, 'Proceedings': True, 'submission': True, 'refined': True, 'extended': True, 'p...","{'synopsis': 1, 'Refinements': 1, 'Extensions': 1, 'compare': 21, 'publication': 3, 'Conference': 1, 'Proceedings': 1, 'submission': 2, 'refined': 2, 'extended': 1, 'paper': 44, 'base': 43, 'ICTER..."
1,Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph,"Enhancing Public Procurement in the European Union through Constructing and Exploiting an Integrated Knowledge Graph Ahmet Soylu1, Oscar Corcho2, Brian Elvesæter1, Carlos Badenes-Olmedo2, Francisc...",5827,"[enhance, Public, Procurement, European, Union, Constructing, exploit, Integrated, Knowledge, Graph, Ahmet, Soylu1, Oscar, Corcho2, Brian, Elvesæter1, Carlos, Badenes, Olmedo2, Francisco, Yedro2, ...",3511,39.74601,1355,61.407007,"{'enhance': True, 'Public': True, 'Procurement': True, 'European': True, 'Union': True, 'Constructing': True, 'exploit': True, 'Integrated': True, 'Knowledge': True, 'Graph': True, 'Ahmet': True, ...","{'enhance': 9, 'Public': 9, 'Procurement': 9, 'European': 5, 'Union': 3, 'Constructing': 1, 'exploit': 1, 'Integrated': 8, 'Knowledge': 11, 'Graph': 11, 'Ahmet': 1, 'Soylu1': 1, 'Oscar': 1, 'Corch..."
2,Drugs4Covid: Making drug information available from scientific publications,"Drugs4Covid: Making drug information available from scientific publications Carlos Badenes-Olmedo1, David Chaves-Fraga1, Mar´ıa Poveda-Villal´on1, Ana Iglesias-Molina1, Pablo Calleja1, Socorro Ber...",5417,"[Drugs4Covid, make, drug, information, available, scientific, publication, Carlos, Badenes, Olmedo1, David, Chaves, Fraga1, Mar´ıa, Poveda, Villal´on1, Ana, Iglesias, Molina1, Pablo, Calleja1, Soc...",3347,38.213033,1413,57.783089,"{'Drugs4Covid': True, 'make': True, 'drug': True, 'information': True, 'available': True, 'scientific': True, 'publication': True, 'Carlos': True, 'Badenes': True, 'Olmedo1': True, 'David': True, ...","{'Drugs4Covid': 24, 'make': 2, 'drug': 71, 'information': 11, 'available': 12, 'scientific': 14, 'publication': 9, 'Carlos': 1, 'Badenes': 10, 'Olmedo1': 1, 'David': 1, 'Chaves': 1, 'Fraga1': 1, '..."
3,Distributing Text Mining tasks with librAIry,"Distributing Text Mining tasks with librAIry Carlos Badenes-Olmedo cbadenes@f.upm.es Universidad Polit´ecnica de Madrid Ontology Engineering Group Boadilla del Monte, Spain Jos´e Luis Redondo-Garc...",2448,"[distribute, text, mining, task, librAIry, Carlos, Badenes, Olmedo, cbadenes@f.upm.es, Universidad, Polit´ecnica, de, Madrid, Ontology, Engineering, Group, Boadilla, del, Monte, Spain, jos´e, Luis...",1484,39.379085,742,50.0,"{'distribute': True, 'text': True, 'mining': True, 'task': True, 'librAIry': True, 'Carlos': True, 'Badenes': True, 'Olmedo': True, 'cbadenes@f.upm.es': True, 'Universidad': True, 'Polit´ecnica': ...","{'distribute': 8, 'text': 13, 'mining': 4, 'task': 6, 'librAIry': 6, 'Carlos': 1, 'Badenes': 2, 'Olmedo': 2, 'cbadenes@f.upm.es': 1, 'Universidad': 4, 'Polit´ecnica': 4, 'de': 4, 'Madrid': 4, 'Ont..."
4,Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms,"Semantic Web 0 (0) 1 1 IOS Press Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms Editor(s): Tomi Kauppinen, Aalto University, Finland; Daniel Garijo,...",9041,"[semantic, web, IOS, Press, large, scale, semantic, Exploration, Scientific, Literature, Topic, base, Hashing, Algorithms, Editor(s, Tomi, Kauppinen, Aalto, University, Finland, Daniel, Garijo, Un...",5825,35.571286,1839,68.429185,"{'semantic': True, 'web': True, 'IOS': True, 'Press': True, 'large': True, 'scale': True, 'Exploration': True, 'Scientific': True, 'Literature': True, 'Topic': True, 'base': True, 'Hashing': True,...","{'semantic': 13, 'web': 4, 'IOS': 2, 'Press': 8, 'large': 12, 'scale': 7, 'Exploration': 1, 'Scientific': 1, 'Literature': 1, 'Topic': 26, 'base': 130, 'Hashing': 26, 'Algorithms': 6, 'Editor(s': ..."
5,An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts,"An initial Analysis of Topic-based Similarity among Scientific Documents based on their Rhetorical Discourse Parts Carlos Badenes-Olmedo1, Jos´e Luis Redondo-Garc´ıa2, and Oscar Corcho1 1 Universi...",2641,"[initial, Analysis, Topic, base, Similarity, scientific, document, base, Rhetorical, Discourse, Parts, Carlos, Badenes, Olmedo1, Jos´e, Luis, Redondo, Garc´ıa2, Oscar, Corcho1, Universidad, Polit´...",1438,45.550928,590,58.970793,"{'initial': True, 'Analysis': True, 'Topic': True, 'base': True, 'Similarity': True, 'scientific': True, 'document': True, 'Rhetorical': True, 'Discourse': True, 'Parts': True, 'Carlos': True, 'Ba...","{'initial': 2, 'Analysis': 2, 'Topic': 4, 'base': 22, 'Similarity': 1, 'scientific': 10, 'document': 14, 'Rhetorical': 2, 'Discourse': 2, 'Parts': 2, 'Carlos': 1, 'Badenes': 1, 'Olmedo1': 1, 'Jos´..."
6,Efficient Clustering from Distributions over Topics,"Efficient Clustering from Distributions over Topics Carlos Badenes-Olmedo cbadenes@￿.upm.es Ontology Engineering Group Universidad Polit´ecnica de Madrid Boadilla del Monte, Spain Jos´e Luis Redon...",5346,"[efficient, clustering, distribution, Topics, Carlos, Badenes, Olmedo, cbadenes@￿.upm.es, Ontology, Engineering, Group, Universidad, Polit´ecnica, de, Madrid, Boadilla, del, Monte, Spain, jos´e, L...",3083,42.330715,1013,67.142394,"{'efficient': True, 'clustering': True, 'distribution': True, 'Topics': True, 'Carlos': True, 'Badenes': True, 'Olmedo': True, 'cbadenes@￿.upm.es': True, 'Ontology': True, 'Engineering': True, 'Gr...","{'efficient': 1, 'clustering': 6, 'distribution': 61, 'Topics': 4, 'Carlos': 1, 'Badenes': 4, 'Olmedo': 4, 'cbadenes@￿.upm.es': 1, 'Ontology': 2, 'Engineering': 2, 'Group': 2, 'Universidad': 2, 'P..."
7,Legal Documents Retrieval Across Languages: Topic Hierarchies based on synsets,Cross-lingual annotations of legislative texts enable us to explore major themes covered in multi- lingual legal data and are a key facilitator of semantic similarity when searching for similar do...,1445,"[cross, lingual, annotation, legislative, text, enable, explore, major, theme, cover, multi-, lingual, legal, datum, key, facilitator, semantic, similarity, search, similar, document, multilingual...",790,45.32872,364,53.924051,"{'cross': True, 'lingual': True, 'annotation': True, 'legislative': True, 'text': True, 'enable': True, 'explore': True, 'major': True, 'theme': True, 'cover': True, 'multi-': True, 'legal': True,...","{'cross': 6, 'lingual': 16, 'annotation': 6, 'legislative': 2, 'text': 6, 'enable': 1, 'explore': 1, 'major': 1, 'theme': 5, 'cover': 1, 'multi-': 2, 'legal': 2, 'datum': 5, 'key': 1, 'facilitator..."
8,Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies,"Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies Carlos Badenes-Olmedo cbadenes@fi.upm.es Ontology Engineering Group, Universidad Politécnica de Madrid Boad...",4602,"[Scalable, Cross, lingual, document, Similarity, language, specific, Concept, Hierarchies, Carlos, Badenes, Olmedo, cbadenes@fi.upm.es, Ontology, Engineering, Group, Universidad, Politécnica, de, ...",3027,34.22425,1100,63.66039,"{'Scalable': True, 'Cross': True, 'lingual': True, 'document': True, 'Similarity': True, 'language': True, 'specific': True, 'Concept': True, 'Hierarchies': True, 'Carlos': True, 'Badenes': True, ...","{'Scalable': 4, 'Cross': 7, 'lingual': 50, 'document': 86, 'Similarity': 4, 'language': 43, 'specific': 9, 'Concept': 3, 'Hierarchies': 3, 'Carlos': 2, 'Badenes': 5, 'Olmedo': 5, 'cbadenes@fi.upm...."
9,Potentially inappropriate medications in older adults living with HIV,"Potentially inappropriate medications in older adults living with HIV B L�opez-Centeno,1,* C Badenes-Olmedo,2 A Mataix-Sanjuan,1 JM Bell�on,3 L P�erez-Latorre,3 JC L�opez,3 J Bened�ı,4,* S Khoo,5 ...",3087,"[potentially, inappropriate, medication, old, adult, live, HIV, opez, Centeno,1, Badenes, Olmedo,2, Mataix, Sanjuan,1, JM, Bell, on,3, erez, Latorre,3, JC, opez,3, Bened, ı,4, khoo,5, Marzolini,6,...",2163,29.931973,1056,51.178918,"{'potentially': True, 'inappropriate': True, 'medication': True, 'old': True, 'adult': True, 'live': True, 'HIV': True, 'opez': True, 'Centeno,1': True, 'Badenes': True, 'Olmedo,2': True, 'Mataix'...","{'potentially': 10, 'inappropriate': 20, 'medication': 35, 'old': 42, 'adult': 10, 'live': 5, 'HIV': 23, 'opez': 2, 'Centeno,1': 1, 'Badenes': 2, 'Olmedo,2': 1, 'Mataix': 2, 'Sanjuan,1': 1, 'JM': ..."


In [11]:
import gensim

id2word = gensim.corpora.Dictionary(corpus_df['tokens'])
vectors = [
    id2word.doc2bow(doc) for doc in corpus_df['tokens']
]
print(vectors[0])

[(0, 101), (1, 19), (2, 8), (3, 10), (4, 5), (5, 3), (6, 3), (7, 4), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 3), (14, 2), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 3), (26, 1), (27, 2), (28, 1), (29, 1), (30, 3), (31, 2), (32, 2), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 2), (41, 1), (42, 1), (43, 2), (44, 2), (45, 1), (46, 1), (47, 2), (48, 1), (49, 1), (50, 2), (51, 1), (52, 2), (53, 2), (54, 1), (55, 3), (56, 2), (57, 1), (58, 1), (59, 1), (60, 7), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 6), (68, 2), (69, 1), (70, 1), (71, 6), (72, 2), (73, 2), (74, 1), (75, 5), (76, 2), (77, 1), (78, 1), (79, 3), (80, 1), (81, 4), (82, 4), (83, 1), (84, 4), (85, 3), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 4), (94, 5), (95, 1), (96, 1), (97, 3), (98, 2), (99, 3), (100, 2), (101, 1), (102, 1), (103, 2), (104, 4), (105, 21), (106, 1), (107, 2), (108, 2), (109, 1), (110

# References



* [Text Vectorization and Transformation Pipelines](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html)
*   List item

