# Exploring Terms in the Encyclopaedia Britannica

## Similar articles within an edition - Gensim - Doc2Vec

In this notebook we are going to do general explorations with the dataframe that we have obtained either with the posprocess_eb.py script or Merging_EB_Terms.ipynb notebooks. Both methods obtain the same dataframe. 

We have selected the first Edition for this explorations, but we can run this notebook with any of the other editions.

**Remark**: Edition 1, has 3 volumes, and it was printed twice, in 1771 and 1773. 

These are the explorations that we are going to do:
- Create a new dataframe selecting just the first 100 elements of the first vol of 1771. This is the df that we are going to use for the rest of this notebook.
- Create a training corpus with the previous dataframe
- Create a doc2vec model with the training corpus
- Saving the model to disk
- Testing the model - using the term ALCOHOL. 



### Loading the necessary libraries

In [1]:
import yaml
import matplotlib.pyplot as plt
import numpy as np
import collections
import matplotlib as mpl

In [2]:
import networkx as nx
import matplotlib.pyplot as plt

In [3]:
import pandas as pd
from yaml import safe_load
from pandas.io.json import json_normalize

In [4]:
import gensim
from gensim.models.doc2vec import Doc2Vec

In [5]:
from doc2vec_prep import stem_text, clean_text, generate_documents_df

In [6]:
from tqdm import tqdm
import os


#### Hyperparameters

In [7]:
# Init the Doc2Vec model
hyperparams  = {
    'dm': 1,
    'vector_size': 300,
    'window': 5,
    'alpha': 0.025,
    'min_alpha': 0.00025,
    'min_count': 2,
    'workers': 8
}

### Functions

In [8]:
def get_document(df, index):
    print("INDEX IS %s" %index)
    term = df.loc[index]["term"]
    definition = df.loc[index]["definition"]
    return term, definition

In [9]:
def most_similar(model, text, clean_func=clean_text, topn=None):
    vector = model.infer_vector(clean_func(text), epochs=100, alpha=model.alpha, min_alpha=model.min_alpha)
    simdocs = model.docvecs.most_similar(positive=[vector], topn=topn)
    return simdocs

In [10]:
def load_model(filename):
    try:
        return Doc2Vec.load(os.path.join(MODEL_PATH, filename), mmap='r')
    except:
        return None


## We have dataframe with these columns

- definition:           Definition of a term
- editionNum:           1,2,3,4,5,6,7,8
- editionTitle:         Title of the edition
- header:               Header of the page's term                                  
- place:                Place where the volume was edited (e.g. Edinburgh)                                    
- relatedTerms:         Related terms (see X article)  
- altoXML:              File Path of the XML file from which the term belongs       
- term:                 Term name                            
- positionPage:         Position of ther term in the page     
- startsAt:             Number page in which the term definition starts 
- endsAt:               Number page in which the term definition ends 
- volumeTitle:          Title of the Volume
- typeTerm:             Type of term [Topic| Articles]                                       
- year:                 Year of the edition
- volumeNum:            Volume number (e.g. 1)
- letters:              leters of the volume (A-B)
- part:                 Part of the volume (e.g 1)
- supplement:           Supplement's Title
- supplementsTo:        It suppelements to editions [1, 2, 3....]
- numberOfWords:        Number of words per term definition
- numberOfTerms:        Number of terms per page
- numberOfPages:        Number of pages per volume

### 1. Load dataframe from JSON file

In [11]:
df = pd.read_json('../../results_NLS/results_eb_1_edition_dataframe', orient="index") 

Now we are going to oder the columns of our dataframe and visualise it. 

In [12]:
df = df[["term", "definition", "relatedTerms", "header", "startsAt", "endsAt", "numberOfTerms","numberOfWords", "numberOfPages", \
             "positionPage", "typeTerm", "editionTitle", "editionNum", "supplementTitle", "supplementsTo",\
             "year", "place", "volumeTitle", "volumeNum", "letters", "part", "altoXML"]].reset_index(drop=True)

df


Unnamed: 0,term,definition,relatedTerms,header,startsAt,endsAt,numberOfTerms,numberOfWords,numberOfPages,positionPage,...,editionNum,supplementTitle,supplementsTo,year,place,volumeTitle,volumeNum,letters,part,altoXML
0,OR,"A NEW A D I C T I A A, the name of several riv...",[],EncyclopaediaBritannica,15,15,22,54,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
1,AABAM,"a term, among alchemifts, for lead,",[],EncyclopaediaBritannica,15,15,22,6,832,1,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
2,ASTRONOMY,Of the divijion of time.,[],EncyclopaediaBritannica,15,15,22,5,832,10,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
3,ABILITY,"a term in law, denoting a power of doing certa...",[],ABLABR,19,19,37,17,832,3,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082956.34.xml
4,ALCMANIAN,"in ancient lyric poetry, a kind of verse consi...",[],ALcALC,109,109,21,17,832,15,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084129.34.xml
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27420,OPACITY,"in philosophy, a quality of bodies which rende...",[],ONOOPA,480,480,24,16,872,22,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",3,M-Z,0,144133903/alto/144810307.34.xml
27421,OPAL,"in natural history, a species of gems. The opa...",[],ONOOPA,480,481,24,251,872,23,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",3,M-Z,0,144133903/alto/144810307.34.xml
27422,OP,"ALIA, in antiquity, feasts celebrated at Rome ...",[],OPHOPI,481,481,17,63,872,1,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",3,M-Z,0,144133903/alto/144810319.34.xml
27423,OPERA,"a dramatic composition set to music, and sung ...",[],OPHOPI,481,481,17,21,872,2,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",3,M-Z,0,144133903/alto/144810319.34.xml


### 2.  Selecting just the 100 first elements of  the first volume of 1771

In [14]:
df_1771_small = df[(df['year'] == 1771) & (df['volumeNum'] == 1) ]
df_1771_small = df_1771_small.head(100).reset_index(drop=True)


In [15]:
df_1771_small

Unnamed: 0,term,definition,relatedTerms,header,startsAt,endsAt,numberOfTerms,numberOfWords,numberOfPages,positionPage,...,editionNum,supplementTitle,supplementsTo,year,place,volumeTitle,volumeNum,letters,part,altoXML
0,OR,"A NEW A D I C T I A A, the name of several riv...",[],EncyclopaediaBritannica,15,15,22,54,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
1,AABAM,"a term, among alchemifts, for lead,",[],EncyclopaediaBritannica,15,15,22,6,832,1,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
2,ASTRONOMY,Of the divijion of time.,[],EncyclopaediaBritannica,15,15,22,5,832,10,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
3,ABILITY,"a term in law, denoting a power of doing certa...",[],ABLABR,19,19,37,17,832,3,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082956.34.xml
4,ALCMANIAN,"in ancient lyric poetry, a kind of verse consi...",[],ALcALC,109,109,21,17,832,15,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084129.34.xml
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,ALHAGI,"in botany, the trivial name of a species of he...",[],ALGBRA,150,150,30,12,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084662.34.xml
96,ALHAMA,"a small town of Granada in Spain, surrounded w...",[],ALGBRA,150,150,30,21,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084662.34.xml
97,ALHANDAL,"among Arabian physicians, a name used for colo...",[],ALGBRA,150,150,30,10,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084662.34.xml
98,ALHEAL,"in botany. See Galeopsis, Stachys. ! B R A. el...",[],ALGBRA,150,150,30,47,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084662.34.xml


### 2.1 Counting the number of terms

**Remember**: A term can appear in more than once  per eddition. 

In [16]:
len(df_1771_small)

100

### 3. Creating Document embeddings


Document Embeddings (doc2vec), an unsupervised method for learning vector representations for variable-length pieces of texts such as sentences and document. We have used the gensim Python Library  to create a document embedding model, using the term and definition information of the **df_1771_small**, which only has 100 elements. 



#### 3.1 Train Corpus

First we have to create a training corpus (train_documents) with the elements of the **df_1771_small**. We are going to use only the information of **term** and **definition** text, to create our "text" per row of this dataframe. Furthermore, we are going to clean this dataset, by applying a serie of transformations (removing stop words, normalise, tokenize,  etc.). We are going to select the text that has a minimum length of 5 words. 

In [17]:
train_documents = list(tqdm(generate_documents_df(df_1771_small, clean_text, min_words=5)))

11it [00:00, 49.32it/s]

Preprocessing function: clean_text
Minimum document length: 5 words


87it [00:05, 16.99it/s]

Generated 87 description terms





#### 3.2 Creating the model

Once we have created our trained corpus, we are going to create our model. In this step we are going to create a document embedding per element in our training corpus. 

In [18]:
print(f'Created {len(train_documents)} tagged documents.')
model = Doc2Vec(**hyperparams)
print('Build vocabulary')
model.build_vocab(train_documents)
for epoch in range(100):
    print(f'Train model: epoch={epoch}')
    model.train(train_documents, total_examples=model.corpus_count, epochs=1)
    model.alpha -= 0.0002
    model.min_alpha = model.alpha

Created 87 tagged documents.
Build vocabulary
Train model: epoch=0
Train model: epoch=1
Train model: epoch=2
Train model: epoch=3
Train model: epoch=4
Train model: epoch=5
Train model: epoch=6
Train model: epoch=7
Train model: epoch=8
Train model: epoch=9
Train model: epoch=10
Train model: epoch=11
Train model: epoch=12
Train model: epoch=13
Train model: epoch=14
Train model: epoch=15
Train model: epoch=16
Train model: epoch=17
Train model: epoch=18
Train model: epoch=19
Train model: epoch=20
Train model: epoch=21
Train model: epoch=22
Train model: epoch=23
Train model: epoch=24
Train model: epoch=25
Train model: epoch=26
Train model: epoch=27
Train model: epoch=28
Train model: epoch=29
Train model: epoch=30
Train model: epoch=31
Train model: epoch=32
Train model: epoch=33
Train model: epoch=34
Train model: epoch=35
Train model: epoch=36
Train model: epoch=37
Train model: epoch=38
Train model: epoch=39
Train model: epoch=40
Train model: epoch=41
Train model: epoch=42
Train model: epoch

#### 3.3 Saving the model to disk

In [19]:
# Save the model
model_path = os.path.join("../../results_NLS/", 'doc2vec_df_1771_small.model')
model.save(model_path)
print(f'Saved model to {model_path}')

Saved model to ../../results_NLS/doc2vec_df_1771_small.model


#### 3.5 Obtaining a list of elements in 

In [21]:
list_of_terms=df_1771_small["term"].to_list()
list_of_terms

['OR',
 'AABAM',
 'ASTRONOMY',
 'ABILITY',
 'ALCMANIAN',
 'ALCOA',
 'ALCOBACO',
 'ALCOHOL',
 'ALCOLA',
 'ALCORAN',
 'ALCORASTS',
 'ALCOST',
 'ALCOVE',
 'ALCOYTIN',
 'ABINGDON',
 'ALCYON',
 'ALCYONIUM',
 'ALDARU',
 'ALDBOROUGH',
 'ALDEA',
 'ALDEBAC',
 'ALDEBARAN',
 'ALDEGO',
 'ALDENAER',
 'ALDENBURG',
 'ABINTESTATE',
 'ALDERMAN',
 'ALDERNEY',
 'ALDII',
 'ALDROVANDA',
 'ALE',
 'ALEA',
 'ALEATORIUM',
 'ALEC',
 'ALECTORIA',
 'ALECTORICARDITES',
 'ABISHERING',
 'ALECTOROMANTIA',
 'ALEAGAR',
 'ALEGRETTE',
 'ALEIPHA',
 'ALEMBROTH',
 'ALENGNER',
 'ALENON',
 'ALENTEJO',
 'ALENZON',
 'ALEORE',
 'ABIT',
 'ALEPPO',
 'ALERION',
 'ALESSANO',
 'ALESSIO',
 'ALET',
 'ALETRIS',
 'ALEUROMANCY',
 'ALEXANDRETTA',
 'ALEXANDRIA',
 'ALEXANDRIAN',
 'ABJURATION',
 'ALEXANDRINUM',
 'ALEXICACUS',
 'ALEXITERIAL',
 'ALFAQUES',
 'ALFELD',
 'ALFET',
 'ALGA',
 'ALGAROT',
 'ALGARVA',
 'ALGEBRA',
 'ABJURATION',
 'BALGEBRA',
 'F',
 'ALGEBRA',
 'J',
 'ALGEBRA',
 'A',
 'ALOIIA',
 'EBRAALGX',
 'IBA',
 'ALGEBRA',
 'ABLAC',
 

#### 3.4 Testing the model - Similar Terms

Selecting the defintion of term alchohol, which is position 7 inside **df_1771_small**. 

In [33]:
text=df_1771_small.loc[7]["definition"]
text

'or Alkool, in ehemiftry, spirit of wine highly redtified. It ij also used for any highly redtified spirit.—Alcohol is extremely light and inflammable : It is a strong antifeptic, and therefore employee! to preserve animal ftlbftances. For tlie other qualities of alcohol, see Chemistry. Alcohol is also used for any fine impalpable powder. ALCOHOLIZATION, among chemists, the process of redtifying any spirit. It is also used for pulveriza-'

In [34]:
#model=load_model('../../results_NLS/doc2vec_df_1771_small.model')
cleaned_text = clean_text(text)
# Just going to take the firs 10 -- so topn=10
simdocs=most_similar(model, text, topn=10)

In [35]:
print("#### TEST 1 -- Doc2Vec -- Printing the details of the 10 most similar documents using Doc2Vec ")
for doc_id , rank in simdocs:
    term, definition = get_document(df_1771_small, doc_id)
    print("!! Using DocVec --- Document_id: %s - Rank %s - Details: Term %s, Definition: %s" %(doc_id, rank, term, definition))
    print("---")

#### TEST 1 -- Doc2Vec -- Printing the details of the 10 most similar documents using Doc2Vec 
INDEX IS 7
!! Using DocVec --- Document_id: 7 - Rank 0.9468585252761841 - Details: Term ALCOHOL, Definition: or Alkool, in ehemiftry, spirit of wine highly redtified. It ij also used for any highly redtified spirit.—Alcohol is extremely light and inflammable : It is a strong antifeptic, and therefore employee! to preserve animal ftlbftances. For tlie other qualities of alcohol, see Chemistry. Alcohol is also used for any fine impalpable powder. ALCOHOLIZATION, among chemists, the process of redtifying any spirit. It is also used for pulveriza-
---
INDEX IS 40
!! Using DocVec --- Document_id: 40 - Rank 0.9130399227142334 - Details: Term ALEIPHA, Definition: among ancient physicians, the name of animal or. vegetable oils, when used as unguents. ALEMBIC, in chemistry.. See Chemistry.
---
INDEX IS 64
!! Using DocVec --- Document_id: 64 - Rank 0.861436128616333 - Details: Term ALFET, Definition: i