# Exploring Terms in the Encyclopaedia Britannica

## Similar terms within an edition - Gensim - Doc2Vec

In this notebook we are going to find similar articles/terms with the dataframe that we have obtained either with the posprocess_eb.py script or Merging_EB_Terms.ipynb notebooks. Both methods obtain the same dataframe. 

We have selected the first Edition for this explorations, but we can run this notebook with any of the other editions.

**Remark**: Edition 1, has 3 volumes, and it was printed twice, in 1771 and 1773. 

These are the explorations that we are going to do:
- Create a new dataframe selecting just the first 100 elements of the first vol of 1771. This is the df that we are going to use for the rest of this notebook.
- Create a training corpus with the previous dataframe
- Create a doc2vec model with the training corpus
- Saving the model to disk
- Testing the model - using the term ABACUS. 



### Loading the necessary libraries

In [1]:
import yaml
import matplotlib.pyplot as plt
import numpy as np
import collections
import matplotlib as mpl

In [2]:
import networkx as nx
import matplotlib.pyplot as plt

In [3]:
import pandas as pd
from yaml import safe_load
from pandas.io.json import json_normalize

In [4]:
import gensim
from gensim.models.doc2vec import Doc2Vec

In [5]:
from doc2vec_prep import stem_text, clean_text, generate_documents_df

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [6]:
from tqdm import tqdm
import os


#### Hyperparameters

In [7]:
# Init the Doc2Vec model
hyperparams  = {
    'dm': 1,
    'vector_size': 300,
    'window': 5,
    'alpha': 0.025,
    'min_alpha': 0.00025,
    'min_count': 2,
    'workers': 8
}

### Functions

In [8]:
def get_document(df, index):
    print("INDEX IS %s" %index)
    term = df.loc[index]["term"]
    definition = df.loc[index]["definition"]
    return term, definition

In [9]:
def most_similar(model, text, clean_func=clean_text, topn=None):
    vector = model.infer_vector(clean_func(text), epochs=100, alpha=model.alpha, min_alpha=model.min_alpha)
    simdocs = model.docvecs.most_similar(positive=[vector], topn=topn)
    return simdocs

In [10]:
def load_model(filename):
    try:
        return Doc2Vec.load(os.path.join(MODEL_PATH, filename), mmap='r')
    except:
        return None


## We have dataframe with these columns

- definition:           Definition of a term
- editionNum:           1,2,3,4,5,6,7,8
- editionTitle:         Title of the edition
- header:               Header of the page's term                                  
- place:                Place where the volume was edited (e.g. Edinburgh)                                    
- relatedTerms:         Related terms (see X article)  
- altoXML:              File Path of the XML file from which the term belongs       
- term:                 Term name                            
- positionPage:         Position of ther term in the page     
- startsAt:             Number page in which the term definition starts 
- endsAt:               Number page in which the term definition ends 
- volumeTitle:          Title of the Volume
- typeTerm:             Type of term [Topic| Articles]                                       
- year:                 Year of the edition
- volumeNum:            Volume number (e.g. 1)
- letters:              leters of the volume (A-B)
- part:                 Part of the volume (e.g 1)
- supplement:           Supplement's Title
- supplementsTo:        It suppelements to editions [1, 2, 3....]
- numberOfWords:        Number of words per term definition
- numberOfTerms:        Number of terms per page
- numberOfPages:        Number of pages per volume

### 1. Load dataframe from JSON file

In [11]:
df = pd.read_json('../../results_NLS/results_eb_1_edition_dataframe', orient="index") 

Now we are going to oder the columns of our dataframe and visualise it. 

In [12]:
df = df[["term", "definition", "relatedTerms", "header", "startsAt", "endsAt", "numberOfTerms","numberOfWords", "numberOfPages", \
             "positionPage", "typeTerm", "editionTitle", "editionNum", "supplementTitle", "supplementsTo",\
             "year", "place", "volumeTitle", "volumeNum", "letters", "part", "altoXML"]].reset_index(drop=True)

df


Unnamed: 0,term,definition,relatedTerms,header,startsAt,endsAt,numberOfTerms,numberOfWords,numberOfPages,positionPage,...,editionNum,supplementTitle,supplementsTo,year,place,volumeTitle,volumeNum,letters,part,altoXML
0,OR,"A NEW A D I C T I A A, the name of several riv...",[],EBAA,15,15,22,54,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
1,AABAM,"a term, among alchemifts, for lead,",[],EBAA,15,15,22,6,832,1,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
2,AACH,the name of a town and river in Swabia. It is ...,[],EBAA,15,15,22,17,832,2,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
3,AADE,"the name of two rivers, one in the country of ...",[],EBAA,15,15,22,19,832,3,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
4,AAHUS,a small town and diftrift in Weftphalia.,[],EBAA,15,15,22,7,832,4,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18112,ZAFFER,"01 Zafre, in chemistry, the name of a blue fuf...",[],YOAYUC,855,855,14,57,864,10,...,1,,[],1773,London,"Encyclopaedia Britannica: or, A dictionary of ...",3,M-Z,0,144850368/alto/188374994.34.xml
18113,ZAMORA,"antyof Spain in the province of Leon, fituater...",[],YOAYUC,855,855,14,31,864,11,...,1,,[],1773,London,"Encyclopaedia Britannica: or, A dictionary of ...",3,M-Z,0,144850368/alto/188374994.34.xml
18114,ZANGUEB,"AR a country on the east coast of Africa, situ...",[],YOAYUC,855,855,14,54,864,12,...,1,,[],1773,London,"Encyclopaedia Britannica: or, A dictionary of ...",3,M-Z,0,144850368/alto/188374994.34.xml
18115,ZANNICHELLIA,"in botany, a genus of the monoe ia monandria c...",[],YOAYUC,855,855,14,56,864,13,...,1,,[],1773,London,"Encyclopaedia Britannica: or, A dictionary of ...",3,M-Z,0,144850368/alto/188374994.34.xml


### 2.  Selecting just the 100 first elements of  the first volume of 1771

In [13]:
df_1771_small = df[(df['year'] == 1771)].reset_index(drop=True)
df_1771_small = df_1771_small.head(1000).reset_index(drop=True)


In [14]:
#df_1771_small

### 2.1 Counting the number of terms

**Remember**: A term can appear in more than once  per eddition. 

In [15]:
len(df_1771_small)

1000

### 3. Creating Document embeddings


Document Embeddings (doc2vec), an unsupervised method for learning vector representations for variable-length pieces of texts such as sentences and document. We have used the gensim Python Library  to create a document embedding model, using the term and definition information of the **df_1771_small**, which only has 100 elements. 



#### 3.1 Train Corpus

First we have to create a training corpus (train_documents) with the elements of the **df_1771_small**. We are going to use only the information of **term** and **definition** text, to create our "text" per row of this dataframe. Furthermore, we are going to clean this dataset, by applying a serie of transformations (removing stop words, normalise, tokenize,  etc.). We are going to select the text that has a minimum length of 5 words. 

In [16]:
train_documents = list(tqdm(generate_documents_df(df_1771_small, clean_text, min_words=5)))
#train_documents = list(tqdm(generate_documents_df(df_1771_small, stem_text, min_words=5)))

14it [00:00, 36.02it/s]

Preprocessing function: clean_text
Minimum document length: 5 words


866it [01:39,  8.71it/s] 

Generated 866 description terms





#### 3.2 Creating the model

Once we have created our trained corpus, we are going to create our model. In this step we are going to create a document embedding per element in our training corpus. 

In [17]:
print(f'Created {len(train_documents)} tagged documents.')
model = Doc2Vec(**hyperparams)
print('Build vocabulary')
model.build_vocab(train_documents)
for epoch in range(100):
    #print(f'Train model: epoch={epoch}')
    model.train(train_documents, total_examples=model.corpus_count, epochs=1)
    model.alpha -= 0.0002
    model.min_alpha = model.alpha

Created 866 tagged documents.
Build vocabulary


#### 3.3 Saving the model to disk

In [18]:
# Save the model
model_path = os.path.join("../../results_NLS/", 'doc2vec_df_1771_small.model')
model.save(model_path)
print(f'Saved model to {model_path}')

Saved model to ../../results_NLS/doc2vec_df_1771_small.model


#### 3.5 Obtaining a list of elements in 

In [19]:
list_of_terms=df_1771_small["term"].to_list()
#list_of_terms

#### 3.4 Testing the model - Similar Terms

Selecting the defintion of term alchohol, which is position 7 inside **df_1771_small**. 

In [20]:
term=df_1771_small.loc[22]["term"]
term

'ABACUS'

In [21]:
text=df_1771_small.loc[22]["definition"]
text

'is also the name of an ancient instrument for facilitating operations in arithmetic. It is vadoully contrived. That chiefly used in Europe is made by drawing any number of parallel lines at the di(lance of two diameters of one of the counters used in the calculation. A counter placed on.the lowed line, signifies r; on the sd, 10; on the 3d, 100; on the 4th, 1000, &c. In the intermediate spaces, the same counters are eflimated at one Jialf of the value of the line immediately superior, viz. between the id and 2d, 5; between the 2d and 3d, 50, &c. See plate I. fig. 2. A B, where the same number, 1768 for example, is represented under both by different dispositions of the counters.'

In [22]:
#model=load_model('../../results_NLS/doc2vec_df_1771_small.model')
cleaned_text = clean_text(term+text)
#cleaned_tex = stem_text(text)
# Just going to take the firs 5 -- so topn=5
simdocs=most_similar(model, text, topn=5)

In [23]:
print("#### TEST 1 -- Doc2Vec -- Printing the details of the 5 most similar documents using Doc2Vec ")
for doc_id , rank in simdocs:
    term, definition = get_document(df_1771_small, doc_id)
    print("!! Using DocVec --- Document_id: %s - Rank %s - Details: Term %s, Definition: %s" %(doc_id, rank, term, definition))
    print("---")

#### TEST 1 -- Doc2Vec -- Printing the details of the 5 most similar documents using Doc2Vec 
INDEX IS 22
!! Using DocVec --- Document_id: 22 - Rank 0.9339599609375 - Details: Term ABACUS, Definition: is also the name of an ancient instrument for facilitating operations in arithmetic. It is vadoully contrived. That chiefly used in Europe is made by drawing any number of parallel lines at the di(lance of two diameters of one of the counters used in the calculation. A counter placed on.the lowed line, signifies r; on the sd, 10; on the 3d, 100; on the 4th, 1000, &c. In the intermediate spaces, the same counters are eflimated at one Jialf of the value of the line immediately superior, viz. between the id and 2d, 5; between the 2d and 3d, 50, &c. See plate I. fig. 2. A B, where the same number, 1768 for example, is represented under both by different dispositions of the counters.
---
INDEX IS 832
!! Using DocVec --- Document_id: 832 - Rank 0.5845749974250793 - Details: Term ANCESTORS, Defi