# Exploring Terms in the Encyclopaedia Britannica

## Summarize text - Gensim - Doc2Vec

In this notebook we are going to do summarize the text of terms definitions with the dataframe that we have obtained either with the posprocess_eb.py script or Merging_EB_Terms.ipynb notebooks. Both methods obtain the same dataframe. 

We have selected the first Edition for this explorations, but we can run this notebook with any of the other editions.

**Remark**: Edition 1, has 3 volumes, and it was printed twice, in 1771 and 1773. 

These are the explorations that we are going to do:
 



### Loading the necessary libraries

In [1]:
import yaml
import matplotlib.pyplot as plt
import numpy as np
import collections
import matplotlib as mpl

In [2]:
import networkx as nx
import matplotlib.pyplot as plt

In [3]:
import pandas as pd
from yaml import safe_load
from pandas.io.json import json_normalize

In [4]:
import gensim
from gensim.summarization.summarizer import summarize

### Functions

In [5]:
def get_document(df, index):
    print("INDEX IS %s" %index)
    term = df.loc[index]["term"]
    definition = df.loc[index]["definition"]
    return term, definition

## We have dataframe with these columns

- definition:           Definition of a term
- editionNum:           1,2,3,4,5,6,7,8
- editionTitle:         Title of the edition
- header:               Header of the page's term                                  
- place:                Place where the volume was edited (e.g. Edinburgh)                                    
- relatedTerms:         Related terms (see X article)  
- altoXML:              File Path of the XML file from which the term belongs       
- term:                 Term name                            
- positionPage:         Position of ther term in the page     
- startsAt:             Number page in which the term definition starts 
- endsAt:               Number page in which the term definition ends 
- volumeTitle:          Title of the Volume
- typeTerm:             Type of term [Topic| Articles]                                       
- year:                 Year of the edition
- volumeNum:            Volume number (e.g. 1)
- letters:              leters of the volume (A-B)
- part:                 Part of the volume (e.g 1)
- supplement:           Supplement's Title
- supplementsTo:        It suppelements to editions [1, 2, 3....]
- numberOfWords:        Number of words per term definition
- numberOfTerms:        Number of terms per page
- numberOfPages:        Number of pages per volume

### 1. Load dataframe from JSON file

In [6]:
df = pd.read_json('../../results_NLS/results_eb_1_edition_dataframe', orient="index") 

Now we are going to oder the columns of our dataframe and visualise it. 

In [7]:
df = df[["term", "definition", "relatedTerms", "header", "startsAt", "endsAt", "numberOfTerms","numberOfWords", "numberOfPages", \
             "positionPage", "typeTerm", "editionTitle", "editionNum", "supplementTitle", "supplementsTo",\
             "year", "place", "volumeTitle", "volumeNum", "letters", "part", "altoXML"]].reset_index(drop=True)

df


Unnamed: 0,term,definition,relatedTerms,header,startsAt,endsAt,numberOfTerms,numberOfWords,numberOfPages,positionPage,...,editionNum,supplementTitle,supplementsTo,year,place,volumeTitle,volumeNum,letters,part,altoXML
0,OR,"A NEW A D I C T I A A, the name of several riv...",[],EncyclopaediaBritannica,15,15,22,54,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
1,AABAM,"a term, among alchemifts, for lead,",[],EncyclopaediaBritannica,15,15,22,6,832,1,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
2,ASTRONOMY,Of the divijion of time.,[],EncyclopaediaBritannica,15,15,22,5,832,10,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
3,ABILITY,"a term in law, denoting a power of doing certa...",[],ABLABR,19,19,37,17,832,3,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082956.34.xml
4,ALCMANIAN,"in ancient lyric poetry, a kind of verse consi...",[],ALcALC,109,109,21,17,832,15,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084129.34.xml
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27422,OOSTERGO,"the north division of West Friefland, one of t...",[],ONOOPA,480,480,24,11,872,21,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",3,M-Z,0,144133903/alto/144810307.34.xml
27423,OPACITY,"in philosophy, a quality of bodies which rende...",[],ONOOPA,480,480,24,16,872,22,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",3,M-Z,0,144133903/alto/144810307.34.xml
27424,OPAL,"in natural history, a species of gems. The opa...",[],ONOOPA,480,481,24,251,872,23,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",3,M-Z,0,144133903/alto/144810307.34.xml
27425,OP,"ALIA, in antiquity, feasts celebrated at Rome ...",[],OPHOPI,481,481,17,63,872,1,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",3,M-Z,0,144133903/alto/144810319.34.xml


### 2.  Selecting just the 100 first elements of  the first volume of 1771

In [8]:
df_1771_small = df[(df['year'] == 1771) & (df['volumeNum'] == 1) & (df['typeTerm']=="Topic") ]
df_1771_small = df_1771_small.head(100).reset_index(drop=True)


In [9]:
df_1771_small

Unnamed: 0,term,definition,relatedTerms,header,startsAt,endsAt,numberOfTerms,numberOfWords,numberOfPages,positionPage,...,editionNum,supplementTitle,supplementsTo,year,place,volumeTitle,volumeNum,letters,part,altoXML
0,BALGEBRA,been imagined for representing their affefitio...,[],BoALGEBRA,112,113,1,1315,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084168.34.xml
1,ALGEBRA,"82 A L G E R A. root, to the right hand, a fig...",[],BALGERA,114,149,1,4643,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084194.34.xml
2,ALGEBRA,"ALGEBRA.ence (viz. unit), and that in the last...",[],ALGEBRA,118,129,1,11084,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084246.34.xml
3,ALOIIA,But though x and y are not quadratic furds or ...,[],ooALOIIA,132,132,1,912,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084428.34.xml
4,EBRAALGX,R A; 102 A L G whence it is produced. The squa...,[],EBRAALGX,134,134,1,873,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084454.34.xml
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,AT,'Jf>'6 V m s m,[],teAT,772,772,1,5,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188092748.34.xml
59,BREWING,hollow bags. Whett feeds are thus sufficiently...,[],BREWING,798,805,1,9474,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188093086.34.xml
60,TING,"T7 I N G. B R E Pale Balls, Are made tn the sa...",[],TING,806,806,1,1180,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188093190.34.xml
61,BREWING,"Ji-ole for a night together, that the (leant o...","[CAPPARIS, ANCON, I>AUPHINY, FRANCE, BRIAR]",BREWING,807,808,1,1677,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188093203.34.xml


### 2.1 Counting the number of terms

**Remember**: A term can appear in more than once  per eddition. 

In [10]:
len(df_1771_small)

63

### 3. Summarize Text



In [11]:
text=df_1771_small.loc[62]["definition"]
text

'The outer coat is extremely thin, and full of pores ; but may be easily separated from the inner one, (which is much thicker), after the bean has been boiled, or lain • a few days in the soil. At the thick end of the bean, there is a small hole vilible to the naked eye, immediately over the radicle or future root, that it may have a free passage into the soil. Plate IV. fig. i. A. When these coats are taken off, the body of the feed appears, which is divided into two smooth portions or lobes. The smoothness of tire lobes is owing to a thin\' film or cuticle with which they are covered. At the bafrs of the bean is placed the radicle or future root, Plate IV. fig. 3. A. The trunk of the radicle, just as it enters into the body of the feed, divides into two capital branches, one of which is inferred into each lobe, and sends off smaller ones in all direftions through the whole substance of the lobes, Plate IV. fig. 7. A A. These ramifications become so extremely minute towards the edges 

In [12]:
summarized_text=summarize(text)
summarized_text

'At the thick end of the bean, there is a small hole vilible to the naked eye, immediately over the radicle or future root, that it may have a free passage into the soil.\nTowards the extremity of the radicle, it is one entire trunk ; but higher up, it divides into three branches.; the middle one runs direcftly up to the plume, and the other two pass into the lobes on each side, .and spread out into a great variety of small branches through the whole body of the lobes, .Plate IV.\n7. This substance is very properly termed the feminal root: for when the feed is sown, the moisture is first absorbed by the outer coats, which are every where furniflied with sap and air-vessels; from these it is conveyed to the cuticle; ’ from the cuticle it proceeds to the pulpy part of the lobes ; when it has got thus far, it is taken up by the mouths of the small branches of the feminal root, and passes from one branch into another, till it is all collefteft into the main trunk, which communicates both w

In [13]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest

In [14]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

In [15]:
len(list(doc.sents))

1635

In [16]:
keyword =[]
stopwords = list(STOP_WORDS)
pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
for token in doc:
    if (token.text in stopwords or token.text in punctuation):
        continue
    if (token.pos_ in pos_tag):
        keyword.append(token.text)