# Exploring Terms in the Encyclopaedia Britannica

## Summarize text - Gensim - Doc2Vec

In this notebook we are going to do summarize the text of terms definitions with the dataframe that we have obtained either with the posprocess_eb.py script or Merging_EB_Terms.ipynb notebooks. Both methods obtain the same dataframe. 

We have selected the first Edition for this explorations, but we can run this notebook with any of the other editions.

**Remark**: Edition 1, has 3 volumes, and it was printed twice, in 1771 and 1773. 

These methods rely on extracting several parts, such as phrases and sentences, from a piece of text and stack them together to create a summary. Therefore, identifying the right sentences for summarization is of utmost importance in an extractive method.

- **Extractive Summarization**: These methods rely on extracting several parts, such as phrases and sentences, from a piece of text and stack them together to create a summary. Therefore, identifying the right sentences for summarization is of utmost importance in an extractive method.

- **Abstractive Summarization**: These methods use advanced NLP techniques to generate an entirely new summary. Some parts of this summary may not even appear in the original text.
 
Here we are going to focus on **extractive summarization technique**, using two libraries:
- Gensim
- Spacy


### Loading the necessary libraries

In [1]:
import yaml
import matplotlib.pyplot as plt
import numpy as np
import collections
import matplotlib as mpl

In [2]:
import networkx as nx
import matplotlib.pyplot as plt

In [3]:
import pandas as pd
from yaml import safe_load
from pandas.io.json import json_normalize

In [4]:
import gensim
from gensim.summarization.summarizer import summarize

### Functions

In [5]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [6]:
def create_document(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    return doc

## We have dataframe with these columns

- definition:           Definition of a term
- editionNum:           1,2,3,4,5,6,7,8
- editionTitle:         Title of the edition
- header:               Header of the page's term                                  
- place:                Place where the volume was edited (e.g. Edinburgh)                                    
- relatedTerms:         Related terms (see X article)  
- altoXML:              File Path of the XML file from which the term belongs       
- term:                 Term name                            
- positionPage:         Position of ther term in the page     
- startsAt:             Number page in which the term definition starts 
- endsAt:               Number page in which the term definition ends 
- volumeTitle:          Title of the Volume
- typeTerm:             Type of term [Topic| Articles]                                       
- year:                 Year of the edition
- volumeNum:            Volume number (e.g. 1)
- letters:              leters of the volume (A-B)
- part:                 Part of the volume (e.g 1)
- supplement:           Supplement's Title
- supplementsTo:        It suppelements to editions [1, 2, 3....]
- numberOfWords:        Number of words per term definition
- numberOfTerms:        Number of terms per page
- numberOfPages:        Number of pages per volume

### 1. Load dataframe from JSON file

In [7]:
df = pd.read_json('../../results_NLS/results_eb_1_edition_dataframe', orient="index") 

Now we are going to oder the columns of our dataframe and visualise it. 

In [8]:
df = df[["term", "definition", "relatedTerms", "header", "startsAt", "endsAt", "numberOfTerms","numberOfWords", "numberOfPages", \
             "positionPage", "typeTerm", "editionTitle", "editionNum", "supplementTitle", "supplementsTo",\
             "year", "place", "volumeTitle", "volumeNum", "letters", "part", "altoXML"]].reset_index(drop=True)

df


Unnamed: 0,term,definition,relatedTerms,header,startsAt,endsAt,numberOfTerms,numberOfWords,numberOfPages,positionPage,...,editionNum,supplementTitle,supplementsTo,year,place,volumeTitle,volumeNum,letters,part,altoXML
0,OR,"A NEW A D I C T I A A, the name of several riv...",[],EncyclopaediaBritannica,15,15,22,54,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
1,AABAM,"a term, among alchemifts, for lead,",[],EncyclopaediaBritannica,15,15,22,6,832,1,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
2,AACH,the name of a town and river in Swabia. It is ...,[],EncyclopaediaBritannica,15,15,22,17,832,2,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
3,AADE,"the name of two rivers, one in the country of ...",[],EncyclopaediaBritannica,15,15,22,19,832,3,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
4,AAHUS,a small town and diftrift in Weftphalia.,[],EncyclopaediaBritannica,15,15,22,7,832,4,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27087,ZUYDERSEE,"a great bay of the German ocean, which lies in...",[],ZoDZYG,857,857,26,66,864,22,...,1,,[],1773,London,"Encyclopaedia Britannica: or, A dictionary of ...",3,M-Z,0,144850368/alto/188375020.34.xml
27088,ZWEIBRUGGEN,"a county of the palatinate of the Rhine, in Ge...",[SQALVS],ZoDZYG,857,857,26,23,864,23,...,1,,[],1773,London,"Encyclopaedia Britannica: or, A dictionary of ...",3,M-Z,0,144850368/alto/188375020.34.xml
27089,ZYGOMA,in anatomy. See Anatomy p. 152.,[],ZoDZYG,857,857,26,6,864,24,...,1,,[],1773,London,"Encyclopaedia Britannica: or, A dictionary of ...",3,M-Z,0,144850368/alto/188375020.34.xml
27090,ZYGOMATICUS,"in anatomy,. See Anatomy, p. 306,",[ANATOMY],ZoDZYG,857,857,26,6,864,25,...,1,,[],1773,London,"Encyclopaedia Britannica: or, A dictionary of ...",3,M-Z,0,144850368/alto/188375020.34.xml


### 2.  Selecting just the 100 first elements of  the first volume of 1771

In [9]:
df_1771_small = df[(df['year'] == 1771) & (df['volumeNum'] == 1) & (df['typeTerm']=="Topic") ]
df_1771_small = df_1771_small.head(100).reset_index(drop=True)


In [10]:
df_1771_small

Unnamed: 0,term,definition,relatedTerms,header,startsAt,endsAt,numberOfTerms,numberOfWords,numberOfPages,positionPage,...,editionNum,supplementTitle,supplementsTo,year,place,volumeTitle,volumeNum,letters,part,altoXML
0,ABRABR,"(6 ABRASA, in surgery, ulcers, where the Ikin ...","[RENUNCIATION, ABRIDGEMENT, BUT, BUT, BUT, MIR...",ABRABR,20,21,1,1451,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082969.34.xml
1,AGRICULTURE,"The outer coat is extremely thin, and full of ...","[AQUIFOLIUM, AGRIMONIA, AGRIMONY, BRITAIN, AGR...",AGRICULTURE,61,100,1,56260,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188083505.34.xml
2,ALGEBRA,been imagined for representing their affefitio...,[],BoALGEBRA,112,149,1,53567,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084168.34.xml
3,ALLIGATION,and almonds at 6 d. how many pounds of almonds...,"[LACERTA, ALLIGATOR-/><?RF>-, ALUM]",ALLIGATION,154,156,1,1743,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084714.34.xml
4,AMPHICTETSTA,f /. AivlARXi.Li-S' or Ci-imfcm orient a! Xill...,[],AMPHICTiETSTAf,177,177,1,9,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188085013.34.xml
5,ANA,"ANAQUITO, a country of Peru, in South America,...","[ANTIRRHINUM, ANARRHOPIA, ANAS, AMERICA, RAY, ...",ANAi,179,180,1,1807,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188085039.34.xml
6,ANATOMY,a nd slide easily upon the bones. 2. To keep i...,"[BEFORE, BEHIND, PARTI, OCCIPITIS, BEFORE, BEH...",ANATOMY,186,360,1,224090,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188085130.34.xml
7,ANNUITIES,"Rule. By the preceding problem, find the prese...","[COLUBER, ANNULET, PROB, RULE, PROB, PROB, EXA...",ANNUITIES,369,373,1,6671,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188087509.34.xml
8,AIA,least apt to crack of any cement easily-to-be ...,[],AIAi,386,387,1,2515,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188087730.34.xml
9,ARCHITECTURE,dows ought always to be proportioned to that o...,"[FOR, ARCHITECTURE, COTURN, BASE, SHAFT, CAPIT...",ARCHITECTURE,401,443,1,25978,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188087925.34.xml


### 2.1 Counting the number of terms

**Remember**: A term can appear in more than once  per eddition. 

In [11]:
len(df_1771_small)

25

### 3. Summarize Text


### 3.1 Testing with Gensim

In [12]:
df_1771_small.loc[24]["term"]

'BREWING'

In [13]:
text=df_1771_small.loc[24]["definition"]
text

'hollow bags. Whett feeds are thus sufficiently malted, they mult be dried in malt-kilns, the fuel of which ftiould smoke as little as poflible.—The hulks mull now be broke open by malt-milns, and then infused or riialh’ed nn warm water, in order to extradt the faccharine substance; the heat applied Ihould be very slow and gradual. Thus the malt is diflblved, and lies till the liquor be sufficiently tindiured. When the malt is too long diffused, so that an acetous fermentation begins to take place, it Is called blinking, or foxing, by brewers. This tindlure obtained from the infusion of grinded malt, is commonly known by the name ok onort. We lhall now give an account of this process in the language and manner of the adhial brewer, which will probably be more acceptable than treating it in a philosophical manner. Of making Malt. The barley mull be put into a leaden or tiled cistern, that holds five, teh, or more quarters, and covered with Water four or six inches above the barley, to a

In [14]:
summarized_text=summarize(text)
summarized_text

"Whett feeds are thus sufficiently malted, they mult be dried in malt-kilns, the fuel of which ftiould smoke as little as poflible.—The hulks mull now be broke open by malt-milns, and then infused or riialh’ed nn warm water, in order to extradt the faccharine substance; the heat applied Ihould be very slow and gradual.\nThus it may lie and be worked on the floor in several parallels, two or three feet thick, ten or more feet broad, and fourteen or more in length, to chip or spire, but not too much nor too fast; and when it is come enough, it is to be turned twelve or sixteen times in twenty four hours, if the season is warm, as in March, April, or May; and when it is fixed, and the root begins to be dead then it must be thickened again, and carefully kept often turned and worked, that the growing of the root may not revive, and this is better done with the Sloes off than on : And here the Workman’s art and diligence ia particular is tried, in keeping the floor clear, and turning the ma

In [15]:
len(summarized_text)

49995

In [16]:
len(text)

111191

### 3.2 Testing with Spacy

Following https://medium.com/analytics-vidhya/text-summarization-using-spacy-ca4867c6b744 

In [17]:
doc=create_document(text)

In [18]:
len(list(doc.sents))

514

#### 3.2.1 Filtering tokens

In [19]:
keyword =[]
stopwords = list(STOP_WORDS)
pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
for token in doc:
    if (token.text in stopwords or token.text in punctuation):
        continue
    if (token.pos_ in pos_tag):
        keyword.append(token.text)

In [20]:
freq_word = Counter(keyword)
freq_word.most_common(5)

[('malt', 137), ('beer', 112), ('water', 86), ('drink', 85), ('time', 82)]

#### 3.2.2 Normalization 

In [21]:
max_freq = Counter(keyword).most_common(1)[0][1]
for word in freq_word.keys():
    freq_word[word] = (freq_word[word]/max_freq)
freq_word.most_common(5)

[('malt', 1.0),
 ('beer', 0.8175182481751825),
 ('water', 0.6277372262773723),
 ('drink', 0.6204379562043796),
 ('time', 0.5985401459854015)]

#### 3.2.3 Weighing sentences

In [22]:
sent_strength={}
for sent in doc.sents:
    for word in sent:
        if word.text in freq_word.keys():
            if sent in sent_strength.keys():
                sent_strength[sent]+=freq_word[word.text]
            else:
                sent_strength[sent] = freq_word[word.text]
#print(sent_strength)

#### 3.2.4 Summarizing the string

And the nlargest function returns a list containing the top 3 sentences which are stored as summarized_sentences

In [23]:
summarized_sentences = nlargest(3, sent_strength, key=sent_strength.get)


In [24]:
final_sentences = [w.text for w in summarized_sentences]
summary =' '.join(final_sentences)
print(summary)

In boiling, both time and the curdling or breaking of the wort should be consulted ; for if a person was to boil the wort an hour, and then take it out of the copper before it was rightly broke, it would be wrong management, and the drink would not be fine and wholesome; and if it Ihould boil an hour and a half, or two hours, without regarding when its particles are in a right order, then it may be too thick; so that due care must be had to the two extremes, to obtain it in its due order; therefore, in Odtober and keeping beers, an hour and a quarter’s good boiling is commonly sufficient to have a thorough cured drink; for generally in that time it will break and boil enough ; because in this there is a double security by length of boiling, and a quantity of hops shifted ; but in the new way there is only a single one, and that is by a double or treble allowance of fresh hops boiled only half an hour in the wort; and for this practice a reason is afligned, that the hops, being endowed 

In [25]:
len(summary)

4335