# Exploring Terms in the Encyclopaedia Britannica

## Summarize text - Gensim - Doc2Vec

In this notebook we are going to do summarize the text of terms definitions with the dataframe that we have obtained either with the posprocess_eb.py script or Merging_EB_Terms.ipynb notebooks. Both methods obtain the same dataframe. 

We have selected the first Edition for this explorations, but we can run this notebook with any of the other editions.

**Remark**: Edition 1, has 3 volumes, and it was printed twice, in 1771 and 1773. 

These methods rely on extracting several parts, such as phrases and sentences, from a piece of text and stack them together to create a summary. Therefore, identifying the right sentences for summarization is of utmost importance in an extractive method.

- **Extractive Summarization**: These methods rely on extracting several parts, such as phrases and sentences, from a piece of text and stack them together to create a summary. Therefore, identifying the right sentences for summarization is of utmost importance in an extractive method.

- **Abstractive Summarization**: These methods use advanced NLP techniques to generate an entirely new summary. Some parts of this summary may not even appear in the original text.
 
Here we are going to focus on **extractive summarization technique**, using two libraries:
- Gensim
- Spacy


### Loading the necessary libraries

In [1]:
import yaml
import matplotlib.pyplot as plt
import numpy as np
import collections
import matplotlib as mpl

In [2]:
import networkx as nx
import matplotlib.pyplot as plt

In [3]:
import pandas as pd
from yaml import safe_load
from pandas.io.json import json_normalize

In [4]:
import gensim
from gensim.summarization.summarizer import summarize

### Functions

In [5]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [6]:
def create_document(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    return doc

## We have dataframe with these columns

- definition:           Definition of a term
- editionNum:           1,2,3,4,5,6,7,8
- editionTitle:         Title of the edition
- header:               Header of the page's term                                  
- place:                Place where the volume was edited (e.g. Edinburgh)                                    
- relatedTerms:         Related terms (see X article)  
- altoXML:              File Path of the XML file from which the term belongs       
- term:                 Term name                            
- positionPage:         Position of ther term in the page     
- startsAt:             Number page in which the term definition starts 
- endsAt:               Number page in which the term definition ends 
- volumeTitle:          Title of the Volume
- typeTerm:             Type of term [Topic| Articles]                                       
- year:                 Year of the edition
- volumeNum:            Volume number (e.g. 1)
- letters:              leters of the volume (A-B)
- part:                 Part of the volume (e.g 1)
- supplement:           Supplement's Title
- supplementsTo:        It suppelements to editions [1, 2, 3....]
- numberOfWords:        Number of words per term definition
- numberOfTerms:        Number of terms per page
- numberOfPages:        Number of pages per volume

### 1. Load dataframe from JSON file

In [7]:
df = pd.read_json('../../results_NLS/results_eb_1_edition_dataframe', orient="index") 

Now we are going to oder the columns of our dataframe and visualise it. 

In [8]:
df = df[["term", "definition", "relatedTerms", "header", "startsAt", "endsAt", "numberOfTerms","numberOfWords", "numberOfPages", \
             "positionPage", "typeTerm", "editionTitle", "editionNum", "supplementTitle", "supplementsTo",\
             "year", "place", "volumeTitle", "volumeNum", "letters", "part", "altoXML"]].reset_index(drop=True)

df


Unnamed: 0,term,definition,relatedTerms,header,startsAt,endsAt,numberOfTerms,numberOfWords,numberOfPages,positionPage,...,editionNum,supplementTitle,supplementsTo,year,place,volumeTitle,volumeNum,letters,part,altoXML
0,OR,"A NEW A D I C T I A A, the name of several riv...",[],EncyclopaediaBritannica,15,15,22,54,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
1,AABAM,"a term, among alchemifts, for lead,",[],EncyclopaediaBritannica,15,15,22,6,832,1,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
2,AACH,the name of a town and river in Swabia. It is ...,[],EncyclopaediaBritannica,15,15,22,17,832,2,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
3,AADE,"the name of two rivers, one in the country of ...",[],EncyclopaediaBritannica,15,15,22,19,832,3,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
4,AAHUS,a small town and diftrift in Weftphalia.,[],EncyclopaediaBritannica,15,15,22,7,832,4,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082904.34.xml
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27422,ZUYDERSEE,"a great bay of the German ocean, which lies in...",[],ZoDZYG,857,857,27,66,864,22,...,1,,[],1773,London,"Encyclopaedia Britannica: or, A dictionary of ...",3,M-Z,0,144850368/alto/188375020.34.xml
27423,ZWEIBRUGGEN,"a county of the palatinate of the Rhine, in Ge...",[SQALVS],ZoDZYG,857,857,27,23,864,23,...,1,,[],1773,London,"Encyclopaedia Britannica: or, A dictionary of ...",3,M-Z,0,144850368/alto/188375020.34.xml
27424,ZYGOMA,in anatomy. See Anatomy p. 152.,[],ZoDZYG,857,857,27,6,864,24,...,1,,[],1773,London,"Encyclopaedia Britannica: or, A dictionary of ...",3,M-Z,0,144850368/alto/188375020.34.xml
27425,ZYGOMATICUS,"in anatomy,. See Anatomy, p. 306,",[ANATOMY],ZoDZYG,857,857,27,6,864,25,...,1,,[],1773,London,"Encyclopaedia Britannica: or, A dictionary of ...",3,M-Z,0,144850368/alto/188375020.34.xml


### 2.  Selecting just the 100 first elements of  the first volume of 1771

In [9]:
df_1771_small = df[(df['year'] == 1771) & (df['volumeNum'] == 1) & (df['typeTerm']=="Topic") ]
df_1771_small = df_1771_small.head(100).reset_index(drop=True)


In [10]:
df_1771_small

Unnamed: 0,term,definition,relatedTerms,header,startsAt,endsAt,numberOfTerms,numberOfWords,numberOfPages,positionPage,...,editionNum,supplementTitle,supplementsTo,year,place,volumeTitle,volumeNum,letters,part,altoXML
0,ABRABR,"(6 ABRASA, in surgery, ulcers, where the Ikin ...","[RENUNCIATION, ABRIDGEMENT, BUT, BUT, BUT, MIR...",ABRABR,20,21,1,1451,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188082969.34.xml
1,AGRICULTURE,"The outer coat is extremely thin, and full of ...","[AQUIFOLIUM, AGRIMONIA, AGRIMONY, BRITAIN, AGR...",AGRICULTURE,61,100,1,31721,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188083505.34.xml
2,BALGEBRA,been imagined for representing their affefitio...,[],BoALGEBRA,112,113,1,1315,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084168.34.xml
3,ALGEBRA,"82 A L G E R A. root, to the right hand, a fig...",[],BALGERA,114,149,1,4643,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084194.34.xml
4,ALGEBRA,"ALGEBRA.ence (viz. unit), and that in the last...",[],ALGEBRA,118,129,1,11084,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188084246.34.xml
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,BOTANY,"mud first eat, before any certain conclusion c...","[LIU, MONOCLINIA, DIFFINITAS, INDIFFERENTIJFMU...",BOTANY,748,777,1,25193,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188092436.34.xml
59,AT,'Jf>'6 V m s m,[],teAT,772,772,1,5,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188092748.34.xml
60,BREWING,hollow bags. Whett feeds are thus sufficiently...,[],BREWING,798,805,1,9474,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188093086.34.xml
61,TING,"T7 I N G. B R E Pale Balls, Are made tn the sa...",[],TING,806,806,1,1180,832,0,...,1,,[],1771,Edinburgh,"Encyclopaedia Britannica; or, A dictionary of ...",1,A-B,0,144133901/alto/188093190.34.xml


### 2.1 Counting the number of terms

**Remember**: A term can appear in more than once  per eddition. 

In [11]:
len(df_1771_small)

63

### 3. Summarize Text


### 3.1 Testing with Gensim

In [26]:
df_1771_small.loc[62]["term"]

'BREWING'

In [27]:
text=df_1771_small.loc[62]["definition"]
text

'Ji-ole for a night together, that the (leant of the boiling water or wort may penetrate into the wood ; this way is such a furious fearcher, that unless the cade is new-hopped just before, it will be apt to fall to pieces. Another Way. Take a pottle, or more, of stone-lime, and put it into the calk^ on this pour some water, and stop it up diredtly, (baking it well about. Another Way. Take along linen rag, and dip it in melted brimstone; light it at the end, and let it hang pendant with the upper part of the rag fastened to the wooden bung; this is a most quick and sure\' way, and will not only (weeten, but help to fine the drink. Another. Or, to make your calk more, pleasant, you may use the vintners way thus: Take four ounces of (lone brimstone, one ounce of burnt allum, and two ounces of brandy; melt all thele in an earthen pan over hot coals, and dip therein a piece of new canvas, and instantly sprinkle thereon the powders of nutmegs, cloves, coriander, and anife feeds : this canva

In [28]:
summarized_text=summarize(text)
summarized_text

"Ji-ole for a night together, that the (leant of the boiling water or wort may penetrate into the wood ; this way is such a furious fearcher, that unless the cade is new-hopped just before, it will be apt to fall to pieces.\nOr, to make your calk more, pleasant, you may use the vintners way thus: Take four ounces of (lone brimstone, one ounce of burnt allum, and two ounces of brandy; melt all thele in an earthen pan over hot coals, and dip therein a piece of new canvas, and instantly sprinkle thereon the powders of nutmegs, cloves, coriander, and anife feeds : this canvas set on fire, and let it burn hanging in the calk fastened at the end with the wooden bung, so that no smoke comes out.\nBoil some pepper in water, and fill the calk with it fealding hot.\nBREYNIA, in botany, a fynonime of the capparis.\nTo prevent this inconvenience, when your brewing is over, put up' Lome water fealding hot, and let it run through the grains; then boil it and fill upthe calk, stop it well, and let it

In [29]:
len(summarized_text)

3045

In [30]:
len(text)

8909

### 3.2 Testing with Spacy

Following https://medium.com/analytics-vidhya/text-summarization-using-spacy-ca4867c6b744 

In [31]:
doc=create_document(text)

In [32]:
len(list(doc.sents))

55

#### 3.2.1 Filtering tokens

In [33]:
keyword =[]
stopwords = list(STOP_WORDS)
pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
for token in doc:
    if (token.text in stopwords or token.text in punctuation):
        continue
    if (token.pos_ in pos_tag):
        keyword.append(token.text)

In [34]:
freq_word = Counter(keyword)
freq_word.most_common(5)

[('bricks', 31), ('calk', 9), ('burnt', 9), ('water', 8), ('wood', 8)]

#### 3.2.2 Normalization 

In [35]:
max_freq = Counter(keyword).most_common(1)[0][1]
for word in freq_word.keys():
    freq_word[word] = (freq_word[word]/max_freq)
freq_word.most_common(5)

[('bricks', 1.0),
 ('calk', 0.2903225806451613),
 ('burnt', 0.2903225806451613),
 ('water', 0.25806451612903225),
 ('wood', 0.25806451612903225)]

#### 3.2.3 Weighing sentences

In [36]:
sent_strength={}
for sent in doc.sents:
    for word in sent:
        if word.text in freq_word.keys():
            if sent in sent_strength.keys():
                sent_strength[sent]+=freq_word[word.text]
            else:
                sent_strength[sent] = freq_word[word.text]
print(sent_strength)

{Ji-ole for a night together, that the (leant of the boiling water or wort may penetrate into the wood ; this way is such a furious fearcher, that unless the cade is new-hopped just before, it will be apt to fall to pieces.: 1.290322580645161, Another Way.: 0.06451612903225806, Take a pottle, or more, of stone-lime, and put it into the calk^ on this pour some water, and stop it up diredtly, (baking it well about.: 0.7096774193548386, Another Way.: 0.06451612903225806, Take along linen rag, and dip it in melted brimstone; light it at the end, and let it hang pendant with the upper part of the rag fastened to the wooden bung; this is a most quick and sure' way, and will not only (weeten, but help to fine the drink.: 1.516129032258064, Or, to make your calk more, pleasant, you may use the vintners way thus: Take four ounces of (lone brimstone, one ounce of burnt allum, and two ounces of brandy; melt all thele in an earthen pan over hot coals, and dip therein a piece of new canvas, and ins

#### 3.2.4 Summarizing the string

And the nlargest function returns a list containing the top 3 sentences which are stored as summarized_sentences

In [37]:
summarized_sentences = nlargest(3, sent_strength, key=sent_strength.get)


In [38]:
final_sentences = [w.text for w in summarized_sentences]
summary =' '.join(final_sentences)
print(summary)

the principal of which are, Compass-bricks, of a circular form, used infleyning of walls : Concave, or hollow bricks, on one side flat like a common brick, on the other hollowed, and used for conveyance of water : Feather-edged bricks, which are like common llatute bricks, only thinner on one edge than the other, and used for penning up the brick pannels in timber buildings: Cogging bricks are used for making the indented works under the caping of walls built with great bricks : Caping bricks, formed on purpose for caping of walls: Dutch or Flemilh bricks, used to pave yards, stables, and for soap-bo Hers vaults and cifterns: Clinkers, such bricks as are glazed by the heat of the fire in making : Sandel or famel-bricks, are such as lie outmoll in a kiln, or clamp, and consequently are soft and useless, as not being thoroughly burnt: Great bricks are those twelve inches long, six broad, and three thick, used to build fence-walls: _Plai%r or'buttress bricks, have a notch at one end, half

In [39]:
len(summary)

2374