# Exploring Terms in the Encyclopaedia Britannica

## Summarize text - Gensim - BERT- GPT2

In this notebook we are going to do summarize the text of terms definitions with the dataframe that we have obtained either with the posprocess_eb.py script or Merging_EB_Terms.ipynb notebooks. Both methods obtain the same dataframe. 

We have selected the first Edition for this explorations, but we can run this notebook with any of the other editions.

**Remark**: Edition 1, has 3 volumes, and it was printed twice, in 1771 and 1773. 

These methods rely on extracting several parts, such as phrases and sentences, from a piece of text and stack them together to create a summary. Therefore, identifying the right sentences for summarization is of utmost importance in an extractive method.

- **Extractive Summarization**: These methods rely on extracting several parts, such as phrases and sentences, from a piece of text and stack them together to create a summary. Therefore, identifying the right sentences for summarization is of utmost importance in an extractive method.

- **Abstractive Summarization**: These methods use advanced NLP techniques to generate an entirely new summary. Some parts of this summary may not even appear in the original text.
 
Here we are going to focus on **extractive summarization technique**, using two libraries:
- Gensim
- Spacy


### Loading the necessary libraries

In [1]:
import yaml
import matplotlib.pyplot as plt
import numpy as np
import collections
import matplotlib as mpl

In [2]:
import networkx as nx
import matplotlib.pyplot as plt

In [3]:
import pandas as pd
from yaml import safe_load
from pandas.io.json import json_normalize

In [4]:
import gensim

In [5]:
from summarizer import Summarizer, TransformerSummarizer

### Functions

In [6]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest

In [7]:
def create_document(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    return doc

## We have dataframe with these columns

- definition:           Definition of a term
- editionNum:           1,2,3,4,5,6,7,8
- editionTitle:         Title of the edition
- header:               Header of the page's term                                  
- place:                Place where the volume was edited (e.g. Edinburgh)                                    
- relatedTerms:         Related terms (see X article)  
- altoXML:              File Path of the XML file from which the term belongs       
- term:                 Term name                            
- positionPage:         Position of ther term in the page     
- startsAt:             Number page in which the term definition starts 
- endsAt:               Number page in which the term definition ends 
- volumeTitle:          Title of the Volume
- typeTerm:             Type of term [Topic| Articles]                                       
- year:                 Year of the edition
- volumeNum:            Volume number (e.g. 1)
- letters:              leters of the volume (A-B)
- part:                 Part of the volume (e.g 1)
- supplement:           Supplement's Title
- supplementsTo:        It suppelements to editions [1, 2, 3....]
- numberOfWords:        Number of words per term definition
- numberOfTerms:        Number of terms per page
- numberOfPages:        Number of pages per volume

### 1. Load dataframe from JSON file

### 3.2 Testing with Spacy

Following https://medium.com/analytics-vidhya/text-summarization-using-spacy-ca4867c6b744 

In [8]:
text = "Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics."
doc=create_document(text)

In [9]:
len(list(doc.sents))

7

#### 3.2.1 Filtering tokens

In [10]:
keyword =[]
stopwords = list(STOP_WORDS)
pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
for token in doc:
    if (token.text in stopwords or token.text in punctuation):
        continue
    if (token.pos_ in pos_tag):
        keyword.append(token.text)

In [11]:
freq_word = Counter(keyword)
freq_word.most_common(5)

[('learning', 8), ('Machine', 4), ('study', 3), ('algorithms', 3), ('task', 3)]

#### 3.2.2 Normalization 

In [20]:
max_freq = Counter(keyword).most_common(1)[0][1]
print(freq_word.keys())
for word in freq_word.keys():
    freq_word[word] = (freq_word[word]/max_freq)
freq_word.most_common(5)

dict_keys(['Machine', 'learning', 'ML', 'scientific', 'study', 'algorithms', 'statistical', 'models', 'computer', 'systems', 'use', 'improve', 'performance', 'specific', 'task', 'build', 'mathematical', 'model', 'sample', 'data', 'known', 'training', 'order', 'predictions', 'decisions', 'programmed', 'perform', 'applications', 'email', 'filtering', 'detection', 'network', 'intruders', 'vision', 'infeasible', 'develop', 'algorithm', 'instructions', 'performing', 'related', 'computational', 'statistics', 'focuses', 'making', 'computers', 'optimization', 'delivers', 'methods', 'theory', 'application', 'domains', 'field', 'machine', 'Data', 'mining', 'exploratory', 'analysis', 'unsupervised', 'business', 'problems', 'referred', 'predictive', 'analytics'])


[('learning', 0.125),
 ('Machine', 0.0625),
 ('study', 0.046875),
 ('algorithms', 0.046875),
 ('task', 0.046875)]

#### 3.2.3 Weighing sentences

In [13]:
sent_strength={}
for sent in doc.sents:
    for word in sent:
        if word.text in freq_word.keys():
            if sent in sent_strength.keys():
                sent_strength[sent]+=freq_word[word.text]
            else:
                sent_strength[sent] = freq_word[word.text]
#print(sent_strength)

#### 3.2.4 Summarizing the string

And the nlargest function returns a list containing the top 3 sentences which are stored as summarized_sentences

In [14]:
summarized_sentences = nlargest(3, sent_strength, key=sent_strength.get)


In [15]:
final_sentences = [w.text for w in summarized_sentences]
summary =' '.join(final_sentences)
print(summary)

Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning.


In [16]:
len(summary)

550

### 3.3 BERT

In [17]:
model = Summarizer()
result = model(text, min_length=60)
summary = "".join(result)
print(summary)

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task.




### 3.4 GPT2

In [18]:

GPT2_model = TransformerSummarizer(transformer_type="GPT2",transformer_model_key="gpt2-medium")
full = ''.join(GPT2_model(text, min_length=60))
print(full)

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning.




### 3.5 XLNET

In [19]:
model = TransformerSummarizer(transformer_type="XLNet",transformer_model_key="xlnet-base-cased")
full = ''.join(model(text, min_length=60))
print(full)

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetModel: ['lm_loss.bias', 'lm_loss.weight']
- This IS expected if you are initializing XLNetModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.




ModuleNotFoundError: No module named 'gensim.summarization'