<a id='top'></a><a name='top'></a>
# Chapter 9: Text Summarization

**Blueprints for Text Analysis Using Python**

* [Introduction](#introduction)
* [9.0 Imports and Setup](#9.0)
* [9.1 Text Summarization](#9.1)
    - [9.1.1 Extractive Methods](#9.1.1)
    - [9.1.2 Data Preprocessing](#9.1.2)
* [9.2 Blueprint: Summarizing Text Using Topic Representation](#9.2)
    - [9.2.1 Identifying Important Words with TF-IDF Values](#9.2.1)
    - [9.2.2 LSA Algorithm](#9.2.2)
* [9.3 Blueprint: Summarizing Text Using an Indicator Representation](#9.3)
* [9.4 Measuring the Performance of Text Summarization Methods](#9.4)
* [9.5 Blueprint: Summarizing Text Using Machine Learning](#9.5)
    - [9.5.1 Step 1: Creating Target Labels](#9.5.1)
    - [9.5.2 Step 2: Adding Features to Assist Model Prediction](#9.5.2)
    - [9.5.3 Step 3: Build a Machine Learning Model](#9.5.3)

---
<a name='introduction'></a><a id='introduction'></a>
# Introduction
<a href="#top">[back to top]</a>

### Dataset

* acl2017.tex: [script](#acl2017.tex), [source](https://raw.githubusercontent.com/blueprints-for-text-analytics-python/blueprints-text/master/ch09/acl2017.tex)
* predicting-the-next-u-s-recession-idUSKCN1V31JE: [script](#predicting-the-next-u-s-recession-idUSKCN1V31JE), [source](https://www.reuters.com/article/us-usa-economy-watchlist-graphic/predicting-the-next-u-s-recession-idUSKCN1V31JE)
* travel_threads.csv.gz : [script](#travel_threads.csv.gz), [source](https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/data/travel-forum-threads/travel_threads.csv.gz)


### Explore

* Analyzing different types of text data.
* Examine specific text data characteristics useful in determining the choice of summarization method.

---
<a name='9.0'></a><a id='9.0'></a>
# 9.0 Imports and Setup
<a href="#top">[back to top]</a>

In [1]:
# Start with clean project
!rm -f *.gz
!rm -f *.py
!rm -f *.txt
!rm -fr articles

In [2]:
!mkdir articles

In [3]:
req_file = "requirements_09.txt"

In [4]:
%%writefile {req_file}
isort
rouge-score
scikit-learn-intelex
spacy
sumy
textacy
textdistance
tqdm
watermark
Wikipedia-API

Writing requirements_09.txt


In [5]:
import sys

IS_COLAB = 'google.colab' in sys.modules
if IS_COLAB:
    print("Installing packages")
    !pip install --upgrade --quiet -r {req_file}
else:
    print("Running locally.")

Installing packages
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 KB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.0/83.0 KB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.3/97.3 KB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m208.4/208.4 KB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.7/12.7 MB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m

In [6]:
%%writefile imports.py
# Place at top to patch scikit-learn algorithms
from sklearnex import patch_sklearn # isort:skip
patch_sklearn() # isort:skip

import html
import locale
import os.path
import pprint
import random
import re
import reprlib
import warnings

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import requests
import rouge_score
import seaborn as sns
import spacy
import sumy
import textacy
import textdistance
import wikipediaapi
from bs4 import BeautifulSoup
from dateutil import parser
from nltk import tokenize
from rouge_score import rouge_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GroupShuffleSplit
from spacy.tokenizer import Tokenizer as spacy_Tokenizer
from spacy.util import (compile_infix_regex, compile_prefix_regex,
                        compile_suffix_regex)
from sumy.nlp.stemmers import Stemmer
from sumy.nlp.tokenizers import Tokenizer as sumy_Tokenizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.utils import get_stop_words
from textacy.preprocessing import replace
from tqdm import tqdm
from watermark import watermark

Writing imports.py


In [7]:
!isort imports.py
!cat imports.py

# Place at top to patch scikit-learn algorithms
from sklearnex import patch_sklearn # isort:skip
patch_sklearn() # isort:skip

import html
import locale
import os.path
import pprint
import random
import re
import reprlib

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import requests
import rouge_score
import seaborn as sns
import spacy
import sumy
import textacy
import textdistance
import wikipediaapi
from bs4 import BeautifulSoup
from dateutil import parser
from nltk import tokenize
from rouge_score import rouge_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GroupShuffleSplit
from spacy.tokenizer import Tokenizer as spacy_Tokenizer
from spacy.util import (compile_infix_regex, compile_prefix_regex,
                        compile_suffix_regex)
from sumy.nlp.stemmers import Stemmer
from sumy.nlp.tokenizers import Tokenizer as sumy_Tokenizer
fro

In [8]:
# Place at top to patch scikit-learn algorithms
from sklearnex import patch_sklearn # isort:skip
patch_sklearn() # isort:skip

import html
import locale
import os.path
import pprint
import random
import re
import reprlib
import warnings

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import requests
import rouge_score
import seaborn as sns
import spacy
import sumy
import textacy
import textdistance
import wikipediaapi
from bs4 import BeautifulSoup
from dateutil import parser
from nltk import tokenize
from rouge_score import rouge_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GroupShuffleSplit
from spacy.tokenizer import Tokenizer as spacy_Tokenizer
from spacy.util import (compile_infix_regex, compile_prefix_regex,
                        compile_suffix_regex)
from sumy.nlp.stemmers import Stemmer
from sumy.nlp.tokenizers import Tokenizer as sumy_Tokenizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.utils import get_stop_words
from textacy.preprocessing import replace
from tqdm import tqdm
from watermark import watermark

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [9]:
def HR():
    print("-"*40)
    
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding
warnings.filterwarnings('ignore')
BASE_DIR = '.'
sns.set_style("darkgrid")
tqdm.pandas(desc="progress-bar")
pp = pprint.PrettyPrinter(indent=4)
LANGUAGE = "english"

print(watermark(iversions=True, globals_=globals(),python=True, machine=True))

Python implementation: CPython
Python version       : 3.9.16
IPython version      : 7.9.0

Compiler    : GCC 9.4.0
OS          : Linux
Release     : 5.10.147+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit

re          : 2.2.1
spacy       : 3.5.1
sys         : 3.9.16 (main, Dec  7 2022, 01:11:51) 
[GCC 9.4.0]
textdistance: 4.5.0
pandas      : 1.4.4
wikipediaapi: (0, 5, 8)
textacy     : 0.12.0
numpy       : 1.22.4
matplotlib  : 3.7.1
sumy        : 0.11.0
seaborn     : 0.12.2
dateutil    : 2.8.2
nltk        : 3.8.1
rouge_score : 0.1.2
requests    : 2.27.1



In [10]:
# Downloads
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [11]:
def regex_clean(text):
    # convert html escapes like &amp; to characters.
    text = html.unescape(text) 
    # tags like <tab>
    text = re.sub(r'<[^<>]*>', ' ', text)
    # markdown URLs like [Some text](https://....)
    text = re.sub(r'\[([^\[\]]*)\]\([^\(\)]*\)', r'\1', text)
    # text or code in brackets like [0]
    text = re.sub(r'\[[^\[\]]*\]', ' ', text)
    # standalone sequences of specials, matches &# but not #cool
    text = re.sub(r'(?:^|\s)[&#<>{}\[\]+|\\:-]{1,}(?:\s|$)', ' ', text)
    # standalone sequences of hyphens like --- or ==
    text = re.sub(r'(?:^|\s)[\-=\+]{2,}(?:\s|$)', ' ', text)
    # sequences of white spaces
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

In [12]:
def download_article(url):
    # check if article already there
    filename = url.split("/")[-1] + ".html"
    if not os.path.isfile(filename):
        r = requests.get(url)
        with open(filename, "w+") as f:
            f.write(r.text)
    return filename

**Setting up parse_article for Beautiful Soup**

* Right click on article element for 'Inspect Accessibility Properties'
* Copy entry for DOMNode

In [13]:
def parse_article(article_file):
    print(f"ARTICLE_FILE: {article_file}")
    HR()
    with open(article_file, "r") as f:
        html = f.read()
    r = {}
    soup = BeautifulSoup(html, 'html.parser')
    
    # r['id'] = soup.select_one("div.StandardArticle_inner-container")['id']
    r['url'] = soup.find("link", {'rel': 'canonical'})['href']
    r['headline'] = soup.h1.text
    
    #r['section'] = soup.select_one("div.ArticleHeader_channel a").text
    
    r['text'] = soup.select_one("p.Paragraph-paragraph-2Bgue.ArticleBody-para-TD_9x").text
    # r['text'] = soup.select_one("div.StandardArticleBody_body").text

    r['authors'] = [a.text 
                    for a in soup.select("div.BylineBar_first-container.ArticleHeader_byline-bar\
                                          div.BylineBar_byline span")]
    r['time'] = soup.find("meta", { 'property': "og:article:published_time"})['content']
    
    return r

<a name='9.1'></a><a id='9.1'></a>
# 9.1 Text Summarization
<a href="#top">[back to top]</a>

<a name='9.1.1'></a><a id='9.1.1'></a>
## 9.1.1 Extractive Methods
<a href="#top">[back to top]</a>

No source code.

<a name='9.1.2'></a><a id='9.1.2'></a>
## 9.1.2 Data Preprocessing
<a href="#top">[back to top]</a>

<a id='news-sitemap'></a><a name='news-sitemap'></a>
### Dataset: news-sitemap
<a href="#top">[back to top]</a>

In [14]:
r = reprlib.Repr()
r.maxstring = 800
article_dir = 'articles'

article_name1 = "what-is-5g-and-who-are-the-major-players-idUSKCN1GR1IN"
!wget -P {article_dir} -nc -q https://www.reuters.com/article/us-qualcomm-m-a-broadcom-5g/what-is-5g-and-who-are-the-major-players-idUSKCN1GR1IN
!ls -l articles/{article_name1}

-rw-r--r-- 1 root root 365332 Mar 22 06:44 articles/what-is-5g-and-who-are-the-major-players-idUSKCN1GR1IN


In [15]:
article1 = parse_article(f"{article_dir}/{article_name1}")

print ('Article Published on', r.repr(article1['time']))
print (r.repr(article1['text']))

ARTICLE_FILE: articles/what-is-5g-and-who-are-the-major-players-idUSKCN1GR1IN
----------------------------------------
Article Published on '2018-03-15T11:37:01Z'
"LONDON/SAN FRANCISCO (Reuters) - U.S. President Donald Trump has blocked microchip maker Broadcom Ltd's AVGO.O $117 billion takeover of rival Qualcomm QCOM.O amid concerns that it would give China the upper hand in the next generation of mobile communications, or 5G."


---
<a name='9.2'></a><a id='9.2'></a>
# 9.2 Blueprint: Summarizing Text Using Topic Representation
<a href="#top">[back to top]</a>

<a name='9.2.1'></a><a id='9.2.1'></a>
## 9.2.1 Identifying Important Words with TF-IDF Values
<a href="#top">[back to top]</a>

The simplest approach for summarizing text is to identify important sentences based on an aggregate of the TF-IDF values of the words in the sentence. 

Here, we apply the TF-IDF vectorization and then aggregate the values to a sentence level. We can generate a score for each sentence as a sum of the TF-IDF values for each word in that sentence. This means a sentence with a high score contains many important words, relative to other sentences in the article.

In [16]:
sentences = tokenize.sent_tokenize(article1['text'])
tfidfVectorizer = TfidfVectorizer()
words_tfidf = tfidfVectorizer.fit_transform(sentences)

Here, there are approximately 20 sentences in the article. We create a condensed summary that is only 10% of the size of the original article. We sum up the TF-IDF values for each sentence, and use ng.argsort to sort them. 

In [17]:
# Parameter to specify number of summary sentences required
num_summary_sentence = 3

# Sort the sentences in descending order by the sum of TF-IDF values
sent_sum = words_tfidf.sum(axis=1)
important_sent = np.argsort(sent_sum, axis=0)[::-1]

# Print three most important sentences in the order they appear in the article
for i in range(0, len(sentences)):
    if i in important_sent[:num_summary_sentence]:
        print (sentences[i])

LONDON/SAN FRANCISCO (Reuters) - U.S. President Donald Trump has blocked microchip maker Broadcom Ltd's AVGO.O $117 billion takeover of rival Qualcomm QCOM.O amid concerns that it would give China the upper hand in the next generation of mobile communications, or 5G.


In [18]:
def tfidf_summary(text, num_summary_sentence):
    summary_sentence = []
    sentences = tokenize.sent_tokenize(text)
    tfidfVectorizer = TfidfVectorizer()
    words_tfidf = tfidfVectorizer.fit_transform(sentences)
    sentence_sum = words_tfidf.sum(axis=1)
    important_sentences = np.argsort(sentence_sum, axis=0)[::-1]
    for i in range(0, len(sentences)):
        if i in important_sentences[:num_summary_sentence]:
            summary_sentence.append(sentences[i])
    return summary_sentence

<a name='9.2.2'></a><a id='9.2.2'></a>
## 9.2.2 LSA Algorithm
<a href="#top">[back to top]</a>

LSA is a general-purpose method used for topic modeling, document similarity, and other tasks. LSA assumes that words close in meaning will occur in the same documents. 

https://github.com/miso-belica/sumy

In [19]:
LANGUAGE = "english"
stemmer = Stemmer(LANGUAGE)
parser = PlaintextParser.from_string(article1['text'], sumy_Tokenizer(LANGUAGE))
summarizer = LsaSummarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)

for sentence in summarizer(parser.document, num_summary_sentence):
    print (str(sentence))

LONDON/SAN FRANCISCO (Reuters) - U.S. President Donald Trump has blocked microchip maker Broadcom Ltd's AVGO.O $117 billion takeover of rival Qualcomm QCOM.O amid concerns that it would give China the upper hand in the next generation of mobile communications, or 5G.


In [20]:
def lsa_summary(text, num_summary_sentence):
    summary_sentence = []
    LANGUAGE = "english"
    stemmer = Stemmer(LANGUAGE)
    parser = PlaintextParser.from_string(text, sumy_Tokenizer(LANGUAGE))
    summarizer = LsaSummarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)
    for sentence in summarizer(parser.document, num_summary_sentence):
        summary_sentence.append(str(sentence))
    return summary_sentence

<a id='predicting-the-next-u-s-recession-idUSKCN1V31JE'></a><a name='predicting-the-next-u-s-recession-idUSKCN1V31JE'></a>
### Dataset: predicting-the-next-u-s-recession-idUSKCN1V31JE
<a href="#top">[back to top]</a>

In [21]:
article_name2 = 'predicting-the-next-u-s-recession-idUSKCN1V31JE'
!wget -P {article_dir} -nc -q "https://www.reuters.com/article/us-usa-economy-watchlist-graphic/predicting-the-next-u-s-recession-idUSKCN1V31JE"
!ls -l {article_dir}/{article_name2}

-rw-r--r-- 1 root root 373066 Mar 22 06:44 articles/predicting-the-next-u-s-recession-idUSKCN1V31JE


In [22]:
r.maxstring = 800
article2 = parse_article(f"{article_dir}/{article_name2}")
print ('Article Published', r.repr(article1['time']))
HR()
print (r.repr(article2['text']))

ARTICLE_FILE: articles/predicting-the-next-u-s-recession-idUSKCN1V31JE
----------------------------------------
Article Published '2018-03-15T11:37:01Z'
----------------------------------------
'NEW YORK A protracted trade war between China and the United States, the world’s largest economies, and a deteriorating global growth outlook has left investors apprehensive about the end to the longest expansion in American history.'


In [23]:
article2['text']

'NEW YORK A protracted trade war between China and the United States, the world’s largest economies, and a deteriorating global growth outlook has left investors apprehensive about the end to the longest expansion in American history.'

In [24]:
summary_sentence = tfidf_summary(article2['text'], num_summary_sentence)

for sentence in summary_sentence:
    print (sentence)

NEW YORK A protracted trade war between China and the United States, the world’s largest economies, and a deteriorating global growth outlook has left investors apprehensive about the end to the longest expansion in American history.


In [25]:
summary_sentence = lsa_summary(article2['text'], num_summary_sentence)

for sentence in summary_sentence:
    print (sentence)

NEW YORK A protracted trade war between China and the United States, the world’s largest economies, and a deteriorating global growth outlook has left investors apprehensive about the end to the longest expansion in American history.


---
<a name='9.3'></a><a id='9.3'></a>
# 9.3 Blueprint: Summarizing Text Using an Indicator Representation
<a href="#top">[back to top]</a>

In [26]:
parser = PlaintextParser.from_string(article2['text'], sumy_Tokenizer(LANGUAGE))
summarizer = TextRankSummarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)

for sentence in summarizer(parser.document, num_summary_sentence):
    print (str(sentence))

NEW YORK A protracted trade war between China and the United States, the world’s largest economies, and a deteriorating global growth outlook has left investors apprehensive about the end to the longest expansion in American history.


In [27]:
def textrank_summary(text, num_summary_sentence):
    summary_sentence = []
    LANGUAGE = "english"
    stemmer = Stemmer(LANGUAGE)
    parser = PlaintextParser.from_string(text, sumy_Tokenizer(LANGUAGE))
    summarizer = TextRankSummarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)
    for sentence in summarizer(parser.document, num_summary_sentence):
        summary_sentence.append(str(sentence))
    return summary_sentence

In [28]:
parser = PlaintextParser.from_string(article1['text'], sumy_Tokenizer(LANGUAGE))
summarizer = TextRankSummarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)

for sentence in summarizer(parser.document, num_summary_sentence):
    print (str(sentence))

LONDON/SAN FRANCISCO (Reuters) - U.S. President Donald Trump has blocked microchip maker Broadcom Ltd's AVGO.O $117 billion takeover of rival Qualcomm QCOM.O amid concerns that it would give China the upper hand in the next generation of mobile communications, or 5G.


In [29]:
wiki_wiki = wikipediaapi.Wikipedia(
        language='en',
        extract_format=wikipediaapi.ExtractFormat.WIKI
)

In [30]:
r.maxstring = 500

In [31]:
p_wiki = wiki_wiki.page('Mongol_invasion_of_Europe')
print (r.repr(p_wiki.text))

'From the 1220s into the 1240s, the Mongols conquered the Turkic states of Volga Bulgaria, Cumania, Alania, and the Kievan Rus\' federation. Following this, they began their invasion into heartland Europe by launching a two-pronged invasion of then...Citations\nSources\nSverdrup, Carl (2010). "Numbers in Mongol Warfare". Journal of Medieval Military History. Boydell Press. 8: 109–17 [p. 115]. ISBN 978-1-84383-596-7.\n\nFurther reading\nExternal links\nThe Islamic World to 1600: The Golden Horde'


In [32]:
r.maxstring = 200

num_summary_sentence = 10

summary_sentence = textrank_summary(p_wiki.text, num_summary_sentence)

for sentence in summary_sentence:
    print (sentence)

Warring European princes realized they had to cooperate in the face of a Mongol invasion, so local wars and conflicts were suspended in parts of central Europe, only to be resumed after the Mongols had withdrawn.
Under Wenceslaus' leadership during the Mongol invasion, Bohemia remained one of a few eastern European kingdoms that was never pillaged by the Mongols even though most kingdoms around it such as Poland and Moravia were ravaged.
Saint Margaret (January 27, 1242 – January 18, 1271), a daughter of Béla IV and Maria Laskarina, was born in Klis Fortress during the Mongol invasion of Hungary-Croatia in 1242.Historians estimate that up to half of Hungary's two million population at that time were victims of the Mongol invasion of Europe.
European tactics against Mongols The traditional European method of warfare of melee combat between knights ended in catastrophe when it was deployed against the Mongol forces as the Mongols were able to keep a distance and advance with superior num

<a id='acl2017.tex'></a><a name='acl2017.tex'></a>
### Dataset: acl2017.tex
<a href="#top">[back to top]</a>

In [33]:
filename = 'acl2017.tex'
!wget -P {article_dir} -nc -q https://raw.githubusercontent.com/blueprints-for-text-analytics-python/blueprints-text/master/ch09/acl2017.tex
!ls -l {article_dir}/{filename}

-rw-r--r-- 1 root root 56337 Mar 22 06:44 articles/acl2017.tex


In [34]:
parser = PlaintextParser.from_file(f"{article_dir}/{filename}", sumy.nlp.tokenizers.Tokenizer(LANGUAGE))
summarizer = TextRankSummarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)

for sentence in summarizer(parser.document, 5):
    print (str(sentence))

In more detail, our contributions are as follows: \begin{itemize}[noitemsep] \item{We introduce a new dataset for summarisation of scientific publications consisting of over 10k documents} \item{Following the approach of \cite{hermann2015teaching} in the news domain, we introduce a method, \textit{HighlightROUGE}, which can be used to automatically extend this dataset %extractive summarisation datasets% and show empirically that this improves summarisation performance} \item{Taking inspiration from previous work in summarising scientific literature \citep{kupiec1995trainable, papers_citationSaggion2016}, we introduce a %further metric we use as a feature, \textit{AbstractROUGE}, which can be used to extract summaries by exploiting the abstract of a paper} \item{We benchmark several neural as well traditional summarisation methods on the dataset and use simple features to model the global context of a summary statement, which contribute most to the overall score} \item{We compare our be

---
<a name='9.4'></a><a id='9.4'></a>
# 9.4 Measuring the Performance of Text Summarization Methods
<a href="#top">[back to top]</a>

In [35]:
def print_rouge_score(rouge_score):
    for k,v in rouge_score.items():
        print (k, 'Precision:', "{:.2f}".format(v.precision), 'Recall:', "{:.2f}".format(v.recall), 'fmeasure:', "{:.2f}".format(v.fmeasure))

In [36]:
num_summary_sentence = 3
gold_standard = article2['headline']
summary = ""

summary = ''.join(textrank_summary(article2['text'], num_summary_sentence))
scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
scores = scorer.score(gold_standard, summary)
print_rouge_score(scores)

rouge1 Precision: 0.05 Recall: 0.33 fmeasure: 0.09


In [37]:
summary = ''.join(lsa_summary(article2['text'], num_summary_sentence))
scores = scorer.score(gold_standard, summary)
print_rouge_score(scores)

rouge1 Precision: 0.05 Recall: 0.33 fmeasure: 0.09


In [38]:
num_summary_sentence = 10 ##
gold_standard = p_wiki.summary

summary = ''.join(textrank_summary(p_wiki.text, num_summary_sentence))

scorer = rouge_scorer.RougeScorer(['rouge2','rougeL'], use_stemmer=True)
scores = scorer.score(gold_standard, summary)
print_rouge_score(scores)

rouge2 Precision: 0.10 Recall: 0.28 fmeasure: 0.15
rougeL Precision: 0.11 Recall: 0.30 fmeasure: 0.16


In [39]:
summary = ''.join(lsa_summary(p_wiki.text, num_summary_sentence))

scorer = rouge_scorer.RougeScorer(['rouge2','rougeL'], use_stemmer=True)
scores = scorer.score(gold_standard, summary)
print_rouge_score(scores)

rouge2 Precision: 0.04 Recall: 0.09 fmeasure: 0.05
rougeL Precision: 0.12 Recall: 0.26 fmeasure: 0.16


---
<a name='9.5'></a><a id='9.5'></a>
# 9.5 Blueprint: Summarizing Text Using Machine Learning
<a href="#top">[back to top]</a>

<a name='9.5.1'></a><a id='9.5.1'></a>
## 9.5.1 Step 1: Creating target labels
<a href="#top">[back to top]</a>

<a id='travel_threads.csv.gz'></a><a name='travel_threads.csv.gz'></a>
### Dataset: travel_threads.csv.gz
<a href="#top">[back to top]</a>

In [40]:
file = "travel_threads.csv.gz"
!wget -nc -q https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/data/travel-forum-threads/travel_threads.csv.gz
!ls -l {file}

-rw-r--r-- 1 root root 1825254 Mar 22 06:44 travel_threads.csv.gz


In [41]:
df = pd.read_csv(file, sep='|', dtype={'ThreadID': 'object'})
df[df['ThreadID']=='60763_5_3122150'].head(1).T

# You can view the actual post here ###
# URL - https://www.tripadvisor.com/ShowTopic-g60763-i5-k3122150-Which_attractions_need_to_be_pre_booked-New_York_City_New_York.html ###

Unnamed: 0,850
Filename,60763_5_3122150
ThreadID,60763_5_3122150
Title,which attractions need to be pre booked?
userID,musicqueenLon...
Date,"29 September 2009, 1:41"
postNum,1
text,Hi I am coming to NY in Oct! So excited&quo...
summary,A woman was planning to travel NYC in October ...


In [42]:
# Re-using the blueprint from Chapter 4 but adapting to add additional steps specific to this dataset

def custom_tokenizer(nlp):
    # use default patterns except the ones matched by re.search
    prefixes = [pattern for pattern in nlp.Defaults.prefixes 
                if pattern not in ['-', '_', '#']]
    suffixes = [pattern for pattern in nlp.Defaults.suffixes
                if pattern not in ['_']]
    infixes  = [pattern for pattern in nlp.Defaults.infixes
                if not re.search(pattern, 'xx-xx')]

    return spacy_Tokenizer(
        vocab          = nlp.vocab, 
        rules          = nlp.Defaults.tokenizer_exceptions,
        prefix_search  = compile_prefix_regex(prefixes).search,
        suffix_search  = compile_suffix_regex(suffixes).search,
        infix_finditer = compile_infix_regex(infixes).finditer,
        token_match    = nlp.Defaults.token_match
    )


nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = custom_tokenizer(nlp)
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [43]:
def extract_lemmas(doc, **kwargs):
    return [t.lemma_ for t in textacy.extract.words(doc, **kwargs)]

def extract_noun_chunks(doc, include_pos=['NOUN'], sep='_'):
    chunks = []
    for noun_chunk in doc.noun_chunks:
        chunk = [token.lemma_ for token in noun_chunk
                 if token.pos_ in include_pos]
        if len(chunk) >= 2:
            chunks.append(sep.join(chunk))
    return chunks

def extract_entities(doc, include_types=None, sep='_'):

    ents = textacy.extract.entities(doc, 
             include_types=include_types, 
             exclude_types=None, 
             drop_determiners=True, 
             min_freq=1)
    
    return [re.sub('\s+', sep, e.lemma_)+'/'+e.label_ for e in ents]

def clean(text):
    # Replace URLs
    text = replace.urls(text)
    
    # Replace semi-colons (relevant in Java code ending)
    text = text.replace(';','')
    
    # Replace character tabs (present as literal in description field)
    text = text.replace('\t','')
    
    # Find and remove any stack traces - doesn't fix all code fragments but removes many exceptions
    start_loc = text.find("Stack trace:")
    text = text[:start_loc]
    
    # Remove Hex Code
    text = re.sub(r'(\w+)0x\w+', '', text)
    
    # Initialize Spacy
    doc = nlp(text)
    
    # From Blueprint function
    lemmas = extract_lemmas(
        doc, 
        exclude_pos = ['PART', 'PUNCT', 'DET', 'PRON', 'SYM', 'SPACE', 'NUM'],
        filter_stops = True,
        filter_nums = True,
        filter_punct = True
    )

    return lemmas

In [44]:
# Applying regex based cleaning function
df['text'] = df['text'].progress_apply(regex_clean)

# Extracting lemmas using spacy pipeline
df['lemmas'] = df['text'].progress_apply(clean)

progress-bar: 100%|██████████| 7357/7357 [00:01<00:00, 3875.83it/s]
progress-bar: 100%|██████████| 7357/7357 [02:30<00:00, 48.88it/s]


In [45]:
gss = GroupShuffleSplit(
    n_splits=1, 
    test_size=0.2, 
    random_state=42
)

train_split, test_split = next(
    gss.split(
        df, 
        groups=df['ThreadID']
    )
)

In [46]:
train_df = df.iloc[train_split]
test_df = df.iloc[test_split]

print ('Number of threads for Training ', train_df['ThreadID'].nunique())
print ('Number of threads for Testing ', test_df['ThreadID'].nunique())

Number of threads for Training  559
Number of threads for Testing  140


In [47]:
compression_factor = 0.3

train_df['similarity'] = train_df.progress_apply(
    lambda x: textdistance.jaro_winkler(x.text, x.summary), axis=1)

train_df["rank"] = train_df.groupby("ThreadID")["similarity"].rank(
    "max", ascending=False)

topN = lambda x: x <= np.ceil(compression_factor * x.max())
train_df['summaryPost'] = train_df.groupby('ThreadID')['rank'].progress_apply(topN)

progress-bar: 100%|██████████| 5858/5858 [00:01<00:00, 3192.10it/s]
progress-bar: 100%|██████████| 559/559 [00:00<00:00, 4752.15it/s]


In [48]:
train_df[['text','summaryPost']][train_df['ThreadID']=='60763_5_3122150'].head(3)

Unnamed: 0,text,summaryPost
850,"Hi I am coming to NY in Oct! So excited"" Have ...",True
851,I wouldnt bother doing the ESB if I was you TO...,False
852,"The Statue of Liberty, if you plan on going to...",True


<a name='9.5.2'></a><a id='9.5.2'></a>
## 9.5.2 Step 2: Adding Features to Assist Model Prediction
<a href="#top">[back to top]</a>

In [49]:
train_df['titleSimilarity'] = train_df.progress_apply(
    lambda x: textdistance.jaro_winkler(x.text, x.Title), axis=1)

progress-bar: 100%|██████████| 5858/5858 [00:00<00:00, 20053.86it/s]


In [50]:
## Adding post length as a feature
train_df['textLength'] = train_df['text'].str.len()

In [51]:
train_df.loc[train_df['textLength'] <= 20, 'summaryPost'] = False

In [52]:
feature_cols = ['titleSimilarity','textLength','postNum']

In [53]:
train_df['combined'] = [
    ' '.join(map(str, l)) for l in train_df['lemmas'] if l is not '']

tfidf = TfidfVectorizer(min_df=10, ngram_range=(1, 2), stop_words="english")
tfidf_result = tfidf.fit_transform(train_df['combined']).toarray()

tfidf_df = pd.DataFrame(tfidf_result, columns=tfidf.get_feature_names_out())

tfidf_df.columns = ["word_" + str(x) for x in tfidf_df.columns]
tfidf_df.index = train_df.index
train_df_tf = pd.concat([train_df[feature_cols], tfidf_df], axis=1)

In [54]:
test_df['similarity'] = test_df.progress_apply(lambda x: textdistance.jaro_winkler(x.text, x.summary), axis=1)
test_df["rank"] = test_df.groupby("ThreadID")["similarity"].rank("max", ascending=False)

topN = lambda x: x <= np.ceil(compression_factor * x.max())
test_df['summaryPost'] = test_df.groupby('ThreadID')['rank'].progress_apply(topN)

test_df['titleSimilarity'] = test_df.progress_apply(lambda x: textdistance.jaro_winkler(x.text, x.Title), axis=1)

test_df['textLength'] = test_df['text'].str.len()

test_df.loc[test_df['textLength'] <= 20, 'summaryPost'] = False

test_df['combined'] = [' '.join(map(str, l)) for l in test_df['lemmas'] if l is not '']

tfidf_result = tfidf.transform(test_df['combined']).toarray()

tfidf_df = pd.DataFrame(tfidf_result, columns = tfidf.get_feature_names_out())
tfidf_df.columns = ["word_" + str(x) for x in tfidf_df.columns]
tfidf_df.index = test_df.index
test_df_tf = pd.concat([test_df[feature_cols], tfidf_df], axis=1)

progress-bar: 100%|██████████| 1499/1499 [00:00<00:00, 2928.39it/s]
progress-bar: 100%|██████████| 140/140 [00:00<00:00, 4802.74it/s]
progress-bar: 100%|██████████| 1499/1499 [00:00<00:00, 19654.45it/s]


<a name='9.5.3'></a><a id='9.5.3'></a>
## 9.5.3 Step 3: Build a Machine Learning Model
<a href="#top">[back to top]</a>

### API Notes

[`sklearn.ensemble.RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

A random forest classifier.

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

In [55]:
%%time

model1 = RandomForestClassifier(
    random_state=20,
    verbose=1
)

CPU times: user 164 µs, sys: 0 ns, total: 164 µs
Wall time: 170 µs


In [56]:
%%time

# This takes a lot of time to run

model1.fit(
    train_df_tf, 
    train_df['summaryPost']
)

CPU times: user 4min 40s, sys: 396 ms, total: 4min 40s
Wall time: 4min 48s


In [57]:
# Function to calculate rouge_score for each thread
def calculate_rouge_score(x, column_name):
    # Get the original summary - only first value since they are repeated
    ref_summary = x['summary'].values[0]
    
    # Join all posts that have been predicted as summary
    predicted_summary = ''.join(x['text'][x[column_name]])
    
    # Return the rouge score for each ThreadID
    scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
    scores = scorer.score(ref_summary, predicted_summary)
    return scores['rouge1'].fmeasure

In [58]:
%%time

test_df['predictedSummaryPost'] = model1.predict(test_df_tf)
print('Mean ROUGE-1 Score for test threads',
      test_df.groupby('ThreadID')[['summary','text','predictedSummaryPost']] \
      .progress_apply(calculate_rouge_score, column_name='predictedSummaryPost').mean())

progress-bar: 100%|██████████| 140/140 [00:01<00:00, 72.72it/s]

Mean ROUGE-1 Score for test threads 0.3468526208505262
CPU times: user 2.21 s, sys: 99.8 ms, total: 2.31 s
Wall time: 2.37 s





In [59]:
%%time 

random.seed(2)
random.sample(test_df['ThreadID'].unique().tolist(), 1)

CPU times: user 554 µs, sys: 1 ms, total: 1.56 ms
Wall time: 1.3 ms


['60763_5_3139646']

In [60]:
%%time

example_df = test_df[test_df['ThreadID'] == '60974_588_2180141']
print('Total number of posts', example_df['postNum'].max())
print('Number of summary posts',
      example_df[example_df['predictedSummaryPost']].count().values[0])
print('Title: ', example_df['Title'].values[0])
example_df[['postNum', 'text']][example_df['predictedSummaryPost']]

Total number of posts 9
Number of summary posts 2
Title:  What's fun for kids?
CPU times: user 3.99 ms, sys: 2.98 ms, total: 6.97 ms
Wall time: 6.5 ms


Unnamed: 0,postNum,text
813,4,"Well, you're really in luck, because there's a..."
814,5,"Depending on your time frame, a quick trip to ..."
