# Text Summarization

Using abstraction or extraction.

**Extractive Method**

Three main steps:
1. Create an intermediate representation of the text.
2. Score the sentences/phrases based on the choosen representation.
3. Rank and choose sentences to create a summary of the text.


**Data Preprocessing**

In [1]:
import reprlib

r = reprlib.Repr()
r.maxstring = 800

In [2]:
# !pip install beautifulsoup4

In [3]:
import os

import requests
from bs4 import BeautifulSoup


def download_article(url):
    # Check if article is already there.
    filename = url.split("/")[-1] + ".html"
    if not os.path.isfile(filename):
        r = requests.get(url)
        with open(filename, "w+") as f:
            f.write(r.text)
    return filename


def parse_article(article_file):
    with open(article_file, "r") as f:
        html = f.read()

    r = {}
    soup = BeautifulSoup(html, "html.parser")
    r["url"] = soup.find("link", {"rel": "canonical"})["href"]
    r["headline"] = soup.h1.text
    children = list(soup.select_one("article").children)[0]
    paragraphs = [child.text for child in children if child.name != "div"]
    r["text"] = "\n".join(paragraphs)
    r["authors"] = [
        a.text for a in soup.select(".ArticleBody-byline-container-3H6dy a")
    ]
    r["time"] = soup.find("meta", {"property": "og:article:published_time"})["content"]
    return r

In [4]:
url1 = "https://www.reuters.com/article/us-qualcomm-m-a-broadcom-5g/what-is-5g-and-who-are-the-major-players-idUSKCN1GR1IN"

article_name1 = download_article(url1)
article1 = parse_article(article_name1)
print("Article published on", r.repr(article1["time"]))
print(r.repr(article1["text"]))

Article published on '2018-03-15T11:37:01Z'
"LONDON/SAN FRANCISCO (Reuters) - U.S. President Donald Trump has blocked microchip maker Broadcom Ltd's AVGO.O $117 billion takeover of rival Qualcomm QCOM.O amid concerns that it would give China the upper hand in the next generation of mobile communications, or 5G.\nA 5G sign is seen at the Mobile World Congress in Barcelona, Spain February 28, 2018. REUTERS/Yves Herman\nBelow are some facts ... and telecommunications gear makers will have to pay it licensing fees. It dominated standards setting in 3G and 4G wireless and looks set to top the list of patent holders heading into the 5G cycle.\nHuawei, Nokia, Ericsson and others are also vying to amass 5G patents, which has helped spur complex cross-licensing agreements like the deal struck late last year Nokia and Huawei around handsets."


### Blueprint: Summarizing Text using Topic Representation

**Identifying Important Words with TF-IDF values**

In [30]:
import numpy as np
from nltk import tokenize
from sklearn.feature_extraction.text import TfidfVectorizer


def tfidf_summary(article, num_summary_sentences=3):
    sentences = tokenize.sent_tokenize(article)
    tfidf_vectorizer = TfidfVectorizer()
    words_tfidf = tfidf_vectorizer.fit_transform(sentences)

    # Sort the sentences in descending order by the sum of TF-IDF values.
    sent_sum = words_tfidf.sum(axis=1)  # One column.
    important_sent = np.argsort(sent_sum, axis=0)[::-1]

    result = []
    # Return three most important sentences in the order they appear in the article.
    for i in range(len(sentences)):
        if i in important_sent[:num_summary_sentences]:
            result.append(sentences[i])
    return result

In [32]:
for sentence in tfidf_summary(article1["text"]):
    print(sentence)
    print()

LONDON/SAN FRANCISCO (Reuters) - U.S. President Donald Trump has blocked microchip maker Broadcom Ltd's AVGO.O $117 billion takeover of rival Qualcomm QCOM.O amid concerns that it would give China the upper hand in the next generation of mobile communications, or 5G.

5G networks, now in the final testing stage, will rely on denser arrays of small antennas and the cloud to offer data speeds up to 50 or 100 times faster than current 4G networks and serve as critical infrastructure for a range of industries.

Most other baseband chips come from Asia: MediaTek 2454.TW of Taiwan holds about one quarter of the market, while Samsung Electronics 005930.KS and Huawei [HWT.UL] - two big smartphone makers - develop chips for their own devices.



## LSA Algorithm

In [26]:
# !pip install sumy

In [49]:
from sumy.nlp.stemmers import Stemmer
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.lsa import LsaSummarizer
from sumy.utils import get_stop_words

LANGUAGE = "english"


def lsa_summary(article, num_summary_sentences=3):
    stemmer = Stemmer(LANGUAGE)
    parser = PlaintextParser.from_string(article, Tokenizer(LANGUAGE))
    summarizer = LsaSummarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)

    return [
        str(sentence) for sentence in summarizer(parser.document, num_summary_sentences)
    ]


for sentence in lsa_summary(article1["text"]):
    print(sentence)
    print()

LONDON/SAN FRANCISCO (Reuters) - U.S. President Donald Trump has blocked microchip maker Broadcom Ltd's AVGO.O $117 billion takeover of rival Qualcomm QCOM.O amid concerns that it would give China the upper hand in the next generation of mobile communications, or 5G.

Moving to new networks promises to enable new mobile services and even whole new business models, but could pose challenges for countries and industries unprepared to invest in the transition.

The concern is that a takeover by Singapore-based Broadcom could see the firm cut research and development spending by Qualcomm or hive off strategically important parts of the company to other buyers, including in China, U.S. officials and analysts have said.



### Blueprint: Summarizing Text using an Indicator Representation

In [43]:
from sumy.summarizers.text_rank import TextRankSummarizer


def textrank_summary(article, num_summary_sentences=3):
    parser = PlaintextParser.from_string(article, Tokenizer(LANGUAGE))
    summarizer = TextRankSummarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)
    return [
        str(sentence) for sentence in summarizer(parser.document, num_summary_sentences)
    ]


for sentence in textrank_summary(article1["text"]):
    print(sentence)
    print()

Acquiring Qualcomm would represent the jewel in the crown of Broadcom’s portfolio of communications chips, which supply wi-fi, power management, video and other features in smartphones alongside Qualcomm’s core baseband chips - radio modems that wirelessly connect phones to networks.

Qualcomm QCOM.O is the dominant player in smartphone communications chips, making half of all core baseband radio chips in smartphones.

The standards are set by a global body to ensure all phones work across different mobile networks, and whoever’s essential patents end up making it into the standard stands to reap huge royalty licensing revenue streams.



### Measuring the performance of Text Summarization Methods

In [39]:
# !pip install rouge_score

In [53]:
from rouge_score import rouge_scorer

num_summary_sentences = 3
gold_standard = article1["headline"]
summary = "".join(textrank_summary(article1["text"], num_summary_sentences))
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
scores = scorer.score(gold_standard, summary)
scores

{'rouge1': Score(precision=0.05, recall=0.5555555555555556, fmeasure=0.09174311926605504),
 'rouge2': Score(precision=0.0, recall=0.0, fmeasure=0.0),
 'rougeL': Score(precision=0.03, recall=0.3333333333333333, fmeasure=0.055045871559633024)}

In [54]:
summary = "".join(lsa_summary(article1["text"], num_summary_sentences))
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
scores = scorer.score(gold_standard, summary)
scores

{'rouge1': Score(precision=0.03389830508474576, recall=0.4444444444444444, fmeasure=0.06299212598425197),
 'rouge2': Score(precision=0.0, recall=0.0, fmeasure=0.0),
 'rougeL': Score(precision=0.025423728813559324, recall=0.3333333333333333, fmeasure=0.04724409448818898)}

### Blueprint: Summarizing Text using Machine Learning

**Step 1: Creating Target Labels**

In [61]:
import numpy as np
import pandas as pd

df = pd.read_csv("data/travel_threads.csv.gz", sep="|", dtype={"ThreadID": "object"})
df[df["ThreadID"] == "60763_5_3122150"].head(1).T

Unnamed: 0,850
Filename,60763_5_3122150
ThreadID,60763_5_3122150
Title,which attractions need to be pre booked?
userID,musicqueenLon...
Date,"29 September 2009, 1:41"
postNum,1
text,Hi I am coming to NY in Oct! So excited&quo...
summary,A woman was planning to travel NYC in October ...


In [65]:
# Re-using the blueprint from Chapter 4 but adapting to add additional steps specific to this dataset
import re  # ##

import spacy  # ##
import textacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex, compile_prefix_regex, compile_suffix_regex
from textacy.preprocessing.replace import urls as replace_urls


def custom_tokenizer(nlp):

    # use default patterns except the ones matched by re.search
    prefixes = [
        pattern for pattern in nlp.Defaults.prefixes if pattern not in ["-", "_", "#"]
    ]
    suffixes = [pattern for pattern in nlp.Defaults.suffixes if pattern not in ["_"]]
    infixes = [
        pattern for pattern in nlp.Defaults.infixes if not re.search(pattern, "xx-xx")
    ]

    return Tokenizer(
        vocab=nlp.vocab,
        rules=nlp.Defaults.tokenizer_exceptions,
        prefix_search=compile_prefix_regex(prefixes).search,
        suffix_search=compile_suffix_regex(suffixes).search,
        infix_finditer=compile_infix_regex(infixes).finditer,
        token_match=nlp.Defaults.token_match,
    )


nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)


def extract_lemmas(doc, **kwargs):
    return [t.lemma_ for t in textacy.extract.words(doc, **kwargs)]


def extract_noun_chunks(doc, include_pos=["NOUN"], sep="_"):
    chunks = []
    for noun_chunk in doc.noun_chunks:
        chunk = [token.lemma_ for token in noun_chunk if token.pos_ in include_pos]
        if len(chunk) >= 2:
            chunks.append(sep.join(chunk))
    return chunks


def extract_entities(doc, include_types=None, sep="_"):

    ents = textacy.extract.entities(
        doc,
        include_types=include_types,
        exclude_types=None,
        drop_determiners=True,
        min_freq=1,
    )

    return [re.sub("\s+", sep, e.lemma_) + "/" + e.label_ for e in ents]


def spacy_clean(text):
    # Replace URLs
    text = replace_urls(text)

    # Replace semi-colons (relevant in Java code ending)
    text = text.replace(";", "")

    # Replace character tabs (present as literal in description field)
    text = text.replace("\t", "")

    # Find and remove any stack traces - doesn't fix all code fragments but removes many exceptions
    start_loc = text.find("Stack trace:")
    text = text[:start_loc]

    # Remove Hex Code
    text = re.sub(r"(\w+)0x\w+", "", text)

    # Initialize Spacy
    doc = nlp(text)

    # From Blueprint function
    lemmas = extract_lemmas(
        doc,
        exclude_pos=["PART", "PUNCT", "DET", "PRON", "SYM", "SPACE", "NUM"],
        filter_stops=True,
        filter_nums=True,
        filter_punct=True,
    )

    return lemmas

In [67]:
%run preprocess.py

In [68]:
df["text"] = df["text"].apply(clean)
df["lemmas"] = df["text"].apply(spacy_clean)

In [69]:
from sklearn.model_selection import GroupShuffleSplit

gss = GroupShuffleSplit(n_splits=1, test_size=0.2)
train_split, test_split = next(gss.split(df, groups=df["ThreadID"]))
train_df = df.iloc[train_split]
test_df = df.iloc[test_split]

print("Number of threads for Training", train_df["ThreadID"].nunique())
print("Number of threads for Testing", test_df["ThreadID"].nunique())

Number of threads for Training 559
Number of threads for Testing 140


In [71]:
# !pip install textdistance

In [72]:
import textdistance

compression_factor = 0.3

train_df["similarity"] = train_df.apply(
    lambda x: textdistance.jaro_winkler(x.text, x.summary), axis=1
)
train_df["rank"] = train_df.groupby("ThreadID")["similarity"].rank(
    "max", ascending=False
)

top_n = lambda x: x <= np.ceil(compression_factor * x.max())

train_df["summary_post"] = train_df.groupby("ThreadID")["rank"].apply(top_n)
train_df[["text", "summary_post"]][train_df["ThreadID"] == "60763_5_3122150"].head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["similarity"] = train_df.apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["rank"] = train_df.groupby("ThreadID")["similarity"].rank(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["summary_post"] = train_df.groupby("ThreadID")["rank"].apply(top_n)


Unnamed: 0,text,summary_post
850,"Hi I am coming to NY in Oct! So excited"" Have ...",True
851,I wouldnt bother doing the ESB if I was you TO...,False
852,"The Statue of Liberty, if you plan on going to...",True


In [77]:
train_df.loc[train_df["text"].str.len() <= 20, "summary_post"] = False

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


**Step 2: Adding Features to Assist Model Prediction**

In [99]:
def preprocess(df):
    df = df.copy(deep=True)
    df["title_similarity"] = df.apply(
        lambda x: textdistance.jaro_winkler(x.text, x.Title), axis=1
    )
    df["text_length"] = df["text"].str.len()
    df["combined"] = [" ".join(map(str, l)) for l in df["lemmas"] if l != ""]
    return df


def vectorizer(df, tfidf):
    feature_cols = ["title_similarity", "text_length", "postNum"]
    tfidf_result = tfidf.transform(df["combined"]).toarray()

    tfidf_df = pd.DataFrame(tfidf_result, columns=tfidf.get_feature_names())
    tfidf_df.columns = ["word_" + str(x) for x in tfidf_df.columns]
    tfidf_df.index = df.index
    df_tf = pd.concat([df[feature_cols], tfidf_df], axis=1)
    return df_tf

In [78]:
train_df["title_similarity"] = train_df.apply(
    lambda x: textdistance.jaro_winkler(x.text, x.Title), axis=1
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["title_similarity"] = train_df.apply(


In [79]:
# Adding post length as a feature.
train_df["text_length"] = train_df["text"].str.len()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['text_length'] = train_df['text'].str.len()


In [100]:
feature_cols = ["title_similarity", "text_length", "postNum"]
train_df["combined"] = [" ".join(map(str, l)) for l in train_df["lemmas"] if l != ""]
tfidf = TfidfVectorizer(min_df=10, ngram_range=(1, 2), stop_words="english")
tfidf_result = tfidf.fit_transform(train_df["combined"]).toarray()

tfidf_df = pd.DataFrame(tfidf_result, columns=tfidf.get_feature_names())
tfidf_df.columns = ["word_" + str(x) for x in tfidf_df.columns]
tfidf_df.index = train_df.index
train_df_tf = pd.concat([train_df[feature_cols], tfidf_df], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["combined"] = [" ".join(map(str, l)) for l in train_df["lemmas"] if l != ""]


**Step 3: Build a Machine Learning Model**

In [85]:
from sklearn.ensemble import RandomForestClassifier

model1 = RandomForestClassifier()
model1.fit(train_df_tf, train_df["summary_post"])

RandomForestClassifier()

In [101]:
# Function to calculate rouge score for each thread.
def calculate_rouge_score(x, column_name):
    # Get the original summary - only first value since they are repeated.
    ref_summary = x["summary"].values[0]

    # Join all posts that have been predicted as summary.
    predicted_summary = "".join(x["text"][x[column_name]])

    # Return the rouge score for each ThreadID.
    scorer = rouge_scorer.RougeScorer(["rouge1"], use_stemmer=True)
    scores = scorer.score(ref_summary, predicted_summary)
    return scores["rouge1"].fmeasure


test_df = preprocess(test_df)
test_df_tf = vectorizer(test_df, tfidf)

test_df["predicted_summary_post"] = model1.predict(test_df_tf)
print(
    "Mean ROUGE-1 Score for test threads",
    test_df.groupby("ThreadID")[["summary", "text", "predicted_summary_post"]]
    .apply(calculate_rouge_score, column_name="predicted_summary_post")
    .mean(),
)

Mean ROUGE-1 Score for test threads 0.3463420408926299


In [112]:
import random

random.seed(2)
thread_id = random.sample(test_df["ThreadID"].unique().tolist(), 1)[0]
thread_id

'60763_5_3144153'

In [113]:
example_df = test_df[test_df["ThreadID"] == thread_id]
print("Total number of posts", example_df["postNum"].max())
print(
    "Number of summary posts",
    example_df[example_df["predicted_summary_post"]].count().values[0],
)
print("Title: ", example_df["Title"].values[0])
example_df[["postNum", "text"]][example_df["predicted_summary_post"]]

Total number of posts 11
Number of summary posts 1
Title:  The Information Deficit


Unnamed: 0,postNum,text
397,1,From today's TImes: Afzal Hossain is the New Y...


In [114]:
example_df

Unnamed: 0,Filename,ThreadID,Title,userID,Date,postNum,text,summary,lemmas,title_similarity,text_length,combined,predicted_summary_post
397,60763_5_3144153,60763_5_3144153,The Information Deficit,J-A-W-P,"09 October 2009, 20:55",1,From today's TImes: Afzal Hossain is the New Y...,A person initiated discussins about new york c...,"[today, TImes, Afzal, Hossain, New, York, City...",0.520931,1432,today TImes Afzal Hossain New York City subway...,True
398,60763_5_3144153,60763_5_3144153,The Information Deficit,Outoftheinkwe...,"09 October 2009, 21:22",2,"""For a fee, Ugarte.""",A person initiated discussins about new york c...,"[fee, Ugarte]",0.46562,20,fee Ugarte,False
399,60763_5_3144153,60763_5_3144153,The Information Deficit,Outoftheinkwe...,"09 October 2009, 21:26",3,Oops! I think I misquoted my quote... should b...,A person initiated discussins about new york c...,"[oops, think, misquote, quote, price, Ugarte]",0.436232,70,oops think misquote quote price Ugarte,False
400,60763_5_3144153,60763_5_3144153,The Information Deficit,livetotravel,"09 October 2009, 21:45",4,there's a couple of great iPhone apps CityTran...,A person initiated discussins about new york c...,"[couple, great, iPhone, app, CityTransit, map,...",0.527734,176,couple great iPhone app CityTransit map line s...,False
401,60763_5_3144153,60763_5_3144153,The Information Deficit,queensbouleva...,"10 October 2009, 1:32",5,These young ladies were already ahead of the c...,A person initiated discussins about new york c...,"[young, lady, ahead, curve, speak, April, adn,...",0.518157,300,young lady ahead curve speak April adn MTA law...,False
402,60763_5_3144153,60763_5_3144153,The Information Deficit,queensbouleva...,"10 October 2009, 1:35",6,"More on the Subway Service Specialists, from t...",A person initiated discussins about new york c...,"[Subway, Service, Specialists, too-close-too-h...",0.497337,117,Subway Service Specialists too-close-too-home ...,False
403,60763_5_3144153,60763_5_3144153,The Information Deficit,Lotuspath,"10 October 2009, 1:38",7,Hey .... if we're all gonna do this 'thing' .....,A person initiated discussins about new york c...,"[hey, gon, thing, willing, expect, actually, a...",0.505086,209,hey gon thing willing expect actually answer t...,False
404,60763_5_3144153,60763_5_3144153,The Information Deficit,Crans,"10 October 2009, 1:44",8,"LOL, QB! I don't even like to sit on those woo...",A person initiated discussins about new york c...,"[lol, QB, like, sit, wooden, platform, bench, ...",0.536962,202,lol QB like sit wooden platform bench jean lon...,False
405,60763_5_3144153,60763_5_3144153,The Information Deficit,BrooklynMel,"10 October 2009, 1:53",9,"QB, that's awesome. I'll bet that the no-fun M...",A person initiated discussins about new york c...,"[QB, awesome, bet, no-fun, MTA, shut, love, sn...",0.465761,115,QB awesome bet no-fun MTA shut love snappy uni...,False
406,60763_5_3144153,60763_5_3144153,The Information Deficit,Hankshanker,"10 October 2009, 7:21",10,"Nothing personal, but I think 1,000 percent ri...",A person initiated discussins about new york c...,"[personal, think, percent, ridership, system, ...",0.522549,335,personal think percent ridership system person...,False
