# Extractive Text Summarization

- Split the document into sentences (sentence tokenization)
- Assign a score to each sentence
- Pick the top N sentences


- Score = Average (non-zero TF-IDF of words in the sentence) (unimportant words -> smaller value). Important words appearing more often in the sentence will have an even higher score. Mean -> avoid bias towards longer sentences. Non-zero -> TF-IDF very sparse (don't want to choose based on variety of words)

- TextRank score


In [27]:
import pandas as pd
import numpy as np
import textwrap
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

In [28]:
nltk.download("punkt")
nltk.download('punkt_tab')
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /home/amarov/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/amarov/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /home/amarov/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [29]:
df = pd.read_csv("https://github.com/febse/data/raw/refs/heads/main/ta/BBC%20News%20Train.csv.zip")

In [30]:
doc = df.iloc[0]

print(textwrap.fill(doc["Text"], replace_whitespace=False, fix_sentence_endings=True))

worldcom ex-boss launches defence lawyers defending former worldcom
chief bernie ebbers against a battery of fraud charges have called a
company whistleblower as their first witness.  cynthia cooper
worldcom s ex-head of internal accounting  alerted directors to
irregular accounting practices at the us telecoms giant in 2002. her
$11bn (£5.7bn) accounting fraud.  mr ebbers has pleaded not guilty to
charges of fraud and conspiracy.  prosecution lawyers have argued that
mr ebbers orchestrated a series of accounting tricks at worldcom
ordering employees to hide expenses and inflate revenues to meet wall
street earnings estimates.  but ms cooper  who now runs her own
consulting business  told a jury in new york on wednesday that
external auditors arthur andersen had approved worldcom s accounting
in early 2001 and 2002. she said andersen had given a  green light  to
the procedures and practices used by worldcom.  mr ebber s lawyers
have said he was unaware of the fraud  arguing that audito

In [31]:
type(doc["Text"])

str

In [32]:
from nltk.tokenize import sent_tokenize

sents = sent_tokenize(doc["Text"])
sents

['worldcom ex-boss launches defence lawyers defending former worldcom chief bernie ebbers against a battery of fraud charges have called a company whistleblower as their first witness.',
 'mr ebbers has pleaded not guilty to charges of fraud and conspiracy.',
 'prosecution lawyers have argued that mr ebbers orchestrated a series of accounting tricks at worldcom  ordering employees to hide expenses and inflate revenues to meet wall street earnings estimates.',
 'but ms cooper  who now runs her own consulting business  told a jury in new york on wednesday that external auditors arthur andersen had approved worldcom s accounting in early 2001 and 2002. she said andersen had given a  green light  to the procedures and practices used by worldcom.',
 'mr ebber s lawyers have said he was unaware of the fraud  arguing that auditors did not alert him to any problems.',
 'ms cooper also said that during shareholder meetings mr ebbers often passed over technical questions to the company s finance

In [33]:
vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))

X = vectorizer.fit_transform(sents)

In [34]:
X.shape

(12, 137)

In [38]:
X = X.toarray()
scores = np.array([np.mean(row[row != 0]) for row in X])
print("Row-wise Mean of Non-zero Elements:\n", scores)

Row-wise Mean of Non-zero Elements:
 [0.2269258  0.1987996  0.36673461 0.21409173 0.18696719 0.30947282
 0.23131543 0.22443098 0.25194776 0.29554142 0.371702   0.31593632]


In [39]:
sort_idx = np.argsort(-scores)

In [40]:
print("Top sentences:\n")

for i in sort_idx[:5]:
  print(f"%.2f: %s" % (scores[i], sents[i]))

Top sentences:

0.37: worldcom emerged from bankruptcy protection in 2004  and is now known as mci.
0.37: mr ebbers has pleaded not guilty to charges of fraud and conspiracy.
0.32: last week  mci agreed to a buyout by verizon communications in a deal valued at $6.75bn.
0.31: mr ebber s lawyers have said he was unaware of the fraud  arguing that auditors did not alert him to any problems.
0.30: mr ebbers could face a jail sentence of 85 years if convicted of all the charges he is facing.


## TextRank

TextRank is an unsupervised keyword and sentence extraction algorithm that is based on PageRank. Before we can see how to apply TextRank to text summarization, we need to understand how PageRank works.

PageRank is a link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set.


```{mermaid}
graph TD
    A[Page A] -->|Link| B[Page B]
    A -->|Link| C[Page C]
    B -->|Link| D[Page D]
    C -->|Link| D
    D -->|Link| A
    D -->|Link| E[Page E]
    E -->|Link| B
```


Let's take a walk through the web. We start at a random webpage and follow a random link on that page. On the next page, we again follow a randomly choosen link. We keep doing this for a long time. We can ask a question: what is the probability that we end up on a certain page?

We can view this walk as a Markov chain. Say that all pages are $n$ and let's assume that we can reach any page from any other page (though not 
with the same probability).

Let $s_t$ be the page we are on at time $t$. The probability of moving from page $i$ to page $j$ is given by

$$
P_{ij} = p(s_{t+1} = j | s_t = i)
$$

The probability of being on page $i$ at time $t+1$ is given by

$$
p(s_{t+1} = i) = \sum_{j=1}^n p(s_{t+1} = i | s_t = j) p(s_t = j)
$$

or in matrix form

$$
p_{t+1} = p_t P
$$

where $p_t$ is a row vector with the probability of being on each page at time $t$.

The matrix $P$ is called the transition matrix. It is a square matrix with $n$ rows and columns. The rows sum to 1 (the probability of moving to any page is 1). What happens when we walk for a long time, is there a unique distribution of pages we end up on? 

The answer is given by the Frobenius-Perron theorem that proofs that the if the Markov chain is ergodic (you can reach any page from any other page), then there is a unique stationary distribution of pages.

$$
p_{\infty} = p_{\infty} P
$$

The last equation also tells us how to calculate the stationary distribution as it is an eigenvector equation for the eigenvalue equal to one.


In [46]:
P = np.array([[0.3, 0.7], [0.1, 0.9]])

p0 = np.array([0.5, 0.5])

for i in range(14):
    p0 = p0.dot(P)
    print(p0)


[0.2 0.8]
[0.14 0.86]
[0.128 0.872]
[0.1256 0.8744]
[0.12512 0.87488]
[0.125024 0.874976]
[0.1250048 0.8749952]
[0.12500096 0.87499904]
[0.12500019 0.87499981]
[0.12500004 0.87499996]
[0.12500001 0.87499999]
[0.125 0.875]
[0.125 0.875]
[0.125 0.875]


In [67]:
# Compute the eigenvalues and eigenvectors of P
eigenvalues, eigenvectors = np.linalg.eig(P.T)

print("Eigenvalues:", eigenvalues)
eigenvectors


Eigenvalues: [0.2 1. ]


array([[-0.70710678, -0.14142136],
       [ 0.70710678, -0.98994949]])

In [71]:
# Eigenvectors are unique up to a scalar multiple

eigenvectors.T[1] / eigenvectors.T[1].sum()

array([0.125, 0.875])

The PageRank algorithm applies smoothing to the transition matrix as in practice it is not possible for every page to link to every other page.

The smoothing is done by adding a damping factor $d$ to the transition matrix

$$
P = \alpha T + (1-\alpha) E, \quad \alpha \in [0, 1]
$$

where $T$ is the original transition matrix and $E$ is a matrix with all elements equal to $1/n$.

The TextRank algorithm scores sentences based on the stationary distribution of a Markov chain. Instead of webpages we have sentences. There are no real transition probabilities between sentences, but we can use the cosine similarity between the sentence representations in the TF-IDF space as a proxy.

Let's implement it as an exercise.

- Compute the TF-IDF matrix of the sentences
- Compute the cosine similarity matrix
- Normalize the cosine similarity matrix to get the transition matrix
- Smooth the transition matrix
- Compute the stationary distribution
- Rank the sentences based on the stationary distribution


In [72]:
# sents


['\nUK retail sales fell in December, failing to meet expectations and making it by some counts the worst Christmas since 1981.',
 'Retail sales dropped by 1% on the month in December, after a 0.6% rise in November, the Office for National Statistics (ONS) said.',
 'The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%.',
 'A number of retailers have already reported poor figures for December.',
 'Clothing retailers and non-specialist stores were the worst hit with only internet retailers showing any significant growth, according to the ONS.',
 'The last time retailers endured a tougher Christmas was 23 years previously, when sales plunged 1.7%.',
 'The ONS echoed an earlier caution from Bank of England governor Mervyn King not to read too much into the poor December figures.',
 'Some analysts put a positive gloss on the figures, pointing out that the non-seasonally-adjusted figures showed a performance comparable with 2003.',
 'The November-De