# Sentence scores ranking for text summarization
This approach aims at finding the sentences that are more similar to other sentences in the text. By using sentence embedding, we can calculate the cosine distance between the sentences. We sum up the those distance and that becomes the `similarity scores` of a sentence. We assume that the highere similarity, the sentence is more likely to be a key sentence of the text. We keep the `TOP 5` result as a rough text summarization. 

Dino: Good job, Nan, summarizing what you are going to do in the notebook!

## Preprocessing data
At first, we selected an article `ON TACTICS AGAINST JAPANESE IMPERIALISM`, written by Mao. As an political article, we can easily find the key sentences of the article and make summarization. 

In [1]:
import pandas as pd

In [2]:
text = ''

with open('data/ON_TACTICS_AGAINST_JAPANESE_IMPERIALISM.txt', 'r') as reader:
    text = reader.read()
    
    
len(text)

44958

In [3]:
import re
# remove reference mark, e.g. '[1]'
text = re.sub(r'\[\d{1,3}\]' ,'', text)
len(text)

44819

In [4]:
# import nltk
# nltk.download('punkt')

Dino: Good idea to use `sent_tokenize` API

In [5]:
from nltk.tokenize import sent_tokenize

# split into sentences

sentences = sent_tokenize(text)

len(sentences)

332

## Sentence Transformers

Dino: What is the difference between the `bert-base-nli-mean-tokens` model, and the `stsb-roberta-large` one?

In [6]:
## get embedding

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

sentence_embeddings = model.encode(sentences)

sentence_embeddings

array([[ 0.44056144,  0.5195048 ,  2.5450442 , ...,  0.5283022 ,
        -0.46262807,  0.19722113],
       [-0.24993764, -0.7121983 ,  1.2375451 , ..., -1.0851908 ,
        -0.77441585,  0.10610564],
       [-0.331048  , -0.12393019,  1.3653917 , ..., -0.72380173,
        -0.92983496,  0.26965213],
       ...,
       [ 0.19709875, -0.39470232,  3.0218735 , ..., -0.4574634 ,
        -0.34476504,  0.0278206 ],
       [ 0.52104473, -0.29295123,  1.9758993 , ...,  0.1811067 ,
        -0.92158616, -0.10381993],
       [-0.7256513 ,  0.35026634,  0.09495329, ..., -0.47348344,
        -0.92776227, -0.3470151 ]], dtype=float32)

Dino: Why do you normalize?

In [7]:
from sklearn.preprocessing import normalize
import numpy as np

# Noralize the data
norm_data = normalize(sentence_embeddings, norm='l2')


def get_sum_scores(sentence_idx):
    """input sentence index, get the sum of cosine distance between
    current sentence and other sentences in the text(score)
    @param: sentence_idx, index of sentence
    @return score(float)
    """
    scores = np.dot(norm_data, norm_data[sentence_idx].T)
    return np.sum(scores)
    
    
scores_list = []
for i in range(len(sentences)):
    scores_list.append(get_sum_scores(i))


Dino: Algebraically, the dot product of two vectors is the sum of the products of the corresponding entries of the two sequences of numbers. Geometrically, it is the product of the Euclidean magnitudes of the two vectors *and* the cosine of the angle between them. It is *not* the cosine, unless you divide by the magnitude afterwards...

However, since you normalized the vectors beforehand, it actually is :-)

You may want to explain this in a markdown cell and motivate the reader beforehand. Otherwise, it is very painful for the reader to understand what you are doing. Your job, when you write a notebook, is to make it easy ***for the reader***, not for you. It's essentially like designing a UI. Do you mean the UI to be used by ***you***, or the rest of the world?

In [8]:
# find out the max scores sentence
max_score = 0
max_idx = -1
for idx, sc in enumerate(scores_list):
    if(sc > max_score):
        max_score = sc
        max_idx = idx

print(max_score, max_idx)
print("max score sentence is: "+ sentences[max_idx])

168.27069 195
max scores sentence is: For all that, China's revolutionary war will remain a protracted one; this follows from the strength-of imperialism and the uneven development of the revolution.


Dino: The calculus seems a bit careless. You want to find the sentence that is the closest match to every other sentence in the text. In other words, you are attempting to do a clustering and to find the centroid of the cluster, and then the point closest to that centroid. But you have not convinced me that this is the way to do it.

To start, you need to compare *every* sentence to *every other* sentence at least (this is an N$^2$ complexity), and to me it looks like your computations above are linear in complexity.

In [9]:
# find out the top 5(min heap algorithm)
import heapq

minHeap = []

for idx, sc in enumerate(scores_list):
    if len(minHeap) == 5:
        if(sc > minHeap[0][0]):
            heapq.heapreplace(minHeap, (sc, idx))
    else:
        heapq.heappush(minHeap, (sc, idx))

ans = []
while len(minHeap) > 0 :
    ans.append(heapq.heappop(minHeap))
    
ans    

[(167.46082, 152),
 (167.6716, 257),
 (167.67487, 87),
 (168.06612, 198),
 (168.27069, 195)]

In [10]:
# print out the top 5 sentences
for sc, idx in ans:
    print(sentences[idx])

Whoever questions our ability to lead the revolutionary war will fall into the morass of opportunism.
Not only are the Communist Party and the Red Army serving as the initiator of a national united front against Japan today, but in the future too they will inevitably become the powerful mainstay of China's anti-Japanese government and army, capable of preventing the Japanese imperialists and Chiang Kai-shek from carrying through their policy of disrupting this united front.
Therefore, we emphatically assert that when the national crisis reaches a crucial point, splits will occur in the Kuomintang camp.
But we must also say that imperialism is still a force to be earnestly reckoned with, that the unevenness in the development of the revolutionary forces is a serious weakness, and that to defeat our enemies we must be prepared to fight a protracted war; this is another characteristic of the present revolutionary situation.
For all that, China's revolutionary war will remain a protracted 

Dino: It does not make sense to me that there be only one main point in the corpus, and that all other sentences are a variation of that point. So I don't think this approach will work. Even top-5 won't work, because its the top-5 closest to ***one*** centroid. You want different centroids.

## Using pretrained Summarization model
We would like to compare our `Top 5` result to the state-of-the-art text summarization model, here we selected the pretained `T-5` model from google

In [11]:
# using google T-5 model
from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")




In [12]:
# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)

In [13]:
# generate summary
outputs = model.generate(inputs, max_length=180, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)

In [14]:
# decode the summary
print(tokenizer.decode(outputs[0]))

<pad> a great change has now taken place in the political situation in china. it's main characteristic is that Japanese imperialism wants to turn China into a colony. the workers and the peasants are all demanding resistance.</s>


## Pretrained summarization model for top 5 rank sentence
What about we using summariztion model to summarize the ONLY top-5 sentence? Will it give us more concise result?

In [11]:
top5 = ''.join([sentences[idx] for sc, idx in ans])

top5

"Whoever questions our ability to lead the revolutionary war will fall into the morass of opportunism.Not only are the Communist Party and the Red Army serving as the initiator of a national united front against Japan today, but in the future too they will inevitably become the powerful mainstay of China's anti-Japanese government and army, capable of preventing the Japanese imperialists and Chiang Kai-shek from carrying through their policy of disrupting this united front.Therefore, we emphatically assert that when the national crisis reaches a crucial point, splits will occur in the Kuomintang camp.But we must also say that imperialism is still a force to be earnestly reckoned with, that the unevenness in the development of the revolutionary forces is a serious weakness, and that to defeat our enemies we must be prepared to fight a protracted war; this is another characteristic of the present revolutionary situation.For all that, China's revolutionary war will remain a protracted one

In [13]:
# using T5 model
from transformers import AutoModelWithLMHead, AutoTokenizer
model_top5 = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer_top5 = AutoTokenizer.from_pretrained("t5-base")
inputs_top5 = tokenizer_top5.encode("summarize: " + top5, return_tensors="pt", max_length=512, truncation=True)
outputs_top5 = model_top5.generate(inputs_top5, max_length=180, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)



In [14]:
## summary of the top-5 sentences
print(tokenizer_top5.decode(outputs_top5[0]))

<pad> china's revolutionary war will remain a protracted one. uneven development of revolutionary forces is a serious weakness. china's imperialism is still a force to be earnestly reckoned with.</s>


## Summarization model test -- A fail example
The pretained model is powerful, but not always working, below is an example: we selected the Mao's wiki, when summarize the the wikipedia(text in chronological order), the model can hardly catch the whole map.

In [1]:
text_wiki = ''

with open('data/mao_wiki.txt', 'r') as reader:
    text_wiki = reader.read()
    
    
len(text_wiki)

31284

In [2]:
import re
# remove reference mark, e.g. '[1]'
text_wiki = re.sub(r'\[\d{1,4}\]' ,'', text_wiki)
len(text_wiki)

30645

In [3]:
# using T5 model to summarize
from transformers import AutoModelWithLMHead, AutoTokenizer
model_wiki = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer_wiki = AutoTokenizer.from_pretrained("t5-base")
inputs_wiki = tokenizer_wiki.encode("summarize: " + text_wiki, return_tensors="pt", max_length=512, truncation=True)
outputs_wiki = model_wiki.generate(inputs_wiki, max_length=180, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)



In [4]:
print(tokenizer_wiki.decode(outputs_wiki[0]))

<pad> 1917–19 Mao moved to Beijing, where his mentor Yang Changji took a job at peking university. he was snubbed by other students due to his rural Hunanese accent and lowly position. he joined Li's Study Group and "developed rapidly toward Marxism" during the winter of 1919.</s>


## Using Top-5 model for wiki
We also would like to test our `Top 5` model for the wikipedia article, and compare the result with the pretrained model

In [5]:
from nltk.tokenize import sent_tokenize

# split into sentences

sentences = sent_tokenize(text_wiki)

len(sentences)

198

In [6]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

sentence_embeddings = model.encode(sentences)

sentence_embeddings

array([[ 0.10726409,  1.0987113 , -0.09853894, ...,  0.21398848,
        -0.18605344,  0.4299076 ],
       [-1.0969067 ,  0.87336105,  0.07322763, ...,  0.22656934,
        -0.00875739, -0.26058328],
       [-0.59271616,  0.1364042 ,  0.17228283, ..., -0.5263024 ,
        -0.15850013,  0.5666151 ],
       ...,
       [ 0.25065625,  0.23724607, -0.04016513, ...,  0.4672328 ,
         1.4903196 , -0.24578984],
       [-0.7199149 ,  0.5134689 , -0.61556673, ..., -0.08170696,
         0.10948969,  0.47148797],
       [-0.34347   ,  1.011436  , -0.07794081, ...,  0.09482689,
         0.4004276 , -0.04142505]], dtype=float32)

In [7]:
from sklearn.preprocessing import normalize
import numpy as np

# Noralize the data
norm_data = normalize(sentence_embeddings, norm='l2')


def get_sum_scores(sentence_idx):
    """input sentence index, get the sum of scores
    @param: sentence_idx, index of sentence
    @return scores(float)
    """
    scores = np.dot(norm_data, norm_data[sentence_idx].T)
    return np.sum(scores)
    
    
scores_list = []
for i in range(len(sentences)):
    scores_list.append(get_sum_scores(i))

In [8]:
# top 5
import heapq

minHeap = []

for idx, sc in enumerate(scores_list):
    if len(minHeap) == 5:
        if(sc > minHeap[0][0]):
            heapq.heapreplace(minHeap, (sc, idx))
    else:
        heapq.heappush(minHeap, (sc, idx))

ans = []
while len(minHeap) > 0 :
    ans.append(heapq.heappop(minHeap))
    
for sc, idx in ans:
    print(sentences[idx])

These demonstrations ignited the nationwide May Fourth Movement and fueled the New Culture Movement which blamed China's diplomatic defeats on social and cultural backwardness.
In the winter of 1925, Mao fled to Guangzhou after his revolutionary activities attracted the attention of Zhao's regional authorities.
Although Chiang intended to ignore Mao's message and continue the civil war, he was arrested by one of his own generals, Zhang Xueliang, in Xi'an, leading to the Xi'an Incident; Zhang forced Chiang to discuss the issue with the Communists, resulting in the formation of a United Front with concessions on both sides on December 25, 1937.
In December 1919, Mao helped organise a general strike in Hunan, securing some concessions, but Mao and other student leaders felt threatened by Zhang, and Mao returned to Beijing, visiting the terminally ill Yang Changji.
Civil War
Main articles: Chinese Civil War and Chinese Communist Revolution
The Nanchang and Autumn Harvest Uprisings: 1927

F

## Text Summarization Model

State of the art model: Pegasus, see: https://ai.googleblog.com/2020/06/pegasus-state-of-art-model-for.html

In 2021 Google IO, google annouced LaMDA platform for NLP

Dino: I think we need a better understanding of the SOTA of text summarization. I doubt that there's only one supervized approach.