## Summarization: NLP mini-projects

### <span style="color:coral">Text summarization (with NLTK)</span>

Another great library to work with texts is [NLTK](http://www.nltk.org/), which stands for Natural Language Toolkit. We have now extracted a large set of news on a given topic and would like to extract the most informative parts: out of each text we would like to exatract the most informative sentence. This type of summarization of texts is called **extractive**.

Take a look at this video: [Hillary Clinton's concession speech](https://www.vox.com/2016/11/9/13570328/hillary-clinton-concession-speech-full-transcript-2016-presidential-election). The webpage already provides the full transcript.



We will approach this task as follows:

In [18]:
# import text

text=open('data/trainhillary.txt').read().lower().replace('\xa0',' ')
print('corpus length:', len(text))
text[0:300]

corpus length: 6390


'thank you. thank you all very much. thank you so much.\n\nvery rowdy group. thank you, my friends. thank you. thank you. thank you so very much for being here. i love you all, too.\n\nlast night i congratulated donald trump and offered to work with him on behalf of our country.\n\ni hope that he will be a'

In [19]:
import nltk
from nltk.corpus import stopwords
from collections import defaultdict
stopWords = set(stopwords.words("english"))

In [20]:
sentences = nltk.sent_tokenize(text)
len(sentences)
sentences[0:10]

['thank you.',
 'thank you all very much.',
 'thank you so much.',
 'very rowdy group.',
 'thank you, my friends.',
 'thank you.',
 'thank you.',
 'thank you so very much for being here.',
 'i love you all, too.',
 'last night i congratulated donald trump and offered to work with him on behalf of our country.']

In [21]:
word_sent = [nltk.word_tokenize(s.lower()) for s in sentences]

In [22]:
word_sent[0:5]

[['thank', 'you', '.'],
 ['thank', 'you', 'all', 'very', 'much', '.'],
 ['thank', 'you', 'so', 'much', '.'],
 ['very', 'rowdy', 'group', '.'],
 ['thank', 'you', ',', 'my', 'friends', '.']]

In [23]:
#compute frequencies 
freq = defaultdict(int)
for sentence in word_sent:
    for word in sentence:
        if word not in stopWords:
            freq[word] +=1
len(freq)

353

In [24]:
m = float(max(freq.values()))
m

77.0

In [25]:
for word in freq.keys():
    freq[word] = freq[word]/m

In [26]:
min_cut=0.2
max_cut=0.8
freq_new = defaultdict(int)
for word in freq.keys():
    if not freq[word] > max_cut or freq[word] < min_cut:
        freq_new[word] = freq[word]
freq = freq_new
del freq_new

In [27]:
len(freq)

351

In [28]:
ranking = defaultdict(int)
for i, sentence in enumerate(word_sent):
        for word in sentence:
            if word in freq:
                ranking[i] +=freq[word]

In [29]:
from heapq import nlargest
sentences_index = nlargest(1, ranking, key=ranking.get)
print(sentences_index)
sentences[sentences_index[0]]

[35]


'we’ve spent a year and a half bringing together millions of people from every corner of our country to say with one voice that we believe that the american dream is big enough for everyone—for people of all races, and religions, for men and women, for immigrants, for lgbt people, and people with disabilities.'

In [30]:
## All of it in one function
from heapq import nlargest
import nltk
from nltk.corpus import stopwords
from collections import defaultdict

stopWords = set(stopwords.words("english"))

In [31]:
def summarize_text(text, stopWords, min_cut, max_cut, ntop=1):
   
    sentences = nltk.sent_tokenize(text)
    
    word_sent = [nltk.word_tokenize(s.lower()) for s in sentences]
    
    # compute frequencies 
    freq = defaultdict(int)
    for sentence in word_sent:
        for word in sentence:
            if word not in stopWords:
                freq[word] +=1

    # normilize frequencies 
    m = float(max(freq.values()))
    for word in freq.keys():
        freq[word] = freq[word]/m
 
    # cut off too frequent or too rare words
    freq_new = defaultdict(int)
    for word in freq.keys():
        if not freq[word] >= max_cut or freq[word] <= min_cut:
            freq_new[word] = freq[word]
    freq = freq_new
    del freq_new
    
    # rank sentences
    ranking = defaultdict(int)
    for i, sentence in enumerate(word_sent):
        for word in sentence:
            if word in freq:
                ranking[i] +=freq[word]
                
    sentences_index = nlargest(ntop, ranking, key=ranking.get)
    summary = [sentences[sentences_index[ind]] for ind in range(len(sentences_index))]
    return summary

In [32]:
min_cut = 0.2
max_cut = 0.8
ntop = 1
summary = summarize_text(text, stopWords, min_cut, max_cut, ntop)
print('SUMMARIZING SENTENCE : \n' + str(summary[0]))

SUMMARIZING SENTENCE : 
we’ve spent a year and a half bringing together millions of people from every corner of our country to say with one voice that we believe that the american dream is big enough for everyone—for people of all races, and religions, for men and women, for immigrants, for lgbt people, and people with disabilities.
