## This notebook contains our summarization techniques that we tried ranging from Keyphrase Extraction (using frequency, collocation, chunking and wordnet) to Key sentences extraction ( sentences with most frequent words, Gensim summarizer, Classification and Gensim summarizer). The final approach that we used was Key sentences extraction using Classification and Gensim Summarization that uses TextRank algorithm.

In [None]:
# Importing all the necessary libraries
import nltk, re, numpy as np
import urllib
from bs4 import BeautifulSoup
from nltk import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import string
from gensim.summarization import summarize, keywords
from nltk.probability import FreqDist

## Reading in the files
The following sections read in the text from a file.

In [None]:
#loadText
def loadText(text):
    f = open(text)
    raw = f.read().decode('utf8');
    return raw

tos_text = loadText("static/data/Google.txt")

In [None]:
text = tos_text

# Keyphrase Extraction Approach

In the following approaches - frequency, collocation and chunking - for each, the steps followed are: 1. Identifying candidates for keyphrases and 2. Keyphrase selection

## Approach 1 - Frequency

The following segment attempts using Approach 1 - Frequent Terms
-- Frequent Unigrams (with or without stemming, stopwords, and other normalization)
-- Frequent Bigram frequencies (with or without stemming, stopwords, and other normalization)
-- Other variations on frequent n-gram

### 1. Candidates identification

In [None]:
#Using FreqDist
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
text_tokens = tokenizer.tokenize(text)

fdist = nltk.FreqDist(text_tokens)
print("20 Most frequent tokens: ")
print(fdist.most_common(20))

#### Normalizing
Normalizing - convert to lower case, remove punctuation and stop words, stem words

### 1. Candidate Identification

In [None]:
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

### 2. Candidate Selection - Bigrams

In [None]:
#Post normalizing

finder = BigramCollocationFinder.from_words(norm)

# only bigrams that appear atleast 3 times
finder.apply_freq_filter(3)

# returns the 10 bigrams with the highest PMI
print("\nTop 10 bigrams using PMI:")
print(finder.nbest(bigram_measures.pmi, 10))
#finder.score_ngrams(bigram_measures.pmi)

# Finds top 10 bigrams using the Pearson's Chi-squared test
print("\nTop 10 bigrams using Pearson's Chi-Squared Test:")
print(finder.nbest(bigram_measures.chi_sq, 10))
#finder.score_ngrams(bigram_measures.chi_sq)


In [None]:
# Finds top 10 bigrams using the likelihood ratio

print("\nTop 10 bigrams using Maximum Likelihood Ratio:")
print(finder.nbest(bigram_measures.likelihood_ratio, 10))
#finder.score_ngrams(bigram_measures.likelihood_ratio)

I also used collocations on text that was not normalized but that did not yield in meaningful results since the stop words were included and ended up being among the most frequent.
Eg: ('of', 'the'), ('in', 'the'), ('the', 'jungle'), ('he', 'was'), etc. 

### 2. Candidate Selection - Trigrams

In [None]:
# Post normalizing

finder = TrigramCollocationFinder.from_words(norm)

# only trigrams that appear atleast 3 times
finder.apply_freq_filter(3)

# return the 10 trigrams with the highest PMI
print("\nTop 10 trigrams using PMI:")
print(finder.nbest(trigram_measures.pmi, 10))
#finder.score_ngrams(trigram_measures.pmi)

# Finds top 10 trigrams using the Pearson's Chi-squared test
print("\nTop 10 trigrams using Pearson's Chi-Squared Test:")
print(finder.nbest(trigram_measures.chi_sq, 10))
#finder.score_ngrams(trigram_measures.chi_sq)


In [None]:
# Finds top 10 trigrams using the likelihood ratio
print("\nTop 10 trigrams using Maximum Likelihood Ratio:")
print(finder.nbest(trigram_measures.likelihood_ratio, 10))
#finder.score_ngrams(trigram_measures.likelihood_ratio)

## Aproach 3 - Chunking

### 1. Candidate Identification

In [None]:
def tokenize_text(corpus):
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sents = sent_tokenizer.tokenize(corpus) # Split text into sentences  
    return [nltk.word_tokenize(word) for word in raw_sents]

def create_chunker(grammar):
    return nltk.RegexpParser(grammar)

def run_chunker(ch, sentences):
    return [ch.parse(sent) for sent in sentences]

# Defining the grammar for the chunker
grammar = r"""
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+}  # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
"""
tagged_sentences = nltk.pos_tag_sents(tokenize_text(text))

clause_chunker = create_chunker(grammar)

# Collecting the clauses
clauses = []
for sent in tagged_sentences:
    tree = clause_chunker.parse(sent)
    for st in tree.subtrees():
        if st.label() == 'CLAUSE': clauses.append(st)

#print(clauses)

#Collecting the proper nouns
proper_nouns = []
for sent in tagged_sentences:
    tree = clause_chunker.parse(sent)
    for st in tree.subtrees():
        if st.label() == 'NP': proper_nouns.append(st.leaves()[0][0])
#print(proper_nouns)

candidates = []
for clause in clauses:
    for s in clause.subtrees():
        if(s.label() == 'NP'):
            for l in s.leaves():
                for word in l:
                    if(word in proper_nouns):
                        candidates.append(clause)
                        next

candidate_sentences = []
for candidate in candidates:
    candidate_sentences.append(' '.join([l[0] for l in candidate.leaves()]))

### 2. Candidate Selection

In [None]:
punct = set(string.punctuation)
#To remove specific punctuation and possessive found in this book
all_punct=string.punctuation + "--" + ".\"" + ",\"" + "?\"" + "\'s" + "!\""+"\""
text_nopunct = [x.lower() for x in text_tokens if x not in all_punct]      #Convert to lower & punctuation
#print("Total number of words: ", len(text_nopunct),"\n")

norm_fdist = nltk.FreqDist(text_nopunct)
unique_incl_stop = len(norm_fdist.keys())
#print("Total number of unique words - without removing stop words: ", unique_incl_stop)

stop = set(stopwords.words('english'))
norm = [i for i in text_nopunct if i not in stop]    #Remove stopwords
norm_fdist = nltk.FreqDist(norm)


### 2. Keyphrase selection

In [None]:
print("\nNormalized Text:")
#print("Total number of unique words post normalizing: ",len(norm_fdist.keys()))
print("\n20 Most frequent words: ")
print(norm_fdist.most_common(20))

#### Stemming
Using stemming on unigrams, post normalizing 

In [None]:
from nltk.stem.snowball import SnowballStemmer

#Results post stemming - Using Snowball stemmer
stemmer = SnowballStemmer("english")
text_stem = [stemmer.stem(i) for i in norm]
stem_fdist = nltk.FreqDist(text_stem)
print("\nUsing stemmed words: \n")
print("20 Most frequent words: ")
print(stem_fdist.most_common(20))


The above output is not very useful. It basically used all the words and even post normalizing did not provide a very meaningful output.

Repeating the above, but using bigrams and trigrams

In [None]:
from nltk.util import ngrams
from nltk import bigrams
from nltk import trigrams

bi = bigrams(norm)
bi_list = [bigram for bigram in bi]
tri = trigrams(norm)
tri_list = [trigram for trigram in tri]

bi_fdist = nltk.FreqDist(bi_list)
print("\n20 Most frequent bigrams: \n")
print(bi_fdist.most_common(20))
print("\n\n20 Most frequent trigrams: \n")
tri_fdist = nltk.FreqDist(tri_list)
print(tri_fdist.most_common(20))

## Approach 2 - Collocation (Words)

Collocation can be done either with words or with parts of speech. Here, I have attempted with words. 

In [None]:
lchar = 0
finlist = []

for sentence in set(candidate_sentences):
    if lchar<2000:
        ls = len(sentence)
        if (lchar + ls)<2000:
            finlist.append(sentence)
            lchar +=ls
        
print("\n".join(ph for ph in finlist))

## Aproach 4 - Wordnet 