# This notebook contains implementations for getting the summary of a block of text

## Implementation 1: Basic Implementation Based on Most Important Words
---

**This implementation uses frequency distribution of words to rank each sentence and select the top sentences**

### Improvements (To Do)
1. Based on pos tagging we can give preference to specific type of pos like nouns etc.

In [56]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
import string

class Summarizer:
    stopwords = stopwords.words("english")

    def __init__(self):
        pass
        
    def get_useful_bag_of_words(self,text):
        bag_of_words = [w for w in word_tokenize(text) if w not in (self.stopwords + list(string.punctuation))]
        return bag_of_words
    
    def score_sent(self,sent,word_rank_list):
        score = 0
        for w in word_tokenize(sent.lower()):
            if w in word_rank_list.keys():
                score += word_rank_list[w]
        return score
    
    def reorder_sents(self,sents,text):
        sents.sort( lambda s1, s2: text.find(s1) - text.find(s2) )
        return sents
    
    def get_summary(self,text,sent_num):
        sane_text = unicode(text, "utf-8")
        
        word_freq_dist = FreqDist(self.get_useful_bag_of_words(sane_text.lower()))
        
        word_ranks = {}
        for word in word_freq_dist:
            word_ranks[word] = word_freq_dist.freq(word)
            
        sents = sent_tokenize(sane_text)
        
        if sent_num > len(sents):
            print "text already summarized !!!"
            return
        
        full_output = []
        for sent in sents:
            temp = {}
            temp['sent'] = sent
            temp['score'] = self.score_sent(sent,word_ranks)
            full_output.append(temp)
            
        sorted_full_output = sorted(full_output, key=lambda k: k['score'], reverse=True)
        top_output_sents = [s['sent'] for s in sorted_full_output[:sent_num]]
        reordered_sents = self.reorder_sents(top_output_sents,sane_text)
        return "".join(reordered_sents)
    

In [57]:
test_summarizer = Summarizer()

test_text = """To Sherlock Holmes she is always the woman. I have
seldom heard him mention her under any other name. In his eyes she
eclipses and predominates the whole of her sex. It was not that he
felt any emotion akin to love for Irene Adler. All emotions, and that
one particularly, were abhorrent to his cold, precise but admirably
balanced mind. He was, I take it, the most perfect reasoning and
observing machine that the world has seen, but as a lover he would
have placed himself in a false position. He never spoke of the softer
passions, save with a gibe and a sneer. They were admirable things for
the observer-excellent for drawing the veil from men’s motives and
actions. But for the trained reasoner to admit such intrusions into
his own delicate and finely adjusted temperament was to introduce a
distracting factor which might throw a doubt upon all his mental
results. Grit in a sensitive instrument, or a crack in one of his own
high-power lenses, would not be more disturbing than a strong emotion
in a nature such as his. And yet there was but one woman to him, and
that woman was the late Irene Adler, of dubious and questionable
memory.
"""

test_text = ' '.join(test_text.strip().split('\n'))

summary = test_summarizer.get_summary(test_text,5)

print summary

All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position.But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results.Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more disturbing than a strong emotion in a nature such as his.And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory.


## Implementation 2: Using Text Rank Algorithm
---

**This implementation creates a fully connected weighted graph and then using the inbuilt pagerank functionality (to score the sentences) of the networkx module of python creates the summary of a text block**

In [58]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import itertools
import math
import networkx
from collections import Counter

class TextRankSummarizer:
    graph = networkx.Graph()
    stopwords = stopwords.words("english")
    
    def __init__(self):
        pass
    
    def get_vector(self,text):
        words = [w for w in word_tokenize(text.lower()) if w not in self.stopwords]
        return dict(Counter(words))
        
    def calc_cosine_sim(self,vector1,vector2):
        
        intersection_set = set(vector1.keys()) & set(vector2.keys())
        dot_product = sum([vector1[x] * vector2[x] for x in intersection_set])

        vector1_mag = math.sqrt( sum([vector1[x]**2 for x in vector1.keys()]) )
        vector2_mag = math.sqrt( sum([vector2[x]**2 for x in vector2.keys()]) )
        
        denominator = vector1_mag*vector2_mag

        if not denominator:
            return 0.0
        else:
            return float(dot_product) / denominator
    
    def add_to_graph(self,n):
        self.graph.add_nodes_from(n)
        return
    
    def reorder_sents(self,sents,text):
        sents.sort( lambda s1, s2: text.find(s1) - text.find(s2) )
        return sents
    
    def get_summary(self,text,num_sent):
        sane_text = unicode(text, "utf-8")            
        sents = sent_tokenize(sane_text)
        
        if num_sent > len(sents):
            print "text already summarized !!!"
            return
        
        self.add_to_graph(sents)
        
        sent_pairs = list(itertools.combinations(sents, 2))
        
        for pair in sent_pairs:
            sent1 = pair[0]
            sent2 = pair[1]
            vec1 = self.get_vector(sent1)
            vec2 = self.get_vector(sent2)
            
            cosine_sim = self.calc_cosine_sim(vec1, vec2)
            self.graph.add_edge(sent1, sent2, weight=cosine_sim)
            
        output_sents_score = networkx.pagerank(self.graph, weight='weight')
        
        output_sorted = sorted(output_sents_score, key=output_sents_score.get, reverse=True)
        
        reordered_sents = self.reorder_sents(output_sorted[:num_sent],sane_text)
        
        return "".join(reordered_sents)        

In [59]:
test_summarizer = TextRankSummarizer()

test_text = """To Sherlock Holmes she is always the woman. I have
seldom heard him mention her under any other name. In his eyes she
eclipses and predominates the whole of her sex. It was not that he
felt any emotion akin to love for Irene Adler. All emotions, and that
one particularly, were abhorrent to his cold, precise but admirably
balanced mind. He was, I take it, the most perfect reasoning and
observing machine that the world has seen, but as a lover he would
have placed himself in a false position. He never spoke of the softer
passions, save with a gibe and a sneer. They were admirable things for
the observer-excellent for drawing the veil from men’s motives and
actions. But for the trained reasoner to admit such intrusions into
his own delicate and finely adjusted temperament was to introduce a
distracting factor which might throw a doubt upon all his mental
results. Grit in a sensitive instrument, or a crack in one of his own
high-power lenses, would not be more disturbing than a strong emotion
in a nature such as his. And yet there was but one woman to him, and
that woman was the late Irene Adler, of dubious and questionable
memory.
"""

test_text = ' '.join(test_text.strip().split('\n'))

summary = test_summarizer.get_summary(test_text,5)

print summary

All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position.He never spoke of the softer passions, save with a gibe and a sneer.Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more disturbing than a strong emotion in a nature such as his.And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory.
