# Text Summarizer

## Overview
This is a text summarizer based loosely on [SMMRY](http://smmry.com/). It returns the most important sentences in a document. It works by assigning a score to each sentence in one of 2 ways:

1. summing the tf-idf values of its constituent words OR

2. calculating the distance of each sentence from all the other sentences.

<!-- TEASER_END -->

Sentences are then ranked and the top-scoring ones are returned in chronological order.

In [1]:
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from scipy.spatial.distance import pdist, squareform
import numpy as np

def summarize(text, p, scoring='sum'):
    '''
    Summarizes a text by returning only the most important sentences
    
    Inputs:
        text - str - the text to be summarized
        p - int or float, the number of sentences to be returned or, if p < 1, the percentage of sentences
        scoring - str - scoring method. To sum tf-idf values use 'sum'. To measure sentence-vector distance, 
                        use one of the following metrics:
                            ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’,
                            ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘kulsinski’,
                            ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’,
                            ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’

    Output: str - the summarized text
    '''
    
    # create blob, determine number of sentences to return
    blob = TextBlob(text)
    if p < 1:
        p = int(round(p*len(blob.sentences)))
    
    
    scores = []
    if scoring == 'sum':

        # create tf-idf matrix
        vec = TfidfVectorizer(stop_words='english')
        dtm = vec.fit_transform((str(x) for x in blob.sentences))
        
        # calculate and store scores
        for i, c in enumerate(dtm):
            scores.append((i, c.sum()))
    
    else:
        # create count matrix for distance method
        vec = CountVectorizer(stop_words='english')
        dtm = vec.fit_transform((str(x) for x in blob.sentences))
        
        # calculate pair-wise distances between each sentence
        dtm = dtm.toarray()
        y = pdist(dtm, scoring)
        
        # change sparse representation into a dense matrix
        y = squareform(y)
        
        # calcualte and store scores
        for i, c in enumerate(y):
            scores.append((i, np.nanmean(c)))

    # sort sentences by score, select the top p sentences, then re-order them chronologically
    top_sentences = sorted(scores, key=lambda x: x[1], reverse=True)
    summary = top_sentences[:p]
    summary = sorted(summary)
    
    result = TextBlob('')
    for i in summary:
        result += blob.sentences[i[0]]
    return result

## Example

In [2]:
with open('../data/little_red_riding_hood.txt') as f:
    story = f.read()
    
def percent_of_story(summary):
    return len(summary)/float(len(story))

In [3]:
tfidf = summarize(story, .4)
percent_of_story(tfidf)

0.6689223697650664

In [4]:
euclid = summarize(story, .4, scoring='euclidean')
percent_of_story(euclid)

0.666879468845761

In [5]:
cheb = summarize(story, .7, scoring='chebyshev')
percent_of_story(cheb)

0.6363636363636364

## Sample Output

In [6]:
for s in cheb.sentences:
    print s



Once upon a time there was a dear little girl who was loved by everyone who looked at her, but most of all by her grandmother, and there was nothing that she would not have given to the child.Once she gave her a little riding hood of red velvet, which suited her so well that she would never wear anything else; so she was always called 'Little Red Riding Hood.
'Set out before it gets hot, and when you are going, walk nicely and quietly and do not run off the path, or you may fall and break the bottle, and then your grandmother will get nothing; and when you go into her room, don't forget to say, "Good morning", and don't peep into every corner before you do it.
'The grandmother lived out in the wood, half a league from the village, and just as Little Red Riding Hood entered the wood, a wolf met her.Red Riding Hood did not know what a wicked creature he was, and was not at all afraid of him.
'Thank you kindly, wolf.
''To my grandmother's.
''What have you got in your apron?
''Cake and w

## Results and thoughts

Even with one third removed, the story is fairly easy to follow. The summarizer dealt best with exposition and narration and all the major plot-points are covered. The summarizer doesn't do so well with dialogue, however, which makes sense, since deleting any portion of a dialogue interrupts the flow of the conversation.

The tf-idf sum method is heavily biased towards long sentences. I tried to temper this by assigning the *average* tf-idf as the score instead of the sum, but this led to a summary comprised almost entirely of short bits of disconnected dialogue. Distance methods are a bit shorter, but not much. Different distances return very different summaries, so this is one way to tune the method to specific types of documents.

SMMRY mentions that it removes unnecessary clauses. Incorporating something like this could also substantially shorten the final summary. One final idea for further improvement is automating the amount of material that is returned. This seems like it might be quite difficult, since a human is needed to evaluate the quality of the summary.