# Film Script Analyzer

## Summarization

Summarization is a method to extract meaningful information from a bigger source. In the case of scripts, it is to extract sentences that can uniquely identify the script, sentences that capture the essence of the script.

In order to do it, a statistical method is used. The frequencies of words are calculated, very high frequency words usually do not have a lot of information (like stopwords), very low frequency words may not convey the spirit of the script, they may very well be isolated instances.

In [8]:
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus   import stopwords
from collections   import defaultdict
from string        import punctuation
from heapq         import nlargest

def frequencies(word_sent,min_cut=0.1,max_cut=0.9):
    stopw = set(stopwords.words('english')+list(punctuation))
    freq  = defaultdict(int)
    
    for s in word_sent:
        for word in s:
            if word not in stopw:
                freq[word] += 1
    m = float(max(freq.values()))
    
    for w in freq.keys():
        freq[w] = freq[w]/m
        if freq[w] >= max_cut or freq[w] <= min_cut:
            del freq[w]
            
    return freq

Then to summarize, set thresholds of minimum and maximum frequencies and then choose from among the remaining sentences the top ones.

In [9]:
def rank(ranking,n):
    return nlargest(n,ranking,key=ranking.get)

def summarize(text,n=5):
    sents      = sent_tokenize(text)
    assert n  <= len(sents)
    word_sent  = [word_tokenize(s.lower()) for s in sents]
    freq       = frequencies(word_sent)
    ranking    = defaultdict(int)
    
    for i,sent in enumerate(word_sent):
        for w in sent:
            if w in freq:
                ranking[i] += freq[w]
                
    sents_idx = rank(ranking,n)  
    
    return [sents[j] for j in sents_idx]

Open text and try it.

In [13]:
text = open('scripts/Eat Pray Love 2010.txt','r').readlines()[2]

summarize(text,10)

["I'm going to Italy and then l'm going to David's guru's ashram in lndia... ...and l'm going to end the year in Bali.",
 "It's long, it's tedious, I can't keep up... ...and l get these insane anxieties about everything in my life... ...and l've lost my place.",
 'And it was such a foreign concept to me, that l swear l almost began with: "l\'m a big fan of your work."',
 "-No, I don't even have my-- l-- You don't have your-- You don't-- You're so naked.",
 "And l-- You know, I don't-- I don't know.",
 "Do not tell me what lessons l have and haven't learned in the last year... ...and don't tell me how balanced and wise you are... ...and how I can't express myself.",
 "If it wasn't for you, I wouldn't have come back to Bali... ...and l wouldn't have come back to myself.",
 "l'm sorry l didn't call sooner.",
 "I don't know why we can't accept... ...we don't wanna live in unhappiness anymore.",
 "You know, it's been a rough day, and if no one takes it personally... ...l'm going to take my 