## This notebook contains our summarization techniques that we tried ranging from Keyphrase Extraction (using frequency, collocation, chunking and wordnet) to Key sentences extraction ( sentences with most frequent words, Gensim summarizer, Classification and Gensim summarizer). The final approach that we used was Key sentences extraction using Classification and Gensim Summarization that uses TextRank algorithm.

In [25]:
# Importing all the necessary libraries
import nltk, re, numpy as np
import urllib
from bs4 import BeautifulSoup
from nltk import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import string
from gensim.summarization import summarize, keywords
from nltk.probability import FreqDist

## Reading in the files
The following sections read in the text from a file.

In [31]:
#loadText
def loadText(text):
    f = open(text)
    raw = f.read().decode('utf8');
    return raw

tos_text = loadText("static/data/Google.txt")

In [33]:
text = tos_text

# Keyphrase Extraction Approach

In the following approaches - frequency, collocation and chunking - for each, the steps followed are: 1. Identifying candidates for keyphrases and 2. Keyphrase selection

## Approach 1 - Frequency

The following segment attempts using Approach 1 - Frequent Terms
-- Frequent Unigrams (with or without stemming, stopwords, and other normalization)
-- Frequent Bigram frequencies (with or without stemming, stopwords, and other normalization)
-- Other variations on frequent n-gram

### 1. Candidates identification

In [82]:
#Using FreqDist
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
text_tokens = tokenizer.tokenize(text)

fdist = nltk.FreqDist(text_tokens)
print("20 Most frequent tokens: ")
print(fdist.most_common(20))

20 Most frequent tokens: 
[(',', 242), ('.', 188), ('to', 152), ('and', 124), ('you', 120), ('your', 114), ('Google', 113), ('information', 102), ('or', 71), ('our', 65), ('the', 62), ('with', 60), ('services', 59), ('we', 59), ('of', 57), ('a', 45), ('more', 45), ('may', 42), ('that', 40), ('in', 39)]


#### Normalizing
Normalizing - convert to lower case, remove punctuation and stop words, stem words

In [83]:
punct = set(string.punctuation)
#To remove specific punctuation and possessive found in this book
all_punct=string.punctuation + "--" + ".\"" + ",\"" + "?\"" + "\'s" + "!\""+"\""
text_nopunct = [x.lower() for x in text_tokens if x not in all_punct]      #Convert to lower & punctuation
#print("Total number of words: ", len(text_nopunct),"\n")

norm_fdist = nltk.FreqDist(text_nopunct)
unique_incl_stop = len(norm_fdist.keys())
#print("Total number of unique words - without removing stop words: ", unique_incl_stop)

stop = set(stopwords.words('english'))
norm = [i for i in text_nopunct if i not in stop]    #Remove stopwords
norm_fdist = nltk.FreqDist(norm)


### 2. Keyphrase selection

In [84]:
print("\nNormalized Text:")
#print("Total number of unique words post normalizing: ",len(norm_fdist.keys()))
print("\n20 Most frequent words: ")
print(norm_fdist.most_common(20))


Normalized Text:

20 Most frequent words: 
[('google', 113), ('information', 110), ('services', 59), ('may', 42), ('example', 38), ('account', 35), ('use', 35), ('learn', 34), ('privacy', 30), ('personal', 24), ('like', 19), ('policy', 18), ('collect', 18), ('device', 16), ('ads', 16), ('cookies', 16), ('access', 16), ('including', 16), ('share', 16), ('people', 13)]


#### Stemming
Using stemming on unigrams, post normalizing 

In [85]:
from nltk.stem.snowball import SnowballStemmer

#Results post stemming - Using Snowball stemmer
stemmer = SnowballStemmer("english")
text_stem = [stemmer.stem(i) for i in norm]
stem_fdist = nltk.FreqDist(text_stem)
print("\nUsing stemmed words: \n")
print("20 Most frequent words: ")
print(stem_fdist.most_common(20))



Using stemmed words: 

20 Most frequent words: 
[('googl', 113), ('inform', 111), ('servic', 67), ('use', 60), ('may', 42), ('exampl', 39), ('account', 35), ('learn', 34), ('privaci', 30), ('person', 27), ('share', 25), ('includ', 24), ('polici', 20), ('like', 20), ('cooki', 20), ('collect', 20), ('devic', 19), ('ad', 19), ('user', 18), ('access', 18)]


The above output is not very useful. It basically used all the words and even post normalizing did not provide a very meaningful output.

Repeating the above, but using bigrams and trigrams

In [86]:
from nltk.util import ngrams
from nltk import bigrams
from nltk import trigrams

bi = bigrams(norm)
bi_list = [bigram for bigram in bi]
tri = trigrams(norm)
tri_list = [trigram for trigram in tri]

bi_fdist = nltk.FreqDist(bi_list)
print("\n20 Most frequent bigrams: \n")
print(bi_fdist.most_common(20))
print("\n\n20 Most frequent trigrams: \n")
tri_fdist = nltk.FreqDist(tri_list)
print(tri_fdist.most_common(20))


20 Most frequent bigrams: 

[(('personal', 'information'), 23), (('google', 'account'), 23), (('privacy', 'policy'), 17), (('google', 'analytics'), 10), (('use', 'services'), 8), (('advertising', 'services'), 7), (('associated', 'google'), 7), (('information', 'collect'), 7), (('share', 'information'), 7), (('example', 'google'), 6), (('search', 'results'), 6), (('use', 'information'), 6), (('google', 'services'), 6), (('services', 'may'), 5), (('information', 'google'), 5), (('many', 'services'), 5), (('services', 'google'), 4), (('cookies', 'similar'), 4), (('use', 'google'), 4), (('information', 'publicly'), 4)]


20 Most frequent trigrams: 

[(('associated', 'google', 'account'), 7), (('cookies', 'similar', 'technologies'), 4), (('personal', 'information', 'companies'), 3), (('organizations', 'individuals', 'outside'), 3), (('companies', 'organizations', 'individuals'), 3), (('share', 'personal', 'information'), 3), (('relevant', 'search', 'results'), 3), (('process', 'enforceable

## Approach 2 - Collocation (Words)

Collocation can be done either with words or with parts of speech. Here, I have attempted with words. 

### 1. Candidate Identification

In [87]:
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

### 2. Candidate Selection - Bigrams

In [88]:
#Post normalizing

finder = BigramCollocationFinder.from_words(norm)

# only bigrams that appear atleast 3 times
finder.apply_freq_filter(3)

# returns the 10 bigrams with the highest PMI
print("\nTop 10 bigrams using PMI:")
print(finder.nbest(bigram_measures.pmi, 10))
#finder.score_ngrams(bigram_measures.pmi)

# Finds top 10 bigrams using the Pearson's Chi-squared test
print("\nTop 10 bigrams using Pearson's Chi-Squared Test:")
print(finder.nbest(bigram_measures.chi_sq, 10))
#finder.score_ngrams(bigram_measures.chi_sq)



Top 10 bigrams using PMI:
[('cell', 'towers'), ('enforceable', 'governmental'), ('pixel', 'tags'), ('credit', 'card'), ('domain', 'administrator'), ('individuals', 'outside'), ('organizations', 'individuals'), ('governmental', 'request'), ('companies', 'organizations'), ('ip', 'addresses')]

Top 10 bigrams using Pearson's Chi-Squared Test:
[('cell', 'towers'), ('enforceable', 'governmental'), ('pixel', 'tags'), ('domain', 'administrator'), ('credit', 'card'), ('companies', 'organizations'), ('individuals', 'outside'), ('organizations', 'individuals'), ('privacy', 'policy'), ('governmental', 'request')]


In [89]:
# Finds top 10 bigrams using the likelihood ratio

print("\nTop 10 bigrams using Maximum Likelihood Ratio:")
print(finder.nbest(bigram_measures.likelihood_ratio, 10))
#finder.score_ngrams(bigram_measures.likelihood_ratio)


Top 10 bigrams using Maximum Likelihood Ratio:
[('privacy', 'policy'), ('personal', 'information'), ('google', 'account'), ('search', 'results'), ('domain', 'administrator'), ('companies', 'organizations'), ('google', 'analytics'), ('cell', 'towers'), ('enforceable', 'governmental'), ('pixel', 'tags')]


I also used collocations on text that was not normalized but that did not yield in meaningful results since the stop words were included and ended up being among the most frequent.
Eg: ('of', 'the'), ('in', 'the'), ('the', 'jungle'), ('he', 'was'), etc. 

### 2. Candidate Selection - Trigrams

In [90]:
# Post normalizing

finder = TrigramCollocationFinder.from_words(norm)

# only trigrams that appear atleast 3 times
finder.apply_freq_filter(3)

# return the 10 trigrams with the highest PMI
print("\nTop 10 trigrams using PMI:")
print(finder.nbest(trigram_measures.pmi, 10))
#finder.score_ngrams(trigram_measures.pmi)

# Finds top 10 trigrams using the Pearson's Chi-squared test
print("\nTop 10 trigrams using Pearson's Chi-Squared Test:")
print(finder.nbest(trigram_measures.chi_sq, 10))
#finder.score_ngrams(trigram_measures.chi_sq)



Top 10 trigrams using PMI:
[('enforceable', 'governmental', 'request'), ('organizations', 'individuals', 'outside'), ('process', 'enforceable', 'governmental'), ('companies', 'organizations', 'individuals'), ('relevant', 'search', 'results'), ('cookies', 'similar', 'technologies'), ('individuals', 'outside', 'google'), ('information', 'companies', 'organizations'), ('collect', 'store', 'information'), ('personal', 'information', 'companies')]

Top 10 trigrams using Pearson's Chi-Squared Test:
[('enforceable', 'governmental', 'request'), ('organizations', 'individuals', 'outside'), ('process', 'enforceable', 'governmental'), ('companies', 'organizations', 'individuals'), ('cookies', 'similar', 'technologies'), ('relevant', 'search', 'results'), ('individuals', 'outside', 'google'), ('information', 'companies', 'organizations'), ('associated', 'google', 'account'), ('collect', 'store', 'information')]


In [91]:
# Finds top 10 trigrams using the likelihood ratio
print("\nTop 10 trigrams using Maximum Likelihood Ratio:")
print(finder.nbest(trigram_measures.likelihood_ratio, 10))
#finder.score_ngrams(trigram_measures.likelihood_ratio)


Top 10 trigrams using Maximum Likelihood Ratio:
[('personal', 'information', 'companies'), ('associated', 'google', 'account'), ('share', 'personal', 'information'), ('personal', 'information', 'google'), ('google', 'account', 'learn'), ('relevant', 'search', 'results'), ('companies', 'organizations', 'individuals'), ('enforceable', 'governmental', 'request'), ('cookies', 'similar', 'technologies'), ('process', 'enforceable', 'governmental')]


## Aproach 3 - Chunking

### 1. Candidate Identification

In [92]:
def tokenize_text(corpus):
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sents = sent_tokenizer.tokenize(corpus) # Split text into sentences  
    return [nltk.word_tokenize(word) for word in raw_sents]

def create_chunker(grammar):
    return nltk.RegexpParser(grammar)

def run_chunker(ch, sentences):
    return [ch.parse(sent) for sent in sentences]

# Defining the grammar for the chunker
grammar = r"""
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+}  # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
"""
tagged_sentences = nltk.pos_tag_sents(tokenize_text(text))

clause_chunker = create_chunker(grammar)

# Collecting the clauses
clauses = []
for sent in tagged_sentences:
    tree = clause_chunker.parse(sent)
    for st in tree.subtrees():
        if st.label() == 'CLAUSE': clauses.append(st)

#print(clauses)

#Collecting the proper nouns
proper_nouns = []
for sent in tagged_sentences:
    tree = clause_chunker.parse(sent)
    for st in tree.subtrees():
        if st.label() == 'NP': proper_nouns.append(st.leaves()[0][0])
#print(proper_nouns)

candidates = []
for clause in clauses:
    for s in clause.subtrees():
        if(s.label() == 'NP'):
            for l in s.leaves():
                for word in l:
                    if(word in proper_nouns):
                        candidates.append(clause)
                        next

candidate_sentences = []
for candidate in candidates:
    candidate_sentences.append(' '.join([l[0] for l in candidate.leaves()]))

### 2. Candidate Selection

In [93]:
lchar = 0
finlist = []

for sentence in set(candidate_sentences):
    if lchar<2000:
        ls = len(sentence)
        if (lchar + ls)<2000:
            finlist.append(sentence)
            lchar +=ls
        
print("\n".join(ph for ph in finlist))

legal team reviews each
the ads delivered by Google
we’re using information
partners – like publishers
security related materials
Data generated through Google Analytics
Google Analytics product helps businesses
The hyperlinked examples
don’t follow the correct process
SMS routing information
choice People have different privacy concerns
services appear in the language
services using SSL
partners use various technologies
Specific product practices The following notices explain specific privacy practices with respect
device using mechanisms such as browser web storage
a partner uses Google Analytics in conjunction
Google uses cookies
other Google­hosted content
This includes information
Unique application numbers Certain services include a unique application number
Google processes personal information
view archived versions
view content provided by Google
site owners analyze the traffic
services offered on other sites
Profile photo appear in shared endorsements
services offered by othe

## Aproach 4 - Wordnet 

In [94]:
def tokenize_text_semantic(corpus):
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sents = sent_tokenizer.tokenize(corpus) # Split text into sentences    
    return [nltk.word_tokenize(word) for word in raw_sents]

In [95]:
with open('google_privacy.txt','r') as fs:
        summaries_str = fs.read()
        
sents = tokenize_text_semantic(summaries_str)
tagged_POS_sents = [nltk.pos_tag(sent) for sent in sents]

In [99]:
def freq_normed_unigrams(tagged_sents, num):
    wnl = WordNetLemmatizer() # to get word stems      
    normed_tagged_words = [wnl.lemmatize(word[0].lower()) for sent in tagged_sents
                           for word in sent 
                           if word[0].lower() not in stopwords.words('English')
                           and word[0] not in string.punctuation # remove punctuation
                           and not re.search(r'''^[\.,;"'?!():\-_`–—]+$''', word[0])
                           and word[1].startswith('N')]  # include only nouns
    top_normed_unigrams = [word for (word, count) in nltk.FreqDist(normed_tagged_words).most_common(num)]
    return top_normed_unigrams

In [103]:
from nltk.stem import WordNetLemmatizer
from collections import defaultdict
from nltk.corpus import wordnet as wn
import re

def categories_from_hypernyms(tagged_sents, num=10):
    termlist = freq_normed_unigrams(tagged_sents, num) # get top unigrams
    hypterms = []
    key_hyponyms = []
    hyp_list = []
    hypterms_dict = defaultdict(list)
    for term in termlist:                  # for each term
        s = wn.synsets(term.lower(), 'n')  # get its nominal synsets
        for syn in s:                      # for each lemma synset
            for hyp in syn.hypernyms():    # It has a list of hypernyms
                hypterms = hypterms + [hyp.name]      # Extract the hypernym name and add to list
                hypterms_dict[hyp.name].append(term)  # Extract examples and add them to dict
    hypfd = nltk.FreqDist(hypterms)             # After going through all the nouns, print out the hypernyms 
    for (name, count) in hypfd.most_common(10):  # that have accumulated the most counts (have seen the most descendents)
        key_hyponyms.append(hypterms_dict[name])
        # here I eliminated her category listing, opting instead to flattern out the listing and show synonyms
    top_hyponyms = [item for sublist in key_hyponyms for item in sublist]
    fd_hyp = nltk.FreqDist(top_hyponyms).most_common()
    for (each, count) in fd_hyp:
        hyp_list.append(each)
    #print(hyp_list)
    return hyp_list
       


In [104]:
key_topics = categories_from_hypernyms(tagged_POS_sents, 15)

In [110]:
output_key_topics = ['KEY TOPICS:'] + [each for each in categories_from_hypernyms(tagged_POS_sents, 15)]
print(output_key_topics)

['KEY TOPICS:', 'account', 'share', 'access', 'user', 'information', 'device']


## Reflection on using Keyphrase for ToS summarization

We used four approaches to identify keyphrases - Using frequencies, collocations, chunking and wordnet - on the Terms of Service, of Google - as an example. 

#### Frequencies
Using frequencies was a very simple and straightforward approach. However, it did not give very much information. Even post normalizing the text by removing punctuation, converting to lower case and stemming, there's not much improvement in the quality of words recognized.
For Eg: 
20 Most frequent words: 
[('google', 113), ('information', 110), ('services', 59), ('may', 42), ('example', 38), ('account', 35), ('use', 35), ('learn', 34), ('privacy', 30), ('personal', 24), ('like', 19), ('policy', 18), ('collect', 18), ('device', 16), ('ads', 16), ('cookies', 16), ('access', 16), ('including', 16), ('share', 16), ('people', 13)]

Applying frequencies to bigrams and trigrams resulted in some of the common phrases being bundled together. However, the quality of this improvement was not very much. While using the text in its actual form would make for more meaningful bigrams and trigrams (with prepositions and articles), these occur very frequently and hence shadow the others. Hence, we used the normalized text for bigrams and trigrams.
For Eg:
[(('personal', 'information'), 23), (('google', 'account'), 23), (('privacy', 'policy'), 17), (('google', 'analytics'), 10), (('use', 'services'), 8), (('advertising', 'services'), 7), (('associated', 'google'), 7), (('information', 'collect'), 7), (('share', 'information'), 7), (('example', 'google'), 6), (('search', 'results'), 6), (('use', 'information'), 6), (('google', 'services'), 6), (('services', 'may'), 5), (('information', 'google'), 5), (('many', 'services'), 5), (('services', 'google'), 4), (('cookies', 'similar'), 4), (('use', 'google'), 4), (('information', 'publicly'), 4)]

As seen above, while these are key terms (personal information, etc.), they do not tell us anything about the context of use and hence are not very useful.


#### Collocations
We implemented collocations using bigrams and trigrams of the words, using the statistical measures of Pearson Chi Square and PMI. However, both these did not provide very meaningful information across both bigrams and trigrams. We then implemented maximum likelihood ratio and this worked relatively well compared to the others for bigrams. The result for bigrams was better than that for trigrams.
Eg:Bigrams: Top 10 bigrams using Maximum Likelihood Ratio:
[('privacy', 'policy'), ('personal', 'information'), ('google', 'account'), ('search', 'results'), ('domain', 'administrator'), ('companies', 'organizations'), ('google', 'analytics'), ('cell', 'towers'), ('enforceable', 'governmental'), ('pixel', 'tags')]

However, without context, this does not make sense too.

#### Chunking
The chunks here are defined as noun phrases, verb phrases, prepositional phrases and clauses. Of these, clauses are a combination of noun phrase and verb phrase. As we parse, we split the sentences into chunks defined by the rules above and select those that have proper nouns in them. This is a good approach but the caveat is that the if the tags are incorrectly labeled, then chunks can be misleading.

Eg: 
legal team reviews each
the ads delivered by Google
we’re using information
partners – like publishers
security related materials
Data generated through Google Analytics
Google Analytics product helps businesses
The hyperlinked examples
don’t follow the correct process
SMS routing information
choice People have different privacy concerns

Even in this case, while the results are better than frequency and collocation, the keyphrase in itself does not make much sense. 

#### Wordnet
Using Wordnet to find the key topics in the Terms of Service, resulted in the following:
['KEY TOPICS:', 'account', 'share', 'access', 'user', 'information', 'device']
While these may be key topics, they give us no further information about the terms of service and the user is not better off having seen these topics. Wordnet does not perform very well in this case.

#### Conclusion
Our objective was that we need to summarize Terms of Service so that users can read the key points of the text and be better informed. None of the techniques used here to extract keyphrases facilitate this since the context of the phrase is missing and that is important to understand the sentence in its entirety.

#### Learnings and Way Forward
Rather than extract key phrases, our plan is to look at complete sentences so that the context is clear and the sentence is better comprehendable.

In other techniques, we will focus on extracting complete sentences.

# Key Sentences Extraction Approach

## Approach 1: Summarizing by picking sentences that contain most frequent words.

In [5]:
#Code taken from Author: Tristan Havelick <tristan@havelick.com> URL: <https://github.com/thavelick/summarize/>

class SimpleSummarizer:
    def reorder_sentences(self, output_sentences, input ):
        output_sentences.sort( lambda s1, s2:
            input.find(s1) - input.find(s2) )
        return output_sentences

    def get_summarized(self, input, num_sentences ):
        tokenizer = RegexpTokenizer('\w+')

        # get the frequency of each word in the input
        base_words = [word.lower() for word in tokenizer.tokenize(input)]
        words = [word for word in base_words if word not in stopwords.words()]
        word_frequencies = FreqDist(words)
        
        # now create a set of the most frequent words
        most_frequent_words = [pair[0] for pair in
            word_frequencies.items()[:100]]

        # break the input up into sentences.  working_sentences is used
        # for the analysis, but actual_sentences is used in the results
        # so capitalization will be correct.
        
        sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
        actual_sentences = sent_detector.tokenize(input)
        working_sentences = [sentence.lower() for sentence in actual_sentences]

        # iterate over the most frequent words, and add the first sentence
        # that inclues each word to the result.
        output_sentences = []

        for word in most_frequent_words:
            for i in range(0, len(working_sentences)):
                if (word in working_sentences[i]
                 and actual_sentences[i] not in output_sentences):
                    output_sentences.append(actual_sentences[i])
                    break
                if len(output_sentences) >= num_sentences: break
            if len(output_sentences) >= num_sentences: break
                
        # sort the output sentences back to their original order
        return self.reorder_sentences(output_sentences, input)
    
    def summarize_text(self, input, num_sentences):
        return self.get_summarized(input, num_sentences)

In [11]:
print 'Summary:'

ss=SimpleSummarizer()
summary=ss.summarize_text(tos_text,10)
for sent in summary[1:]:
    print sent
    print 

Summary:
Using our Services

You must follow any policies made available to you within the Services.

In connection with your use of the Services, we may send you service announcements, administrative messages, and other information.

Do not use such Services in a way that distracts you and prevents you from obeying traffic or safety laws.

The rights you grant in this license are for the limited purpose of operating, promoting, and improving our Services, and to develop new ones.

Our automated systems analyze your content (including emails) to provide you personally relevant product features, such as customized search results, tailored advertising, and spam and malware detection.

If you have a Google Account, we may display your Profile name, Profile photo, and actions you take on Google or on third-party applications connected to your Google Account (such as +1’s, reviews you write and comments you post) in our Services, including displaying in ads and other commercial contexts.

Y

This approach pulled out nice sentences, but with Terms of Service, we cannot just rely on the most frequent terms.

## Approach 2: Gensim summarization which works on TextRank Algorithm
https://rare-technologies.com/text-summarization-with-gensim/

In [13]:
print 'Summary:'
tos_text = loadText("google.txt")
for sent in summarize(tos_text, split=True, ratio=.05):
    print sent
    print 

Summary:
When you upload, submit, store, send or receive content to or through our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content.

You can find more information about how Google uses and stores content in the privacy policy or additional terms for particular Services.

This license is for the sole purpose of enabling you to use and enjoy the benefit of the Services as provided by Google, in the manner permitted by these terms.

TO THE EXTENT PERMITTED BY LAW, THE TOTAL LIABILITY OF GOOGLE, AND ITS SUPPLIERS AND DISTRIBUTORS, FOR ANY CLAIMS UNDER THESE TERMS, INCLUDING FOR ANY IMPLIED WARRANTIES, IS LIMITED TO THE AMOUNT YOU PAID US TO USE THE SERVICES (OR, IF WE CHOOSE, TO SUPPLYING

This gives a nice and quite informative summary. We are getting the most important information that people would care about using this approach.

## Approach 3: Classification and Summarization
Classifying the text using words that are commonly used in the context of labels like copyright, privacy and termination and summarizing the text under each label using gensim summarization.

In [24]:
def summarizeAlgo(_text): 

    tos_text_paras = _text.split("\n")
    
    # Classifying using common words used under labels 
    copyright = ['collective work',\
    'compilation',\
    'compulsory license',\
    'copyright',\
    'copyright holder/copyright owner',\
    'copyright notice',\
    'derivative work',\
    'exclusive right',\
    'expression',\
    'fair use',\
    'first sale doctrine',\
    'fixation',\
    'idea',\
    'infringement',\
    'intellectual property',\
    'license',\
    'master use license',\
    'mechanical license',\
    'medium',\
    'moral rights',\
    'musical composition',\
    'parody',\
    'patent',\
    'performing rights',\
    'permission',\
    'public domain',\
    'publication/publish',\
    'right of publicity',\
    'royalty',\
    'service mark',\
    'sound recording',\
    'statutory damages',\
    'synchronization license',\
    'tangible form of expression',\
    'term',\
    'title',\
    'trademark',\
    'trade secret',\
    'work for hire']

    privacy = ['access',\
    'account',\
    'activity',\
    'advertising',\
    'confidentiality',\
    'content',\
    'cookie',\
    'legal',\
    'preferences',\
    'privacy',\
    'protect',\
    'religion',\
    'security',\
    'settings']

    termination = ['cease',\
    'terminate',\
    'remove',\
    'inactive',\
    'suspend',\
    'account',\
    'discontinue',\
    'revoke',\
    'retain']

    copyright_all,privacy_all,termination_all = [],[],[]


    for para in tos_text_paras:
        check = 0
        for word in para.split(" "):
            word = word.lower()
            if word in copyright:
                copyright_all.append(para)
                check = 1
            if word in privacy:
                privacy_all.append(para)
                check = 1
            if word in termination:
                termination_all.append(para)
                check = 1
            if check != 0:
                break
    
    #Removing sentences that contain less than 5 words under all the labels
    
    copyright_all = [sent for sent in copyright_all if len(word_tokenize(sent)) > 5]

    privacy_all = [sent for sent in privacy_all if len(word_tokenize(sent)) > 5]

    termination_all = [sent for sent in termination_all if len(word_tokenize(sent)) > 5]

    categoryDict = {}
    
    #Summarizing each labelled text

    if (len(copyright_all) != 0):
        if (len(copyright_all) != 1):
            copyright_text = ' '.join(copyright_all)
            copyright_all = summarize(copyright_text, split=True, ratio=.2)
        categoryDict["Copyright"] = copyright_all


    if (len(privacy_all) != 0):
        if (len(privacy_all) != 1):
            privacy_text = ' '.join(privacy_all)
            privacy_all = summarize(privacy_text, split=True, ratio=.1)
        categoryDict["Privacy"] = privacy_all


    if (len(termination_all) != 0):
        if (len(termination_all) != 1):
            termination_text = ' '.join(termination_all)
            termination_all = summarize(termination_text, split=True, ratio=.2)
        categoryDict["Termination"] = termination_all
                              
    return categoryDict


summary = summarizeAlgo(tos_text)
for key in summary:
    print key
    print summary[key]
    print
    


Termination
[u'If you are using a Google Account assigned to you by an administrator, different or additional terms may apply and your administrator may be able to access or disable your account.', u'We will respect the choices you make to limit sharing or visibility settings in your Google Account.']

Copyright
[u'Google gives you a personal, worldwide, royalty-free, non-assignable and non-exclusive license to use the software provided to you by Google as part of the Services.']

Privacy
[u'If you are using a Google Account assigned to you by an administrator, different or additional terms may apply and your administrator may be able to access or disable your account.', u'When you upload, submit, store, send or receive content to or through our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works b

## Reflection on using Sentence Selection for ToS summarization

Approach #3 (Classification and Summarization) gave us the best results as we were able to divide the text into certain categories and within each category, we were able to pull out the most interesting and important sentences. As a user, we would want to know only the crux or the most important pieces of information in Terms of Services that would matter to us the most under Copyright, Privacy and Termination. We have picked this as our final algorithm. Hopefully, the surveys would help us in further evaluation of this approach.

Going forward, we would like to rephrase the text to make the sentences shorter and easy to comprehend.