# Text Summarization Drill

Modifiy the keyword extraction code to extract two-word phrases (digrams) rather than single words.  Then try it with trigrams.  You will probably want to broaden the window that defines 'neighbors.'

In [1]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import networkx as nx
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline

In [16]:
# Importing the text the lazy way.
gatsby="In my younger and more vulnerable years my father gave me some advice that I've been turning over in my mind ever since. \"Whenever you feel like criticizing any one,\" he told me, \"just remember that all the people in this world haven't had the advantages that you've had.\" He didn't say any more but we've always been unusually communicative in a reserved way, and I understood that he meant a great deal more than that. In consequence I'm inclined to reserve all judgments, a habit that has opened up many curious natures to me and also made me the victim of not a few veteran bores. The abnormal mind is quick to detect and attach itself to this quality when it appears in a normal person, and so it came about that in college I was unjustly accused of being a politician, because I was privy to the secret griefs of wild, unknown men. Most of the confidences were unsought--frequently I have feigned sleep, preoccupation, or a hostile levity when I realized by some unmistakable sign that an intimate revelation was quivering on the horizon--for the intimate revelations of young men or at least the terms in which they express them are usually plagiaristic and marred by obvious suppressions. Reserving judgments is a matter of infinite hope. I am still a little afraid of missing something if I forget that, as my father snobbishly suggested, and I snobbishly repeat a sense of the fundamental decencies is parcelled out unequally at birth. And, after boasting this way of my tolerance, I come to the admission that it has a limit. Conduct may be founded on the hard rock or the wet marshes but after a certain point I don't care what it's founded on. When I came back from the East last autumn I felt that I wanted the world to be in uniform and at a sort of moral attention forever; I wanted no more riotous excursions with privileged glimpses into the human heart. Only Gatsby, the man who gives his name to this book, was exempt from my reaction--Gatsby who represented everything for which I have an unaffected scorn. If personality is an unbroken series of successful gestures, then there was something gorgeous about him, some heightened sensitivity to the promises of life, as if he were related to one of those intricate machines that register earthquakes ten thousand miles away. This responsiveness had nothing to do with that flabby impressionability which is dignified under the name of the \"creative temperament\"--it was an extraordinary gift for hope, a romantic readiness such as I have never found in any other person and which it is not likely I shall ever find again. No--Gatsby turned out all right at the end; it is what preyed on Gatsby, what foul dust floated in the wake of his dreams that temporarily closed out my interest in the abortive sorrows and short-winded elations of men."

# We want to use the standard english-language parser.
parser = spacy.load('en')

# Parsing Gatsby.
gatsby = parser(gatsby)

# Dividing the text into sentences and storing them as a list of strings.
sentences=[]
for span in gatsby.sents:
    # go from the start to the end of each span, returning each token in the sentence
    # combine each token using join()
    sent = ''.join(gatsby[i].string for i in range(span.start, span.end)).strip()
    sentences.append(sent)

# Creating the tf-idf matrix.
counter = TfidfVectorizer(lowercase=False, 
                          stop_words=None,
                          ngram_range=(1, 1), 
                          analyzer=u'word', 
                          max_df=.5, 
                          min_df=1,
                          max_features=None, 
                          vocabulary=None, 
                          binary=False)

#Applying the vectorizer
data_counts=counter.fit_transform(sentences)

In [17]:
# Calculating similarity
similarity = data_counts * data_counts.T

# Identifying the sentence with the highest rank.
nx_graph = nx.from_scipy_sparse_matrix(similarity)
ranks=nx.pagerank(nx_graph, alpha=.85, tol=.00000001)

ranked = sorted(((ranks[i],s) for i,s in enumerate(sentences)),
                reverse=True)
print(ranked[0])

(0.07458830063813308, 'This responsiveness had nothing to do with that flabby impressionability which is dignified under the name of the "creative temperament"--it was an extraordinary gift for hope, a romantic readiness such as I have never found in any other person and which it is not likely I shall ever find again.')


In [18]:
# Removing stop words and punctuation, then getting a list of all unique words in the text
gatsby_filt = [word for word in gatsby if word.is_stop==False and (word.pos_=='NOUN' or word.pos_=='ADJ')]
words=set(gatsby_filt)

#Creating a grid indicating whether words are within 4 places of the target word
adjacency=pd.DataFrame(columns=words,index=words,data=0)

#Iterating through each word in the text and indicating which of the unique words are its neighbors
for i,word in enumerate(gatsby):
    # Checking if any of the word's next four neighbors are in the word list 
    if any([word == item for item in gatsby_filt]):
        # Making sure to stop at the end of the string, even if there are less than four words left after the target.
        end=max(0,len(gatsby)-(len(gatsby)-(i+5)))
        # The potential neighbors.
        nextwords=gatsby[i+1:end]
        # Filtering the neighbors to select only those in the word list
        inset=[x in gatsby_filt for x in nextwords]
        neighbors=[nextwords[i] for i in range(len(nextwords)) if inset[i]]
        # Adding 1 to the adjacency matrix for neighbors of the target word
        if neighbors:
            adjacency.loc[word,neighbors]=adjacency.loc[word,neighbors]+1

print('done!')

done!


In [19]:
# Running TextRank
nx_words = nx.from_numpy_matrix(adjacency.as_matrix())
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
ranked = sorted(((ranks[i],s) for i,s in enumerate(words)),
                reverse=True)
print(ranked[:5])

  


[(0.013842411165984558, hope), (0.012538179113556777, promises), (0.012538179113556777, exempt), (0.012455008769377494, glimpses), (0.012201713657423653, intimate)]


## Digrams (Two-Word Phrases)

In [45]:
#Create digram phrases
digrams = [str(gatsby_filt[i]) + ' ' + str(gatsby_filt[i+1]) for i in range(0,len(gatsby_filt)-1)]
words_di = set(digrams)

#Creating a grid indicating whether words are within 4 places of the target word
adjacency_di=pd.DataFrame(columns=words_di,index=words_di,data=0)

#Iterating through each word in the text and indicating which of the unique words are its neighbors
gatsby_pairs = [str(gatsby[i]) + ' ' + str(gatsby[i+1]) for i in range(0,len(gatsby)-1)]
for i,pair in enumerate(gatsby_pairs):
    # Checking if any of the word's next four neighbors are in the word list 
    if any([pair == item for item in digrams]):
        # Making sure to stop at the end of the string, even if there are less than four words left after the target.
        end=max(0,len(gatsby_pairs)-(len(gatsby_pairs)-(i+10)))
        # The potential neighbors.
        nextwords=gatsby_pairs[i+1:end]
        # Filtering the neighbors to select only those in the word list
        inset=[x in digrams for x in nextwords]
        neighbors=[nextwords[i] for i in range(len(nextwords)) if inset[i]]
        # Adding 1 to the adjacency matrix for neighbors of the target digram
        if neighbors:
            adjacency_di.loc[pair,neighbors]=adjacency_di.loc[pair,neighbors]+1

print('done!')

done!


In [46]:
adjacency_di

Unnamed: 0,human heart,birth way,wake dreams,privy secret,little afraid,riotous excursions,end foul,miles responsiveness,sleep preoccupation,intimate revelation,...,dreams interest,hope romantic,terms plagiaristic,interest abortive,successful gestures,advice mind,sign intimate,fundamental decencies,limit Conduct,horizon intimate
human heart,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
birth way,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
wake dreams,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
privy secret,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
little afraid,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
riotous excursions,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
end foul,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
miles responsiveness,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sleep preoccupation,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
intimate revelation,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [49]:
# Running TextRank
nx_pairs = nx.from_numpy_matrix(adjacency_di.as_matrix())
di_ranks=nx.pagerank(nx_pairs, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
di_ranked = sorted(((di_ranks[i],s) for i,s in enumerate(words_di)),
                reverse=True)
print(di_ranked[:5])

  


[(0.03690423819762779, 'riotous excursions'), (0.0367160584215534, 'unmistakable sign'), (0.0367160584215534, 'unbroken series'), (0.0367160584215534, 'infinite hope'), (0.02515723270136997, 'young men')]


## Trigrams (Three-Word Phrases)

In [53]:
#Create trigram phrases
trigrams = [str(gatsby_filt[i]) + ' ' + str(gatsby_filt[i+1]) + ' ' + str(gatsby_filt[i+2]) for i in range(0,len(gatsby_filt)-2)]
words_tri = set(trigrams)

#Creating a grid indicating whether words are within 4 places of the target word
adjacency_tri=pd.DataFrame(columns=words_tri,index=words_tri,data=0)

#Iterating through each word in the text and indicating which of the unique words are its neighbors
gatsby_triplets = [str(gatsby[i]) + ' ' + str(gatsby[i+1]) + ' ' + str(gatsby[i+2]) for i in range(0,len(gatsby)-2)]
for i,triplet in enumerate(gatsby_triplets):
    # Checking if any of the word's next four neighbors are in the word list 
    if any([triplet == item for item in trigrams]):
        # Making sure to stop at the end of the string, even if there are less than four words left after the target.
        end=max(0,len(gatsby_triplets)-(len(gatsby_triplets)-(i+15)))
        # The potential neighbors.
        nextwords=gatsby_triplets[i+1:end]
        # Filtering the neighbors to select only those in the word list
        inset=[x in trigrams for x in nextwords]
        neighbors=[nextwords[i] for i in range(len(nextwords)) if inset[i]]
        # Adding 1 to the adjacency matrix for neighbors of the target digram
        if neighbors:
            adjacency_tri.loc[triplet,neighbors]=adjacency_tri.loc[triplet,neighbors]+1

print('done!')

done!


In [54]:
adjacency_tri

Unnamed: 0,tolerance admission limit,habit curious natures,abortive sorrows winded,preoccupation hostile levity,series successful gestures,intimate revelation horizon,hostile levity unmistakable,plagiaristic obvious suppressions,interest abortive sorrows,man book exempt,...,sensitivity promises life,horizon intimate revelations,dignified creative extraordinary,politician privy secret,Conduct hard rock,reserved way great,person college politician,sort moral attention,hard rock wet,natures victim veteran
tolerance admission limit,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
habit curious natures,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abortive sorrows winded,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
preoccupation hostile levity,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
series successful gestures,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
intimate revelation horizon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
hostile levity unmistakable,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
plagiaristic obvious suppressions,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
interest abortive sorrows,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
man book exempt,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [55]:
# Running TextRank
nx_triplets = nx.from_numpy_matrix(adjacency_tri.as_matrix())
tri_ranks=nx.pagerank(nx_triplets, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
tri_ranked = sorted(((tri_ranks[i],s) for i,s in enumerate(words_tri)),
                reverse=True)
print(tri_ranked[:5])

  


[(0.0078125, 'younger vulnerable years'), (0.0078125, 'young men terms'), (0.0078125, 'years father advice'), (0.0078125, 'world uniform sort'), (0.0078125, 'world advantages communicative')]
