# Abstractive Summarization

In [1]:
% matplotlib inline
# warning: the below makes unicode the default
from __future__ import unicode_literals 

from __future__ import division

import math
import numpy as np 
import pandas as pd 

import nltk
import inspect

from textacy.vsm import Vectorizer
import textacy.vsm

import spacy

from scipy.spatial.distance import cosine

from tqdm import *

import re

import os
import kenlm

Loading the data

In [2]:
cowts_tweets = pd.read_pickle('cowts_tweets.pkl')
term_matrix = np.load('term_matrix.npy')
vocab_to_idx = np.load('vocab_to_idx.npy').item()
tweet_indices = np.load('tweet_indices.npy')

content_vocab = list(np.load('content_vocab.npy'))
tfidf_dict = np.load('tfidf_dict.npy').item()

Tokenizing it using SpaCy

In [3]:
nlp = spacy.load('en')

In [4]:
spacy_tweets = []

for doc in nlp.pipe(cowts_tweets[0], n_threads = -1):
    spacy_tweets.append(doc)

Given these 59 tweets, which consist of a 1000 word summary of the dataset, I want to generate a paragraph summary from the words. I am going to do this in two steps: 

1. First, I am going to generate a bigram of the words. This will map all of the words in all of the tweets to all the other words they are connected to. 

2. Then, I am going to use the bigram to find an optimal 'text path', which I will solve using Integer Linear Programming. 

Some of the tweets are single word tweets (eg. '`Awful`'). These will not contribute anything to the word graph, so I remove them from this list of chosen tweets. 

In [5]:
spacy_tweets = [tweet for tweet in spacy_tweets if len(tweet) > 1]

## Making a word graph

I want to make a bigram; each node of a bigram is two adjacent words in a tweet. I can then generate sentences by traversing paths (which lead to the same word). 

In [6]:
from nltk.util import bigrams

So for a single tweet, a bigram would look like this: 

In [7]:
(list(bigrams(spacy_tweets[0])))

[(:, LATEST),
 (LATEST, Nepal),
 (Nepal, 's),
 ('s, Kantipur),
 (Kantipur, TV),
 (TV, shows),
 (shows, at),
 (at, least),
 (least, 21),
 (21, bodies),
 (bodies, lined),
 (lined, up),
 (up, on),
 (on, ground),
 (ground, after),
 (after, 7.9),
 (7.9, earthquake),
 (earthquake, 
   )]

I can now construct this for all the tweets, simply by adding their respective bigrams together

In [8]:
all_bigrams = [list(bigrams([token.lemma_ for token in tweets])) for tweets in spacy_tweets]

I take the the starting and end nodes of the bigrams, so that I can generate word paths. 

In [9]:
starting_nodes = [single_bigram[0] for single_bigram in all_bigrams]
end_nodes = [single_bigram[-1] for single_bigram in all_bigrams]

In [10]:
all_bigrams = [node for single_bigram in all_bigrams for node in single_bigram]

In [11]:
all_bigrams = list(set(all_bigrams))

These bigrams themselves are not super useful; what I want to do is compile a list of all the word paths through the bigram. I can do this by using the starts of the tweets as the 'beginnings', and the ends as 'ends'. 

In order to limit the number of word paths, I will limit the path length to between 10 and 15 paths. I'm going to implement a breadth first search to find these word paths.

The first step is to take my bigram list and turn it into a dictionary; this will make it easier to find the paths. 

In [12]:
def make_bigram_graph(all_bigrams, start_node):
    bigram = all_bigrams[:]
    
    '''
    Given a bigram, with a defined start node and defined end nodes, this method
    returns a dictionary which serves as a graph for that bigram 
    '''
    def find_children(bigram, node):
        '''
        Given a node, this method finds all its children 
        '''
        second_word = node[1]
        
        children = [node for node in bigram if node[0] == second_word]
        
        return children
   
    bigram_graph = {}
    # start by adding the start node
    bigram_graph[start_node] = find_children(all_bigrams, start_node)
    bigram.remove(start_node)
    
    nodes_to_check = []
    for i in find_children(bigram, start_node):
        nodes_to_check.append(i)
        
    while nodes_to_check: 
        node = nodes_to_check.pop()
        if node in bigram: 
            bigram_graph[node] = find_children(bigram, node)
            bigram.remove(node)
            for i in find_children(bigram, node):
                nodes_to_check.append(i)
    return bigram_graph


In [13]:
bigram_graph = make_bigram_graph(all_bigrams, starting_nodes[1])

In [14]:
len(bigram_graph)

819

Now that I have this dictionary, I am going to implement a breadth first search to find all possible paths between a start node and an end node. 

In [15]:
def breadth_first_search(bigram_graph, start_node, end_node):
    '''
    This method takes as input a graph, a start node and an end node
    and returns all paths which have a length between 10 and 16
    between the two nodes.
    '''
    graph_to_manipulate = dict(bigram_graph)
    
    queue = []
    paths_to_return = []
    queue.append([start_node])
    
    while queue:
        # get the first path from the queue
        path = queue.pop(0)
        # get the last node from the path
        node = path[-1]
        # path found
        if node == end_node:
            if (len(path) < 16) and (len(path) > 10): #limit path length 
                paths_to_return.append(path)
        # enumerate all adjacent nodes, construct a new path and push it into the queue
        for adjacent in graph_to_manipulate.get(node, []):
            new_path = list(path)
            new_path.append(adjacent)
            queue.append(new_path)
        if node in graph_to_manipulate: 
            del graph_to_manipulate[node] # prevents circular references

    return paths_to_return

In [16]:
path = breadth_first_search(bigram_graph, starting_nodes[1], end_nodes[2])

Lets see the first path this produces

In [17]:
path

[[(u'prayer', u'for'),
  (u'for', u'the'),
  (u'the', u'parts'),
  (u'parts', u'of'),
  (u'of', u'lamjung'),
  (u'lamjung', u','),
  (u',', u'2'),
  (u'2', u'in'),
  (u'in', u'solu'),
  (u'solu', u'dist'),
  (u'dist', u'accord'),
  (u'accord', u'\x89\xfb'),
  (u'\x89\xfb', u'_')]]

Nonsensical, but this is a word path. Success! 

Now, I just need to repeat this exercise for every starting node, mapping to every end node, to collect a 'total word path' document. 

In [18]:
bigram_paths = []

for single_start_node in tqdm(starting_nodes): 
    bigram_graph = make_bigram_graph(all_bigrams, single_start_node)
    for single_end_node in end_nodes:
        possible_paths = breadth_first_search(bigram_graph, single_start_node, single_end_node)
        for path in possible_paths: 
            bigram_paths.append(path)

100%|██████████| 60/60 [00:26<00:00,  2.55it/s]


In [19]:
len(bigram_paths)

4032

Adding the original tweets to the possible word paths

In [20]:
for tweet in spacy_tweets: 
    bigram_paths.append(list(bigrams([token.lemma_ for token in tweets])))

In [21]:
bigram_paths[4033]

[(u':', u'.@su4ita'),
 (u'.@su4ita', u'for'),
 (u'for', u'blood'),
 (u'blood', u'requirement'),
 (u'requirement', u'in'),
 (u'in', u'kathmandu'),
 (u'kathmandu', u'contact'),
 (u'contact', u'mr.'),
 (u'mr.', u'adhikari'),
 (u'adhikari', u'00977'),
 (u'00977', u'-'),
 (u'-', u'9862005225')]

Finally, I want to turn the tweets from bigrams to actual sentences (or at least, lists of unicode). 

In [22]:
def make_list(bigram_path):
    '''
    This method takes a bigram path (eg. [(u'hello', u'world'), (u'world', u'!')]) and returns 
    a list of unicode (eg [u'hello', u'world', u'!')
    '''
    unicode_list = []
    unicode_list.append(bigram_path[0][0])
    unicode_list.append(bigram_path[0][1])
    
    for bigram in bigram_path[1:]:
        unicode_list.append(bigram[1])
    
    return unicode_list

In [23]:
make_list(bigram_paths[4033])

[u':',
 u'.@su4ita',
 u'for',
 u'blood',
 u'requirement',
 u'in',
 u'kathmandu',
 u'contact',
 u'mr.',
 u'adhikari',
 u'00977',
 u'-',
 u'9862005225']

In [24]:
word_paths = []
for path in tqdm(bigram_paths): 
    word_paths.append(make_list(path))

100%|██████████| 4092/4092 [00:00<00:00, 158138.54it/s]


Given all these paths, I want to find the best ones. 

## COntent Words Based ABstractive Summarization (COWABS) 

As per [Rudra et al](http://dl.acm.org/citation.cfm?id=2914600) (different paper from the last notebook), I want to maximize
\begin{equation}
\sum_{i=1}^{n} LQ(i)\cdot I(i) \cdot x_{i} + \sum_{j=1}^{m} y_{j}
\end{equation}
where x are the paths chosen, y are the content words chosen, I(i) describes the Informativeness of some word path i and LQ(i) describes the Linguistic Quality Score of some word path i. 

I will need to define the Informativeness and Linguistic Quality Scores quantitatively: 

**Informativeness** is the cosine distance between each word-path and the mean tf-idf vector. 

**Linguistic Quality Score** assigns probabilities to the occurences of words, with more probable words getting higher scores, such that 
\begin{equation}
LQ(s_{i}) = \frac{1}{(1 - ll(w_1, w_2, ... , w_q)}
\end{equation}
where
\begin{equation}
ll(w_{1}, w_{2}, ... , w_{q}) = \frac{1}{L} log_{2} \prod_{t=3}^q P(w_{t}|w_{t-1}w_{t-2})
\end{equation}
The point of this constraint is to weight more probably sequences of words more highly, therefore favouring more 'realistic' sentences. 

As before, the word paths will be subject to the same constraints as the tweets; all content words in a word path must be selected, and if a content word is selected, so must a word path containing it. 

I'm going to start by defining some methods, which will make all these equations easier to define. 

Starting with informativeness, which is quite a bit easier to define: 

In [25]:
def informativeness(word_path):
    '''
    This method returns the cosine difference between
    a tweet path and the mean of the tf-idf term matrix
    
    Input = word path (as a unicode list)
    Ouptut = cosine difference (scalar value)
    '''
    tfidf_mean = np.mean(term_matrix, axis = 0)
    
    # First, I need to construct the tf-idf vector
    tfidf_path = np.zeros(len(tfidf_mean))
    
    for word in word_path: 
        word_idx = vocab_to_idx[word]
        tfidf_path[word_idx] = np.max(term_matrix[:,word_idx])
   
    cosine_difference = cosine(tfidf_mean, tfidf_path)
    return cosine_difference

In [26]:
informativeness(word_paths[1000])

0.8106635475960231

For the linguistic quality score, I am going to be using [kenlm](http://kheafield.com/code/kenlm/) (specifically its python implementation), which actually calculates this quality for me. 

Like SpaCy, the model must be defined: 

In [27]:
kenlm_model = kenlm.Model('coca.arpa')

Note: the kenlm model takes as input a string, not a list of unicode, so I need to turn the word path into a string sentence before I can pass it to the kenlm model to get its score. 

Since this method depends on the summary length, I will define the summary length `L` here. 

In [28]:
L = 150

In [29]:
def linguistic_quality(word_path):
    '''
    This method takes a word path, and returns a linguistic quality score 
    '''
    path_string = str(" ").join([token.encode('ascii', 'ignore') for token in word_path])
    
    ll_score = math.log(10**kenlm_model.score(path_string, bos = True, eos = True), 2)/L
    
    return (1/(1-ll_score))

I may be picking the best of a bad bunch here. 

The constraints are actually the same as for the COWTS model:

1. 
\begin{equation}
\sum_{i=1}^{n} x_{i} \cdot Length(i) \leq L
\end{equation}
I want the total length of all the selected word paths to be less than some value L, which will be the length of my summary, L. I can vary L depending on how long I want my summary to be. 

2. 
\begin{equation}
\sum_{i \in T_{j}} x_{i} \geq y_{j}, j = [1,...,m]
\end{equation}
If I pick some content word $y_{j}$ (out of my $m$ possible content words) , then I want to have at least one path from the set of word paths which contain that content word, $T_{j}$. 

3. 
\begin{equation}
\sum_{j \in C_{i}} y_{j} \leq |C_{i}| \times x_{i}, i = [1,...,n]
\end{equation}
If I pick some path i (out of my $n$ possible paths) , then all the content words in that path $C_{i}$ are also selected. 

Let's begin the ILP step, once again using  [PyMathProg](http://pymprog.sourceforge.net/index.html).

In [30]:
from pymprog import *

In [31]:
begin('COWABS')

model('COWABS') is the default model.

In [32]:
# Defining my first variable, x 
# This defines whether or not a word path is selected
x = var(str('x'), len(word_paths), bool)

In [33]:
# Also defining the second variable, which defines
# whether or not a content word is chosen
y = var(str('y'), len(content_vocab), bool)

Now that I have defined my variables, I can define my equation to maximize: 
\begin{equation}
\sum_{i=1}^{n} LQ(i)\cdot I(i) \cdot x_{i} + \sum_{j=1}^{m} y_{j}
\end{equation}

In [34]:
maximize(sum([linguistic_quality(word_paths[i])*informativeness(word_paths[i])*x[i] for i in range(len(x))]) + 
         sum(y));

Now, I can define my constraints. First, 
\begin{equation}
\sum_{i=1}^{n} x_{i} \cdot Length(i) \leq L
\end{equation}

In [35]:
# hiding the output of this line since its a very long sum 
sum([x[i]*len(word_paths[i]) for i in range(len(x))]) <= L;

As for COWTS, I define two helper methods for the next two constrains. 

Since I don't have a term matrix, they need to be slighly rewritten. 

In [36]:
def content_words(i):
    '''Given a word path index i (for x[i]), this method will return the indices of the words in the 
    content_vocab[] array
    Note: these indices are the same as for the y variable
    '''
    path = word_paths[i]
    content_indices = []
    
    for word in path:
        if word in content_vocab:
            content_indices.append(content_vocab.index(word))
    return content_indices

In [37]:
def paths_with_content_words(j):
    '''Given the index j of some content word (for content_vocab[j] or y[j])
    this method will return the indices of all tweets which contain this content word
    '''
    content_word = content_vocab[j]
    
    indices = []
    
    for i in range(len(word_paths)):
        if content_word in word_paths[i]:
            indices.append(i)
    
    return indices

I can now define the second constraint: 
\begin{equation}
\sum_{i \in T_{j}} x_{i} \geq y_{j}, j = [1,...,m]
\end{equation}

In [38]:
for j in range(len(y)):
    sum([x[i] for i in paths_with_content_words(j)])>= y[j]

And the third constraint:
\begin{equation}
\sum_{j \in C_{i}} y_{j} \leq |C_{i}| \times x_{i}, i = [1,...,n]
\end{equation}

In [39]:
for i in range(len(x)):
    sum(y[j] for j in content_words(i)) >= len(content_words(i))*x[i]

In [40]:
solve()

'The LP problem instance has been successfully solved. (This code\ndoes {\\it not} necessarily mean that the solver has found optimal\nsolution. It only means that the solution process was successful.) \nThe MIP problem instance has been successfully solved. (This code\ndoes {\\it not} necessarily mean that the solver has found optimal\nsolution. It only means that the solution process was successful.)'

In [41]:
result_x =  [value.primal for value in x]
result_y = [value.primal for value in y]

In [42]:
end()

model('COWABS') is not the default model.

In [43]:
chosen_paths = np.nonzero(result_x)
chosen_words = np.nonzero(result_y)

In [44]:
for i in chosen_paths[0]:
    print ('--------------')
    print str(" ").join([token.encode('ascii', 'ignore') for token in word_paths[i]])

--------------
avalanche sweeps everest base camp , 34 minute of major earthquake  
--------------
: mea control room no for nepal 25/04/2015 07:13 utc , april 25,nepalquake kathmanduquake
--------------
magnitude-7.9 quake hits nepal nepalquake nepalearthquake  high alert after 7.9 magnitude earthquake perso _
--------------
earthquake m7.5 strike 89 km nw of 4.5 + 91 11 2301 7905
--------------
thr r safe . apr 25 14:14 at 7.7 richter scale , via
--------------
sad day for the last 1 hour(s ) .   associatedpress associated press news
--------------
: whole himalayan region be up and lalitpur make kathmandu 's 19th century nine - witness
--------------
: 09771 4261945/ 4261790 emergency helpline number in 80 year - typical indian
--------------
: patan durbar square   afganistan bhutan emb  pm  9779851135141
--------------
building collapse , 400 people kill in kathmandu-+977 98511 07021 , 9851135141
--------------
historic dharara tower  nepal n north east . kathmandu contact m