# Content Words

In [75]:
% matplotlib inline

from __future__ import division

import numpy as np 
import pandas as pd 

import spacy
from spacy.tokens.doc import Doc
import inspect

from textacy.vsm import Vectorizer
import textacy.vsm

import scipy.sparse as sp

from tqdm import *

import re

Loading the data

In [2]:
tweets = pd.read_csv('tweet_ids/2015_Nepal_Earthquake_en/stripped_filled_tweets.csv', encoding = 'ISO-8859-1')

In [3]:
tweets.head()

Unnamed: 0.1,Unnamed: 0,label,tweet_id,tweet_texts
0,1,infrastructure_and_utilities_damage,'591902695822331904',RT @DailySabah: #LATEST #Nepal's Kantipur TV s...
1,2,injured_or_dead_people,'591902695943843840',RT @iamsrk: May Allah look after all. Here r t...
2,3,missing_trapped_or_found_people,'591902696371724288',RT @RT_com: LATEST: 108 killed in 7.9-magnitud...
3,4,sympathy_and_emotional_support,'591902696375877632',RT @Edourdoo: Shocking picture of the earthqua...
4,5,sympathy_and_emotional_support,'591902696895950848',Indian Air Force is ready to help the people o...


In [4]:
tweets = tweets.dropna()

## Preprocessing

For my tweets to be informative, there are a few terms I can immediately remove. For instance, any urls won't be useful to the rescue teams. Equally, any '@...' are just calling another twitter handle, and are equally not useful. 

In [5]:
# removing URLS
tweets.tweet_texts = tweets.tweet_texts.apply(lambda x: re.sub(u'http\S+', u'', x))   

# removing @... 
tweets.tweet_texts = tweets.tweet_texts.apply(lambda x: re.sub(u'(\s)@\w+', u'', x))

# removing hashtags
tweets.tweet_texts = tweets.tweet_texts.apply(lambda x: re.sub(u'#', u'', x))

In [6]:
tweets.tweet_texts.head()

0    RT: LATEST Nepal's Kantipur TV shows at least ...
1    RT: May Allah look after all. Here r the emerg...
2    RT: LATEST: 108 killed in 7.9-magnitude Nepal ...
3    RT: Shocking picture of the earthquake in Nepa...
4    Indian Air Force is ready to help the people o...
Name: tweet_texts, dtype: object

There are alot of `u'RT'` terms in the tweet texts. Since this isn't a word, SpaCy doesn't know how to handle them. Since these add nothing to the content of a tweet, I'm just going to get rid of them. 

In [7]:
tweets.tweet_texts = tweets.tweet_texts.apply(lambda x: x.replace(u'RT', u''))

Tokenizing with SpaCy

In [8]:
nlp = spacy.load('en')

In [9]:
spacy_tweets = []

for doc in nlp.pipe(tweets.tweet_texts, n_threads = -1):
    spacy_tweets.append(doc)

In [10]:
spacy_tweets[200]

MEA opens 24 hour Control Room for queries regarding the Nepal Earthquake.åÊ
Numbers:
+91 11 2301 2113
+91 11 2301 4104
+91 11 2301 7905

Not all tweets are equally useful. Some just contain prayers, such as

`Hope it doesn't rain. #Nepal`

whereas others are dense with useful information: 

`2 Dead, 100 Injured in Bangladesh From Nepal Quake`

How do I decide which parts of these tweets are most useful? One way to do it is to measure the term frequency-inverse document frequency (tf-idf) of each of the words in the corpus of tweets. This metric measures how important a word is in a corpus of tweets. 

## Getting the tf-idf values of content words. 

I can do a preliminary 'cleanup', by keeping only 'content words'. These are defined as : Numerals, Nouns and Verbs. Conveniantly, SpaCy has already organised this for us. 

In [11]:
spacy_tweets[90]

: BREAKING: At least 114 killed in Nepal earthquake: home ministry - Read  

I care most about tokens which are entities, and numbers. The other tokens have too much noise, so let's focus on these two:

In [12]:
main_words = [u'earthquake', u'killed', u'injured', u'stranded', u'wounded', u'hurt', u'helpless', u'wrecked', u'nepal']

In [13]:
useful_entities = [u'NORP', u'FACILITY', u'ORG', u'GPE', u'LOC', u'EVENT', u'DATE', u'TIME']

In [14]:
content_tweets = []
for single_tweet in tqdm(spacy_tweets):
    single_tweet_content = []
    for token in single_tweet: 
        if ((token.ent_type_ in useful_entities)  
            or (token.pos_ == u'NUM') 
            or (token.lower_ in main_words)):
            single_tweet_content.append(token)
    content_tweets.append(single_tweet_content)

100%|██████████| 2337/2337 [00:00<00:00, 24086.83it/s]


In [15]:
tweet_num = 200
print ("original_tweet \n" + str(spacy_tweets[tweet_num]) 
       + "\n\noriginal_tweet\n" + str([str(x) for x in spacy_tweets[tweet_num]])
       + "\n\ncontent_tweet\n" + str(content_tweets[tweet_num])
      )

original_tweet 
MEA opens 24 hour Control Room for queries regarding the Nepal Earthquake.åÊNumbers:+91 11 2301 2113+91 11 2301 4104+91 11 2301 7905

original_tweet
['MEA', 'opens', '24', 'hour', 'Control', 'Room', 'for', 'queries', 'regarding', 'the', 'Nepal', 'Earthquake.\xc3\xa5\xc3\x8a', '\r', 'Numbers', ':', '\r', '+', '91', '11', '2301', '2113', '\r', '+', '91', '11', '2301', '4104', '\r', '+', '91', '11', '2301', '7905']

content_tweet
[MEA, 24, hour, Control, Room, Nepal, 91, 11, 2301, 2113, 91, 11, 2301, 4104, 91, 11, 2301, 7905]


So this has already gone some way to (crudely) isolating the interesting parts of a tweet. 

Unfortunately, SpaCy doesn't calculate tf-idf score automatically. There IS a library which can do this: [textacy](https://textacy.readthedocs.io/en/latest/index.html). Note: textacy is built on SpaCy.

I care about the tf-idf scores of the entire tweet, so will find the tf-idf score across the entire corpus of original tweets. 

In [16]:
vectorizer = Vectorizer(weighting = 'tfidf')

To calculate the tf-idf score of all the tokens in the tweets, I can use `fit_transform()`. 

Note: I am using the `lemma_` attribute of each token, because tokens contain information about the documents. This means that 'Nepal' in the 100th tweet will have a different **token** from 'Nepal' in the 200th tweet, but the same `lemma__` attribute. This is what I want to compare - I don't want hundreds of 'Nepal' columns in my term matrix. 

In [17]:
term_matrix = vectorizer.fit_transform([tok.lemma_ for tok in doc] for doc in spacy_tweets)

This matrix is a term-document matrix. What this means is that on top of having the tf-idf values, each row is a document (and each column is a word). 

If the tweet in row `i` contains the column in row `j`, then the element `matrix[i][j]` will contain the tf-idf value. If the tweet *doesn't* contan the word, the matrix value will be zero. 

In [18]:
np_matrix = term_matrix.todense()

In [19]:
np_matrix.shape

(2337, 2184)

My ultimate goal is to create a dictionary, which maps from the tokens in the content tweets to some tf-idf score. To do this, I need to find out which tokens are at what columns in the term matrix. 

The vectorizer object has a dictionary, which maps token.lemma_ to its column. 

In [20]:
for key in sorted(vectorizer.vocabulary)[1000:1015]:
    print key, vectorizer.vocabulary[key]

himalayan 722
himaû 629
hindan 1536
hindu 1486
hindus 1808
historic 506
historical 910
history 628
hit 94
hits 1244
hn 1283
hold 1144
hom 803
home 264
hop 1522


And each column (word) has a unique tf-idf value.

I can therefore map the value of the content tokens to their tf-idf, using the `vectorizer.vocabulary` dictionary. 

In [21]:
for token in content_tweets[500]:
    print (token.lemma_, vectorizer.vocabulary[token.lemma_], np.max(np_matrix[:,vectorizer.vocabulary[token.lemma_]]))

(u'earthquake', 17, 2.9568443994226628)
(u'indian', 48, 4.2797143275538065)
(u'kathmandu', 96, 5.2647355022756184)
(u'977', 255, 9.9007773045233876)
(u'98511', 256, 11.522637736956046)
(u'07021', 884, 6.0490009409298038)
(u'977', 255, 9.9007773045233876)
(u'98511', 256, 11.522637736956046)
(u'35141"\x89\xfb', 885, 7.3707567809121235)


In [22]:
tfidf_dict = {}
content_vocab = []
for tweet in content_tweets: 
    for token in tweet: 
        if token.lemma_ not in tfidf_dict: 
            content_vocab.append(token.lemma_)
            tfidf_dict[token.lemma_] = np.max(np_matrix[:,vectorizer.vocabulary[token.lemma_]])

In [24]:
for key in sorted(tfidf_dict)[500:505]:
    print ("WORD:" + str(key) + " -- tf-idf SCORE:" +  str(tfidf_dict[key]))

WORD:property -- tf-idf SCORE:5.98446241979
WORD:pulse -- tf-idf SCORE:8.06390396147
WORD:purvanchal -- tf-idf SCORE:7.65843885336
WORD:quake -- tf-idf SCORE:6.40818311422
WORD:r -- tf-idf SCORE:9.22783283128


Success! 

## COntent Word-based Tweet Summarization (COWTS) 
As per [Rudra et al](http://dl.acm.org/citation.cfm?id=2806485). 

I'll be using [PyMathProg](http://pymprog.sourceforge.net/index.html) as my Integer Linear Programming Solver. This is a python interface for [GLPK](https://www.gnu.org/software/glpk/)

In [150]:
from pymprog import *

I want to maximize 
\begin{equation}
\sum_{i=1}^n x_{i} + \sum_{j = 1}^{m} Score(j) \cdot y_{j}
\end{equation}
Where $x_{i}$ is 1 if I include tweet i, or 0 if I don't, and where $y_{j}$ is 1 or 0 if each content word is included (and Score(j) is that word's tf-idf score). 

I'm going to subject this equation to the following constraints: 

1. 
\begin{equation}
\sum_{i=1}^{n} x_{i} \cdot Length(i) \leq L
\end{equation}
I want the total length of all the selected tweets to be less than some value L, which will be the length of my summary, L. I can vary L depending on how long I want my summary to be. 

2. 
\begin{equation}
\sum_{i \in T_{j}} x_{i} \geq y_{j}, j = [1,...,m]
\end{equation}
If I pick some content word $y_{j}$ (out of my $m$ possible content words) , then I want to have at least one tweet from the set of tweets which contain that content word, $T_{j}$. 

3. 
\begin{equation}
\sum_{j \in C_{i}} y_{j} \leq |C_{i}| \times x_{i}, i = [1,...,n]
\end{equation}
If I pick some tweet i (out of my $n$ possible tweets) , then all the content words in that tweet $C_{i}$ are also selected. 

In [151]:
begin('COWTS')

model('COWTS') is the default model.

In [152]:
# Defining my first variable, x 
# This defines whether or not a tweet is selected
x = var('x', len(spacy_tweets), bool)

# Check this worked
x[1000]

0 <= x[1000] <= 1 binary

In [153]:
# Also defining the second variable, which defines
# whether or not a content word is chosen
y = var('y', len(content_vocab), bool)

In [154]:
len(y), y[0]

(611, 0 <= y[0] <= 1 binary)

Now that I have defined my variables, I can define the equation I am maximizing. 

In [155]:
maximize(sum(x) + sum([tfidf_dict[content_vocab[j]]*y[j] for j in range(len(y))]));

Now, I can define my constraints. First, 
\begin{equation}
\sum_{i=1}^{n} x_{i} \cdot Length(i) \leq L
\end{equation}

In [156]:
## Maximum length of the entire tweet summary

# Was 150 for the tweet summary, 
# But generated a 1000 word summary for CONABS
L = 1000

# hiding the output of this line since its a very long sum 
sum([x[i]*len(spacy_tweets[i]) for i in range(len(x))]) <= L;

These next two constraints are slightly more tricky, as I need a way to define which content words are in which tweets. 

However, the term matrix I defined using the vectorizer has all of this information. 

I'll begin by defining two helper methods

In [157]:
def content_words(i):
    '''Given a tweet index i (for x[i]), this method will return the indices of the words in the 
    content_vocab[] array
    Note: these indices are the same as for the y variable
    '''
    tweet = spacy_tweets[i]
    content_indices = []
    
    for token in tweet:
        if token.lemma_ in content_vocab:
            content_indices.append(content_vocab.index(token.lemma_))
    return content_indices

In [158]:
def tweets_with_content_words(j):
    '''Given the index j of some content word (for content_vocab[j] or y[j])
    this method will return the indices of all tweets which contain this content word
    '''
    content_word = content_vocab[j]
    
    index_in_term_matrix = vectorizer.vocabulary[content_word]
    
    matrix_column = np_matrix[:, index_in_term_matrix]
    
    return np.nonzero(matrix_column)[0]

I can now define the second constraint: 
\begin{equation}
\sum_{i \in T_{j}} x_{i} \geq y_{j}, j = [1,...,m]
\end{equation}

In [159]:
for j in range(len(y)):
    sum([x[i] for i in tweets_with_content_words(j)])>= y[j]

And the third constraint:
\begin{equation}
\sum_{j \in C_{i}} y_{j} \leq |C_{i}| \times x_{i}, i = [1,...,n]
\end{equation}

In [160]:
for i in range(len(x)):
    sum(y[j] for j in content_words(i)) >= len(content_words(i))*x[i]

In [161]:
solve()

'The LP problem instance has been successfully solved. (This code\ndoes {\\it not} necessarily mean that the solver has found optimal\nsolution. It only means that the solution process was successful.) \nThe MIP problem instance has been successfully solved. (This code\ndoes {\\it not} necessarily mean that the solver has found optimal\nsolution. It only means that the solution process was successful.)'

In [162]:
result_x =  [value.primal for value in x]
result_y = [value.primal for value in y]

In [163]:
end()

model('COWTS') is not the default model.

In [164]:
chosen_tweets = np.nonzero(result_x)
chosen_words = np.nonzero(result_y)

In [165]:
len(chosen_tweets[0]), len(chosen_words[0])

(60, 314)

Lets take a look at the results! 

In [65]:
for i in chosen_tweets[0]:
    print ('--------------')
    print spacy_tweets[i]

--------------
MEA opens 24 hour Control Room in Delhi for queries regarding the Nepal Earthquake. 011 2301 2113011 2301 4104011 2301 7905
--------------
: USGS reports a M5 earthquake 31km NNW of Nagarkot, Nepal on 4/25/15 @ 9:30:29 UTC  quake
--------------
TV: 2 dead, 100 injured in Bangladesh from Nepal quake: DHAKA, Bangladesh (AP) ÛÓ A TV r...  
--------------
Avalanche Sweeps Everest in Nepal; 30 Injured 
--------------
: Earthquake helpline at the Indian Embassy in Kathmandu-+977 98511 07021, +977 98511 35141
--------------
earthquickinnepal shocing news earthquick in nepal may god all safe
--------------
: Whole Himalayan region is becoming non stable. Two yrs back Uttrakhand, then Kashmir now Nepal n north east. Even Tibet isÛ_
--------------
WellingtonHere Nepal's Home Ministry Says at Least 71 People Killed in the Earthquake: Nepal'...  WellingtonHere
--------------
Years of major earthquake-s in Nepal:125514081681181018331934 
--------------
Historic Dharahara

There is definitely noise amongst these tweets, but these tweets do successfully provide a good overview of the situation in Nepal.

I am going to compare this to random tweets, to make sure it does perform better than 16 randomly chosen tweets. 

In [401]:
random_tweets = np.random.choice(spacy_tweets, size=11)

In [402]:
for i in random_tweets:
    print ('--------')
    print i

--------
: Images from Everest Base Camp by UTM student Azim Afif following the quake #Kathmandu UTM camp is safe. 
--------
: Sad 2 see this image of extensive damage due to the #earthquake in Nepal.My prayers with the victims &amp; their families 
--------
From HN: Strong earthquake rocks Nepal, damages Kathmandu 
--------
: OMG ! 7.9 magnitude earthquake in Nepal &amp; parts of Northern &amp; Eastern India. I pray to God that everyone is safe _Ùª _Ùª #Û_
--------
: MEA opens 24 hour Control Room for queries regarding the Nepal Earthquake. Numbers:+91 11 2301 2113+91 11 2301 4104+91 11Û_
--------
: More than 100 killed in powerful Nepal #earthquake, say government officials and police  
--------
@EyeshaBee I just checked out Thamel, thinking it was a distance away from Kathmandu, but it's a suburb!
--------
: UPDATE: Humanitarian crisis in Nepal. 100+ dead, toll may rises to 1000's. Prayers &amp; thought for Nepalese friends. 
--------
A powerful earthquake has rocked Nepal, 

A brief comparison does indicate that this method is far better than random choice at providing a situational overview. 

It's worth noting that even a random distribution will contain a fair amount of information, because of the selective nature in which we isolated tweets; this is already a subsample which contains a higher % of relevant information. 

In [171]:
cowts_tweets = []
for i in chosen_tweets[0]:
    cowts_tweets.append(spacy_tweets[i])

Lets take a look at the first few tweets

In [172]:
for tweet in cowts_tweets[:10]:
    print ('--------')
    print tweet

--------
: LATEST Nepal's Kantipur TV shows at least 21 bodies lined up on ground after 7.9 earthquake 
--------
Prayers for the affected people across SouthAsia by the horrible earthquake!India Bangladesh Pakistan  Afganistan Bhutan Nepal
--------
: Due to bulding collaps 12 People died in Eastern Part of Nepal 5 in sunsari,5 In okhaldhunga and 2 in Solu dist according Û_
--------
: M7.9  - 29km ESE of Lamjung, Nepal  20 00 29 26.  Highly informative read on seismic history of HimaÛ_
--------
Earthquake: 2015-04-25 17:30HKT M5.0 [28.0N,85.4E] in Nepal 
--------
: USGS reports a M5 earthquake 31km NNW of Nagarkot, Nepal on 4/25/15 @ 9:30:29 UTC  quake
--------
[AP] Key facts about Nepal, site of magnitude-7.9 quake 
--------
NepalEarthquake PM spoke with Nepal prez, PMCM's BIHAR MP WB SIKKIMUP,MP sitamarhiBhutan embhigh level meetingNDRF dispatchedwow!
--------
The BBC put the death toll even higher 
--------
: 556 tourists to Nepal from Maharashtra safe. Indian embassy hel

This notebook is getting long, so I'm going to save these tweets (which I will continue using) and start a fresh notebook for the next steps. 

## Saving everything for a fresh notebook 

In [167]:
cowts_unicode = [x.text for x in cowts_tweets]

In [168]:
cowts_dataframe = pd.DataFrame(cowts_unicode)

In [169]:
cowts_dataframe.head()

Unnamed: 0,0
0,: LATEST Nepal's Kantipur TV shows at least 21...
1,Prayers for the affected people across SouthAs...
2,: Due to bulding collaps 12 People died in Eas...
3,": M7.9 - 29km ESE of Lamjung, Nepal 20 00 29..."
4,"Earthquake: 2015-04-25 17:30HKT M5.0 [28.0N,85..."


Saving it to a pickle: 

In [170]:
cowts_dataframe.to_pickle('cowts_tweets.pkl')

In [72]:
np.save('term_matrix.npy', np_matrix)

In [73]:
np.save('tweet_indices.npy', chosen_tweets)

In [74]:
np.save('vocab_to_idx.npy', vectorizer.vocabulary)

In [173]:
np.save('content_vocab.npy', content_vocab)

In [176]:
np.save('tfidf_dict.npy', tfidf_dict)