# Content Words

In [1]:
% matplotlib inline

from __future__ import division

import numpy as np 
import pandas as pd 

import nltk
from nltk.tokenize import TweetTokenizer
from nltk import pos_tag, ne_chunk
from nltk.tree import Tree
from nltk.tag.stanford import StanfordNERTagger

import inspect

from textacy.vsm import Vectorizer
import textacy.vsm

import scipy.sparse as sp

from tqdm import *

import re

Loading the data

In [2]:
tweets = pd.read_csv('tweet_ids/2015_Nepal_Earthquake_en/stripped_filled_tweets.csv', encoding = 'ISO-8859-1')

In [3]:
tweets.head()

Unnamed: 0.1,Unnamed: 0,label,tweet_id,tweet_texts
0,1,infrastructure_and_utilities_damage,'591902695822331904',RT @DailySabah: #LATEST #Nepal's Kantipur TV s...
1,2,injured_or_dead_people,'591902695943843840',RT @iamsrk: May Allah look after all. Here r t...
2,3,missing_trapped_or_found_people,'591902696371724288',RT @RT_com: LATEST: 108 killed in 7.9-magnitud...
3,4,sympathy_and_emotional_support,'591902696375877632',RT @Edourdoo: Shocking picture of the earthqua...
4,5,sympathy_and_emotional_support,'591902696895950848',Indian Air Force is ready to help the people o...


In [4]:
tweets = tweets.dropna()

## Preprocessing

For my tweets to be informative, there are a few terms I can immediately remove. For instance, any urls won't be useful to the rescue teams. Equally, any '@...' are just calling another twitter handle, and are equally not useful. 

In [5]:
# removing URLS
tweets.tweet_texts = tweets.tweet_texts.apply(lambda x: re.sub(u'http\S+', u'', x))   

# removing @... 
tweets.tweet_texts = tweets.tweet_texts.apply(lambda x: re.sub(u'(\s)@\w+', u'', x))

# removing hashtags
tweets.tweet_texts = tweets.tweet_texts.apply(lambda x: re.sub(u'#', u'', x))

In [6]:
tweets.tweet_texts.head()

0    RT: LATEST Nepal's Kantipur TV shows at least ...
1    RT: May Allah look after all. Here r the emerg...
2    RT: LATEST: 108 killed in 7.9-magnitude Nepal ...
3    RT: Shocking picture of the earthquake in Nepa...
4    Indian Air Force is ready to help the people o...
Name: tweet_texts, dtype: object

There are alot of `u'RT'` terms in the tweet texts. Since these add nothing to the content of a tweet, I'm just going to get rid of them. 

In [7]:
tweets.tweet_texts = tweets.tweet_texts.apply(lambda x: x.replace(u'RT', u''))

Extracting information from tokens using NLTK is a little trickier than with SpaCy (but not much, and NLTK has the advantage of having a twitter specific tokenizer). 

To extract the pos tags, I use the `pos_tag` method. 

Tokenizing with NLTK

In [8]:
tokenizer = TweetTokenizer()

In [9]:
nltk_tweets = []

for tweet in tweets.tweet_texts:
    nltk_tweets.append(tokenizer.tokenize(tweet))

In [10]:
nltk_tweets[100]

[u':',
 u'Over',
 u'110',
 u'killed',
 u'in',
 u'earthquake',
 u':',
 u'Nepal',
 u'Home',
 u'Ministry',
 u'(',
 u'PTI',
 u')']

Now, to get the part of speech tags for all of these tokenized tweets, I'll use `nltk.pos_tag`. This method requires a particular file, `'taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle'`, which I can download using `nltk.download()`. 

Note - this opens a graphic interface, which can then be used to download a host of additional packages which complement NLTK. 

In [11]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

Having downloaded the average perceptron tagger, I can now tag the tweets with each token's part of speech 

In [12]:
pos_tag(nltk_tweets[100])

[(u':', ':'),
 (u'Over', 'IN'),
 (u'110', 'CD'),
 (u'killed', 'VBN'),
 (u'in', 'IN'),
 (u'earthquake', 'NN'),
 (u':', ':'),
 (u'Nepal', 'NNP'),
 (u'Home', 'NNP'),
 (u'Ministry', 'NNP'),
 (u'(', '('),
 (u'PTI', 'NNP'),
 (u')', ')')]

In [13]:
nltk_pos = []

for tweet in nltk_tweets:
    nltk_pos.append(pos_tag(tweet))

Now, I just need to get the entities from each token, using `ne_chunk`: 

To extract entities, I am going to use Stanford's Named Entity Recognizer (NER). 

I begin by initializing the NER, using files I downloaded from [Stanford's website](https://nlp.stanford.edu/software/CRF-NER.shtml). 

In [279]:
st = StanfordNERTagger('NLTK_resources/stanford-ner-2017-06-09/classifiers/english.all.3class.distsim.crf.ser.gz',
           'NLTK_resources/stanford-ner-2017-06-09/stanford-ner.jar')

Note: since the Stanford NER tagger takes 2 hours to run, I've commented out the code block, and instead saved the content words from each tweet as a `.npy` file. 

In [14]:
nltk_ents = np.load('ner_content.npy')

In [298]:
# nltk_ents = []

# for tweet in tqdm(nltk_tweets): 
#     entity_tagged_tweet = st.tag(tweet)
#     nltk_ents.append([tag for tag in entity_tagged_tweet if (tag[1] != u'O')])

100%|██████████| 2337/2337 [1:58:44<00:00,  3.28s/it]  


Not all tweets are equally useful. Some just contain prayers, such as

`Hope it doesn't rain. #Nepal`

whereas others are dense with useful information: 

`2 Dead, 100 Injured in Bangladesh From Nepal Quake`

How do I decide which parts of these tweets are most useful? One way to do it is to measure the term frequency-inverse document frequency (tf-idf) of each of the words in the corpus of tweets. This metric measures how important a word is in a corpus of tweets. 

## Getting the tf-idf values of content words. 

I can do a preliminary 'cleanup', by keeping only 'content words'. These are defined as : Numerals, Nouns and Verbs. Conveniently, I have just calculated those using NLTK's methods. 

I care most about tokens which are entities, and numbers. The other tokens have too much noise, so let's focus on these two. 

In addition, entities which are people tend to be personal wishes (non situational), so I remove those from my content tweets as well. 

In [17]:
content_tweets = []
for pos_tweet, content_words in tqdm(zip(nltk_pos, nltk_ents)):
    # we'll start by definitely appending all of the entities
    single_tweet_content = [word[0] for word in content_words if word[1] != u'PERSON']
    
    # next, add the token if it is a number
    for token in pos_tweet: 
        if token[1] == u'CD': # CD = cardinal number
            single_tweet_content.append(token[0])
    content_tweets.append(single_tweet_content)

100%|██████████| 2337/2337 [00:00<00:00, 96838.49it/s]


In [18]:
tweet_num = 200
print ("original_tweet \n" + str(nltk_tweets[tweet_num]) 
       + "\n\ncontent_tweet\n" + str(content_tweets[tweet_num])
      )

original_tweet 
[u'MEA', u'opens', u'24', u'hour', u'Control', u'Room', u'for', u'queries', u'regarding', u'the', u'Nepal', u'Earthquake', u'.', u'\xe5\xca', u'Numbers', u':', u'+', u'91', u'11', u'2301', u'2113', u'+', u'91', u'11', u'2301', u'4104', u'+', u'91', u'11', u'2301', u'7905']

content_tweet
[u'Nepal', u'24', u'91', u'11', u'2301', u'2113', u'91', u'11', u'2301', u'4104', u'91', u'11', u'2301', u'7905']


So this has already gone some way to (crudely) isolating the interesting parts of a tweet. 

Unfortunately, NLTK doesn't calculate tf-idf score automatically. There IS a library which can do this: [textacy](https://textacy.readthedocs.io/en/latest/index.html). Note: textacy is built on SpaCy.

I care about the tf-idf scores of the entire tweet, so will find the tf-idf score across the entire corpus of original tweets. 

In [19]:
vectorizer = Vectorizer(weighting = 'tfidf')

To calculate the tf-idf score of all the tokens in the tweets, I can use `fit_transform()`. 

In [20]:
term_matrix = vectorizer.fit_transform(nltk_tweets)

This matrix is a term-document matrix. What this means is that on top of having the tf-idf values, each row is a document (and each column is a word). 

If the tweet in row `i` contains the column in row `j`, then the element `matrix[i][j]` will contain the tf-idf value. If the tweet *doesn't* contan the word, the matrix value will be zero. 

In [21]:
np_matrix = term_matrix.todense()

In [22]:
np_matrix.shape

(2337, 2847)

My ultimate goal is to create a dictionary, which maps from the tokens in the content tweets to some tf-idf score. To do this, I need to find out which tokens are at what columns in the term matrix. 

The vectorizer object has a dictionary, which maps each token to its column. 

In [23]:
for key in sorted(vectorizer.vocabulary)[700:715]:
    print key, vectorizer.vocabulary[key]

Indo-Nepal 1201
Initial 2176
Injured 97
Injuries 1607
Instructed 1942
Int'l 1691
Intensity 530
Interior 2281
International 2712
Ireland 1922
Is 1613
Islamabad 1407
It 393
It's 1147
Italian 1972


And each column (word) has a unique tf-idf value.

I can therefore map the value of the content tokens to their tf-idf, using the `vectorizer.vocabulary` dictionary. 

In [24]:
for token in content_tweets[500]:
    print (token, vectorizer.vocabulary[token], np.max(np_matrix[:,vectorizer.vocabulary[token]]))

(u'Kathmandu', 93, 8.3727132087249654)
(u'977 9851', 284, 11.522637736956046)
(u'1', 285, 15.873945717696863)
(u'07021', 1112, 6.0490009409298038)
(u'977 9851', 284, 11.522637736956046)
(u'1', 285, 15.873945717696863)
(u'35141', 287, 5.761318868478023)


In [25]:
tfidf_dict = {}
content_vocab = []
for tweet in content_tweets: 
    for token in tweet: 
        if token not in tfidf_dict: 
            if token in vectorizer.vocabulary:
                content_vocab.append(token)
                tfidf_dict[token] = np.max(np_matrix[:,vectorizer.vocabulary[token]])

In [26]:
for key in sorted(tfidf_dict)[205:210]:
    print ("WORD:" + str(key) + " -- tf-idf SCORE:" +  str(tfidf_dict[key]))

WORD:AssociatedPress -- tf-idf SCORE:8.06390396147
WORD:Avalanche -- tf-idf SCORE:5.01938152375
WORD:BBC -- tf-idf SCORE:12.3842035691
WORD:BIHAR -- tf-idf SCORE:6.9652916728
WORD:Bachchan -- tf-idf SCORE:8.06390396147


Success! 

## COntent Word-based Tweet Summarization (COWTS) 
As per [Rudra et al](http://dl.acm.org/citation.cfm?id=2806485). 

I'll be using [PyMathProg](http://pymprog.sourceforge.net/index.html) as my Integer Linear Programming Solver. This is a python interface for [GLPK](https://www.gnu.org/software/glpk/)

In [27]:
from pymprog import *

I want to maximize 
\begin{equation}
\sum_{i=1}^n x_{i} + \sum_{j = 1}^{m} Score(j) \cdot y_{j}
\end{equation}
Where $x_{i}$ is 1 if I include tweet i, or 0 if I don't, and where $y_{j}$ is 1 or 0 if each content word is included (and Score(j) is that word's tf-idf score). 

I'm going to subject this equation to the following constraints: 

1. 
\begin{equation}
\sum_{i=1}^{n} x_{i} \cdot Length(i) \leq L
\end{equation}
I want the total length of all the selected tweets to be less than some value L, which will be the length of my summary, L. I can vary L depending on how long I want my summary to be. 

2. 
\begin{equation}
\sum_{i \in T_{j}} x_{i} \geq y_{j}, j = [1,...,m]
\end{equation}
If I pick some content word $y_{j}$ (out of my $m$ possible content words) , then I want to have at least one tweet from the set of tweets which contain that content word, $T_{j}$. 

3. 
\begin{equation}
\sum_{j \in C_{i}} y_{j} \leq |C_{i}| \times x_{i}, i = [1,...,n]
\end{equation}
If I pick some tweet i (out of my $n$ possible tweets) , then all the content words in that tweet $C_{i}$ are also selected. 

In [28]:
begin('COWTS')

model('COWTS') is the default model.

In [29]:
# Defining my first variable, x 
# This defines whether or not a tweet is selected
x = var('x', len(nltk_tweets), bool)

# Check this worked
x[1000]

0 <= x[1000] <= 1 binary

In [30]:
# Also defining the second variable, which defines
# whether or not a content word is chosen
y = var('y', len(content_vocab), bool)

In [31]:
len(y), y[0]

(407, 0 <= y[0] <= 1 binary)

Now that I have defined my variables, I can define the equation I am maximizing. 

In [32]:
maximize(sum(x) + sum([tfidf_dict[content_vocab[j]]*y[j] for j in range(len(y))]));

Now, I can define my constraints. First, 
\begin{equation}
\sum_{i=1}^{n} x_{i} \cdot Length(i) \leq L
\end{equation}

In [33]:
## Maximum length of the entire tweet summary

# Was 150 for the tweet summary, 
# But generated a 1000 word summary for CONABS
L = 150

# hiding the output of this line since its a very long sum 
sum([x[i]*len(nltk_tweets[i]) for i in range(len(x))]) <= L;

These next two constraints are slightly more tricky, as I need a way to define which content words are in which tweets. 

However, the term matrix I defined using the vectorizer has all of this information. 

I'll begin by defining two helper methods

In [34]:
def content_words(i):
    '''Given a tweet index i (for x[i]), this method will return the indices of the words in the 
    content_vocab[] array
    Note: these indices are the same as for the y variable
    '''
    tweet = nltk_tweets[i]
    content_indices = []
    
    for token in tweet:
        if token in content_vocab:
            content_indices.append(content_vocab.index(token))
    return content_indices

In [35]:
def tweets_with_content_words(j):
    '''Given the index j of some content word (for content_vocab[j] or y[j])
    this method will return the indices of all tweets which contain this content word
    '''
    content_word = content_vocab[j]
    
    index_in_term_matrix = vectorizer.vocabulary[content_word]
    
    matrix_column = np_matrix[:, index_in_term_matrix]
    
    return np.nonzero(matrix_column)[0]

I can now define the second constraint: 
\begin{equation}
\sum_{i \in T_{j}} x_{i} \geq y_{j}, j = [1,...,m]
\end{equation}

In [36]:
for j in range(len(y)):
    sum([x[i] for i in tweets_with_content_words(j)])>= y[j]

And the third constraint:
\begin{equation}
\sum_{j \in C_{i}} y_{j} \leq |C_{i}| \times x_{i}, i = [1,...,n]
\end{equation}

In [37]:
for i in range(len(x)):
    sum(y[j] for j in content_words(i)) >= len(content_words(i))*x[i]

In [38]:
solve()

'The LP problem instance has been successfully solved. (This code\ndoes {\\it not} necessarily mean that the solver has found optimal\nsolution. It only means that the solution process was successful.) \nThe MIP problem instance has been successfully solved. (This code\ndoes {\\it not} necessarily mean that the solver has found optimal\nsolution. It only means that the solution process was successful.)'

In [39]:
result_x =  [value.primal for value in x]
result_y = [value.primal for value in y]

In [40]:
end()

model('COWTS') is not the default model.

In [41]:
chosen_tweets = np.nonzero(result_x)
chosen_words = np.nonzero(result_y)

In [42]:
len(chosen_tweets[0]), len(chosen_words[0])

(11, 67)

Lets take a look at the results! 

In [48]:
for i in chosen_tweets[0]:
    print ('--------------')
    print " ".join(nltk_tweets[i])

--------------
MEA opens 24 hour Control Room in Delhi for queries regarding the Nepal Earthquake . 011 2301 2113 011 2301 4104 011 2301 7905
--------------
: At least 150 people killed in Kathmandu , 1 in China , 10 in Pokhara , 2 at Mount Everest Base Camp and 11 in India frm eart  Û_
--------------
: USGS reports a M5 earthquake 31km NNW of Nagarkot , Nepal on 4/25 / 15 @ 9:30 : 29 UTC quake
--------------
Chitttt
--------------
Nepal's Home Ministry Says at Least 71 People Killed in the Earthquake
--------------
Raw : Powerful Earthquake Rocks Nepal AssociatedPress Associated Press news
--------------
Strong Earthquake Strikes Nepal Near Its Capital , Katmandu - New York Times
--------------
: .. Indian Embassy Helpline in Nepal + 9779851107 021
--------------
: Patan Durbar Square after earthquake
--------------
: 6 NDRF teams leave for Nepal and 5 NDRF teams leave for North Bihar . NepalEarthquake
--------------
Strong earthquake strikes Nepal - BBC News


Because the NLTK tokens are harder to sort than the SpaCy ones (for instance, I can't isolate the kind of entities I am interested in), this method is actually less successful than the SpaCy tweets, even though the actual tokenizer is much better. 

In [49]:
random_tweets = np.random.choice(nltk_tweets, size=11)

In [50]:
for i in random_tweets:
    print ('--------')
    print " ".join(i)

--------
: 1934 :: Earthquake in Bihar and Nepal . With 8.0 Magnitude , epicenter was located in Eastern Nepal Near Mount Everest  Û_
--------
: More than 100 killed in powerful Nepal earthquake , say government officials and police
--------
: Strong Earthquake Strikes Nepal Near Its Capital , Katmandu
--------
: prayers go out to all those affected by the earthquake ... _Ù ÷ Ó earthquake Nepal
--------
:  ÛÏ @googleindia : We  Ûªve just launched a Person Finder instance to help track missing persons for the Nepal earthquake  ÛÓ >
--------
: BREAKING : More than 100 killed in Nepal earthquake , says interior ministry -
--------
: Here are the emergency contact numbers for Nepal , share , help . Our prayers are with all in Nepal .
--------
: We've just launched a Person Finder instance to help track missing persons for the Nepal earthquake -->
--------
: People are praying to temples in Nepal , even as they rock with aftershocks according to Americans in NepalEarthquake
--------
: M

A brief comparison does indicate that this method is far better than random choice at providing a situational overview. 

It's worth noting that even a random distribution will contain a fair amount of information, because of the selective nature in which we isolated tweets; this is already a subsample which contains a higher % of relevant information. 

This notebook is getting long, so I'm going to save these tweets (which I will continue using) and start a fresh notebook for the next steps. 

## Saving everything for a fresh notebook 

In [237]:
np.save('term_matrix_nltk.npy', np_matrix)

In [238]:
np.save('vocab_to_idx_nltk.npy', vectorizer.vocabulary)

In [239]:
np.save('content_vocab_nltk.npy', content_vocab)

In [299]:
np.save('ner_content.npy', nltk_ents)