Looking for words or phrases that occur disproportionately often in sustained (or not sustained) protests.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
data = pd.read_csv('protests.csv')

In [3]:
data.shape

(2298, 13)

In [4]:
pd.crosstab(~data.summary.isnull(), data.outcome)

outcome,Denied,Dismissed,Sustained,Withdrawn
summary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
False,52,1394,28,399
True,350,15,60,0


In [5]:
_.sum().sum()  # Just checking that things add up...

2298

I'm not happy to be missing over 30% of the sustained cases' summaries, and over 12% of the denied cases' summaries. And only 1% of dismissed cases have summaries, but that might be expected.

Is something going wrong?

Anyway, let's try with the data we have, trying to find what differentiates "Denied" from "Sustained" or "Dismissed" summaries.

In [6]:
data = data[data.outcome.isin(['Denied', 'Sustained', 'Dismissed']) & ~data.summary.isnull()]

In [7]:
data.shape

(425, 13)

In [8]:
def hot_terms(texts, sustained, vectorizer):
    X = vectorizer.fit_transform(texts)
    X = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
    X = 0 < X
    sustained.index = X.index
    sustained_n = X[sustained].sum()
    not_sustained_n = X[~sustained].sum()
    sustained_prop = X[sustained].mean()
    not_sustained_prop = X[~sustained].mean()
    more_sustained = sustained_prop / (not_sustained_prop + 1.0/sum(sustained))
    more_not_sustained = not_sustained_prop / (sustained_prop + 1.0/sum(~sustained))
    results = pd.DataFrame.from_items([('more_sustained', more_sustained),
                                   ('sustained_n (of {})'.format(sustained.sum()), sustained_n),
                                   ('sustained_prop', sustained_prop),
                                   ('not_sustained_prop', not_sustained_prop),
                                   ('not_sustained_n (of {})'.format((~sustained).sum()), not_sustained_n),
                                   ('more_not_sustained', more_not_sustained)])
    return results

In [9]:
words = hot_terms(data.summary, data.outcome == 'Sustained', CountVectorizer())

In [10]:
words.sort('more_sustained', ascending=False).head(6)

Unnamed: 0,more_sustained,sustained_n (of 60),sustained_prop,not_sustained_prop,not_sustained_n (of 365),more_not_sustained
sustain,60.0,60,1.0,0.0,0,0.0
acquire,7.525773,10,0.166667,0.005479,2,0.032345
tenica,7.0,7,0.116667,0.0,0,0.0
tasa,7.0,7,0.116667,0.0,0,0.0
tek,7.0,7,0.116667,0.0,0,0.0
metis,7.0,7,0.116667,0.0,0,0.0


In [11]:
words.sort('more_not_sustained', ascending=False).head(6)

Unnamed: 0,more_sustained,sustained_n (of 60),sustained_prop,not_sustained_prop,not_sustained_n (of 365),more_not_sustained
dismiss,0,0,0,0.115068,42,42
national,0,0,0,0.084932,31,31
resulting,0,0,0,0.076712,28,28
contracting,0,0,0,0.073973,27,27
dla,0,0,0,0.071233,26,26
pursuant,0,0,0,0.071233,26,26


Is it alarming that all the summaries with the word "national" were not sustained? I don't know. The above is not very interesting to me.

In [12]:
pairs = hot_terms(data.summary, data.outcome == 'Sustained', CountVectorizer(ngram_range=(2, 2)))

In [13]:
pairs.sort('more_sustained', ascending=False).head(6)

Unnamed: 0,more_sustained,sustained_n (of 60),sustained_prop,not_sustained_prop,not_sustained_n (of 365),more_not_sustained
we sustain,58.0,58,0.966667,0.0,0,0.0
sustain the,58.0,58,0.966667,0.0,0,0.0
and deny,10.256198,17,0.283333,0.010959,4,0.038308
tennessee and,9.0,9,0.15,0.0,0,0.0
deny them,9.0,9,0.15,0.0,0,0.0
misevaluated proposals,7.525773,10,0.166667,0.005479,2,0.032345


In [14]:
pairs.sort('more_not_sustained', ascending=False).head(6)

Unnamed: 0,more_sustained,sustained_n (of 60),sustained_prop,not_sustained_prop,not_sustained_n (of 365),more_not_sustained
the defense,0.0,0,0.0,0.09589,35,35.0
of proposals,0.0,0,0.0,0.090411,33,33.0
to request,0.0,0,0.0,0.082192,30,30.0
defense logistics,0.0,0,0.0,0.073973,27,27.0
and dismiss,0.0,0,0.0,0.073973,27,27.0
we deny,0.034655,2,0.033333,0.945205,345,26.202532


Maybe that's interesting?

In [15]:
triples = hot_terms(data.summary, data.outcome == 'Sustained', CountVectorizer(ngram_range=(3, 3)))

In [16]:
triples.sort('more_sustained', ascending=False).head(6)

Unnamed: 0,more_sustained,sustained_n (of 60),sustained_prop,not_sustained_prop,not_sustained_n (of 365),more_not_sustained
we sustain the,58.0,58,0.966667,0.0,0,0.0
sustain the protest,37.0,37,0.616667,0.0,0,0.0
sustain the protests,21.0,21,0.35,0.0,0,0.0
decision we sustain,14.0,14,0.233333,0.0,0,0.0
part and deny,10.256198,17,0.283333,0.010959,4,0.038308
and deny them,9.0,9,0.15,0.0,0,0.0


In [17]:
triples.sort('more_not_sustained', ascending=False).head(6)

Unnamed: 0,more_sustained,sustained_n (of 60),sustained_prop,not_sustained_prop,not_sustained_n (of 365),more_not_sustained
deny the protests,0,0,0,0.136986,50,50
decision we deny,0,0,0,0.117808,43,43
evaluation of proposals,0,0,0,0.087671,32,32
to request for,0,0,0,0.082192,30,30
part and dismiss,0,0,0,0.073973,27,27
defense logistics agency,0,0,0,0.071233,26,26


I haven't looked through these much; maybe there's something interesting to be seen?

My main observation is that beyond the phrases specifically about sustaining or denying, words that appear "disproportionately" make a pretty small percentage of already small total numbers.