# Classifying Human Rights Tweets with Vowpal Wabbit

I tried to help out on a project where the researchers were looking to map the source of tweets abouts human rights in the United States. The training data they used and which was provided to me came from posts made by human rights activists/organizations on Twitter. Both the training/test sets and unlabeled tweets were provided to me with no alteration/supplementation on my part. The following is an attempt to get at the problem using [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki) (VW) version 8.1.1. I had a few issues trying to install the [Python package](https://pypi.python.org/pypi/vowpalwabbit) via `pip`, so I decided to install the command-line version from the default Ubuntu repos and use this opportunity to become more familiar with Bash scripting.

In a previous attempt, I used Facebook's [fastText](https://fasttext.cc/) to classify the tweets. Since the default training/test data format that fastText accepts is similar to VW's, I will slightly modify it to VW's preferred format. The added benefit is that I have already preprocessed the fastText data to lower-case all words and to remove all URL links and punctuation.

## Cleaning the Data

In [1]:
labeled=("train" "valid" "test")
for type in "${labeled[@]}";
do
    echo Number of $type examples: $(wc -l < fast$type.txt)
    echo Number of human rights tweets in $type set: $(grep -c "__label__hr" fast$type.txt)
done

Number of train examples: 99035
Number of human rights tweets in train set: 34893
Number of valid examples: 24808
Number of human rights tweets in valid set: 8771
Number of test examples: 31114
Number of human rights tweets in test set: 11018


As can be seen, I was provided with around 150,000 labeled examples, with human rights tweets being outnumbered by a ratio of about 2:1. 

This is what the data looks like (warning, some tweets may be explicit...)

In [2]:
grep  '^__label__hr.*' fasttrain.txt | shuf -n 15
grep  '^__label__nonhr.*' fasttrain.txt | shuf -n 15

__label__hr get the real facts on medical malpractice 2015 update   
__label__hr meet flavia shes on a mission to restore peoples right to clean water  
__label__hr usa senates resistance to reupping patriotact shows momentum against mass surveillance   
__label__hr were celebrating 10 years of promoting amp protecting human rights in afghanistan learn more about our work  
__label__hr majorityspeaks youve been quoted in our storify mymadre link roundup raising our voices and creating solutions  
__label__hr in palestine dignity and violence madre friend noam chomsky on the continuing crisis  
__label__hr rt richardengel un over 191000 killed in syria conflict 
__label__hr rt bbcbreaking nelson mandela south africas first black president dies aged 95  
__label__hr china opens fire on peaceful tibetan protesters injuring 10 sad un navi pillays parting speech ignored china  
__label__hr the 2014 international labor rights defenders askgeorge cwaunion cgt at ilrf awards what an amazing ni

Note that the first part of every line is the label in "\_\_label\_\_" format followed by the actual text, which is the default format fastText accepts. VW's default format is similar. The first part of every line is "1+ |" for a positive example and "-1 |" for a negative example. Let's change the format to one that VW accepts and make some additional changes.

Also notice that some of these posts may not necessarily be related to human rights, even though they are labeled as such. For example, the tweet "majorityspeaks youve been quoted in our storify mymadre link roundup raising our voices and creating solutions " by itself would probably not be thought of as human rights.

In [3]:
cat vw/vw_bestparams

PRF: 0.96789
weight: 2
--passes 10 --loss_function logistic --ngram w1 --skips w2 --learning_rate .80000000000000000000


Prior to running this notebook, I ran random search over a set of parameters, including label weights, that I will describe in more detail later. The reason I bring it up now is because the version of VW I use does not allow us to pass label weight as an option, but must instead be included in the actual training data.

In [4]:
while read line; do
    if [[ $line == weight* ]]; then
        weight=$(echo $line | grep -o -P '(?<=weight: )[0-9]+';)
    fi
done < vw/vw_bestparams

cat fasttrain.txt | awk '{sub("__label__hr", "+1 '$weight'|w"); sub("__label__nonhr", "-1 |w"); print $0}' | shuf > vw/traintmpW
cat vw/traintmpW | sed 's:.*|w::'| awk '$0="|l len:"length($0)' > vw/traintmpL
paste vw/traintmpW vw/traintmpL > vw/vwtrain.txt
rm vw/traintmp*

head vw/vwtrain.txt

+1 2|w ff cleanclothes stitchistas usleap 10campaign globalexchange 	|l len:62
-1 |w sexualglf whats a boyfriend if hes not your bestfriend 	|l len:56
+1 2|w check out the awesome things our partners are up to this week un  unpfii 	|l len:74
+1 2|w ff data2x which is launching in a big way today 	|l len:49
-1 |w please dont leave your sense in september god bless you as you comply 	|l len:71
+1 2|w with 1200 deaths and no end in sight please help our partners provide aid and critical care to civilians in gaza  	|l len:115
-1 |w alexxxisann yeah too bad kanyes a douchebag  	|l len:46
-1 |w same old sgit just a different day 	|l len:36
+1 2|w rt ajam after four years of syrias war no end in sight  	|l len:57
-1 |w korinichole15 last name nigga first name thug im just a g like that 	|l len:69


We first read in the "best" weight from our random search. VW allows uses to assign importance weights to individual training examples. This is generally useful for when the trainng set is unbalanced between different classes, which our training set is. I simply applied a uniform weight (greater than 1) weight to all examples of human rights tweets, since it was the smaller class. The weight of 2 was found via random search, meaning that tweets labeled as being human rights-related are considered 2 times as important (since we keep our non-human rights tweets at weight 1).

We then use awk to replace the fastText labels with labels that VW would understand e.g. "\_\_label\_\_nonhr" becomes "-1 |w ". The VW label format for binary classification has positive labels be "+1" and negative labels be "-1" with a pipe | separating the label from the actual features. We add a "w" after the pipe to tell VW that the actual text will be considered to be in the "w" namespace. In VW, features can be split into multiple namespaces, and different actions can be applied to different namespaces. We will add another namespace in the next step.

In the other namespace, which we call "l", we add the length of the tweet (after it has been preprocessed to remove links and punctuation) as another feature for our classifier. The guess would be that perhaps human rights tweets are less likely to be very short. To get the length, we first use sed to delete the label and then use awk to count the length of the post, saving the output into a separate file. We then paste our "w" and "l" namespaces together, deleting the temporary files we took them from.

The data is shuffled becauase VW learns in an online fashion, one example at a time (the version I'm using does not seem to accept the minibatch option). The data was originally organized with all the positive examples first and the negative examples afterwards, which would have hampered online learning.

We format the validation and test sets together, since they do not need a weight.

In [5]:
files=("valid" "test")
for data in "${files[@]}";
do
    cat fast${data}.txt | awk '{sub("__label__hr", "+1 |w"); sub("__label__nonhr", "-1 |w"); print $0}'| shuf > vw/${data}tmpW
    cat vw/${data}tmpW | sed 's:.*|w::'| awk '$0="|l len:"length($0)' > vw/${data}tmpL
    paste vw/${data}tmpW vw/${data}tmpL > vw/vw${data}.txt
    rm vw/${data}tmp*
done

## Training

I had ran random search using the script `vw_rsearch.sh`:

In [None]:
./vw_rsearch.sh 100 false false vw/vwmodel vw/vwresults

This is telling VW to run random search for 100 iterations, not to force L1 regularization, not to supplement the labeled data (more on this later), what to name the model file, and what to name random search results file. It's not run in the notebook because while VW is fast, it would still take a while. 

The `bestresult.sh` script goes through the output of random search (`vw/vwresults`) and finds the entry that maximizes the metric of interest (`PRF`, or the F1 score), and writes it out to a file (`vw/vw_bestparams`). Metrics were found with the [perf](http://osmot.cs.cornell.edu/kddcup/software.html) software.

In [6]:
./bestresult.sh PRF 1 vw/vwresults vw/vw_bestparams
cat vw/vw_bestparams

PRF: 0.96789
weight: 2
--passes 10 --loss_function logistic --ngram w1 --skips w2 --learning_rate .80000000000000000000


The passes is the number of epochs i.e. how many times VW will go over the data. While it's currently set at 10, VW by default engages in early stopping, so it may not actually go through all 10 passes. Since this is a binary classification problem, I conducted random search with either logistic or hinge loss, with the "best results" coming from using logistic loss. Random search also determined that it was optimal to use unigrams over bi-, tri-, etc. grams. It's possible that for the "average" human rights tweet, surrounding word contexts don't actually matter; what determines human rights content is the presence of just a few keywords. When using n-grams, we can also have the n-grams skip over a number of words rather than just including concurrent words. However, since I am using unigrams, this option does nothing. It was not in the "best parameters", but random search also included specifications that included interactions between the "w" (the words) and "l" (length of tweet) namespaces as a feature.

We see that the model with the highest F1 score is actually quite similar to VW's defaults. The only thing that changes is the number of passes through the data, the loss function, and the learning rate. There was no regularization, n-grams, or use of interactions.

In [7]:
while read line; do
    if [[ $line == --* ]]; then
        params=$line
    fi
done < vw/vw_bestparams
echo $params
vw --binary vw/vwtrain.txt -c -k -f vw/vw.model -b 24 $params

--passes 10 --loss_function logistic --ngram w1 --skips w2 --learning_rate .80000000000000000000
Generating 1-grams for w namespaces.
Generating 2-skips for w namespaces.
final_regressor = vw/vw.model
Num weight bits = 24
learning rate = 0.8
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = vw/vwtrain.txt.cache
Reading datafile = vw/vwtrain.txt
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0   1.0000  -1.0000        8
1.000000 1.000000            2            2.0  -1.0000   1.0000       11
0.500000 0.000000            4            4.0   1.0000   1.0000       12
0.500000 0.500000            8            8.0  -1.0000  -1.0000        9
0.500000 0.500000           16           16.0  -1.0000   1.0000        7
0.375000 0.250000           32           32.0   1.0000  -1.0000       24
0.265625 0.156250           6

We took the "best parameters" and trained a new model using those parameters. It would have been more efficient to combine the performance evaluation with random search such that the "best VW model was saved" without having to retrain a model with the found best parameters. However, since VW is so fast, I just did it this way.

By default, VW will generate its own holdout set from the training data you give it and evaluate on that. We see that this classifier achieved around 97% accuracy on that set. What about on labeled data it hasn't seen before?

## Evaluation

In [8]:
vw --binary -t -i vw/vw.model -r vw/vwvalid_rawpred.txt vw/vwvalid.txt

Generating 1-grams for w namespaces.
Generating 2-skips for w namespaces.
only testing
raw predictions = vw/vwvalid_rawpred.txt
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = vw/vwvalid.txt
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0   1.0000   1.0000       16
0.000000 0.000000            2            2.0   1.0000   1.0000       13
0.000000 0.000000            4            4.0  -1.0000  -1.0000        4
0.000000 0.000000            8            8.0  -1.0000  -1.0000        4
0.000000 0.000000           16           16.0   1.0000   1.0000       13
0.000000 0.000000           32           32.0  -1.0000  -1.0000        3
0.015625 0.031250           64           64.0  -1.0000  -1.0000       11
0.023438 0.031250          128          128.0  -1.0000   1.0000       21
0.023438

It actually does slightly better on the unseen data, though not by much, still getting around 97% accuracy. What about other measures?

In [9]:
cut -d' ' -f1 vw/vwvalid.txt | paste - vw/vwvalid_rawpred.txt | perf.linux/perf -t 0 -PRE -REC -PRF -ACC

ACC    0.97714   pred_thresh  0.000000
PRE    0.97731   pred_thresh  0.000000
REC    0.95759   pred_thresh  0.000000
PRF    0.96735   pred_thresh  0.000000


Here we use the [perf](http://osmot.cs.cornell.edu/kddcup/software.html) software to calculate our metrics. The precision (PRE) is high and the recall (REC) is slightly lower. 

The unlabeled text was provided to me as a .csv file with other tweet metadata, so I need to extract just the text and do all the lower-casing, URL and punctuation removing steps on the text.

In [10]:
cat first200k.csv | awk -F "\"*,\"*" '{print $12}' | sed 's/http[^ ]*//g' | tr -d '[:punct:]' | awk '{print "|w "tolower($0)}' | awk '{print($0)" |l len:" length($0)-2}' > vw/vwfirst200k.txt
vw --binary -t -i vw/vw.model -p vw/vwunlabeled_pred.txt vw/vwfirst200k.txt
paste vw/vwunlabeled_pred.txt vw/vwfirst200k.txt > vw/vwlabeled200k.txt

Generating 1-grams for w namespaces.
Generating 2-skips for w namespaces.
only testing
predictions = vw/vwunlabeled_pred.txt
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = vw/vwfirst200k.txt
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0  unknown  -1.0000        3
0.000000 0.000000            2            2.0  unknown  -1.0000        9
0.000000 0.000000            4            4.0  unknown  -1.0000        7
0.000000 0.000000            8            8.0  unknown  -1.0000       13
0.000000 0.000000           16           16.0  unknown  -1.0000       14
0.000000 0.000000           32           32.0  unknown  -1.0000        5
0.000000 0.000000           64           64.0  unknown  -1.0000       21
0.000000 0.000000          128          128.0  unknown  -1.0000        5
0.00000

Let's look through some of the predictions to see what the classifier did.

In [11]:
echo Number of predicted human rights tweets: $(grep '^1.*' vw/vwlabeled200k.txt | wc -l)
echo Number of total tweets: $(wc -l < vw/vwlabeled200k.txt)
printf "\n"
echo Predicted as human rights
grep '^1.*' vw/vwlabeled200k.txt | shuf -n 20
printf "\n"
echo Predicted as not human rights
grep '^-1.*' vw/vwlabeled200k.txt | shuf -n 20

Number of predicted human rights tweets: 1617
Number of total tweets: 211180

Predicted as human rights
1	|w new faces of country thanks 93q  stafford centre  |l len:50
1	|w lopezemmanuel47 rollingfermin aguilascibaenas aguilapica luichysanchez no es en ninguna manera un insulto  barbarazo es baño de pueblo |l len:135
1	|w cc allahpundit rt neiltyson cfchabris thanks sure |l len:50
1	|w facts only rt lolofbabyy “uptownraised selling drugs gt going to college”😒 |l len:75
1	|w happy national podcast day nowgolistentomyadvertisers |l len:54
1	|w gun confiscated at bwi airport a pennsylvania man was arrested by maryland transportation authority police monday…  |l len:116
1	|w justicemercyinternational annual gala learning about what this great org does for people living in…  |l len:101
1	|w we are not for sale coreybbrooks on support for jimoberweis brucerauner  |l len:73
1	|w khamotabanerjee bibekdebroy tathagata2 dds ans to sikkito peepool over momwaste bengal will send 10 sputniks to pu

Looking through this small sample, it seems to generate a lot of false positives. After reading other tweets labeled as positive, it seems unlikely that the validation set precision was achieved here. It predicts that around .007% of the tweets are about human rights, which sounds plausible. What features did the classifier pick up on?

In [12]:
vw/vw-varinfo2 vw/vwtrain.txt -f vw/vwfeat.model -b 24 --loss_function logistic --learning_rate .8 > vw/features

This uses the Python script [vw-varinfo2](https://github.com/arielf/weight-loss/blob/master/vw-varinfo2) to map the VW features back to interpretable words. I write out the parameters instead of feeding it a variable because it seems to have some trouble interpreting the `--passes` argument, so I just left it out.

In [13]:
head -50 vw/features
tail -50 vw/features

FeatureName     	   HashVal	MinVal	MaxVal	Weight	RelScore
Constant        	  11650396	1.00	1.00	-2.65	 -78.74
w^shit          	  13240385	1.00	1.00	-2.62	 -77.82
w^lol           	  11494464	1.00	1.00	-2.58	 -76.55
w^fuck          	  14898728	1.00	1.00	-2.44	 -72.50
w^fucking       	  15120391	1.00	1.00	-2.07	 -61.51
w^lmao          	  11217848	1.00	1.00	-2.02	 -59.84
w^idk           	  11175912	1.00	1.00	-1.90	 -56.45
w^ass           	   7056324	1.00	1.00	-1.84	 -54.59
w^funny         	   2739481	1.00	1.00	-1.83	 -54.34
w^damn          	  16768282	1.00	1.00	-1.79	 -53.14
w^bae           	   2060971	1.00	1.00	-1.78	 -52.96
w^bro           	   1718365	1.00	1.00	-1.78	 -52.84
w^game          	   3665051	1.00	1.00	-1.78	 -52.82
w^im            	  12491456	1.00	1.00	-1.76	 -52.10
w^oomf          	   8191242	1.00	1.00	-1.75	 -51.93
w^donniewahlberg	   5160006	1.00	1.00	-1.74	 -51.79
w^royals        	  15825115	1.00	1.00	-1.74	 -51.72
w^annoying      	   4399400	1.00	1.00	-1.74	 -51.50
w^rn  

Unsurprisingly, words with the highest negative weights are often either swear words or Internet slang. For positive words, many make sense, including the actual word humanrights (a hashtag originally), justice, torture, etc. The classifier also seems to have picked up on some countries where there are perhaps more human rights violations, such as Afghanistan, Burma, Iran, etc. Retweets (rt) and sharing stories via an app (via) seem to be much more associated with human rights tweets than non-human rights tweets. Some of the positive word weights seem a bit off, like wisdomwednesday. 

In [14]:
grep '.*wisdomwednesday.*' vw/vwtrain.txt | head -5

+1 2|w wisdomwednesday rt thewjp democracy freedom amp justice dont just happen we must strive for them through action  chen guangcheng 	|l len:130
+1 2|w go into the world and do well but more importantly go into the world and do good  minor myers jr wisdomwednesday 	|l len:114
+1 2|w wisdomwednesday rt half a society that fails to protect the rights of women is not a free society  laura bush  	|l len:112
+1 2|w wisdomwednesday  	|l len:18
+1 2|w wisdomwednesday there never will be complete equality until women themselves help to make laws and elect lawmakers  susan b anthony 	|l len:133
grep: write error: Broken pipe


It turns out that one of the human rights accounts used for the training set really enjoyed the hashtag #wisdomwednesday. Looking through the tweets, there seems to be a bigger issue with false positives than with false negatives. There are too many tweets labeled as non-human rights to look through though. Let's look at some of the false negatives that the classifier incorrectly predicted.

To be honest, I kind of got sick of using Bash/awk/sed/etc. stuff for everything by this point, so I just wrote a short Python script to do this.

In [15]:
./compare.py -h

usage: compare.py [-h] [-r RAWPRED] [-t TEST] [-o OUTPUT]

This script takes a file with raw predictions and a labeled set in VW format
and outputs a file that contains the misclassified posts. The format is [pred]
| [true label] | [rest of post]

optional arguments:
  -h, --help            show this help message and exit
  -r RAWPRED, --rawpred RAWPRED
                        File with the raw predictions
  -t TEST, --test TEST  Labeled test set file
  -o OUTPUT, --output OUTPUT
                        Where to save mislabeled tweets


In [16]:
./compare.py -r vw/vwvalid_rawpred.txt -t vw/vwvalid.txt -o vw/vwvalid_mislabeled.txt

In [17]:
echo "Number of false positives: $(grep '^+1.*' vw/vwvalid_mislabeled.txt | wc -l)"
echo "Number of false negatives: $(grep '^-1.*' vw/vwvalid_mislabeled.txt | wc -l)"
printf "\n"
echo "False positives:"
grep "^+1 |" vw/vwvalid_mislabeled.txt | shuf -n 15

Number of false positives: 195
Number of false negatives: 372

False positives:
+1 | -1 |w after 16 hours qantas a380 from sydney is on the ground  dallasfort worth international airport  	|l len:98
+1 | -1 |w notre dame students there is still time to apply for a fall break trip to guatemala or el salvador with handsorg  	|l len:115
+1 | -1 |w gilcedillocd1 hi cm cedillo would u consider writing a support letter for ethnic studies as grad req in la schools  	|l len:117
+1 | -1 |w mfw wandows support hasnt replied to any of my tweets in months  	|l len:66
+1 | -1 |w did you hear about our new holidayhours we are now open 6 days a week tuessat 10a6p and now sun 12p6p lodi newhours awesomesauce 	|l len:130
+1 | -1 |w thanks for covering the windows 10 event at our space for usa today nansanfran  would love to have you back anytime 	|l len:117
+1 | -1 |w to achieve victory we must mass our forces at the hub of all power amp movement the enemys center of gravity  carl von clausewitz 	|l le

The above are tweets mistakenly labeled as being related to human rights, when their true label was being not related to human rights. One thing it's picking up on are tweets that thank other people.

In [4]:
grep "^+1 .*thank.*" vw/vwtrain.txt | shuf -n 15

+1 2|w conniedineen thanks so much for your support connie 	|l len:53
+1 2|w rt unrightswire thank you for following dgd2014  the discussion will resume this afternoon at 3pm cet watch live on  	|l len:118
+1 2|w thank you senrandpaul for supporting meaningful limits to the isil force authorization consistent with our natl security amp rule of law 	|l len:138
+1 2|w frenchplums thanks so much for your support paula 	|l len:51
+1 2|w check out this great idlonews video  thanks for including our morocco footage a2j womensrights 	|l len:96
+1 2|w rt lionshalom thank you hillelneuer you are a legend  they could not hide from the facts   	|l len:92
+1 2|w fhassan15 thanks for your support 	|l len:35
+1 2|w rt ijdh thank you to theccr for supporting ijdhs cholera case appeal great article on that here  	|l len:98
+1 2|w rt atomicalandy thanks 530000 people sign to demand naturalfruit drop criminal cases against me  ilrf tucglobal walkfree laboursta 	|l len:132
+1 2|w thanks for the rts dwatc

As can be seen, one of the artifacts of the way the human rights tweets were collected is that many of the posts now labeled as human rights essentially just say "thanks for your support" and don't really have any other words that would indicate being about human rights.

In [6]:
grep "^-1 |" vw/vwvalid_mislabeled.txt | shuf -n 15

-1 | +1 |w its impossible to not have fun playing in a river riversuniteus  	|l len:66
-1 | +1 |w martinpradel please follow us 	|l len:31
-1 | +1 |w rt spmizner mgmudel has it right  beautiful moving and true  nomodernasylum   	|l len:79
-1 | +1 |w class challenges uber fees at lax   	|l len:37
-1 | +1 |w  	|l len:2
-1 | +1 |w when i looked around me i had lost my friends my family my dignity my freedom amp my religion marinanemat cff 	|l len:111
-1 | +1 |w the average score for the congressreportcard is 15 is that like an f fail   	|l len:77
-1 | +1 |w overuse safety questions cloud advairs ascent to asthma blockbuster  	|l len:70
-1 | +1 |w  	|l len:2
-1 | +1 |w jadaliyya link is broken 	|l len:26
-1 | +1 |w ferrarogiuliano on est daccord 	|l len:32
-1 | +1 |w nypalin16 we do see  	|l len:22
-1 | +1 |w man sues walmart over gascan blast   	|l len:38
-1 | +1 |w a few days b4 campaore was ousted ericachenoweth predicted as much based on her research of nonviolent movements  	|l len:11

These are the "true" human rights tweets that were "mistakenly" labeled as not being about human rights. Again, some of these labels are questionable. Many of these are clearly considered human rights because it was a human rights account. However, as we can see, human rights accounts don't necessarily give every tweet human rights content.

## Trying Again

There are two simple text pre-processing things and another simple thing that can be done that might improve the classifier.

1. __Remove tweets that say thanks__. These being labeled as human rights related are clearly an artifact of the way the labeled data was collected (getting posts from certain accounts associated with human rights). The text itself does not merit the posts' label of being human-rights related. Therefore I will try removing posts that say thanks.

2. __Remove the term rt__. Something about the labeled data made it so that the term "rt" was more likely to be associated with human rights tweets. Judging from some of the mislabeled tweets, this is generating some of the false positives. I will try removing all instances of the term "rt" from the labeled data. 

3. __Focus on precision__. Looking through some of the mislabeled tweets, it looks like false positives are a greater problem than false negatives. In other words, it may be beneficial to weight precision greater than recall, since it's possible that a poor recall is the result of not predicting the "human rights tweets" that say things like "jadaliyya link is broken".

Let's first try the parameters tried in the original random search and get the parameters that maximize the $F_{0.5}$ score.

In [13]:
./bestresult.sh F 0.5 vw/vwresults vw/vw_bestparams_proc
cat vw/vw_bestparams_proc

F: .97439686820112401149
weight: 2
--passes 10 --loss_function logistic --ngram w1 --skips w2 --learning_rate .80000000000000000000


In [15]:
cat vw/vw_bestparams

PRF: 0.96789
weight: 2
--passes 10 --loss_function logistic --ngram w1 --skips w2 --learning_rate .80000000000000000000


Actually, the model that maximized the balanced $F_1$ score maximizes the $F_{0.5}$ score as well. Let's try weighing precision even higher.

In [16]:
./bestresult.sh F 0.25 vw/vwresults vw/vw_bestparams_proc
cat vw/vw_bestparams_proc

F: .97876761676640748976
weight: 8
--passes 18 --loss_function hinge --ngram w1 --skips w2 --learning_rate .60000000000000000000 --l1 .00001000000000000000


With the $F_{0.25}$ score, now we get something different. VW will weigh human rights examples even higher, perhaps go through more passes in the data, uses hinge loss instead of logistic, has a slightly lower learning rate, and now uses L1 regularization. 

Let's do our extra text processing steps (remove tweets with the word "thank*" and take out the term "rt") while using our new weight of 8 instead of 2.

In [20]:
labeled=("train" "valid" "test")

for labeledtype in "${labeled[@]}";
do
    cat fast${labeledtype}.txt | sed -e 's/\<rt\>//g' > fast${labeledtype}_proc.txt
    sed -i '/thank/d' fast${labeledtype}_proc.txt
    echo Original number of tweets:$(wc -l < fast${labeledtype}.txt)
    echo New number of tweets:$(wc -l < fast${labeledtype}_proc.txt)
done

files=("valid" "test")
for data in "${files[@]}";
do
    cat fast${data}_proc.txt | awk '{sub("__label__hr", "+1 |w"); sub("__label__nonhr", "-1 |w"); print $0}'| shuf > vw/${data}tmpW
    cat vw/${data}tmpW | sed 's:.*|w::'| awk '$0="|l len:"length($0)' > vw/${data}tmpL
    paste vw/${data}tmpW vw/${data}tmpL > vw/vw${data}_proc.txt
    rm vw/${data}tmp*
done

while read line; do
    if [[ $line == weight* ]]; then
        weight=$(echo $line | grep -o -P '(?<=weight: )[0-9]+';)
    fi
    if [[ $line == --* ]]; then
        params=$line
    fi
done < vw/vw_bestparams_proc

cat fasttrain_proc.txt | awk '{sub("__label__hr", "+1 '$weight'|w"); sub("__label__nonhr", "-1 |w"); print $0}' | shuf > vw/traintmpW
cat vw/traintmpW | sed 's:.*|w::'| awk '$0="|l len:"length($0)' > vw/traintmpL
paste vw/traintmpW vw/traintmpL > vw/vwtrain_proc.txt
rm vw/traintmp*

Original number of tweets:99035
New number of tweets:96657
Original number of tweets:24808
New number of tweets:24256
Original number of tweets:31114
New number of tweets:30379


Removing tweets with the word "thank*" results in the loss of a few thousand tweets.

In [21]:
echo $params
vw --binary vw/vwtrain_proc.txt -c -k -f vw/vw_proc.model -b 24 $params --quiet

--passes 18 --loss_function hinge --ngram w1 --skips w2 --learning_rate .60000000000000000000 --l1 .00001000000000000000


Let's see how it performs on the unlabeled data.

In [22]:
vw --binary -t -i vw/vw_proc.model -p vw/vwunlabeled_pred_proc.txt vw/vwfirst200k.txt --quiet
paste vw/vwunlabeled_pred_proc.txt vw/vwfirst200k.txt > vw/vwlabeled200k_proc.txt

In [23]:
echo Number of predicted human rights tweets: $(grep '^1.*' vw/vwlabeled200k_proc.txt | wc -l)
echo Number of total tweets: $(wc -l < vw/vwlabeled200k_proc.txt)
printf "\n"
echo Predicted as human rights
grep '^1.*' vw/vwlabeled200k_proc.txt | shuf -n 20
printf "\n"
echo Predicted as not human rights
grep '^-1.*' vw/vwlabeled200k_proc.txt | shuf -n 20

Number of predicted human rights tweets: 1337
Number of total tweets: 211180

Predicted as human rights
1	|w forgot how miserable it is to live in a swing state during election season dont complain about your vote not counting dc ondemandonly |l len:134
1	|w isaacmizrahi have an idea gt headscarfsgt charity women wearing in support of kurdish women warriors you  |l len:106
1	|w nigerianewsdesk boko haram lagos court delivers secret judgment on suspected terrorists  via todayngr |l len:102
1	|w because heaven forbid we focus on womens safety for once |l len:57
1	|w the chinese are smarter simply bc they can fit more knowledge into less material a 1000 page book in english would be 500 pages in chinese |l len:139
1	|w markhor14 what an xprnc gr8 speakers proud of youth of pakistan who made this a success full of energy determination jazba passion n luv |l len:137
1	|w update we are arriving at temple university hospital where pa state police will update media on trooper killed in shootin

Looking through the above and other tweets, this round seems to have done better than the previous one, though it's still not perfect.

Just for kicks, let's see how this model nominally performs on our test set.

In [24]:
vw --binary -t -i vw/vw_proc.model -r vw/vwtest_rawpred_proc.txt vw/vwtest_proc.txt 
cut -d' ' -f1 vw/vwtest_proc.txt | paste - vw/vwtest_rawpred_proc.txt | perf.linux/perf -t 0 -PRE -REC -PRF -PRB -ACC

Generating 1-grams for w namespaces.
Generating 2-skips for w namespaces.
only testing
raw predictions = vw/vwtest_rawpred_proc.txt
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = vw/vwtest_proc.txt
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0  -1.0000  -1.0000        9
0.000000 0.000000            2            2.0   1.0000   1.0000       17
0.000000 0.000000            4            4.0  -1.0000  -1.0000        8
0.000000 0.000000            8            8.0  -1.0000  -1.0000        9
0.062500 0.125000           16           16.0   1.0000   1.0000       23
0.031250 0.000000           32           32.0  -1.0000  -1.0000       11
0.046875 0.062500           64           64.0  -1.0000  -1.0000       12
0.046875 0.046875          128          128.0  -1.0000  -1.0000        7


The model nominally does well. It achieves 
an accuracy of over 97% and a precision of nearly 0.98, though recall is down at around 0.94.

## Potential Next Steps

- __More text preprocessing__. Some steps I normally employ that I left out while playing around with VW is removing stopwords (e.g. the, a, ...) and stemming. It's possible that these steps may have helped improve things.

- __Manually clean some of the tweets__. To recap, the positive-labeled tweets (tweets considered to be about human rights) were collected from accounts that tweet often about human rights and from posts that used certain hashtags. This strategy is nice because it's presumably low effort and generates a lot of "labeled" data rapidly. The downside of this vacuum strategy seems to be that it gets a lot of tweets that would generally be considered unrelated to human rights (see those "thank you" tweets). The labeled data could therefore benefit from some manual cleanup and labeling.

- __Supplement the training data__. There are 150,000 labeled tweets, with nearly 100,000 of them going to the training set. Given the huge volume of tweets that are produced, this is a very small amount. It's quite likely that the classifier is not being exposed to enough examples of "regular" tweets, which can contribute to a feature weighting outcome where many unrelated tweets are labeled as human rights tweets (low precision).

- __Force greater regularization__. Models that optimized accuracy or the balanced $F_1$ score tended not to have regularization. However, given what we know about the trustworthiness of the data, optimizing these numbers may not exactly be the way forward; and that we definitely do not want overfitting on this training data. We could therefore run random search (or some other type of search) where all models must have regularization (my current implementation only uses regularization in about half of the random search iterations).

- __Take advantage of other VW options__. Some of the options that VW has that I didn't use include boosting and using a multi-layer perceptron with a single hidden layer (with sigmoidal activations I believe). I could try incorporating these, though some initial tests I ran seemed to indicate that they didn't too much to improve things.

- __Try different architectures__. A lot of recent advancements in NLP have come through the usage of neural networks. Shallow ones can be used to generate word/sentence/document embeddings. Convolutional neural networks and recurrent neural networks (e.g. LSTMs) can also be used to incorporate aspects of the sentence sequence/structure into learning and predictions. 

- __Information retrieval__. Rather than treating this as a machine learning problem, we could treat it as a information retrieval problem. This would require us to come up with a set of search terms. I was also asked to do this by training a classifier. Nevertheless, I tried running this data through [Elasticsearch](https://www.elastic.co/products/elasticsearch) with some terms I came up with, and it seemed to do okay
![elasticsearch](elasticsearch.png)

## General Takeaways

- Data quality is important....an obvious one. The way the human rights tweets were collected meant that there are potentially a lot of incorrectly applied labels, which is not fantastic for training...

- Can't blindly just optimize numbers. I believe the original goal of this project was to use a trained classifier to label tweets as being about human rights or not, and then to map where the human rights tweets were coming from, geographically. If we had been less discerning, we could have observed the high accuracy, $F_1$ score, precision, etc. on our test set, assumed our classifier is doing great, and then directly map the predictions it generated onto the map. As can be seen from our first attempt, and to somewhat lesser extent our second attempt, this would not have been good, since we'd be mapping a lot of incorrect tweets.

- Should interrogate the model. We picked up some insights by taking a look through the tweets that were mislabeled and by looking at the features with the highest weights so that the classifier itself would not be a black box. Using a tool like [lime](https://github.com/marcotcr/lime) could have been useful too.

- Vowpal Wabbit is fast. I had heard of Facebook's fastText (some code using it is elsewhere in this repo), and I came across VW while reading discussions about fastText. Both tools are nice and speedy, but one advantage VW had over \[a Python implementation of\] fastText was its handling of ngrams. Using more than bigrams brought my laptop (4GB RAM) to a crawl, but VW seemed to handle up to 5-grams quite well.

- The command line seems kind of unwieldly for these kinds of tasks. The caveat is that I'm more experienced in tools like Python and R relative to command-line things like Bash, awk, sed, etc. I'm glad that I went through this exercise and got a better handle on command line commands though, since I generally didn't use command line stuff besides moving and renaming files and things.