# Classifying 🤬ffensive Language on Twitter
### Using Recurrent Neural Networks To Classify Hate Speech 
Project mentor: Carlos Aguirre

Aditya Yedetore <ayedeto1@jh.edu>, Karl Mulligan <kmullig3@jh.edu>

https://github.com/adityayedetore/hate-speech-and-offensive-language

*** Warning: though we censor what we can, in some instances we display offensive language and hate speech. Please proceed with caution. ***

# Outline and Deliverables

### Uncompleted Deliverables
1. "Must Accomplish #1": Engineer syntactic features (part of speech tagging, parse tree). 
  * Reason we didn't complete this one: Twitter hate speech is unlike most language used to train part of speech taggers and parsers in that it is informal, and includes misspellings and slang. Thus most automatic methods of syntactic feature extraction would likely either be riddled with inaccuracies or fail to generate features altogether. Considering this, we decided to move to different ways of augmenting the data.
    * Note that another group did decide to do augment the data with a parse tree, and they found that this extra data did not improve their model. To us, this is evidence for the failure of usual parsers in parsing the tweet data. 
2. "Would Like to Accomplish #3": GRUs. 
  * We realized that GRUs probably wouldn't perform qualitatively differently from LSTMs, since they are structurally very similar but less powerful, so we decided not to implement them. In addition, our code was LSTM specific, so modifying it to use GRUs would be difficult, and not worth the effort. 
3. "Would Like to Accomplish #1": Compare performance of LSTM and GRU models. 
  * This follows from point 2 above. 



### Completed Deliverables
1. "Must Accomplish #2": Augment data with additional hand crafted examples. We discuss data augmentation [in "Dataset" below](#scrollTo=zFq-_D0khnhh&line=10&uniqifier=1).
2. "Must complete #3": Train LSTM/GRU language model on twitter speech. We discuss training our logistic regression baseline [in "Baselines" below](#scrollTo=oMyqHUa0jUw7&line=5&uniqifier=1).
3. "Expect to Accomplish #1": Perform hyperparameter search. We discuss the hyperparameter search [in "Methods" below](#scrollTo=PqB48IF9kMBf&line=4&uniqifier=1).
4. "Expect to Accomplish #2": Evaluate performance of LSTM/GRU, discussed [in "Results" below](#scrollTo=_Zdp4_H-kx8H). 
5. "Expect to Accomplish #3": Compare performance with and without data augmentation, discussed [in "Results" below](#scrollTo=_Zdp4_H-kx8H).
6. "Expect to Accomplish #4": Compare performance with and without feature engineering [in "Results" below](#scrollTo=_Zdp4_H-kx8H).

# Preliminaries

## What problem were you trying to solve or understand?

*What are the real-world implications of this data and task?*
* In recent years, the increase in hate speech on platforms like Twitter harms the mental health of users, and in some cases, even violating laws. However, it is a non-trivial problem to distinguish hate speech from otherwise innocuous speech with profanity, or highly negative but not hateful speech. Deciding whether a Tweet qualifies as hate speech usually requires human decision making, which struggles to scale to the volume of hate speech on Twitter today. Often human annotators find reading hate speech expensive and exhausting. 

*How is this problem similar to others we’ve seen in lectures, breakouts, and homeworks?*
* This problem is similar to others we've seen in the homework in that it is a supervised classification problem over text data, specifically Programming Homework 4. 



*What makes this problem unique?*
* This problem is unique in that a bag of words representation will likely not work well:
  * Ex. “I think I’d call that bad man a \*\*\*” vs “I think calling that man a \*\*\* is bad”. 
    * In the above case, if '\*\*\*' indicates a slur, the former intuitively will have a much greater chance of being hate speech than the latter, despite the fact that both sentences have similar bag of words representations. 
  * A Twitter based example: tweet A is hate speech, and a user retweets A, commenting "This is hate speech". Theoretically, the model should be able to differentiate tweet A and the retweet, though they likely have similar bag of words representations. 

*What ethical implications does this problem have?*
* A machine that can detect hate speech will likely be used to inform decisions about who or what to censor, and thus the creation of such a machine intersects with ethical and legal issues. Moreover, due to the possibility of  tweet patterns varying across groups, any machine learning based hate speech detector will display different accuracy across different (racial, cultural, etc.) groups. Furthermore, the labeling for this data was done by only three people, and may be biased. Thus we have problems of fairness as well as the censorship ones.


## Dataset(s)

*Describe the dataset you used. How were they collected?*

We will base our dataset on the Hate Speech and Offensive Language Dataset, which consists of 24,000 tweets that are each labeled by at least 3 crowdsourced human annotators [1]

Examples:
* Hate speech: @JuanYeez shut yo beaner ass up sp\*c and hop your f\*ggot ass back across the border little n\*gga.

* Offensive speech: @bitterchick dat means get the f\*ck out h\*e i be thinkin. 

* Neither: He’s a damn good actor. As a gay man it’s awesome to see an openly queer actor given the lead role for a major film
 


*Why did you choose them?*

* We were interested in the possibility of using RNNs to help classify tweets, and if the LSTM, which could potentially leverage the sequence and interdependencies of the words when creating embeddings, would perform better than other methods. Also, automatically detecting hate speech is a practical task that can be extended to many similar domains and platforms across the internet, so it is inherently useful to determine the effectiveness of the various solutions. 

*How many examples in each?*

Total 24,000 tweets, average ~14 tokens/tweet. 

Number of examples per split. 

|       | hate speech  | offensive language | neither |
|-------|--------------|--------------------|---------|
| train | 1196         | 15262              | 3368    |
| valid | 121          | 1996               | 362     |
| test  | 113          | 1932               | 433     |
| all   | 1430         | 19190              | 4163    |


In [None]:
# Load your data and print 2-3 examples
# WARNING: these examples are offensive, proceed with caution. 
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/adityayedetore/hate-speech-and-offensive-language/master/data/labeled_data.csv")
for tweet in list(df['tweet'][-3:]):
  print(tweet)


young buck wanna eat!!.. dat nigguh like I aint fuckin dis up again
youu got wild bitches tellin you lies
~~Ruffled | Ntac Eileen Dahlia - Beautiful color combination of pink, orange, yellow &amp; white. A Coll http://t.co/H0dYEBvnZB


## Pre-processing

*What features did you use or choose not to use? Why?*
* For the baseline model, we used all the features present in the twitter data. 


*If you have categorical labels, were your datasets class-balanced?*
* (Refer to the table above for the specific counts) The datasets were not class-balanced.  There were very few examples of hate speech. Though this may make the model make more mistakes when classifying hate speech, intuitively hate speech is not very frequent in regular twitter data as well. A model that misclassifies even a small percentage of the offensive language as hate speech will be useless, since then those false positives will have hand sorted, defeating the whole purpose of the automatic hate speech detection. For this reason, we tried both using the unbalanced dataset as-is in addition to an augmented version which improved the balance issue.

*How did you deal with missing data? What about outliers?*
* No missing data, and no clear outliers per se. What exactly would qualify as an outlier for tweet data and hate speech is not easy to define, so we didn't consider them in our analysis. 

*What approach(es) did you use to pre-process your data? Why?*
* Data Preprocessing: We tokenized the text, lowercased it, and replaced infrequent words with \<unk\> tokens. Our tokenization split punctuation from words. These preprocessing methods are standard in text processing, and capture the fact that "carburetor", "carburetor's", and "Carburetor" should be encoded with the same embedding. 

*Are your features continuous or categorical? How do you treat these features differently?*
* The features present in the data are categorical, i.e. the words in the dataset. However, in order to feed them into the LSTM, we usde GloVe word embeddings, which represent the vocabulary as continuous, dense vectors. 

# Models and Evaluation

## Experimental Setup

*How did you evaluate your methods? Why is that a reasonable evaluation metric for the task?*
* We evaluated our models on a held out test set, where we measured the percentage of predictions that matched the target. We also performed a more nuanced evaluation by producing a confusion matrix of the three classes. For instance, we looked at the number of instances the model classified hate speech as hate speech, offensive language, or neither, and so on and so forth. This is a reasonable metric, since there are more than one relevant way in which the model can perform well or badly, and the confusion matrix captures most of them. We also manually examined misclassified examples to get a qualitative sense of model behavior.

*What did you use for your loss function to train your models?*
* We used cross entropy loss, as it is the standard loss for classification problems of the sort we were dealing with. 

*How did you split your data into train and test sets? Why?*
* We used a 80% train, 10% valid, and 10% test split. We kept the splits the splits the same for all runs of the models, in order to be able to accurately compare performance. We also did the splits while trying to keep the relative proportions of each class intact within each set. 

Code for loss functions, evaluation metrics: [link to Git repo](https://github.com/adityayedetore/hate-speech-and-offensive-language/blob/master/main.py)

## Baselines 

*What baselines did you compare against? Why are these reasonable?*
* We used two baselinses: the model with the best peformance from the paper we are basing our task on (a SVM), and the base LSTM model performance, without the additional data augmentation. 

*Did you look at related work to contextualize how others methods or baselines have performed on this dataset/task? If so, how did those methods do?*
* As stated above, we compared our baseline to the baseline of the SVM implemented in [1]. Our baseline performed noticeably worse on classifying the hate speech than the SVM, but had comparable overall accuracy (ours 88.4%; theirs 91%).

## Methods

*What methods did you choose? Why did you choose them?*
* We choose to use a LSTM, since we were interested in seeing how it would perform on this sort of semantic classification. 
* We augmented the training data by simply duplicate tweets according to the number of people who classified them as hate speech, which varied from 0 to 3 on most examples. For instance, if 3 annotators marked a tweet as hate speech, it would be presented to the model 4 times, whereas a tweet with only 1 annotator marking it as hate speech would lead to it to be presented 2 times (even if the actual model class was something else, like offensive); tweets with no annotators classifying it as hate speech were not duplicated.  In theory, this would make the model pay more attention to examples where more people agreed about the classification as hate speech. Also, it would help adjust the balance of a severely unbalanced dataset.


*How did you train these methods, and how did you evaluate them? Why?*
* We trained the LSTM with Scholastic Gradient Descent, due to there not being any closed form solution. We then performed a small hyperparameter search. For a general estimate of their performance, we evaluated each of the models via their overall accuracy on the validation set. When we wanted to look more carefully at the results, we plotted a confusion matrix, in order to see where exactly the model was going wrong. 

*Which methods were easy/difficult to implement and train? Why?*
* The LSTM was reasonably easy to implement, since we were able to find a implementation very similar to what we wanted to achieve. Furthermore, we used MARCC for computing, and the training dataset was pretty small. Thus, training the models took only a few minutes. 

*For each method, what hyperparameters did you evaluate? How sensitive was your model's performance to different hyperparameter settings?*

* For our LSTM model, we experimented with three hyperparameters: learning rate (2e5, 2e4, 2e3), hidden size (128, 256, 512), and embedding length (150, 300, 600). None of the hyperparameters we tried had a qualtitative effect on test accuracy. We settled on a learning rate of 2e5, a hidden size of 256, and an embedding length of 300.

Code for training modes: [link to Git repo](https://github.com/adityayedetore/hate-speech-and-offensive-language/blob/master/models/LSTM.py).

We used models written with PyTorch and torchtext, adapted from code from [prakashpandey9](https://github.com/prakashpandey9/Text-Classification-Pytorch). 

Baseline: Plot of the training and valid accuracy (in percent correct) for an example model. The final test accuracy was 88.42%. Note that changing the hyperparameters did not significantly change these training results. 

![hi](http://adityayedetore.com/data/hate-speech-images/train-valid-lstm.jpg)


Data-Augmentation: Plot of the training and valid accuracy (in percent correct) for an example model. The final test accuracy was 86.98%. Note that changing the hyperparameters did not significantly change these training results. 

![hi](http://adityayedetore.com/data/hate-speech-images/train-valid-dup.jpg)

## Results

![text](http://adityayedetore.com/data/hate-speech-images/svm.png)

![text](http://adityayedetore.com/data/hate-speech-images/baseline.png)

![text](http://adityayedetore.com/data/hate-speech-images/dup-results.png)

*What about these results surprised you? Why?*
* What about these results surprised you? Why?  The results which surprised us the most was how dramatically effective the duplication augmentation strategy was for improving hate speech classification accuracy. While overall accuracy was roughly the same (~2% worse test accuracy), the percentage of hate speech correctly classified as such rose from 35% to 83% on the same test set. We looked at a number of sources for a potential bug (e.g. data “peeking”, mismatched test sets, etc) but did not find any. Looking at the confusion matrix, we see that performance on “neither” decreased after the augmentation; this corroborates our story about data balancing, since tweets classified as neither hate speech nor offensive are much less likely to be judged as hate speech by any annotators, and thus make up an even smaller percentage of the overall training data than before (15.3% as opposed to 17%).

Did your models over- or under-fit? How can you tell? What did you do to address these issues?
* Our models certainly overfit. We found that the models were getting 100% accuracy on the training set after a few epochs, but the accuracy on the validation set wasn't increasing, as can be seen in the training curve plot above. 

What does the evaluation of your trained models tell you about your data? How do you expect these models might behave differently on different data?  

The evaluation of our trained models, and specifically our naive augmentation strategy, tells us that any issues with the dataset classification seem to be primarily due to balance. That is, the test set performance is highly dependent on the distribution of classes during training. On a more realistic set of twitter data, where offensive speech is not nearly as plentiful, our model might tend to classify examples more according to the distribution of the training data. It would not be wise to deploy this model at scale; larger datasets are needed to test the applicability of our method on novel data.

# Discussion

## What we've learned

*What concepts from lecture/breakout were most relevant to your project? How so?*
* Vanishing gradient problem: 
  * During the process of coding, we ran into a problem where all the tweets in a batch received the same logits after the training. The problem was that we were using a mask to regularize the size of each tweet, by padding the right end with 1's. Ostensibly it wouldn't be difficult for the LSTM to learn that it should ignore those 1's, and keep the same hidden state through them. However, it turned out to be the case that since we had around 300 1's, the vanishing gradient kicked in, and the LSTM wasn't able to learn anything about the words in the beginning of the text. The solution was simple: just remove the padding from the tweets, but it was interesting to see that a concept from class was so directly helpful in the actual coding process. 
* Overfitting (bias-variance tradeoff).
  * We found that during training, our accuracy on the test set quickly reached 100%, but the accuracy on the validation set did not improve. Due to discussions in class, we realized that this was due to overfitting. 
* Ethics.
  * Without the discussions of fairness in class, we wouldn't have thought about the possible problems with implementing this sort of model. In some cases, such as the decision not to class balance, our decisions were guided by those discussions. For example, we believe our augmentation approach, which more correctly identifies hate speech at the expense of some "neither" class accuracy, is a better strategy, since the effect of hate speech may be considered more societally damaging than having innocuous speech mislabeled sometimes and then reversed (though this, along with many other ethical issues, is debatable, of course -- what is important is that our method allows us to control this behavior somewhat). 
* Neural networks
  * Of course, the information about neural networks, and LSTMs in specific, was particularly useful when creating this project. 


*What aspects of your project did you find most surprising?*
* When we first trained the model, we improperly tokenized the text, so that punctuation was not split from the vocabulary. This probably greatly increased the size of the vocabulary. It was surprising that the LSTM model was still able to perform at all on such improperly tokenized text. 
* Another thing that we found surprising was how much my intuitions about what tweets were hate speech and which were not differed from the codings. We disagreed with approximately at least one in five tweets that were coded as hate speech.
* We were surprised that the LSTM was able to learn from such a small dataset. We previously assumed that a much bigger dataset was required, but it seems that at least for this task, relatively less data was needed; the GloVe embeddings carried a lot of external knowledge. 

*What lessons did you take from this project that you want to remember for the next ML project you work on? Do you think those lessons would transfer to other datasets and/or models? Why or why not?*
* One lesson we learned is the difficulty of extracting features that from natural data. The amount of noise would cause any of the online part of speech taggers or parsers to break. This was something that we hadn't considered when coming up with the project proposal.
* Another thing we learned was the difficulty of working with hate speech. One possiblity for dealing with the lack of data was manaually find slurs that could be used in similar contexts, but that soon proved difficult, since it required reading such a large quantity of this data. This might not apply to other less hateful datasets, but it is definitely something to keep in mind for the future.

*What was the most helpful feedback you received during your presentation? Why?*
* One of the groups suggested that we think about class-balancing the data. While we didn't end up doing this, it did make us realize one method of data augmentation, which we [discuss in "Methods" above](#scrollTo=PqB48IF9kMBf&line=6&uniqifier=1). 

*If you had two more weeks to work on this project, what would you do next? Why?*
* Currently we are using GloVe embeddings. These may not be ideal, as slang and informal misspellings may not be known to GloVe, and thus those tokens will be replaced by <unk> tokens. To address this, we would use an embedder specifically crafted for tweets. 
* Intuitively, one significant difference between hate speech and offensive language is the use of the most egregious slurs. It is likely that those words co-occur very strongly with the hate speech classification. We might want to analyze the data to see if this is the case, and then augment the data in some way to reflect this. One idea would be to use a simpler model, such as creating count-vector inputs to a logistic regression classifier, and then rank the most informative features/words for each class. 
* We would *triple*-check our model for any bugs, to account for the higher performance on the hate speech class. 

# Reference

[1] Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated hate speech detection and the problem of offensive language. arXiv preprint arXiv:1703.04009.

[2] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.