# Assembling and Inspecting Potential Data for my Hate Speech Detection Model
This is the initial stage of my project where I will assemble as much hate speech data as possible to fine-tune the pre-trained BERT model so that it can accurately classify hate speech on twitter.

I hope below I have demonstrated that I've taken great care in choosing what data I decide to train my model, as I've manually inspected the data sources online and decided for myself, (as well as having the informal help of my friends), whether a dataset has accurately annotated for hate speech or not based on the below definition. 

> <i>Hate Speech is commonly defined as any communication that disparages a person or a group on the basis of some characteristic such as race, colour, ethnicity, gender, sexual orientation, nationality, religion, or other characteristics.</i>


I've been careful to do this because I didn't want to confuse my model by providing it with vastly differing interpretations of what constitutes hate speech.

The need for accurate and reliable hate speech detection online is very important, many online have lamentated the over-zealous nature of social media companies curtailing free speech and punishing users for speech that is offensive - but not hate speech under common law. Therefore I've been cautious to try and limit false positive annotation in the data below carefully

In [2]:
import pandas as pd
import json
import csv
import tweepy
import os
import numpy as np
#from google.colab import auth #if using colab
%pip install gcsfs
pd.set_option('display.max_colwidth', -1)


Note: you may need to restart the kernel to use updated packages.


I have stored all of my separate data sources in my google bucket for accessibiliity purposes - although I will provide the data in the github upload. 

An option to load datasets from local storage will also be available.

<b>All data in this notebook is supervised</b>

In [4]:
#STORAGE_TYPE = 'LOCAL'
STORAGE_TYPE = 'REMOTE'
if STORAGE_TYPE =='LOCAL':
    DATA_DIR = '../Raw_Data/' #Local Storage
else:
    DATA_DIR = 'gs://csc3002/Raw_Data/' #Remote Storage
    
    #Below obtains access to GCS bucket
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"]= "C:/Users/fionn/Downloads/storageCreds.json"
    from google.cloud import storage
    storage_client = storage.Client()
    buckets = list(storage_client.list_buckets())
    print("If access has been granted, below will print [<Bucket: csc3002>] ")
    print(buckets) # Testing if access to GCS has been granted
    

If access has been granted, below will print [<Bucket: csc3002>] 
[<Bucket: csc3002>]


## Loading in HatEval 2019 Data

Below is a description of the initial SemEval training data this project will use. The targets of abuse in this dataset are immigrants and women.




<b>HS - Hate Speech</b>

This column indicates whether the content of the tweet is hate speech or not.

`0` - Not Hate Speech

`1` -  Hate Speech

This is the only column we're interested in for this task as the other columns denote aggression and who the target of the tweet was (group or individual). These columns will be removed as they're irrelevant to my task



<i>Ref: [Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana .  Patti, Francisco Rangel, Paolo Rosso, and Manuela Sanguinetti. 2019. Semeval2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-2019). Association for Computational Linguistics.](https://www.aclweb.org/anthology/S19-2007.pdf "")</i>

In [305]:
dev = pd.read_csv(os.path.join(DATA_DIR, 'hateval2019/hateval2019_en_dev.csv'), sep=',')
train = pd.read_csv(os.path.join( DATA_DIR, 'hateval2019/hateval2019_en_train.csv'), sep=',')
alldf = pd.concat([train, dev])
alldf.drop(columns = {'TR', 'AG', 'id'}, axis = 1, inplace = True)
alldf.reset_index(drop = True, inplace = True)
count = alldf.HS.value_counts()
print("There are {:d} non hate speech tweets and {:d} hate speech tweets\n".format(count[0], count[1]) )
alldf.head(10)

There are 5790 non hate speech tweets and 4210 hate speech tweets



Unnamed: 0,text,HS
0,"Hurray, saving us $$$ in so many ways @potus @realDonaldTrump #LockThemUp #BuildTheWall #EndDACA #BoycottNFL #BoycottNike",1
1,"Why would young fighting age men be the vast majority of the ones escaping a war &amp; not those who cannot fight like women, children, and the elderly?It's because the majority of the refugees are not actually refugees they are economic migrants trying to get into Europe.... https://t.co/Ks0SHbtYqn",1
2,"@KamalaHarris Illegals Dump their Kids at the border like Road Kill and Refuse to Unite! They Hope they get Amnesty, Free Education and Welfare Illegal #FamilesBelongTogether in their Country not on the Taxpayer Dime Its a SCAM #NoDACA #NoAmnesty #SendThe",1
3,NY Times: 'Nearly All White' States Pose 'an Array of Problems' for Immigrants https://t.co/ACZKLhdMV9 https://t.co/CJAlSXCzR6,0
4,"Orban in Brussels: European leaders are ignoring the will of the people, they do not want migrants https://t.co/NeYFyqvYlX",0
5,@KurtSchlichter LEGAL is. Not illegal. #BuildThatWall,1
6,"@RitaPanahi @826Maureen @RealCandaceO Antifa are just a pack of druggie misfits that no one loves, being the violent thugs they are is their cry for attention and their hit of self importance.#JuvenileDelinquents",0
7,Ex-Teacher Pleads Not guilty To Rape Charges https://t.co/D2mGu3VT5G,0
8,still places on our Bengali (Sylheti) class! it's London's 2nd language! know anyone interested @SBSisters @refugeecouncil @DocsNotCops https://t.co/sOx6shjvMx,0
9,DFID Africa Regional Profile: July 2018 https://t.co/npfZCriW0w,0


Above seems quite reliably annotated, I'm particularly happy that the authors of this data deemed to include terms that might trigger a false positive like `Rape` and `Immigrants` yet are included in a non hate speech context. Also some of the data is quite aggressive, such as against a group like antifa (see 6th entry) and aren't hate speech, which is an important distinction to learn.

### Cleaning and Saving Data

As will be discussed later on in the notebook, I have elected to also have an `Offensive` column for my final dataset in case it is useful to my investigation later down the line


We'll assume that all tweets that are labelled as hate speech are inherently offensive. Otherwise in this set we'll `Offensive` label to `-` as they're ambigious. This dilemma will be rare later on

In [306]:
alldf.rename(columns = {'text': 'Tweet', 'HS' : 'Hate_Speech'}, inplace = True)
alldf['Offensive'] = '-'
alldf.loc[alldf['Hate_Speech'] == 1, 'Offensive'] = 1
alldf.head()

Unnamed: 0,Tweet,Hate_Speech,Offensive
0,"Hurray, saving us $$$ in so many ways @potus @realDonaldTrump #LockThemUp #BuildTheWall #EndDACA #BoycottNFL #BoycottNike",1,1
1,"Why would young fighting age men be the vast majority of the ones escaping a war &amp; not those who cannot fight like women, children, and the elderly?It's because the majority of the refugees are not actually refugees they are economic migrants trying to get into Europe.... https://t.co/Ks0SHbtYqn",1,1
2,"@KamalaHarris Illegals Dump their Kids at the border like Road Kill and Refuse to Unite! They Hope they get Amnesty, Free Education and Welfare Illegal #FamilesBelongTogether in their Country not on the Taxpayer Dime Its a SCAM #NoDACA #NoAmnesty #SendThe",1,1
3,NY Times: 'Nearly All White' States Pose 'an Array of Problems' for Immigrants https://t.co/ACZKLhdMV9 https://t.co/CJAlSXCzR6,0,-
4,"Orban in Brussels: European leaders are ignoring the will of the people, they do not want migrants https://t.co/NeYFyqvYlX",0,-


In [307]:
print("Breakdown for HatEval 2019 Data")
print("\nHate Speech Column Labelling:\n", alldf.Hate_Speech.value_counts())
print("\nOffensive Column Labelling:\n", alldf.Offensive.value_counts())
print("\nWhere 1 is positive, 0 is negative and - is ambiguous")
alldf['Tweet'] = alldf['Tweet'].map(lambda x: x.encode('unicode-escape').decode('utf-8'))
alldf.to_csv(os.path.join(DATA_DIR, 'cleaned/hateval.csv'), sep = ',', encoding='utf-8', \
                 index = False, header = True)

Breakdown for HatEval 2019 Data

Hate Speech Column Labelling:
 0    5790
1    4210
Name: Hate_Speech, dtype: int64

Offensive Column Labelling:
 -    5790
1    4210
Name: Offensive, dtype: int64

Where 1 is positive, 0 is negative and - is ambiguous


## Loading in OffensEval 2019 Data

This data is a bit more tricky but it'll be useful to try out and inspect if the data can be reliably used to fine tune our model. We must ask ourselves if the offensive language in this dataset can potentially equate to hate speech.


The labels used in the annotation are explained as is in the README file of this dataset.

(A) subtask_a: <b>Offensive language identification</b>

- `(NOT) Not Offensive` - This post does not contain offense or profanity.
- `(OFF) Offensive` - This post contains offensive language or a targeted (veiled or direct) offense

In our annotation, we label a post as offensive (OFF) if it contains any form of non-acceptable language (profanity) or a targeted offense, which can be veiled or direct. 

(B) subtask_b: <b>Categorization of offense types</b>

- `(TIN) Targeted Insult and Threats` - A post containing an insult or threat to an individual, a group, or others (see categories in sub-task C).
- `(UNT) Untargeted` - A post containing non-targeted profanity and swearing.

Posts containing general profanity are not targeted, but they contain non-acceptable language.

(C) subtask_c: <b>Offense target identification</b>

- `(IND) Individual` - The target of the offensive post is an individual: a famous person, a named individual or an unnamed person interacting in the conversation.
- `(GRP) Group` - The target of the offensive post is a group of people considered as a unity due to the same ethnicity, gender or sexual orientation, political affiliation, religious belief, or something else.
- `(OTH) Other` – The target of the offensive post does not belong to any of the previous two categories (e.g., an organization, a situation, an event, or an issue)

So bearing all this in mind I believe that if we only get data from this set that has the `OFF` flag for level A, a `TIN` flag for level B and a `GRP` or `IND` flag for level 3, the data could potentially qualify as hate speech.


A problem lies here though in that not all offensive targetting of groups may be considered hate speech. Targetting political affiliation such as `"All Democrats are a bunch of mongrels #libtards"`  would not be considered hate speech by the definition I'm basing my view of hate speech off of.

<i>Ref: [Predicting the Type and Target of Offensive Posts in Social Media, Zampieri, Marcos and Malmasi, Shervin and Nakov, Preslav and Rosenthal, Sara and Farra, Noura and Kumar, Ritesh, Proceedings of NAACL, 2019]( https://www.aclweb.org/anthology/S19-2010.pdf, "")</i> 


In [4]:
offens = pd.read_csv(os.path.join(DATA_DIR, 'OLIDv1.0/olid-training-v1.0.tsv'), sep='\t')
offens.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13240 entries, 0 to 13239
Data columns (total 5 columns):
id           13240 non-null int64
tweet        13240 non-null object
subtask_a    13240 non-null object
subtask_b    4400 non-null object
subtask_c    3876 non-null object
dtypes: int64(1), object(4)
memory usage: 517.3+ KB


Getting all entries in the dataset which are labelled as `Offensive`, `Targeted Insult`, and the target may be a `Group` or `Individual`

In [5]:
offens1 = offens[offens["subtask_a"] == 'OFF']
offens1 = offens1[offens1["subtask_b"] == 'TIN']
offens1 = offens1[offens1["subtask_c"] != 'OTH']
offens1 = offens1.dropna()
offens1.reset_index(drop = True, inplace = True)
offens1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3481 entries, 0 to 3480
Data columns (total 5 columns):
id           3481 non-null int64
tweet        3481 non-null object
subtask_a    3481 non-null object
subtask_b    3481 non-null object
subtask_c    3481 non-null object
dtypes: int64(1), object(4)
memory usage: 136.1+ KB


In [6]:
offens1.head(20)

Unnamed: 0,id,tweet,subtask_a,subtask_b,subtask_c
0,90194,@USER @USER Go home you’re drunk!!! @USER #MAGA #Trump2020 👊🇺🇸👊 URL,OFF,TIN,IND
1,52415,@USER was literally just talking about this lol all mass shootings like that have been set ups. it’s propaganda used to divide us on major issues like gun control and terrorism,OFF,TIN,GRP
2,13384,@USER Canada doesn’t need another CUCK! We already have enough #LooneyLeft #Liberals f**king up our great country! #Qproofs #TrudeauMustGo,OFF,TIN,IND
3,28414,@USER you are a lying corrupt traitor!!! Nobody wants to hear anymore of your lies!!! #DeepStateCorruption URL,OFF,TIN,IND
4,56117,@USER @USER @USER @USER LOL!!! Throwing the BULLSHIT Flag on such nonsense!! #PutUpOrShutUp #Kavanaugh #MAGA #CallTheVoteAlready URL,OFF,TIN,IND
5,12681,@USER @USER Kind of like when conservatives wanna associate everyone to their left as communist antifa members?,OFF,TIN,GRP
6,82904,@USER @USER Da fuck is going on people? There's the men's room and the women's room Pick one and stick w it 🤔,OFF,TIN,GRP
7,77665,@USER Tbh these days i just don't like people in general i just don't connect with people these days just a annoyance..,OFF,TIN,IND
8,12609,The only thing the Democrats have is lying and stalling to stop Trump from being #President. What have they done for you lately. #Trump #Kavanaugh #MAGA #DEMSUCK,OFF,TIN,GRP
9,12108,@USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER You are not very smart are you? Why do you think Gen Flynn’s sentencing keeps being rescheduled? URL,OFF,TIN,IND


These tweets - <i>however tasteless some may be</i>, would not be considered by most people to be hate speech. The other tweets in the previous hatEval dataset are much more clear instances of hate speech
    
We will exclude the tweets that satisfy these conditions:
> <i>that the tweet is offensive, is targeted and is directed at an individual or group</i>
    
Because we don't want to risk contamination of the hate speech column.

### Cleaning and Saving Data

In [275]:
#Drop risky rows
offens.drop(offens[(offens["subtask_a"] == 'OFF') & \
                  (offens["subtask_b"] == 'TIN') & \
                  (offens["subtask_c"] != 'OTH')].index, inplace = True)

offens.drop(columns = ['subtask_b', 'subtask_c', 'id'], inplace = True, axis = 1)
offens.rename(columns = {'subtask_a':'Offensive', 'tweet': 'Tweet'}, inplace = True)
offens['Hate_Speech'] = 0
offens.loc[offens['Offensive'] == 'OFF', 'Offensive'] = 1
offens.loc[offens['Offensive'] == 'NOT', 'Offensive'] = 0
offens.head()

Unnamed: 0,Tweet,Offensive,Hate_Speech
0,@USER She should ask a few native Americans what their take on this is.,1,0
2,Amazon is investigating Chinese employees who are selling internal data to third-party sellers looking for an edge in the competitive marketplace. URL #Amazon #MAGA #KAG #CHINA #TCOT,0,0
3,"@USER Someone should'veTaken"" this piece of shit to a volcano. 😂""",1,0
4,@USER @USER Obama wanted liberals &amp; illegals to move into red states,0,0
5,@USER Liberals are all Kookoo !!!,1,0


In [276]:
offens['Tweet'] = offens['Tweet'].map(lambda x: x.encode('unicode-escape').decode('utf-8'))

offens.to_csv(os.path.join(DATA_DIR, 'cleaned/offens.csv'), sep = ',', encoding='utf-8', \
                 index = False, header = True)

print("Breakdown for Offenseval 2019 Data")
print("\nHate Speech Column Labelling:\n", offens.Hate_Speech.value_counts())
print("\nOffensive Column Labelling:\n", offens.Offensive.value_counts())
print("\nWhere 1 is positive, 0 is negative and - is ambiguous")

Breakdown for Offenseval 2019 Data

Hate Speech Column Labelling:
 0    9759
Name: Hate_Speech, dtype: int64

Offensive Column Labelling:
 0    8840
1    919 
Name: Offensive, dtype: int64

Where 1 is positive, 0 is negative and - is ambiguous


## ICVSM 2017 Hate Speech Dataset

This dataset usefully has benign tweets, offensive language and hate speech. Having offensive language tweets that is not hate speech will be very useful to my model because I can imagine false postives with some models can arise when trying to distinguish between offensive language and hate speech - which are both often laced with profanity.


### What the columns mean:

* `Count` - Number of CrowdFlower users who coded each tweet (min is 3, sometimes more users coded a tweet when judgments were determined to be unreliable by CF).


* `Hate_speech` - Number of CF users who judged the tweet to be hate speech.


* `Offensive_language` - Number of CF users who judged the tweet to be offensive.


* `Neither` - Number of CF users who judged the tweet to be neither offensive nor non-offensive.


* `Class` -Class label for majority of CF users:

 0. Hate Speech

 1.  Offensive Language
 
 2.  Neither
 
<i>Ref: [Thomas Davidson, Dana Warmsley, Michael Macy and Ingmar Weber. 2017. Automated Hate Speech Detection and the Problem of Offensive Language. In Proceedings of the Eleventh International AAAI onference on Web and Social Media (ICWSM 2017), pages 512–515, Montreal.](https://arxiv.org/pdf/1703.04009.pdf "")</i>

In [5]:
sometweets = os.path.join(DATA_DIR, 'ICVSM_2017.csv')
sometweets = pd.read_csv(sometweets, sep=',', encoding='utf-8')
sometweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24783 entries, 0 to 24782
Data columns (total 7 columns):
Unnamed: 0            24783 non-null int64
count                 24783 non-null int64
hate_speech           24783 non-null int64
offensive_language    24783 non-null int64
neither               24783 non-null int64
class                 24783 non-null int64
tweet                 24783 non-null object
dtypes: int64(6), object(1)
memory usage: 1.3+ MB


Further refining tweet database and inspecting to see if the tweets qualify as hate speech

In [6]:
hatetweets = sometweets[sometweets["class"] == 0]
hatetweets.drop(columns = {'Unnamed: 0','neither', 'class' }, axis = 1, inplace = True)
hatetweets.reset_index(drop = True, inplace = True)
hatetweets.head(20)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,count,hate_speech,offensive_language,tweet
0,3,2,1,"""@Blackman38Tide: @WhaleLookyHere @HowdyDowdy11 queer"" gaywad"
1,3,3,0,"""@CB_Baby24: @white_thunduh alsarabsss"" hes a beaner smh you can tell hes a mexican"
2,3,3,0,"""@DevilGrimz: @VigxRArts you're fucking gay, blacklisted hoe"" Holding out for #TehGodClan anyway http://t.co/xUCcwoetmn"
3,3,3,0,"""@MarkRoundtreeJr: LMFAOOOO I HATE BLACK PEOPLE https://t.co/RNvD2nLCDR"" This is why there's black people and niggers"
4,3,2,1,"""@NoChillPaz: ""At least I'm not a nigger"" http://t.co/RGJa7CfoiT""\n\nLmfao"
5,3,2,1,"""@NotoriousBM95: @_WhitePonyJr_ Ariza is a snake and a coward"" but at least he isn't a cripple like your hero Roach lmaoo"
6,6,3,2,"""@RTNBA: Drakes new shoes that will be released by Nike/Jordan.... Yes, there's glitter on the shoes http://t.co/QCtPLxHEXM"" ....dudes a fag"
7,3,2,1,"""@TheoMaxximus: #GerrysHalloweenParty http://t.co/3ycrSrnjHc"" Halloween was yesterday stupid nigger"
8,3,2,1,"""@ashlingwilde: @ItsNotAdam is bored supposed to be cute, you faggot?"" Sometimes"
9,3,2,1,"""@bigbootybishopp: @white_thunduh lassen cc , the coach wants me"" i feel it youll get more my nig no doubt"


For the most part these tweets could be classified as hate speech, although I've noticed some again have offensive language and may contain hate speech terms but aren't clear cut. These tweets usually have 2 or less annotators agreeing it's hate speech.

As I've said earlier, my goal of defining clear cut hate speech is paramount. So, at the expense of the volume of hate speech data, but with the benefit of clearer, more accurately defined dataset, I'll exclude tweets that have been annotated as hate speech by less than 3 CF users.

Below we'll see what tweets have been agreed as hate speech by 3 or more CF users would qualify

In [183]:
hatetweets1 = hatetweets[hatetweets['hate_speech'] > 2]
print("There are", len(hatetweets1.index), "tweets in this database where 3 or more users agree it's hate speech\n")
hatetweets1.head(20)

There are 308 tweets in this database where 3 or more users agree it's hate speech



Unnamed: 0,count,hate_speech,offensive_language,tweet
1,3,3,0,"""@CB_Baby24: @white_thunduh alsarabsss"" hes a beaner smh you can tell hes a mexican"
2,3,3,0,"""@DevilGrimz: @VigxRArts you're fucking gay, blacklisted hoe"" Holding out for #TehGodClan anyway http://t.co/xUCcwoetmn"
3,3,3,0,"""@MarkRoundtreeJr: LMFAOOOO I HATE BLACK PEOPLE https://t.co/RNvD2nLCDR"" This is why there's black people and niggers"
6,6,3,2,"""@RTNBA: Drakes new shoes that will be released by Nike/Jordan.... Yes, there's glitter on the shoes http://t.co/QCtPLxHEXM"" ....dudes a fag"
13,3,3,0,"""Hey go look at that video of the man that found the kidnapped girls in Ohio........what a nigger"" - #shitmybosssays"
17,3,3,0,"""Our people"". Now is the time for the Aryan race 2 stand up and say ""no more"". Before the mongerls turn the world into a ghetto slum. 1488"
19,3,3,0,"""We hate niggers, we hate faggots and we hate spics""-kkk rally"
22,3,3,0,"""Why people think gay marriage is okay is beyond me. Sorry I don't want my future son seeing 2 fags walking down the street holding hands an"
30,3,3,0,#California is full of white trash
34,3,3,0,#Dutch people who live outside of #NewYorkCity are all white trash.


These tweets, in my opinion, qualify as hate speech.

### Cleaning and Saving Data

In [12]:
#Load data again
sometweets = pd.read_csv(os.path.join(DATA_DIR, 'ICVSM_2017.csv'), sep=',', encoding='utf-8')

#initialise cols to all 0's
sometweets['Hate_Speech'] = 0
sometweets['Offensive'] = 0

sometweets.rename(columns = {'tweet': 'Tweet'}, inplace = True)

#Class = 0 means that the entry has been classified as hate speech
sometweets['Hate_Speech'] = np.where((sometweets['class'] == 0)\
                                     & (sometweets['hate_speech'] > 2), 1, 0)

#Drop tweets that have been annotated as hate speech by 2 or less CF users
sometweets.drop(sometweets[(sometweets["class"] == 0) & \
                     (sometweets["hate_speech"] < 3 )].index, inplace = True)

#Class = 2 means that the entry has been classified as offensive language
sometweets['Offensive'] = np.where((sometweets['class'] == 2), 0, 1)

#Drop unnecessary columns
sometweets.drop(columns = ['Unnamed: 0', 'count','offensive_language','neither',\
                           'class','hate_speech'], axis = 1, inplace = True)

sometweets.reset_index(drop = True, inplace = True)
sometweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23661 entries, 0 to 23660
Data columns (total 3 columns):
Tweet          23661 non-null object
Hate_Speech    23661 non-null int32
Offensive      23661 non-null int32
dtypes: int32(2), object(1)
memory usage: 369.8+ KB


In [13]:
sometweets['Tweet'] = sometweets['Tweet'].map(lambda x: x.encode('unicode-escape').decode('utf-8'))

sometweets.to_csv(os.path.join(DATA_DIR, 'cleaned/sometweets.csv'), sep = ',', encoding='utf-8', \
                 index = False, header = True)

print("Breakdown for ICVSM 2017 Data")
print("\nHate Speech Column Labelling:\n", sometweets.Hate_Speech.value_counts())
print("\nOffensive Column Labelling:\n", sometweets.Offensive.value_counts())
print("\nWhere 1 is positive, 0 is negative and - is ambiguous")

Breakdown for ICVSM 2017 Data

Hate Speech Column Labelling:
 0    23353
1    308  
Name: Hate_Speech, dtype: int64

Offensive Column Labelling:
 1    19498
0    4163 
Name: Offensive, dtype: int64

Where 1 is positive, 0 is negative and - is ambiguous


## A New Database - ICVSM 2018 Abusive Behaviour Database

In all, about 100,000 tweets. Labelled one of four categories: Normal, abusive, hateful and spam.

The authors of this dataset clearly state that the entry `hateful` is meant to represent hate speech and they have instructed their annotators to label these tweets on that basis. The use a definition quite close to the one I am using to guide my report:

><i>Language used to express hatred towards a targeted individual or group, or is  intended to be derogatory, to humiliate, or to insult the members of the group, on the basis of attributes such as race, religion, ethnic origin, sexual orientation, disability, or gender.

Ref: [Antigoni-Maria Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael
Sirivianos, and Nicolas Kourtellis. 2018. Large scale crowdsourcing and characterization of twitter abusive behavior. AAAI](https://arxiv.org/pdf/1802.00393.pdf"")

In [8]:
icvsm = pd.read_csv( os.path.join(DATA_DIR, 'ICVSM_2018_dataset/hatespeech_text_label_vote.csv'), \
                    sep='\t', names =["tweets", "majority label", "votes on majority label"  ], encoding ='utf-8')
icvsm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99996 entries, 0 to 99995
Data columns (total 3 columns):
tweets                     99996 non-null object
majority label             99996 non-null object
votes on majority label    99996 non-null int64
dtypes: int64(1), object(2)
memory usage: 2.3+ MB


In [9]:
icvsm.head()

Unnamed: 0,tweets,majority label,votes on majority label
0,Beats by Dr. Dre urBeats Wired In-Ear Headphones - White https://t.co/9tREpqfyW4 https://t.co/FCaWyWRbpE,spam,4
1,RT @Papapishu: Man it would fucking rule if we had a party that was against perpetual warfare.,abusive,4
2,"It is time to draw close to Him &#128591;&#127995; Father, I draw near to you now and always ❤️ https://t.co/MVRBBX2aqJ",normal,4
3,if you notice me start to act different or distant.. it's bc i peeped something you did or i notice a difference in how you act &amp; ian fw it.,normal,5
4,"Forget unfollowers, I believe in growing. 7 new followers in the last day! Stats via https://t.co/bunPHQNXhj",normal,3


In [10]:
hatefultweets = icvsm[icvsm["majority label"] == "hateful" ]
hatefultweets.head(20)

Unnamed: 0,tweets,majority label,votes on majority label
21,"Bad day for #Kyrgyzstan. Suspect in St. Petersburg metro bombing identified as #Kyrgyz, opposition politician Japarov reportedly kills self.",hateful,2
34,"I'm over the fucking moon we've cleared up the definition of an act of war. Now, about that slap on the wrist we just gave Syria.",hateful,3
37,RT @ynaoivw: nah bitch i hate u https://t.co/fHX8y7esMH,hateful,3
71,RT @WaysThingsWork: I fucking hate people &#128514;&#128514; https://t.co/Qz5gihmcQF,hateful,3
106,@VanJones68 You looked like a complete pathetic idiot tonight. How could you be so stupid Mr. Ghetto. My God you… https://t.co/eEpAih1GAj,hateful,3
147,RT @___DestinyJadai: Never argue with a bitch you can kill by just stepping out on your WORST day &#128514; we both know why that bitch mad sis &#129335;&#127998;‍…,hateful,4
152,RT @Woodparkweirdo: hell yeah I VAPE V: hate A: women P E,hateful,3
194,I talked well to you all along. I do not feel like I'm answering you bad. Damn it!!,hateful,2
221,RT @Jedi_Pite_Bre: Which state allowed 800K illegal aliens to get a license? You guessed it right-the retarded state of California https://…,hateful,3
234,"@domy1337 @MainGame6 this is polish hooker look man, come here and you'll meet them next to some forests on the road lmao",hateful,2


   Most of the above tweets in my opinon, and among the opinions of an informal sampling I took (friends and family), do not definitively qualify as hate speech. Possibly a few of the annotators got confused and thought that the content had hateful intent rather than hate speech and thought they had to annotate to that definition.

The only "hateful" defined tweets agreed by a majority of the simple sampling I took from the tweets above were:

* tweet 106 - `@VanJones68 You looked like a complete pathetic idiot tonight. How could you be so stupid Mr. Ghetto. My God you…`


* tweet 221 - `RT @Jedi_Pite_Bre: Which state allowed 800K illegal aliens to get a license? You guessed it right-the retarded state of California`

To mitigate this, I'll further explore and see if the tweets, with high agreement among annotators that it contains hateful content, is a better suit to the hate speech definition.

In [193]:
morehatefultweets = hatefultweets[hatefultweets["votes on majority label"] > 7 ]
morehatefultweets.head(10)

Unnamed: 0,tweets,majority label,votes on majority label
10333,"Hanging at UB on Friday, Unsacred headlines the metal showcase at Macrock on Saturday, then heading to Damaged City for Marked Men on Sunday",hateful,16
13919,This was a proportionate response by the United States. It is not designed to overthrow the Assad regime...… https://t.co/PqdEWMQrhy,hateful,18
15481,I had two exams and had to choose 1 or the other to study for and the 1 I didn't study for I passed with an 81 &#127881;,hateful,8
22478,South Korean OF Kim makes Orioles opening day roster #BaltimoreOrioles #OriolesOpeningDay #Orioles #Orioles https://t.co/erAcYNSsUb,hateful,10
29915,"Black people low budget cookouts have quarter legs for the old heads and bullshit hotdogs for the ""kids"" &#128557;&#128557;&#128557;&#128557;&#128557;",hateful,14
33403,&#128196; #WinterEvent2017 On the attraction of two perfectly conducting plat on-the-attraction-of-two-perfectly-conducting-plates.pdf,hateful,22
37228,#DickCavett asks Art Garfunkel who broke up Simon &amp; Garfunkel. AG doesn't remember. I'm guessing suppression @decadesnetwork.,hateful,20
38192,Damn niggas was comparing this season to the farm which is really bad https://t.co/OVCBErCZDC,hateful,10
39529,RT @isabelaseraffim: insomnia ain't a joke bruh I'm really a fucking zombie at this point,hateful,20
43602,RT @iamwilliewill: This what happens when you separate yo self from niggas who don't eat they food cold. You FLOURISH... https://t.co/FzTIA…,hateful,15


This does not seem to be much better. The tweets marked as hateful should be discarded from the final dataset because of the unreliability of annotators.

<b>This will result in losing around 5000 tweets.</b>


These tweets will be useful for the further-pretraining stage of my project, especially as there is an abundance of informal slang and abusive terms which is hard to come by on tweet datasets online. 

### Cleaning and Saving Data

In [196]:
icvsm = pd.read_csv(os.path.join(DATA_DIR, 'ICVSM_2018_dataset/hatespeech_text_label_vote.csv'), sep='\t', names = \
                    ["tweets", "majority label", "votes on majority label"  ], encoding ='utf-8')

icvsm['Hate_Speech'] = 0
icvsm['Offensive'] = 0
icvsm.rename(columns = {'tweets': 'Tweet'}, inplace = True)
icvsm.loc[icvsm['majority label'] == 'abusive', 'Offensive'] = 1
icvsm.drop(['majority label', "votes on majority label"], inplace = True, axis = 1)
icvsm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99996 entries, 0 to 99995
Data columns (total 3 columns):
Tweet          99996 non-null object
Hate_Speech    99996 non-null int64
Offensive      99996 non-null int64
dtypes: int64(2), object(1)
memory usage: 2.3+ MB


In [197]:
icvsm['Tweet'] = icvsm['Tweet'].map(lambda x: x.encode('unicode-escape').decode('utf-8'))

icvsm.to_csv(os.path.join(DATA_DIR, 'cleaned/icvsm.csv'), sep = ',', encoding='utf-8', \
                 index = False, header = True)

print("Breakdown for ICVSM 2018 Data")
print("\nHate Speech Column Labelling:\n", icvsm.Hate_Speech.value_counts())
print("\nOffensive Column Labelling:\n", icvsm.Offensive.value_counts())
print("\nWhere 1 is positive, 0 is negative and - is ambiguous")

Breakdown for ICVSM 2018 Data

Hate Speech Column Labelling:
 0    99996
Name: Hate_Speech, dtype: int64

Offensive Column Labelling:
 0    72846
1    27150
Name: Offensive, dtype: int64

Where 1 is positive, 0 is negative and - is ambiguous


## Using Tweepy API to mine tweets via ID (Waseem and Hovy 2016)

The authors collected tweets based on a set of hate-related terms and users, and manually annotated a subset of their dataset using an outside annotator for reviewing. The final dataset consists of almost 17k tweets, 12% of which are labeled as `racist`, 20% `sexist`, and the rest are considered `normal`.

The tweets in this database are all through tweet IDs. So they must be collected through a twitter API. 

Ref: [Waseem, Zeerak, Hovy and Dirk, Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter, June 2016](https://www.aclweb.org/anthology/N16-2013/ "")

In [48]:
path = r'https://raw.githubusercontent.com/ZeerakW/hatespeech/master/NAACL_SRW_2016.csv'

ids = pd.read_csv(path, sep=',', names = ['id', 'label']);

ids.head()

Unnamed: 0,id,label
0,572342978255048705,racism
1,572341498827522049,racism
2,572340476503724032,racism
3,572334712804384768,racism
4,572332655397629952,racism


If using anaconda notebook:

In [0]:
#conda install -c conda-forge tweepy

Below is a tweepy method to obtain tweets via ID.


In [0]:
def lookup_tweets(tweet_IDs, api):
    full_tweets = []
    tweet_count = len(tweet_IDs)
    
    try:
        for i in range((tweet_count // 100) + 1):
            # Catch the last group if it is less than 100 tweets
            end_loc = min((i + 1) * 100, tweet_count)
            full_tweets.extend(
                api.statuses_lookup(id_=tweet_IDs[i * 100:end_loc])
            )
        return full_tweets
    except tweepy.TweepError:
        print('Something went wrong, quitting...')

Open a previously created json file with twitter api credentials and also set `wait_on_rate_limit = True`, so as not to exceed wait limits as Twitter API only allows you to extract 100 tweets at a time

In [200]:
# Load twitter API credentials from json file from local file storage
with open('C://Users/fionn/Downloads/twitter_credentials.json', "r") as f:
    creds = json.load(f)


auth = tweepy.OAuthHandler(creds['CONSUMER_KEY'], creds['CONSUMER_SECRET'])
auth.set_access_token(creds['ACCESS_TOKEN'], creds['ACCESS_SECRET'])

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

Run below kernel to load tweets via ID - May take a while

In [0]:
good_tweet_ids = list(ids['id']) #Create a list of tweet ids to look up 
results = lookup_tweets(good_tweet_ids, api) #apply function

Then wrangle the data into one dataframe

In [0]:
temp = json.dumps([status._json for status in results]) #create JSON
newdf = pd.read_json(temp, orient='records')
full = ids.merge(newdf,  how='left', on = 'id')
tweetSet = full[['id', 'label', 'text']]
tweetSet = tweetSet.drop_duplicates(subset = 'id')
pd.set_option('display.max_colwidth', -1)
tweetSet.head(10)

Unnamed: 0,id,label,text
0,572342978255048705,racism,So Drasko just said he was impressed the girls cooked half a chicken.. They cooked a whole one #MKR
2,572341498827522049,racism,Drasko they didn't cook half a bird you idiot #mkr
4,572340476503724032,racism,Hopefully someone cooks Drasko in the next ep of #MKR
6,572334712804384768,racism,of course you were born in serbia...you're as fucked as A Serbian Film #MKR
7,572332655397629952,racism,"These girls are the equivalent of the irritating Asian girls a couple years ago. Well done, 7. #MKR"
8,575949086055997440,racism,#MKR Lost the plot - where's the big Texan with the elephant sized steaks that they all have for brekkie ?
10,551659627872415744,racism,
11,551763146877452288,racism,
12,551768543277355009,racism,
13,551769061055811584,racism,


In [0]:
print(tweetSet.info())
dups = len(tweetSet) - tweetSet['text'].count()
print(" \nAmount of tweet IDs not returning tweets - ", dups)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16849 entries, 0 to 16994
Data columns (total 3 columns):
id       16849 non-null int64
label    16849 non-null object
text     10821 non-null object
dtypes: int64(1), object(2)
memory usage: 526.5+ KB
None
 
Amount of tweet IDs not returning tweets -  6028


### Suspended Accounts:

Although from our original dataset with tweet IDs we have 16,849 tweets total, we only retrieve 10821 tweets with our function as of writing. This is because in many cases the twitter accounts associated with these tweets have been suspended from twitter.
    
As we can see above, some tweet IDs have retrieved NaN for the tweet text because of this

Demonstrated below is an attempt to retrieve one of these tweets.

In [201]:
try:
    tweet = api.get_status(551763146877452288) # Index 11 in above tweetSet dataset
except tweepy.TweepError as err:
    if err.api_code == 63:
        print(err.reason)
    else:
        print(tweet.text)

[{'code': 63, 'message': 'User has been suspended.'}]


In [0]:
#Dropping rows where tweet I'd failed to obtain associated tweet text
tweetSet.dropna(inplace=True) 

#Manually inspecting tweets
tweetSet.head(20)

Unnamed: 0,id,label,text
0,572342978255048705,racism,So Drasko just said he was impressed the girls cooked half a chicken.. They cooked a whole one #MKR
2,572341498827522049,racism,Drasko they didn't cook half a bird you idiot #mkr
4,572340476503724032,racism,Hopefully someone cooks Drasko in the next ep of #MKR
6,572334712804384768,racism,of course you were born in serbia...you're as fucked as A Serbian Film #MKR
7,572332655397629952,racism,"These girls are the equivalent of the irritating Asian girls a couple years ago. Well done, 7. #MKR"
8,575949086055997440,racism,#MKR Lost the plot - where's the big Texan with the elephant sized steaks that they all have for brekkie ?
491,575174115667017728,racism,"RT @PhxKen: SIR WINSTON CHURCHHILL: ""ISLAM IS A DANGEROUS IN A MAN AS RABIES IN A DOG"" http://t.co/kCXgKD70SK"
1266,569294066984202240,racism,RT @TheRightWingM: Giuliani watched his city attacked &amp; people jump to their deaths. He's entitled to say WTF he wants about the guy shield…
1942,446460991396917248,racism,"RT @YesYoureRacist: At least you're only a tiny bit racist RT @AnMo95: I'm not racist, but my dick is!"
1943,489938636956135424,racism,@MisfitInChains @oldgfatherclock @venereveritas13 SANTA JUST *IS* WHITE


Undoubtedly there are some hate speech tweets in this dataset `"RT @PhxKen: SIR WINSTON CHURCHHILL: "ISLAM IS A DANGEROUS IN A MAN AS RABIES IN A DOG" http://t.co/kCXgKD70SK".`

However a lot of the other tweets depend on the context of the individual they are targeting at the time - the tweets talking about #mkr - ( a show called my kitchen rules in australia) is deemed sexist or racist at times only because of the subject they are talking about. `"#katieandnikki stop calling yourselves pretty and hot..you're not and saying it a million times doesn't make you either...STFU #MKR` - even then is this truly sexist?

It's difficult to ascertain this without researching the context and perhaps the text alone isn't a good enough inidicator of hate speech.

Some tweets are just mentioning racism or talking about it without being actually racist in intent `"RT @Dreamdefenders: Eric Holder from #ferguson: "I understand that mistrust. I am the Attorney General, but I am also a Black Man"`

There are deep politcal complexities in some of the tweets above such as other tweets about ferguson and also 9/11. It's difficult to justify using this dataset so these tweets will not be classified as hate speech

## Data from annotator reliability study - uses much of the Waseem & Hovy dataset directly above but has some extra data

The below data is from a study where the partcipants aimed to examine how reliable amateur annotators were compared to experts in annotating hate speech. They used a lot of data from the above Waseem and Hovy 2016 dataset, (there is an overlap of 2,876 tweets).
    
I only retrieved the opinions of the <b>expert annotators</b>, as in the paper of the study itself they claim that the amateur annotators are unreliable. Yet, the tweets here are annotated much like the previous Waseem and Hovy dataset so the "racism" and "sexism" tweets won't be used.

Ref: [Waseem, Zeerak, Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter, November 2016](http://aclweb.org/anthology/W16-5618"")

In [0]:
path = r'https://raw.githubusercontent.com/ZeerakW/hatespeech/master/NLP%2BCSS_2016.csv'

ids = pd.read_csv(path, sep='\t', usecols = ["TweetID", "Expert"], index_col=False);

ids.head()

Unnamed: 0,TweetID,Expert
0,597576902212063232,neither
1,565586175864610817,neither
2,563881580209246209,neither
3,595380689534656512,neither
4,563757610327748608,neither


In [0]:
ids.rename(columns = {'TweetID':'id'}, inplace = True)
ids.rename(columns = {'Expert': 'label'}, inplace = True)
ids.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6909 entries, 0 to 6908
Data columns (total 2 columns):
id       6909 non-null int64
label    6909 non-null object
dtypes: int64(1), object(1)
memory usage: 108.1+ KB


In [0]:
good_tweet_ids = list(ids['id']) #Create a list of tweet ids to look up 
results = lookup_tweets(good_tweet_ids, api) #apply function

In [0]:
temp = json.dumps([status._json for status in results]) #create JSON
newdf = pd.read_json(temp, orient='records')
full = ids.merge(newdf,  how='left', on = 'id')

In [0]:
tweetSet1 = full[['id', 'label', 'text']]
tweetSet1 = tweetSet1.drop_duplicates(subset = 'id')
pd.set_option('display.max_colwidth', -1)
dups1 = len(tweetSet1) - tweetSet1['text'].count()
tweetSet1.dropna(inplace=True)
tweetSet1.head(10)

Unnamed: 0,id,label,text
0,597576902212063232,neither,Cisco had to deal with a fat cash payout to the FSF *and* allow an external party to do constant reviews of their FOSS license compliancy.
1,565586175864610817,neither,"@MadamPlumpette I'm decent at editing, no worries ^.^"
2,563881580209246209,neither,@girlziplocked will read. gotta go afk for a bit - still bringing stuff in from car after week long road trip.
3,595380689534656512,neither,guys. show me the data. show me your github. tell me your story. show me something that makes me think you're not a bag of useless opinions.
4,563757610327748608,neither,@tpw_rules nothings broken. I was just driving througg a lot of water.
5,563082741370339330,neither,ur face is classified as a utility by the FCC.
6,596962098845851648,neither,@lysandraws yay! Absolutely. I'm not gone until November :)
7,563874350038675457,neither,"RT @kashiichan: ""It really feels like the @twitter DM can be the hand-on-the-knee of social communication."" http://t.co/7mFseL5zfE #stopwad…"
8,597240424873394176,neither,@SirenSailor rtfm. http://t.co/jaMXHikl3u
10,595306172833353728,neither,@Popehat who wouldn't?


<b>Data content of both datasets after dropping null rows:</b>

In [0]:
print("There are", len(tweetSet.index), "tweets in the first dataframe with",\
      dups, "duplicates")

print("And", len(tweetSet1.index), "tweets in the second dataframe with",\
     dups1, "duplicates")

There are 10821 tweets in the first dataframe with 6028 duplicates
And 6221 tweets in the second dataframe with 688 duplicates


### Combining the two datasets 

Hopefully with the merge function I'll be able to reliably combine the two datasets and have no duplicates in the final result. As stated above there is supposed to be an overlap of 2,786 tweets so this final set of data should be 16,907 + 6,909 - 2,786 - 5983 - 669 = 14,378 in the overall combined set.

In [0]:
fullSet = pd.merge(tweetSet, tweetSet1,  how='outer', on = ['id', 'text', 'label'])
fullSet.drop_duplicates(subset= ['id'], keep = 'last', inplace = True)

#Itermittent save of combined data
fullSet.to_csv(os.path.join(DATA_DIR, 'Waseem_Hovy_2016.csv'), sep = ',', \
               encoding='utf-8', index = False, header = True)
fullSet.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13092 entries, 3 to 16966
Data columns (total 3 columns):
id       13092 non-null int64
label    13092 non-null object
text     13092 non-null object
dtypes: int64(1), object(2)
memory usage: 409.1+ KB


Not the exact number of tweets we were seeking, but it's likely this is because when that datset was first collected, all of those tweets were available to be retrieved - wheras now they can't be because the accounts they belong to are suspended.

### Cleaning and Saving the Dataset

If a tweet has been labelled as `racism`, `sexism` or `both` - we'll remove these as they have been demonstrated to be unreliably annotated.

The rest of the dataset will be labelled as `not hate speech`; but we can't be sure if some of the tweets that have been labelled as `none` or `neither` aren't offensive - so we'll label that as `-`.

In [213]:
fullset = pd.read_csv( os.path.join( DATA_DIR, 'Waseem_Hovy_2016.csv'),sep=',')

fullset.dropna(inplace = True)
print("\nThere are 5 different labels for how the tweets have been annotated:")
print(fullset['label'].unique(),'\n')

fullset['Hate_Speech'] = 0
fullset['Offensive'] = '-'

fullset.rename(columns = {'text': 'Tweet'}, inplace = True)

fullset.drop(fullset[(fullset["label"] == 'racism') | \
                     (fullset["label"] == 'sexism') | \
                  (fullset["label"] == 'both')].index, inplace = True)

fullset.drop(['label', 'id'], inplace = True, axis = 1)
fullset.reset_index(drop = True, inplace = True)


There are 5 different labels for how the tweets have been annotated:
['racism' 'sexism' 'none' 'neither' 'both'] 



In [214]:
fullset['Tweet'] = fullset['Tweet'].map(lambda x: x.encode('unicode-escape').decode('utf-8'))

fullset.to_csv(os.path.join(DATA_DIR, 'cleaned/Waseem_Hovy.csv'), sep = ',', encoding='utf-8', \
                 index = False, header = True)

print("Breakdown for Wassem & Hovy 2016 Data")
print("\nHate Speech Column Labelling:\n", fullset.Hate_Speech.value_counts())
print("\nOffensive Column Labelling:\n", fullset.Offensive.value_counts())
print("\nWhere 1 is positive, 0 is negative and - is ambiguous")

Breakdown for Wassem & Hovy 2016 Data

Hate Speech Column Labelling:
 0    9579
Name: Hate_Speech, dtype: int64

Offensive Column Labelling:
 -    9579
Name: Offensive, dtype: int64

Where 1 is positive, 0 is negative and - is ambiguous


## Testing for possible duplicate tweets

<b> Where tweet ID is available, we want to use it at the beginning to root out possible duplicate tweets which may have overlapped over the datasets. </b>
    
Removing duplicates via tweet ID should be more reliable, but after this we'll attempt to remove duplicate tweets via the text content and see if it's a reliable method of removing duplicate entries.

There are only two datasets that have an accompanying ID, the ICVSM dataset has IDs along with their label as another csv file in the directory.

I'll test now to see if there's any overlap between the ICVSM_2018 tweets and the Waseem_And Hovy tweets - which are the only two datasets with accompanying tweet IDs


In [242]:
icvsm1 = pd.read_csv(os.path.join(DATA_DIR, 'ICVSM_2018_dataset/hatespeech_id_label.csv') \
                     , sep=',', names = ['id', 'label'], index_col = False)

print("\nThere are", len(icvsm1.index), "tweets in this dataset")
icvsm1.head()


There are 99996 tweets in this dataset


Unnamed: 0,id,label
0,849667487180259329,abusive
1,850490912954351616,abusive
2,848791766853668864,abusive
3,848306464892604416,abusive
4,850010509969465344,normal


In [243]:
print("There are", len(icvsm1.index), "tweets in the ICVSM dataset and", \
      len(ids.index), "tweets in the Waseem_Hovy dataset")

print("\nThe end merge of datasets should have", len(icvsm1.index) + len(ids.index),\
      "tweets in the dataset assuming no duplicates")


There are 99996 tweets in the ICVSM dataset and 13092 tweets in the Waseem_Hovy dataset

The end merge of datasets should have 113088 tweets in the dataset assuming no duplicates


In [245]:
ids = pd.read_csv(os.path.join( DATA_DIR, 'Waseem_Hovy_2016.csv'), sep=',')
ids.dropna(inplace = True)
#Make sure both id columns are of same type before merge
icvsm1['id'] = icvsm1['id'].astype('int64')
ids['id'] = ids['id'].astype('int64')
ids1 =ids.drop(['text'], axis = 1)

dt = pd.merge(icvsm1, ids1,  how='outer', on = ['id', 'label'])
duplicateRows = dt[dt.duplicated(subset = ['id'], keep = False)]

print("When using the duplicated method there are", \
      len(duplicateRows), "duplicate ids identified.")

print("we'll obviously retain half of these because we'll keep the first instance of the duplicate tweet",\
     "this will result in", int(len(duplicateRows)/2), "tweets being dropped")

duplicateRows.head(10)

When using the duplicated method there are 394 duplicate ids identified.
we'll obviously retain half of these because we'll keep the first instance of the duplicate tweet this will result in 197 tweets being dropped


Unnamed: 0,id,label
10,849087242987593728,abusive
11,849087242987593728,abusive
17,849282894682050564,abusive
18,849282894682050564,abusive
42,849881409284182016,normal
43,849881409284182016,normal
50,848600351381098496,hateful
51,848600351381098496,hateful
55,848975292794318848,normal
56,848975292794318848,normal


As shown above, identifying duplicate tweets via id is quite reliable as the ids above are identical. They're often entered in consecutive indexes in the database which leads me to believe it must have been human error in the large ICVSM 2017 database of 100,000 tweets.

This isn't a huge amount of error when considering the overall size of the dataframe, however we do need to weed out these duplicates because they may give us inaccurate scores as we may perform cross-validation to evaluate model performance down the line 

In [246]:
dt.drop_duplicates(subset= ['id'], keep = 'first', inplace = True)
print("When using the drop_duplicates method there are",\
(len(icvsm1.index) + len(ids.index) - len(dt.index)), "duplicate tweets\n")

print(dt.info())
dt.head()

When using the drop_duplicates method there are 197 duplicate tweets

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112891 entries, 0 to 113087
Data columns (total 2 columns):
id       112891 non-null int64
label    112891 non-null object
dtypes: int64(1), object(1)
memory usage: 2.6+ MB
None


Unnamed: 0,id,label
0,849667487180259329,abusive
1,850490912954351616,abusive
2,848791766853668864,abusive
3,848306464892604416,abusive
4,850010509969465344,normal


<b> Let's see if dropping duplicate entries via text content is a reliable method </b>

In [255]:
icvsm1 = os.path.join(DATA_DIR, 'ICVSM_2018_dataset/hatespeech_text_label_vote.csv')
icvsm1 = pd.read_csv(icvsm1, sep='\t', names = \
                    ["text", "label" ], index_col = False);

icvsm1.drop(icvsm1.loc[icvsm1['label']=='hateful'].index, inplace=True)

dt = pd.merge(icvsm1, ids,  how='outer', on = ['text', 'label'])

duplicateRows = dt[dt.duplicated(subset = ['text'], keep = False)]

print("When using the duplicated method there are", \
      len(duplicateRows), "duplicate tweets identified.")

print("\nWe'll retain half of these because we'll keep the first instance of the duplicate tweet",\
     "this will result in", int(len(duplicateRows)/2), "tweets being dropped")
duplicateRows.drop(['id'], axis = 1, inplace = True)
duplicateRows.head(30)

When using the duplicated method there are 8809 duplicate tweets identified.

We'll retain half of these because we'll keep the first instance of the duplicate tweet this will result in 4404 tweets being dropped


Unnamed: 0,text,label
1,RT @Papapishu: Man it would fucking rule if we had a party that was against perpetual warfare.,abusive
2,RT @Papapishu: Man it would fucking rule if we had a party that was against perpetual warfare.,abusive
3,RT @Papapishu: Man it would fucking rule if we had a party that was against perpetual warfare.,abusive
7,RT @Vitiligoprince: Hate Being sexually Frustrated Like I wanna Fuck But ion wanna Just fuck anybody,abusive
8,RT @Vitiligoprince: Hate Being sexually Frustrated Like I wanna Fuck But ion wanna Just fuck anybody,abusive
12,"RT @LestuhGang_: If your fucking up &amp; your homies dont tell you that your fucking up, those ain't your homies",abusive
13,"RT @LestuhGang_: If your fucking up &amp; your homies dont tell you that your fucking up, those ain't your homies",abusive
16,RT @ennoia3: That's one way he pulls you in RT@amysreedusxx norman fucking reedus just threw candy at me when will your fav ever https://t.…,abusive
17,RT @ennoia3: That's one way he pulls you in RT@amysreedusxx norman fucking reedus just threw candy at me when will your fav ever https://t.…,abusive
20,RT @EiramAydni: Im a nasty ass freak when I like you..,abusive


There are quite a lot of identical tweets through retweets `RT` that aren't picked up by dropping via tweet ids. These tweets come from different sources - thus different tweet IDs
    
We do NOT want duplicate text entries in our final set, as it will contaminate our validation set when we eventually do use cross-validation to evaluate the performance of our model.
    
Thus, we will drop duplicate tweets via dropping by text, it's also seemingly identified duplicates reliably which is a plus.

<b> I have a suspicion that most of the duplicate entries come from within the ICVSM database.</b>
    
I'll test this quickly below by calculating the amount of duplicates within the ICVSM database, and subtracting it from the overlap I <i> believed </i> was between the two datasets, when I was assuming there were no duplicates within the ICVSM dataset 

In [261]:
duplicateRows1 = icvsm1[icvsm1.duplicated(subset = ['text'], keep = False)]

difference = len(duplicateRows) - len(duplicateRows1)

print("The actual overlap between ICVSM 2018 and the Waseem & Hovy Database is",\
     difference)

The actual overlap between ICVSM 2018 and the Waseem & Hovy Database is 0


<b> Okay so all the duplicates were within the ICVSM dataset </b>

## Finally combining all of the data 

This combined dataset will certainly be used in further pre-training, however it remains to be seen whether it is used in the final hate speech classifier.

This notebook has demonstrated that each hate speech dataset online has their own interpretations of what qualifies as hate speech, depending on who is annotating the dataset. 

Thus assessing the performance of an automatic classifier by training and testing on this combined dataset may not be fair, seeing as how it's difficult for a human to distinguish a consistent pattern, never mind a machine. 

A discernable pattern may be much more identifyable <i>within</i> datasets. Therefore I'll assess the performance of my classifier that way

In [14]:
alldf = pd.read_csv(os.path.join(DATA_DIR, 'cleaned/hateval.csv'), \
                    sep = ',', encoding='ISO-8859-1')

offens = pd.read_csv(os.path.join(DATA_DIR, 'cleaned/offens.csv'), \
                    sep = ',', encoding='ISO-8859-1')

sometweets = pd.read_csv(os.path.join(DATA_DIR, 'cleaned/sometweets.csv'), \
                    sep = ',', encoding='ISO-8859-1')

icvsm = pd.read_csv(os.path.join(DATA_DIR, 'cleaned/icvsm.csv'), \
                    sep = ',', encoding='ISO-8859-1')

fullset = pd.read_csv(os.path.join( DATA_DIR, 'cleaned/Waseem_Hovy.csv'), \
                    sep = ',', encoding='ISO-8859-1')


final_df = pd.concat([offens, sometweets, alldf, icvsm, fullset], sort = True, axis = 0)
final_df.reset_index(drop = True, inplace = True)
print(final_df.info())
final_df.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152995 entries, 0 to 152994
Data columns (total 3 columns):
Hate_Speech    152995 non-null int64
Offensive      152995 non-null object
Tweet          152995 non-null object
dtypes: int64(1), object(2)
memory usage: 3.5+ MB
None


Unnamed: 0,Hate_Speech,Offensive,Tweet
0,0,1,@USER She should ask a few native Americans what their take on this is.
1,0,0,Amazon is investigating Chinese employees who are selling internal data to third-party sellers looking for an edge in the competitive marketplace. URL #Amazon #MAGA #KAG #CHINA #TCOT
2,0,1,"@USER Someone should'veTaken"" this piece of shit to a volcano. \U0001f602"""
3,0,0,@USER @USER Obama wanted liberals &amp; illegals to move into red states
4,0,1,@USER Liberals are all Kookoo !!!


In [15]:
final_df1 = final_df.drop_duplicates(subset= ['Tweet'], keep = 'first')
final_df1.reset_index(drop = True, inplace = True)
diff = len(final_df.index) - len(final_df1.index)

print("The amount of tweets lost in the final dataframe by using the drop duplicates by text entry is", \
      diff, "which is {0:.2}%".format(diff/len(final_df.index) * 100),  "of the original dataset")

The amount of tweets lost in the final dataframe by using the drop duplicates by text entry is 8070 which is 5.3% of the original dataset


Most of these duplicate tweets are likely from the ICVSM 2018 dataset

In [16]:
print( "There are", len(final_df1.index), "tweets in the final dataset\n")
final_df1.info()

There are 144925 tweets in the final dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144925 entries, 0 to 144924
Data columns (total 3 columns):
Hate_Speech    144925 non-null int64
Offensive      144925 non-null object
Tweet          144925 non-null object
dtypes: int64(1), object(2)
memory usage: 3.3+ MB


In [312]:
final_df1['Tweet'] = final_df1['Tweet'].map(lambda x: x.encode('unicode-escape').decode('utf-8'))
final_df1.Offensive = final_df1.Offensive.astype(str)
final_df1.to_csv(os.path.join(DATA_DIR, 'final.csv'), sep = ',', encoding='utf-8', \
                 index = False, header = True)

print("Hate Speech Column Labelling:\n", final_df1.Hate_Speech.value_counts())

print("\nOffensive Column Labelling:\n", final_df1.Offensive.value_counts())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Hate Speech Column Labelling:
 0    141529
1    4518  
Name: Hate_Speech, dtype: int64

Offensive Column Labelling:
 0    84228
1    46450
-    15369
Name: Offensive, dtype: int64


## AnalyticsVidhya Dataset - Online Hate Speech Competition

Below I'll inspect the dataset for the AnalyticsVidhya hate speech competition and see whether it should be added to my combined set and if it's hate speech data qualifies.

[Competition Link](https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/# "") 

Leaderboard name: `fionn49`

In [314]:
df = pd.read_csv('gs://csc3002/trial/train_E6oV3lV.csv', sep = ',', encoding='ISO-8859-1')
count = df.label.value_counts()
print("There are {:d} non hate speech tweets and {:d} hate speech tweets\n".format(count[0], count[1]) )
df.head()

There are 29720 non hate speech tweets and 2242 hate speech tweets



Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in urÃ°ÂÂÂ±!!! Ã°ÂÂÂÃ°ÂÂÂÃ°ÂÂÂÃ°ÂÂÂÃ°ÂÂÂ¦Ã°ÂÂÂ¦Ã°ÂÂÂ¦
4,5,0,factsguide: society now #motivation


In [315]:
hatedf = df[df["label"] == 1]
hatedf.drop(columns = ['id'], axis = 1, inplace = True)
hatedf.reset_index(drop = True, inplace = True)
hatedf.head(25)

Unnamed: 0,label,tweet
0,1,@user #cnn calls #michigan middle school 'build the wall' chant '' #tcot
1,1,no comment! in #australia #opkillingbay #seashepherd #helpcovedolphins #thecove #helpcovedolphins
2,1,retweet if you agree!
3,1,@user @user lumpy says i am a . prove it lumpy.
4,1,it's unbelievable that in the 21st century we'd need something like this. again. #neverump #xenophobia
5,1,@user lets fight against #love #peace
6,1,Ã°ÂÂÂ©the white establishment can't have blk folx running around loving themselves and promoting our greatness
7,1,"@user hey, white people: you can call people 'white' by @user #race #identity #medÃ¢ÂÂ¦"
8,1,how the #altright uses &amp; insecurity to lure men into #whitesupremacy
9,1,@user i'm not interested in a #linguistics that doesn't address #race &amp; . racism is about #power. #raciolinguistics bringsÃ¢ÂÂ¦


This seems to be the most inaccurately annotated dataset so far. Not only are these tweets not hate speech, but often these sentences are grammatically incoherent. 

Therefore these tweets will not be merged with the combined datasets as they certainly aren't hate speech, nor do we need more benign tweets for our classifier - if we do elect to train and test on the cobined set - as the combined set is imbalanced enough.

I'll still train my classifier on these tweets however within the dataset as it's part of a competition and I'd like to compare my classifier performance to other people's.

# Summary

* The only two datasets that were found to have reliably annotated hate speech were HatEval 2019 - `4210 hate tweets` and ICVSM 2017 - `308 hate tweets`


* This notebook shows there is an inconsistency between datasets as to what constitutes hate speech, therefore using hate speech from different datasets to train a model may actually harm performance as there could be contradictory tweets which would confuse the model. Training and testing within datasets is therefore the solution


* The final combined dataset has ~3.1% tweets annotated as hate speech and ~32% tweets annotated as offensive.