In [2]:
import re
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
sns.set_style("dark")
sns.set_context("talk")

In [6]:
path = 'maindatahnh.csv'
df = pd.read_csv(path,encoding='latin1')

In [5]:
df.head()

Unnamed: 0,label,file_id,user_id,subforum_id,num_contexts,data,Unnamed: 6,Unnamed: 7
0,noHate,12834217_1,572066,1346,0,"As of March 13th , 2014 , the booklet had been...",,
1,noHate,12834217_10,572066,1346,0,Thank you in advance. : ) Download the youtube...,,
2,noHate,12834217_2,572066,1346,0,In order to help increase the booklets downloa...,,
3,noHate,12834217_3,572066,1346,0,( Simply copy and paste the following text int...,,
4,hate,12834217_4,572066,1346,0,Click below for a FREE download of a colorfull...,,


Here are the meanings of the column values, per the original authors:

`count` = number of CrowdFlower users who coded each tweet (min is 3, sometimes more users coded a tweet when judgments were determined to be unreliable by CF).

`hate_speech` = number of CF users who judged the tweet to be hate speech.

`offensive_language` = number of CF users who judged the tweet to be offensive.

`neither` = number of CF users who judged the tweet to be neither offensive nor non-offensive.

`class` = class label for majority of CF users. 0 - hate speech 1 - offensive language 2 - neither

`tweet` = the actual text of the tweet

Let's see how many of each class we have

In [6]:
plt.figure(figsize=(12, 8))
ax = sns.countplot(x="class", data=df)
plt.title('Distribution of Speech in Dataset')
plt.xlabel('') # Don't print "class"
plt.xticks(np.arange(3), ['Hate speech', 'Offensive language', 'Neither'])

# Print the number above each bar
for p in ax.patches:
    x = p.get_bbox().get_points()[:, 0]
    y = int(p.get_bbox().get_points()[1, 1])
    ax.annotate(y, (x.mean(), y), 
            ha='center', va='bottom')


ValueError: Could not interpret input 'class'

<matplotlib.figure.Figure at 0x7f2a83acf7f0>

Note the unusual distribution of types of text. Because all of the tweets were pulled from HateBase, hate speech and offensive language is going to be significantly over-represented, as we see above.

Let's see how much agreement there was. We'll check if any were called hate speech by at least one person and neither by at least one other.

In [8]:
hate_neither = df[(df['hate_speech'] != 0) & (df['neither'] != 0)]
hate_neither.sample(20)

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
7164,3,2,0,1,0,@saraelizabethj4 that's what happens when you ...
4702,6,1,1,4,2,@RonanFarrow Yeah we know about Anglo-American...
6319,3,1,0,2,2,@iamkrause \nI ain't never orderin from no col...
13098,6,1,0,5,2,Maybe @barackobama is colored blind with his '...
3576,6,1,4,1,1,@ImASpiderG you a bitch nigga
7562,3,2,0,1,0,"@yawlknow @HoskinsTy96 ""all of these rednecks ..."
3700,3,1,0,2,2,@JamesdaJewison &lt;&lt;BEST NAME ON TWITTER E...
22618,3,1,0,2,2,"This target is ghetto, but whatever"
8029,3,1,0,2,2,Aren't these little border jumpers supposed to...
21963,3,1,0,2,2,The Perfect Gift for the Jihadi on Your Shoppi...


There's a fair amount of disagreement, including some tweets that were marked by at least one person as being in every category.

This suggests that Bayes error will be relatively high for this dataset. It would be hard to get a good estimate of that, but we can see how many tweets were unanimously agreed upon. This should give a decent baseline for a classifier.

In [9]:
all_three = df[(df['hate_speech'] != 0) & (df['neither'] != 0) & (df['offensive_language'] != 0)]
hate_offensive = df[(df['hate_speech'] != 0) & (df['offensive_language'] != 0)]
offensive_neither = df[(df['neither'] != 0) & (df['offensive_language'] != 0)]

In [10]:
all_multiple = pd.concat([hate_neither, hate_offensive, offensive_neither]).drop_duplicates()

In [11]:
all_multiple.sample(20)

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
9506,3,1,2,0,1,Fuck dat bitch
8717,3,1,2,0,1,Citi field bitches http://t.co/gHxXmb3M
8293,3,1,2,0,1,Bitches be like im nt beefing over a niggah smh
19107,3,1,2,0,1,RT @brittanyaflores: you got niggas &amp; I go...
19918,3,1,2,0,1,RT @laceeybugg: @hedge_brandon you're such a q...
2839,6,0,5,1,1,@CapoDaAssHole naw where can I see that hoe at
25003,3,0,2,1,1,that hoe aint there anymore
10690,3,1,2,0,1,I had a date wit Irene nd that bitch stood me ...
3266,6,4,1,1,0,@Flow935 jus wanted to let y&#225;ll know hope...
15940,3,0,1,2,2,RT @I_Be_kOoLz Food be good...except that rice...


Let's look at how many were unanimous

In [12]:
disputed = len(all_multiple)/len(df)
print("{disputed:.1%} of the samples were disputed. {unanimous:.1%} were unanimous.".format(disputed=disputed, unanimous=1-disputed))

29.5% of the samples were disputed. 70.5% were unanimous.


This gives us a good idea for how well we'll be able to do. Even a "great" classifier likely won't agree with the majority all of the time, as 29.5% of the samples had at least one dissenting opinion. This isn't quite Bayes error because it shows any disagreement, but it gives an idea that this won't be easy.

### Tweets

Let's look in more detail at the tweets.

In [13]:
# Stop pandas from truncating text
pd.set_option('display.max_colwidth', -1)

In [14]:
df['tweet'].sample(20, random_state=0)

4251     @Misplaced_Momma \nHello dare momma. I show glad dat u follow me. U lbe screamin hot in dem pitchers. \nYou eva b wit colored men?                        
15438    RT @DymondMarie1: Shoutout to your main bitch !                                                                                                           
16926    RT @MsKeeKee90s: Throw it up! Like a pizza! Get stirring in that pussy like a feature &#128514;&#128514;&#128514; @t_ndyy @PamiiBabesh                    
4399     @Nien_Nunb what did you search? Gay redneck episode 1 play?                                                                                               
4380     @Ncoleycole u still a mutt tho lol                                                                                                                        
4998     @Taylor_Simonee you have your nips pierced too ?! &#127806;&#128064;&#127806;                                                                             
17363    RT @Qui

## Cleaning the data

We could clean this out by removing URLs and mentions, but we'll leave hashtags in. Looks like there are also emojis, like this &#128557;, that have been replace with text strings like like this `&#128557;`. We'll remove those too. We'll come up with regexes to remove them.

It looks like there's also a ampersand issue. We'll fix that too.

In [7]:
url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|'
        '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
mention_regex = '@[\w\-]+'
emoji_regex = '&#\d*;'
amp_regex = '&amp;'

In [9]:
text = df['clean']
print(len(text))

10906


In [10]:
def clean_text(text):
    text = re.sub(url_regex, 'URLHERE', str(text))
    text = re.sub(mention_regex, 'MENTIONHERE', str(text))
    text = re.sub(emoji_regex, ' EMOJIHERE ', str(text))
    text = re.sub(amp_regex, '&', str(text))
    return text

In [11]:
df['clean'] = df['clean'].apply(clean_text)

In [12]:
df['clean'].sample(20, random_state=0)

8207    I was just debating with myself whether I shou...
7020                                 Try here Amazon.com.
7290    They need to give them mandatory DNA tests and...
9089    We all have unique abilities and knowledge , c...
2643    Multiculturism is a Trojan Horse Multicultural...
4911    hey, im from manhattan. its really rough havin...
1670                                                 TF .
5772    Cant say over the net what I do EVERY day for ...
1085    I actually have a college degree , but choose ...
8582    We have tons of links for materials and free d...
2247    That news is something totally incomprihensible .
9616    Main Entry : epis Â· tro Â· phe Pronunciation ...
7958                  Does anybody else experience this ?
5135    heil sisters and brothers lit see if this make...
7631                  Oh, you 'd better believe they do .
9603    Einstein said nothing can go faster than the s...
1523    - `` Something about black '' and the old favo...
1597          

This looks much better (for hate speech). Now that we've cleaned it we'll save it and start generating some features for classification.

In [13]:
df.to_csv('cleanhate123.csv')