**Import Libraries**

In [1]:
import pandas as pd
import numpy as np

**Load Data**

In [2]:
cyberbullying_raw_data = pd.read_csv("https://github.com/Voldegin/hate_speech_detection/blob/develop/data/cyberbullying_tweets.csv?raw=true")

In [3]:
cyberbullying_raw_data

Unnamed: 0,tweet_text,cyberbullying_type
0,"In other words #katandandre, your food was cra...",not_cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying
...,...,...
47687,"Black ppl aren't expected to do anything, depe...",ethnicity
47688,Turner did not withhold his disappointment. Tu...,ethnicity
47689,I swear to God. This dumb nigger bitch. I have...,ethnicity
47690,Yea fuck you RT @therealexel: IF YOURE A NIGGE...,ethnicity


In [4]:
malignant_raw_data = pd.read_csv("https://github.com/Voldegin/hate_speech_detection/blob/develop/data/malignant_comments/train.csv?raw=true")

In [5]:
malignant_raw_data

Unnamed: 0,id,comment_text,malignant,highly_malignant,rude,threat,abuse,loathe
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
159566,ffe987279560d7ff,""":::::And for the second time of asking, when ...",0,0,0,0,0,0
159567,ffea4adeee384e90,You should be ashamed of yourself \n\nThat is ...,0,0,0,0,0,0
159568,ffee36eab5c267c9,"Spitzer \n\nUmm, theres no actual article for ...",0,0,0,0,0,0
159569,fff125370e4aaaf3,And it looks like it was actually you who put ...,0,0,0,0,0,0


**The two sets of data have different levels of classification on output columns where one distinguishes based on the intensity of bullying while the other distinguishes based on the type of bullying.
It will be easier to bring the two sets of data together by creating new column which identifies whether the tweet is cyberbullying or not.**

In [6]:
# Function to convert malignant data into a format to create new column 'is_cyberbullying'
def convert_malignant_tweets_format(data):
    new_data = pd.DataFrame()
    new_data['tweet_text'] = data['comment_text']
    new_data['actual_value'] = data.apply(lambda row: row[row == 1].index.tolist(), axis=1)
    return new_data

**Since the malignant tweets dataset is used to create test dataset, we will fetch 4000 tweets from each category**

In [7]:
not_malignant_tweets = malignant_raw_data[malignant_raw_data.iloc[:,-6:].sum(axis=1) == 0]
malignant_tweets = malignant_raw_data[malignant_raw_data.iloc[:,-6:].sum(axis=1) > 0]
len(not_malignant_tweets), len(malignant_tweets)

(143346, 16225)

In [8]:
not_malignant_4000 = convert_malignant_tweets_format(not_malignant_tweets.sample(n=4000, random_state=23))
malignant_4000 = convert_malignant_tweets_format(malignant_tweets.sample(n=4000, random_state=23))

In [9]:
# Joining 4000 each to create final test dataset
test_data = pd.concat([not_malignant_4000,malignant_4000])

**Creating new column for test data and training data**

In [10]:
test_data['is_cyberbullying'] = np.where(test_data['actual_value'].apply(lambda x:len(x)) == 0,0,1)

In [11]:
test_data

Unnamed: 0,tweet_text,actual_value,is_cyberbullying
32504,"Oppose For the sake of this decision, I don't ...",[],0
39965,REDIRECT Talk:Shabab Al-Bireh Institute,[],0
128463,Rutherford was a supporter of the Haultain gov...,[],0
66224,I didn't do it \n\nI didn't add improperly cit...,[],0
65530,"""Hang on a minute, scobey. I'm Irish. I'd just...",[],0
...,...,...,...
107717,You're an asshole. Seriously. I've been her...,"[malignant, rude]",1
23020,For your information: you´re an idiot too!! an...,"[malignant, abuse]",1
125543,FUCK \n\nWhy hasn't anyone included this infor...,"[malignant, rude]",1
105523,I just wanted to say your article sucks. 206.2...,"[malignant, rude, abuse]",1


In [12]:
cyberbullying_raw_data['is_cyberbullying'] = np.where(cyberbullying_raw_data['cyberbullying_type'] == 'not_cyberbullying',0,1)

In [13]:
cyberbullying_raw_data

Unnamed: 0,tweet_text,cyberbullying_type,is_cyberbullying
0,"In other words #katandandre, your food was cra...",not_cyberbullying,0
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying,0
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying,0
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying,0
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying,0
...,...,...,...
47687,"Black ppl aren't expected to do anything, depe...",ethnicity,1
47688,Turner did not withhold his disappointment. Tu...,ethnicity,1
47689,I swear to God. This dumb nigger bitch. I have...,ethnicity,1
47690,Yea fuck you RT @therealexel: IF YOURE A NIGGE...,ethnicity,1


In [17]:
# cyberbullying_raw_data.to_csv("C:\\Greenwich\\MSc Project\\project_code\\train_data.csv",index=False)

In [18]:
# test_data.to_csv("C:\\Greenwich\\MSc Project\\project_code\\test_data.csv",index=False)