# Data Cleaning and Collection 
Given data from the following sources: 

1.   [Dynamically generated hate speech](https://github.com/bvidgen/Dynamically-Generated-Hate-Speech-Dataset)
2.  [Depression and suicide reddit content](https://www.kaggle.com/datasets/xavrig/reddit-dataset-rdepression-and-rsuicidewatch)
3. [Text that has been classified by sentiment](https://www.kaggle.com/datasets/amiteshpatel16/sentiment-analysis-dataset3labels)

We extracted the text and its associated label and combined it into one dataset that will be used to train a text classification model.

Class labels are: 

0 - Normal (both neutral and positive sentiment) 

1 - Risk of harming others (violence/hate)

2 - Risk of harming self (self-harm/depression/suicide)



In [None]:
import math
import pandas as pd 
import numpy as np

In [None]:
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
death_row_fn = 'gdrive/My Drive/COEN140/group-project/data/Last-Statement-of-Death-Row.csv'
suicide_depression_fn = 'gdrive/My Drive/COEN140/group-project/data/reddit_depression_suicidewatch.csv'
hate_fn = 'gdrive/My Drive/COEN140/group-project/data/Dynamically_Generated_Hate_Dataset_v0.2.3.csv'
sentiment_fn = 'gdrive/My Drive/COEN140/group-project/data/sentiment.csv'
output_fn = 'gdrive/My Drive/COEN140/group-project/data/train.csv'
nosamp_output_fn = 'gdrive/My Drive/COEN140/group-project/data/nosample_train.csv'

RAND_STATE = 1

In [None]:
def remove_hello_goodbye(text):
  to_rm = ['goodbye', 'bye', 'Goodbye', 'Bye', 'Hello', 
           'Hi', 'Hey', 'hello', 'hi','hey']
  for word in to_rm: 
    text = text.replace(word, '')
  return text

Clean the csv of suicide notes.

In [None]:
# extract text and class labels 
suicide_df = pd.read_csv(suicide_depression_fn)

# label all samples "as risk to oneself"
suicide_df['class'] = 2
suicide_df.drop('label', axis=1, inplace=True)

# remove greetings from suicidal 
suicide_df['text'] = suicide_df['text'].apply(lambda x: remove_hello_goodbye(x))

suicide_df

Unnamed: 0,text,class
0,I recently went through a breakup and she said...,2
1,"I do not know how to navigate these feelings, ...",2
2,"So I have been with my bf for 5 months , and h...",2
3,I am so exhausted of ts. Just when I tnk I can...,2
4,I have been severly bullied since i was 5 till...,2
...,...,...
20358,I took 50mg of seroquel a few hours after I dr...,2
20359,that is what has happened to me last week. And...,2
20360,Ever just feel alone in a house full of people...,2
20361,Politicians. Neighbors. Corporations. Society....,2


Clean the csv of hateful data, labelling hateful data as hate and non-hateful data as neutral.

In [None]:
hate_df = pd.read_csv(hate_fn)

# extract text and class labels 
hate_df = hate_df.loc[:,['text','label']]
hate_df.rename({'label' : 'class'}, axis=1, inplace=True)

# numerically represent class labels 
# label all hateful samples as "risk to others" and all non-hateful samples 
# as "normal"
hate_df['class'] = [0 if cls == "nothate" else 1 for cls in hate_df['class']] 

hate_df['text'] = hate_df['text'].apply(lambda x: remove_hello_goodbye(x))
hate_df

Unnamed: 0,text,class
0,dalits are lowlives,1
1,gay people are a burden to our society,1
2,Arabs are not welcome,1
3,I'm not saying we should actually eliminate he...,1
4,bananas are for black people,1
...,...,...
41139,Indian culture is so backwards,1
41140,"Women don't want equality, t just want to be i...",1
41141,fuck covid,0
41142,Ts computer is garbage,0


Clean the csv of sentiment labeled data by extracting only the neutral labeled data.

In [None]:
sentiment_df = pd.read_csv(sentiment_fn)

# extract text with only neutral (0) or positive (1) labels 
sentiment_df = sentiment_df.loc[sentiment_df['target'].isin([0,1])]
sentiment_df.rename({'target' : 'class'}, axis=1, inplace=True)

# label all samples as normal (0)
sentiment_df['class'] = 0
sentiment_df

Unnamed: 0,text,class
0,An image forming apparatus of the present inve...,0
2,The first aspect of a method for recovering a ...,0
3,"First Aspect of Invention', 'The present inven...",0
4,"As described above, according to the cap, the ...",0
8,"According to the present invention, a method f...",0
...,...,...
149993,According to one aspect of the present inventi...,0
149995,The ultrasonic atomizing device of the present...,0
149996,"According to the present invention, a producti...",0
149997,The present invention can thus prevent a hybri...,0


In [None]:
print(f'num of samples from r/SuicideWatch and r/Depression dataset: {suicide_df.shape[0]}')
print(f'num of samples from sentiment analysis dataset: {sentiment_df.shape[0]}')
print(f'num of samples from dynamically generated hate dataset: {hate_df.shape[0]}')

num of samples from deathrow dataset: 450
num of samples from r/SuicideWatch and r/Depression dataset: 20363
num of samples from sentiment analysis dataset: 100000
num of hateful samples from dynamically generated hate dataset: 41144


In [None]:
# create a new dataframe of the labeled text 
df = pd.concat([suicide_df, hate_df, sentiment_df], ignore_index=True)
df.head()

Unnamed: 0,text,class
0,I recently went through a breakup and she said...,2
1,"I do not know how to navigate these feelings, ...",2
2,"So I have been with my bf for 5 months , and h...",2
3,I am so exhausted of ts. Just when I tnk I can...,2
4,I have been severly bullied since i was 5 till...,2


Resample the dataframe so that non-normal classes are not overrepresented. Only 1/8 of the data will be not normal (i.e. hateful, depressed, suicidal). We will also be slightly downsampling the "normal" class by downsampling from the sentiment data, not from the non-hateful data from the dynamically generated hate speech data to help a classification model have enough samples to distinguish non-hate from hate. 

In [None]:
# extract the rows of the same class 
df_none_cls = df[df['class'].eq(0)]
df_others_cls = df[df['class'].eq(1)]
df_self_cls = df[df['class'].eq(2)]

print(f'num of samples classified as normal: {df_none_cls.shape[0]}')
print(f'num of samples classified as risk of harm to others: {df_others_cls.shape[0]}')
print(f'num of samples classified as risk of harm to self: {df_self_cls.shape[0]}')

num of samples classified as normal: 118969
num of samples classified as risk of harm to others: 22175
num of samples classified as risk of harm to self: 20363


In [None]:
# non-neutral text will be 1/4 of size the neutral data
# all minority classes are equally sampled 
n_samples = math.floor(df_none_cls.shape[0] / 12) 

df = pd.concat([df_none_cls, df_others_cls.sample(n=n_samples,
                                                random_state=RAND_STATE),
                df_self_cls.sample(n=n_samples, random_state=RAND_STATE)], 
               ignore_index=True)

# remove any documents that are only the word "None"
df = df[df['text'] != 'None']

# shuffle the samples in the dataset 
df = df.sample(df.shape[0], random_state=RAND_STATE, ignore_index=True)
df

Unnamed: 0,text,class
0,A first feature of the present invention is a ...,0
1,"According to the present invention, it is poss...",0
2,"According to the invention, it is possible to ...",0
3,"To solve the problems described above, a proce...",0
4,The vehicle seat device comprises front moveme...,0
...,...,...
138792,"Due to the specific steps, the method for prod...",0
138793,According to the present disclosure as describ...,0
138794,"The terminal, method for controlling the termi...",0
138795,how can you tnk it's acceptable to call people...,0


Write the combined and cleaned data to a csv file to be used to train a model.

In [None]:
df.to_csv(output_fn, index=False)