# Data Cleaning and Collection 
Given data from the following sources: 

1.   [Last statements from people on death row in Texas](https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html)
2.   [Dynamically generated hate speech](https://github.com/bvidgen/Dynamically-Generated-Hate-Speech-Dataset)
3.  [Depression and suicide reddit content](https://www.kaggle.com/datasets/xavrig/reddit-dataset-rdepression-and-rsuicidewatch)
4. [Text that has been classified by sentiment](https://www.kaggle.com/datasets/amiteshpatel16/sentiment-analysis-dataset3labels)

We extracted the text and its associated label and combined it into one dataset that will be used to train a text classification model.

Class labels are: 

0 - Neutral 

1 - Hate

2 - Depression

3 - Suicidal



In [26]:
import math
import pandas as pd 
import numpy as np

In [27]:
from google.colab import drive 
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [28]:
death_row_fn = 'gdrive/My Drive/COEN140/group-project/data/Last-Statement-of-Death-Row.csv'
suicide_depression_fn = 'gdrive/My Drive/COEN140/group-project/data/reddit_depression_suicidewatch.csv'
hate_fn = 'gdrive/My Drive/COEN140/group-project/data/Dynamically_Generated_Hate_Dataset_v0.2.3.csv'
sentiment_fn = 'gdrive/My Drive/COEN140/group-project/data/sentiment.csv'
output_fn = 'gdrive/My Drive/COEN140/group-project/data/train.csv'

RAND_STATE = 1

Clean the csv of death row last statements 

In [29]:
# extract and rename last statement feature
deathrow_df = pd.read_csv(death_row_fn)
deathrow_df['text'] = deathrow_df['last_statement']
deathrow_df.drop(deathrow_df.columns.difference(['text']), axis = 1, inplace = True)

# label all samples as suicidal
deathrow_df['class'] = 3

# filter out all of the samples with no statement given 
deathrow_df = deathrow_df[~deathrow_df['text'].isin(["No statement given.", 
                        "This offender declined to make a last statement."])]
deathrow_df

Unnamed: 0,text,class
0,To my friends and family it was a nice journey...,3
1,"Yes Sir, I would like to thank the Shape Commu...",3
2,Yes Sir. Dear Heavenly Father please forgive t...,3
3,I am very thankful for all the hard work the M...,3
5,"Thank you I love you all. Sandra, nice meeting...",3
...,...,...
553,"Heavenly Father, I give thanks for this time, ...",3
554,I pray that my family will rejoice and will fo...,3
555,"When asked if he had a last statement, he repl...",3
556,What is about to transpire in a few moments is...,3


Clean the csv of suicide notes.

In [30]:
# extract text and class labels 
suicide_df = pd.read_csv(suicide_depression_fn)

# numerically represent class labels 
suicide_df['class'] = [2 if cls == "depression" else 3 for cls in suicide_df['label']] 
suicide_df.drop('label', axis=1, inplace=True)

suicide_df

Unnamed: 0,text,class
0,I recently went through a breakup and she said...,2
1,"I do not know how to navigate these feelings, ...",2
2,"So I have been with my bf for 5 months , and h...",2
3,I am so exhausted of this. Just when I think I...,3
4,I have been severly bullied since i was 5 till...,2
...,...,...
20358,I took 50mg of seroquel a few hours after I dr...,3
20359,that is what has happened to me last week. And...,2
20360,Ever just feel alone in a house full of people...,2
20361,Politicians. Neighbors. Corporations. Society....,2


Clean the csv of hateful data, labelling hateful data as hate and non-hateful data as neutral.

In [31]:
hate_df = pd.read_csv(hate_fn)

# extract text and class labels 
hate_df = hate_df.loc[:,['text','label']]
hate_df.rename({'label' : 'class'}, axis=1, inplace=True)

# numerically represent class labels 
hate_df['class'] = [0 if cls == "nothate" else 1 for cls in hate_df['class']] 
hate_df

Unnamed: 0,text,class
0,dalits are lowlives,1
1,gay people are a burden to our society,1
2,Arabs are not welcome,1
3,I'm not saying we should actually eliminate he...,1
4,bananas are for black people,1
...,...,...
41139,Indian culture is so backwards,1
41140,"Women don't want equality, they just want to b...",1
41141,fuck covid,0
41142,This computer is garbage,0


Clean the csv of sentiment labeled data by extracting only the neutral labeled data.

In [32]:
sentiment_df = pd.read_csv(sentiment_fn)

# extract text with only neutral labels 
sentiment_df = sentiment_df[sentiment_df['target'] == 0]
sentiment_df.rename({'target' : 'class'}, axis=1, inplace=True)
sentiment_df

Unnamed: 0,text,class
0,An image forming apparatus of the present inve...,0
3,"First Aspect of Invention', 'The present inven...",0
10,The electronic device according to the present...,0
11,The objects of the present invention can be im...,0
27,"The inventors took note of the fact that, to i...",0
...,...,...
149981,The means for addressing the problem according...,0
149988,"According to the present invention, there is p...",0
149991,"In order to solve the above problem, a microsc...",0
149995,The ultrasonic atomizing device of the present...,0


In [33]:
deathrow_df.shape, suicide_df.shape, sentiment_df.shape, hate_df.shape

((450, 2), (20363, 2), (50000, 2), (41144, 2))

In [34]:
# create a new dataframe of the labeled text 
df = pd.concat([deathrow_df, sentiment_df, suicide_df, hate_df], ignore_index=True)
df.head()

Unnamed: 0,text,class
0,To my friends and family it was a nice journey...,3
1,"Yes Sir, I would like to thank the Shape Commu...",3
2,Yes Sir. Dear Heavenly Father please forgive t...,3
3,I am very thankful for all the hard work the M...,3
4,"Thank you I love you all. Sandra, nice meeting...",3


Resample the dataframe so that suicide and hate classes are not overrepresented. Only 1/4 of the data will be non-neutral (hateful, depressed, suicidal). 

In [35]:
# extract the rows of the same class 
df_none_cls = df[df['class'].eq(0)]
df_hate_cls = df[df['class'].eq(1)]
df_dep_cls = df[df['class'].eq(2)]
df_suicide_cls = df[df['class'].eq(3)]
df_none_cls.shape, df_hate_cls.shape, df_dep_cls.shape, df_suicide_cls.shape

((68969, 2), (22175, 2), (10371, 2), (10442, 2))

In [36]:
math.floor(df_none_cls.shape[0] / 12) 

5747

In [37]:
# non-neutral text will be 1/4 of size the neutral data
n_samples = math.floor(df_none_cls.shape[0] / 12) 
df = pd.concat([df_none_cls, df_hate_cls.sample(n=n_samples,
                                                random_state=RAND_STATE),
                df_dep_cls.sample(n=n_samples, random_state=RAND_STATE), 
                df_suicide_cls.sample(n=n_samples, random_state=RAND_STATE)], 
               ignore_index=True)

# remove any documents that are only the word "None"
df = df[df['text'] != 'None']

# shuffle the samples in the dataset 
df = df.sample(df.shape[0], random_state=RAND_STATE, ignore_index=True)
df

Unnamed: 0,text,class
0,"I am such an unnatural freak, I should be dead...",2
1,"In Canadian politics, ""riding"" is a colloquial...",0
2,"A client computer (a client PC, for example) d...",0
3,The inventors of the present invention perform...,0
4,I think many of you here must have a problem w...,3
...,...,...
86204,"According to the embodiment, there is provided...",0
86205,Chinese should be exterminated,1
86206,A black musician is among the six trumpeters d...,0
86207,A valve opening/closing timing control device ...,0


Write the combined and cleaned data to a csv file to be used to train a model.

In [38]:
df.to_csv(output_fn)