# Cyberbullying Text Classification
CS6120 Group 10: Pushyanth Damarapati, Sindhya Balasubramanian, Eileen Chang, Priyanka Padinam

### Description
The rise of social media and the recent couple of years of covid-19 lockdown has led to a concerning increase in cyberbullying cases. In 2020, UNICEF even issued a warning in response to the increased cyberbullying compounded by social distancing and increased screen-time. Those who bully others on the internet have the convenience of being able to hide anonymously behind a screen, but the people who are bullied are likely to develop mental-health issues that persist even after the bullying has ceased. Due to social media’s ability to spread information quickly and anonymously, a single person can easily end up being targeted by a large number of people of various demographics. We aim to create a model that will flag harmful tweets and, therefore, protect targets of cyberbullying.

### Dataset
We will be using a kaggle dataset, Cyberbullying Classification, consisting of more than 47,000 tweets labeled according to 6 classes of cyberbullying: Age, Ethnicity, Gender, Religion, Other type of cyberbullying, and Not cyberbullying. Each row of the dataset will have a tweet and its class of cyberbullying. The dataset is meant to be used to create a multi-classification model to predict cyberbullying type, create a binary classification model to flag potentially harmful tweets, and examine words and patterns associated with each type of cyberbullying.

# Importing Dataset and Libraries

In [None]:
!pip install contractions
!pip install nltk
!pip install autocorrect 
!pip install --upgrade matplotlib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting contractions
  Downloading contractions-0.1.72-py2.py3-none-any.whl (8.3 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[K     |████████████████████████████████| 287 kB 26.3 MB/s 
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 64.7 MB/s 
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.72 pyahocorasick-1.4.4 textsearch-0.0.24
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting 

In [None]:
import string 
import nltk 
import re # regex
from string import punctuation 
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer 
from nltk.stem import SnowballStemmer 
from contractions import contractions_dict # to solve contractions
from autocorrect import Speller # correct spelling
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

#Libraries 
import matplotlib.pyplot as plt
import seaborn as sns


#Data preprocessing
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler

#Naive Bayes
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
import numpy as np
import pandas as pd

df = pd.read_csv('cyberbullying_tweets.csv')
df['cyberbullying_type'].value_counts()

religion               7998
age                    7992
gender                 7973
ethnicity              7961
not_cyberbullying      7945
other_cyberbullying    7823
Name: cyberbullying_type, dtype: int64

**There is not much imbalance between different cyberbulling type. other_cyberbulling will be removed since it may cause a confusion for the models with other cyberbullying class.**

In [None]:
#df.drop(df[df['cyberbullying_type'] == 'other_cyberbullying'].index, inplace = True)
df['cyberbullying_type'].value_counts()

religion               7998
age                    7992
gender                 7973
ethnicity              7961
not_cyberbullying      7945
other_cyberbullying    7823
Name: cyberbullying_type, dtype: int64

# 1. Data Preprocessing

In [None]:
# Renaming Categories
df = df.rename(columns={'tweet_text': 'text', 'cyberbullying_type': 'sentiment'})

In [None]:
# Checking 10 samples
df.sample(10)

Unnamed: 0,text,sentiment
20818,@congressman_aly I know that the terrorist org...,religion
5880,@punketterocker it just causes people to be bu...,not_cyberbullying
26590,"35,762- you would chose a # like this lol. I'm...",other_cyberbullying
30927,@kyhwana age of ultron.,other_cyberbullying
12097,@Dommoneek: Little stupid as bitch I don't fuc...,gender
21607,Go for Yemeni Egyptian stores or those who wel...,religion
9292,"I feel like the thing people don't get about ""...",gender
13640,Gender is a social construct. Why are we assum...,gender
33665,my pisces mars brother can argue with me all d...,age
17531,"You r not, u r a dhurt shiromani who has never...",religion


In [None]:
df["target_bullying"] = np.where(df['sentiment']!="not_cyberbullying",1,0)
df

Unnamed: 0,text,sentiment,target_bullying
0,"In other words #katandandre, your food was cra...",not_cyberbullying,0
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying,0
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying,0
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying,0
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying,0
...,...,...,...
47687,"Black ppl aren't expected to do anything, depe...",ethnicity,1
47688,Turner did not withhold his disappointment. Tu...,ethnicity,1
47689,I swear to God. This dumb nigger bitch. I have...,ethnicity,1
47690,Yea fuck you RT @therealexel: IF YOURE A NIGGE...,ethnicity,1


In [None]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
df["type_bullying"] = LE.fit_transform(df["sentiment"])
df

Unnamed: 0,text,sentiment,target_bullying,type_bullying
0,"In other words #katandandre, your food was cra...",not_cyberbullying,0,3
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying,0,3
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying,0,3
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying,0,3
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying,0,3
...,...,...,...,...
47687,"Black ppl aren't expected to do anything, depe...",ethnicity,1,1
47688,Turner did not withhold his disappointment. Tu...,ethnicity,1,1
47689,I swear to God. This dumb nigger bitch. I have...,ethnicity,1,1
47690,Yea fuck you RT @therealexel: IF YOURE A NIGGE...,ethnicity,1,1


In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
nltk.download('omw-1.4')
sw_eng = stopwords.words('english')
def clean_review(review):
    '''
    Input:
        review: a string containing a review.
    Output:
        review_cleaned: a processed review. 

    '''
    review_in_lowercase = review.lower()
    no_punctuation = review_in_lowercase.translate(review_in_lowercase.maketrans('', '', string.punctuation))
    no_url = re.sub(r'https?:\/\/.*[\r\n]*','', no_punctuation)
    review_tokens = word_tokenize(no_url)
    no_stopwords_tokens = [token for token in review_tokens if not token in sw_eng]
    porter = PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    review_cleaned = ''
    
    for each in no_stopwords_tokens:
        review_cleaned = review_cleaned + lemmatizer.lemmatize(each) + " "
    
    return review_cleaned

df['text'] = df['text'].apply(lambda x : clean_review(x))

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [None]:
df['target_bullying'].value_counts()

1    39747
0     7945
Name: target_bullying, dtype: int64