In [None]:
import os
from google.colab import drive

drive.mount('/content/drive')
os.chdir('drive/My Drive/Colab Notebooks/Data Glacier Internship/Week 7')

Mounted at /content/drive


**Problem Statement:**

   The term hate speech is understood as any type of verbal, written or behavioural communication that attacks or uses derogatory or discriminatory language against a person or group based on what they are, in other words, based on their religion, ethnicity, nationality, race, colour, ancestry, sex or another identity factor. In this problem, we will take you through a hate speech detection model with Machine Learning and Python.
   
   Hate Speech Detection is generally a task of sentiment classification. So, for training, a model that can classify hate speech from a certain piece of text can be achieved by training it on a data that is generally used to classify sentiments. So, for the task of hate speech detection model, we will use the Twitter tweets to identify tweets containing Hate speech.


**Business Understanding:**

Detection of hate speech in tweets is an important issue for businesses to 
consider for several reasons.

First, hate speech can be harmful and offensive to individuals and groups, and 
businesses have a social responsibility to address it. In addition, businesses 
may face legal and reputational risks if they fail to address hate speech on their platforms.

Second, businesses that operate social media platforms or engage in social 
media marketing may need to monitor and address hate speech to maintain 
the trust and loyalty of their users and customers. If a business is perceived as tolerating hate speech, it may face backlash from users and negative media 
attention.

Finally, businesses may also have a financial incentive to address hate speech, 
as it can negatively impact the user experience and drive users away from the 
platform.

To detect hate speech in tweets, businesses may use a combination of 
automated tools and human moderation. Automated tools may include 
machine learning algorithms that are trained to identify hate speech based on 
certain characteristics, such as the use of certain words or phrases. Human 
moderation may involve a team of moderators who review tweets and take 
appropriate action, such as deleting the tweet or banning the user.

It's important to note that detecting hate speech can be challenging, as it may 
involve complex issues of context and intent. It is also important for businesses 
to consider the potential for false positives and ensure that their approaches 
to detecting and addressing hate speech are fair and transparent.

Import the necessary libraries.

In [None]:
import numpy as np
import pandas as pd

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet 
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

## ignore warnings
import warnings
warnings.filterwarnings("ignore",module = "matplotlib\..*")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Load the data.

In [None]:
train_data = pd.read_csv('train_E6oV3lV.csv')
test_data = pd.read_csv('test_tweets_anuFYb8.csv')

## **Data understanding**

Let's print the first few rows to get an idea of our data.

In [None]:
train_data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


Feature description:
  - 'id' : primary key of our dataset, but we will not need it.
  - 'label' : 0 for free speech and 1 for hate speech
  - 'tweet' : contains the tweets we want to classify as free speech or hate speech

In [None]:
## drop the id column
train_data.drop('id', axis=1, inplace=True)

Shape of the datasets.

In [None]:
print(f'Shape of training data: {train_data.shape}')
print(f'Shape of test data: {test_data.shape}')

Shape of training data: (31962, 2)
Shape of test data: (17197, 2)


Let's check the type of each feature, and if there are any null values or duplicated rows.

In [None]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   31962 non-null  int64 
 1   tweet   31962 non-null  object
dtypes: int64(1), object(1)
memory usage: 499.5+ KB


In [None]:
## check for duplicates
print(f'There are {train_data.duplicated().sum()} duplicated rows.')

There are 0 duplicated rows.


The type of the features is correct and there aren't any null values. However, there are $2432$ duplicated rows. Let's remove them.

In [None]:
train_data.drop_duplicates(inplace=True)

Let's check for imbalanced data.

In [None]:
## check for imbalanced data
train_data['label'].value_counts()

0    27517
1     2013
Name: label, dtype: int64

The data are highly imbalanced. We will later apply an oversampling or a downsampling technique to address this issue.