DESCRIPTION:

Using NLP and ML, make a model to identify hate speech (racist or sexist tweets) in Twitter.

Problem Statement:  

Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium  to spread hate. 

You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the platform. You will use NLP techniques, perform specific cleanup for tweets data, and make a robust model.

Domain: Social Media

Analysis to be done: 
- Clean up tweets and build a classification model by using NLP techniques, 
- cleanup specific for tweets data
- Regularization and hyperparameter tuning using stratified k-fold and cross validation to get the best model.

Content: 

id: identifier number of the tweet , 
Label: 0 (non-hate) /1 (hate) ,
Tweet: the text in the tweet



### Task 1: Load necessary libraries & Load the tweets file using read_csv function from Pandas Package

In [2]:
import pandas as pd
import numpy as np
import os, re

inp_tweets0=pd.read_csv('TwitterHate.csv')
inp_tweets0

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...
31958,31959,0,to see nina turner on the airwaves trying to...
31959,31960,0,listening to sad songs on a monday morning otw...
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,..."


### Task 2: Check the distribution of the label and find out if there is any class imbalance

In [3]:
inp_tweets0.label.value_counts(normalize=True)

0    0.929854
1    0.070146
Name: label, dtype: float64

From the above. we can clearly see that there is high imbalance in the classes.% non hate labels significantly higher than % hate labels. The modelling process will have to account for this.

### Task 3: Get the tweets into a list for easy text cleanup and manipulation

In [7]:
tweets0 = inp_tweets0.tweet.values

### Task 4 :  Cleanup 

- Normalize the case
- Using regular expressions, remove user handles. These begin with '@’
- Using regular expressions, remove URLs
- Using TweetTokenizer from NLTK, tokenize the tweets into individual terms
- Remove stop words
- Remove redundant terms like ‘amp’, ‘rt’, etc
- Remove ‘#’ symbols from the tweet while retaining the term.

In [20]:
# Normalizing the case
tweets0_lower = [twt.lower() for twt in tweets0] 


In [12]:
# Removing user handles
# Testing on a test string

import re
re.sub("@\w+","","@Rahim this course rocks! http://rahimbaig.com/ai")

' this course rocks! http://rahimbaig.com/ai'

In [27]:
# Applying on the data 
tweets0_nouser = [re.sub("@\w+","",twt)for twt in tweets0_lower]


In [28]:
# Removing URLs
# Test string example
re.sub("\w+://\S+","", "@Rahim this course rocks! http://rahimbaig.com/ai") 

'@Rahim this course rocks! '

In [29]:
# Applying on the data
tweets_nourl = [re.sub("\w+://\S+","", twt) for twt in tweets0_nouser] 

In [30]:
# Tokenizing isong TweetTokenizer from NLTK
from nltk.tokenize import TweetTokenizer
tkn = TweetTokenizer()

In [31]:
# Applying the tokenizer on the data using list comprehension
tweet_token = [tkn.tokenize(sent) for sent in tweets_nourl]
print(tweet_token[0])

['when', 'a', 'father', 'is', 'dysfunctional', 'and', 'is', 'so', 'selfish', 'he', 'drags', 'his', 'kids', 'into', 'his', 'dysfunction', '.', '#run']


In [32]:
# Removing punctuation, stop words, redundant terms like 'rt', 'amp' and removing terms with length of 1

from nltk.corpus import stopwords
from string import punctuation

In [35]:
stop_nltk = stopwords.words("english") 

stop_punct = list(punctuation) 

In [36]:
# Adding some specific punctuation from the data
stop_punct.extend(['...','``',"''",".."])

stop_context = ['rt','amp']

In [37]:
# final stop word list including all of these:
stop_final = stop_nltk + stop_punct + stop_context

In [38]:
# Define a function to:
# a. Remove stopwords from a single tokenized sentence
# b. Remove # tags
# c. Remove terms with length = 1

In [39]:
def del_stop(sent):
    return [re.sub("#","",term) for term in sent if ((term not in stop_final) & (len(term)>1))] 

In [40]:
# Applying function on the data 
tweets_clean = [del_stop(tweet) for tweet in tweet_token]

In [41]:
tweets_clean[0]

['father', 'dysfunctional', 'selfish', 'drags', 'kids', 'dysfunction', 'run']

### Task 5 Check out the top terms in the tweets. 

- First, get all the tokenized terms into one large list
- Use counter and find the 10 most common terms. 

In [45]:
# Adding all terms to one huge list
terms_list = []
for tweet in tweets_clean:
    terms_list.extend(tweet)
    
# Using counter to get top terms:
from collections import Counter
res = Counter(terms_list)
res.most_common(10)

[('love', 2748),
 ('day', 2276),
 ('happy', 1684),
 ('time', 1131),
 ('life', 1118),
 ('like', 1047),
 ("i'm", 1018),
 ('today', 1013),
 ('new', 994),
 ('thankful', 946)]

### Task 6  Data formatting for predictive modeling:

- Join the tokens back to form strings. This will be required for the vectorizers
- Assign x and y
- Perform train_test_split using sklearn.

In [46]:
# Join tokens back to form strings. This will be required for the vectorizers.

tweets_clean = ["".join(tweet) for tweet in tweets_clean]

In [47]:
# Assign X and y

X = tweets_clean
y = inp_tweets0.label.values

In [49]:
# Perform train_test_split using scikit learn
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

### Task 7 Use TF-IDF values for the terms as a feature to get into a vector space model

In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [51]:
vectorizer = TfidfVectorizer(max_features=5000)

In [52]:
# fit and apply on the train set
X_train_bow = vectorizer.fit_transform(X_train)

In [53]:
# apply on the test set
X_test_bow = vectorizer.transform(X_test)
X_train_bow.shape,X_test_bow.shape

((22373, 5000), (9589, 5000))

### Task  8 Model building: Ordinary Logistic Regression

- Instantiate Logistic Regression from sklearn with default parameters.
- Fit into  the train data.
- Make predictions for the train and the test set

In [54]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [55]:
# Fitting on the train data
logreg.fit(X_train_bow,y_train)

LogisticRegression()

In [56]:
# Making predictions
y_train_pred = logreg.predict(X_train_bow)
y_test_pred = logreg.predict(X_test_bow)

### Task 9 Model Evaluation: Accuracy, recall, and f1_score 

- Report the accuracy on the train set
- Report the recall on the train set:decent, high, or low? 
- Get the f1_score on the train set. 

In [59]:
from sklearn.metrics import accuracy_score,classification_report

In [61]:
# Checking the accuracy score
accuracy_score(y_train,y_train_pred)

0.9362177624815626

93.6% accuracy may not be good if we are not capturing the 1s well at all

In [62]:
# Now lets look at classification report
print(classification_report(y_train,y_train_pred))

              precision    recall  f1-score   support

           0       0.94      1.00      0.97     20815
           1       1.00      0.08      0.16      1558

    accuracy                           0.94     22373
   macro avg       0.97      0.54      0.56     22373
weighted avg       0.94      0.94      0.91     22373



Recall is just 8% for 1 class which is not good at all

### Task 10 From the above it gets clear that we will need to adjust the class imblance as the model seem,s to focus too much on 0s
  - a) Adjust the appropriate class in the LogisticRegression Model

In [65]:
logreg = LogisticRegression(class_weight='balanced')

### Task 11 Train again with the adjustment and evaluate.

- Train the model on the train set
- Evaluate the predictions on the train set: accuracy, recall, and f_1 score

In [66]:
logreg.fit(X_train_bow,y_train)

LogisticRegression(class_weight='balanced')

In [67]:
# Evaluating the train set
y_train_pred = logreg.predict(X_train_bow)
y_test_pred =  logreg.predict(X_test_bow)

In [68]:
accuracy_score(y_train,y_train_pred)

0.9544540294104501

In [69]:
# Now lets look at classification report
print(classification_report(y_train,y_train_pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98     20815
           1       0.88      0.40      0.55      1558

    accuracy                           0.95     22373
   macro avg       0.92      0.70      0.76     22373
weighted avg       0.95      0.95      0.95     22373



This is better on the train set ! A recall of 40% which has improved from 8%. f1_Score has got better at 55%
This is still on the training data. The performance could be lower on the test set

### Task 12 Regularization and Hyperparameter tuning:

- Import GridSearch and StratifiedKFold because of class imbalance
- Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters
- Use a balanced class weight while instantiating the logistic regression.

In [76]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

In [77]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'C': [0.01,0.1,1,10,100],
    'penalty': ["l1","l2"]
}

In [78]:
classifier_lr = LogisticRegression(class_weight='balanced')

### Task 13 Find the parameters with the best recall in cross validation.

- Choose ‘recall’ as the metric for scoring.
- Choose stratified 4 fold cross validation scheme.
- Fit into  the train set.

In [96]:
grid_search = GridSearchCV(estimator=classifier_lr,param_grid = param_grid,
              cv = StratifiedKFold(4), n_jobs=-1, verbose=1, scoring='recall')

In [97]:
# fitting gridsearch on training data
grid_search.fit(X_train_bow,y_train)

Fitting 4 folds for each of 10 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    0.4s finished


GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=None, shuffle=False),
             estimator=LogisticRegression(class_weight='balanced'), n_jobs=-1,
             param_grid={'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']},
             scoring='recall', verbose=1)

### Task 14 What are the best parameters?

In [98]:
grid_search.best_estimator_

LogisticRegression(C=1, class_weight='balanced')

### Task 15 Predict and evaluate using the best estimator
- Use the best estimator from the grid search to make predictions on the test set
- What is the recall on the test set for the toxic comments?
- What is the f_1 score?

In [99]:
y_test_pred = grid_search.best_estimator_.predict(X_test_bow)
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.94      1.00      0.97      8905
           1       0.72      0.15      0.25       684

    accuracy                           0.94      9589
   macro avg       0.83      0.57      0.61      9589
weighted avg       0.92      0.94      0.91      9589



f1-score for class 1 is 0.25 and recall as 0.15
As compared to grid search, LogisticRegression method gave us better recall and f1-score after doing class-weight='balanced'