# Identify hate speech in Twitter tweets

Problem Statement:
    
    Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium  to spread hate. You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the platform. You will use NLP techniques, perform specific cleanup for tweets data, and make a robust model.

Content:

    id: Identifier Number

    label: 
        0 - Not Hate
        1 - Hate


    tweet: text of the tweet

# IMPORTS

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
import os
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from nltk.tokenize import TweetTokenizer
import warnings
warnings.filterwarnings('ignore')

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load and Process Data

Tasks:
    
    Load the tweets file using read_csv function from Pandas package. 

    Get the tweets into a list for easy text cleanup and manipulation.

    To cleanup: 

    Normalize the casing.

    Using regular expressions, remove user handles. These begin with '@’.

    Using regular expressions, remove URLs.

    Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.

    Remove stop words.

    Remove redundant terms like ‘amp’, ‘rt’, etc.

    Remove ‘#’ symbols from the tweet while retaining the term.

    Extra cleanup by removing terms with a length of 1.

    Check out the top terms in the tweets:

    First, get all the tokenized terms into one large list.

    Use the counter and find the 10 most common terms.

In [2]:
df=pd.read_csv("TwitterHate.csv")

In [3]:
df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [4]:
patterns =(r"@\w+|http\S+")
tweets = np.array(df["tweet"])
for i,tweet in enumerate(tweets):
    ###Remove User/URL###
    tweet=re.sub(patterns, "",tweet)
    ###lowercase###
    tweets[i]=tweet.lower()
df["tweet"] = pd.Series(tweets)

In [5]:
tknzr = TweetTokenizer()
tweets = np.array(df["tweet"])
labels = df["label"]

In [6]:
for i,tweet in enumerate(tweets):
    tweets[i]=tknzr.tokenize(tweet)

In [7]:
stop_words = stopwords.words("english") + ["rt",
                   "amp",
                   "etc","..."]

In [8]:
for i, tweet in enumerate(tweets):
    tweets[i] = [tok for tok in tweet if tok not in stop_words]

In [9]:
for i,tweet in enumerate(tweets):
    for n,tok in enumerate(tweet):
        if tok[0]=="#":
            tweets[i][n]=tok[1:]

In [10]:
for i, tweet in enumerate(tweets):
    tweets[i] = [tok for tok in tweet if len(tok)>1]

In [11]:
flat_tweets = []
for i,tweet in enumerate(tweets):
    for tok in tweet:
        flat_tweets.append(tok)


In [12]:
flat_tweets = np.array(flat_tweets)

In [13]:
flat_tweets.shape

(253478,)

In [14]:
Counter = Counter(flat_tweets)

In [15]:
print(Counter.most_common(10))

[('love', 2748), ('day', 2276), ('happy', 1684), ('time', 1131), ('life', 1118), ('like', 1047), ("i'm", 1018), ('today', 1013), ('new', 994), ('thankful', 946)]


Data formatting for predictive modeling:

    Join the tokens back to form strings. This will be required for the 
    vectorizers.

    Assign x and y.

    Perform train_test_split using sklearn.

In [16]:
for i,tweet in enumerate(tweets):
    tweets[i]=" ".join(tweet)

In [17]:
X=tweets
y=labels

In [18]:
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=.3,random_state=42)

We’ll use TF-IDF values for the terms as a feature to get into a vector space model.

    Perform TF-IDF  vectorization

    Instantiate with a maximum of 5000 terms in your vocabulary.

    Fit and apply on the train set.

    Apply on the test set.

In [19]:
tfidf = TfidfVectorizer(max_features=5000)

In [20]:

x_train = tfidf.fit_transform(x_train)
x_test = tfidf.transform(x_test)

Model building: Ordinary Logistic Regression

    Instantiate Logistic Regression from sklearn with default parameters.

    Fit into  the train data.

    Make predictions for the train and the test set.

In [21]:
model = LogisticRegression()

In [22]:
model.fit(x_train,y_train)

LogisticRegression()

In [23]:
preds = model.predict(x_test)

Model evaluation: Accuracy, recall, and f_1 score.

    Report the accuracy on the train set.

    Report the recall on the train set: decent, high, or low.

    Get the f1 score on the train set.

In [24]:
print(pd.DataFrame(confusion_matrix(y_test, preds),index=["non-hate","hate"],columns=["non-hate","hate"]))

          non-hate  hate
non-hate      8880    25
hate           468   216


In [25]:
print(f"Accuracy:{accuracy_score(y_test, preds):.2f}")

Accuracy:0.95


In [26]:
print(classification_report(y_test,preds))
print("Recall is low due to class imbalance.\nThe F-1 scores show the model is good at predicting non-hate tweets, but fails to predict hate tweets.")

              precision    recall  f1-score   support

           0       0.95      1.00      0.97      8905
           1       0.90      0.32      0.47       684

    accuracy                           0.95      9589
   macro avg       0.92      0.66      0.72      9589
weighted avg       0.95      0.95      0.94      9589

Recall is low due to class imbalance.
The F-1 scores show the model is good at predicting non-hate tweets, but fails to predict hate tweets.


Looks like you need to adjust the class imbalance, as the model seems to focus on the 0s.

    Adjust the appropriate class in the LogisticRegression model.

Using oversampling to adjust for class imbalance

In [27]:
oversample = RandomOverSampler(sampling_strategy='minority')

In [28]:
X = tweets.reshape(X.shape[0],1)

In [29]:
X_over, y_over= oversample.fit_resample(X,y)

In [30]:
X_over=X_over.reshape(X_over.shape[0])

In [31]:
x_over_train,x_over_test,y_over_train,y_over_test = train_test_split(X_over,y_over,test_size=.1,random_state=42)
x_over_train = tfidf.fit_transform(x_over_train)
x_over_test = tfidf.transform(x_over_test)

# Retrain with overfitted data

In [32]:
model = LogisticRegression()
model.fit(x_over_train,y_over_train)

LogisticRegression()

In [33]:
preds_over = model.predict(x_over_test)

In [34]:
print(pd.DataFrame(confusion_matrix(y_over_test,preds_over)))

      0     1
0  2794   177
1    49  2924


In [35]:
print(classification_report(y_over_test,preds_over))

              precision    recall  f1-score   support

           0       0.98      0.94      0.96      2971
           1       0.94      0.98      0.96      2973

    accuracy                           0.96      5944
   macro avg       0.96      0.96      0.96      5944
weighted avg       0.96      0.96      0.96      5944



Model performs well on overfitted data with default parameters.

# Regularization and Hyperparameted tuning

In [36]:
X_over = tfidf.fit_transform(X_over)

In [37]:
model = LogisticRegression(class_weight="balanced")
params = {
    "penalty":["l2","l1","elasticnet"],
    'C': [0.001, 0.01, 0.1, 1,10, 100, 1000,10000]
}
grid = GridSearchCV(model,params, cv=4,verbose=3,scoring="recall")

In [38]:
grid.fit(X_over,y_over)

Fitting 4 folds for each of 24 candidates, totalling 96 fits
[CV 1/4] END ...............C=0.001, penalty=l2;, score=0.887 total time=   0.0s
[CV 2/4] END ...............C=0.001, penalty=l2;, score=0.882 total time=   0.0s
[CV 3/4] END ...............C=0.001, penalty=l2;, score=0.881 total time=   0.0s
[CV 4/4] END ...............C=0.001, penalty=l2;, score=0.895 total time=   0.0s
[CV 1/4] END .................C=0.001, penalty=l1;, score=nan total time=   0.0s
[CV 2/4] END .................C=0.001, penalty=l1;, score=nan total time=   0.0s
[CV 3/4] END .................C=0.001, penalty=l1;, score=nan total time=   0.0s
[CV 4/4] END .................C=0.001, penalty=l1;, score=nan total time=   0.0s
[CV 1/4] END .........C=0.001, penalty=elasticnet;, score=nan total time=   0.0s
[CV 2/4] END .........C=0.001, penalty=elasticnet;, score=nan total time=   0.0s
[CV 3/4] END .........C=0.001, penalty=elasticnet;, score=nan total time=   0.0s
[CV 4/4] END .........C=0.001, penalty=elasticne

GridSearchCV(cv=4, estimator=LogisticRegression(class_weight='balanced'),
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000],
                         'penalty': ['l2', 'l1', 'elasticnet']},
             scoring='recall', verbose=3)

In [39]:

model.set_params(**grid.best_params_)
model.get_params()

{'C': 10000,
 'class_weight': 'balanced',
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [40]:
    x_train,x_test,y_train,y_test = train_test_split(X_over,y_over,test_size=.1,random_state=42)

In [41]:
model.fit(x_train,y_train)

LogisticRegression(C=10000, class_weight='balanced')

In [42]:
pred_grid=model.predict(x_test)

In [43]:
print(accuracy_score(y_test,pred_grid))

0.9794751009421265


In [44]:
print(pd.DataFrame(confusion_matrix(y_test,pred_grid),index=["non-hate","hate"],columns=["non-hate","hate"]))

          non-hate  hate
non-hate      2855   116
hate             6  2967


In [45]:
print(classification_report(y_test,pred_grid))

              precision    recall  f1-score   support

           0       1.00      0.96      0.98      2971
           1       0.96      1.00      0.98      2973

    accuracy                           0.98      5944
   macro avg       0.98      0.98      0.98      5944
weighted avg       0.98      0.98      0.98      5944



In [46]:
grid.best_params_

{'C': 10000, 'penalty': 'l2'}