# *TWITTER SENTIMENT ANALYSIS*

**Problem Statement**

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

In [1]:
# For handling data
import pandas as pd

# For numerical computing
import numpy as np

# Library for pattern matching
import re

# For NLP related tasks
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')

<font size=3>**Steps to Follow:**</font>
1. Loading and Exploring Data
2. Text Cleaning
3. Data Preparation
    1. Label Encoding
    2. Split Data
    3. Feature Engineering using TF-IDF
4. Model Building
    1. Naive Bayes
    2. Logistic Regression
    3. Model Building Summary
5. Final Sentiment Analysis Pipeline

### Loading and Exploring Train Data

In [3]:
# read CSV file
df_train = pd.read_csv(r"C:\Users\TheWhiteWolf\NLP\Module1\Projects\Twitter_Sentiment_Analysis\train.csv")

# Shape of the dataframe
print("Shape =>", df_train.shape)

# first first five rows
df_train.head()

Shape => (31962, 3)


Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [4]:
# Some sample tweets
df_train['tweet'].sample(5)

7848      @user so   that @user is presenting âcurves...
29488    not the first or the last. #yyc    the tax bil...
9494     removal of #aap spokesperson #alkalamba showca...
20604    my trust fund check is gonna be so bomb. ðð...
9290     idk what its like to feel loved. #depressed   ...
Name: tweet, dtype: object

In [5]:
# Class distribution
# Where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist
df_train['label'].value_counts()

0    29720
1     2242
Name: label, dtype: int64

In [6]:
# class distribution in percentage
# Where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist
df_train['label'].value_counts(normalize = True)*100

0    92.98542
1     7.01458
Name: label, dtype: float64

### Loading and Exploring Test Data

In [26]:
# read CSV file
df_test = pd.read_csv(r"C:\Users\TheWhiteWolf\NLP\Module1\Projects\Twitter_Sentiment_Analysis\test.csv")

# Shape of the dataframe
print("Shape =>", df_test.shape)

# first first five rows
df_test.head()

Shape => (17197, 2)


Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


In [27]:
# Some sample tweets
df_test['tweet'].sample(5)

9290    let's make it #today â¤ï¸  #bradpitt #celeb ...
3741    i have some good friends ð   #friends #tram...
7148    wanna go shopping but i have $0.17 in my accou...
6720    very   abt what happened in #orlando we #must ...
2872    new #trending #gif on @user  , euro 2016, euro...
Name: tweet, dtype: object

### Text Cleaning

In [7]:
# Define a function for text cleaninig
def text_cleaner(text):
    
    # remove user mentions
    text = re.sub(r'@[a-zA-Z0-9]+', '', text)
    
    # remove links
    text = re.sub(r'http\S+', '', text)
    
    # converting text to lower case
    text = text.lower()
    
    # Fetch only words
    text = re.sub('[^a-z]', ' ', text)
    
    # Removing extra spaces
    text = re.sub('[\s]+', ' ', text)
    
    # Creating doc object
    doc = nlp(text)
    
    # Remove stopwords and lemmatize the text
    tokens = [token.lemma_ for token in doc if (token.is_stop == False)]
    
    # Join tokens by space
    return " ".join(tokens)

#### Perform text cleaning on train data

In [8]:
# Perform text cleaning
df_train['cleaned_tweet'] = df_train['tweet'].apply(text_cleaner)

In [9]:
df_train[['tweet','cleaned_tweet']].head()

Unnamed: 0,tweet,cleaned_tweet
0,@user when a father is dysfunctional and is s...,father dysfunctional selfish drag kid dysfun...
1,@user @user thanks for #lyft credit i can't us...,thank lyft credit t use cause don t offer wh...
2,bihday your majesty,bihday majesty
3,#model i love u take with u all the time in ...,model love u u time ur
4,factsguide: society now #motivation,factsguide society motivation


In [10]:
# Save cleaned tweets and labels to variable
tweets = df_train['cleaned_tweet'].values
labels = df_train['label'].values

In [11]:
# Sample cleaned tweet
tweets[:10]

array(['  father dysfunctional selfish drag kid dysfunction run',
       '  thank lyft credit t use cause don t offer wheelchair van pdx disapointe getthanke',
       '  bihday majesty', '  model love u u time ur',
       '  factsguide society motivation',
       '  huge fan fare big talking leave chaos pay dispute allshowandnogo',
       '  camping tomorrow danny',
       'school year year exam t think school exam hate imagine actorslife revolutionschool girl',
       'win love land allin cavs champions cleveland clevelandcavalier',
       '  welcome m s gr'], dtype=object)

In [12]:
# Sample labels
labels[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

#### Perform text cleaning on test data

In [28]:
# Perform text cleaning
df_test['cleaned_tweet'] = df_test['tweet'].apply(text_cleaner)

In [30]:
df_test[['tweet','cleaned_tweet']].head()

Unnamed: 0,tweet,cleaned_tweet
0,#studiolife #aislife #requires #passion #dedic...,studiolife aislife require passion dedicatio...
1,@user #white #supremacists want everyone to s...,white supremacist want new bird movie s
2,safe ways to heal your #acne!! #altwaystohe...,safe way heal acne altwaystoheal healthy healing
3,is the hp and the cursed child book up for res...,hp cursed child book reservation yes harrypott...
4,"3rd #bihday to my amazing, hilarious #nephew...",rd bihday amazing hilarious nephew eli ahmir...


In [31]:
# Save cleaned tweets to variable
tweets_test = df_test['cleaned_tweet'].values

In [32]:
# Sample cleaned tweet
tweets_test[:10]

array(['  studiolife aislife require passion dedication willpower find newmaterial',
       '  white supremacist want new bird movie s',
       'safe way heal acne altwaystoheal healthy healing',
       'hp cursed child book reservation yes harrypotter pottermore favorite',
       '  rd bihday amazing hilarious nephew eli ahmir uncle dave love miss',
       'choose momtip',
       'inside die eye ness smokeyeye tired lonely sof grunge',
       '  finished tattoo ink ink loveit thank aleeee',
       '  understand dad leave young deep inthefeel',
       '  delicious food lovelife capetown mannaepicure resturant'],
      dtype=object)

### Data Preparation

### *Feature Engineering using TF-IDF*

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [14]:
# initialize TFIDF
word_vectorizer = TfidfVectorizer(max_features=1000)

In [15]:
# Fitting Vectorizer on Train set
word_vectorizer.fit(tweets)

TfidfVectorizer(max_features=1000)

In [16]:
# create TF-IDF vectors for Train Set
train_word_features = word_vectorizer.transform(tweets)
train_word_features

<31962x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 131106 stored elements in Compressed Sparse Row format>

In [33]:
# Fitting Vectorizer on Test set
word_vectorizer.fit(tweets_test)

TfidfVectorizer(max_features=1000)

In [34]:
# create TF-IDF vectors for Test Set
test_word_features = word_vectorizer.transform(tweets_test)
test_word_features

<17197x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 70469 stored elements in Compressed Sparse Row format>

### Model Building

### *Naive Bayes*

In [17]:
# Importing for modelling
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score

In [42]:
# Training Train model
nb_model = MultinomialNB().fit(train_word_features, labels)
nb_model

MultinomialNB()

In [43]:
# Make predictions for train set
train_pred_nb = nb_model.predict(train_word_features)

In [44]:
train_pred_nb

array([0, 0, 0, ..., 0, 1, 0], dtype=int64)

In [45]:
# Evaluating on Training Set
print('F1-score on Train Set:', f1_score(labels, train_pred_nb, average = 'weighted'))

F1-score on Train Set: 0.9366703984293401


In [46]:
# Make predictions for test set
test_pred_nb = nb_model.predict(test_word_features)
test_pred_nb

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [54]:
# Evaluating on Training Set
#print('F1-score on Train Set:', f1_score(labels, test_pred_nb, average = 'weighted'))

### *Logistic Regression*

In [22]:
from sklearn.linear_model import LogisticRegression

In [47]:
# Training model
lr_model=LogisticRegression(solver='liblinear').fit(train_word_features,labels)
lr_model

LogisticRegression(solver='liblinear')

In [48]:
# Make predictions for train set
train_pred_lr = lr_model.predict(train_word_features)
train_pred_lr

array([0, 0, 0, ..., 0, 1, 0], dtype=int64)

In [49]:
# Evaluating on Training Set
print("F1-score on Train Set:",f1_score(labels,train_pred_lr,average="weighted"))

F1-score on Train Set: 0.9433698601551692


In [50]:
test_pred_lr = lr_model.predict(test_word_features)

In [51]:
test_pred_lr

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [73]:
df_test['label'] = test_pred_lr

In [74]:
df_test['label'].value_counts()

0    17059
1      138
Name: label, dtype: int64

In [75]:
df_test['label'].value_counts(normalize = True)*100

0    99.197534
1     0.802466
Name: label, dtype: float64

In [80]:
df_test[['cleaned_tweet','label']]

Unnamed: 0,cleaned_tweet,label
0,studiolife aislife require passion dedicatio...,0
1,white supremacist want new bird movie s,0
2,safe way heal acne altwaystoheal healthy healing,0
3,hp cursed child book reservation yes harrypott...,0
4,rd bihday amazing hilarious nephew eli ahmir...,0
...,...,...
17192,think factory leave right polarisation trump u...,0
17193,feel like mermaid hairflip neverready formal w...,0
17194,hillary campaign today ohio omg amp word lik...,0
17195,happy work conference right mindset lead cultu...,0


In [100]:
df_test[['cleaned_tweet','label']][df_test['label'] == 1].head()

Unnamed: 0,cleaned_tweet,label
33,suppo taiji fisherman bully racism tweet taiji...,1
189,happy instagram instagood instagram instapassp...,1
294,wish kpop aist come country win t able perform...,1
419,nice ready come oakland marijuana business boo...,1
433,s sad guy kill warrior team soon,1


**Motivation**

Hate speech is an unfortunately common occurrence on the  Internet.  Often social media sites like Facebook and Twitter face the problem of identifying and censoring problematic posts while weighing the right to freedom of speech. The importance of detecting and moderating hate speech is evident from the strong connection between hate speech and actual hate crimes. Early identification of users promoting hate speech could enable outreach programs that attempt to prevent an escalation from speech to action. Sites such as Twitter and Facebook have been seeking to actively combat hate speech. Despite these reasons, NLP research on hate speech has been very limited, primarily due to the lack of a general definition of hate speech, an analysis of its demographic influences, and an investigation of the most effective features.