# Emotion detection in text

We will use Neattext and Scikit-learn to build our model. Neattext is a Python library used to preprocess our dataset. Neattext will clean the text dataset by removing stop words and other noise.

source: https://www.section.io/engineering-education/nlp-based-detection-model-using-neattext-and-scikit-learn/

## 1. Exploring the dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("7_emotion.csv")

In [3]:
df.head()

Unnamed: 0,Emotion,Text
0,neutral,Why ?
1,joy,Sage Act upgrade on my to do list for tommorow.
2,sadness,ON THE WAY TO MY HOMEGIRL BABY FUNERAL!!! MAN ...
3,joy,Such an eye ! The true hazel eye-and so brill...
4,joy,@Iluvmiasantos ugh babe.. hugggzzz for u .! b...


In [4]:
df['Emotion'].value_counts()

joy         11045
sadness      6722
fear         5410
anger        4297
surprise     4062
neutral      2254
disgust       856
shame         146
Name: Emotion, dtype: int64

In [5]:
# drop 'shame'
df.drop(df[df.Emotion == 'shame'].index, inplace=True)

In [6]:
df['Emotion'].value_counts()

joy         11045
sadness      6722
fear         5410
anger        4297
surprise     4062
neutral      2254
disgust       856
Name: Emotion, dtype: int64

## 2. Pre-processing 

In [53]:
# Justin
'''neutral
fear
angry
sad
happy
surprise
disgust'''

'neutral\nfear\nangry\nsad\nhappy\nsurprise\ndisgust'

In [5]:
#!pip install neattext



In [7]:
import neattext.functions as nfx

To use neattext, we list all the methods and attributes used by neattext for data cleaning.

In [71]:
dir(nfx)

['BTC_ADDRESS_REGEX',
 'CURRENCY_REGEX',
 'CURRENCY_SYMB_REGEX',
 'Counter',
 'DATE_REGEX',
 'EMAIL_REGEX',
 'EMOJI_REGEX',
 'HASTAG_REGEX',
 'MASTERCard_REGEX',
 'MD5_SHA_REGEX',
 'MOST_COMMON_PUNCT_REGEX',
 'NUMBERS_REGEX',
 'PHONE_REGEX',
 'PoBOX_REGEX',
 'SPECIAL_CHARACTERS_REGEX',
 'STOPWORDS',
 'STOPWORDS_de',
 'STOPWORDS_en',
 'STOPWORDS_es',
 'STOPWORDS_fr',
 'STOPWORDS_ru',
 'STOPWORDS_yo',
 'STREET_ADDRESS_REGEX',
 'TextFrame',
 'URL_PATTERN',
 'USER_HANDLES_REGEX',
 'VISACard_REGEX',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__generate_text',
 '__loader__',
 '__name__',
 '__numbers_dict',
 '__package__',
 '__spec__',
 '_lex_richness_herdan',
 '_lex_richness_maas_ttr',
 'clean_text',
 'defaultdict',
 'digit2words',
 'extract_btc_address',
 'extract_currencies',
 'extract_currency_symbols',
 'extract_dates',
 'extract_emails',
 'extract_emojis',
 'extract_hashtags',
 'extract_html_tags',
 'extract_mastercard_addr',
 'extract_md5sha',
 'extract_numbers',
 'extr

We are interested in only two methods from the list, the remove_stopwords and remove_userhandles.

The dataset contains some Twitter handles of different users. These handles are unnecessary and considered as noise to our dataset. We use the remove_userhandles method to remove them from our dataset. We use the apply() method to add remove_userhandles. We save the cleaned dataset into a new column named Clean_Text.

In [8]:
# user handles
df['Clean_Text'] = df['Text'].apply(nfx.remove_userhandles)

We also use apply() to add remove_stopwords. We save the cleaned dataset into a new column named Clean_Text.

In [7]:
# stopwords
#df['Clean_Text'] = df['Clean_Text'].apply(nfx.remove_stopwords)

In [73]:
df

Unnamed: 0,Emotion,Text,Clean_Text
0,neutral,Why ?,Why ?
1,joy,Sage Act upgrade on my to do list for tommorow.,Sage Act upgrade on my to do list for tommorow.
2,sadness,ON THE WAY TO MY HOMEGIRL BABY FUNERAL!!! MAN ...,ON THE WAY TO MY HOMEGIRL BABY FUNERAL!!! MAN ...
3,joy,Such an eye ! The true hazel eye-and so brill...,Such an eye ! The true hazel eye-and so brill...
4,joy,@Iluvmiasantos ugh babe.. hugggzzz for u .! b...,ugh babe.. hugggzzz for u .! babe naamazed ...
...,...,...,...
34787,surprise,@MichelGW have you gift! Hope you like it! It'...,have you gift! Hope you like it! It's hand m...
34788,joy,The world didnt give it to me..so the world MO...,The world didnt give it to me..so the world MO...
34789,anger,A man robbed me today .,A man robbed me today .
34790,fear,"Youu call it JEALOUSY, I call it of #Losing YO...","Youu call it JEALOUSY, I call it of #Losing YO..."


## 3. Importing ML packages

`LogisticRegression` is an algorithm used for both classification and regression. This algorithm is imported from Scikit-learn. We will use it for emotion classification.

Machine learning models have a problem comprehending raw text, they work well with numbers. Machines cannot process the raw text data, and it has to be converted into a matrix of numbers. `CountVectorizer` is used to convert the raw text into a matrix of numbers. This process depends on the frequency of each word in the entire text. During this process, `CountVectorizer` extracts important features from the text. They are used as input for the model during training.

The `train_test_split` method is important during the splitting of the dataset. It splits the dataset set into two sets, a `train set`, and a `test set`. This depends on the percentage specified by the user.

The `accuracy_score` is important when calculating the accuracy score of our model during prediction.

In [37]:
# Estimators
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# Transformers
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer  
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

## 4. Model features and labels

**Features** are the attributes and variables extracted from the dataset. These extracted features are used as inputs to the model during training enabling model learning. Our features are present in the Clean_Text column.

**Labels** are the output or the target variable. Our label is the Emotion column, and this is what the model is predicting.

In [38]:
Xfeatures = df['Clean_Text']
ylabels = df['Emotion']

## 5. Data splitting and pipline

We need to split our dataset into a train set and test set. The model will learn from the train set. We will use the test set to evaluate the model performance and measure the model’s knowledge capability.

We specify the test_size=0.3. This will split our dataset with 70% of data used for training and 30% for testing.

In [39]:
# split data
x_train,x_test,y_train,y_test = train_test_split(Xfeatures,ylabels,test_size=0.3,random_state=42)

The first stage is the CountVectorizer process. This stage converts the raw text dataset into a matrix of numbers that a machine can understand.

The second stage is the model training process using the LogisticRegression algorithm. In this stage, the model learns from the dataset. During training, it understands patterns, gains knowledge, and uses the knowledge to make predictions.

In [40]:
from sklearn.pipeline import Pipeline

In [41]:
# LogistiticRegression pipeline
pipe_lr = Pipeline(steps=[('cv',CountVectorizer()),('lr',LogisticRegression())])

## 7. Model fitting

After initializing the two stages, we need to fit these stages into our dataset. We will use the train set dataset, which is specified as x_train and y_train.

In [42]:
# train and fit data
pipe_lr.fit(x_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Pipeline(steps=[('cv', CountVectorizer()), ('lr', LogisticRegression())])

In [43]:
pipe_lr

Pipeline(steps=[('cv', CountVectorizer()), ('lr', LogisticRegression())])

In [44]:
# check accuracy
pipe_lr.score(x_test,y_test)

0.6421012122378296

In [32]:
pipe_lr["lr"].classes_

array(['anger', 'disgust', 'fear', 'joy', 'neutral', 'sadness',
       'surprise'], dtype=object)

In [33]:
# make a prediction
sample1 = "This chocolate was very sweet it made me happy"

In [34]:
# actual prediction
pipe_lr.predict([sample1])

array(['joy'], dtype=object)

In [35]:
text_1 = "this is all wrong I shouldn't be up here I should be back in school on the other side of the ocean yet you all come to us young people for Hope how dare you you have stolen my dreams and my childhood with your empty words and yet I'm one of the lucky ones people are suffering people are dying entire ecosystems are collapsing we are in the beginning of a mass extinction and all you can talk about is the money and fairy tales of Eternal economic growth how dare you"


In [49]:
# show ordered output with emotion name and percentage 

#show all emotion classes and their probablities together 
lst_1 = [(clss, prob) for clss, prob in zip(pipe_lr["lr"].classes_, pipe_lr.predict_proba([text_1])[0])]
# lst to df
df_1 = pd.DataFrame (lst_1)
# rename columns of df
df_1.rename(columns={0:'emotion', 1: 'percentage'}, inplace = True)
# sort values
df_1.sort_values(by="percentage", inplace = True, ascending=False)
# reset index
df_1.reset_index(drop = True, inplace = True)
# show percentage
df_1['percentage'] = df_1['percentage'].apply(lambda x:round(x, 4)*100)
print(df_1)

    emotion  percentage
0      fear       68.44
1   sadness       17.33
2     anger       13.81
3   disgust        0.32
4       joy        0.09
5  surprise        0.01
6   neutral        0.00


In [52]:
text_2 = "thank you thank you thank you thank you to the academy for this all 6,000 members thank you to the other nominees all these performances were impeccable in my opinion I didn't see a false note anywhere I want to thank valet or director Jennifer garnered with daily there's a few things about three things to my account that I need each day one of them is something to look up to another is something to look forward to in another is someone to chase now first off I want to thank God because that's who I look up to he's great my life with opportunities that I know are not of my hand or any other human hand he is showing me that it's a scientific fact that gratitude reciprocates"


In [53]:
# show desc ordered output with emotion name and percentage 

#show all emotion classes and their probablities together 
lst_2 = [(clss, prob) for clss, prob in zip(pipe_lr["lr"].classes_, pipe_lr.predict_proba([text_2])[0])]
# lst to df
df_2 = pd.DataFrame (lst_2)
# rename columns of df
df_2.rename(columns={0:'emotion', 1: 'percentage'}, inplace = True)
# sort values
df_2.sort_values(by="percentage", inplace = True, ascending=False)
# reset index
df_2.reset_index(drop = True, inplace = True)
# show percentage
df_2['percentage'] = df_2['percentage'].apply(lambda x:round(x, 4)*100)
print(df_2)

    emotion  percentage
0       joy       99.06
1     anger        0.93
2   sadness        0.00
3      fear        0.00
4  surprise        0.00
5   disgust        0.00
6   neutral        0.00


In [54]:
text_3 = "I used to think the whole purpose of life was pursuing happiness everyone said the path to happiness was success so I searched for that ideal job that perfect boyfriend that beautiful apartment but instead of ever feeling fulfilled I felt anxious and the drift and I wasn't alone my friends they struggled with this too"


In [55]:
# show desc ordered output with emotion name and percentage 

#show all emotion classes and their probablities together 
lst_3 = [(clss, prob) for clss, prob in zip(pipe_lr["lr"].classes_, pipe_lr.predict_proba([text_3])[0])]
# lst to df
df_3 = pd.DataFrame (lst_3)
# rename columns of df
df_3.rename(columns={0:'emotion', 1: 'percentage'}, inplace = True)
# sort values
df_3.sort_values(by="percentage", inplace = True, ascending=False)
# reset index
df_3.reset_index(drop = True, inplace = True)
# show percentage
df_3['percentage'] = df_3['percentage'].apply(lambda x:round(x, 4)*100)
print(df_3)

    emotion  percentage
0   sadness       45.09
1       joy       43.31
2      fear       11.52
3   disgust        0.05
4     anger        0.02
5  surprise        0.00
6   neutral        0.00


## Saving the model

In [57]:
# save the model to disk
import pickle
filename = 'emo_text_model_cv.sav'
pickle.dump(pipe_lr, open(filename, 'wb'))

In [58]:
# load the model for predictions
import pickle
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.predict([text_1])[0]
print(result)

fear


In [59]:
# load the model for predictions
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.predict([text_2])[0]
print(result)

joy


In [60]:
# load the model for predictions
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.predict([text_3])[0]
print(result)

sadness
