## MultiClass Classification - Emotions Detection from Text Message

## Table of Content
- Data Source
- Data Description
- Goal
- Importing Packages and Loading Data
- EDA and Feature Engineering
- Text Preprocessing
- Multi-Classification Models
    - Spliting the data: train and test
    - Models
- Comparison of models performance
- Model Evaluation
    - Precision, Recall, F1-Score
    - Confusion Matrix
- Predictions
- Pipeline


### Source of Dataset

> We will be using the [Emotions dataset for NLP](https://www.kaggle.com/praveengovi/emotions-dataset-for-nlp) by Praveen.

### Format of Dataset

> | text         | emotion |
> |--------------|---------|
> |i didnt feel humiliated | sadness |
> |i can go from feeling so hopeless to so damned hopeful just from being around... | sadness |
> |im grabbing a minute to post i feel greedy wrong | anger |
> |i am ever feeling nostalgic about the fireplace i will know that it is still... | love |

> **Note:** ***text*** and ***emotion*** are separated by a semi-colon ***';'***.
<br>
<pre>
i didnt feel humiliated;sadness
i am feeling grouchy;anger
...
</pre>

### Data Description:
The data is basically a collection of tweets annotated with the emotions behind them. We have three columns 
- `emotion`
- `text`  

In `text`, we have the raw text message. In `emotion`, we have the emotion behind the message.

### Goal:
Emotion detection from text is one of the challenging problems in NLP. Humans have a variety of emotions and it is difficult to collect enough records for each emotion. Here we have a labeled data for emotion detection and the objective is to build an efficient model to detect emotion.

### Importing Packages and Loading the datset

In [1]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import opendatasets as od
pd.set_option('display.max_colwidth', -1)

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [None]:
url = "https://www.kaggle.com/datasets/pashupatigupta/emotion-detection-from-text"
# od.download(url)

In [2]:
df = pd.read_csv('dataset/train.txt', delimiter=";")
df.head()

Unnamed: 0,text,emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplace i will know that it is still on the property,love
4,i am feeling grouchy,anger


In [4]:
# !cat dataset/test.txt

In [None]:
df.head()

In [None]:
df.shape
# total 40000 rows and 2 columns

In [None]:
# Descriptive view of the categorical column
df.emotion.describe()

In [None]:
# all the unique values of emotion along with their count
df.emotion.value_counts()

In [None]:
df['clean_text'] = df.text.apply(lambda x: text_preprocessing(x))

In [None]:
import re
import string
import contractions
from cleantext import clean
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
wn = WordNetLemmatizer()

def text_preprocessing(mystr):
    mystr = mystr.lower()                                               # case folding
    mystr = re.sub('\w*\d\w*', '', mystr)                               # remove digits
    mystr = re.sub('\n', ' ', mystr)                                    # replace new line characters with space
    mystr = re.sub('[‘’“”…]', '', mystr)                                # removing double quotes and single quotes
    mystr = re.sub('<.*?>', '', mystr)                                  # removing html tags 
    mystr = re.sub(r'\[.*?\]', '', mystr)                               # remove text in square brackets
    mystr = re.sub('https?://\S+|www.\.\S+', '', mystr)                 # removing URLs
    mystr = re.sub('\n', ' ', mystr)                                    # replace new line characters with space
    mystr = clean(mystr, no_emoji=True)                                 # remove emojis
    mystr = ''.join([c for c in mystr if c not in string.punctuation])  # remove punctuations
    mystr = ' '.join([contractions.fix(word) for word in mystr.split()])# expand contractions
    
    tokens = word_tokenize(mystr)                                       # tokenize the string
    mystr = ''.join([c for c in mystr if c not in string.punctuation])  # remove punctuations
    tokens = [token for token in tokens if token not in stop_words]     # remove stopwords
#   tokens = [ps.stem(token) for token in tokens]                       # stemming
    tokens = [wn.lemmatize(token) for token in tokens]                   # lemmatization
    new_str = ' '.join(tokens)
    return new_str

### EDA and Feature Engineering

In [None]:
df.isna().sum()
# There is no missing values in any row

In [None]:
df.duplicated().sum()
# There is total 91 duplicated values in the dataset

In [None]:
# drop all the duplicate values
df.drop_duplicates(ignore_index=True, inplace=True)

In [None]:
df.duplicated().sum()
# There is total 91 duplicated values in the dataset

In [None]:
# add new column which stores total number of characters in each tweet
print(df.text[0])
len(df.text[0])

In [None]:
print(df.text[22])
len(df.text[22])

In [None]:
print(df.text[201])
len(df.text[201])

In [None]:
df['char_length'] = df.text.apply(lambda  x : len(x))
df.head()

In [None]:
# add new column which stores total number of tokens/words in each tweet
len(df.text[0].split(" "))

In [None]:
len(df.text[22].split(" "))

In [None]:
len(df.text[45].split(" "))

In [None]:
len(df.text[2985].split(" "))

In [None]:
df['token_length'] = df.text.apply(lambda x: len(x.split(" ")))

In [None]:
df.head()

#### sentiment values distribution

In [None]:
sns.set_style('darkgrid')
plt.figure(figsize=(8,8))
sns.countplot(y= df.emotion)
# plt.xticks(rotation=90)
plt.yticks(size=12)
plt.xticks(size=13)
plt.title("Emotion-Type vs Values", fontdict={'fontsize':20})
plt.xlabel("Values", fontdict={'fontsize':15})
plt.ylabel("Emotion-Type", fontdict={'fontsize':15})
plt.show()

#### Percentage of each Emotion

In [None]:
sns.set_style('darkgrid')
plt.figure(figsize=(12,12))
plt.pie(df.emotion.value_counts(), labels=df.emotion.value_counts().index, autopct='%.2f')
plt.title("Percentage of each emotion", fontsize=20)
plt.show()

#### Distribution of Character Length in each tweet

In [None]:
plt.figure(figsize=(8,8))
sns.distplot(df.char_length)
plt.title("Number of Characters in the tweet", fontsize=20)
plt.show()

#### Distribution of tokens/words in the tweet

In [None]:
plt.figure(figsize=(8,8))
sns.distplot(df.token_length)
plt.title("Number of tokens(words) in the tweet", fontsize=20)
plt.show()

#### Distribution of top 5 emotions character-length wise

In [None]:
# First Method
df1 = df.groupby('emotion')['char_length'].count().sort_values(ascending=False).head(5).reset_index()
df1

In [None]:
# second method
for sentiment in df.emotion.value_counts().sort_values()[-5:].index.tolist():
    print(sentiment)

In [None]:
# for sentiment in df.sentiment.value_counts().sort_values()[-5:].index.tolist():
#     print(df[df['sentiment']==sentiment]['char_length'])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
for emotion in df['emotion'].value_counts().sort_values()[-5:].index.tolist():
    sns.kdeplot(df[df['emotion']==emotion]['char_length'],ax=ax, label=emotion)
ax.legend()
ax.set_title("Distribution of top 5 emotions character-length wise", fontsize=20)
plt.show()

#### Distribution of top 5 emotions token-length wise

In [None]:
# First Method
df2 = df.groupby('sentiment')['token_length'].count().sort_values(ascending=False).head(5).reset_index()
df2

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
for sentiment in df['sentiment'].value_counts().sort_values()[-5:].index.tolist():
    #print(sentiment)
    sns.kdeplot(df[df['sentiment']==sentiment]['token_length'],ax=ax, label=sentiment)
ax.legend()
ax.set_title("Distribution of top 5 emotions token-length wise", fontsize=20)
plt.show()

#### Average Length of each Tweet characters and Tokens wise

In [None]:
avg_df = df.groupby('sentiment').agg({'char_length':'mean', 'token_length':'mean'}).reset_index()
avg_df

In [None]:
plt.figure(figsize=(8,8))
plt.yticks(size=12)
plt.xticks(size=13)
sns.barplot(y=avg_df.sentiment, x=avg_df.char_length)
plt.title("Average Length of each Tweet characters wise", fontdict={'fontsize':20})
plt.xlabel("Char Length", fontdict={'fontsize':15})
plt.ylabel("Emotion-Type", fontdict={'fontsize':15})
plt.show()

In [None]:
plt.figure(figsize=(8,8))
plt.yticks(size=12)
plt.xticks(size=13)
sns.barplot(y=avg_df.sentiment, x=avg_df.token_length)
plt.title("Average Length of each Tweet token wise", fontdict={'fontsize':20})
plt.xlabel("Token Length", fontdict={'fontsize':15})
plt.ylabel("Emotion-Type", fontdict={'fontsize':15})
plt.show()

### Text Preprocessing

#### Case Folding and Cleaning Data

In [None]:
import re
def clean_text(text):
#     removing the @mentions
    text = re.sub(r"@\w+|#\w+", "", text)
    
#     removing # hashtages from text
    text = re.sub(r"#","", text)
    
#     removing RT from text
    text = re.sub(r"RT[\s]+","", text)
    
#     removing hyperlinks from text
    text = re.sub(r"\w+:\/\/\S+","", text)
    
#     removing punctuation from the text
    text = re.sub(r"[^a-zA-Z]"," ", text)
    
#     convert text into lowercase
    text.lower()
    return text
    

In [None]:
df.content.apply(clean_text)

In [None]:
# case folding
temp = df['content'].str.lower()
temp

In [None]:
# remove hashtag and mention using regex
import re
temp = temp.apply(lambda x: re.sub(r'@\w+|#\w+', '', x))
temp

In [None]:
# remove url using regex
temp = temp.apply(lambda x: re.sub(r'http\S+', '', x))
temp

In [None]:
# remove punctuation
import string
temp = temp.apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
temp

In [None]:
# remove number
temp = temp.apply(lambda x: re.sub(r'\d+', '', x))
temp

In [None]:
# remove whitespace
temp = temp.apply(lambda x: x.strip())
temp

#### Stopwords Removal

In [None]:
stop = stopwords.words('english')
temp = temp.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
temp

#### lemmatization


In [None]:
lemmatizer = WordNetLemmatizer()
temp = temp.apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))
temp

#### tokenization


In [None]:
temp = temp.apply(lambda x: word_tokenize(x))
temp

In [None]:
# temp to df content_token
df['content_token'] = temp
df

In [None]:
df.isna().sum()

In [None]:
# remove NaN data in content_token
df = df.dropna(subset=['content_token'])
df

#### Finding & Removing Duplicate Synonim?

In [None]:
## Find synonym of each token.
from nltk.corpus import wordnet
def find_synonym(word):
    synonyms = []
    for syn in wordnet.synsets(word):
        for l in syn.lemmas():
            synonyms.append(l.name())
    return synonyms

df['synonym'] = df['content_token'].apply(lambda x: [find_synonym(word) for word in x])
df.head()

#### Dictionary of word index


In [None]:
index_word = {}
for i, word in enumerate(df['content_token'].sum()):
    if word not in index_word:
        index_word[i] = word
words = [value for key, value in index_word.items()]
words

#### set synonym dictionary using find_synonym function 


In [None]:
synonym_dict = {}
for word in words:
    synonym_dict.update({word : tuple([w.lower() for w in find_synonym(word)])})

synonym_dict

#### remove duplicate synonym


In [None]:
for key, value in synonym_dict.items():
    synonym_dict[key] = tuple(set(value))

synonym_dict

#### remove null value in synonym_dict


In [None]:
synonym_dict = {k: v for k, v in synonym_dict.items() if v}

synonym_dict

In [None]:
import collections
value_occurrences = collections.Counter(synonym_dict.values())

filtered_synonym = {key: value for key, value in synonym_dict.items() if value_occurrences[value] == 1}

filtered_synonym

#### Data Augmention by replacing words with synonyms using Spacy

In [None]:
## Function for augmenting data by replacing words with synonym using spaCy
import re
import random
sr = random.SystemRandom()
split_pattern = re.compile(r'\s+')
def data_augmentation(message, aug_range=1) :
    augmented_messages = []
    for j in range(0,aug_range) :
        new_message = ""
        for i in filter(None, split_pattern.split(message)) :
            new_message = new_message + " " + sr.choice(filtered_synonym.get(i,[i]))
        augmented_messages.append(new_message)
    return augmented_messages

In [None]:
tweet_count = df.sentiment.value_counts().to_dict()
tweet_count

In [None]:
# df

In [None]:
## Get max intent count to match other minority classes through data augmentation
import operator
max_intent_count = max(tweet_count.items(), key=operator.itemgetter(1))[1]
max_intent_count

#### Balance Data
Because sentiment data is very far apart, such as neutral containing 8638 data, while anger as much as 110 data. We decided to balance the data to make the data fairer in terms of accuracy learning later. We use the Oversampling method, which means adding synthetic data that refers to the largest amount of data in the dataset.

In [None]:
import math
import tqdm
# tqdm is a library in Python which is used for creating Progress Meters or Progress Bars.
newdf = pd.DataFrame()
for intent, count in tweet_count.items() :
    count_diff = max_intent_count - count    ## Difference to fill
    multiplication_count = math.ceil((count_diff)/count)  ## Multiplying a minority classes for multiplication_count times
    if (multiplication_count) :
        old_message_df = pd.DataFrame()
        new_message_df = pd.DataFrame()
        for message in tqdm.tqdm(df[df["sentiment"] == intent]['content'].values) :
            ## Extracting existing minority class batch
            dummy1 = pd.DataFrame([message], columns=['content'])
            dummy1["sentiment"] = intent
            # concat existing minority class batch
            old_message_df = pd.concat([old_message_df, dummy1])

            ## Creating new augmented batch from existing minority class
            new_messages = data_augmentation(message,  multiplication_count)
            dummy2 = pd.DataFrame(new_messages, columns=['content'])
            dummy2["sentiment"] = intent
            # concat new augmented batch
            new_message_df = pd.concat([new_message_df, dummy2])

        ## Select random data points from augmented data
        new_message_df=new_message_df.take(np.random.permutation(len(new_message_df))[:count_diff])
        
        ## Merge existing and augmented data points using concat
        newdf = pd.concat([newdf, old_message_df, new_message_df])
        # newdf = newdf.append([old_message_df,new_message_df])
    else :
        newdf = pd.concat([newdf, df[df["sentiment"] == intent]])
        # newdf = newdf.append(df[df["Intent"] == intent])

In [None]:
newdf.head(2)

In [None]:
newdf.sentiment.value_counts()

In [None]:
## Save newdf to csv file
newdf.to_csv('dataset/augmented_data.csv', index=False)
clean_df = pd.read_csv('dataset/augmented_data.csv')
clean_df.head(2)

In [None]:
clean_df['sentiment'].value_counts().plot(kind='bar')

In [None]:
# cleaning the tweets using clean_tweet function
clean_df['clean_tweet'] = clean_df['content'].apply(lambda x: clean_text(x))

# lower casing clean_tweet column
clean_df['clean_tweet'] = clean_df['clean_tweet'].apply(lambda x: x.lower())

# function to remove stop words from clean_tweet column
def remove_stopwords(text):
    text = [word for word in text.split() if word not in stop]
    return " ".join(text)

# stopword removal
clean_df['clean_tweet'] = clean_df['clean_tweet'].apply(lambda x: remove_stopwords(x))

clean_df.head(1)

In [None]:
# function to lemmitize clean_tweet column
def lemmatization(text):
    text = [lemmatizer.lemmatize(word) for word in text.split()]
    return " ".join(text)
clean_df['clean_tweet'] = clean_df['clean_tweet'].apply(lambda x: lemmatization(x))

# tokenization using word_tokenize
clean_df['clean_tweet_token'] = clean_df['clean_tweet'].apply(lambda x: word_tokenize(x))

clean_df.head(1)

### Multi-Classification Models

In [None]:
from sklearn.model_selection import train_test_split

# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(clean_df['clean_tweet_token'], clean_df['sentiment'], test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

### TF-IDF

In [None]:
# vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train.astype('U'))
X_test = vectorizer.transform(X_test.astype('U'))
X_train.shape, X_test.shape

### Model Making

#### Model Multinomial Naive Bayes

In [None]:
# model training

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

mnb = MultinomialNB()

# Fit model
mnb.fit(X_train, y_train)

y_pred_train = mnb.predict(X_train)

acc_train = accuracy_score(y_train, y_pred_train)

y_pred_test = mnb.predict(X_test)

acc_test = accuracy_score(y_test, y_pred_test)

print(f'The Results of the calculation of the accuracy of the Train Data : {acc_train}')
print(f'The Results of the calculation of the accuracy of the Test Data : {acc_test}')

### Model Evaluation

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, y_pred_test))


In [None]:
cm = confusion_matrix(y_test, y_pred_test)
print(confusion_matrix(y_test, y_pred_test))

In [None]:
fig, ax = plt.subplots(figsize=(13,13))
from sklearn.metrics import ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=mnb.classes_)
# ax.tick_params()
disp.plot(ax=ax);

#### Model Linear SVC

In [None]:
# Model using Linear SVC

from sklearn.svm import LinearSVC

# Inisiasi LinearSVC

lsvc = LinearSVC()

# Fit model
lsvc.fit(X_train, y_train)

y_pred_train = lsvc.predict(X_train)

acc_train = accuracy_score(y_train, y_pred_train)

y_pred_test = lsvc.predict(X_test)

acc_test = accuracy_score(y_test, y_pred_test)


print(f'The Results of the calculation of the accuracy of the Train Data : {acc_train}')
print(f'The Results of the calculation of the accuracy of the Test Data : {acc_test}')


#### Model Evaluation


In [None]:

print(classification_report(y_test, y_pred_test))

In [None]:
cm = confusion_matrix(y_test, y_pred_test)
print(confusion_matrix(y_test, y_pred_test))

In [None]:
fig, ax = plt.subplots(figsize=(13,13))
from sklearn.metrics import ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=lsvc.classes_)
# ax.tick_params()
disp.plot(ax=ax);

#### Model using Logistic Regression

In [None]:
# Model using Logistic Regression

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train, y_train)

y_pred_train = lr.predict(X_train)

acc_train = accuracy_score(y_train, y_pred_train)

y_pred_test = lr.predict(X_test)

acc_test = accuracy_score(y_test, y_pred_test)

print(f'The Results of the calculation of the accuracy of the Train Data : {acc_train}')
print(f'The Results of the calculation of the accuracy of the Test Data : {acc_test}')

#### Model Evaluation

In [None]:
print(classification_report(y_test, y_pred_test))

In [None]:
cm = confusion_matrix(y_test, y_pred_test)
print(confusion_matrix(y_test, y_pred_test))

In [None]:
fig, ax = plt.subplots(figsize=(13,13))
from sklearn.metrics import ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=lr.classes_)
# ax.tick_params()
disp.plot(ax=ax);

In [None]:
new_tweet = "It is very important for us to work hard to achieve goals in life"