### **About Data: Emotion Detection**

Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp


- This data consists of two columns.
        - Comment
        - Emotion
- Comment are the statements or messages regarding to a particular event/situation.

- Emotion feature tells whether the given comment is fear ðŸ˜¨, Anger ðŸ˜¡, Joy ðŸ˜‚.

- As there are only 3 classes, this problem comes under the **Multi-Class Classification.**

In [None]:
#import pandas library
import pandas as pd
#read the dataset with name "Emotion_classify_Data.csv" and store it in a variable df
df = pd.read_csv('Emotion_classify_Data.csv')
#print the shape of dataframe
df.shape
#print top 5 rows
df.sample(10)

Unnamed: 0,Comment,Emotion
5741,i travel i feel like men expect me to be neuro...,fear
4004,i realise that although i originally started t...,fear
3565,i really feel bothered about this specific iss...,anger
1204,i also feel as though this assumption is rude ...,anger
4729,i liked it all the same this one will take a f...,anger
4029,i wear this shirt i feel artistic you are arti...,joy
1408,i feel like i want to stop i think of my wimpy...,fear
4632,i alight in front of the hotel i can feel the ...,joy
2391,i feel equally wronged,anger
4562,i remember feeling so calmed and at ease becau...,joy


In [None]:
#check the distribution of Emotion
df['Emotion'].value_counts()

Emotion
anger    2000
joy      2000
fear     1937
Name: count, dtype: int64

In [None]:
#Add the new column "Emotion_num" which gives a unique number to each of these Emotions
df['Emotion_num'] = df['Emotion'].map({'joy' : 0 , 'fear' : 1 , 'anger' : 2})
#checking the results by printing top 5 rows
df.sample(10)

Unnamed: 0,Comment,Emotion,Emotion_num
4035,i do enjoy large bold prints and i suppose its...,fear,1
4885,i was able to be myself and not feel pressured...,fear,1
199,i should have left this movie feeling frighten...,fear,1
1155,ive been feeling like im on shaky quilting wat...,fear,1
5818,i got a feeling give it up i got a feeling get...,anger,2
1891,i hate getting behind because then i feel pres...,fear,1
1803,im not going to gush too much about the relati...,joy,0
2561,im one of girl who feel insecure about herself...,fear,1
4108,im feeling angry i think i strop about rufflin...,anger,2
1582,im also feeling cranky about it because the ma...,anger,2


### **Modelling without Pre-processing Text data**

In [None]:
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(df['Comment'] , df['Emotion_num'] , random_state=2022 , test_size=0.2,stratify=df['Emotion_num'])

In [None]:
print(X_train.shape , X_test.shape)

(4749,) (1188,)


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [None]:
#1. create a pipeline object
Random_pipeline = Pipeline([
    ('CountVectorizer' ,CountVectorizer(ngram_range=(3,3))),
    ('Random Forest' , RandomForestClassifier())
])
#2. fit with X_train and y_train
Random_pipeline.fit(X_train,y_train)
#3. get the predictions for X_test and store it in y_pred
y_pred = Random_pipeline.predict(X_test)
#4. print the classfication report
print(classification_report(y_test , y_pred))

              precision    recall  f1-score   support

           0       0.58      0.28      0.37       400
           1       0.37      0.81      0.51       388
           2       0.56      0.22      0.32       400

    accuracy                           0.43      1188
   macro avg       0.50      0.44      0.40      1188
weighted avg       0.50      0.43      0.40      1188



In [None]:
from sklearn.naive_bayes import MultinomialNB
#1. create a pipeline object
NB_pipline = Pipeline([
    ('CountVectorizer' ,CountVectorizer(ngram_range=(1,2))),
    ('NB' , MultinomialNB())
])
#2. fit with X_train and y_train
NB_pipline.fit(X_train,y_train)
#3. get the predictions for X_test and store it in y_pred
y_pred = NB_pipline.predict(X_test)
#4. print the classfication report
print(classification_report(y_test , y_pred))

              precision    recall  f1-score   support

           0       0.87      0.86      0.87       400
           1       0.87      0.83      0.85       388
           2       0.83      0.88      0.85       400

    accuracy                           0.86      1188
   macro avg       0.86      0.86      0.86      1188
weighted avg       0.86      0.86      0.86      1188



In [None]:
#1. create a pipeline object
Random_pipline = Pipeline([
    ('CountVectorizer' ,CountVectorizer(ngram_range=(1,2))),
    ('Random Forest' , RandomForestClassifier())
])
#2. fit with X_train and y_train
Random_pipline.fit(X_train,y_train)
#3. get the predictions for X_test and store it in y_pred
y_pred = Random_pipline.predict(X_test)
#4. print the classfication report
print(classification_report(y_test , y_pred))

              precision    recall  f1-score   support

           0       0.83      0.95      0.89       400
           1       0.93      0.87      0.90       388
           2       0.93      0.85      0.89       400

    accuracy                           0.89      1188
   macro avg       0.90      0.89      0.89      1188
weighted avg       0.90      0.89      0.89      1188




**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using **TF-IDF vectorizer** for Pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
#1. create a pipeline object
Random_pipline = Pipeline([
    ('TfidfVectorizer' ,TfidfVectorizer()),
    ('Random Forest' , RandomForestClassifier())
])
#2. fit with X_train and y_train
Random_pipline.fit(X_train,y_train)
#3. get the predictions for X_test and store it in y_pred
y_pred = Random_pipline.predict(X_test)
#4. print the classfication report
print(classification_report(y_test , y_pred))

              precision    recall  f1-score   support

           0       0.87      0.94      0.91       400
           1       0.92      0.90      0.91       388
           2       0.93      0.86      0.89       400

    accuracy                           0.90      1188
   macro avg       0.90      0.90      0.90      1188
weighted avg       0.90      0.90      0.90      1188



<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [None]:
import spacy
# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 
#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    return " ".join(filtered_tokens) 

In [None]:
# create a new column "preprocessed_comment" and use the utility function above to get the clean data
df['text_preprocessing'] = df['Comment'].apply(preprocess)

**Build a model with pre processed text**

In [None]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(df['text_preprocessing'] , df['Emotion_num'] , random_state=2022 , test_size=0.2,stratify=df['Emotion_num'])

In [None]:
#1. create a pipeline object
Random_pipline = Pipeline([
    ('CountVectorizer' ,CountVectorizer(ngram_range=(1,2))),
    ('Random Forest' , RandomForestClassifier())
])
#2. fit with X_train and y_train
Random_pipline.fit(X_train,y_train)
#3. get the predictions for X_test and store it in y_pred
y_pred = Random_pipline.predict(X_test)
#4. print the classfication report
print(classification_report(y_test , y_pred))

              precision    recall  f1-score   support

           0       0.94      0.95      0.95       400
           1       0.94      0.91      0.93       388
           2       0.92      0.94      0.93       400

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188



In [None]:
#1. create a pipeline object
Random_pipeline = Pipeline ([
    ('TfidfVectorizer' ,TfidfVectorizer()),
    ('Random Forest' , RandomForestClassifier())
])
#2. fit with X_train and y_train
Random_pipeline.fit(X_train,y_train)
#3. get the predictions for X_test and store it in y_pred
y_pred = Random_pipeline.predict(X_test)
#4. print the classfication report
print(classification_report(y_test , y_pred))

              precision    recall  f1-score   support

           0       0.92      0.95      0.93       400
           1       0.93      0.93      0.93       388
           2       0.94      0.91      0.93       400

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188

