### **TF-IDF**

- Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.

- In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.

- For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!

- We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

### **About Data: Emotion Detection**

Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp


- This data consists of two columns.
        - Comment
        - Emotion
- Comment are the statements or messages regarding to a particular event/situation.

- Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

- As there are only 3 classes, this problem comes under the **Multi-Class Classification.**

In [None]:
#import pandas library
import pandas as pd

#read the dataset with name "Emotion_classify_Data.csv" and store it in a variable df
df=pd.read_csv("Emotions_dataset.csv", sep=';')

#print the shape of dataframe
print(df.shape)

#print top 5 rows
df.head()

(20000, 2)


Unnamed: 0,Text,Label
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [None]:
#check the distribution of Emotion
df.Label.value_counts()

Unnamed: 0_level_0,count
Label,Unnamed: 1_level_1
joy,6752
sadness,5828
anger,2709
fear,2385
love,1622
surprise,704


In [None]:
#Add the new column "Emotion_num" which gives a unique number to each of these Emotions
#joy --> 0, fear --> 1, anger --> 2
df["Emotion_num"]=df.Label.map({
    "joy":0,
    "sadness":1,
    "anger":2,
    "fear":3,
    "love":4,
    "surprise":5
})

#checking the results by printing top 5 rows
print(df.head())

                                                Text    Label  Emotion_num
0                            i didnt feel humiliated  sadness            1
1  i can go from feeling so hopeless to so damned...  sadness            1
2   im grabbing a minute to post i feel greedy wrong    anger            2
3  i am ever feeling nostalgic about the fireplac...     love            4
4                               i am feeling grouchy    anger            2


### **Modelling without Pre-processing Text data**

In [None]:
#import train-test split
from sklearn.model_selection import train_test_split

#Do the 'train-test' splitting with test size of 20%
#Note: Give Random state 2022 and also do the stratify sampling
X=df.Text
y=df.Emotion_num

X_train,X_test,y_train,y_test=train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=2022,
    stratify=df.Emotion_num
)


In [None]:
#print the shapes of X_train and X_test
print("X_train ",X_train.shape)
print("X_test ",X_test.shape)

X_train  (16000,)
X_test  (4000,)



**Attempt 1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
#import CountVectorizer, RandomForest, pipeline, classification_report from sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

#1. create a pipeline object
clf=Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range=(3,3))),
    ('rf', RandomForestClassifier())
])


#2. fit with X_train and y_train
clf.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred=clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.55      0.68      0.61      1350
           1       0.65      0.46      0.54      1166
           2       0.61      0.39      0.47       542
           3       0.27      0.58      0.37       477
           4       0.80      0.21      0.34       324
           5       0.87      0.28      0.42       141

    accuracy                           0.51      4000
   macro avg       0.63      0.43      0.46      4000
weighted avg       0.59      0.51      0.51      4000




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and bigrams.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.


In [None]:
#import MultinomialNB from sklearn
from sklearn.naive_bayes import MultinomialNB


#1. create a pipeline object
clf=Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range=(1,2))),
    ('Multi NB', MultinomialNB())
])


#2. fit with X_train and y_train
clf.fit(X_train,y_train)



#3. get the predictions for X_test and store it in y_pred
y_pred=clf.predict(X_test)

#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.66      0.97      0.79      1350
           1       0.72      0.92      0.80      1166
           2       0.97      0.44      0.61       542
           3       0.91      0.37      0.53       477
           4       0.98      0.17      0.28       324
           5       1.00      0.09      0.16       141

    accuracy                           0.72      4000
   macro avg       0.87      0.49      0.53      4000
weighted avg       0.79      0.72      0.67      4000




**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and Bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
#1. create a pipeline object
clf=Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range=(1,2))),
    ('rf', RandomForestClassifier())
])



#2. fit with X_train and y_train
clf.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred=clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.83      0.96      0.89      1350
           1       0.91      0.90      0.91      1166
           2       0.92      0.82      0.87       542
           3       0.86      0.82      0.84       477
           4       0.91      0.67      0.77       324
           5       0.92      0.68      0.78       141

    accuracy                           0.88      4000
   macro avg       0.89      0.81      0.84      4000
weighted avg       0.88      0.88      0.87      4000




**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using **TF-IDF vectorizer** for Pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
#import TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

#1. create a pipeline object
clf=Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('rf', RandomForestClassifier())
])


#2. fit with X_train and y_train
clf.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred=clf.predict(X_test)

#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.84      0.94      0.89      1350
           1       0.91      0.89      0.90      1166
           2       0.89      0.83      0.86       542
           3       0.85      0.84      0.85       477
           4       0.88      0.70      0.78       324
           5       0.87      0.72      0.79       141

    accuracy                           0.87      4000
   macro avg       0.87      0.82      0.84      4000
weighted avg       0.87      0.87      0.87      4000



<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [None]:
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm")


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)

    return " ".join(filtered_tokens)

In [None]:
# create a new column "preprocessed_comment" and use the utility function above to get the clean data
# this will take some time, please be patient
df["preprocessed_comment"]=df.Text.apply(preprocess)


In [None]:
df.head()

Unnamed: 0,Text,Label,Emotion_num,preprocessed_comment
0,i didnt feel humiliated,sadness,1,not feel humiliate
1,i can go from feeling so hopeless to so damned...,sadness,1,feel hopeless damned hopeful care awake
2,im grabbing a minute to post i feel greedy wrong,anger,2,m grab minute post feel greedy wrong
3,i am ever feeling nostalgic about the fireplac...,love,4,feel nostalgic fireplace know property
4,i am feeling grouchy,anger,2,feel grouchy


**Build a model with pre processed text**

In [None]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
#Note: Use the preprocessed_Comment
X=df.preprocessed_comment
y=df.Emotion_num

X_train,X_test,y_train,y_test=train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=2022,
    stratify=df.Emotion_num
)


**Let's check the scores with our best model till now**
- Random Forest

**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigrams and bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
#1. create a pipeline object
clf=Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range=(1,2))),
    ('rf', RandomForestClassifier())
])



#2. fit with X_train and y_train
clf.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred=clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test,y_pred))



              precision    recall  f1-score   support

           0       0.90      0.91      0.91      1350
           1       0.89      0.94      0.91      1166
           2       0.89      0.87      0.88       542
           3       0.88      0.85      0.87       477
           4       0.80      0.77      0.78       324
           5       0.88      0.72      0.79       141

    accuracy                           0.89      4000
   macro avg       0.88      0.84      0.86      4000
weighted avg       0.89      0.89      0.89      4000




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the data.

**Note:**
- using **TF-IDF vectorizer** for pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
clf=Pipeline([
    ('vectorizer_tfidf', TfidfVectorizer()),
    ('rf', RandomForestClassifier())
])


#2. fit with X_train and y_train
clf.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred=clf.predict(X_test)

#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.86      0.93      0.89      1350
           1       0.92      0.89      0.90      1166
           2       0.87      0.85      0.86       542
           3       0.84      0.87      0.85       477
           4       0.83      0.69      0.76       324
           5       0.82      0.70      0.76       141

    accuracy                           0.87      4000
   macro avg       0.86      0.82      0.84      4000
weighted avg       0.87      0.87      0.87      4000

