# TF-IDF: Exercises

* Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.

* In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.

* For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!

* We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

## About Data: Emotion Detection
Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp

* This data consists of two columns. - Comment - Emotion

* Comment are the statements or messages regarding to a particular event/situation.

* Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

* As there are only 3 classes, this problem comes under the Multi-Class Classification.

In [2]:
#import pandas library
import pandas as pd

#read the dataset with name "Emotion_classify_Data.csv" and store it in a variable df
df = pd.read_csv("Emotion_classify_Data.csv")

#print the shape of dataframe
print(df.shape)

#print top 5 rows
df.head(5)

(5937, 2)


Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


In [3]:
#check the distribution of Emotion
df.Emotion.value_counts()

Unnamed: 0_level_0,count
Emotion,Unnamed: 1_level_1
anger,2000
joy,2000
fear,1937


In [4]:
#Add the new column "Emotion_num" which gives a unique number to each of these Emotions
#joy --> 0, fear --> 1, anger --> 2
df["Emotion_num"] = df["Emotion"].map({
    'joy': 0,
    'fear': 1,
    'anger': 2
})

#checking the results by printing top 5 rows
df.head()

Unnamed: 0,Comment,Emotion,Emotion_num
0,i seriously hate one subject to death but now ...,fear,1
1,im so full of life i feel appalled,anger,2
2,i sit here to write i start to dig out my feel...,fear,1
3,ive been really angry with r and i feel like a...,joy,0
4,i feel suspicious if there is no one outside l...,fear,1


## Modelling without Pre-processing Text data

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.Comment,
    df.Emotion_num,
    test_size=0.2,
    random_state=2022,
    stratify=df.Emotion_num
)

In [7]:
print(X_train.shape)
print(X_test.shape)

(4749,)
(1188,)


### Attempt 1 :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

Note:

* using CountVectorizer with only trigrams.
* use RandomForest as the classifier.
* print the classification report.

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer

clf = Pipeline([
     ('vectorizer_3',CountVectorizer(ngram_range = (3,3))),
     ('rf', RandomForestClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.58      0.27      0.37       400
           1       0.37      0.80      0.51       388
           2       0.52      0.20      0.29       400

    accuracy                           0.42      1188
   macro avg       0.49      0.43      0.39      1188
weighted avg       0.49      0.42      0.39      1188



### Attempt 2 :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

Note:

* using CountVectorizer with both unigram and bigrams.
* use Multinomial Naive Bayes as the classifier.
* print the classification report.

In [10]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer

clf = Pipeline([
     ('vectorizer_1_2',CountVectorizer(ngram_range = (1,2))),
     ('nb', MultinomialNB())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.86      0.87       400
           1       0.87      0.83      0.85       388
           2       0.83      0.88      0.85       400

    accuracy                           0.86      1188
   macro avg       0.86      0.86      0.86      1188
weighted avg       0.86      0.86      0.86      1188



### Attempt 3 :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

Note:

* using CountVectorizer with both unigram and Bigrams.
* use RandomForest as the classifier.
*  print the classification report.

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer

clf = Pipeline([
     ('vectorizer_1_2',CountVectorizer(ngram_range = (1,2))),
     ('rf', RandomForestClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.97      0.90       400
           1       0.95      0.88      0.91       388
           2       0.94      0.86      0.90       400

    accuracy                           0.90      1188
   macro avg       0.91      0.90      0.90      1188
weighted avg       0.91      0.90      0.90      1188



### Attempt 4 :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

Note:

* using TF-IDF vectorizer for Pre-processing the text.
* use RandomForest as the classifier.
* print the classification report.

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

clf = Pipeline([
     ('vectorizer_tf_idf',TfidfVectorizer()),
     ('rf', RandomForestClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.94      0.90       400
           1       0.91      0.90      0.91       388
           2       0.93      0.86      0.89       400

    accuracy                           0.90      1188
   macro avg       0.90      0.90      0.90      1188
weighted avg       0.90      0.90      0.90      1188



## Use text pre-processing to remove stop words, punctuations and apply lemmatization

In [13]:
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm")


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)

    return " ".join(filtered_tokens)

In [14]:
df['preprocessed_txt'] = df['Comment'].apply(preprocess)

In [15]:
X_train, X_test, y_train, y_test = train_test_split(
    df.preprocessed_txt,
    df.Emotion_num,
    test_size=0.2,
    random_state=2022,
    stratify=df.Emotion_num
)

### Attempt1 :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

Note:

* using CountVectorizer with both unigrams and bigrams.
* use RandomForest as the classifier.
* print the classification report.

In [16]:
clf = Pipeline([
     ('vectorizer_1_2',CountVectorizer(ngram_range=(1,2))),
     ('rf', RandomForestClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.95      0.94       400
           1       0.94      0.90      0.92       388
           2       0.91      0.94      0.92       400

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188



### Attempt 2 :

1. using the sklearn pipeline module create a classification pipeline to classify the data.

Note:

* using TF-IDF vectorizer for pre-processing the text.
* use RandomForest as the classifier.
* print the classification report.

In [17]:
clf = Pipeline([
     ('vectorizer_tf_idf',TfidfVectorizer()),
     ('rf', RandomForestClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.91      0.95      0.93       400
           1       0.93      0.93      0.93       388
           2       0.93      0.91      0.92       400

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188



### Observations:

* As the n_gram range increase, there is drastic fall of improvement in performance metrics

* There is a improvement in results when pre-process the data