#### TF-IDF: Exercises
- Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.

- In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.

- For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!

- We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

#### About Data: Emotion Detection
- Credits: https://www.kaggle.com/datasets/abdallahwagih/emotion-dataset

- This data consists of two columns. - Comment - Emotion

- Comment are the statements or messages regarding to a particular event/situation.

- Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

- As there are only 3 classes, this problem comes under the Multi-Class Classification.


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('Dataset/Emotion_classify_Data.csv')

In [3]:
df.shape

(5937, 2)

In [4]:
df.head()

Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


In [5]:
df.Emotion.value_counts()

Emotion
anger    2000
joy      2000
fear     1937
Name: count, dtype: int64

In [6]:
df['label_num'] = df.Emotion.map({'anger' : 0, 'joy': 1, 'fear': 2})

In [7]:
df.head()

Unnamed: 0,Comment,Emotion,label_num
0,i seriously hate one subject to death but now ...,fear,2
1,im so full of life i feel appalled,anger,0
2,i sit here to write i start to dig out my feel...,fear,2
3,ive been really angry with r and i feel like a...,joy,1
4,i feel suspicious if there is no one outside l...,fear,2


In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(df.Comment, df.label_num, test_size = 0.2, random_state = 2022, stratify = df.label_num)

**Attempt 1 :**
- 1.using sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use RandomForest as the classifier.
- print the classification report.

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
clf1 = Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range = (3, 3))),
    ('rf', RandomForestClassifier())

])

In [12]:
clf1.fit(X_train, y_train)

In [13]:
y_pred = clf1.predict(X_test)

In [14]:
from sklearn.metrics import classification_report

In [15]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.41      0.68      0.51       400
           1       0.60      0.27      0.37       400
           2       0.55      0.49      0.52       388

    accuracy                           0.48      1188
   macro avg       0.52      0.48      0.47      1188
weighted avg       0.52      0.48      0.47      1188



**Attempt 2 :**
- using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**

- using CountVectorizer with both unigram and bigrams.
- use Multinomial Naive Bayes as the classifier.
- print the classification report.

In [16]:
from sklearn.naive_bayes import MultinomialNB

In [17]:
clf2 = Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range = (1, 2))),
    ('mnb', MultinomialNB())

])

In [18]:
clf2.fit(X_train, y_train)

In [19]:
y_pred = clf2.predict(X_test)

In [20]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.89      0.89       400
           1       0.88      0.88      0.88       400
           2       0.86      0.86      0.86       388

    accuracy                           0.88      1188
   macro avg       0.88      0.88      0.88      1188
weighted avg       0.88      0.88      0.88      1188



**Attempt 3 :**
- using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**

- using TF-IDF vectorizer for Pre-processing the text.
- use RandomForest as the classifier.
- print the classification report.

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [22]:
clf3 = Pipeline([
    ('vectorizer_TF_IDF', TfidfVectorizer()),
    ('random_forest', RandomForestClassifier())
])

In [23]:
clf3.fit(X_train, y_train)

In [24]:
y_pred = clf3.predict(X_test)

In [25]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.87      0.90       400
           1       0.89      0.94      0.91       400
           2       0.90      0.91      0.91       388

    accuracy                           0.91      1188
   macro avg       0.91      0.91      0.91      1188
weighted avg       0.91      0.91      0.91      1188



**Use text pre-processing to remove stop words, punctuations and apply lemmatization**

In [26]:
import spacy

nlp = spacy.load("en_core_web_sm") 

def preprocess(text):
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 


In [27]:
df['preprocessed_comment'] = df['Comment'].apply(preprocess)

In [28]:
df.head()

Unnamed: 0,Comment,Emotion,label_num,preprocessed_comment
0,i seriously hate one subject to death but now ...,fear,2,seriously hate subject death feel reluctant drop
1,im so full of life i feel appalled,anger,0,m life feel appalled
2,i sit here to write i start to dig out my feel...,fear,2,sit write start dig feeling think afraid accep...
3,ive been really angry with r and i feel like a...,joy,1,ve angry r feel like idiot trust place
4,i feel suspicious if there is no one outside l...,fear,2,feel suspicious outside like rapture happen


In [29]:
X_train, X_test, y_train, y_test = train_test_split(df.preprocessed_comment, df.label_num, test_size = 0.2, random_state = 2022, stratify = df.label_num)

**Attempt1 :**

- using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**

- using CountVectorizer with both unigrams and bigrams.
- use RandomForest as the classifier.
- print the classification report.

In [30]:
clf4 = Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range = (1, 2))),
    ('rf', RandomForestClassifier())

])

In [31]:
clf4.fit(X_train, y_train)

In [32]:
y_pred = clf4.predict(X_test)

In [33]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.91      0.93      0.92       400
           1       0.92      0.95      0.94       400
           2       0.95      0.90      0.93       388

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188



**Attempt 2 :**

- using the sklearn pipeline module create a classification pipeline to classify the data.

**Note:**

- using TF-IDF vectorizer for pre-processing the text.
- use RandomForest as the classifier.
- print the classification report.

In [34]:
clf5 = Pipeline([
    ('vectorizer_TF_IDF', TfidfVectorizer()),
    ('random_forest', RandomForestClassifier())
])

In [35]:
clf5.fit(X_train, y_train)

In [36]:
y_pred = clf5.predict(X_test)

In [37]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.91      0.92       400
           1       0.93      0.95      0.94       400
           2       0.91      0.92      0.92       388

    accuracy                           0.92      1188
   macro avg       0.92      0.92      0.92      1188
weighted avg       0.92      0.92      0.92      1188

