<a href="https://colab.research.google.com/github/abdullah1234-bit/NLP-/blob/main/tf_idf_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **TF-IDF:

- Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.

- In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.

- For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!

- We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

### **About Data: Emotion Detection**

Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp


- This data consists of two columns.
        - Comment
        - Emotion
- Comment are the statements or messages regarding to a particular event/situation.

- Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

- As there are only 3 classes, this problem comes under the **Multi-Class Classification.**

In [2]:
# Import pandas library
import pandas as pd

# Read the dataset with the name "Emotion_classify_Data.csv" and store it in a variable df
df = pd.read_csv("/content/train.txt")

# Print the shape of the dataframe
print("Shape of the dataframe:", df.shape)

# Print the top 5 rows of the dataframe
print("Top 5 rows of the dataframe:")
print(df.head())


Shape of the dataframe: (15999, 1)
Top 5 rows of the dataframe:
                     i didnt feel humiliated;sadness
0  i can go from feeling so hopeless to so damned...
1  im grabbing a minute to post i feel greedy wro...
2  i am ever feeling nostalgic about the fireplac...
3                         i am feeling grouchy;anger
4  ive been feeling a little burdened lately wasn...


In [4]:
print(df.columns)
print(df.head())


Index(['i didnt feel humiliated;sadness'], dtype='object')
                     i didnt feel humiliated;sadness
0  i can go from feeling so hopeless to so damned...
1  im grabbing a minute to post i feel greedy wro...
2  i am ever feeling nostalgic about the fireplac...
3                         i am feeling grouchy;anger
4  ive been feeling a little burdened lately wasn...


In [6]:
import pandas as pd

# Load the dataset
df = pd.read_csv('train.txt', sep=';', header=None, names=['Text', 'Emotion'])

# Check the shape
print("Shape of the dataset:", df.shape)

# Display the top 5 rows
print(df.head())



Shape of the dataset: (16000, 2)
                                                Text  Emotion
0                            i didnt feel humiliated  sadness
1  i can go from feeling so hopeless to so damned...  sadness
2   im grabbing a minute to post i feel greedy wrong    anger
3  i am ever feeling nostalgic about the fireplac...     love
4                               i am feeling grouchy    anger


In [7]:
# Check the distribution of emotions
print("Distribution of Emotion:")
print(df['Emotion'].value_counts())



Distribution of Emotion:
Emotion
joy         5362
sadness     4666
anger       2159
fear        1937
love        1304
surprise     572
Name: count, dtype: int64


In [8]:
from imblearn.over_sampling import RandomOverSampler

# Define oversampling strategy
oversample = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversample.fit_resample(df[['Text']], df['Emotion'])

# Combine back into a DataFrame
resampled_df = pd.DataFrame({'Text': X_resampled['Text'], 'Emotion': y_resampled})
print(resampled_df['Emotion'].value_counts())


Emotion
sadness     5362
anger       5362
love        5362
surprise    5362
fear        5362
joy         5362
Name: count, dtype: int64


In [10]:
# Create a mapping for emotions
emotion_mapping = {
    'joy': 0,
    'fear': 1,
    'anger': 2,
    'sadness': 3,   # Optional if sadness and others are needed
    'love': 4,
    'surprise': 5
}

# Add a new column based on the mapping
df['Emotion_num'] = df['Emotion'].map(emotion_mapping)

# Print the top 5 rows to verify
print(df.head())


                                                Text  Emotion  Emotion_num
0                            i didnt feel humiliated  sadness            3
1  i can go from feeling so hopeless to so damned...  sadness            3
2   im grabbing a minute to post i feel greedy wrong    anger            2
3  i am ever feeling nostalgic about the fireplac...     love            4
4                               i am feeling grouchy    anger            2


<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [13]:
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm")


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)

    return " ".join(filtered_tokens)

In [15]:
print(df.columns)


Index(['Text', 'Emotion', 'Emotion_num'], dtype='object')


In [16]:
# Apply the preprocessing function to the 'Text' column
df['preprocessed_comment'] = df['Text'].apply(preprocess)

# Check the updated DataFrame
print(df.head())



                                                Text  Emotion  Emotion_num  \
0                            i didnt feel humiliated  sadness            3   
1  i can go from feeling so hopeless to so damned...  sadness            3   
2   im grabbing a minute to post i feel greedy wrong    anger            2   
3  i am ever feeling nostalgic about the fireplac...     love            4   
4                               i am feeling grouchy    anger            2   

                      preprocessed_comment  
0                       not feel humiliate  
1  feel hopeless damned hopeful care awake  
2     m grab minute post feel greedy wrong  
3   feel nostalgic fireplace know property  
4                             feel grouchy  


**Build a model with pre processed text**

In [17]:
from sklearn.model_selection import train_test_split

# Split the data using preprocessed_comment and Emotion_num for stratified sampling
X = df['preprocessed_comment']
y = df['Emotion_num']

# Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2022, stratify=y)

# Print the shapes of the train and test sets to confirm the split
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")


Shape of X_train: (12800,)
Shape of X_test: (3200,)
Shape of y_train: (12800,)
Shape of y_test: (3200,)


**Let's check the scores with our best model till now**
- Random Forest

**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigrams and bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# Create the classification pipeline
model_rf_ngram = make_pipeline(
    CountVectorizer(ngram_range=(1, 2)),  # Use both unigrams and bigrams
    RandomForestClassifier(n_estimators=50, criterion='entropy', random_state=2022)  # Random Forest Classifier
)

# Train-test split (use the preprocessed comments as input features)
X_train, X_test, y_train, y_test = train_test_split(df['preprocessed_comment'], df['Emotion_num'], test_size=0.2, random_state=2022, stratify=df['Emotion_num'])

# Train the model
model_rf_ngram.fit(X_train, y_train)

# Predict on the test set
y_pred = model_rf_ngram.predict(X_test)

# Print the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.89      0.88      1072
           1       0.84      0.83      0.84       387
           2       0.87      0.83      0.85       432
           3       0.87      0.90      0.88       933
           4       0.77      0.69      0.73       261
           5       0.86      0.64      0.74       115

    accuracy                           0.85      3200
   macro avg       0.84      0.80      0.82      3200
weighted avg       0.85      0.85      0.85      3200




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the data.

**Note:**
- using **TF-IDF vectorizer** for pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# Create the classification pipeline with TF-IDF and Random Forest
model_rf_tfidf = make_pipeline(
    TfidfVectorizer(),  # Use TF-IDF vectorizer for text data preprocessing
    RandomForestClassifier(n_estimators=50, criterion='entropy', random_state=2022)  # RandomForest as classifier
)

# Train-test split (use the preprocessed comments as input features)
X_train, X_test, y_train, y_test = train_test_split(df['preprocessed_comment'], df['Emotion_num'], test_size=0.2, random_state=2022, stratify=df['Emotion_num'])

# Train the model
model_rf_tfidf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model_rf_tfidf.predict(X_test)

# Print the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.79      0.90      0.84      1072
           1       0.79      0.82      0.80       387
           2       0.84      0.83      0.83       432
           3       0.89      0.81      0.85       933
           4       0.76      0.64      0.69       261
           5       0.81      0.57      0.67       115

    accuracy                           0.82      3200
   macro avg       0.81      0.76      0.78      3200
weighted avg       0.82      0.82      0.82      3200

