## Tweet Emotion Classification with scikit-learn

### Project Overview:

This project is all about classifying emotions in tweets using Scikit-learn. I used TF-IDF to convert the tweets into numerical features the model could understand, then tested out a logistic regression approach with a few tuning tricks to boost accuracy. The goal was to try different modeling techniques and explore how different preprocessing and modeling choices could impact performance on real-world text data.

### Project Objective:

My goal with this project was to get hands-on with natural language processing and practice building a model that could recognize emotions in text. Along the way, I experimented with preprocessing steps and model parameters to see what worked best. More than just optimizing for high accuracy, I wanted this to be a meaningful learning experience and something others could reference when working on similar text classification tasks.

## Project Walkthrough:

### 1. Importing Libraries and Loading Data:

In [16]:
# Importing libraries:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

# Loading Twitter dataset:
data = pd.read_parquet('train-00000-of-00001.parquet')

data.head()

Unnamed: 0,text,label
0,i feel awful about it too because it s my job ...,0
1,im alone i feel awful,0
2,ive probably mentioned this before but i reall...,1
3,i was feeling a little low few days back,0
4,i beleive that i am much more sensitive to oth...,2


***Explanation:***

The Twitter dataset is stored in **Parquet** format, which efficiently handles large datasets. I loaded it using Pandas and took a quick look at the first few rows to better understand the overall structure.

### 2. Exploring Data:

In [20]:
# Checking the structure of the data more deeply
print(data.info()) #to understand the columns, data types, and non-null counts.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 416809 entries, 0 to 416808
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    416809 non-null  object
 1   label   416809 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 6.4+ MB
None


In [22]:
# Checking the distribution of the target labels (in this case, emotions)
print(data['label'].value_counts()) # we use 'value_counts' on the 'label' column -
                                    #to see how many instances of each emotion are present.

label
1    141067
0    121187
3     57317
4     47712
2     34554
5     14972
Name: count, dtype: int64


### 3. Data Preprocessing and Transformation:

In [24]:
# Mapping labels to emotion names for readability (this step is optional, but I prefer to do it this way)
emotion_map = {0: 'sadness', 1: 'joy', 2: 'love', 3: 'anger', 4: 'fear', 5: 'surprise'}
data['emotion'] = data['label'].map(emotion_map)

# Previewing the updated DataFrame:
print(data[['text', 'emotion']].head())

                                                text  emotion
0  i feel awful about it too because it s my job ...  sadness
1                              im alone i feel awful  sadness
2  ive probably mentioned this before but i reall...      joy
3           i was feeling a little low few days back  sadness
4  i beleive that i am much more sensitive to oth...     love


***Explanation:***

Here, I'm mappping the numeric labels to names that are readable and easy to understand for us (humans). Doing so can improve interpretability in the overall analysis and results.

### 4. Feature Engineering with TF-IDF:

***Here, the goal is to transform raw data into relevant information our machine learning model can use. In other words, I'm setting up the parameters required to be able to train our machine learning model.***

In [31]:
# Defining features and target:
X = data['text']  # text data
y = data['label']  #numeric labels

# Splitting data into training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Converting text to numerical data using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

***Explanation:***

***TF-IDF Vectorizer:*** Converts text data into a numerical format that models can process. When we set max_features=5000, we limit the vocabulary size, balancing model accuracy and computational efficiency.

***Data Splitting:*** Here I'm using an 80/20 train-test split to evaluate model performance.

### 5. Model Training and Evaluation:

In [34]:
# Training a Multinomial Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)

# To make predictions on the test data
y_pred = model.predict(X_test_vectorized)

# Evaluating the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print(classification_report(y_test, y_pred, target_names=emotion_map.values()))

Accuracy: 0.84
              precision    recall  f1-score   support

     sadness       0.84      0.95      0.89     24504
         joy       0.78      0.97      0.87     28247
        love       0.96      0.43      0.59      6853
       anger       0.94      0.77      0.84     11339
        fear       0.90      0.69      0.78      9376
    surprise       0.98      0.24      0.38      3043

    accuracy                           0.84     83362
   macro avg       0.90      0.68      0.73     83362
weighted avg       0.85      0.84      0.82     83362



***Explanation:***

***Model Choice:*** Naive Bayes 

Naive Bayes works fast and effectively, especially with high dimensional datasets like text (in this case). Overall, this model type is considered a popular choice for text classification. 

***Evaluation:*** At the evaluation step, we need to calculate for accuracy and provide a detailed classification report, which shows metrics such as precision, recall, and F1-score for each emotion. High F1-scores mean that we have a good balance between precision and recall. On the other hand, low F1-scores suggest that the model is having trouble balancing precision and recall. 

***Output Interpretation:*** 


Overall, we can see that the model achieves a total accuracy of ***84%***. This accuracy percentage suggests that the model's performance is reliable for emotion classification. I noticed that it performs particularly well with joy and sadness, showing high recall and balanced F1-scores, sugesting that it's identifying these emotions accurately and effectively. 

When it comes to love, anger, and fear, the model shows high precision but lower recall, which means that it recognizes these emotions correctly when predicted, but misses many occurences. The surprise category shows very high precision but very low recall, indicating that the model rarely predicts surprise, but when it does it is usually correct.


The macro and weighted averages show that our model's performance varies across emotions (predicting some emotions more accurately than others). This indicates that there is room for improvement and that our original model may benefit from some improvements so that it can better recall other emotions like surprise and love.

### 6. Additional Enhancements and Model Comparison:

***Adding Logistic Regression Model for Comparison with Naive Bayes Model***

So far, we have a working model. However, one of my main goals with this project was to maximize accuracy  and make it useful. So I decided to add another model training and evaluation section using a different model type. I will be using a Logistic Regression model, which is a common alternative for text classification, and I will compare its performace with Naive Bayes to see which one performs better.

In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MaxAbsScaler

# Scaling the TF-IDF features
scaler = MaxAbsScaler()
X_train_scaled = scaler.fit_transform(X_train_vectorized)
X_test_scaled = scaler.transform(X_test_vectorized)

# Initializing the Logistic Regression model with increased max_iter and an alternative solver
lr_model = LogisticRegression(max_iter=500, solver='saga', random_state=42)

# Training the model on the scaled training data
lr_model.fit(X_train_scaled, y_train)

# To make predictions on the scaled test data
y_pred_lr = lr_model.predict(X_test_scaled)

# Evaluating our new model's accuracy
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print(f"Logistic Regression Accuracy (with scaling and solver='saga'): {accuracy_lr:.2f}")
print(classification_report(y_test, y_pred_lr, target_names=emotion_map.values()))

Logistic Regression Accuracy (with scaling and solver='saga'): 0.90
              precision    recall  f1-score   support

     sadness       0.94      0.94      0.94     24504
         joy       0.92      0.93      0.92     28247
        love       0.80      0.77      0.78      6853
       anger       0.89      0.90      0.90     11339
        fear       0.85      0.84      0.85      9376
    surprise       0.78      0.71      0.74      3043

    accuracy                           0.90     83362
   macro avg       0.86      0.85      0.86     83362
weighted avg       0.90      0.90      0.90     83362



***Explanation:***

***New Model Choice:*** Logistic Regression. 


When it comes to text classification, logistic Regression usually outperforms Naive Bayes because it doesn't assume that words are independent of each other. This model type is better for capturing complex relationships between words, which leads to more accurate predictions. Moreover, logistic regression gives us reliable probability estimates and supports techniques to prevent overfitting. Overall, this is a very robust choice for handling the intricacies of text data. 

***Output Interpretation:*** 


When setting up the new logistic regression model, I implemented data scaling and utilized the 'saga' solver, which has led to significant improvements in the model's performance for emotion classification of tweets. The overall accuracy of the model increased from ***84%*** to ***90%***. 

The class-wise analysis also shows big improvements, especially for the 'surprise' and 'love' emotions, with F1-scores improving from ***0.38*** to ***0.74*** and from ***0.59*** to ***0.78***, etc. Additionally, other emotions also had some small performance gains. 

Moreover, both macro and weighted averages of precision, recall, and F1-score got better, which indicates a more accurate and balanced classificationn across all categories of emotions.

## Conclusion: 

This project began as an opportunity to deepen my understanding of Scikit-learn and apply it to a real-world NLP task. Along the way, I explored key text classification techniques, experimented with different models, and examined the impact of preprocessing choices on performance. By building on foundational knowledge in data science and insights from Kaggle resources, I created a scalable and interpretable workflow for multi-class emotion detection. I hope this project provides both practical insights and inspiration for anyone working with text data or looking to strengthen their machine learning skill set.

## Dataset Info:

***Source:*** "Twitter Emotion Classification Dataset" on Kaggle.com (labeled tweets for multi-class emotion detection).