<h2>CS 3780/5780 Creative Project: </h2>
<h3>Emotion Classification of Natural Language</h3>

Names and NetIDs for your group members:

<h3>Introduction:</h3>

<p> The creative project is about conducting a real-world machine learning project on your own, with everything that is involved. Unlike in the programming projects 1-5, where we gave you all the scaffolding and you just filled in the blanks, you now start from scratch. The past programming projects provide templates for how to do this (and you can reuse part of your code if you wish), and the lectures provide some of the methods you can use. So, this creative project brings realism to how you will use machine learning in the real world.  </p>

The task you will work on is classifying texts to human emotions. Through words, humans express feelings, articulate thoughts, and communicate our deepest needs and desires. Language helps us interpret the nuances of joy, sadness, anger, and love, allowing us to connect with others on a deeper level. Are you able to train an ML model that recognizes the human emotions expressed in a piece of text? <b>Please read the project description PDF file carefully and follow the instructions there. Also make sure you write your code and answers to all the questions in this Jupyter Notebook </b> </p>
<p>


<h2>Part 0: Basics</h2><p>

<h3>0.1 Import:</h3><p>
Please import necessary packages to use. Note that learning and using packages are recommended but not required for this project. Some official tutorial for suggested packacges includes:
    
https://scikit-learn.org/stable/tutorial/basic/tutorial.html
    
https://pytorch.org/tutorials/
    
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
<p>

In [1]:
import os
import pandas as pd
import numpy as np
import re
import torch

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder
import nltk
from nltk.corpus import stopwords

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


<h3>0.2 Accuracy and Mean Squared Error:</h3><p>
To measure your performance in the Kaggle Competition, we are using accuracy. As a recap, accuracy is the percent of labels you predict correctly. To measure this, you can use library functions from sklearn. A simple example is shown below. 
<p>

In [3]:
from sklearn.metrics import accuracy_score
y_pred = [3, 2, 1, 0, 1, 2, 3]
y_true = [0, 1, 2, 3, 1, 2, 3]
accuracy_score(y_true, y_pred)

0.42857142857142855

<h2>Part 1: Basic</h2><p>
Note that your code should be commented well and in part 1.4 you can refer to your comments.

<h3>1.1 Load and preprocess the dataset:</h3><p>
We provide how to load the data on Kaggle's Notebook.
<p>

In [2]:
train = pd.read_csv("train.csv")
train_text = train["text"]
train_label = train["label"]

test = pd.read_csv("test.csv")
test_id = test["id"]
test_text = test["text"]

In [4]:
# Make sure you comment your code clearly and you may refer to these comments in the part 1.4
# TODO

# 2. Initialize tools for preprocessing
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# 3. Define a text preprocessing function
def preprocess_text(text):
    """
    Cleans and preprocesses the input text.
    """
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)  # Remove URLs
    return text

train_text = train["text"].apply(preprocess_text)
test_text = test["text"].apply(preprocess_text)
train_label = train["label"]

# 5. Encode labels
label_encoder = LabelEncoder()
train_label_encoded = label_encoder.fit_transform(train_label)

# 6. Split the training data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    train_text, train_label_encoded, test_size=0.2, random_state=42
)

# 7. TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(
    max_features=20000,  # Use top 20,000 features for richer representation
    ngram_range=(1, 3),  # Include unigrams, bigrams, and trigrams
    stop_words="english",  # Remove common stop words
    min_df=5,  # Ignore terms appearing in fewer than 5 documents
    max_df=0.8,  # Ignore terms appearing in more than 80% of documents
)

# Fit and transform the train data, transform validation and test data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_val_tfidf = tfidf_vectorizer.transform(X_val)
X_test_tfidf = tfidf_vectorizer.transform(test_text)

# 8. Display preprocessing summary
print("Preprocessing Complete!")
print(f"Train TF-IDF shape: {X_train_tfidf.shape}")
print(f"Validation TF-IDF shape: {X_val_tfidf.shape}")
print(f"Test TF-IDF shape: {X_test_tfidf.shape}")

# 9. Display sample features from the TF-IDF vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()
print(f"Number of features: {len(feature_names)}")
print("Sample Feature Names:")
print(feature_names[:30])  # Display the first 30 feature names

# Save the processed datasets if needed
X_train.to_csv("X_train_cleaned.csv", index=False)
X_val.to_csv("X_val_cleaned.csv", index=False)
test_text.to_csv("test_cleaned.csv", index=False)


Preprocessing Complete!
Train TF-IDF shape: (10000, 3087)
Test TF-IDF shape: (15000, 3087)
Number of features: 3087
Sample Feature Names:
['10' '20' '30' 'ability' 'able' 'able help' 'absolute' 'absolutely'
 'abuse' 'abused' 'accept' 'acceptable' 'accepted' 'accepting' 'access'
 'accident' 'accidentally' 'accomplished' 'accomplishment' 'account'
 'accounts' 'ache' 'aching' 'act' 'acted' 'acting' 'action' 'active'
 'activities' 'actor']


<h3>1.2 Use At Least Two Training Algorithms from class:</h3><p>
You need to use at least two training algorithms from class. You can use your code from previous projects or any packages you imported in part 0.1.

In [6]:
# Make sure you comment your code clearly and you may refer to these comments in the part 1.4
# TODO

#Model 1: Logistic Regression

#Logistic Regression Model: Just showing its original training accuracy
#Want to use Entire Training Set for K-fold cross validation!! This line restores the entire training set
X_train, y_train = train_text, train_label_encoded
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(test_text)
print(f"Train TF-IDF shape: {X_train_tfidf.shape}")
print(f"Test TF-IDF shape: {X_test_tfidf.shape}")

# 2. Train Logistic Regression model
logreg = LogisticRegression(max_iter=1000, random_state=42)  # Increase max_iter if needed
logreg.fit(X_train_tfidf, y_train)

# 3. Make predictions
y_train_pred = logreg.predict(X_train_tfidf)

# 4. Evaluate the model
train_accuracy = accuracy_score(y_train, y_train_pred)

print(f"Training Accuracy: {train_accuracy:.4f}")
# Fine tuning scikit
#Hyperparameter tuning and cross validation:

NameError: name 'X_val_tfidf' is not defined

In [11]:
#SVM
#Model 2: SVM
svm = SVC(kernel = 'rbf', C= 1.0, gamma= 'scale')
svm.fit(X_train_tfidf, y_train)

<h3>1.3 Training, Validation and Model Selection:</h3><p>
You need to split your data to a training set and validation set or performing a cross-validation for model selection.

In [None]:
#mess around with the parameters grid search cv
svm_params = {
    'kernel': ['rbf'],
    'C': np.arange(1,5)
}

grid = GridSearchCV(estimator = SVC(), param_grid = svm_params, scoring= 'accuracy')
grid.fit(X_train_tfidf, y_train)

print(grid.best_params_)
print(grid.best_score_)

#SVM Testing
# 3. Make predictions
y_val_pred = svm.predict(X_val_tfidf)

# 4. Evaluate the model
val_accuracy = accuracy_score(y_val, y_val_pred)

print(f"Validation Accuracy: {val_accuracy:.4f}")


In [5]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

#Want to use Entire Training Set for K-fold cross validation!! This line restores the entire training set 
X_train, y_train = train_text, train_label_encoded
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(test_text)
print(f"Train TF-IDF shape: {X_train_tfidf.shape}")
print(f"Test TF-IDF shape: {X_test_tfidf.shape}")

# Define the hyperparameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strengths
    'solver': ['liblinear', 'saga'],  # Solvers compatible with the penalties
}

# Set up StratifiedKFold for cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Define Logistic Regression model
logreg = LogisticRegression(random_state=42)

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=logreg,
    param_grid=param_grid,
    scoring='accuracy',
    cv=cv,  # Use StratifiedKFold
    n_jobs=-1,  # Use all available cores
    verbose=2   # Show detailed output
)

# Fit GridSearchCV to the training data
grid_search.fit(X_train_tfidf, y_train)

# Output the best parameters and accuracy
print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")

# Predict on the test set

# Evaluate the best model on the training set
best_model = grid_search.best_estimator_
y_train_pred = best_model.predict(X_train_tfidf)
train_accuracy = accuracy_score(train_label_encoded, y_train_pred)
print(f"Training Accuracy with Best Model: {train_accuracy:.4f}")

# Use the best model to make predictions
test_predictions = best_model.predict(X_test_tfidf)

# Save predictions to a CSV file
output = pd.DataFrame({"id": test_id, "label": test_predictions})
output.to_csv("submission_gridsearch3.csv", index=False)
print("Predictions saved to submission_gridsearch3.csv")


Fitting 5 folds for each of 10 candidates, totalling 50 fits





Best Parameters: {'C': 10, 'solver': 'liblinear'}
Best Cross-Validation Accuracy: 0.7156
Training Accuracy with Best Model: 0.9438
Predictions saved to submission_gridsearch3.csv


<h3>1.4 Explanation in Words:</h3><p>
    You need to answer the following questions in the markdown cell after this cell:

1.4.1 How did you formulate the learning problem?

1.4.2 Which two learning methods from class did you choose and why did you made the choices?

1.4.3 How did you do the model selection?

1.4.4 Does the test performance reach the first baseline "Tiny Piney"? (Please include a screenshot of Kaggle Submission)

<h2>Part 2: Be creative!</h2><p>

<h3>2.1 Open-ended Code:</h3><p>
You may follow the steps in part 1 again but making innovative changes like using new training algorithms, etc. Make sure you explain everything clearly in part 2.2. Note that beating "Zero Hero" is only a small portion of this part. Any creative ideas will receive most points as long as they are reasonable and clearly explained.

In [None]:
#METHOD: BERT Transformer

In [3]:
pip install transformers



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Users/aileenh/opt/anaconda3/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install transformers[torch]


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Users/aileenh/opt/anaconda3/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install accelerate==0.26.0



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Users/aileenh/opt/anaconda3/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [6]:
pip install 'transformers[torch]'



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Users/aileenh/opt/anaconda3/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [8]:
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the dataset
data = pd.read_csv('train.csv')
train_texts, val_texts, train_labels, val_labels = train_test_split(data['text'], data['label'], test_size=0.1)

# Tokenization
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(train_texts.tolist(), truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(val_texts.tolist(), truncation=True, padding=True, max_length=128)

# Dataset class
class EmotionDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = EmotionDataset(train_encodings, train_labels.tolist())
val_dataset = EmotionDataset(val_encodings, val_labels.tolist())

# Model initialization
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=28)

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

# Define metric computation
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc}

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

# Train the model
trainer.train()

# Evaluate and print accuracy for both training and validation datasets
train_results = trainer.evaluate(train_dataset)
val_results = trainer.evaluate(val_dataset)
print(f"Training Accuracy: {train_results['eval_accuracy']:.4f}")
print(f"Validation Accuracy: {val_results['eval_accuracy']:.4f}")

# Load the test dataset for final predictions
test_data = pd.read_csv('test.csv')
test_texts = test_data['text'].tolist()
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=128)
test_dataset = EmotionDataset(test_encodings, [0]*len(test_texts))  # Dummy labels

# Predict on test data
predictions = trainer.predict(test_dataset)
predicted_labels = predictions.predictions.argmax(-1)

# Saving predictions to CSV for submission
submission = pd.DataFrame({'id': range(len(predicted_labels)), 'label': predicted_labels})
submission.to_csv('submission.csv', index=False)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
10,3.3352
20,3.3362
30,3.3198
40,3.2948
50,3.2709
60,3.2092
70,3.1587
80,3.0568
90,2.8627
100,2.6653


Training Accuracy: 0.8621
Validation Accuracy: 0.8010


<h3>2.2 Explanation in Words:</h3><p>
You need to answer the following questions in a markdown cell after this cell:

2.2.1 How much did you manage to improve performance on the test set? Did you beat "Zero Hero" in Kaggle? (Please include a screenshot of Kaggle Submission)

2.2.2 Please explain in detail how you achieved this and what you did specifically and why you tried this.

2.2.1: We did improve performance on the test set by 0.6% and beat Zero Hero on Kaggle (screenshot in pdf). 
    
2.2.2: 

To achieve an improvement in performance on the test set using a BERT-based approach, we transitioned from the baseline logistic regression model, which initially provided us with an accuracy of 70%. The primary rationale behind this switch was BERT's superior ability to understand and leverage the contextual nuances of language, which is essential in emotion classification tasks.

Implementation Details and Strategy:
Initially, we continued using our established preprocessing routine, which involved cleaning the text by lowercasing, removing punctuation, URLs, and utilizing regex for text normalization. This consistency was crucial to ensure that the input data was clean and standardized for both models, facilitating a direct comparison of their performance.

We chose DistilBERT for our model due to its balance between efficiency and effectiveness. DistilBERT is a streamlined version of the more cumbersome BERT model that retains most of the original model's strengths but is more resource-efficient. This choice was ideal given our constraints on training time and computational resources.

For tokenization, we employed DistilBertTokenizerFast, which adapts the text for BERT’s requirements, adding necessary special tokens and managing sentence length through padding and truncation. The model was fine-tuned using the Hugging Face’s Trainer API, which simplified the training process and allowed for easy integration of training arguments like the number of epochs, batch size, and learning rate adjustments.

Evaluation and Results:
The compute_metrics function was defined to calculate the accuracy of the model, ensuring we could quantitatively measure its performance against our logistic regression baseline. After training, we conducted evaluations on both a training set and a validation set to confirm the model's ability to learn effectively and generalize well beyond the training data.

The final step involved making predictions on the test dataset, which had been similarly preprocessed and tokenized. These predictions were formatted according to Kaggle's submission requirements and uploaded to the platform. Our submission achieved a 0.6% improvement over the baseline model, outperforming the "Zero Hero" benchmark on Kaggle. This improvement was a direct result of leveraging BERT's advanced capabilities to interpret the complex and varied semantic structures within the text data.

Conclusion:
This shift to a more sophisticated neural network model underscores the importance of contextual understanding in language processing tasks. By integrating BERT, specifically DistilBERT, we utilized a state-of-the-art approach that not only enhanced our model’s accuracy but also provided deeper insights into the predictive power of neural network architectures in natural language processing. The results on Kaggle affirmed our methodology, marking a successful application of transfer learning to significantly boost performance in emotion classification.

<h2>Part 3: Kaggle Submission</h2><p>
You need to generate a prediction CSV using the following cell from your trained model and submit the direct output of your code to Kaggle. The results should be presented in two columns in csv format: the first column is the data id (0-14999) and the second column includes the predictions for the test set. The first column must be named id and the second column must be named label (otherwise your submission will fail). A sample predication file can be downloaded from Kaggle for each problem. 
We provide how to save a csv file if you are running Notebook on Kaggle.

In [None]:
id = range(15000)
prediction = range(15000)
submission = pd.DataFrame({'id': id, 'label': prediction})
submission.to_csv('/kaggle/working/submission.csv', index=False)

In [1]:
# We created submission csv's in our code above! And have submitted 

<h2>Part 4: Resources and Literature Used</h2><p>

Please cite the papers and open resources you used.