# 2nd

This part is worth 30% of your grade. Participate in the in-class Kaggle Competition regarding Emotion Recognition on Twitter by this link: https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework.

![ranking](./pics/Kaggle_Competition_Ranking.png)

# 3 rd

A report of your work developing the model for the competition (You can use code and comment on it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained.

This competition aims to predict one of the corresponding emotions (i.e. joy, anticipation, trust, sadness, disgust, fear, surprise, and anger) according to the givewn text data.

First, I browsed through all the files we have and extracted some information that I think is useful for the prediction.

1. tweets_DM.json: 'hashtags', 'text', 'tweet_id'
2. data_identification.csv
3. emotion.csv

I selected the above-information and merged them by 'tweet_id'. Based on the data_identification.csv, I split the merged data into test and train. 

## 1. Preprocessing

In the competition, I utilized the differnt techiques we learned in class for trial: First, I used BOW, TFIDF, and Word2Vec to deal with the raw data. After I view the given data once again, I found that although the hashtags data is sparse, some contents are explicitly related to sentiments. As a result, I merge the 'hashtags' with 'text' and generate two new files. With the new data with hashtags, I also tried to combined the processed results with different models (they will be illustrated later).

(Since I have tried a lot of porrtfolios, I just listed some of them below.)


In order to split the training and testing data, I split them based on the label in data_identification.csv

In [None]:
import pandas as pd

df1 = pd.read_csv('/mnt/sda/catherine/kaggle/emotion.csv')
df2 = pd.read_csv('/mnt/sda/catherine/kaggle/data_identification.csv')

merged_df = df1.merge(df2, on='tweet_id', how='outer')

merged_df.to_csv('merged_file.csv', index=False)

In [None]:
import pandas as pd

df_test = merged_df[merged_df['identification'] == 'test']
df_train = merged_df[merged_df['identification'] == 'train']

df_test.to_csv('test.csv', index=False)
df_train.to_csv('train.csv', index=False)

Originally, I want to match the 'hashtags' and 'text' in tweets_DM.json with the two .csv files (i.e. train.csv and test.csv), but the compiler rejected my request (since the dataset is too large). Therefore, I split the data into many batch and ran respecitvely to match the data properly.

In [None]:
import os
import pandas as pd

# organized_tweets.csv is the .csv of tweets_DM.json with only three columns - 'tweet_id', 'hashtags', and 'text'
file_path = '/mnt/sda/catherine/kaggle/organized_tweets.csv'
output_dir = '/mnt/sda/catherine/kaggle/split_files/'

# Ensure the output directory exists
os.makedirs(output_dir, exist_ok=True)

# Define the chunk size
chunk_size = 10000

# Read the file in chunks and save each chunk as a separate CSV
for i, chunk in enumerate(pd.read_csv(file_path, chunksize=chunk_size, engine='python')):
    chunk.to_csv(f'{output_dir}chunk_{i}.csv', index=False)

In [None]:
# Train data
import os
import pandas as pd

target_df = pd.read_csv('/mnt/sda/catherine/kaggle/train.csv')
folder_path = '/mnt/sda/catherine/kaggle/split_files'

dataframes = []

for filename in os.listdir(folder_path):
    if filename.endswith('.csv'):
        file_path = os.path.join(folder_path, filename)
        chunk_df = pd.read_csv(file_path)

        dataframes.append(chunk_df)

merged_train_df = target_df

for df in dataframes:
    merged_train_df = pd.merge(merged_train_df, df[['tweet_id', 'hashtags', 'text']], on='tweet_id', how='left', suffixes=('', '_new'))

merged_train_df['text'] = merged_train_df[['text', 'text_new']].bfill(axis=1).iloc[:, 0]
merged_train_df['hashtags'] = merged_train_df[['hashtags', 'hashtags_new']].bfill(axis=1).iloc[:, 0]

merged_train_df = merged_train_df.drop(columns=['text_new'])
merged_train_df = merged_train_df.drop(columns=['hashtags_new'])

merged_train_df = merged_train_df[['tweet_id', 'hashtags', 'text']]
merged_train_df.to_csv('train_merged_output.csv', index=False)

In [None]:
# Test data
import os
import pandas as pd

target_df = pd.read_csv('/mnt/sda/catherine/kaggle/test.csv')
folder_path = '/mnt/sda/catherine/kaggle/split_files'

dataframes = []

for filename in os.listdir(folder_path):
    if filename.endswith('.csv'):
        file_path = os.path.join(folder_path, filename)
        chunk_df = pd.read_csv(file_path)

        dataframes.append(chunk_df)

merged_test_df = target_df

for df in dataframes:
    merged_test_df = pd.merge(merged_test_df, df[['tweet_id', 'hashtags', 'text']], on='tweet_id', how='left', suffixes=('', '_new'))

merged_test_df['text'] = merged_test_df[['text', 'text_new']].bfill(axis=1).iloc[:, 0]
merged_test_df['hashtags'] = merged_test_df[['hashtags', 'hashtags_new']].bfill(axis=1).iloc[:, 0]

merged_test_df = merged_test_df.drop(columns=['text_new'])
merged_test_df = merged_test_df.drop(columns=['hashtags_new'])

merged_test_df = merged_test_df[['tweet_id', 'hashtags', 'text']]
merged_test_df.to_csv('test_merged_output.csv', index=False)

Operations for merging the 'hashtags' with 'text'

In [None]:
import pandas as pd
import re

train_data = pd.read_csv('train_merged_output.csv')
test_data = pd.read_csv('test_merged_output.csv')

# Extract hashtags
train_data['hashtags'] = train_data['text'].apply(lambda x: re.findall(r'#\w+', x))
test_data['hashtags'] = test_data['text'].apply(lambda x: re.findall(r'#\w+', x))

# Fill "no_hashtags" into the row without hashtags
train_data['hashtags'] = train_data['hashtags'].apply(lambda x: x if len(x) > 0 else ['no_hashtags'])
test_data['hashtags'] = test_data['hashtags'].apply(lambda x: x if len(x) > 0 else ['no_hashtags'])

In [None]:
# Add 'hashtags' into 'text' except the one with "non_hashtags"
train_data['text_with_hashtags'] = train_data.apply(
    lambda row: row['text'] + ' ' + ' '.join(row['hashtags']) if row['hashtags'] != 'non_hashtags' else row['text'], axis=1
)

test_data['text_with_hashtags'] = test_data.apply(
    lambda row: row['text'] + ' ' + ' '.join(row['hashtags']) if row['hashtags'] != 'non_hashtags' else row['text'], axis=1
)

Although I tried all trials both on text and text_with_hashtags below, I only showed the code that I tried for text_with_hashtags.

## 2. Model

For the model used, I tried decision tree, random forest, neural network, transformer, and ROBERTa.

When applying different models, I encountered a serious problem that the training process ran very slow, so it was hard for me to try more complex or bigger models. They will make my computer crash 😢.

From my perspectives, based on what I have learned in class, these models are suitable for the task. Only ROBEERTa is a new model I think is worth trying) 


### BOW + Decision Tree

In [None]:
train_data = pd.read_csv('train_merged_output.csv')
test_data = pd.read_csv('test_merged_output.csv')

- BOW

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
BOW_vectorizer = CountVectorizer()

# Learn a vocabulary dictionary of all tokens in the raw documents.
BOW_vectorizer.fit(train_data['text_with_hashtags'])

# Transform documents to document-term matrix.
train_data_BOW_features = BOW_vectorizer.transform(train_data['text_with_hashtags'])
test_data_BOW_features = BOW_vectorizer.transform(test_data['text_with_hashtags'])

Install the necessary libraries and modules

In [None]:
!pip install nltk

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')      
# build analyzers (bag-of-words)
BOW_500 = CountVectorizer(max_features=500, tokenizer=nltk.word_tokenize) 

# apply analyzer to training data
BOW_500.fit(train_data['text_with_hashtags'])

train_data_BOW_features_500 = BOW_500.transform(train_data['text_with_hashtags'])

- Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Prepare the training and testing data
X_train = BOW_500.transform(train_data['text_with_hashtags'])
y_train = train_data['emotion']
X_test = BOW_500.transform(test_data['text_with_hashtags'])

# Construct the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Train the model
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

test_data['emotion'] = y_pred

# Make the data fit the submission pattern
submission = test_data[['tweet_id', 'emotion']]
submission = submission.rename(columns={'tweet_id': 'id'})

# Save the result
submission.to_csv('submission.csv', index=False)

### BOW + Neural Network

In [None]:
from sklearn.neural_network import MLPClassifier

X_train = BOW_500.transform(train_data['text_with_hashtags'])
y_train = train_data['emotion']

X_test = BOW_500.transform(test_data['text_with_hashtags'])

# Construct the Neural Network Classifier
clf = MLPClassifier(hidden_layer_sizes=(100,), max_iter=300, random_state=42)

# Train the model
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

test_data['emotion'] = y_pred

# Make the data fit the submission pattern
final_result = test_data[['tweet_id', 'emotion']]
final_result = final_result.rename(columns={'tweet_id': 'id'})

# Save the result
final_result.to_csv('NN_test_data_with_hashtags.csv', index=False)

### TFIDF + Decision Tree

- TFIDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import nltk

# Generate an embedding using the TF-IDF vectorizer with 1000 features
TFIDF_vectorizer = TfidfVectorizer(max_features=1000, tokenizer=nltk.word_tokenize)

# Learn a vocabulary dictionary of all tokens in the raw documents.
train_data = pd.read_csv('/mnt/sda/catherine/kaggle/train_merged_output.csv')
test_data = pd.read_csv('/mnt/sda/catherine/kaggle/test_merged_output.csv')
TFIDF_vectorizer.fit(train_data['text'])

# Transform documents to document-term matrix.
train_data_TFIDF_features = TFIDF_vectorizer.transform(train_data['text_with_hashtags'])
test_data_TFIDF_features = TFIDF_vectorizer.transform(test_data['text_with_hashtags'])

- Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

X_train = TFIDF_vectorizer.transform(train_data['text_with_hashtags'])
y_train = train_data['emotion']

X_test = TFIDF_vectorizer.transform(test_data['text_with_hashtags']) 

# Construct the DecisionTree model
DT_model = DecisionTreeClassifier(random_state=1)

# Train the model
DT_model = DT_model.fit(X_train, y_train)
y_train_pred = DT_model.predict(X_train)

y_test_pred = DT_model.predict(X_test)

test_data['emotion'] = y_test_pred

# Make the data fit the submission pattern
test_data = test_data.rename(columns={'tweet_id': 'id'})
test_data_split = test_data[['id', 'emotion']]

# Save the result
test_data_split.to_csv('/mnt/sda/catherine/kaggle/TFIDF_DT_submission.csv', index=False)

Since TFIDF + Decision Tree had a better result than BOW + Decision Tree, so I tried to use TFIDF + Random Forest in the hope of a surpassing result.

### TFIDF + Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Construct the Random Forest model
clf = RandomForestClassifier(n_jobs=-1, random_state=1)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

test_data['emotion'] = y_pred

# Make the data fit the submission pattern
test_data = test_data.rename(columns={'tweet_id': 'id'})
test_data_split = test_data[['id', 'emotion']]

# Save the result
test_data_split.to_csv('/mnt/sda/catherine/kaggle/TFIDF_RF_submission.csv', index=False)

### TFIDF + NN

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from sklearn.neural_network import MLPClassifier

TFIDF_vectorizer = TfidfVectorizer(max_features=1000, tokenizer=nltk.word_tokenize)
TFIDF_vectorizer.fit(train_data['text_with_hashtags'])


train_data_TFIDF_features = TFIDF_vectorizer.transform(train_data['text_with_hashtags']) # training text
y_train = train_data['emotion']

test_data_TFIDF_features = TFIDF_vectorizer.transform(test_data['text_with_hashtags']) # testing text

# Construct the NN model
clf = MLPClassifier(hidden_layer_sizes=(200,), max_iter=300, random_state=42)

# Train the model
clf.fit(train_data_TFIDF_features, y_train)
y_pred = clf.predict(test_data_TFIDF_features)


test_data['emotion'] = y_pred

# Make the data fit the submission pattern
final_result = test_data[['tweet_id', 'emotion']]
final_result = final_result.rename(columns={'tweet_id': 'id'})

# Save the result
final_result.to_csv('TFIDF_NN_test_.csv', index=False)

Since TFIDF + NN worked better than TFIDF + Random Forest, I tried to use Word2Vec + NN in the hope of better performance.

### Word2Vec + NN

In [None]:
import pandas as pd
import nltk
from gensim.models import Word2Vec
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import LabelEncoder
from nltk.tokenize import word_tokenize
import numpy as np

nltk.download('punkt')

def tokenize(text):
    return word_tokenize(text)

def build_word2vec(sentences, vector_size=100, min_count=1):
    tokenized_sentences = [tokenize(sentence) for sentence in sentences]
    model = Word2Vec(sentences=tokenized_sentences, vector_size=vector_size, min_count=min_count, window=5, sg=1, workers=4)
    return model

def sentence_to_avg_vector(sentence, model, vector_size):
    words = tokenize(sentence)
    word_vectors = [model.wv[word] for word in words if word in model.wv]
    if word_vectors: 
        return np.mean(word_vectors, axis=0)
    else:
        return np.zeros(vector_size)

# Construct the W2V model
sen = pd.concat([train_data['text_with_hashtags'], test_data['text_with_hashtags']])
word2vec_model = build_word2vec(sen, vector_size=100)

# Training Data
X_train = np.array([sentence_to_avg_vector(sen, word2vec_model, vector_size=100) for sentence in train_data['text_with_hashtags']])

# Transform the emotion label into number
y_train = LabelEncoder().fit_transform(train_data['emotion'])

# Testing Data
X_test = np.array([sentence_to_avg_vector(sen, word2vec_model, vector_size=100) for sentence in test_data['text_with_hashtags']])

# Construct the NN model
clf = MLPClassifier(hidden_layer_sizes=(200,), max_iter=300, random_state=42)

# Train the model
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Transform the result into the original label
emotion_labels = LabelEncoder().fit(train_data['emotion']).classes_
test_data['emotion'] = [emotion_labels[pred] for pred in y_pred]

# Make the data fit the submission pattern
final_result = test_data[['tweet_id', 'emotion']]
final_result = final_result.rename(columns={'tweet_id': 'id'})

# Save the result
final_result.to_csv('W2V_NN_test.csv', index=False)

### RoBERTa

Since the model ran very slow, I tried to use the server with more cores to run the training process. In order to ensure which step the program were running, I set the logging for tracking.

In [None]:
import pandas as pd
import re
from transformers import RobertaTokenizer
from transformers import RobertaForSequenceClassification
from datasets import Dataset
from transformers import Trainer, TrainingArguments
import torch
import logging

""" Set the logging """
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')
logger = logging.getLogger(__name__)

def tokenize_function(examples):
    
    logger.info(" Load the RoBERTa tokenizer ")
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    return tokenizer(examples["text"], padding="max_length", truncation=True)

def main():
    
    logger.info("Read the data")
    train_data = pd.read_csv('train_merged_output.csv')
    test_data = pd.read_csv('test_merged_output.csv')

    """ Extract 'hashtags' """
    logger.info(" Extract the hashtags ")
    train_data['hashtags'] = train_data['text'].apply(lambda x: re.findall(r'#\w+', x))
    test_data['hashtags'] = test_data['text'].apply(lambda x: re.findall(r'#\w+', x))

    """ Deal with the data without hashtag """
    train_data['hashtags'] = train_data['hashtags'].apply(lambda x: x if len(x) > 0 else ['no_hashtags'])
    test_data['hashtags'] = test_data['hashtags'].apply(lambda x: x if len(x) > 0 else ['no_hashtags'])

    """ Merge the hashtags with 'text' """
    train_data['text_with_hashtags'] = train_data.apply(
        lambda row: row['text'] + ' ' + ' '.join(row['hashtags']) if row['hashtags'] != 'non_hashtags' else row['text'], axis=1
    )
    test_data['text_with_hashtags'] = test_data.apply(
        lambda row: row['text'] + ' ' + ' '.join(row['hashtags']) if row['hashtags'] != 'non_hashtags' else row['text'], axis=1
    )

    logger.info(" Training Data ")
    texts = train_data['text'].tolist()
    labels = train_data['emotion'].tolist()

    """ Load the Roberta Model """
    logger.info(" Load the Roberta Model ")
    model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=8)

    """ Construct the Dataset """
    logger.info(" Construct the Dataset ")
    dataset = Dataset.from_dict({"text": texts, "label": labels})


    label_map = {
        "sadness": 0,
        "disgust": 1,
        "anticipation": 2,
        "joy": 3,
        "trust": 4,
        "anger": 5,
        "fear": 6,
        "surprise": 7
    }
    dataset = dataset.map(lambda examples: {"label": label_map[examples["label"]]})


    logger.info(" Tokenize the dataset ")
    tokenized_dataset = dataset.map(tokenize_function, batched=True)

    """ Training """
    logger.info("Setting the training parameters ")
    training_args = TrainingArguments(
        output_dir="./results",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        num_train_epochs=3,
        learning_rate=2e-5,
        weight_decay=0.01
    )

    logger.info(" Define the trainer ")
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset
    )

    logger.info(" Start training ")
    trainer.train()
    logger.info("finish")

if __name__ == '__main__':
    main()

## 3. Reflection

Throughout the competition, there are several things that I would like to share:

1. As I mentioned above, I found out that the training process took a lot of time. Even though I switched to my machine to connect to the server, it still had to wait for long.
   Since I am not familiar with running ROBERTa and how to make it faster (maybe I have done something wrong throughout the process of building up the model), and it predicted to run for many days. After trying several modeifications through taking referenbce of the online resources, I gave up to apply this model eventually. Although I didn't sucessfully run the model and generate the prediction, the process of learning to apply this model is still caluable.
3. Since the training dataset is too large, I split the data into many batch and ran respecitvely.
4. For the preproccesing technique I tried in sequence, I found that the result got better. However, when I applied Word2Vec combined with Neural Network, the result is not better than TFIDF with Neural Network. I supposed it would be better. I think it is worth investigating the reasons behind.
5. Because of the time limitation, one of the technique that I have no time to try is classification. I think with pre-defined label, it is likely to get a good prediction result. 