### Student Information
Name: 林宗翰

Student ID: 109062304

GitHub ID: freezexpert

Kaggle name: freezexpert

Kaggle private scoreboard snapshot:

[Snapshot](img/pic0.png)

---

### Instructions

1. First: __This part is worth 30% of your grade.__ Do the **take home** exercises in the DM2023-Lab2-master. You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/t/09b1d0f3f8584d06848252277cb535f2) regarding Emotion Recognition on Twitter by this link https://www.kaggle.com/t/09b1d0f3f8584d06848252277cb535f2. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (60-x)/6 + 20 points, where x is your ranking in the leaderboard (ie. If you rank 3rd your score will be (60-3)/6 + 20 = 29.5% out of 30%)   
    Submit your last submission __BEFORE the deadline (Dec. 27th 11:59 pm, Wednesday)_. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developping the model for the competition (You can use code and comment it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook** and **add minimal comments where needed**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Dec. 31th 11:59 pm, Sunday)__. 

In [None]:
### Begin Assignment Here
## separate testing dataset from training dataset
import json
import pandas as pd
json_filepath = './tweets_DM.json'
df_ident = pd.read_csv('./data_identification.csv')
tweets_list = []
with open(json_filepath, 'r') as file:
    for line in file:
        data = json.loads(line)
        tweets_list.append({
                'tweet_id': data['_source']['tweet']['tweet_id'],
                'text': data['_source']['tweet']['text']
            })

df_tweets = pd.DataFrame(tweets_list)
# print(df_tweets)
# print(df_ident)
df_test = df_ident[df_ident['identification'] == 'test']
df_train = df_ident[df_ident['identification'] == 'train']
# print(df_test)
# print(df_train)
df_emo = pd.read_csv('./emotion.csv')
# print(df_emo)
tmp = pd.merge(df_tweets, df_train, on='tweet_id', how='left')
# print(tmp)
df_train_with_emotion = pd.merge(tmp, df_emo, on='tweet_id', how='left')
# print(df_train_with_emotion)
df_train_with_emotion = df_train_with_emotion.dropna()
print(df_train_with_emotion)


In [None]:
def create_sample(df, n):
  df0 = df[df['emotion']=='anger'].sample(n=n)
  df1 = df[df['emotion']=='anticipation'].sample(n=n)
  df2 = df[df['emotion']=='disgust'].sample(n=n)
  df3 = df[df['emotion']=='fear'].sample(n=n)
  df4 = df[df['emotion']=='joy'].sample(n=n)
  df5 = df[df['emotion']=='sadness'].sample(n=n)
  df6 = df[df['emotion']=='surprise'].sample(n=n)
  df7 = df[df['emotion']=='trust'].sample(n=n)
  df_train_sample = pd.concat([df0, df1, df2, df3, df4, df5, df6, df7])
  return df_train_sample

In [None]:
import torch
from torch.utils.data import DataLoader, TensorDataset
from transformers import BertTokenizer, BertForSequenceClassification, get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from tqdm import tqdm

In [None]:
label_encoder = LabelEncoder()
df_train_with_emotion['label'] = label_encoder.fit_transform(df_train_with_emotion['emotion'])
df_train_sample = create_sample(df_train_with_emotion, 30000)
num_classes = len(label_encoder.classes_)
print(num_classes)
# print(df_train_label)
X_train, X_val, y_train, y_val = train_test_split(df_train_sample, df_train_sample['label'], test_size=0.2, random_state=42)
# print(X_train, X_val, y_train, y_val)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# print(X_train_sample['text'].values)
train_encodings = tokenizer.batch_encode_plus(X_train['text'].values, add_special_tokens=True, padding='max_length', max_length=256, return_tensors='pt')
val_encodings = tokenizer.batch_encode_plus(X_val['text'].values, add_special_tokens=True, padding='max_length', max_length=256, return_tensors='pt')


In [None]:
train_input_ids = train_encodings['input_ids']
train_attention_mask = train_encodings['attention_mask']
train_labels = torch.tensor(X_train['label'].values)
val_input_ids = val_encodings['input_ids']
val_attention_mask = val_encodings['attention_mask']
val_labels = torch.tensor(X_val['label'].values)
# print(y_train)
train_dataset = TensorDataset(train_input_ids, train_attention_mask, train_labels)
val_dataset = TensorDataset(val_input_ids, val_attention_mask, val_labels)

In [None]:
torch.cuda.empty_cache()
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
epoch = 3
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_classes)
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=len(train_loader)*epoch)

In [None]:
for i in range(epoch):
    model.train()
    total_loss = 0
    with tqdm(total=len(train_loader), desc=f"Epoch {i + 1}/{epoch}") as pbar:
        for batch in train_loader:
            optimizer.zero_grad()
            input_ids = batch[0].to(device)
            attention_mask = batch[1].to(device)
            labels = torch.tensor(batch[2], dtype=torch.long).to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_loss += loss
            loss.backward()
            optimizer.step()
            pbar.update(1)
            pbar.set_postfix(loss=f'{loss.item():.4f}')
    scheduler.step()
torch.cuda.empty_cache()

In [None]:
model.eval()
predictions = []
true_labels = []
with tqdm(total=len(val_loader)) as pbar:
    for batch in val_loader:
        optimizer.zero_grad()
        input_ids = batch[0].to(device)
        attention_mask = batch[1].to(device)
        labels = torch.tensor(batch[2], dtype=torch.long).to(device)
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions.extend(torch.argmax(logits, dim=1).cpu().numpy())
        true_labels.extend(labels.cpu().numpy())
        pbar.update(1)
print(accuracy_score(true_labels, predictions))
print(classification_report(true_labels, predictions))

In [None]:
print(predictions)
print(true_labels)

In [None]:
df_test_label = pd.merge(df_tweets, df_test, on='tweet_id', how='left')
df_test_label = df_test_label.dropna()

In [None]:

test_encodings = tokenizer(list(df_test_label['text']), truncation=True, padding=True, max_length=128, return_tensors='pt')
test_input_ids = test_encodings['input_ids']
test_attention_mask = test_encodings['attention_mask']
test_dataset = TensorDataset(test_input_ids, test_attention_mask)


In [None]:
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)
predicted_emotion = []
model.eval()
with torch.no_grad():
    with tqdm(total=len(test_loader)) as pbar:
        for batch in test_loader:
            input_ids = batch[0].to(device)
            attention_mask = batch[1].to(device)
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            predicted_emotion.extend(torch.argmax(logits, dim=1).cpu().numpy())
            pbar.update(1)
# print(predicted_emotion)

In [None]:
print(label_encoder.inverse_transform(predicted_emotion))
df_test_label['emotion'] = label_encoder.inverse_transform(predicted_emotion)
submission = pd.concat([df_test_label['tweet_id'], df_test_label['emotion']], axis = 1)
submission.columns = ['id', 'emotion']
submission.to_csv('submission.csv', index=False)