# Third Part Report 

### introduction
This competition I apply [Hugging face](https://huggingface.co/) to handling transformer text classification, the **env installation** is based on [this link](https://huggingface.co/docs/transformers/main/en/installation). The report will split into three parts : **data preprocessing, training, and evaluation(make submission csv)**

#### What is Hugging Face Transformer?
Hugging Face Transformers provides APIs and tools to **easily download and train state-of-the-art pretrained models**. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities 




### First Part : Data Preprocessing 

load lines of tweet record from tweets_DM.json into a list called **data**   

In [1]:
import json
f = open('data/tweets_DM.json','r', encoding='utf')
data = []
for line in f.readlines():
    dic = json.loads(line)
    data.append(dic)
print(data[0])

{'_score': 391, '_index': 'hashtag_tweets', '_source': {'tweet': {'hashtags': ['Snapchat'], 'tweet_id': '0x376b20', 'text': 'People who post "add me on #Snapchat" must be dehydrated. Cuz man.... that\'s <LH>'}}, '_crawldate': '2015-05-23 11:42:47', '_type': 'tweets'}


take **tweet_id** and **text** feature to make a dataframe

In [2]:
import pandas as pd
dic = {
    'tweet_id':[],
    'text':[]
}
for d in data:
    for key in dic.keys():
        dic[key].append(d['_source']['tweet'][key])

tweet_df = pd.DataFrame(dic)
tweet_df.head()

Unnamed: 0,tweet_id,text
0,0x376b20,"People who post ""add me on #Snapchat"" must be ..."
1,0x2d5350,"@brianklaas As we see, Trump is dangerous to #..."
2,0x28b412,"Confident of your obedience, I write to you, k..."
3,0x1cd5b0,Now ISSA is stalking Tasha 😂😂😂 <LH>
4,0x2de201,"""Trust is not the same as faith. A friend is s..."


read emotion label from emotion.csv

In [3]:
emo_df = pd.read_csv('data/emotion.csv')
emo_df.head()

Unnamed: 0,tweet_id,emotion
0,0x3140b1,sadness
1,0x368b73,disgust
2,0x296183,anticipation
3,0x2bd6e1,joy
4,0x2ee1dd,anticipation


read data identification 

In [4]:
data_id_df = pd.read_csv('data/data_identification.csv')
data_id_df.head()

Unnamed: 0,tweet_id,identification
0,0x28cc61,test
1,0x29e452,train
2,0x2b3819,train
3,0x2db41f,test
4,0x2a2acc,train


merge tweet_df and data_id_df base on tweet_id column

In [5]:
merge_df = pd.merge(tweet_df, data_id_df, on = 'tweet_id')
merge_df.head()

Unnamed: 0,tweet_id,text,identification
0,0x376b20,"People who post ""add me on #Snapchat"" must be ...",train
1,0x2d5350,"@brianklaas As we see, Trump is dangerous to #...",train
2,0x28b412,"Confident of your obedience, I write to you, k...",test
3,0x1cd5b0,Now ISSA is stalking Tasha 😂😂😂 <LH>,train
4,0x2de201,"""Trust is not the same as faith. A friend is s...",test


split train and test data

In [6]:
train_df = merge_df.loc[merge_df['identification'] == 'train']
test_df = merge_df.loc[merge_df['identification'] == 'test']
train_df.shape, test_df.shape

((1455563, 3), (411972, 3))

encode emotion label into number 

In [7]:
from sklearn.preprocessing import LabelEncoder
train_df = pd.merge(train_df, emo_df, on = 'tweet_id')
emo_label_encoder = LabelEncoder()
train_df['emotion'] = emo_label_encoder.fit_transform(train_df['emotion'])
train_df.head()

Unnamed: 0,tweet_id,text,identification,emotion
0,0x376b20,"People who post ""add me on #Snapchat"" must be ...",train,1
1,0x2d5350,"@brianklaas As we see, Trump is dangerous to #...",train,5
2,0x1cd5b0,Now ISSA is stalking Tasha 😂😂😂 <LH>,train,3
3,0x1d755c,@RISKshow @TheKevinAllison Thx for the BEST TI...,train,4
4,0x2c91a8,Still waiting on those supplies Liscus. <LH>,train,1


In [9]:
!pip install emot



In [20]:
import re
import pickle
from emot.emo_unicode import EMOJI_UNICODE
from tqdm import tqdm
tqdm.pandas()

with open('Emoji_Dict.p', 'rb') as fp:
    Emoji_Dict = pickle.load(fp)
Emoji_Dict = {v: k for k, v in Emoji_Dict.items()}

def convert_emojis(text):
    for emot in Emoji_Dict:
        text = re.sub(r'('+emot+')', "_".join(Emoji_Dict[emot].replace(","," ").replace(":"," ").split()), text)
        text = text.replace('_', " ")
    return text

train_df['text'] = train_df['text'].progress_apply(convert_emojis)
test_df['text'] = test_df['text'].progress_apply(convert_emojis)
train_df.head()

save data into pkl file than we can load in trianing process

In [None]:
import pickle
test_df.to_csv('data/test_df.csv')
test_df.head()
train_texts = train_df['text'].to_list()
test_texts = test_df['text'].to_list()
train_labels = train_df['emotion'].tolist()

with open('data/train_texts.pickle', 'wb') as f:
    pickle.dump(train_texts, f)
with open('data/test_texts.pickle', 'wb') as f:
    pickle.dump(test_texts, f)
with open('data/train_labels.pickle', 'wb') as f:
    pickle.dump(train_labels, f)

In [21]:
print(convert_emojis("Now ISSA is stalking Tasha 😂😂😂"))

Now ISSA is stalking Tasha face with tears of joyface with tears of joyface with tears of joy


### Second Part: Training 



define a new dataset class with torch dataset type, and define **PATH** (if path is a checkpoint path, the training process can start from checkpoint state)

In [None]:
import pickle
from transformers import AutoTokenizer
import torch
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

PATH = None
class TweetDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

load the data 

In [10]:
with open('data/train_texts.pickle', 'rb') as f:
    train_texts = pickle.load(f)
with open('data/test_texts.pickle', 'rb') as f:
    test_texts = pickle.load(f)
with open('data/train_labels.pickle', 'rb') as f:
    train_labels = pickle.load(f)


I chosed [RoBERTa](https://huggingface.co/roberta-base?text=The+goal+of+life+is+%3Cmask%3E.) as transformer architecture, and use [pretrained model](https://huggingface.co/j-hartmann/emotion-english-distilroberta-base?text=Oh+wow.+I+didn%27t+know+that.) pretrained by Hartmann and Jochen. The model is pretrain on **six emtion english text datasets**(MELD, Crowdflower, GoEmotions...). So I load the tokenizer by the pretrain model checkpoints and encode the training text and define our training dataset. Next, Load the pretrained model, because the pretrained model was trained on 7 labels, however, our task is 8 labels. So the arguments **num_labels** should set 8, **ignore_mismatched_sizes** set True.


In [None]:
if PATH == None:
    tokenizer = AutoTokenizer.from_pretrained("j-hartmann/emotion-english-distilroberta-base")
else:
    tokenizer = AutoTokenizer.from_pretrained(PATH)
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encoding = tokenizer(train_texts[:100], truncation=True, padding=True)
train_dataset = TweetDataset(train_encodings, train_labels)
val_dataset = TweetDataset(val_encoding, train_labels[:100])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
if PATH == None:
    model = AutoModelForSequenceClassification.from_pretrained("j-hartmann/emotion-english-distilroberta-base", num_labels=8, ignore_mismatched_sizes=True).to("cuda")
else:
    model = AutoModelForSequenceClassification.from_pretrained(PATH).to("cuda")

Define training argument and trainer, set learning rate, batch size, weight decay... arguments. The **gradient_accumulation_steps** means the model will update weight every 16 steps instead of every step. I set this argument is because my GPU mem is not big enough to larger batch size.

In [None]:
training_args = TrainingArguments(
    output_dir="./results_v3",
    learning_rate=2.0e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=20,
    weight_decay=0.01,
    save_total_limit = 20,
    save_strategy = "steps",
    save_steps = 2500,
    gradient_accumulation_steps = 16
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)
if PATH == None:
    trainer.train()
else:
    trainer.train(PATH)


### Third Part： Evaluation

read text df

In [13]:
test_df = pd.read_csv("data/test_df.csv", lineterminator='\n')
test_df.head()

Unnamed: 0.1,Unnamed: 0,tweet_id,text,identification
0,2,0x28b412,"Confident of your obedience, I write to you, k...",test
1,4,0x2de201,"""Trust is not the same as faith. A friend is s...",test
2,9,0x218443,When do you have enough ? When are you satisfi...,test
3,30,0x2939d5,"God woke you up, now chase the day #GodsPlan #...",test
4,33,0x26289a,"In these tough times, who do YOU turn to as yo...",test


define checkpoint path saved in training part. and encode the text data, define the test data set with dummy labels, and load the model and tokenizer by checkpoints path 

In [None]:
CKPT_PATH = "results_v3/checkpoint-85000"
size = test_df.shape[0]
dummy_labels = size * [0]
test_texts = test_df['text'].to_list()
test_df['emotion'] = ""
emo_labels = ['anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness',
       'surprise', 'trust']
model = AutoModelForSequenceClassification.from_pretrained(CKPT_PATH).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(CKPT_PATH)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
test_dataset = TweetDataset(test_encodings, dummy_labels)

set the training argument and use the trainer to predict emo label

In [None]:
import numpy as np
test_args = TrainingArguments(
    output_dir = './other',
    do_train = False,
    do_predict = True,
    per_device_eval_batch_size = 50,   
    dataloader_drop_last = False    
)

# init trainer
trainer = Trainer(
              model = model, 
              args = test_args)

test_results = trainer.predict(test_dataset)
test_results_list = list(np.argmax(test_results.predictions, axis=-1))
pred = list(map(lambda x : emo_labels[x], test_results_list))

save submision csv

In [None]:
ids = test_df["tweet_id"].tolist()
dic = {
    "id": ids,
    "emotion": pred
}
df = pd.DataFrame(dic, )
df.to_csv("submission20221115_85000iter.csv", index=False)