##### ----- Large Language Model

這一部分我是想試試看LLM的效果會不會比較好，

那我使用的是比較小的model EleutherAI/pythia-70m-deduped，參數量只有70 million而已，

但發現數據量實在太大，訓練資料就有多達一百四十幾萬筆，

於實作過程發現跑一個epoch就要一百多分鐘，由於時間問題，我並沒有給予模型完整的訓練。

最後只有訓練兩個epoch(七個小時左右)，但發現loss偏高(三點多)，

所以我從train當中隨機抽幾個來驗證，發現錯誤率很高，

因此我並沒有把整個test跑完並上傳至kaggle。

以下程式碼內有較詳細的註解!

In [27]:
import os

import numpy as np
from tqdm import tqdm, trange
from torch.optim import AdamW

from torch.utils.data import DataLoader
import torch
import torch.optim as optim
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, get_linear_schedule_with_warmup
import re
import random
import matplotlib.pyplot as plt
from torch.nn import functional as F
from torch.utils.data import Dataset

##### ----- 讀取'train'和'test'內容

詳細流程

1. 從'train_df.pkl'和'test_df.pkl'讀取data

2. 考慮到使用datasets.load_dataset的方便性，把'train_df.pkl'和'test_df.pkl'存成'train_df.csv'和'test_df.csv'

3. 使用load_data載入train的data

4. 引入LLM，並添加special token

5. 使用aicup官方提供的function來將'text'與'label'及special token合併成固定形式(請看後續備註，控制輸出結果，方便後處理)，並搭配該model所提供的tokenizer並文本分詞並移除stop words

6. 訓練模型，隨後隨機選幾個train data觀察結果

備註：

<|endoftext|>

'文本內容'

\n\n####\n\n

'label'

<|END|>

In [1]:
import pandas as pd
import numpy as np

## load a pickle file
train_df = pd.read_pickle("train_df.pkl")

train_df.drop('identification', axis=1, inplace=True)
train_df

Unnamed: 0,tweet_id,text,emotion
0,0x376b20,"People who post ""add me on #Snapchat"" must be ...",anticipation
1,0x2d5350,"@brianklaas As we see, Trump is dangerous to #...",sadness
2,0x1cd5b0,Now ISSA is stalking Tasha 😂😂😂 <LH>,fear
3,0x1d755c,@RISKshow @TheKevinAllison Thx for the BEST TI...,joy
4,0x2c91a8,Still waiting on those supplies Liscus. <LH>,anticipation
...,...,...,...
1455558,0x321566,I'm SO HAPPY!!! #NoWonder the name of this sho...,joy
1455559,0x38959e,In every circumtance I'd like to be thankful t...,joy
1455560,0x2cbca6,there's currently two girls walking around the...,joy
1455561,0x24faed,"Ah, corporate life, where you can date <LH> us...",joy


In [2]:
test_df = pd.read_pickle("test_df.pkl")

test_df.drop('identification', axis=1, inplace=True)
test_df

Unnamed: 0,tweet_id,text
2,0x28b412,"Confident of your obedience, I write to you, k..."
4,0x2de201,"""Trust is not the same as faith. A friend is s..."
9,0x218443,When do you have enough ? When are you satisfi...
30,0x2939d5,"God woke you up, now chase the day #GodsPlan #..."
33,0x26289a,"In these tough times, who do YOU turn to as yo..."
...,...,...
1867525,0x2913b4,"""For this is the message that ye heard from th..."
1867529,0x2a980e,"""There is a lad here, which hath five barley l..."
1867530,0x316b80,When you buy the last 2 tickets remaining for ...
1867531,0x29d0cb,I swear all this hard work gone pay off one da...


In [3]:
train_df.to_csv("train_df.csv", index=False)
test_df.to_csv("test_df.csv", index=False)

In [16]:
from datasets import load_dataset, Features, Value

# dataset = load_dataset("csv", data_files = "train_df.csv")
dataset = load_dataset("csv",
                       data_files = "train_df.csv",
                       column_names = ['idx', 'content', 'label'],
                       skiprows=1)
dataset

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['idx', 'content', 'label'],
        num_rows: 1455563
    })
})

In [20]:
dataset['train']

Dataset({
    features: ['idx', 'content', 'label'],
    num_rows: 1455563
})

In [36]:
dataset['train'][10]

{'idx': '0x37a0a9',
 'content': 'You know you research butterflies when predictive text autocorrects "but" to "butterfly" #justgradstudentthings #ecology <LH>',
 'label': 'joy'}

In [22]:
from transformers import AutoTokenizer, AutoModelForCausalLM

plm = "EleutherAI/pythia-70m-deduped" #"EleutherAI/pythi a-70m-deduped"

bos = '<|endoftext|>'
eos = '<|END|>'
pad = '<p>'
sep ='\n\n####\n\n'

special_tokens_dict = {'eos_token': eos, 'bos_token': bos, 'pad_token': pad, 'sep_token': sep}

tokenizer = AutoTokenizer.from_pretrained(plm, revision="step3000")
tokenizer.padding_side = 'left'
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
print(f"{tokenizer.pad_token}: {tokenizer.pad_token_id}")

<p>: 50278


In [23]:
from torch.utils.data import DataLoader
from islab.aicup import collate_batch_with_prompt_template

train_data = list(dataset['train'])
train_dataloader = DataLoader(train_data, batch_size=4, shuffle=False, collate_fn=lambda batch: collate_batch_with_prompt_template(batch, tokenizer))
titer = iter(train_dataloader)
tks, labels, masks= next(titer)
print(tks.shape)
next(iter(titer))

'''
這部分將 train_dataloader 轉換為迭代器（titer），並使用 next(titer) 提取下一個資料批次。
批次包含三個部分：tks、labels 和 masks。tks 似乎是經過標記的輸入資料，而 labels 可能是相應的標籤。
print(tks.shape) 印出 tks 的形狀。
'''

torch.Size([4, 45])


'\n這部分將 train_dataloader 轉換為迭代器（titer），並使用 next(titer) 提取下一個資料批次。\n批次包含三個部分：tks、labels 和 masks。tks 似乎是經過標記的輸入資料，而 labels 可能是相應的標籤。\nprint(tks.shape) 印出 tks 的形狀。\n'

In [25]:
results = tokenizer(
    [f"People who post  must be dehydrated. Cuz man.... that\'s <LH> {sep} anticipation",
     f"This is a sentence {sep} PHI: NULL"],
    padding=True
)
print(results['attention_mask'][0])

print("---" * 30)
print(tokenizer.decode(results['input_ids'][0]))
print("---" * 30)
print(tokenizer.decode(results['input_ids'][1]))

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
------------------------------------------------------------------------------------------
People who post  must be dehydrated. Cuz man.... that's <LH> 

####

 anticipation
------------------------------------------------------------------------------------------
<p><p><p><p><p><p><p><p><p><p><p>This is a sentence 

####

 PHI: NULL


In [26]:
from islab.aicup import OpenDeidBatchSampler

BATCH_SIZE = 12

bucket_train_dataloader = DataLoader(train_data,
                                     batch_sampler=OpenDeidBatchSampler(train_data, BATCH_SIZE),
                                     collate_fn=lambda batch: collate_batch_with_prompt_template(batch, tokenizer),
                                     pin_memory=True)

In [28]:
from transformers import AutoConfig
# the model config to which we add the special tokens
config = AutoConfig.from_pretrained(plm,
                                    bos_token_id=tokenizer.bos_token_id,
                                    eos_token_id=tokenizer.eos_token_id,
                                    pad_token_id=tokenizer.pad_token_id,
                                    sep_token_id=tokenizer.sep_token_id,
                                    output_hidden_states=False)

model = AutoModelForCausalLM.from_pretrained(plm, revision="step3000", config=config)
model

config.json:   0%|          | 0.00/567 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/166M [00:00<?, ?B/s]

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 512)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-5): 6 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=512, out_features=1536, bias=True)
          (dense): Linear(in_features=512, out_features=512, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=512, out_features=2048, bias=True)
          (dense_4h_to_h): Linear(in_features=2048, out_features=512, bias=True)
          (a

In [29]:
from transformers import get_linear_schedule_with_warmup
from torch.optim import AdamW
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

EPOCHS = 10 # CHANGE TO THE NUMBER OF EPOCHS YOU WANT
optimizer = AdamW(model.parameters(),lr=3e-5) # YOU CAN ADJUST LEARNING RATE
# optimizer = AdamW(model.parameters(),lr=3e-5)

model.resize_token_embeddings(len(tokenizer))
model.to(device)

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50280, 512)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-5): 6 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=512, out_features=1536, bias=True)
          (dense): Linear(in_features=512, out_features=512, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=512, out_features=2048, bias=True)
          (dense_4h_to_h): Linear(in_features=2048, out_features=512, bias=True)
          (a

In [30]:
def set_torch_seed(seed = 0):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benckmark = False
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
set_torch_seed()

def read_file(path):
    with open(path , 'r' , encoding = 'utf-8-sig') as fr:
        return fr.readlines()

In [31]:
from tqdm import tqdm, trange

# 模型儲存資料夾名稱
model_name = "LLM_model"
# 模型儲存路徑
model_dir = f"{model_name}"

if not os.path.isdir(model_dir):
    os.mkdir(model_dir)
min_loss = 9999

global_step = 0
total_loss = 0

model.train()
for _ in trange(EPOCHS, desc="Epoch"):
    torch.cuda.empty_cache()
    
    model.train()
    total_loss = 0

    # Training loop
    predictions , true_labels = [], []

    for step, (seqs, labels, masks) in enumerate(bucket_train_dataloader):
        
        seqs = seqs.to(device)
        labels = labels.to(device)
        masks = masks.to(device)
        
        model.zero_grad()
        
        outputs = model(seqs, labels=labels, attention_mask=masks)
        
        logits = outputs.logits
        loss = outputs.loss
        loss = loss.mean()
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        
    avg_train_loss = total_loss / len(bucket_train_dataloader)
    print("Average train loss: {}".format(avg_train_loss))
    torch.save(model.state_dict(), os.path.join(model_dir , 'GPT_Finial.pt'))
    
    if avg_train_loss < min_loss:
        min_loss = avg_train_loss
        torch.save(model.state_dict(), os.path.join(model_dir , 'GPT_best.pt'))

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Average train loss: 3.2636140085880796


Epoch:  10%|█         | 1/10 [1:41:47<15:16:03, 6107.08s/it]

Average train loss: 3.0920540481265024


Epoch:  20%|██        | 2/10 [7:10:49<28:43:19, 12924.89s/it]


KeyboardInterrupt: 

In [37]:
model.load_state_dict(torch.load(os.path.join(model_name , 'GPT_best.pt')))
model = model.to(device)

def sample_text(model, tokenizer, text, n_words=20):
    model.eval()
    text = tokenizer.encode(text)
    inputs, past_key_values = torch.tensor([text]).to(device), None

    with torch.no_grad():
        for _ in range(n_words):
            out = model(inputs, past_key_values=past_key_values)
            logits = out.logits
            past_key_values = out.past_key_values
            log_probs = F.softmax(logits[:, -1], dim=-1)
            inputs = torch.multinomial(log_probs, 1)
            text.append(inputs.item())
            if tokenizer.decode(inputs.item()) == eos:
                break

    return tokenizer.decode(text)

text = "You know you research butterflies when predictive text autocorrects \"but\" to \"butterfly\" #justgradstudentthings #ecology <LH>"
print(sample_text(model, tokenizer, text=text , n_words=20))

You know you research butterflies when predictive text autocorrects "but" to "butterfly" #justgradstudentthings #ecology <LH>

####

trust <|END|>
