# **Homework 7 - Bert (Question Answering)**

If you have any questions, feel free to email us at ntu-ml-2023spring-ta@googlegroups.com



Slide:    [Link](https://docs.google.com/presentation/d/15lGUmT8NpLGtoxRllRWCJyQEjhR1Idcei63YHsDckPE/edit#slide=id.g21fff4e9af6_0_13)　Kaggle: [Link](https://www.kaggle.com/competitions/ml2023spring-hw7/host/sandbox-submissions)　Data: [Link](https://drive.google.com/file/d/1YU9KZFhQqW92Lw9nNtuUPg0-8uyxluZ7/view?usp=sharing)




# Prerequisites

## Install packages

Documentation for the toolkit: 
*   https://huggingface.co/transformers/
*   https://huggingface.co/docs/accelerate/index



In [None]:
# You are allowed to change version of transformers or use other toolkits
!pip install transformers==4.26.1
!pip install accelerate==0.16.0

Collecting transformers==4.26.1
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
^C
[31mERROR: Operation cancelled by user[0m[31m
[0mCollecting accelerate==0.16.0
  Downloading accelerate-0.16.0-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.12.0
    Uninstalling accelerate-0.12.0:
      Successfully uninstalled accelerate-0.12.0


# Kaggle (Fine-tuning)

## Task description
- Chinese Extractive Question Answering
  - Input: Paragraph + Question
  - Output: Answer

- Objective: Learn how to fine tune a pretrained model on downstream task using transformers

- Todo
    - Fine tune a pretrained chinese BERT model
    - Change hyperparameters (e.g. doc_stride)
    - Apply linear learning rate decay
    - Try other pretrained models
    - Improve preprocessing
    - Improve postprocessing
- Training tips
    - Automatic mixed precision
    - Gradient accumulation
    - Ensemble

- Estimated training time (tesla t4 with automatic mixed precision enabled)
    - Simple: 8mins
    - Medium: 8mins
    - Strong: 25mins
    - Boss: 2hrs
  

## Import Packages

In [None]:
import json
import numpy as np
import random
import torch
from torch.utils.data import DataLoader, Dataset 
from transformers import AdamW, BertForQuestionAnswering, BertTokenizerFast, get_linear_schedule_with_warmup

from tqdm.auto import tqdm

device = "cuda" if torch.cuda.is_available() else "cpu"

# Fix random seed for reproducibility
def same_seeds(seed):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
same_seeds(42)

## Install Fengshenbang-LM




 

In [None]:
#REF: https://github.com/IDEA-CCNL/Fengshenbang-LM

In [None]:
!git clone https://github.com/IDEA-CCNL/Fengshenbang-LM.git
%cd Fengshenbang-LM
!pip install --editable ./

# Edit modeling_ubert.py by my code

In [None]:
# replace the modeling_ubert.py
# EX: !cp your_code_flie /kaggle/working/Fengshenbang-LM/fengshen/models/ubert/modeling_ubert.py
!cp /kaggle/input/ubert-for-t4x2/modeling_ubert.py /kaggle/working/Fengshenbang-LM/fengshen/models/ubert/modeling_ubert.py

## Read Data

- Training set: 26918 QA pairs
- Dev set: 2863  QA pairs
- Test set: 3524  QA pairs

- {train/dev/test}_questions:	
  - List of dicts with the following keys:
   - id (int)
   - paragraph_id (int)
   - question_text (string)
   - answer_text (string)
   - answer_start (int)
   - answer_end (int)
- {train/dev/test}_paragraphs: 
  - List of strings
  - paragraph_ids in questions correspond to indexs in paragraphs
  - A paragraph may be used by several questions 

In [None]:
def read_data(file):
    with open(file, 'r', encoding="utf-8") as reader:
        data = json.load(reader)
    return data["questions"], data["paragraphs"]

# Change the path of the dataset
train_questions, train_paragraphs = read_data("/kaggle/input/2023-ml-hw7-question-answering/hw7_train.json")
dev_questions, dev_paragraphs = read_data("/kaggle/input/2023-ml-hw7-question-answering/hw7_dev.json")
test_questions, test_paragraphs = read_data("/kaggle/input/2023-ml-hw7-question-answering/hw7_test.json")

In [None]:
ct = 0
checker = False
ll = []
for i in range(len(dev_questions)):
    checker = False
    for j in range(len(train_questions)):
        if(dev_questions[i]["question_text"] == train_questions[j]["question_text"]):
            checker = True
            ct+=1
    if not checker:
        ll.append(i)

dev_questions2 = []
for i in ll:
    dev_questions2.append(dev_questions[i])
dev_questions = dev_questions2

## Train_data

In [None]:
train_data = []
for i in range(len(train_questions)):
    data0 = {}
    entity0 = {}
    data0["task_type"] = "抽取任务"
    data0["subtask_type"] = "抽取式阅读理解"
    data0["text"] = train_paragraphs[train_questions[i]['paragraph_id']]
    entity0["entity_type"] = train_questions[i]['question_text']
    entity0["label"] = 0
    entity0["entity_list"] = [{
        "entity_name": train_questions[i]['answer_text'],
        "entity_idx": [
            [train_questions[i]['answer_start'], train_questions[i]['answer_end']]
        ]
    }]
    data0["choices"] = [entity0]
    data0["id"] = i
    train_data.append(data0)

## Dev_data

In [None]:
dev_data = []
for i in range(len(dev_questions)):
    data0 = {}
    entity0 = {}
    data0["task_type"] = "抽取任务"
    data0["subtask_type"] = "抽取式阅读理解"
    data0["text"] = dev_paragraphs[dev_questions[i]['paragraph_id']]
    entity0["entity_type"] = dev_questions[i]['question_text']
    entity0["label"] = 0
    entity0["entity_list"] = [{
        "entity_name": dev_questions[i]['answer_text'],
        "entity_idx": [
             [dev_questions[i]['answer_start'], dev_questions[i]['answer_end']]
        ]
    }]
    data0["choices"] = [entity0]
    data0["id"] = i
    dev_data.append(data0)

## Test_data

In [None]:
test_data = []
for i in range(len(test_questions)):
    data0 = {}
    entity0 = {}
    data0["task_type"] = "抽取任务"
    data0["subtask_type"] = "抽取式阅读理解"
    data0["text"] = test_paragraphs[test_questions[i]['paragraph_id']]
    entity0["entity_type"] = test_questions[i]['question_text']
    entity0["label"] = 0
    entity0["entity_list"] = []
    data0["choices"] = [entity0]
    data0["id"] = i
    test_data.append(data0)

## Main

In [None]:
!pip install pytorch-lightning==1.9.0
import pytorch_lightning as pl
print(pl.__version__)

In [None]:
#REF: https://github.com/IDEA-CCNL/Fengshenbang-LM

In [None]:
class args:
    pretrained_model_path = 'IDEA-CCNL/Erlangshen-Ubert-330M-Chinese'      #预训练模型的路径，默认
    load_checkpoints_path = ""    #加载模型的路径，如果你finetune完，想加载模型进行预测可以传入这个参数
    batchsize = 1                  #批次大小, 默认 8
    monitor = "train_span_acc"             #保存模型需要监控的变量，例如我们可监控 val_span_acc
    checkpoint_path = "./checkpoint"           #模型保存的路径, 默认 ./checkpoint
    save_top_k = 3                 #最多保存几个模型, 默认 3
    every_n_train_steps = 100       #多少步保存一次模型, 默认 100
    learning_rate = 2e-5             #学习率, 默认 2e-5
    weight_decay = 0.1
    warmup  = 0.01                    #预热的概率, 默认 0.01
    default_root_dir = "/kaggle/working/"           #模型日子默认输出路径
    gradient_clip_val = 0.25          #梯度截断， 默认 0.25
    accelerator='gpu'
    devices=1                        #gpu 的数量
    check_val_every_n_epoch = 1     #多少次验证一次， 默认 100
    max_epochs = 2                 #多少个 epochs， 默认 5
    max_length = 512                 #句子最大长度， 默认 512
    num_labels = 10                 #训练每条样本最多取多少个label，超过则进行随机采样负样本， 默认 10'''
    mode = "min"
    save_weights_only = True
    filename = 'model-{epoch:02d}-{train_loss:.4f}'
    threshold = 0
    precision = 16
    accumulate_grad_batches = 8

In [None]:
import argparse
from fengshen import UbertPipelines

model = UbertPipelines(args)

## Testing

In [None]:
result = model.predict(test_data)
result_post = []
ct = 0
ct1 = 0
ct2 = 0
for i in (result):
    result_post.append([i["choices"][0]["entity_list"][0]["entity_name"],i["choices"][0]["entity_list"][0]["score"]])
print(len(result_post))

# Download result_model1 for the process later

In [None]:
result_file = "result_model1.csv"
%cd /kaggle/working/
with open(result_file, 'w') as f:
    f.write("ID,Answer,Score\n")
    for i, test_question in enumerate(test_questions):
    # Replace commas in answers with empty strings (since csv is separated by comma)
    # Answers in kaggle are processed in the same way
        f.write(f"{test_question['id']},{result_post[i][0].replace(',','')},{result_post[i][1]}\n")

print(f"Completed! Result_raw is in {result_file}")

In [None]:
result_file = "result.csv"
%cd /kaggle/working/
with open(result_file, 'w') as f:
    f.write("ID,Answer\n")
    for i, test_question in enumerate(test_questions):
    # Replace commas in answers with empty strings (since csv is separated by comma)
    # Answers in kaggle are processed in the same way
        f.write(f"{test_question['id']},{result_post[i][0].replace(',','')}\n")

print(f"Completed! Result is in {result_file}")