# **Homework 7 - Bert (Question Answering)**

If you have any questions, feel free to email us at ntu-ml-2021spring-ta@googlegroups.com



Slide:    [Link](https://docs.google.com/presentation/d/1aQoWogAQo_xVJvMQMrGaYiWzuyfO0QyLLAhiMwFyS2w)　Kaggle: [Link](https://www.kaggle.com/c/ml2021-spring-hw7)　Data: [Link](https://drive.google.com/uc?id=1znKmX08v9Fygp-dgwo7BKiLIf2qL1FH1)




## Task description
- Chinese Extractive Question Answering
  - Input: Paragraph + Question
  - Output: Answer

- Objective: Learn how to fine tune a pretrained model on downstream task using transformers

- Todo
    - Fine tune a pretrained chinese BERT model
    - Change hyperparameters (e.g. doc_stride)
    - Apply linear learning rate decay
    - Try other pretrained models
    - Improve preprocessing
    - Improve postprocessing
- Training tips
    - Automatic mixed precision
    - Gradient accumulation
    - Ensemble

- Estimated training time (tesla t4 with automatic mixed precision enabled)
    - Simple: 8mins
    - Medium: 8mins
    - Strong: 25mins
    - Boss: 2hrs
  

## Download Dataset

In [None]:
import torch
try:
    # Get GPU name, check if it's K80
    GPU_name = torch.cuda.get_device_name()
    if GPU_name[-3:] == "K80":
        print("Get K80! :'( RESTART!")
        exit()  # Restart the session
    else:
        print("Your GPU is {}!".format(GPU_name))
        print("Great! Keep going~")
except RuntimeError as e:
    if e.args == ("No CUDA GPUs are available",):
        print("You are training with CPU! "
              "Please restart!")
        exit()  # Restart the session
    else:
        print("What's wrong here?")
        print("Error message: \n", e)

Your GPU is Tesla T4!
Great! Keep going~


In [None]:
# Download link 1
!gdown --id '1znKmX08v9Fygp-dgwo7BKiLIf2qL1FH1' --output hw7_data.zip

# Download Link 2 (if the above link fails) 
# !gdown --id '1pOu3FdPdvzielUZyggeD7KDnVy9iW1uC' --output hw7_data.zip

!unzip -o hw7_data.zip

# For this HW, K80 < P4 < T4 < P100 <= T4(fp16) < V100
!nvidia-smi

Downloading...
From: https://drive.google.com/uc?id=1znKmX08v9Fygp-dgwo7BKiLIf2qL1FH1
To: /content/hw7_data.zip
0.00B [00:00, ?B/s]7.71MB [00:00, 122MB/s]
Archive:  hw7_data.zip
  inflating: hw7_dev.json            
  inflating: hw7_test.json           
  inflating: hw7_train.json          
Sat May 22 07:45:08 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P0    33W /  70W |  13192MiB / 15109MiB |      0%      Default |
|                               | 

## Install transformers

Documentation for the toolkit:　https://huggingface.co/transformers/

In [None]:
# You are allowed to change version of transformers or use other toolkits
!pip install transformers==4.5.0



## Import Packages

In [None]:
import json
import numpy as np
import random
import torch
from torch.utils.data import DataLoader, Dataset 
from transformers import AdamW, BertForQuestionAnswering, BertTokenizerFast

from tqdm.auto import tqdm

device = "cuda" if torch.cuda.is_available() else "cpu"

# Fix random seed for reproducibility
def same_seeds(seed):
	  torch.manual_seed(seed)
	  if torch.cuda.is_available():
		    torch.cuda.manual_seed(seed)
		    torch.cuda.manual_seed_all(seed)
	  np.random.seed(seed)
	  random.seed(seed)
	  torch.backends.cudnn.benchmark = False
	  torch.backends.cudnn.deterministic = True
same_seeds(0)

In [None]:
# Change "fp16_training" to True to support automatic mixed precision training (fp16)	
fp16_training = True

if fp16_training:
    !pip install accelerate==0.2.0
    from accelerate import Accelerator
    accelerator = Accelerator(fp16=True)
    device = accelerator.device

# Documentation for the toolkit:  https://huggingface.co/docs/accelerate/



## Load Model and Tokenizer




 

In [None]:
### Baseline model: 'bert-base-chinese' ###
# model = BertForQuestionAnswering.from_pretrained("bert-base-chinese").to(device)
# tokenizer = BertTokenizerFast.from_pretrained("bert-base-chinese")


### Try different model ###
# model_name = 'hfl/chinese-roberta-wwm-ext'
model_name = 'hfl/chinese-roberta-wwm-ext-large'
model = BertForQuestionAnswering.from_pretrained(model_name).to(device)
tokenizer = BertTokenizerFast.from_pretrained(model_name)


# You can safely ignore the warning message (it pops up because new prediction heads for QA are initialized randomly)

Some weights of the model checkpoint at hfl/chinese-roberta-wwm-ext-large were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at hfl

## Read Data

- Training set: 26935 QA pairs
- Dev set: 3523  QA pairs
- Test set: 3492  QA pairs

- {train/dev/test}_questions:	
  - List of dicts with the following keys:
   - id (int)
   - paragraph_id (int)
   - question_text (string)
   - answer_text (string)
   - answer_start (int)
   - answer_end (int)
- {train/dev/test}_paragraphs: 
  - List of strings
  - paragraph_ids in questions correspond to indexs in paragraphs
  - A paragraph may be used by several questions 

In [None]:
def read_data(file):
    with open(file, 'r', encoding="utf-8") as reader:
        data = json.load(reader)
    return data["questions"], data["paragraphs"]

train_questions, train_paragraphs = read_data("hw7_train.json")
dev_questions, dev_paragraphs = read_data("hw7_dev.json")
test_questions, test_paragraphs = read_data("hw7_test.json")

## Tokenize Data

In [None]:
# Tokenize questions and paragraphs separately
# 「add_special_tokens」 is set to False since special tokens will be added when tokenized questions and paragraphs are combined in datset __getitem__ 

train_questions_tokenized = tokenizer([train_question["question_text"] for train_question in train_questions], add_special_tokens=False)
dev_questions_tokenized = tokenizer([dev_question["question_text"] for dev_question in dev_questions], add_special_tokens=False)
test_questions_tokenized = tokenizer([test_question["question_text"] for test_question in test_questions], add_special_tokens=False) 

train_paragraphs_tokenized = tokenizer(train_paragraphs, add_special_tokens=False)
dev_paragraphs_tokenized = tokenizer(dev_paragraphs, add_special_tokens=False)
test_paragraphs_tokenized = tokenizer(test_paragraphs, add_special_tokens=False)

# You can safely ignore the warning message as tokenized sequences will be futher processed in datset __getitem__ before passing to model

In [None]:
for test_question in test_questions:
  print(test_question)
  break

{'id': 0, 'paragraph_id': 792, 'question_text': '士官長的頭盔上會有何裝飾物?', 'answer_text': None, 'answer_start': None, 'answer_end': None}


## Dataset and Dataloader

In [None]:
import random
class QA_Dataset(Dataset):
    def __init__(self, split, questions, tokenized_questions, tokenized_paragraphs):
        self.split = split
        self.questions = questions
        self.tokenized_questions = tokenized_questions
        self.tokenized_paragraphs = tokenized_paragraphs
        self.max_question_len = 40
        self.max_paragraph_len = 150
        
        ##### TODO: Change value of doc_stride #####
        self.doc_stride = 75   # original setting: 150

        # Input sequence length = [CLS] + question + [SEP] + paragraph + [SEP]
        self.max_seq_len = 1 + self.max_question_len + 1 + self.max_paragraph_len + 1

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        question = self.questions[idx]
        tokenized_question = self.tokenized_questions[idx]
        tokenized_paragraph = self.tokenized_paragraphs[question["paragraph_id"]]

        ##### TODO: Preprocessing #####
        # Hint: How to prevent model from learning something it should not learn

        if self.split == "train":
            # Convert answer's start/end positions in paragraph_text to start/end positions in tokenized_paragraph  
            answer_start_token = tokenized_paragraph.char_to_token(question["answer_start"])
            answer_end_token = tokenized_paragraph.char_to_token(question["answer_end"])

            # A single window is obtained by slicing the portion of paragraph containing the answer
            
            ### Preprocess method 1:
            # mid = answer_start_token
            # randm_int = int(torch.randint(0, 200, (1,1))[0]) 
            # paragraph_start = max(0, min(mid - randm_int // 2, len(tokenized_paragraph) - randm_int))
            
            ### Preprocess method 2:
            mid = (answer_start_token + answer_end_token) // 2
            mid += int(random.uniform(-0.5, 0.5)*self.max_paragraph_len)
            paragraph_start = max(0, min(mid - self.max_paragraph_len // 2, len(tokenized_paragraph) - self.max_paragraph_len))


            paragraph_end = paragraph_start + self.max_paragraph_len
            
            # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
            input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102] 
            input_ids_paragraph = tokenized_paragraph.ids[paragraph_start : paragraph_end] + [102]		
            
            # Convert answer's start/end positions in tokenized_paragraph to start/end positions in the window  
            answer_start_token += len(input_ids_question) - paragraph_start
            answer_end_token += len(input_ids_question) - paragraph_start
            
            # Pad sequence and obtain inputs to model 
            input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
            
            return torch.tensor(input_ids), torch.tensor(token_type_ids), torch.tensor(attention_mask), answer_start_token, answer_end_token

        # Validation/Testing
        else:
            input_ids_list, token_type_ids_list, attention_mask_list = [], [], []
            
            # Paragraph is split into several windows, each with start positions separated by step "doc_stride"
            for i in range(0, len(tokenized_paragraph), self.doc_stride):
                
                # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
                input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102]
                input_ids_paragraph = tokenized_paragraph.ids[i : i + self.max_paragraph_len] + [102]
                
                # Pad sequence and obtain inputs to model
                input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
                
                input_ids_list.append(input_ids)
                token_type_ids_list.append(token_type_ids)
                attention_mask_list.append(attention_mask)
            
            return torch.tensor(input_ids_list), torch.tensor(token_type_ids_list), torch.tensor(attention_mask_list)

    def padding(self, input_ids_question, input_ids_paragraph):
        # Pad zeros if sequence length is shorter than max_seq_len
        padding_len = self.max_seq_len - len(input_ids_question) - len(input_ids_paragraph)
        # Indices of input sequence tokens in the vocabulary
        input_ids = input_ids_question + input_ids_paragraph + [0] * padding_len
        # Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
        token_type_ids = [0] * len(input_ids_question) + [1] * len(input_ids_paragraph) + [0] * padding_len
        # Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
        attention_mask = [1] * (len(input_ids_question) + len(input_ids_paragraph)) + [0] * padding_len
        
        return input_ids, token_type_ids, attention_mask

train_set = QA_Dataset("train", train_questions, train_questions_tokenized, train_paragraphs_tokenized)
dev_set = QA_Dataset("dev", dev_questions, dev_questions_tokenized, dev_paragraphs_tokenized)
test_set = QA_Dataset("test", test_questions, test_questions_tokenized, test_paragraphs_tokenized)

train_batch_size = 16

# Note: Do NOT change batch size of dev_loader / test_loader !
# Although batch size=1, it is actually a batch consisting of several windows from the same QA pair
train_loader = DataLoader(train_set, batch_size=train_batch_size, shuffle=True, pin_memory=True)
dev_loader = DataLoader(dev_set, batch_size=1, shuffle=False, pin_memory=True)
test_loader = DataLoader(test_set, batch_size=1, shuffle=False, pin_memory=True)

## Function for Evaluation

In [None]:
def evaluate(data, output, postprocess):
    ##### TODO: Postprocessing #####
    # There is a bug and room for improvement in postprocessing 
    # Hint: Open your prediction file to see what is wrong 
    
    answer = ''
    max_prob = float('-inf')
    num_of_windows = data[0].shape[1]

    for k in range(num_of_windows):
        # Obtain answer by choosing the most probable start position / end position
        start_prob, start_index = torch.max(output.start_logits[k], dim=0)
        end_prob, end_index = torch.max(output.end_logits[k], dim=0)
        
        # Probability of answer is calculated as sum of start_prob and end_prob
        prob = start_prob + end_prob
        
        # Replace answer if calculated probability is larger than previous windows
        if prob > max_prob:
            max_prob = prob
            # Convert tokens to chars (e.g. [1920, 7032] --> "大 金")
            answer = tokenizer.decode(data[0][0][k][start_index : end_index + 1])
            ans_id = data[0][0][k][start_index : end_index + 1]

    # Remove spaces in answer (e.g. "大 金" --> "大金")
    if postprocess == 'false':
      return answer.replace(' ','')
    elif postprocess == 'true':
      return answer, ans_id

## Training

In [None]:
from transformers import get_linear_schedule_with_warmup

num_epoch = 1
validation = True
logging_step = 100
learning_rate = 1e-4
num_training_steps = len(train_loader) * num_epoch
optimizer = AdamW(model.parameters(), lr=learning_rate)

# linear learning rate decay: scheduler warmup
scheduler = get_linear_schedule_with_warmup(optimizer, 0, num_training_steps=num_training_steps)

# accum_iter, batch_idx = 4, 0

if fp16_training:
    model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader) 

model.train()

print("Start Training ...")

for epoch in range(num_epoch):
    step = 1
    train_loss = train_acc = 0

    for data in tqdm(train_loader):
        # Load all data into GPU
        data = [i.to(device) for i in data]
        
        # Model inputs: input_ids, token_type_ids, attention_mask, start_positions, end_positions (Note: only "input_ids" is mandatory)
        # Model outputs: start_logits, end_logits, loss (return when start_positions/end_positions are provided)  
        output = model(input_ids=data[0], token_type_ids=data[1], attention_mask=data[2], start_positions=data[3], end_positions=data[4])

        # Choose the most probable start position / end position
        start_index = torch.argmax(output.start_logits, dim=1)
        end_index = torch.argmax(output.end_logits, dim=1)
        
        # Prediction is correct only if both start_index and end_index are correct
        train_acc += ((start_index == data[3]) & (end_index == data[4])).float().mean()

        # normalize loss to account for batch accumulation
        # output.loss = output.loss / accum_iter

        train_loss += output.loss
        
         

        if fp16_training:
            accelerator.backward(output.loss)
        else:
            output.loss.backward()
        
        ### batch accumulation ###
        # if ((batch_idx + 1) % accum_iter == 0) or (batch_idx + 1 == len(train_loader)): 
        #     optimizer.step()
        #     scheduler.step()
        #     optimizer.zero_grad()
        # batch_idx += 1

        
        ##### TODO: Apply linear learning rate decay #####
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        
        step += 1

        ##### TODO: Apply linear learning rate decay #####
        # optimizer.param_groups[0]["lr"] -= learning_rate / num_training_steps 


        # Print training loss and accuracy over past logging step
        if step % logging_step == 0:
            print(f"Epoch {epoch + 1} | Step {step} | loss = {train_loss.item() / logging_step:.3f}, acc = {train_acc / logging_step:.3f}")
            train_loss = train_acc = 0

    if validation:
        print("Evaluating Dev Set ...")
        model.eval()
        with torch.no_grad():
            dev_acc = 0
            for i, data in enumerate(tqdm(dev_loader)):
                output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
                # prediction is correct only if answer text exactly matches
                dev_acc += evaluate(data, output, 'false') == dev_questions[i]["answer_text"]
            print(f"Validation | Epoch {epoch + 1} | acc = {dev_acc / len(dev_loader):.3f}")
        model.train()

# Save a model and its configuration file to the directory 「saved_model」 
# i.e. there are two files under the direcory 「saved_model」: 「pytorch_model.bin」 and 「config.json」
# Saved model can be re-loaded using 「model = BertForQuestionAnswering.from_pretrained("saved_model")」
print('current learning rate:', optimizer.param_groups[0]["lr"])
print("Saving Model ...")
model_save_dir = "saved_model" 
model.save_pretrained(model_save_dir)

Start Training ...


HBox(children=(FloatProgress(value=0.0, max=1684.0), HTML(value='')))

Epoch 1 | Step 100 | loss = 1.481, acc = 0.494
Epoch 1 | Step 200 | loss = 0.898, acc = 0.661
Epoch 1 | Step 300 | loss = 0.812, acc = 0.683
Epoch 1 | Step 400 | loss = 0.721, acc = 0.690
Epoch 1 | Step 500 | loss = 0.726, acc = 0.711
Epoch 1 | Step 600 | loss = 0.642, acc = 0.737
Epoch 1 | Step 700 | loss = 0.679, acc = 0.731
Epoch 1 | Step 800 | loss = 0.711, acc = 0.705
Epoch 1 | Step 900 | loss = 0.635, acc = 0.728
Epoch 1 | Step 1000 | loss = 0.587, acc = 0.754
Epoch 1 | Step 1100 | loss = 0.605, acc = 0.759
Epoch 1 | Step 1200 | loss = 0.541, acc = 0.786
Epoch 1 | Step 1300 | loss = 0.573, acc = 0.758
Epoch 1 | Step 1400 | loss = 0.519, acc = 0.787
Epoch 1 | Step 1500 | loss = 0.494, acc = 0.791
Epoch 1 | Step 1600 | loss = 0.494, acc = 0.789

Evaluating Dev Set ...


HBox(children=(FloatProgress(value=0.0, max=3524.0), HTML(value='')))


Validation | Epoch 1 | acc = 0.809
current learning rate: 0.0
Saving Model ...


## Postprocessing

In [None]:
### Postprocess method: trace back what [UNK] is in paragraph
def look_back_paragraph(idx, ans,ans_id):
  for i,v in enumerate(test_questions):
    if i == idx:
      paragraph = test_paragraphs[v['paragraph_id']]
      ans = tokenizer.decode(ans_id)
      print(ans)
      ans2 = ans.split(' ')

      check_unk_num = 0
      unk_idx_list = []
      for j,v in enumerate(ans2):
        if v == '[UNK]':
          check_unk_num +=1
          unk_idx_list.append(j)
      # [UNK]不只一個
      if len(unk_idx_list) > 1:
        new_word = []
        for unk_idx in unk_idx_list:
          if unk_idx == len(ans2)-1:
            word_near_unk = ans2[-2]
          elif unk_idx == 0:
            word_near_unk = ans2[1]
          else:
            word_near_unk = ans2[unk_idx-1]
        
          for i,val in enumerate(paragraph):
            if val == word_near_unk:
              if unk_idx == 0:
                new_word.append(paragraph[i-1])
              elif unk_idx == len(ans2)-1:
                new_word.append(paragraph[i+1])
              else:
                new_word.append(paragraph[i+1])
        cnt = 0
        for key in range(len(ans2)):
          if key in unk_idx_list:
            ans2[key] = new_word[cnt]
            cnt+=1          
              
        print(ans2)
        return ''.join( _ for _ in ans2)
      
      # 只有一個[UNK]
      else:
        unk_idx = unk_idx_list[0]
        word_near_unk = ''
        if unk_idx == len(ans2)-1:
          word_near_unk = ans2[-2]
        elif unk_idx == 0:
          word_near_unk = ans2[1]
        else:
          word_near_unk = ans2[unk_idx-1]
          #print(word_near_unk)

        for i,val in enumerate(paragraph):
          if val == word_near_unk:
            if unk_idx == 0:
              s = ''.join( _ for _ in ans2[1:])
              new = paragraph[i-1]
              new_ans = new + s
              print(new_ans)
            elif unk_idx == len(ans2)-1:
              s = ''.join( _ for _ in ans2[:-1])
              new = paragraph[i+1]
              new_ans = s + new
              print(new_ans)
            else:
              s = ''.join( _ for _ in ans2[:unk_idx])
              new = paragraph[i+1]
              new_ans = s + new
              last = ''.join( _ for _ in ans2[unk_idx+1:])
              new_ans += last
              print(new_ans)
            
            return new_ans

In [None]:
### postprocess method: 補齊括號
def add_symbols(result):
    cnt = 0
    tmp = []
    for i in range(len(result)):
        ans = result[i]
        if '《' in ans and '》' not in ans:
            modify = ans + '》'
            print(i,ans,modify)
            tmp.append(modify)
        elif '《' not in ans and '》' in ans:
            modify = '《' + ans
            print(i,ans,modify)
            tmp.append(modify)
        elif '「' not in ans and '」' in ans:
            modify = '「' + ans
            print(i,ans,modify)
            tmp.append(modify)
        elif '「' in ans and '」' not in ans:
            modify = ans + '」'
            print(i,ans,modify)
            tmp.append(modify)
        else:
            tmp.append(ans)
    return tmp

## Testing

In [None]:
print("Evaluating Test Set ...")

result = []
model.eval()
with torch.no_grad():
    for data in tqdm(test_loader):
        output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
        ans = evaluate(data, output, 'false')
        result.append(ans)

Evaluating Test Set ...


HBox(children=(FloatProgress(value=0.0, max=3493.0), HTML(value='')))




In [None]:
### Predict_file problem example:
# ans_id = torch.tensor([4635,  100, 5145, 3994, 5179,  752,  816])
# print(len(test_paragraphs),len(test_questions),len(result))
# peeking(991,'白 [UNK] 紀 滅 絕 事 件', ans_id)

In [None]:
### Testing (add postprocess method) ###
print("Evaluating Test Set ...")

postprocess_result = []
idx = 0
model.eval()
with torch.no_grad():
    for data in tqdm(test_loader):
        output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
        ans, ans_id = evaluate(data, output, 'true')
        
        if '[UNK]' in ans:
          print(idx)
          new_ans = look_back_paragraph(idx,ans,ans_id)
          postprocess_result.append(new_ans)
        else:
          ans = ans.replace(' ','')
          postprocess_result.append(ans)
        idx += 1

In [None]:
new_postprocess_result = add_symbols(postprocess_result)

539 天」的至高無上 「天」的至高無上
884 「1910年革命 「1910年革命」
1088 「破四舊 「破四舊」
1536 結束戰爭」。但德軍的最高統帥部卻仍死心不息，意圖利用剩餘的海軍艦隻與英國海軍進行最後決戰 「結束戰爭」。但德軍的最高統帥部卻仍死心不息，意圖利用剩餘的海軍艦隻與英國海軍進行最後決戰
1625 記錄」號 「記錄」號
1973 「斷罰出征，並全征燒埋銀 「斷罰出征，並全征燒埋銀」
2174 永興」 「永興」
2349 《韓國憲法 《韓國憲法》
2610 「印度國民軍士兵大審判 「印度國民軍士兵大審判」
2771 「11月祭 「11月祭」
2970 一個小時說一個字」 「一個小時說一個字」
3058 中國共產黨紀律檢查機關，是中國共產黨對黨員進行紀律考核的部門。在《中國共產黨章程》內稱為「黨的紀律檢查機關 中國共產黨紀律檢查機關，是中國共產黨對黨員進行紀律考核的部門。在《中國共產黨章程》內稱為「黨的紀律檢查機關」
3425 建立「1月6日獨裁 建立「1月6日獨裁」
3437 歐」型恆星 「歐」型恆星


In [None]:
def write_file(result, result_file):
  with open(result_file, 'w') as f:	
    f.write("ID,Answer\n")
    for i, test_question in enumerate(test_questions):
      # print(result[i])
      # Replace commas in answers with empty strings (since csv is separated by comma)
      # Answers in kaggle are processed in the same way
      if result_file == 'ensemble_result.csv':
        f.write(f"{test_question['id']},{result[i]}\n")
      else:
        f.write(f"{test_question['id']},{result[i].replace(',','')}\n")
      
  print(f"Completed! Result is in {result_file}")

In [None]:
print(len(result),len(new_postprocess_result))
write_file(result, 'result.csv')
write_file(new_postprocess_result, 'postprocess_result.csv')

3493 3493
Completed! Result is in result.csv
Completed! Result is in postprocess_result.csv


##Ensemble

In [None]:
from google.colab import files
import io
import pandas as pd

uploaded1 = files.upload()
uploaded2 = files.upload()
uploaded3 = files.upload()

df1 = pd.read_csv(io.BytesIO(uploaded1['postprocess_result.csv']))      # 0.78947
df2 = pd.read_csv(io.BytesIO(uploaded2['result (1).csv']))         # 0.78661
df3 = pd.read_csv(io.BytesIO(uploaded3['postprocess_result2.csv']))     # 0.77860

Saving postprocess_result.csv to postprocess_result.csv


Saving result (1).csv to result (1).csv


Saving postprocess_result2.csv to postprocess_result2.csv


In [None]:
# ensemble
import collections
tmp_ans = []
for i in range(len(df1)):
    #print(df1.iloc[i][1],df2.iloc[i][1],df3.iloc[i][1])
    cnt = collections.Counter([df1.iloc[i][1],df2.iloc[i][1],df3.iloc[i][1]])
    if len(cnt) == 3:     
        tmp_ans.append(df1.iloc[i][1])
    elif len(cnt) == 2:
        k = cnt.most_common(1)[0][0]
        tmp_ans.append(k)
    else:
        tmp_ans.append(df1.iloc[i][1])

write_file(tmp_ans,'ensemble_result.csv')

Completed! Result is in ensemble_result.csv


##Results summary

In [None]:
# reoberta-large, bacth_size = 16, doc-stride = 75, preprocessing method, valid: 0.795, test: 0.78661(result)  --> after postprocess: 0.78947 (postprocess_result)
# reoberta-large, bacth_size = 128, doc-stride = 75, preprocessing method, valid: 0.788, test: 0.78661(result1) --> after postprocess: 0.79061 (postprocess_result3)
# reoberta-large, bacth_size = 128, doc-stride = 50, preprocessing method, valid: 0.798, test: 0.77631(result2) --> after postprocess: 0.77860 (postprocess_result2)

# reoberta-large, bacth_size = 16, doc-stride = 75, preprocessing method2, valid: 0.791, test: 0.77432

# reoberta-large, bacth_size = 16, doc-stride = 75, preprocessing method(200), valid: 0.810, test: 0.79405 (result10)
# reoberta-large, bacth_size = 16, doc-stride = 75, preprocessing method(正負x*150), valid: 0.810, test:0.80434 --> 找到[UNK]答案後 test: 0.81064(result11) 補齊括號後 test: 0.81350 (postprocess_result6)

# reoberta-large, bacth_size = 16, doc-stride = 75, preprocessing method, valid: 0.791, test:0.77459, 找到[UNK]答案後 --> test: 0.78089  --> 補齊括號後 --> test: 0.78432
# reoberta-large, bacth_size = 16, doc-stride = 75, preprocessing method, valid: 0.791, test:0.77459, 找到[UNK]答案後 --> test: 0.78089  --> remove括號後 --> test: 0.78203

### Ensemble:
# ensembele: postprocess_result, result1, postprocess_result2          test: 0.80263
# ensembele2: postprocess_result, postprocess_result3, postprocess_result2:   test: 0.79061