# **Homework 7 - Bert (Question Answering)**

If you have any questions, feel free to email us at mlta-2022-spring@googlegroups.com



Slide:    [Link](https://docs.google.com/presentation/d/1H5ZONrb2LMOCixLY7D5_5-7LkIaXO6AGEaV2mRdTOMY/edit?usp=sharing)　Kaggle: [Link](https://www.kaggle.com/c/ml2022spring-hw7)　Data: [Link](https://drive.google.com/uc?id=1AVgZvy3VFeg0fX-6WQJMHPVrx3A-M1kb)




## Task description
- Chinese Extractive Question Answering
  - Input: Paragraph + Question
  - Output: Answer

- Objective: Learn how to fine tune a pretrained model on downstream task using transformers

- Todo
    - Fine tune a pretrained chinese BERT model
    - Change hyperparameters (e.g. doc_stride)
    - Apply linear learning rate decay
    - Try other pretrained models
    - Improve preprocessing
    - Improve postprocessing
- Training tips
    - Automatic mixed precision
    - Gradient accumulation
    - Ensemble

- Estimated training time (tesla t4 with automatic mixed precision enabled)
    - Simple: 8mins
    - Medium: 8mins
    - Strong: 25mins
    - Boss: 2.5hrs
  

## Download Dataset

In [None]:
from google.colab import drive
import os
drive.mount('/content/drive')

##出現提示欄進行授權

os.chdir('/content/drive/My Drive/ML2/HW7/') #切換該目錄
os.listdir() #確認目錄內容

# Download link 1
!gdown --id '1AVgZvy3VFeg0fX-6WQJMHPVrx3A-M1kb' --output hw7_data.zip

# Download Link 2 (if the above link fails) 
# !gdown --id '1qwjbRjq481lHsnTrrF4OjKQnxzgoLEFR' --output hw7_data.zip

# Download Link 3 (if the above link fails) 
# !gdown --id '1QXuWjNRZH6DscSd6QcRER0cnxmpZvijn' --output hw7_data.zip

!unzip -o hw7_data.zip

# For this HW, K80 < P4 < T4 < P100 <= T4(fp16) < V100
!nvidia-smi

Mounted at /content/drive
Downloading...
From: https://drive.google.com/uc?id=1AVgZvy3VFeg0fX-6WQJMHPVrx3A-M1kb
To: /content/drive/My Drive/ML2/HW7/hw7_data.zip
100% 9.57M/9.57M [00:00<00:00, 255MB/s]
Archive:  hw7_data.zip
  inflating: hw7_dev.json            
  inflating: hw7_test.json           
  inflating: hw7_train.json          
Wed Apr 27 02:27:27 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    26W / 250W |      0MiB / 16280MiB |      0%      

## Install transformers

Documentation for the toolkit:　https://huggingface.co/transformers/

In [None]:
# You are allowed to change version of transformers or use other toolkits
!pip install transformers==4.5.0

Collecting transformers==4.5.0
  Downloading transformers-4.5.0-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 4.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 55.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 52.1 MB/s 
Installing collected packages: tokenizers, sacremoses, transformers
Successfully installed sacremoses-0.0.49 tokenizers-0.10.3 transformers-4.5.0


## Import Packages

In [None]:
import json
import numpy as np
import random
import torch
from torch.utils.data import DataLoader, Dataset 
from transformers import AdamW, BertForQuestionAnswering, BertTokenizerFast, BertConfig

from tqdm.auto import tqdm

device = "cuda" if torch.cuda.is_available() else "cpu"

# Fix random seed for reproducibility
def same_seeds(seed):
	  torch.manual_seed(seed)
	  if torch.cuda.is_available():
		    torch.cuda.manual_seed(seed)
		    torch.cuda.manual_seed_all(seed)
	  np.random.seed(seed)
	  random.seed(seed)
	  torch.backends.cudnn.benchmark = False
	  torch.backends.cudnn.deterministic = True
same_seeds(101)

In [None]:
# Change "fp16_training" to True to support automatic mixed precision training (fp16)	
fp16_training = False

if fp16_training:
    !pip install accelerate==0.2.0
    from accelerate import Accelerator
    accelerator = Accelerator(fp16=True)
    device = accelerator.device

# Documentation for the toolkit:  https://huggingface.co/docs/accelerate/

## Load Model and Tokenizer




 

In [None]:
# config = BertConfig.from_pretrained("hfl/chinese-macbert-large", hidden_dropout_prob=0.2)
# model = BertForQuestionAnswering.from_pretrained("hfl/chinese-macbert-large").to(device)
model1 = BertForQuestionAnswering.from_pretrained("saved_model_roberta_devless_less_0.1_0.1_400_3_0.842").to(device)
model2 = BertForQuestionAnswering.from_pretrained("saved_model_roberta_devless_less_0.1_0.1_400_3_0.836").to(device)
model3 = BertForQuestionAnswering.from_pretrained("saved_model_roberta_devless_less_0.1_0.1_400_3_0.831").to(device)
# model4 = BertForQuestionAnswering.from_pretrained("saved_model_macbert_devless0.1_0.1_440_3_0.883").to(device)
# model5 = BertForQuestionAnswering.from_pretrained("saved_model_macbert_devless0.1_0.1_384_3_0.841").to(device)
# model6 = BertForQuestionAnswering.from_pretrained("saved_model_macbert_devless0.1_0.1_384_4_0.850").to(device)

tokenizer = BertTokenizerFast.from_pretrained("hfl/chinese-macbert-large")
##-------------------------------------------------------------------------------------------------------------------
# config = BertConfig.from_pretrained("hfl/chinese-macbert-large", hidden_dropout_prob=0.4)
# model = BertForQuestionAnswering.from_pretrained("hfl/chinese-macbert-large", config=config).to(device)
# tokenizer = BertTokenizerFast.from_pretrained("hfl/chinese-macbert-large")
##-------------------------------------------------------------------------------------------------------------------
# from transformers import BertTokenizer, T5ForConditionalGeneration, Text2TextGenerationPipeline
# tokenizer = BertTokenizer.from_pretrained("uer/t5-small-chinese-cluecorpussmall")
# model = T5ForConditionalGeneration.from_pretrained("uer/t5-small-chinese-cluecorpussmall")
# You can safely ignore the warning message (it pops up because new prediction heads for QA are initialized randomly)

## Read Data

- Training set: 31690 QA pairs
- Dev set: 4131  QA pairs
- Test set: 4957  QA pairs

- {train/dev/test}_questions:	
  - List of dicts with the following keys:
   - id (int)
   - paragraph_id (int)
   - question_text (string)
   - answer_text (string)
   - answer_start (int)
   - answer_end (int)
- {train/dev/test}_paragraphs: 
  - List of strings
  - paragraph_ids in questions correspond to indexs in paragraphs
  - A paragraph may be used by several questions 

In [None]:
def read_data(file):
    with open(file, 'r', encoding="utf-8") as reader:
        data = json.load(reader)
    return data["questions"], data["paragraphs"]


train_questions, train_paragraphs = read_data("hw7_train.json")
dev_questions, dev_paragraphs = read_data("hw7_dev.json")
dev_questions_final, dev_paragraphs_final = read_data("hw7_dev.json")

print("Train questions before :", len(train_questions))
print("Train paragraphs before :", len(train_paragraphs))
print("Dev questions before :", len(dev_questions))
print("Dev paragraphs before :", len(dev_paragraphs))
# counter = 0
for i in range(len(dev_questions)):
  if int(dev_questions[i]["paragraph_id"]) < 1400:
    train_questions += [dev_questions[i]]
    dev_questions_final.remove(dev_questions[i])
    # counter+=1
  
train_paragraphs += dev_paragraphs[:1400]
dev_paragraphs_final = dev_paragraphs[1400:]
print("Train questions after :", len(train_questions))
print("Train paragraphs after :", len(train_paragraphs))
print("Dev questions  after :", len(dev_questions_final))
print("Dev paragraphs after :", len(dev_paragraphs_final))

# print(counter) #3137

for number in range(31691, 35644):
  train_questions[number]["id"] = int(number)
  train_questions[number]["paragraph_id"] = 10524 + int(train_questions[number]["paragraph_id"])


for number in range(177):
  dev_questions_final[number]['id'] = int(number)
  dev_questions_final[number]['paragraph_id'] = int(dev_questions_final[number]['paragraph_id']) - 1400

# for i in range(31691, 34827):
  
# print(train_questions[34111])
# print(train_paragraphs[11268])



test_questions, test_paragraphs = read_data("hw7_test.json")

Train questions before : 31690
Train paragraphs before : 10524
Dev questions before : 4131
Dev paragraphs before : 1490
Train questions after : 35644
Train paragraphs after : 11924
Dev questions  after : 177
Dev paragraphs after : 90


## Tokenize Data

In [None]:
# Tokenize questions and paragraphs separately
# 「add_special_tokens」 is set to False since special tokens will be added when tokenized questions and paragraphs are combined in datset __getitem__ 

train_questions_tokenized = tokenizer([train_question["question_text"] for train_question in train_questions], add_special_tokens=False)
dev_questions_tokenized = tokenizer([dev_question["question_text"] for dev_question in dev_questions_final], add_special_tokens=False)
test_questions_tokenized = tokenizer([test_question["question_text"] for test_question in test_questions], add_special_tokens=False) 

train_paragraphs_tokenized = tokenizer(train_paragraphs, add_special_tokens=False)
dev_paragraphs_tokenized = tokenizer(dev_paragraphs_final, add_special_tokens=False)
test_paragraphs_tokenized = tokenizer(test_paragraphs, add_special_tokens=False)

# You can safely ignore the warning message as tokenized sequences will be futher processed in datset __getitem__ before passing to model

## Dataset and Dataloader

In [None]:
from random import randint
class QA_Dataset(Dataset):
    def __init__(self, split, questions, tokenized_questions, tokenized_paragraphs):
        self.split = split
        self.questions = questions
        self.tokenized_questions = tokenized_questions
        self.tokenized_paragraphs = tokenized_paragraphs
        self.max_question_len = 60
        self.max_paragraph_len = 400
        
        ##### TODO: Change value of doc_stride #####
        self.doc_stride = 32

        # Input sequence length = [CLS] + question + [SEP] + paragraph + [SEP]
        self.max_seq_len = 1 + self.max_question_len + 1 + self.max_paragraph_len + 1

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        question = self.questions[idx]
        tokenized_question = self.tokenized_questions[idx]
        tokenized_paragraph = self.tokenized_paragraphs[question["paragraph_id"]]
        
        ##### TODO: Preprocessing #####
        # Hint: How to prevent model from learning something it should not learn

        if self.split == "train":
            
            # Convert answer's start/end positions in paragraph_text to start/end positions in tokenized_paragraph  
            answer_start_token = tokenized_paragraph.char_to_token(question["answer_start"])
            answer_end_token = tokenized_paragraph.char_to_token(question["answer_end"])

            if answer_start_token == None or answer_end_token == None:
              print(question["answer_start"])
              print(question["answer_end"])
              print(answer_start_token)
              print(answer_end_token) 
              print("--------------")
              answer_start_token = 0
              answer_end_token = 1
            # A single window is obtained by slicing the portion of paragraph containing the answer
            shift_num = randint(-50, +50)
            shift_num_right = randint(0, +50)
            shift_num_left = randint(-50, 0)
            mid = (answer_start_token + answer_end_token) // 2
            paragraph_start = max(0, min(mid - self.max_paragraph_len // 2, len(tokenized_paragraph) - self.max_paragraph_len))
            

            
            if paragraph_start >= abs(shift_num) and (len(tokenized_paragraph) - self.max_paragraph_len)>= abs(shift_num):
              paragraph_start += shift_num
              paragraph_end = paragraph_start + self.max_paragraph_len
            
            else:
              paragraph_end = paragraph_start + self.max_paragraph_len

            # paragraph_end = paragraph_start + self.max_paragraph_len #sample code
            
            # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
            input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102] 
            input_ids_paragraph = tokenized_paragraph.ids[paragraph_start : paragraph_end] + [102]		
            
            # Convert answer's start/end positions in tokenized_paragraph to start/end positions in the window  
            answer_start_token += len(input_ids_question) - paragraph_start
            answer_end_token += len(input_ids_question) - paragraph_start
            
            # Pad sequence and obtain inputs to model 
            input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
            return torch.tensor(input_ids), torch.tensor(token_type_ids), torch.tensor(attention_mask), answer_start_token, answer_end_token
            
        # Validation/Testing
        else:
            input_ids_list, token_type_ids_list, attention_mask_list = [], [], []
            
            # Paragraph is split into several windows, each with start positions separated by step "doc_stride"
            for i in range(0, len(tokenized_paragraph), self.doc_stride):
                
                # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
                input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102]
                input_ids_paragraph = tokenized_paragraph.ids[i : i + self.max_paragraph_len] + [102]
                
                # Pad sequence and obtain inputs to model
                input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
                
                input_ids_list.append(input_ids)
                token_type_ids_list.append(token_type_ids)
                attention_mask_list.append(attention_mask)
            
            return torch.tensor(input_ids_list), torch.tensor(token_type_ids_list), torch.tensor(attention_mask_list), question["id"]

    def padding(self, input_ids_question, input_ids_paragraph):
        # Pad zeros if sequence length is shorter than max_seq_len
        padding_len = self.max_seq_len - len(input_ids_question) - len(input_ids_paragraph)
        # Indices of input sequence tokens in the vocabulary
        input_ids = input_ids_question + input_ids_paragraph + [0] * padding_len
        # Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
        token_type_ids = [0] * len(input_ids_question) + [1] * len(input_ids_paragraph) + [0] * padding_len
        # Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
        attention_mask = [1] * (len(input_ids_question) + len(input_ids_paragraph)) + [0] * padding_len
        
        return input_ids, token_type_ids, attention_mask

train_set = QA_Dataset("train", train_questions, train_questions_tokenized, train_paragraphs_tokenized)
dev_set = QA_Dataset("dev", dev_questions_final, dev_questions_tokenized, dev_paragraphs_tokenized)
test_set = QA_Dataset("test", test_questions, test_questions_tokenized, test_paragraphs_tokenized)

train_batch_size = 4

# Note: Do NOT change batch size of dev_loader / test_loader !
# Although batch size=1, it is actually a batch consisting of several windows from the same QA pair
train_loader = DataLoader(train_set, batch_size=train_batch_size, shuffle=True, pin_memory=True)
dev_loader = DataLoader(dev_set, batch_size=1, shuffle=True, pin_memory=True)
test_loader = DataLoader(test_set, batch_size=1, shuffle=False, pin_memory=True)

## Function for Evaluation

In [None]:
def evaluate(data, output):
    ##### TODO: Postprocessing #####
    # There is a bug and room for improvement in postprocessing 
    # Hint: Open your prediction file to see what is wrong 
    
    answer = ''
    max_prob = float('-inf')
    num_of_windows = data[0].shape[1]
    # print(data[0].shape)
    for k in range(num_of_windows):
        # Obtain answer by choosing the most probable start position / end position
        start_prob, start_index = torch.max(output.start_logits[k], dim=0)
        end_prob, end_index = torch.max(output.end_logits[k], dim=0)
        # print("start_index", start_index)
        # print("end_index", end_index)
        # Probability of answer is calculated as sum of start_prob and end_prob
        prob = start_prob + end_prob
        
        # Replace answer if calculated probability is larger than previous windows
        if prob > max_prob and (end_index - start_index) < 30 and end_index > start_index:
            max_prob = prob
            # Convert tokens to chars (e.g. [1920, 7032] --> "大 金")
            answer = tokenizer.decode(data[0][0][k][start_index : end_index + 1])
    

    # Remove spaces in answer (e.g. "大 金" --> "大金")
    return answer.replace(' ','')

## Training

In [None]:
from transformers import get_linear_schedule_with_warmup
num_epoch = 5
validation = True
logging_step = 100
accumulation_steps = 8
learning_rate = 5e-5
optimizer = AdamW(model.parameters(), lr=learning_rate)

total_steps = len(train_loader) * num_epoch
warm_up_ratio = 0.01
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = (warm_up_ratio * total_steps)//accumulation_steps, num_training_steps = total_steps//accumulation_steps) 

if fp16_training:
    model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader) 

model.train()

print("Start Training ...")

step = 1
for epoch in range(num_epoch):
    
    train_loss = train_acc = 0
    
    for data in tqdm(train_loader):	
      
      # Load all data into GPU
      data = [i.to(device) for i in data]
      
      # Model inputs: input_ids, token_type_ids, attention_mask, start_positions, end_positions (Note: only "input_ids" is mandatory)
      # Model outputs: start_logits, end_logits, loss (return when start_positions/end_positions are provided)  
      output = model(input_ids=data[0], token_type_ids=data[1], attention_mask=data[2], start_positions=data[3], end_positions=data[4])

      # Choose the most probable start position / end position
      start_index = torch.argmax(output.start_logits, dim=1)
      end_index = torch.argmax(output.end_logits, dim=1)
      
      # Prediction is correct only if both start_index and end_index are correct
      train_acc += ((start_index == data[3]) & (end_index == data[4])).float().mean()
      train_loss += output.loss / accumulation_steps
      
      if fp16_training:
          accelerator.backward(output.loss)
      else:
          output.loss.backward()
      
      if (step) % accumulation_steps == 0:             # Wait for several backward steps
        optimizer.step()                         # Now we can do an optimizer step
        model.zero_grad()                         # Reset gradients tensors
        scheduler.step()
      
      step += 1

      ##### TODO: Apply linear learning rate decay #####
      # optimizer.param_groups[0]['lr'] -= learning_rate / total_steps
      
      # Print training loss and accuracy over past logging step
      if step % logging_step == 0:
          print(f"Learning Rate {optimizer.param_groups[0]['lr']:.8f} | Epoch {epoch + 1} | Step {step} | loss = {train_loss.item() / logging_step:.3f}, acc = {train_acc / logging_step:.3f}")
          train_loss = train_acc = 0

      if validation and step % 870 == 0 :
          print("Evaluating Dev Set ...")
          model.eval()
          with torch.no_grad():
              dev_acc = 0
              for i, data in enumerate(tqdm(dev_loader)):
                  output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                        attention_mask=data[2].squeeze(dim=0).to(device))
                  # prediction is correct only if answer text exactly matches
                  dev_acc += evaluate(data, output) == dev_questions_final[data[3]]["answer_text"]
              print(f"Validation | Epoch {epoch + 1} | acc = {dev_acc / len(dev_loader):.3f}")
              print("Saving Model ...")
              model_save_dir = f"saved_model_roberta_devless_less_0.1_0.1_400_{epoch+1}_{dev_acc / len(dev_loader):.3f}" 
              model.save_pretrained(model_save_dir)
          model.train()
      

# Save a model and its configuration file to the directory 「saved_model」 
# i.e. there are two files under the direcory 「saved_model」: 「pytorch_model.bin」 and 「config.json」
# Saved model can be re-loaded using 「model = BertForQuestionAnswering.from_pretrained("saved_model")」
# print("Saving Model ...")
# model_save_dir = "saved_model" 
# model.save_pretrained(model_save_dir)

Start Training ...


  0%|          | 0/8911 [00:00<?, ?it/s]

Learning Rate 0.00001091 | Epoch 1 | Step 100 | loss = 0.747, acc = 0.000
Learning Rate 0.00002182 | Epoch 1 | Step 200 | loss = 0.592, acc = 0.050
Learning Rate 0.00003364 | Epoch 1 | Step 300 | loss = 0.274, acc = 0.325
Learning Rate 0.00004455 | Epoch 1 | Step 400 | loss = 0.172, acc = 0.502
Learning Rate 0.00004994 | Epoch 1 | Step 500 | loss = 0.161, acc = 0.590
Learning Rate 0.00004983 | Epoch 1 | Step 600 | loss = 0.128, acc = 0.630
Learning Rate 0.00004971 | Epoch 1 | Step 700 | loss = 0.126, acc = 0.645
Learning Rate 0.00004960 | Epoch 1 | Step 800 | loss = 0.108, acc = 0.683
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 1 | acc = 0.757
Saving Model ...
Learning Rate 0.00004948 | Epoch 1 | Step 900 | loss = 0.123, acc = 0.632
Learning Rate 0.00004937 | Epoch 1 | Step 1000 | loss = 0.125, acc = 0.650
Learning Rate 0.00004926 | Epoch 1 | Step 1100 | loss = 0.116, acc = 0.662
Learning Rate 0.00004915 | Epoch 1 | Step 1200 | loss = 0.089, acc = 0.745
Learning Rate 0.00004903 | Epoch 1 | Step 1300 | loss = 0.112, acc = 0.667
Learning Rate 0.00004892 | Epoch 1 | Step 1400 | loss = 0.097, acc = 0.715
Learning Rate 0.00004880 | Epoch 1 | Step 1500 | loss = 0.102, acc = 0.715
Learning Rate 0.00004869 | Epoch 1 | Step 1600 | loss = 0.109, acc = 0.680
Learning Rate 0.00004858 | Epoch 1 | Step 1700 | loss = 0.091, acc = 0.712
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 1 | acc = 0.740
Saving Model ...
Learning Rate 0.00004847 | Epoch 1 | Step 1800 | loss = 0.108, acc = 0.680
Learning Rate 0.00004835 | Epoch 1 | Step 1900 | loss = 0.099, acc = 0.705
343
346
None
None
--------------
Learning Rate 0.00004824 | Epoch 1 | Step 2000 | loss = 0.091, acc = 0.717
Learning Rate 0.00004812 | Epoch 1 | Step 2100 | loss = 0.086, acc = 0.760
Learning Rate 0.00004801 | Epoch 1 | Step 2200 | loss = 0.102, acc = 0.717
Learning Rate 0.00004790 | Epoch 1 | Step 2300 | loss = 0.094, acc = 0.717
Learning Rate 0.00004779 | Epoch 1 | Step 2400 | loss = 0.082, acc = 0.730
Learning Rate 0.00004767 | Epoch 1 | Step 2500 | loss = 0.086, acc = 0.707
Learning Rate 0.00004756 | Epoch 1 | Step 2600 | loss = 0.083, acc = 0.712
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 1 | acc = 0.763
Saving Model ...
Learning Rate 0.00004744 | Epoch 1 | Step 2700 | loss = 0.095, acc = 0.720
Learning Rate 0.00004733 | Epoch 1 | Step 2800 | loss = 0.088, acc = 0.740
Learning Rate 0.00004722 | Epoch 1 | Step 2900 | loss = 0.094, acc = 0.730
Learning Rate 0.00004711 | Epoch 1 | Step 3000 | loss = 0.095, acc = 0.700
Learning Rate 0.00004699 | Epoch 1 | Step 3100 | loss = 0.079, acc = 0.748
Learning Rate 0.00004688 | Epoch 1 | Step 3200 | loss = 0.087, acc = 0.767
Learning Rate 0.00004676 | Epoch 1 | Step 3300 | loss = 0.092, acc = 0.732
Learning Rate 0.00004665 | Epoch 1 | Step 3400 | loss = 0.087, acc = 0.735
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 1 | acc = 0.780
Saving Model ...
Learning Rate 0.00004654 | Epoch 1 | Step 3500 | loss = 0.064, acc = 0.790
Learning Rate 0.00004643 | Epoch 1 | Step 3600 | loss = 0.076, acc = 0.787
Learning Rate 0.00004631 | Epoch 1 | Step 3700 | loss = 0.070, acc = 0.782
Learning Rate 0.00004620 | Epoch 1 | Step 3800 | loss = 0.071, acc = 0.782
Learning Rate 0.00004608 | Epoch 1 | Step 3900 | loss = 0.088, acc = 0.720
Learning Rate 0.00004597 | Epoch 1 | Step 4000 | loss = 0.080, acc = 0.762
Learning Rate 0.00004586 | Epoch 1 | Step 4100 | loss = 0.092, acc = 0.743
Learning Rate 0.00004575 | Epoch 1 | Step 4200 | loss = 0.072, acc = 0.770
Learning Rate 0.00004563 | Epoch 1 | Step 4300 | loss = 0.068, acc = 0.785
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 1 | acc = 0.791
Saving Model ...
Learning Rate 0.00004552 | Epoch 1 | Step 4400 | loss = 0.092, acc = 0.745
Learning Rate 0.00004540 | Epoch 1 | Step 4500 | loss = 0.068, acc = 0.800
Learning Rate 0.00004529 | Epoch 1 | Step 4600 | loss = 0.075, acc = 0.770
Learning Rate 0.00004518 | Epoch 1 | Step 4700 | loss = 0.090, acc = 0.732
Learning Rate 0.00004507 | Epoch 1 | Step 4800 | loss = 0.078, acc = 0.767
Learning Rate 0.00004495 | Epoch 1 | Step 4900 | loss = 0.079, acc = 0.803
Learning Rate 0.00004484 | Epoch 1 | Step 5000 | loss = 0.072, acc = 0.772
Learning Rate 0.00004472 | Epoch 1 | Step 5100 | loss = 0.076, acc = 0.745
Learning Rate 0.00004461 | Epoch 1 | Step 5200 | loss = 0.087, acc = 0.757
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 1 | acc = 0.751
Saving Model ...
Learning Rate 0.00004450 | Epoch 1 | Step 5300 | loss = 0.086, acc = 0.738
Learning Rate 0.00004439 | Epoch 1 | Step 5400 | loss = 0.061, acc = 0.780
Learning Rate 0.00004427 | Epoch 1 | Step 5500 | loss = 0.073, acc = 0.800
Learning Rate 0.00004416 | Epoch 1 | Step 5600 | loss = 0.079, acc = 0.775
Learning Rate 0.00004404 | Epoch 1 | Step 5700 | loss = 0.079, acc = 0.740
Learning Rate 0.00004393 | Epoch 1 | Step 5800 | loss = 0.063, acc = 0.790
Learning Rate 0.00004382 | Epoch 1 | Step 5900 | loss = 0.079, acc = 0.775
Learning Rate 0.00004371 | Epoch 1 | Step 6000 | loss = 0.064, acc = 0.803
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 1 | acc = 0.780
Saving Model ...
Learning Rate 0.00004359 | Epoch 1 | Step 6100 | loss = 0.075, acc = 0.787
Learning Rate 0.00004348 | Epoch 1 | Step 6200 | loss = 0.072, acc = 0.748
Learning Rate 0.00004336 | Epoch 1 | Step 6300 | loss = 0.070, acc = 0.772
Learning Rate 0.00004325 | Epoch 1 | Step 6400 | loss = 0.066, acc = 0.780
Learning Rate 0.00004314 | Epoch 1 | Step 6500 | loss = 0.062, acc = 0.797
Learning Rate 0.00004303 | Epoch 1 | Step 6600 | loss = 0.075, acc = 0.787
Learning Rate 0.00004291 | Epoch 1 | Step 6700 | loss = 0.082, acc = 0.755
Learning Rate 0.00004280 | Epoch 1 | Step 6800 | loss = 0.069, acc = 0.795
Learning Rate 0.00004268 | Epoch 1 | Step 6900 | loss = 0.066, acc = 0.780
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 1 | acc = 0.797
Saving Model ...
Learning Rate 0.00004257 | Epoch 1 | Step 7000 | loss = 0.081, acc = 0.730
Learning Rate 0.00004246 | Epoch 1 | Step 7100 | loss = 0.078, acc = 0.743
Learning Rate 0.00004235 | Epoch 1 | Step 7200 | loss = 0.080, acc = 0.750
Learning Rate 0.00004223 | Epoch 1 | Step 7300 | loss = 0.070, acc = 0.765
Learning Rate 0.00004212 | Epoch 1 | Step 7400 | loss = 0.053, acc = 0.835
Learning Rate 0.00004200 | Epoch 1 | Step 7500 | loss = 0.081, acc = 0.755
Learning Rate 0.00004189 | Epoch 1 | Step 7600 | loss = 0.080, acc = 0.750
Learning Rate 0.00004178 | Epoch 1 | Step 7700 | loss = 0.066, acc = 0.780
Learning Rate 0.00004167 | Epoch 1 | Step 7800 | loss = 0.071, acc = 0.815
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 1 | acc = 0.808
Saving Model ...
Learning Rate 0.00004155 | Epoch 1 | Step 7900 | loss = 0.064, acc = 0.820
Learning Rate 0.00004144 | Epoch 1 | Step 8000 | loss = 0.074, acc = 0.797
Learning Rate 0.00004132 | Epoch 1 | Step 8100 | loss = 0.071, acc = 0.760
Learning Rate 0.00004121 | Epoch 1 | Step 8200 | loss = 0.066, acc = 0.800
Learning Rate 0.00004110 | Epoch 1 | Step 8300 | loss = 0.070, acc = 0.775
Learning Rate 0.00004099 | Epoch 1 | Step 8400 | loss = 0.069, acc = 0.787
Learning Rate 0.00004087 | Epoch 1 | Step 8500 | loss = 0.066, acc = 0.785
Learning Rate 0.00004076 | Epoch 1 | Step 8600 | loss = 0.069, acc = 0.795
Learning Rate 0.00004064 | Epoch 1 | Step 8700 | loss = 0.052, acc = 0.830
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 1 | acc = 0.785
Saving Model ...
Learning Rate 0.00004053 | Epoch 1 | Step 8800 | loss = 0.058, acc = 0.803
Learning Rate 0.00004042 | Epoch 1 | Step 8900 | loss = 0.065, acc = 0.797


  0%|          | 0/8911 [00:00<?, ?it/s]

Learning Rate 0.00004031 | Epoch 2 | Step 9000 | loss = 0.040, acc = 0.750
Learning Rate 0.00004019 | Epoch 2 | Step 9100 | loss = 0.028, acc = 0.897
Learning Rate 0.00004008 | Epoch 2 | Step 9200 | loss = 0.038, acc = 0.840
Learning Rate 0.00003996 | Epoch 2 | Step 9300 | loss = 0.046, acc = 0.845
Learning Rate 0.00003985 | Epoch 2 | Step 9400 | loss = 0.040, acc = 0.835
Learning Rate 0.00003974 | Epoch 2 | Step 9500 | loss = 0.039, acc = 0.857
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 2 | acc = 0.751
Saving Model ...
Learning Rate 0.00003963 | Epoch 2 | Step 9600 | loss = 0.045, acc = 0.845
Learning Rate 0.00003951 | Epoch 2 | Step 9700 | loss = 0.043, acc = 0.865
Learning Rate 0.00003940 | Epoch 2 | Step 9800 | loss = 0.028, acc = 0.897
Learning Rate 0.00003928 | Epoch 2 | Step 9900 | loss = 0.048, acc = 0.817
Learning Rate 0.00003917 | Epoch 2 | Step 10000 | loss = 0.045, acc = 0.870
Learning Rate 0.00003906 | Epoch 2 | Step 10100 | loss = 0.045, acc = 0.830
Learning Rate 0.00003895 | Epoch 2 | Step 10200 | loss = 0.043, acc = 0.842
Learning Rate 0.00003883 | Epoch 2 | Step 10300 | loss = 0.040, acc = 0.860
Learning Rate 0.00003872 | Epoch 2 | Step 10400 | loss = 0.043, acc = 0.873
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 2 | acc = 0.780
Saving Model ...
Learning Rate 0.00003860 | Epoch 2 | Step 10500 | loss = 0.039, acc = 0.880
Learning Rate 0.00003849 | Epoch 2 | Step 10600 | loss = 0.044, acc = 0.845
Learning Rate 0.00003838 | Epoch 2 | Step 10700 | loss = 0.041, acc = 0.873
Learning Rate 0.00003827 | Epoch 2 | Step 10800 | loss = 0.033, acc = 0.875
Learning Rate 0.00003815 | Epoch 2 | Step 10900 | loss = 0.047, acc = 0.873
Learning Rate 0.00003804 | Epoch 2 | Step 11000 | loss = 0.034, acc = 0.877
Learning Rate 0.00003792 | Epoch 2 | Step 11100 | loss = 0.050, acc = 0.855
Learning Rate 0.00003781 | Epoch 2 | Step 11200 | loss = 0.043, acc = 0.840
Learning Rate 0.00003769 | Epoch 2 | Step 11300 | loss = 0.031, acc = 0.912
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 2 | acc = 0.785
Saving Model ...
Learning Rate 0.00003759 | Epoch 2 | Step 11400 | loss = 0.045, acc = 0.845
Learning Rate 0.00003747 | Epoch 2 | Step 11500 | loss = 0.043, acc = 0.852
Learning Rate 0.00003736 | Epoch 2 | Step 11600 | loss = 0.045, acc = 0.850
Learning Rate 0.00003724 | Epoch 2 | Step 11700 | loss = 0.042, acc = 0.857
Learning Rate 0.00003713 | Epoch 2 | Step 11800 | loss = 0.048, acc = 0.845
Learning Rate 0.00003701 | Epoch 2 | Step 11900 | loss = 0.036, acc = 0.870
Learning Rate 0.00003691 | Epoch 2 | Step 12000 | loss = 0.032, acc = 0.890
Learning Rate 0.00003679 | Epoch 2 | Step 12100 | loss = 0.028, acc = 0.880
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 2 | acc = 0.808
Saving Model ...
Learning Rate 0.00003668 | Epoch 2 | Step 12200 | loss = 0.037, acc = 0.880
Learning Rate 0.00003656 | Epoch 2 | Step 12300 | loss = 0.042, acc = 0.835
Learning Rate 0.00003645 | Epoch 2 | Step 12400 | loss = 0.041, acc = 0.875
Learning Rate 0.00003633 | Epoch 2 | Step 12500 | loss = 0.034, acc = 0.882
Learning Rate 0.00003623 | Epoch 2 | Step 12600 | loss = 0.037, acc = 0.873
Learning Rate 0.00003611 | Epoch 2 | Step 12700 | loss = 0.053, acc = 0.835
Learning Rate 0.00003600 | Epoch 2 | Step 12800 | loss = 0.030, acc = 0.895
Learning Rate 0.00003588 | Epoch 2 | Step 12900 | loss = 0.042, acc = 0.857
Learning Rate 0.00003577 | Epoch 2 | Step 13000 | loss = 0.038, acc = 0.877
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 2 | acc = 0.831
Saving Model ...
Learning Rate 0.00003565 | Epoch 2 | Step 13100 | loss = 0.042, acc = 0.870
Learning Rate 0.00003555 | Epoch 2 | Step 13200 | loss = 0.045, acc = 0.860
Learning Rate 0.00003543 | Epoch 2 | Step 13300 | loss = 0.042, acc = 0.837
Learning Rate 0.00003532 | Epoch 2 | Step 13400 | loss = 0.031, acc = 0.877
Learning Rate 0.00003520 | Epoch 2 | Step 13500 | loss = 0.039, acc = 0.885
Learning Rate 0.00003509 | Epoch 2 | Step 13600 | loss = 0.042, acc = 0.850
Learning Rate 0.00003497 | Epoch 2 | Step 13700 | loss = 0.039, acc = 0.870
Learning Rate 0.00003487 | Epoch 2 | Step 13800 | loss = 0.035, acc = 0.865
343
346
None
None
--------------
Learning Rate 0.00003475 | Epoch 2 | Step 13900 | loss = 0.051, acc = 0.845
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 2 | acc = 0.842
Saving Model ...
Learning Rate 0.00003464 | Epoch 2 | Step 14000 | loss = 0.043, acc = 0.857
Learning Rate 0.00003452 | Epoch 2 | Step 14100 | loss = 0.042, acc = 0.847
Learning Rate 0.00003441 | Epoch 2 | Step 14200 | loss = 0.046, acc = 0.847
Learning Rate 0.00003429 | Epoch 2 | Step 14300 | loss = 0.046, acc = 0.852
Learning Rate 0.00003419 | Epoch 2 | Step 14400 | loss = 0.040, acc = 0.837
Learning Rate 0.00003407 | Epoch 2 | Step 14500 | loss = 0.035, acc = 0.873
Learning Rate 0.00003396 | Epoch 2 | Step 14600 | loss = 0.036, acc = 0.880
Learning Rate 0.00003384 | Epoch 2 | Step 14700 | loss = 0.040, acc = 0.873
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 2 | acc = 0.831
Saving Model ...
Learning Rate 0.00003373 | Epoch 2 | Step 14800 | loss = 0.036, acc = 0.875
Learning Rate 0.00003361 | Epoch 2 | Step 14900 | loss = 0.043, acc = 0.850
Learning Rate 0.00003351 | Epoch 2 | Step 15000 | loss = 0.047, acc = 0.832
Learning Rate 0.00003339 | Epoch 2 | Step 15100 | loss = 0.037, acc = 0.882
Learning Rate 0.00003328 | Epoch 2 | Step 15200 | loss = 0.040, acc = 0.857
Learning Rate 0.00003316 | Epoch 2 | Step 15300 | loss = 0.032, acc = 0.880
Learning Rate 0.00003305 | Epoch 2 | Step 15400 | loss = 0.048, acc = 0.845
Learning Rate 0.00003293 | Epoch 2 | Step 15500 | loss = 0.042, acc = 0.860
Learning Rate 0.00003283 | Epoch 2 | Step 15600 | loss = 0.035, acc = 0.868
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 2 | acc = 0.825
Saving Model ...
Learning Rate 0.00003271 | Epoch 2 | Step 15700 | loss = 0.029, acc = 0.885
Learning Rate 0.00003260 | Epoch 2 | Step 15800 | loss = 0.046, acc = 0.862
Learning Rate 0.00003248 | Epoch 2 | Step 15900 | loss = 0.047, acc = 0.847
Learning Rate 0.00003237 | Epoch 2 | Step 16000 | loss = 0.043, acc = 0.860
Learning Rate 0.00003225 | Epoch 2 | Step 16100 | loss = 0.034, acc = 0.875
Learning Rate 0.00003215 | Epoch 2 | Step 16200 | loss = 0.032, acc = 0.907
Learning Rate 0.00003203 | Epoch 2 | Step 16300 | loss = 0.027, acc = 0.905
Learning Rate 0.00003192 | Epoch 2 | Step 16400 | loss = 0.042, acc = 0.862
Learning Rate 0.00003180 | Epoch 2 | Step 16500 | loss = 0.040, acc = 0.847
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 2 | acc = 0.825
Saving Model ...
Learning Rate 0.00003169 | Epoch 2 | Step 16600 | loss = 0.037, acc = 0.852
Learning Rate 0.00003157 | Epoch 2 | Step 16700 | loss = 0.044, acc = 0.847
Learning Rate 0.00003147 | Epoch 2 | Step 16800 | loss = 0.042, acc = 0.882
Learning Rate 0.00003135 | Epoch 2 | Step 16900 | loss = 0.042, acc = 0.875
Learning Rate 0.00003124 | Epoch 2 | Step 17000 | loss = 0.038, acc = 0.865
Learning Rate 0.00003112 | Epoch 2 | Step 17100 | loss = 0.038, acc = 0.842
Learning Rate 0.00003101 | Epoch 2 | Step 17200 | loss = 0.037, acc = 0.880
Learning Rate 0.00003089 | Epoch 2 | Step 17300 | loss = 0.036, acc = 0.862
Learning Rate 0.00003079 | Epoch 2 | Step 17400 | loss = 0.039, acc = 0.865
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 2 | acc = 0.819
Saving Model ...
Learning Rate 0.00003067 | Epoch 2 | Step 17500 | loss = 0.031, acc = 0.890
Learning Rate 0.00003056 | Epoch 2 | Step 17600 | loss = 0.028, acc = 0.890
Learning Rate 0.00003044 | Epoch 2 | Step 17700 | loss = 0.034, acc = 0.885
Learning Rate 0.00003033 | Epoch 2 | Step 17800 | loss = 0.039, acc = 0.885


  0%|          | 0/8911 [00:00<?, ?it/s]

Learning Rate 0.00003021 | Epoch 3 | Step 17900 | loss = 0.016, acc = 0.707
Learning Rate 0.00003011 | Epoch 3 | Step 18000 | loss = 0.023, acc = 0.938
Learning Rate 0.00002999 | Epoch 3 | Step 18100 | loss = 0.020, acc = 0.920
Learning Rate 0.00002988 | Epoch 3 | Step 18200 | loss = 0.022, acc = 0.927
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 3 | acc = 0.836
Saving Model ...
Learning Rate 0.00002976 | Epoch 3 | Step 18300 | loss = 0.023, acc = 0.925
Learning Rate 0.00002965 | Epoch 3 | Step 18400 | loss = 0.014, acc = 0.950
Learning Rate 0.00002953 | Epoch 3 | Step 18500 | loss = 0.014, acc = 0.950
Learning Rate 0.00002943 | Epoch 3 | Step 18600 | loss = 0.021, acc = 0.933
Learning Rate 0.00002931 | Epoch 3 | Step 18700 | loss = 0.024, acc = 0.920
Learning Rate 0.00002920 | Epoch 3 | Step 18800 | loss = 0.028, acc = 0.915
Learning Rate 0.00002908 | Epoch 3 | Step 18900 | loss = 0.017, acc = 0.940
Learning Rate 0.00002897 | Epoch 3 | Step 19000 | loss = 0.017, acc = 0.938
Learning Rate 0.00002885 | Epoch 3 | Step 19100 | loss = 0.032, acc = 0.900
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 3 | acc = 0.819
Saving Model ...
Learning Rate 0.00002875 | Epoch 3 | Step 19200 | loss = 0.018, acc = 0.945
Learning Rate 0.00002863 | Epoch 3 | Step 19300 | loss = 0.023, acc = 0.927
Learning Rate 0.00002852 | Epoch 3 | Step 19400 | loss = 0.017, acc = 0.935
Learning Rate 0.00002840 | Epoch 3 | Step 19500 | loss = 0.017, acc = 0.938
Learning Rate 0.00002829 | Epoch 3 | Step 19600 | loss = 0.023, acc = 0.925
Learning Rate 0.00002817 | Epoch 3 | Step 19700 | loss = 0.016, acc = 0.938
Learning Rate 0.00002806 | Epoch 3 | Step 19800 | loss = 0.024, acc = 0.915
Learning Rate 0.00002795 | Epoch 3 | Step 19900 | loss = 0.013, acc = 0.957
Learning Rate 0.00002784 | Epoch 3 | Step 20000 | loss = 0.021, acc = 0.912
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 3 | acc = 0.808
Saving Model ...
Learning Rate 0.00002772 | Epoch 3 | Step 20100 | loss = 0.021, acc = 0.915
Learning Rate 0.00002761 | Epoch 3 | Step 20200 | loss = 0.013, acc = 0.967
Learning Rate 0.00002749 | Epoch 3 | Step 20300 | loss = 0.018, acc = 0.920
Learning Rate 0.00002738 | Epoch 3 | Step 20400 | loss = 0.024, acc = 0.922
Learning Rate 0.00002727 | Epoch 3 | Step 20500 | loss = 0.016, acc = 0.930
Learning Rate 0.00002716 | Epoch 3 | Step 20600 | loss = 0.018, acc = 0.940
Learning Rate 0.00002704 | Epoch 3 | Step 20700 | loss = 0.021, acc = 0.915
Learning Rate 0.00002693 | Epoch 3 | Step 20800 | loss = 0.019, acc = 0.925
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 3 | acc = 0.819
Saving Model ...
Learning Rate 0.00002681 | Epoch 3 | Step 20900 | loss = 0.019, acc = 0.938
Learning Rate 0.00002670 | Epoch 3 | Step 21000 | loss = 0.022, acc = 0.920
Learning Rate 0.00002659 | Epoch 3 | Step 21100 | loss = 0.025, acc = 0.920
Learning Rate 0.00002648 | Epoch 3 | Step 21200 | loss = 0.020, acc = 0.927
343
346
None
None
--------------
Learning Rate 0.00002636 | Epoch 3 | Step 21300 | loss = 0.024, acc = 0.915
Learning Rate 0.00002625 | Epoch 3 | Step 21400 | loss = 0.025, acc = 0.900
Learning Rate 0.00002613 | Epoch 3 | Step 21500 | loss = 0.015, acc = 0.927
Learning Rate 0.00002602 | Epoch 3 | Step 21600 | loss = 0.018, acc = 0.933
Learning Rate 0.00002591 | Epoch 3 | Step 21700 | loss = 0.014, acc = 0.942
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 3 | acc = 0.836
Saving Model ...
Learning Rate 0.00002580 | Epoch 3 | Step 21800 | loss = 0.016, acc = 0.952
Learning Rate 0.00002568 | Epoch 3 | Step 21900 | loss = 0.026, acc = 0.902
Learning Rate 0.00002557 | Epoch 3 | Step 22000 | loss = 0.027, acc = 0.920
Learning Rate 0.00002545 | Epoch 3 | Step 22100 | loss = 0.024, acc = 0.910
Learning Rate 0.00002534 | Epoch 3 | Step 22200 | loss = 0.021, acc = 0.925
Learning Rate 0.00002523 | Epoch 3 | Step 22300 | loss = 0.020, acc = 0.935
Learning Rate 0.00002512 | Epoch 3 | Step 22400 | loss = 0.018, acc = 0.938
Learning Rate 0.00002500 | Epoch 3 | Step 22500 | loss = 0.025, acc = 0.905
Learning Rate 0.00002489 | Epoch 3 | Step 22600 | loss = 0.030, acc = 0.897
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 3 | acc = 0.785
Saving Model ...
Learning Rate 0.00002477 | Epoch 3 | Step 22700 | loss = 0.022, acc = 0.935
Learning Rate 0.00002466 | Epoch 3 | Step 22800 | loss = 0.021, acc = 0.922
Learning Rate 0.00002455 | Epoch 3 | Step 22900 | loss = 0.021, acc = 0.922
Learning Rate 0.00002444 | Epoch 3 | Step 23000 | loss = 0.017, acc = 0.925
Learning Rate 0.00002432 | Epoch 3 | Step 23100 | loss = 0.025, acc = 0.900
Learning Rate 0.00002421 | Epoch 3 | Step 23200 | loss = 0.021, acc = 0.930
Learning Rate 0.00002409 | Epoch 3 | Step 23300 | loss = 0.028, acc = 0.920
Learning Rate 0.00002398 | Epoch 3 | Step 23400 | loss = 0.022, acc = 0.922
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 3 | acc = 0.842
Saving Model ...
Learning Rate 0.00002387 | Epoch 3 | Step 23500 | loss = 0.018, acc = 0.945
Learning Rate 0.00002376 | Epoch 3 | Step 23600 | loss = 0.015, acc = 0.930
Learning Rate 0.00002364 | Epoch 3 | Step 23700 | loss = 0.022, acc = 0.935
Learning Rate 0.00002353 | Epoch 3 | Step 23800 | loss = 0.015, acc = 0.942
Learning Rate 0.00002341 | Epoch 3 | Step 23900 | loss = 0.019, acc = 0.925
Learning Rate 0.00002330 | Epoch 3 | Step 24000 | loss = 0.019, acc = 0.927
Learning Rate 0.00002319 | Epoch 3 | Step 24100 | loss = 0.026, acc = 0.917
Learning Rate 0.00002308 | Epoch 3 | Step 24200 | loss = 0.014, acc = 0.945
Learning Rate 0.00002296 | Epoch 3 | Step 24300 | loss = 0.018, acc = 0.925
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 3 | acc = 0.825
Saving Model ...
Learning Rate 0.00002285 | Epoch 3 | Step 24400 | loss = 0.018, acc = 0.938
Learning Rate 0.00002273 | Epoch 3 | Step 24500 | loss = 0.024, acc = 0.942
Learning Rate 0.00002262 | Epoch 3 | Step 24600 | loss = 0.015, acc = 0.947
Learning Rate 0.00002251 | Epoch 3 | Step 24700 | loss = 0.018, acc = 0.915
Learning Rate 0.00002240 | Epoch 3 | Step 24800 | loss = 0.013, acc = 0.947
Learning Rate 0.00002228 | Epoch 3 | Step 24900 | loss = 0.011, acc = 0.960
Learning Rate 0.00002217 | Epoch 3 | Step 25000 | loss = 0.018, acc = 0.942
Learning Rate 0.00002205 | Epoch 3 | Step 25100 | loss = 0.015, acc = 0.950
Learning Rate 0.00002194 | Epoch 3 | Step 25200 | loss = 0.014, acc = 0.942
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 3 | acc = 0.831
Saving Model ...
Learning Rate 0.00002183 | Epoch 3 | Step 25300 | loss = 0.016, acc = 0.935
Learning Rate 0.00002172 | Epoch 3 | Step 25400 | loss = 0.020, acc = 0.933
Learning Rate 0.00002160 | Epoch 3 | Step 25500 | loss = 0.018, acc = 0.925
Learning Rate 0.00002149 | Epoch 3 | Step 25600 | loss = 0.019, acc = 0.927
Learning Rate 0.00002137 | Epoch 3 | Step 25700 | loss = 0.017, acc = 0.938
Learning Rate 0.00002126 | Epoch 3 | Step 25800 | loss = 0.016, acc = 0.935
Learning Rate 0.00002115 | Epoch 3 | Step 25900 | loss = 0.025, acc = 0.930
Learning Rate 0.00002104 | Epoch 3 | Step 26000 | loss = 0.013, acc = 0.950
Learning Rate 0.00002092 | Epoch 3 | Step 26100 | loss = 0.016, acc = 0.947
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 3 | acc = 0.842
Saving Model ...
Learning Rate 0.00002081 | Epoch 3 | Step 26200 | loss = 0.018, acc = 0.942
Learning Rate 0.00002069 | Epoch 3 | Step 26300 | loss = 0.016, acc = 0.940
Learning Rate 0.00002058 | Epoch 3 | Step 26400 | loss = 0.022, acc = 0.922
Learning Rate 0.00002047 | Epoch 3 | Step 26500 | loss = 0.018, acc = 0.938
Learning Rate 0.00002036 | Epoch 3 | Step 26600 | loss = 0.026, acc = 0.920
Learning Rate 0.00002024 | Epoch 3 | Step 26700 | loss = 0.019, acc = 0.938


  0%|          | 0/8911 [00:00<?, ?it/s]

Learning Rate 0.00002013 | Epoch 4 | Step 26800 | loss = 0.007, acc = 0.632
Learning Rate 0.00002001 | Epoch 4 | Step 26900 | loss = 0.010, acc = 0.945
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 4 | acc = 0.842
Saving Model ...
Learning Rate 0.00001990 | Epoch 4 | Step 27000 | loss = 0.009, acc = 0.970
Learning Rate 0.00001979 | Epoch 4 | Step 27100 | loss = 0.005, acc = 0.980
Learning Rate 0.00001968 | Epoch 4 | Step 27200 | loss = 0.005, acc = 0.982
Learning Rate 0.00001956 | Epoch 4 | Step 27300 | loss = 0.013, acc = 0.950
Learning Rate 0.00001945 | Epoch 4 | Step 27400 | loss = 0.007, acc = 0.967
Learning Rate 0.00001933 | Epoch 4 | Step 27500 | loss = 0.016, acc = 0.945
Learning Rate 0.00001922 | Epoch 4 | Step 27600 | loss = 0.012, acc = 0.980
Learning Rate 0.00001911 | Epoch 4 | Step 27700 | loss = 0.012, acc = 0.955
343
346
None
None
--------------
Learning Rate 0.00001900 | Epoch 4 | Step 27800 | loss = 0.013, acc = 0.972
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 4 | acc = 0.847
Saving Model ...
Learning Rate 0.00001888 | Epoch 4 | Step 27900 | loss = 0.010, acc = 0.972
Learning Rate 0.00001877 | Epoch 4 | Step 28000 | loss = 0.011, acc = 0.972
Learning Rate 0.00001865 | Epoch 4 | Step 28100 | loss = 0.007, acc = 0.985
Learning Rate 0.00001854 | Epoch 4 | Step 28200 | loss = 0.013, acc = 0.955
Learning Rate 0.00001843 | Epoch 4 | Step 28300 | loss = 0.010, acc = 0.962
Learning Rate 0.00001832 | Epoch 4 | Step 28400 | loss = 0.011, acc = 0.970
Learning Rate 0.00001820 | Epoch 4 | Step 28500 | loss = 0.008, acc = 0.975
Learning Rate 0.00001809 | Epoch 4 | Step 28600 | loss = 0.010, acc = 0.962
Learning Rate 0.00001797 | Epoch 4 | Step 28700 | loss = 0.008, acc = 0.972
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 4 | acc = 0.847
Saving Model ...
Learning Rate 0.00001786 | Epoch 4 | Step 28800 | loss = 0.004, acc = 0.972
Learning Rate 0.00001775 | Epoch 4 | Step 28900 | loss = 0.010, acc = 0.970
Learning Rate 0.00001764 | Epoch 4 | Step 29000 | loss = 0.010, acc = 0.962
Learning Rate 0.00001752 | Epoch 4 | Step 29100 | loss = 0.011, acc = 0.960
Learning Rate 0.00001741 | Epoch 4 | Step 29200 | loss = 0.010, acc = 0.962
Learning Rate 0.00001729 | Epoch 4 | Step 29300 | loss = 0.010, acc = 0.975
Learning Rate 0.00001718 | Epoch 4 | Step 29400 | loss = 0.012, acc = 0.967
Learning Rate 0.00001707 | Epoch 4 | Step 29500 | loss = 0.011, acc = 0.965
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 4 | acc = 0.847
Saving Model ...
Learning Rate 0.00001696 | Epoch 4 | Step 29600 | loss = 0.007, acc = 0.970
Learning Rate 0.00001684 | Epoch 4 | Step 29700 | loss = 0.008, acc = 0.962
Learning Rate 0.00001673 | Epoch 4 | Step 29800 | loss = 0.006, acc = 0.972
Learning Rate 0.00001661 | Epoch 4 | Step 29900 | loss = 0.009, acc = 0.967
Learning Rate 0.00001650 | Epoch 4 | Step 30000 | loss = 0.016, acc = 0.960
Learning Rate 0.00001639 | Epoch 4 | Step 30100 | loss = 0.007, acc = 0.967
Learning Rate 0.00001628 | Epoch 4 | Step 30200 | loss = 0.011, acc = 0.960
Learning Rate 0.00001616 | Epoch 4 | Step 30300 | loss = 0.006, acc = 0.972
Learning Rate 0.00001605 | Epoch 4 | Step 30400 | loss = 0.010, acc = 0.967
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 4 | acc = 0.876
Saving Model ...
Learning Rate 0.00001593 | Epoch 4 | Step 30500 | loss = 0.009, acc = 0.955
Learning Rate 0.00001582 | Epoch 4 | Step 30600 | loss = 0.019, acc = 0.945
Learning Rate 0.00001571 | Epoch 4 | Step 30700 | loss = 0.007, acc = 0.975
Learning Rate 0.00001560 | Epoch 4 | Step 30800 | loss = 0.008, acc = 0.975
Learning Rate 0.00001548 | Epoch 4 | Step 30900 | loss = 0.007, acc = 0.967
Learning Rate 0.00001537 | Epoch 4 | Step 31000 | loss = 0.015, acc = 0.962
Learning Rate 0.00001525 | Epoch 4 | Step 31100 | loss = 0.007, acc = 0.977
Learning Rate 0.00001514 | Epoch 4 | Step 31200 | loss = 0.013, acc = 0.962
Learning Rate 0.00001503 | Epoch 4 | Step 31300 | loss = 0.019, acc = 0.960
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 4 | acc = 0.864
Saving Model ...
Learning Rate 0.00001492 | Epoch 4 | Step 31400 | loss = 0.009, acc = 0.980
Learning Rate 0.00001480 | Epoch 4 | Step 31500 | loss = 0.010, acc = 0.970
Learning Rate 0.00001469 | Epoch 4 | Step 31600 | loss = 0.013, acc = 0.950
Learning Rate 0.00001457 | Epoch 4 | Step 31700 | loss = 0.008, acc = 0.970
Learning Rate 0.00001446 | Epoch 4 | Step 31800 | loss = 0.006, acc = 0.970
Learning Rate 0.00001435 | Epoch 4 | Step 31900 | loss = 0.009, acc = 0.965
Learning Rate 0.00001424 | Epoch 4 | Step 32000 | loss = 0.006, acc = 0.982
Learning Rate 0.00001412 | Epoch 4 | Step 32100 | loss = 0.009, acc = 0.970
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 4 | acc = 0.847
Saving Model ...
Learning Rate 0.00001401 | Epoch 4 | Step 32200 | loss = 0.010, acc = 0.975
Learning Rate 0.00001389 | Epoch 4 | Step 32300 | loss = 0.007, acc = 0.965
Learning Rate 0.00001378 | Epoch 4 | Step 32400 | loss = 0.011, acc = 0.967
Learning Rate 0.00001367 | Epoch 4 | Step 32500 | loss = 0.005, acc = 0.980
Learning Rate 0.00001356 | Epoch 4 | Step 32600 | loss = 0.015, acc = 0.947
Learning Rate 0.00001344 | Epoch 4 | Step 32700 | loss = 0.004, acc = 0.977
Learning Rate 0.00001333 | Epoch 4 | Step 32800 | loss = 0.011, acc = 0.965
Learning Rate 0.00001321 | Epoch 4 | Step 32900 | loss = 0.014, acc = 0.957
Learning Rate 0.00001310 | Epoch 4 | Step 33000 | loss = 0.008, acc = 0.977
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 4 | acc = 0.836
Saving Model ...
Learning Rate 0.00001299 | Epoch 4 | Step 33100 | loss = 0.012, acc = 0.965
Learning Rate 0.00001288 | Epoch 4 | Step 33200 | loss = 0.009, acc = 0.965
Learning Rate 0.00001276 | Epoch 4 | Step 33300 | loss = 0.007, acc = 0.982
Learning Rate 0.00001265 | Epoch 4 | Step 33400 | loss = 0.009, acc = 0.960
Learning Rate 0.00001253 | Epoch 4 | Step 33500 | loss = 0.010, acc = 0.970
Learning Rate 0.00001242 | Epoch 4 | Step 33600 | loss = 0.008, acc = 0.970
Learning Rate 0.00001231 | Epoch 4 | Step 33700 | loss = 0.010, acc = 0.970
Learning Rate 0.00001220 | Epoch 4 | Step 33800 | loss = 0.013, acc = 0.970
Learning Rate 0.00001208 | Epoch 4 | Step 33900 | loss = 0.005, acc = 0.982
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 4 | acc = 0.842
Saving Model ...
Learning Rate 0.00001197 | Epoch 4 | Step 34000 | loss = 0.011, acc = 0.975
Learning Rate 0.00001185 | Epoch 4 | Step 34100 | loss = 0.009, acc = 0.967
Learning Rate 0.00001174 | Epoch 4 | Step 34200 | loss = 0.008, acc = 0.970
Learning Rate 0.00001162 | Epoch 4 | Step 34300 | loss = 0.004, acc = 0.975
Learning Rate 0.00001152 | Epoch 4 | Step 34400 | loss = 0.007, acc = 0.977
Learning Rate 0.00001140 | Epoch 4 | Step 34500 | loss = 0.008, acc = 0.970
Learning Rate 0.00001129 | Epoch 4 | Step 34600 | loss = 0.011, acc = 0.962
Learning Rate 0.00001117 | Epoch 4 | Step 34700 | loss = 0.010, acc = 0.970
Learning Rate 0.00001106 | Epoch 4 | Step 34800 | loss = 0.004, acc = 0.982
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 4 | acc = 0.853
Saving Model ...
Learning Rate 0.00001094 | Epoch 4 | Step 34900 | loss = 0.009, acc = 0.975
Learning Rate 0.00001084 | Epoch 4 | Step 35000 | loss = 0.007, acc = 0.977
Learning Rate 0.00001072 | Epoch 4 | Step 35100 | loss = 0.012, acc = 0.970
Learning Rate 0.00001061 | Epoch 4 | Step 35200 | loss = 0.013, acc = 0.962
Learning Rate 0.00001049 | Epoch 4 | Step 35300 | loss = 0.006, acc = 0.980
Learning Rate 0.00001038 | Epoch 4 | Step 35400 | loss = 0.011, acc = 0.967
Learning Rate 0.00001026 | Epoch 4 | Step 35500 | loss = 0.012, acc = 0.967
Learning Rate 0.00001016 | Epoch 4 | Step 35600 | loss = 0.011, acc = 0.972


  0%|          | 0/8911 [00:00<?, ?it/s]

Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 5 | acc = 0.853
Saving Model ...
Learning Rate 0.00001004 | Epoch 5 | Step 35700 | loss = 0.003, acc = 0.542
Learning Rate 0.00000993 | Epoch 5 | Step 35800 | loss = 0.005, acc = 0.990
Learning Rate 0.00000981 | Epoch 5 | Step 35900 | loss = 0.010, acc = 0.982
Learning Rate 0.00000970 | Epoch 5 | Step 36000 | loss = 0.003, acc = 0.985
Learning Rate 0.00000958 | Epoch 5 | Step 36100 | loss = 0.004, acc = 0.985
343
346
None
None
--------------
Learning Rate 0.00000948 | Epoch 5 | Step 36200 | loss = 0.009, acc = 0.972
Learning Rate 0.00000936 | Epoch 5 | Step 36300 | loss = 0.008, acc = 0.972
Learning Rate 0.00000925 | Epoch 5 | Step 36400 | loss = 0.003, acc = 0.998
Learning Rate 0.00000913 | Epoch 5 | Step 36500 | loss = 0.004, acc = 0.987
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 5 | acc = 0.831
Saving Model ...
Learning Rate 0.00000902 | Epoch 5 | Step 36600 | loss = 0.002, acc = 0.990
Learning Rate 0.00000890 | Epoch 5 | Step 36700 | loss = 0.005, acc = 0.985
Learning Rate 0.00000880 | Epoch 5 | Step 36800 | loss = 0.005, acc = 0.985
Learning Rate 0.00000868 | Epoch 5 | Step 36900 | loss = 0.005, acc = 0.977
Learning Rate 0.00000857 | Epoch 5 | Step 37000 | loss = 0.008, acc = 0.982
Learning Rate 0.00000845 | Epoch 5 | Step 37100 | loss = 0.005, acc = 0.987
Learning Rate 0.00000834 | Epoch 5 | Step 37200 | loss = 0.002, acc = 0.987
Learning Rate 0.00000822 | Epoch 5 | Step 37300 | loss = 0.003, acc = 0.985
Learning Rate 0.00000812 | Epoch 5 | Step 37400 | loss = 0.006, acc = 0.977
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 5 | acc = 0.825
Saving Model ...
Learning Rate 0.00000800 | Epoch 5 | Step 37500 | loss = 0.002, acc = 0.990
Learning Rate 0.00000789 | Epoch 5 | Step 37600 | loss = 0.004, acc = 0.980
Learning Rate 0.00000777 | Epoch 5 | Step 37700 | loss = 0.007, acc = 0.987
Learning Rate 0.00000766 | Epoch 5 | Step 37800 | loss = 0.004, acc = 0.993
Learning Rate 0.00000754 | Epoch 5 | Step 37900 | loss = 0.008, acc = 0.977
Learning Rate 0.00000744 | Epoch 5 | Step 38000 | loss = 0.003, acc = 0.990
Learning Rate 0.00000732 | Epoch 5 | Step 38100 | loss = 0.006, acc = 0.975
Learning Rate 0.00000721 | Epoch 5 | Step 38200 | loss = 0.004, acc = 0.985
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 5 | acc = 0.842
Saving Model ...
Learning Rate 0.00000709 | Epoch 5 | Step 38300 | loss = 0.004, acc = 0.982
Learning Rate 0.00000698 | Epoch 5 | Step 38400 | loss = 0.005, acc = 0.982
Learning Rate 0.00000686 | Epoch 5 | Step 38500 | loss = 0.003, acc = 0.987
Learning Rate 0.00000676 | Epoch 5 | Step 38600 | loss = 0.005, acc = 0.982
Learning Rate 0.00000664 | Epoch 5 | Step 38700 | loss = 0.004, acc = 0.977
Learning Rate 0.00000653 | Epoch 5 | Step 38800 | loss = 0.003, acc = 0.985
Learning Rate 0.00000641 | Epoch 5 | Step 38900 | loss = 0.003, acc = 0.985
Learning Rate 0.00000630 | Epoch 5 | Step 39000 | loss = 0.009, acc = 0.977
Learning Rate 0.00000618 | Epoch 5 | Step 39100 | loss = 0.003, acc = 0.990
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 5 | acc = 0.847
Saving Model ...
Learning Rate 0.00000608 | Epoch 5 | Step 39200 | loss = 0.004, acc = 0.987
Learning Rate 0.00000596 | Epoch 5 | Step 39300 | loss = 0.005, acc = 0.990
Learning Rate 0.00000585 | Epoch 5 | Step 39400 | loss = 0.005, acc = 0.990
Learning Rate 0.00000573 | Epoch 5 | Step 39500 | loss = 0.006, acc = 0.982
Learning Rate 0.00000562 | Epoch 5 | Step 39600 | loss = 0.009, acc = 0.982
Learning Rate 0.00000550 | Epoch 5 | Step 39700 | loss = 0.005, acc = 0.995
Learning Rate 0.00000540 | Epoch 5 | Step 39800 | loss = 0.003, acc = 0.990
Learning Rate 0.00000528 | Epoch 5 | Step 39900 | loss = 0.004, acc = 0.987
Learning Rate 0.00000517 | Epoch 5 | Step 40000 | loss = 0.003, acc = 0.985
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 5 | acc = 0.853
Saving Model ...
Learning Rate 0.00000505 | Epoch 5 | Step 40100 | loss = 0.008, acc = 0.985
Learning Rate 0.00000494 | Epoch 5 | Step 40200 | loss = 0.004, acc = 0.990
Learning Rate 0.00000482 | Epoch 5 | Step 40300 | loss = 0.004, acc = 0.985
Learning Rate 0.00000472 | Epoch 5 | Step 40400 | loss = 0.012, acc = 0.975
Learning Rate 0.00000460 | Epoch 5 | Step 40500 | loss = 0.005, acc = 0.980
Learning Rate 0.00000449 | Epoch 5 | Step 40600 | loss = 0.007, acc = 0.982
Learning Rate 0.00000437 | Epoch 5 | Step 40700 | loss = 0.005, acc = 0.985
Learning Rate 0.00000426 | Epoch 5 | Step 40800 | loss = 0.007, acc = 0.982
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 5 | acc = 0.836
Saving Model ...
Learning Rate 0.00000414 | Epoch 5 | Step 40900 | loss = 0.001, acc = 0.998
Learning Rate 0.00000404 | Epoch 5 | Step 41000 | loss = 0.008, acc = 0.982
Learning Rate 0.00000392 | Epoch 5 | Step 41100 | loss = 0.003, acc = 0.993
Learning Rate 0.00000381 | Epoch 5 | Step 41200 | loss = 0.006, acc = 0.980
Learning Rate 0.00000369 | Epoch 5 | Step 41300 | loss = 0.003, acc = 0.990
Learning Rate 0.00000358 | Epoch 5 | Step 41400 | loss = 0.005, acc = 0.985
Learning Rate 0.00000346 | Epoch 5 | Step 41500 | loss = 0.004, acc = 0.977
Learning Rate 0.00000336 | Epoch 5 | Step 41600 | loss = 0.002, acc = 0.993
Learning Rate 0.00000324 | Epoch 5 | Step 41700 | loss = 0.003, acc = 0.990
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 5 | acc = 0.836
Saving Model ...
Learning Rate 0.00000313 | Epoch 5 | Step 41800 | loss = 0.004, acc = 0.985
Learning Rate 0.00000301 | Epoch 5 | Step 41900 | loss = 0.012, acc = 0.975
Learning Rate 0.00000290 | Epoch 5 | Step 42000 | loss = 0.008, acc = 0.975
Learning Rate 0.00000278 | Epoch 5 | Step 42100 | loss = 0.001, acc = 0.995
Learning Rate 0.00000268 | Epoch 5 | Step 42200 | loss = 0.009, acc = 0.972
Learning Rate 0.00000256 | Epoch 5 | Step 42300 | loss = 0.004, acc = 0.990
Learning Rate 0.00000245 | Epoch 5 | Step 42400 | loss = 0.005, acc = 0.982
Learning Rate 0.00000233 | Epoch 5 | Step 42500 | loss = 0.010, acc = 0.975
Learning Rate 0.00000222 | Epoch 5 | Step 42600 | loss = 0.009, acc = 0.977
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 5 | acc = 0.836
Saving Model ...
Learning Rate 0.00000210 | Epoch 5 | Step 42700 | loss = 0.010, acc = 0.972
Learning Rate 0.00000199 | Epoch 5 | Step 42800 | loss = 0.008, acc = 0.985
Learning Rate 0.00000188 | Epoch 5 | Step 42900 | loss = 0.008, acc = 0.980
Learning Rate 0.00000177 | Epoch 5 | Step 43000 | loss = 0.010, acc = 0.975
Learning Rate 0.00000165 | Epoch 5 | Step 43100 | loss = 0.008, acc = 0.982
Learning Rate 0.00000154 | Epoch 5 | Step 43200 | loss = 0.008, acc = 0.977
Learning Rate 0.00000142 | Epoch 5 | Step 43300 | loss = 0.006, acc = 0.985
Learning Rate 0.00000131 | Epoch 5 | Step 43400 | loss = 0.006, acc = 0.977
Learning Rate 0.00000120 | Epoch 5 | Step 43500 | loss = 0.003, acc = 0.982
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 5 | acc = 0.847
Saving Model ...
Learning Rate 0.00000109 | Epoch 5 | Step 43600 | loss = 0.002, acc = 0.993
Learning Rate 0.00000097 | Epoch 5 | Step 43700 | loss = 0.004, acc = 0.990
Learning Rate 0.00000086 | Epoch 5 | Step 43800 | loss = 0.004, acc = 0.987
Learning Rate 0.00000074 | Epoch 5 | Step 43900 | loss = 0.004, acc = 0.990
Learning Rate 0.00000063 | Epoch 5 | Step 44000 | loss = 0.008, acc = 0.980
Learning Rate 0.00000052 | Epoch 5 | Step 44100 | loss = 0.006, acc = 0.977
Learning Rate 0.00000041 | Epoch 5 | Step 44200 | loss = 0.006, acc = 0.982
Learning Rate 0.00000029 | Epoch 5 | Step 44300 | loss = 0.005, acc = 0.982
Evaluating Dev Set ...


  0%|          | 0/177 [00:00<?, ?it/s]

Validation | Epoch 5 | acc = 0.847
Saving Model ...
Learning Rate 0.00000018 | Epoch 5 | Step 44400 | loss = 0.007, acc = 0.975
Learning Rate 0.00000006 | Epoch 5 | Step 44500 | loss = 0.005, acc = 0.987


## Testing

In [None]:
print("Evaluating Test Set ...")

result1 = []
result2 = []
result3 = []
# result4 = []
model1.eval()
model2.eval()
model3.eval()
# model4.eval()
with torch.no_grad():
    for data in tqdm(test_loader):
        output1 = model1(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
        output2 = model2(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
        output3 = model3(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
        # output4 = model4(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
        #                attention_mask=data[2].squeeze(dim=0).to(device))
        result1.append(evaluate(data, output1))
        result2.append(evaluate(data, output2))
        result3.append(evaluate(data, output3))
        # result4.append(evaluate(data, output4))
        # if evaluate(data, output1) == evaluate(data, output2) or evaluate(data, output1) == evaluate(data, output3):
        #   result.append(evaluate(data, output1))
        #   continue
        # elif evaluate(data, output2) == evaluate(data, output3):
        #   result.append(evaluate(data, output2))
        #   continue
        # else:
        #   result.append(evaluate(data, output1))

result_file1 = "mac_result_84.2_1.csv"
with open(result_file1, 'w') as f:	
	  f.write("ID,Answer\n")
	  for i, test_question in enumerate(test_questions):
        # Replace commas in answers with empty strings (since csv is separated by comma)
        # Answers in kaggle are processed in the same way
		    f.write(f"{test_question['id']},{result1[i].replace(',','')}\n")
result_file2 = "mac_result_83.6.csv"
with open(result_file2, 'w') as f:	
	  f.write("ID,Answer\n")
	  for i, test_question in enumerate(test_questions):
        # Replace commas in answers with empty strings (since csv is separated by comma)
        # Answers in kaggle are processed in the same way
		    f.write(f"{test_question['id']},{result2[i].replace(',','')}\n")
result_file3 = "mac_result_83.1_1.csv"
with open(result_file3, 'w') as f:	
	  f.write("ID,Answer\n")
	  for i, test_question in enumerate(test_questions):
        # Replace commas in answers with empty strings (since csv is separated by comma)
        # Answers in kaggle are processed in the same way
		    f.write(f"{test_question['id']},{result3[i].replace(',','')}\n")
      
# result_file4 = "result_88.3_2.csv"
# with open(result_file4, 'w') as f:	
# 	  f.write("ID,Answer\n")
# 	  for i, test_question in enumerate(test_questions):
#         # Replace commas in answers with empty strings (since csv is separated by comma)
#         # Answers in kaggle are processed in the same way
# 		    f.write(f"{test_question['id']},{result4[i].replace(',','')}\n")


print(f"Completed! Result is in {result_file1}")

Evaluating Test Set ...


  0%|          | 0/4957 [00:00<?, ?it/s]

Completed! Result is in mac_result_84.2_1.csv
