Multiclass Classification 'Target'

In [4]:
!pip install transformers[torch]

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, Dataset
import torch

In [6]:
import pandas as pd
url = 'https://github.com/bvidgen/Dynamically-Generated-Hate-Speech-Dataset/raw/refs/heads/main/Dynamically%20Generated%20Hate%20Dataset%20v0.2.3.csv'
df = pd.read_csv(url)
df.head(5)

Unnamed: 0.1,Unnamed: 0,acl.id,X1,text,label,type,target,level,split,round.base,annotator,round,acl.id.matched
0,1,acl1,1,dalits are lowlives,hate,notgiven,notgiven,original,train,1,GNZuCtwed3,1,
1,2,acl10,2,gay people are a burden to our society,hate,notgiven,notgiven,original,train,1,GNZuCtwed3,1,
2,3,acl100,3,Arabs are not welcome,hate,notgiven,notgiven,original,train,1,vDe7GN0NrL,1,
3,4,acl1000,4,I'm not saying we should actually eliminate he...,hate,notgiven,notgiven,original,train,1,oemYWm1Tjg,1,
4,5,acl10000,5,bananas are for black people,hate,notgiven,notgiven,original,test,1,QiOKkCi7F8,1,


In [7]:
df1 = df[['text','target','split']]
df1.head(10)

Unnamed: 0,text,target,split
0,dalits are lowlives,notgiven,train
1,gay people are a burden to our society,notgiven,train
2,Arabs are not welcome,notgiven,train
3,I'm not saying we should actually eliminate he...,notgiven,train
4,bananas are for black people,notgiven,test
5,asian people are the shit,none,test
6,women can not reproduce kids but men can in al...,notgiven,test
7,i hate wearing black in the summer!,none,test
8,Who cares what Chinese people think?,notgiven,test
9,i try to punch a letter box when i see one,notgiven,test


In [8]:
df1['target'] = pd.Categorical(df1['target']).codes
df1['target'] = df1['target'] - df1['target'].min()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['target'] = pd.Categorical(df1['target']).codes
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['target'] = df1['target'] - df1['target'].min()


In [9]:
print(df1['target'].unique())

[311 309 131 358 260  47 325 379 374 121 316 228 219 171  29  20 251 169
 194 179   8 207 190   3 202 188 321 302  17 310 304  43  28   1 312 206
 148 119 265 288   0  45 301  51  46  70  27 324 227  76 250 176  95 318
 125 262   9  36 168 332   5 167 308 315 406 183 338 259 122  31 314 166
 157 326 328 329 103 102 146 297 141 137 135 256 389 349 266  66  53 160
 200   4 133  74 317 110 191 408 340 313  39 161 350 154 348 248 231 384
 373 255 393 286 233  79 217 268 334 142 247 278 392  59  64  63  73 258
 239 218 216  52  93 257 149  50 177 295 404 232 132  60  15 294  75 173
  38 230  99 215 186 205 298  32 385 401  82  83  92 372 344 407 400 391
 107 371 184 225 150 221 130 269 357 336 229 124  88 155 353  30 339 252
  67  86 145  87 213 396 354 170   6 235 290 127 381 172 327 163 356  69
 101  49 195  81 152  62  12 253 203 335 104  23 337 386 126 274 129 390
  91  61 319 394 330 197 289 159 118 237 399 111 320 116 395  58 189 276
 405 164 296 291 144 403 242 261  25 264 347 199 27

In [10]:
df1[df1['target'] == 0]

Unnamed: 0,text,target,split
2220,I swear I will piss on this gross pikey until ...,0,train
2741,Really? There are a lot of gypsy travellers th...,0,train
2971,alot ogf people make fun of my personality in ...,0,train
4175,"I have worse teeth than a camel humper, my mum...",0,train
9093,Please stop inviting this squaw at the workplace,0,train
9185,A girl I'm working with hails from Egypt and t...,0,train
11812,fucking hell I've just seen a pikey,0,dev
23651,The purpose of the Aryan Hall is not just to h...,0,train


In [11]:
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        return {
            'text': text,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Example data
texts = df1['text']
labels = df1['target'] # Categorical labels
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
max_len = 128

dataset = TextDataset(texts, labels, tokenizer, max_len)


In [12]:
train_texts = df1[df1['split'] == 'train']['text'].tolist()
train_labels = df1[df1['split'] == 'train']['target'].tolist()
val_texts = df1[df1['split'] == 'dev']['text'].tolist()
val_labels = df1[df1['split'] == 'dev']['target'].tolist()
test_texts = df1[df1['split'] == 'test']['text'].tolist()
test_labels = df1[df1['split'] == 'test']['target'].tolist()

# Assuming TextDataset is your custom dataset class
train_dataset = TextDataset(train_texts, train_labels, tokenizer, max_len)
val_dataset = TextDataset(val_texts, val_labels, tokenizer, max_len)
test_dataset = TextDataset(test_texts, test_labels, tokenizer, max_len)

In [16]:
num_labels = len(pd.Categorical(df1['target']).categories)
print(num_labels)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)

410


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()

  0%|          | 10/12348 [00:13<4:26:12,  1.29s/it]

{'loss': 5.9533, 'grad_norm': 13.803159713745117, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.0}


  0%|          | 20/12348 [00:26<4:25:41,  1.29s/it]

{'loss': 5.9663, 'grad_norm': 10.184335708618164, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.0}


  0%|          | 30/12348 [00:39<4:27:52,  1.30s/it]

{'loss': 5.8067, 'grad_norm': 10.7892484664917, 'learning_rate': 3e-06, 'epoch': 0.01}


  0%|          | 40/12348 [00:52<4:28:02,  1.31s/it]

{'loss': 5.6921, 'grad_norm': 8.905298233032227, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.01}


  0%|          | 50/12348 [01:06<4:43:35,  1.38s/it]

{'loss': 5.6577, 'grad_norm': 13.346779823303223, 'learning_rate': 5e-06, 'epoch': 0.01}


  0%|          | 60/12348 [01:19<4:33:58,  1.34s/it]

{'loss': 5.5712, 'grad_norm': 8.553717613220215, 'learning_rate': 6e-06, 'epoch': 0.01}


  1%|          | 70/12348 [01:33<4:29:04,  1.31s/it]

{'loss': 5.5304, 'grad_norm': 9.396066665649414, 'learning_rate': 7.000000000000001e-06, 'epoch': 0.02}


  1%|          | 80/12348 [01:49<6:05:37,  1.79s/it]

{'loss': 5.3591, 'grad_norm': 10.920450210571289, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.02}


  1%|          | 90/12348 [02:08<6:17:39,  1.85s/it]

{'loss': 5.1796, 'grad_norm': 9.077898025512695, 'learning_rate': 9e-06, 'epoch': 0.02}


  1%|          | 100/12348 [02:27<6:26:29,  1.89s/it]

{'loss': 4.9776, 'grad_norm': 11.748030662536621, 'learning_rate': 1e-05, 'epoch': 0.02}


  1%|          | 110/12348 [02:46<6:42:23,  1.97s/it]

{'loss': 4.778, 'grad_norm': 10.938104629516602, 'learning_rate': 1.1000000000000001e-05, 'epoch': 0.03}


  1%|          | 120/12348 [03:04<6:05:41,  1.79s/it]

{'loss': 4.3727, 'grad_norm': 10.381353378295898, 'learning_rate': 1.2e-05, 'epoch': 0.03}


  1%|          | 130/12348 [03:20<5:00:34,  1.48s/it]

{'loss': 3.6674, 'grad_norm': 16.284442901611328, 'learning_rate': 1.3000000000000001e-05, 'epoch': 0.03}


  1%|          | 140/12348 [03:37<6:07:42,  1.81s/it]

{'loss': 3.538, 'grad_norm': 8.895026206970215, 'learning_rate': 1.4000000000000001e-05, 'epoch': 0.03}


  1%|          | 150/12348 [03:56<6:33:22,  1.93s/it]

{'loss': 3.2219, 'grad_norm': 11.543594360351562, 'learning_rate': 1.5e-05, 'epoch': 0.04}


  1%|▏         | 160/12348 [04:16<6:38:54,  1.96s/it]

{'loss': 3.1792, 'grad_norm': 7.761049270629883, 'learning_rate': 1.6000000000000003e-05, 'epoch': 0.04}


  1%|▏         | 170/12348 [04:34<5:29:30,  1.62s/it]

{'loss': 3.1415, 'grad_norm': 6.166538238525391, 'learning_rate': 1.7000000000000003e-05, 'epoch': 0.04}


  1%|▏         | 180/12348 [04:47<4:31:12,  1.34s/it]

{'loss': 2.9098, 'grad_norm': 7.322800159454346, 'learning_rate': 1.8e-05, 'epoch': 0.04}


  2%|▏         | 190/12348 [05:00<4:27:33,  1.32s/it]

{'loss': 2.915, 'grad_norm': 5.880950450897217, 'learning_rate': 1.9e-05, 'epoch': 0.05}


  2%|▏         | 200/12348 [05:13<4:27:19,  1.32s/it]

{'loss': 2.8984, 'grad_norm': 7.388019561767578, 'learning_rate': 2e-05, 'epoch': 0.05}


  2%|▏         | 210/12348 [05:27<4:26:32,  1.32s/it]

{'loss': 2.7391, 'grad_norm': 5.699715614318848, 'learning_rate': 2.1e-05, 'epoch': 0.05}


  2%|▏         | 220/12348 [05:40<4:26:26,  1.32s/it]

{'loss': 2.5461, 'grad_norm': 6.6326751708984375, 'learning_rate': 2.2000000000000003e-05, 'epoch': 0.05}


  2%|▏         | 230/12348 [05:53<4:26:38,  1.32s/it]

{'loss': 2.7771, 'grad_norm': 8.295827865600586, 'learning_rate': 2.3000000000000003e-05, 'epoch': 0.06}


  2%|▏         | 240/12348 [06:06<4:26:21,  1.32s/it]

{'loss': 2.3847, 'grad_norm': 8.490886688232422, 'learning_rate': 2.4e-05, 'epoch': 0.06}


  2%|▏         | 250/12348 [06:19<4:26:03,  1.32s/it]

{'loss': 2.5852, 'grad_norm': 10.601662635803223, 'learning_rate': 2.5e-05, 'epoch': 0.06}


  2%|▏         | 260/12348 [06:33<4:26:59,  1.33s/it]

{'loss': 2.5731, 'grad_norm': 5.817739963531494, 'learning_rate': 2.6000000000000002e-05, 'epoch': 0.06}


  2%|▏         | 270/12348 [06:46<4:26:18,  1.32s/it]

{'loss': 2.338, 'grad_norm': 10.537542343139648, 'learning_rate': 2.7000000000000002e-05, 'epoch': 0.07}


  2%|▏         | 280/12348 [06:59<4:25:21,  1.32s/it]

{'loss': 2.4121, 'grad_norm': 7.729766845703125, 'learning_rate': 2.8000000000000003e-05, 'epoch': 0.07}


  2%|▏         | 290/12348 [07:12<4:26:19,  1.33s/it]

{'loss': 2.6022, 'grad_norm': 5.8136420249938965, 'learning_rate': 2.9e-05, 'epoch': 0.07}


  2%|▏         | 300/12348 [07:26<4:26:24,  1.33s/it]

{'loss': 2.3773, 'grad_norm': 8.157944679260254, 'learning_rate': 3e-05, 'epoch': 0.07}


  3%|▎         | 310/12348 [07:39<4:26:35,  1.33s/it]

{'loss': 2.1834, 'grad_norm': 7.02308464050293, 'learning_rate': 3.1e-05, 'epoch': 0.08}


  3%|▎         | 320/12348 [07:52<4:25:20,  1.32s/it]

{'loss': 1.9862, 'grad_norm': 9.884442329406738, 'learning_rate': 3.2000000000000005e-05, 'epoch': 0.08}


  3%|▎         | 330/12348 [08:05<4:25:16,  1.32s/it]

{'loss': 2.0966, 'grad_norm': 5.96561336517334, 'learning_rate': 3.3e-05, 'epoch': 0.08}


  3%|▎         | 340/12348 [08:19<4:25:03,  1.32s/it]

{'loss': 2.8759, 'grad_norm': 6.955765724182129, 'learning_rate': 3.4000000000000007e-05, 'epoch': 0.08}


  3%|▎         | 350/12348 [08:32<4:25:59,  1.33s/it]

{'loss': 2.317, 'grad_norm': 13.069576263427734, 'learning_rate': 3.5e-05, 'epoch': 0.09}


  3%|▎         | 360/12348 [08:45<4:27:43,  1.34s/it]

{'loss': 2.3305, 'grad_norm': 8.481781959533691, 'learning_rate': 3.6e-05, 'epoch': 0.09}


  3%|▎         | 370/12348 [08:59<4:28:44,  1.35s/it]

{'loss': 1.898, 'grad_norm': 6.872279644012451, 'learning_rate': 3.7e-05, 'epoch': 0.09}


  3%|▎         | 380/12348 [09:12<4:26:03,  1.33s/it]

{'loss': 2.1837, 'grad_norm': 8.048517227172852, 'learning_rate': 3.8e-05, 'epoch': 0.09}


  3%|▎         | 390/12348 [09:26<4:25:02,  1.33s/it]

{'loss': 2.0729, 'grad_norm': 11.237578392028809, 'learning_rate': 3.9000000000000006e-05, 'epoch': 0.09}


  3%|▎         | 400/12348 [09:39<4:25:34,  1.33s/it]

{'loss': 2.0525, 'grad_norm': 6.590199947357178, 'learning_rate': 4e-05, 'epoch': 0.1}


  3%|▎         | 410/12348 [09:52<4:25:05,  1.33s/it]

{'loss': 2.1393, 'grad_norm': 12.248394966125488, 'learning_rate': 4.1e-05, 'epoch': 0.1}


  3%|▎         | 420/12348 [10:06<4:24:19,  1.33s/it]

{'loss': 2.0952, 'grad_norm': 6.997021675109863, 'learning_rate': 4.2e-05, 'epoch': 0.1}


  3%|▎         | 430/12348 [10:19<4:24:02,  1.33s/it]

{'loss': 2.6778, 'grad_norm': 8.988022804260254, 'learning_rate': 4.3e-05, 'epoch': 0.1}


  4%|▎         | 440/12348 [10:32<4:23:49,  1.33s/it]

{'loss': 2.2715, 'grad_norm': 8.460938453674316, 'learning_rate': 4.4000000000000006e-05, 'epoch': 0.11}


  4%|▎         | 450/12348 [10:46<4:24:36,  1.33s/it]

{'loss': 2.0737, 'grad_norm': 8.83095645904541, 'learning_rate': 4.5e-05, 'epoch': 0.11}


  4%|▎         | 460/12348 [10:59<4:23:04,  1.33s/it]

{'loss': 1.8105, 'grad_norm': 14.157255172729492, 'learning_rate': 4.600000000000001e-05, 'epoch': 0.11}


  4%|▍         | 470/12348 [11:12<4:24:23,  1.34s/it]

{'loss': 2.1838, 'grad_norm': 15.018067359924316, 'learning_rate': 4.7e-05, 'epoch': 0.11}


  4%|▍         | 480/12348 [11:26<4:23:35,  1.33s/it]

{'loss': 2.0474, 'grad_norm': 6.988025188446045, 'learning_rate': 4.8e-05, 'epoch': 0.12}


  4%|▍         | 490/12348 [11:39<4:23:04,  1.33s/it]

{'loss': 2.0265, 'grad_norm': 5.954584121704102, 'learning_rate': 4.9e-05, 'epoch': 0.12}


  4%|▍         | 500/12348 [11:52<4:22:44,  1.33s/it]

{'loss': 2.3344, 'grad_norm': 7.903738021850586, 'learning_rate': 5e-05, 'epoch': 0.12}


  4%|▍         | 510/12348 [12:08<4:31:46,  1.38s/it]

{'loss': 2.4255, 'grad_norm': 8.13530445098877, 'learning_rate': 4.9957798784605e-05, 'epoch': 0.12}


  4%|▍         | 520/12348 [12:22<4:24:51,  1.34s/it]

{'loss': 2.1388, 'grad_norm': 8.055455207824707, 'learning_rate': 4.9915597569209995e-05, 'epoch': 0.13}


  4%|▍         | 530/12348 [12:35<4:24:39,  1.34s/it]

{'loss': 2.3076, 'grad_norm': 13.306142807006836, 'learning_rate': 4.987339635381499e-05, 'epoch': 0.13}


  4%|▍         | 540/12348 [12:48<4:22:36,  1.33s/it]

{'loss': 2.4024, 'grad_norm': 11.902450561523438, 'learning_rate': 4.983119513841999e-05, 'epoch': 0.13}


  4%|▍         | 550/12348 [13:02<4:23:05,  1.34s/it]

{'loss': 2.1684, 'grad_norm': 6.702873229980469, 'learning_rate': 4.9788993923024984e-05, 'epoch': 0.13}


  5%|▍         | 560/12348 [13:15<4:22:09,  1.33s/it]

{'loss': 1.7941, 'grad_norm': 9.974865913391113, 'learning_rate': 4.974679270762998e-05, 'epoch': 0.14}


  5%|▍         | 570/12348 [13:28<4:22:06,  1.34s/it]

{'loss': 1.3658, 'grad_norm': 10.972859382629395, 'learning_rate': 4.970459149223498e-05, 'epoch': 0.14}


  5%|▍         | 580/12348 [13:42<4:21:43,  1.33s/it]

{'loss': 2.0601, 'grad_norm': 10.687320709228516, 'learning_rate': 4.966239027683998e-05, 'epoch': 0.14}


  5%|▍         | 590/12348 [13:55<4:20:33,  1.33s/it]

{'loss': 1.7319, 'grad_norm': 5.7599568367004395, 'learning_rate': 4.962018906144497e-05, 'epoch': 0.14}


  5%|▍         | 600/12348 [14:09<4:21:53,  1.34s/it]

{'loss': 2.0422, 'grad_norm': 7.797793865203857, 'learning_rate': 4.9577987846049965e-05, 'epoch': 0.15}


  5%|▍         | 610/12348 [14:22<4:20:12,  1.33s/it]

{'loss': 2.0516, 'grad_norm': 7.643914222717285, 'learning_rate': 4.953578663065497e-05, 'epoch': 0.15}


  5%|▌         | 620/12348 [14:35<4:20:15,  1.33s/it]

{'loss': 2.2726, 'grad_norm': 12.7946138381958, 'learning_rate': 4.9493585415259965e-05, 'epoch': 0.15}


  5%|▌         | 630/12348 [14:49<4:20:30,  1.33s/it]

{'loss': 2.2006, 'grad_norm': 8.002691268920898, 'learning_rate': 4.945138419986496e-05, 'epoch': 0.15}


  5%|▌         | 640/12348 [15:02<4:19:48,  1.33s/it]

{'loss': 1.9729, 'grad_norm': 9.01723861694336, 'learning_rate': 4.940918298446996e-05, 'epoch': 0.16}


  5%|▌         | 650/12348 [15:15<4:19:34,  1.33s/it]

{'loss': 2.219, 'grad_norm': 10.840018272399902, 'learning_rate': 4.936698176907495e-05, 'epoch': 0.16}


  5%|▌         | 660/12348 [15:28<4:19:02,  1.33s/it]

{'loss': 1.6966, 'grad_norm': 9.108874320983887, 'learning_rate': 4.932478055367995e-05, 'epoch': 0.16}


  5%|▌         | 670/12348 [15:42<4:20:29,  1.34s/it]

{'loss': 1.687, 'grad_norm': 7.3981404304504395, 'learning_rate': 4.9282579338284946e-05, 'epoch': 0.16}


  6%|▌         | 680/12348 [15:55<4:18:53,  1.33s/it]

{'loss': 2.2121, 'grad_norm': 11.375900268554688, 'learning_rate': 4.924037812288994e-05, 'epoch': 0.17}


  6%|▌         | 690/12348 [16:09<4:18:03,  1.33s/it]

{'loss': 1.5853, 'grad_norm': 9.54638957977295, 'learning_rate': 4.919817690749494e-05, 'epoch': 0.17}


  6%|▌         | 700/12348 [16:22<4:19:01,  1.33s/it]

{'loss': 1.9585, 'grad_norm': 9.082088470458984, 'learning_rate': 4.9155975692099935e-05, 'epoch': 0.17}


  6%|▌         | 710/12348 [16:35<4:18:36,  1.33s/it]

{'loss': 1.9731, 'grad_norm': 18.11715316772461, 'learning_rate': 4.911377447670493e-05, 'epoch': 0.17}


  6%|▌         | 720/12348 [16:49<4:18:14,  1.33s/it]

{'loss': 1.9896, 'grad_norm': 8.348484992980957, 'learning_rate': 4.907157326130993e-05, 'epoch': 0.17}


  6%|▌         | 730/12348 [17:02<4:17:39,  1.33s/it]

{'loss': 1.6507, 'grad_norm': 9.565361022949219, 'learning_rate': 4.9029372045914924e-05, 'epoch': 0.18}


  6%|▌         | 740/12348 [17:15<4:18:03,  1.33s/it]

{'loss': 1.6517, 'grad_norm': 10.586373329162598, 'learning_rate': 4.898717083051993e-05, 'epoch': 0.18}


  6%|▌         | 750/12348 [17:29<4:18:08,  1.34s/it]

{'loss': 1.3739, 'grad_norm': 12.779983520507812, 'learning_rate': 4.8944969615124916e-05, 'epoch': 0.18}


  6%|▌         | 760/12348 [17:42<4:17:14,  1.33s/it]

{'loss': 1.8142, 'grad_norm': 9.981303215026855, 'learning_rate': 4.890276839972991e-05, 'epoch': 0.18}


  6%|▌         | 770/12348 [17:55<4:17:08,  1.33s/it]

{'loss': 1.7021, 'grad_norm': 10.970260620117188, 'learning_rate': 4.886056718433491e-05, 'epoch': 0.19}


  6%|▋         | 780/12348 [18:09<4:16:51,  1.33s/it]

{'loss': 1.8583, 'grad_norm': 8.53620433807373, 'learning_rate': 4.8818365968939905e-05, 'epoch': 0.19}


  6%|▋         | 790/12348 [18:22<4:15:41,  1.33s/it]

{'loss': 2.019, 'grad_norm': 7.128504753112793, 'learning_rate': 4.877616475354491e-05, 'epoch': 0.19}


  6%|▋         | 800/12348 [18:35<4:15:50,  1.33s/it]

{'loss': 2.0332, 'grad_norm': 4.623926162719727, 'learning_rate': 4.8733963538149905e-05, 'epoch': 0.19}


  7%|▋         | 810/12348 [18:49<4:16:16,  1.33s/it]

{'loss': 2.0346, 'grad_norm': 11.000515937805176, 'learning_rate': 4.8691762322754894e-05, 'epoch': 0.2}


  7%|▋         | 820/12348 [19:02<4:18:41,  1.35s/it]

{'loss': 1.958, 'grad_norm': 7.640119552612305, 'learning_rate': 4.864956110735989e-05, 'epoch': 0.2}


  7%|▋         | 830/12348 [19:15<4:15:21,  1.33s/it]

{'loss': 1.6374, 'grad_norm': 7.395923137664795, 'learning_rate': 4.8607359891964893e-05, 'epoch': 0.2}


  7%|▋         | 840/12348 [19:29<4:15:36,  1.33s/it]

{'loss': 1.4211, 'grad_norm': 15.332857131958008, 'learning_rate': 4.856515867656989e-05, 'epoch': 0.2}


  7%|▋         | 850/12348 [19:42<4:16:05,  1.34s/it]

{'loss': 2.0823, 'grad_norm': 5.693451404571533, 'learning_rate': 4.8522957461174886e-05, 'epoch': 0.21}


  7%|▋         | 860/12348 [19:56<4:14:54,  1.33s/it]

{'loss': 1.8776, 'grad_norm': 20.181520462036133, 'learning_rate': 4.848075624577988e-05, 'epoch': 0.21}


  7%|▋         | 870/12348 [20:09<4:14:50,  1.33s/it]

{'loss': 1.6347, 'grad_norm': 8.459050178527832, 'learning_rate': 4.843855503038488e-05, 'epoch': 0.21}


  7%|▋         | 880/12348 [20:22<4:14:13,  1.33s/it]

{'loss': 1.4695, 'grad_norm': 10.481395721435547, 'learning_rate': 4.8396353814989875e-05, 'epoch': 0.21}


  7%|▋         | 890/12348 [20:36<4:14:33,  1.33s/it]

{'loss': 1.8688, 'grad_norm': 13.499162673950195, 'learning_rate': 4.835415259959487e-05, 'epoch': 0.22}


  7%|▋         | 900/12348 [20:49<4:14:47,  1.34s/it]

{'loss': 1.7144, 'grad_norm': 10.858574867248535, 'learning_rate': 4.831195138419987e-05, 'epoch': 0.22}


  7%|▋         | 910/12348 [21:02<4:14:03,  1.33s/it]

{'loss': 1.6253, 'grad_norm': 10.172477722167969, 'learning_rate': 4.8269750168804864e-05, 'epoch': 0.22}


  7%|▋         | 920/12348 [21:16<4:13:11,  1.33s/it]

{'loss': 1.6568, 'grad_norm': 6.398813247680664, 'learning_rate': 4.822754895340986e-05, 'epoch': 0.22}


  8%|▊         | 930/12348 [21:29<4:14:16,  1.34s/it]

{'loss': 1.7461, 'grad_norm': 10.315506935119629, 'learning_rate': 4.8185347738014856e-05, 'epoch': 0.23}


  8%|▊         | 940/12348 [21:42<4:13:38,  1.33s/it]

{'loss': 1.7222, 'grad_norm': 13.552051544189453, 'learning_rate': 4.814314652261985e-05, 'epoch': 0.23}


  8%|▊         | 950/12348 [21:56<4:13:15,  1.33s/it]

{'loss': 1.5846, 'grad_norm': 8.232402801513672, 'learning_rate': 4.810094530722485e-05, 'epoch': 0.23}


  8%|▊         | 960/12348 [22:09<4:12:47,  1.33s/it]

{'loss': 2.6802, 'grad_norm': 17.43431282043457, 'learning_rate': 4.805874409182985e-05, 'epoch': 0.23}


  8%|▊         | 970/12348 [22:22<4:13:18,  1.34s/it]

{'loss': 1.7715, 'grad_norm': 7.544881343841553, 'learning_rate': 4.801654287643484e-05, 'epoch': 0.24}


  8%|▊         | 980/12348 [22:36<4:13:29,  1.34s/it]

{'loss': 2.2255, 'grad_norm': 10.293218612670898, 'learning_rate': 4.797434166103984e-05, 'epoch': 0.24}


  8%|▊         | 990/12348 [22:49<4:14:34,  1.34s/it]

{'loss': 1.6934, 'grad_norm': 5.8371171951293945, 'learning_rate': 4.7932140445644834e-05, 'epoch': 0.24}


  8%|▊         | 1000/12348 [23:03<4:13:23,  1.34s/it]

{'loss': 1.5511, 'grad_norm': 9.564264297485352, 'learning_rate': 4.788993923024984e-05, 'epoch': 0.24}


  8%|▊         | 1010/12348 [23:19<4:21:24,  1.38s/it]

{'loss': 1.9982, 'grad_norm': 16.363994598388672, 'learning_rate': 4.7847738014854833e-05, 'epoch': 0.25}


  8%|▊         | 1020/12348 [23:32<4:13:48,  1.34s/it]

{'loss': 1.8608, 'grad_norm': 7.798534870147705, 'learning_rate': 4.780553679945983e-05, 'epoch': 0.25}


  8%|▊         | 1030/12348 [23:46<4:13:22,  1.34s/it]

{'loss': 2.2883, 'grad_norm': 9.241494178771973, 'learning_rate': 4.776333558406482e-05, 'epoch': 0.25}


  8%|▊         | 1040/12348 [23:59<4:12:31,  1.34s/it]

{'loss': 1.8125, 'grad_norm': 5.297024726867676, 'learning_rate': 4.7721134368669816e-05, 'epoch': 0.25}


  9%|▊         | 1050/12348 [24:13<4:12:00,  1.34s/it]

{'loss': 1.4784, 'grad_norm': 11.138681411743164, 'learning_rate': 4.767893315327482e-05, 'epoch': 0.26}


  9%|▊         | 1060/12348 [24:26<4:12:31,  1.34s/it]

{'loss': 1.5131, 'grad_norm': 10.62488079071045, 'learning_rate': 4.7636731937879815e-05, 'epoch': 0.26}


  9%|▊         | 1070/12348 [24:40<4:11:19,  1.34s/it]

{'loss': 1.4916, 'grad_norm': 7.394345760345459, 'learning_rate': 4.759453072248481e-05, 'epoch': 0.26}


  9%|▊         | 1080/12348 [24:53<4:11:54,  1.34s/it]

{'loss': 1.3918, 'grad_norm': 14.329761505126953, 'learning_rate': 4.755232950708981e-05, 'epoch': 0.26}


  9%|▉         | 1090/12348 [25:06<4:10:53,  1.34s/it]

{'loss': 1.9387, 'grad_norm': 8.411025047302246, 'learning_rate': 4.7510128291694804e-05, 'epoch': 0.26}


  9%|▉         | 1100/12348 [25:20<4:10:26,  1.34s/it]

{'loss': 1.3663, 'grad_norm': 15.754842758178711, 'learning_rate': 4.74679270762998e-05, 'epoch': 0.27}


  9%|▉         | 1110/12348 [25:33<4:10:59,  1.34s/it]

{'loss': 1.5215, 'grad_norm': 15.742085456848145, 'learning_rate': 4.7425725860904796e-05, 'epoch': 0.27}


  9%|▉         | 1120/12348 [25:47<4:10:09,  1.34s/it]

{'loss': 1.6675, 'grad_norm': 9.806358337402344, 'learning_rate': 4.738352464550979e-05, 'epoch': 0.27}


  9%|▉         | 1130/12348 [26:00<4:10:07,  1.34s/it]

{'loss': 1.748, 'grad_norm': 12.392236709594727, 'learning_rate': 4.734132343011479e-05, 'epoch': 0.27}


  9%|▉         | 1140/12348 [26:13<4:10:17,  1.34s/it]

{'loss': 1.3634, 'grad_norm': 13.56355094909668, 'learning_rate': 4.7299122214719785e-05, 'epoch': 0.28}


  9%|▉         | 1150/12348 [26:27<4:09:56,  1.34s/it]

{'loss': 2.1395, 'grad_norm': 9.303557395935059, 'learning_rate': 4.725692099932478e-05, 'epoch': 0.28}


  9%|▉         | 1160/12348 [26:40<4:10:30,  1.34s/it]

{'loss': 1.5172, 'grad_norm': 13.769587516784668, 'learning_rate': 4.721471978392978e-05, 'epoch': 0.28}


  9%|▉         | 1170/12348 [26:54<4:09:59,  1.34s/it]

{'loss': 1.9705, 'grad_norm': 8.348567008972168, 'learning_rate': 4.7172518568534774e-05, 'epoch': 0.28}


 10%|▉         | 1180/12348 [27:07<4:09:19,  1.34s/it]

{'loss': 2.0199, 'grad_norm': 13.973167419433594, 'learning_rate': 4.713031735313978e-05, 'epoch': 0.29}


 10%|▉         | 1190/12348 [27:20<4:09:21,  1.34s/it]

{'loss': 2.2351, 'grad_norm': 8.038141250610352, 'learning_rate': 4.708811613774477e-05, 'epoch': 0.29}


 10%|▉         | 1200/12348 [27:34<4:08:49,  1.34s/it]

{'loss': 1.3842, 'grad_norm': 6.280343055725098, 'learning_rate': 4.704591492234976e-05, 'epoch': 0.29}


 10%|▉         | 1210/12348 [27:47<4:08:49,  1.34s/it]

{'loss': 1.6935, 'grad_norm': 10.78691577911377, 'learning_rate': 4.700371370695476e-05, 'epoch': 0.29}


 10%|▉         | 1220/12348 [28:01<4:07:47,  1.34s/it]

{'loss': 1.9413, 'grad_norm': 9.871354103088379, 'learning_rate': 4.696151249155976e-05, 'epoch': 0.3}


 10%|▉         | 1230/12348 [28:14<4:08:12,  1.34s/it]

{'loss': 1.7074, 'grad_norm': 9.147499084472656, 'learning_rate': 4.691931127616476e-05, 'epoch': 0.3}


 10%|█         | 1240/12348 [28:28<4:07:26,  1.34s/it]

{'loss': 1.4142, 'grad_norm': 9.385181427001953, 'learning_rate': 4.6877110060769755e-05, 'epoch': 0.3}


 10%|█         | 1250/12348 [28:41<4:07:35,  1.34s/it]

{'loss': 1.6734, 'grad_norm': 14.676959037780762, 'learning_rate': 4.6834908845374744e-05, 'epoch': 0.3}


 10%|█         | 1260/12348 [28:54<4:08:14,  1.34s/it]

{'loss': 1.2089, 'grad_norm': 9.869584083557129, 'learning_rate': 4.679270762997974e-05, 'epoch': 0.31}


 10%|█         | 1270/12348 [29:08<4:07:09,  1.34s/it]

{'loss': 1.6341, 'grad_norm': 23.568191528320312, 'learning_rate': 4.6750506414584744e-05, 'epoch': 0.31}


 10%|█         | 1280/12348 [29:21<4:06:31,  1.34s/it]

{'loss': 1.6057, 'grad_norm': 7.4275665283203125, 'learning_rate': 4.670830519918974e-05, 'epoch': 0.31}


 10%|█         | 1290/12348 [29:35<4:06:19,  1.34s/it]

{'loss': 1.8993, 'grad_norm': 9.53905200958252, 'learning_rate': 4.6666103983794736e-05, 'epoch': 0.31}


 11%|█         | 1300/12348 [29:48<4:06:45,  1.34s/it]

{'loss': 1.5832, 'grad_norm': 13.387764930725098, 'learning_rate': 4.662390276839973e-05, 'epoch': 0.32}


 11%|█         | 1310/12348 [30:02<4:07:30,  1.35s/it]

{'loss': 1.3185, 'grad_norm': 10.492445945739746, 'learning_rate': 4.658170155300473e-05, 'epoch': 0.32}


 11%|█         | 1320/12348 [30:15<4:05:47,  1.34s/it]

{'loss': 1.8098, 'grad_norm': 16.63056755065918, 'learning_rate': 4.6539500337609725e-05, 'epoch': 0.32}


 11%|█         | 1330/12348 [30:28<4:05:37,  1.34s/it]

{'loss': 1.6378, 'grad_norm': 10.733214378356934, 'learning_rate': 4.649729912221472e-05, 'epoch': 0.32}


 11%|█         | 1340/12348 [30:42<4:06:55,  1.35s/it]

{'loss': 2.1799, 'grad_norm': 18.82303237915039, 'learning_rate': 4.645509790681972e-05, 'epoch': 0.33}


 11%|█         | 1350/12348 [30:55<4:05:46,  1.34s/it]

{'loss': 1.1954, 'grad_norm': 9.04729175567627, 'learning_rate': 4.641289669142472e-05, 'epoch': 0.33}


 11%|█         | 1360/12348 [31:09<4:05:33,  1.34s/it]

{'loss': 1.6392, 'grad_norm': 12.120966911315918, 'learning_rate': 4.637069547602971e-05, 'epoch': 0.33}


 11%|█         | 1370/12348 [31:22<4:05:16,  1.34s/it]

{'loss': 1.5061, 'grad_norm': 21.213111877441406, 'learning_rate': 4.632849426063471e-05, 'epoch': 0.33}


 11%|█         | 1380/12348 [31:35<4:04:43,  1.34s/it]

{'loss': 1.8155, 'grad_norm': 10.527341842651367, 'learning_rate': 4.62862930452397e-05, 'epoch': 0.34}


 11%|█▏        | 1390/12348 [31:49<4:04:34,  1.34s/it]

{'loss': 1.2845, 'grad_norm': 10.609081268310547, 'learning_rate': 4.62440918298447e-05, 'epoch': 0.34}


 11%|█▏        | 1400/12348 [32:02<4:04:25,  1.34s/it]

{'loss': 1.5625, 'grad_norm': 9.319966316223145, 'learning_rate': 4.62018906144497e-05, 'epoch': 0.34}


 11%|█▏        | 1410/12348 [32:16<4:04:28,  1.34s/it]

{'loss': 1.7252, 'grad_norm': 12.328864097595215, 'learning_rate': 4.61596893990547e-05, 'epoch': 0.34}


 11%|█▏        | 1420/12348 [32:29<4:03:35,  1.34s/it]

{'loss': 1.6331, 'grad_norm': 11.37063980102539, 'learning_rate': 4.611748818365969e-05, 'epoch': 0.34}


 12%|█▏        | 1430/12348 [32:43<4:03:55,  1.34s/it]

{'loss': 1.3509, 'grad_norm': 11.59732723236084, 'learning_rate': 4.6075286968264684e-05, 'epoch': 0.35}


 12%|█▏        | 1440/12348 [32:56<4:04:44,  1.35s/it]

{'loss': 1.486, 'grad_norm': 14.192814826965332, 'learning_rate': 4.603308575286969e-05, 'epoch': 0.35}


 12%|█▏        | 1450/12348 [33:09<4:02:01,  1.33s/it]

{'loss': 1.5644, 'grad_norm': 10.21554946899414, 'learning_rate': 4.5990884537474684e-05, 'epoch': 0.35}


 12%|█▏        | 1460/12348 [33:23<4:02:57,  1.34s/it]

{'loss': 1.5433, 'grad_norm': 14.587592124938965, 'learning_rate': 4.594868332207968e-05, 'epoch': 0.35}


 12%|█▏        | 1470/12348 [33:36<4:03:00,  1.34s/it]

{'loss': 1.7125, 'grad_norm': 8.423285484313965, 'learning_rate': 4.590648210668467e-05, 'epoch': 0.36}


 12%|█▏        | 1480/12348 [33:50<4:03:01,  1.34s/it]

{'loss': 1.7373, 'grad_norm': 5.519014835357666, 'learning_rate': 4.586428089128967e-05, 'epoch': 0.36}


 12%|█▏        | 1490/12348 [34:03<4:02:30,  1.34s/it]

{'loss': 1.5062, 'grad_norm': 13.910361289978027, 'learning_rate': 4.582207967589467e-05, 'epoch': 0.36}


 12%|█▏        | 1500/12348 [34:17<4:01:29,  1.34s/it]

{'loss': 1.656, 'grad_norm': 12.886597633361816, 'learning_rate': 4.5779878460499665e-05, 'epoch': 0.36}


 12%|█▏        | 1510/12348 [34:31<4:06:17,  1.36s/it]

{'loss': 1.5973, 'grad_norm': 10.305940628051758, 'learning_rate': 4.573767724510466e-05, 'epoch': 0.37}


 12%|█▏        | 1520/12348 [34:45<4:10:26,  1.39s/it]

{'loss': 1.4196, 'grad_norm': 13.66312313079834, 'learning_rate': 4.569547602970966e-05, 'epoch': 0.37}


 12%|█▏        | 1530/12348 [34:59<4:08:35,  1.38s/it]

{'loss': 1.6227, 'grad_norm': 11.330550193786621, 'learning_rate': 4.5653274814314654e-05, 'epoch': 0.37}


 12%|█▏        | 1540/12348 [35:13<4:09:07,  1.38s/it]

{'loss': 1.3642, 'grad_norm': 6.896184921264648, 'learning_rate': 4.561107359891965e-05, 'epoch': 0.37}


 13%|█▎        | 1550/12348 [35:27<4:08:14,  1.38s/it]

{'loss': 1.7442, 'grad_norm': 7.074594497680664, 'learning_rate': 4.556887238352465e-05, 'epoch': 0.38}


 13%|█▎        | 1560/12348 [35:40<4:08:17,  1.38s/it]

{'loss': 1.4023, 'grad_norm': 7.950493812561035, 'learning_rate': 4.552667116812964e-05, 'epoch': 0.38}


 13%|█▎        | 1570/12348 [35:54<4:08:20,  1.38s/it]

{'loss': 1.6416, 'grad_norm': 15.533053398132324, 'learning_rate': 4.5484469952734646e-05, 'epoch': 0.38}


 13%|█▎        | 1580/12348 [36:08<4:07:31,  1.38s/it]

{'loss': 1.5094, 'grad_norm': 13.054777145385742, 'learning_rate': 4.5442268737339635e-05, 'epoch': 0.38}


 13%|█▎        | 1590/12348 [36:22<4:07:28,  1.38s/it]

{'loss': 1.3002, 'grad_norm': 9.886272430419922, 'learning_rate': 4.540006752194463e-05, 'epoch': 0.39}


 13%|█▎        | 1600/12348 [36:36<4:08:53,  1.39s/it]

{'loss': 1.3294, 'grad_norm': 13.93025016784668, 'learning_rate': 4.535786630654963e-05, 'epoch': 0.39}


 13%|█▎        | 1610/12348 [36:50<4:08:22,  1.39s/it]

{'loss': 1.4684, 'grad_norm': 12.263294219970703, 'learning_rate': 4.531566509115463e-05, 'epoch': 0.39}


 13%|█▎        | 1620/12348 [37:03<4:08:00,  1.39s/it]

{'loss': 1.4319, 'grad_norm': 8.886406898498535, 'learning_rate': 4.527346387575963e-05, 'epoch': 0.39}


 13%|█▎        | 1630/12348 [37:17<4:07:16,  1.38s/it]

{'loss': 1.3675, 'grad_norm': 8.21798038482666, 'learning_rate': 4.5231262660364624e-05, 'epoch': 0.4}


 13%|█▎        | 1640/12348 [37:31<4:05:41,  1.38s/it]

{'loss': 1.2817, 'grad_norm': 8.716711044311523, 'learning_rate': 4.518906144496961e-05, 'epoch': 0.4}


 13%|█▎        | 1650/12348 [37:45<4:05:18,  1.38s/it]

{'loss': 1.7002, 'grad_norm': 9.19748592376709, 'learning_rate': 4.514686022957461e-05, 'epoch': 0.4}


 13%|█▎        | 1660/12348 [37:59<4:05:09,  1.38s/it]

{'loss': 1.3022, 'grad_norm': 10.602896690368652, 'learning_rate': 4.510465901417961e-05, 'epoch': 0.4}


 14%|█▎        | 1670/12348 [38:13<4:05:34,  1.38s/it]

{'loss': 1.3209, 'grad_norm': 18.39936065673828, 'learning_rate': 4.506245779878461e-05, 'epoch': 0.41}


 14%|█▎        | 1680/12348 [38:26<4:04:34,  1.38s/it]

{'loss': 1.5004, 'grad_norm': 8.956329345703125, 'learning_rate': 4.5020256583389605e-05, 'epoch': 0.41}


 14%|█▎        | 1690/12348 [38:40<4:04:47,  1.38s/it]

{'loss': 1.0083, 'grad_norm': 7.049092769622803, 'learning_rate': 4.49780553679946e-05, 'epoch': 0.41}


 14%|█▍        | 1700/12348 [38:54<4:05:45,  1.38s/it]

{'loss': 1.5111, 'grad_norm': 6.110976219177246, 'learning_rate': 4.49358541525996e-05, 'epoch': 0.41}


 14%|█▍        | 1710/12348 [39:08<4:05:13,  1.38s/it]

{'loss': 1.3763, 'grad_norm': 7.729211330413818, 'learning_rate': 4.4893652937204594e-05, 'epoch': 0.42}


 14%|█▍        | 1720/12348 [39:22<4:04:45,  1.38s/it]

{'loss': 1.5786, 'grad_norm': 11.27183723449707, 'learning_rate': 4.485145172180959e-05, 'epoch': 0.42}


 14%|█▍        | 1730/12348 [39:36<4:05:29,  1.39s/it]

{'loss': 1.6252, 'grad_norm': 12.224289894104004, 'learning_rate': 4.4809250506414587e-05, 'epoch': 0.42}


 14%|█▍        | 1740/12348 [39:49<4:04:21,  1.38s/it]

{'loss': 1.6942, 'grad_norm': 6.857753276824951, 'learning_rate': 4.476704929101958e-05, 'epoch': 0.42}


 14%|█▍        | 1750/12348 [40:03<4:04:07,  1.38s/it]

{'loss': 1.0791, 'grad_norm': 3.8683526515960693, 'learning_rate': 4.472484807562458e-05, 'epoch': 0.43}


 14%|█▍        | 1760/12348 [40:17<4:04:15,  1.38s/it]

{'loss': 1.374, 'grad_norm': 11.121822357177734, 'learning_rate': 4.4682646860229575e-05, 'epoch': 0.43}


 14%|█▍        | 1770/12348 [40:31<4:03:38,  1.38s/it]

{'loss': 1.2997, 'grad_norm': 7.880019664764404, 'learning_rate': 4.464044564483457e-05, 'epoch': 0.43}


 14%|█▍        | 1780/12348 [40:45<4:03:50,  1.38s/it]

{'loss': 1.2706, 'grad_norm': 16.1674747467041, 'learning_rate': 4.459824442943957e-05, 'epoch': 0.43}


 14%|█▍        | 1790/12348 [40:59<4:03:26,  1.38s/it]

{'loss': 1.3702, 'grad_norm': 9.557272911071777, 'learning_rate': 4.455604321404457e-05, 'epoch': 0.43}


 15%|█▍        | 1800/12348 [41:12<4:03:15,  1.38s/it]

{'loss': 1.4566, 'grad_norm': 18.027225494384766, 'learning_rate': 4.451384199864956e-05, 'epoch': 0.44}


 15%|█▍        | 1810/12348 [41:26<4:02:39,  1.38s/it]

{'loss': 0.9653, 'grad_norm': 12.688018798828125, 'learning_rate': 4.447164078325456e-05, 'epoch': 0.44}


 15%|█▍        | 1820/12348 [41:40<4:03:04,  1.39s/it]

{'loss': 1.5998, 'grad_norm': 8.335724830627441, 'learning_rate': 4.442943956785955e-05, 'epoch': 0.44}


 15%|█▍        | 1830/12348 [41:54<4:02:16,  1.38s/it]

{'loss': 1.5049, 'grad_norm': 3.8467211723327637, 'learning_rate': 4.4387238352464556e-05, 'epoch': 0.44}


 15%|█▍        | 1840/12348 [42:08<4:02:28,  1.38s/it]

{'loss': 1.6008, 'grad_norm': 10.316153526306152, 'learning_rate': 4.434503713706955e-05, 'epoch': 0.45}


 15%|█▍        | 1850/12348 [42:22<4:02:17,  1.38s/it]

{'loss': 1.4645, 'grad_norm': 14.988263130187988, 'learning_rate': 4.430283592167455e-05, 'epoch': 0.45}


 15%|█▌        | 1860/12348 [42:36<4:01:59,  1.38s/it]

{'loss': 1.4186, 'grad_norm': 11.141302108764648, 'learning_rate': 4.426063470627954e-05, 'epoch': 0.45}


 15%|█▌        | 1870/12348 [42:50<4:02:00,  1.39s/it]

{'loss': 1.3872, 'grad_norm': 9.648951530456543, 'learning_rate': 4.421843349088454e-05, 'epoch': 0.45}


 15%|█▌        | 1880/12348 [43:03<4:00:27,  1.38s/it]

{'loss': 1.6852, 'grad_norm': 8.29002571105957, 'learning_rate': 4.417623227548954e-05, 'epoch': 0.46}


 15%|█▌        | 1890/12348 [43:17<4:01:48,  1.39s/it]

{'loss': 1.6688, 'grad_norm': 5.005443572998047, 'learning_rate': 4.4134031060094534e-05, 'epoch': 0.46}


 15%|█▌        | 1900/12348 [43:31<4:01:04,  1.38s/it]

{'loss': 1.2886, 'grad_norm': 5.551412105560303, 'learning_rate': 4.409182984469953e-05, 'epoch': 0.46}


 15%|█▌        | 1910/12348 [43:45<3:59:41,  1.38s/it]

{'loss': 1.4828, 'grad_norm': 15.388012886047363, 'learning_rate': 4.4049628629304527e-05, 'epoch': 0.46}


 16%|█▌        | 1920/12348 [43:59<4:00:28,  1.38s/it]

{'loss': 1.8684, 'grad_norm': 7.343705654144287, 'learning_rate': 4.400742741390952e-05, 'epoch': 0.47}


 16%|█▌        | 1930/12348 [44:13<3:59:52,  1.38s/it]

{'loss': 1.0858, 'grad_norm': 7.221334934234619, 'learning_rate': 4.396522619851452e-05, 'epoch': 0.47}


 16%|█▌        | 1940/12348 [44:26<3:59:47,  1.38s/it]

{'loss': 1.528, 'grad_norm': 7.517816066741943, 'learning_rate': 4.3923024983119515e-05, 'epoch': 0.47}


 16%|█▌        | 1950/12348 [44:41<4:05:33,  1.42s/it]

{'loss': 1.6826, 'grad_norm': 10.771784782409668, 'learning_rate': 4.388082376772451e-05, 'epoch': 0.47}


 16%|█▌        | 1960/12348 [44:54<4:00:19,  1.39s/it]

{'loss': 1.4586, 'grad_norm': 11.693782806396484, 'learning_rate': 4.383862255232951e-05, 'epoch': 0.48}


 16%|█▌        | 1970/12348 [45:08<3:59:03,  1.38s/it]

{'loss': 1.5147, 'grad_norm': 5.706155776977539, 'learning_rate': 4.3796421336934504e-05, 'epoch': 0.48}


 16%|█▌        | 1980/12348 [45:22<3:59:39,  1.39s/it]

{'loss': 1.3773, 'grad_norm': 16.70802116394043, 'learning_rate': 4.37542201215395e-05, 'epoch': 0.48}


 16%|█▌        | 1990/12348 [45:36<4:02:39,  1.41s/it]

{'loss': 1.5621, 'grad_norm': 14.775408744812012, 'learning_rate': 4.37120189061445e-05, 'epoch': 0.48}


 16%|█▌        | 2000/12348 [45:50<4:00:10,  1.39s/it]

{'loss': 0.8833, 'grad_norm': 23.205995559692383, 'learning_rate': 4.36698176907495e-05, 'epoch': 0.49}


 16%|█▋        | 2010/12348 [46:05<4:02:31,  1.41s/it]

{'loss': 1.5854, 'grad_norm': 11.671204566955566, 'learning_rate': 4.3627616475354496e-05, 'epoch': 0.49}


 16%|█▋        | 2020/12348 [46:19<3:57:55,  1.38s/it]

{'loss': 1.2023, 'grad_norm': 7.221626281738281, 'learning_rate': 4.3585415259959486e-05, 'epoch': 0.49}


 16%|█▋        | 2030/12348 [46:33<3:58:31,  1.39s/it]

{'loss': 1.3725, 'grad_norm': 5.3074774742126465, 'learning_rate': 4.354321404456448e-05, 'epoch': 0.49}


 17%|█▋        | 2040/12348 [46:47<3:58:08,  1.39s/it]

{'loss': 1.4066, 'grad_norm': 18.425960540771484, 'learning_rate': 4.350101282916948e-05, 'epoch': 0.5}


 17%|█▋        | 2050/12348 [47:01<3:57:38,  1.38s/it]

{'loss': 1.3134, 'grad_norm': 11.017562866210938, 'learning_rate': 4.345881161377448e-05, 'epoch': 0.5}


 17%|█▋        | 2060/12348 [47:15<3:57:17,  1.38s/it]

{'loss': 1.1771, 'grad_norm': 12.448472023010254, 'learning_rate': 4.341661039837948e-05, 'epoch': 0.5}


 17%|█▋        | 2070/12348 [47:29<3:57:30,  1.39s/it]

{'loss': 1.4724, 'grad_norm': 8.476505279541016, 'learning_rate': 4.3374409182984474e-05, 'epoch': 0.5}


 17%|█▋        | 2080/12348 [47:42<3:57:05,  1.39s/it]

{'loss': 1.3306, 'grad_norm': 8.357467651367188, 'learning_rate': 4.3332207967589463e-05, 'epoch': 0.51}


 17%|█▋        | 2090/12348 [47:56<3:56:33,  1.38s/it]

{'loss': 0.9118, 'grad_norm': 6.877830982208252, 'learning_rate': 4.3290006752194467e-05, 'epoch': 0.51}


 17%|█▋        | 2100/12348 [48:10<3:56:52,  1.39s/it]

{'loss': 1.5968, 'grad_norm': 12.014944076538086, 'learning_rate': 4.324780553679946e-05, 'epoch': 0.51}


 17%|█▋        | 2110/12348 [48:24<3:55:48,  1.38s/it]

{'loss': 1.5961, 'grad_norm': 10.48190975189209, 'learning_rate': 4.320560432140446e-05, 'epoch': 0.51}


 17%|█▋        | 2120/12348 [48:38<3:55:39,  1.38s/it]

{'loss': 1.186, 'grad_norm': 10.9207763671875, 'learning_rate': 4.3163403106009455e-05, 'epoch': 0.52}


 17%|█▋        | 2130/12348 [48:52<3:56:05,  1.39s/it]

{'loss': 1.6409, 'grad_norm': 11.133735656738281, 'learning_rate': 4.312120189061445e-05, 'epoch': 0.52}


 17%|█▋        | 2140/12348 [49:06<3:55:44,  1.39s/it]

{'loss': 1.2156, 'grad_norm': 10.848344802856445, 'learning_rate': 4.307900067521945e-05, 'epoch': 0.52}


 17%|█▋        | 2150/12348 [49:19<3:55:15,  1.38s/it]

{'loss': 1.5232, 'grad_norm': 9.730389595031738, 'learning_rate': 4.3036799459824444e-05, 'epoch': 0.52}


 17%|█▋        | 2160/12348 [49:33<3:55:21,  1.39s/it]

{'loss': 1.5856, 'grad_norm': 15.805621147155762, 'learning_rate': 4.299459824442944e-05, 'epoch': 0.52}


 18%|█▊        | 2170/12348 [49:47<3:56:38,  1.40s/it]

{'loss': 1.6451, 'grad_norm': 12.107006072998047, 'learning_rate': 4.295239702903444e-05, 'epoch': 0.53}


 18%|█▊        | 2180/12348 [50:01<3:55:18,  1.39s/it]

{'loss': 1.1961, 'grad_norm': 8.763198852539062, 'learning_rate': 4.291019581363944e-05, 'epoch': 0.53}


 18%|█▊        | 2190/12348 [50:15<3:54:24,  1.38s/it]

{'loss': 1.1742, 'grad_norm': 16.163747787475586, 'learning_rate': 4.286799459824443e-05, 'epoch': 0.53}


 18%|█▊        | 2200/12348 [50:29<3:55:12,  1.39s/it]

{'loss': 1.7772, 'grad_norm': 12.28345775604248, 'learning_rate': 4.2825793382849426e-05, 'epoch': 0.53}


 18%|█▊        | 2210/12348 [50:43<3:55:03,  1.39s/it]

{'loss': 1.2542, 'grad_norm': 6.30566930770874, 'learning_rate': 4.278359216745442e-05, 'epoch': 0.54}


 18%|█▊        | 2220/12348 [50:57<3:54:10,  1.39s/it]

{'loss': 1.6019, 'grad_norm': 5.654677867889404, 'learning_rate': 4.2741390952059425e-05, 'epoch': 0.54}


 18%|█▊        | 2230/12348 [51:11<3:54:19,  1.39s/it]

{'loss': 1.3241, 'grad_norm': 6.528100967407227, 'learning_rate': 4.269918973666442e-05, 'epoch': 0.54}


 18%|█▊        | 2240/12348 [51:25<3:54:15,  1.39s/it]

{'loss': 1.3805, 'grad_norm': 14.41530990600586, 'learning_rate': 4.265698852126941e-05, 'epoch': 0.54}


 18%|█▊        | 2250/12348 [51:39<3:54:35,  1.39s/it]

{'loss': 1.0489, 'grad_norm': 15.211127281188965, 'learning_rate': 4.261478730587441e-05, 'epoch': 0.55}


 18%|█▊        | 2260/12348 [51:52<3:53:30,  1.39s/it]

{'loss': 0.8593, 'grad_norm': 8.538074493408203, 'learning_rate': 4.257258609047941e-05, 'epoch': 0.55}


 18%|█▊        | 2270/12348 [52:06<3:53:03,  1.39s/it]

{'loss': 1.3656, 'grad_norm': 15.010968208312988, 'learning_rate': 4.2530384875084407e-05, 'epoch': 0.55}


 18%|█▊        | 2280/12348 [52:20<3:52:56,  1.39s/it]

{'loss': 1.8436, 'grad_norm': 17.247207641601562, 'learning_rate': 4.24881836596894e-05, 'epoch': 0.55}


 19%|█▊        | 2290/12348 [52:34<3:53:09,  1.39s/it]

{'loss': 1.789, 'grad_norm': 11.080063819885254, 'learning_rate': 4.24459824442944e-05, 'epoch': 0.56}


 19%|█▊        | 2300/12348 [52:48<3:52:56,  1.39s/it]

{'loss': 1.3222, 'grad_norm': 8.85755443572998, 'learning_rate': 4.240378122889939e-05, 'epoch': 0.56}


 19%|█▊        | 2310/12348 [53:02<3:51:38,  1.38s/it]

{'loss': 1.2373, 'grad_norm': 10.775102615356445, 'learning_rate': 4.236158001350439e-05, 'epoch': 0.56}


 19%|█▉        | 2320/12348 [53:16<3:51:36,  1.39s/it]

{'loss': 0.9965, 'grad_norm': 8.996657371520996, 'learning_rate': 4.231937879810939e-05, 'epoch': 0.56}


 19%|█▉        | 2330/12348 [53:30<3:51:05,  1.38s/it]

{'loss': 1.6072, 'grad_norm': 21.797895431518555, 'learning_rate': 4.2277177582714384e-05, 'epoch': 0.57}


 19%|█▉        | 2340/12348 [53:44<3:50:33,  1.38s/it]

{'loss': 1.5701, 'grad_norm': 10.416295051574707, 'learning_rate': 4.223497636731938e-05, 'epoch': 0.57}


 19%|█▉        | 2350/12348 [53:57<3:49:34,  1.38s/it]

{'loss': 1.2016, 'grad_norm': 3.958076238632202, 'learning_rate': 4.219277515192438e-05, 'epoch': 0.57}


 19%|█▉        | 2360/12348 [54:11<3:51:05,  1.39s/it]

{'loss': 1.5836, 'grad_norm': 9.011184692382812, 'learning_rate': 4.215057393652937e-05, 'epoch': 0.57}


 19%|█▉        | 2370/12348 [54:25<3:49:10,  1.38s/it]

{'loss': 1.452, 'grad_norm': 12.788724899291992, 'learning_rate': 4.210837272113437e-05, 'epoch': 0.58}


 19%|█▉        | 2380/12348 [54:39<3:50:11,  1.39s/it]

{'loss': 1.3657, 'grad_norm': 11.176697731018066, 'learning_rate': 4.2066171505739366e-05, 'epoch': 0.58}


 19%|█▉        | 2390/12348 [54:53<3:50:28,  1.39s/it]

{'loss': 1.5472, 'grad_norm': 12.640898704528809, 'learning_rate': 4.202397029034437e-05, 'epoch': 0.58}


 19%|█▉        | 2400/12348 [55:07<3:50:26,  1.39s/it]

{'loss': 1.3193, 'grad_norm': 10.693605422973633, 'learning_rate': 4.1981769074949365e-05, 'epoch': 0.58}


 20%|█▉        | 2410/12348 [55:21<3:50:44,  1.39s/it]

{'loss': 1.1978, 'grad_norm': 11.122668266296387, 'learning_rate': 4.1939567859554355e-05, 'epoch': 0.59}


 20%|█▉        | 2420/12348 [55:35<3:48:18,  1.38s/it]

{'loss': 1.36, 'grad_norm': 14.266548156738281, 'learning_rate': 4.189736664415935e-05, 'epoch': 0.59}


 20%|█▉        | 2430/12348 [55:49<3:49:12,  1.39s/it]

{'loss': 1.5229, 'grad_norm': 10.628795623779297, 'learning_rate': 4.185516542876435e-05, 'epoch': 0.59}


 20%|█▉        | 2440/12348 [56:02<3:48:34,  1.38s/it]

{'loss': 1.4832, 'grad_norm': 5.892218589782715, 'learning_rate': 4.181296421336935e-05, 'epoch': 0.59}


 20%|█▉        | 2450/12348 [56:16<3:49:41,  1.39s/it]

{'loss': 1.5671, 'grad_norm': 12.15674877166748, 'learning_rate': 4.1770762997974347e-05, 'epoch': 0.6}


 20%|█▉        | 2460/12348 [56:30<3:49:05,  1.39s/it]

{'loss': 1.3406, 'grad_norm': 11.511250495910645, 'learning_rate': 4.172856178257934e-05, 'epoch': 0.6}


 20%|██        | 2470/12348 [56:44<3:49:03,  1.39s/it]

{'loss': 1.2287, 'grad_norm': 7.223637104034424, 'learning_rate': 4.168636056718433e-05, 'epoch': 0.6}


 20%|██        | 2480/12348 [56:58<3:48:36,  1.39s/it]

{'loss': 1.0734, 'grad_norm': 8.029622077941895, 'learning_rate': 4.1644159351789335e-05, 'epoch': 0.6}


 20%|██        | 2490/12348 [57:12<3:47:50,  1.39s/it]

{'loss': 1.2386, 'grad_norm': 9.965128898620605, 'learning_rate': 4.160195813639433e-05, 'epoch': 0.6}


 20%|██        | 2500/12348 [57:26<3:48:18,  1.39s/it]

{'loss': 1.3221, 'grad_norm': 13.746521949768066, 'learning_rate': 4.155975692099933e-05, 'epoch': 0.61}


 20%|██        | 2510/12348 [57:41<3:54:09,  1.43s/it]

{'loss': 1.3832, 'grad_norm': 15.318381309509277, 'learning_rate': 4.1517555705604324e-05, 'epoch': 0.61}


 20%|██        | 2520/12348 [57:55<3:48:02,  1.39s/it]

{'loss': 1.3943, 'grad_norm': 12.543161392211914, 'learning_rate': 4.147535449020932e-05, 'epoch': 0.61}


 20%|██        | 2530/12348 [58:09<3:47:23,  1.39s/it]

{'loss': 1.1189, 'grad_norm': 8.522262573242188, 'learning_rate': 4.143315327481432e-05, 'epoch': 0.61}


 21%|██        | 2540/12348 [58:23<3:47:16,  1.39s/it]

{'loss': 1.1772, 'grad_norm': 9.04596996307373, 'learning_rate': 4.139095205941931e-05, 'epoch': 0.62}


 21%|██        | 2550/12348 [58:37<3:47:45,  1.39s/it]

{'loss': 1.1467, 'grad_norm': 9.63711929321289, 'learning_rate': 4.134875084402431e-05, 'epoch': 0.62}


 21%|██        | 2560/12348 [58:51<3:46:30,  1.39s/it]

{'loss': 0.9851, 'grad_norm': 4.990233898162842, 'learning_rate': 4.1306549628629306e-05, 'epoch': 0.62}


 21%|██        | 2570/12348 [59:05<3:46:25,  1.39s/it]

{'loss': 1.2147, 'grad_norm': 24.226543426513672, 'learning_rate': 4.12643484132343e-05, 'epoch': 0.62}


 21%|██        | 2580/12348 [59:19<3:45:36,  1.39s/it]

{'loss': 1.4777, 'grad_norm': 8.170157432556152, 'learning_rate': 4.12221471978393e-05, 'epoch': 0.63}


 21%|██        | 2590/12348 [59:33<3:46:22,  1.39s/it]

{'loss': 1.2164, 'grad_norm': 15.23184871673584, 'learning_rate': 4.1179945982444295e-05, 'epoch': 0.63}


 21%|██        | 2600/12348 [59:46<3:46:13,  1.39s/it]

{'loss': 1.5882, 'grad_norm': 12.107711791992188, 'learning_rate': 4.113774476704929e-05, 'epoch': 0.63}


 21%|██        | 2610/12348 [1:00:00<3:46:00,  1.39s/it]

{'loss': 1.2232, 'grad_norm': 11.082934379577637, 'learning_rate': 4.1095543551654294e-05, 'epoch': 0.63}


 21%|██        | 2620/12348 [1:00:14<3:45:28,  1.39s/it]

{'loss': 1.2818, 'grad_norm': 26.41646957397461, 'learning_rate': 4.105334233625929e-05, 'epoch': 0.64}


 21%|██▏       | 2630/12348 [1:00:28<3:44:52,  1.39s/it]

{'loss': 1.0141, 'grad_norm': 7.471109390258789, 'learning_rate': 4.101114112086428e-05, 'epoch': 0.64}


 21%|██▏       | 2640/12348 [1:00:42<3:44:23,  1.39s/it]

{'loss': 1.1509, 'grad_norm': 8.9610013961792, 'learning_rate': 4.0968939905469276e-05, 'epoch': 0.64}


 21%|██▏       | 2650/12348 [1:00:56<3:45:22,  1.39s/it]

{'loss': 1.0332, 'grad_norm': 10.486507415771484, 'learning_rate': 4.092673869007428e-05, 'epoch': 0.64}


 22%|██▏       | 2660/12348 [1:01:10<3:44:13,  1.39s/it]

{'loss': 0.8031, 'grad_norm': 3.4769287109375, 'learning_rate': 4.0884537474679275e-05, 'epoch': 0.65}


 22%|██▏       | 2670/12348 [1:01:24<3:44:18,  1.39s/it]

{'loss': 1.5228, 'grad_norm': 16.759414672851562, 'learning_rate': 4.084233625928427e-05, 'epoch': 0.65}


 22%|██▏       | 2680/12348 [1:01:38<3:43:20,  1.39s/it]

{'loss': 1.5442, 'grad_norm': 8.812772750854492, 'learning_rate': 4.080013504388927e-05, 'epoch': 0.65}


 22%|██▏       | 2690/12348 [1:01:52<3:44:09,  1.39s/it]

{'loss': 1.6727, 'grad_norm': 10.709794998168945, 'learning_rate': 4.075793382849426e-05, 'epoch': 0.65}


 22%|██▏       | 2700/12348 [1:02:06<3:44:23,  1.40s/it]

{'loss': 1.53, 'grad_norm': 17.28463363647461, 'learning_rate': 4.071573261309926e-05, 'epoch': 0.66}


 22%|██▏       | 2710/12348 [1:02:20<3:42:34,  1.39s/it]

{'loss': 1.5628, 'grad_norm': 6.547827243804932, 'learning_rate': 4.067353139770426e-05, 'epoch': 0.66}


 22%|██▏       | 2720/12348 [1:02:33<3:42:24,  1.39s/it]

{'loss': 1.1934, 'grad_norm': 8.275572776794434, 'learning_rate': 4.063133018230925e-05, 'epoch': 0.66}


 22%|██▏       | 2730/12348 [1:02:47<3:42:23,  1.39s/it]

{'loss': 1.2239, 'grad_norm': 8.415942192077637, 'learning_rate': 4.058912896691425e-05, 'epoch': 0.66}


 22%|██▏       | 2740/12348 [1:03:01<3:42:03,  1.39s/it]

{'loss': 0.8758, 'grad_norm': 10.98595142364502, 'learning_rate': 4.0546927751519246e-05, 'epoch': 0.67}


 22%|██▏       | 2750/12348 [1:03:15<3:41:47,  1.39s/it]

{'loss': 1.1806, 'grad_norm': 9.665556907653809, 'learning_rate': 4.050472653612424e-05, 'epoch': 0.67}


 22%|██▏       | 2760/12348 [1:03:29<3:41:54,  1.39s/it]

{'loss': 1.0717, 'grad_norm': 5.688023567199707, 'learning_rate': 4.046252532072924e-05, 'epoch': 0.67}


 22%|██▏       | 2770/12348 [1:03:43<3:41:40,  1.39s/it]

{'loss': 1.4604, 'grad_norm': 8.750755310058594, 'learning_rate': 4.0420324105334235e-05, 'epoch': 0.67}


 23%|██▎       | 2780/12348 [1:03:57<3:40:37,  1.38s/it]

{'loss': 1.4293, 'grad_norm': 15.371333122253418, 'learning_rate': 4.037812288993923e-05, 'epoch': 0.68}


 23%|██▎       | 2790/12348 [1:04:11<3:40:24,  1.38s/it]

{'loss': 1.2337, 'grad_norm': 25.14463233947754, 'learning_rate': 4.033592167454423e-05, 'epoch': 0.68}


 23%|██▎       | 2800/12348 [1:04:25<3:40:39,  1.39s/it]

{'loss': 1.2774, 'grad_norm': 12.133645057678223, 'learning_rate': 4.0293720459149223e-05, 'epoch': 0.68}


 23%|██▎       | 2810/12348 [1:04:38<3:41:06,  1.39s/it]

{'loss': 1.0257, 'grad_norm': 6.717347145080566, 'learning_rate': 4.025151924375422e-05, 'epoch': 0.68}


 23%|██▎       | 2820/12348 [1:04:52<3:43:50,  1.41s/it]

{'loss': 1.4072, 'grad_norm': 6.848352909088135, 'learning_rate': 4.0209318028359216e-05, 'epoch': 0.69}


 23%|██▎       | 2830/12348 [1:05:06<3:39:57,  1.39s/it]

{'loss': 1.4204, 'grad_norm': 9.633284568786621, 'learning_rate': 4.016711681296422e-05, 'epoch': 0.69}


 23%|██▎       | 2840/12348 [1:05:20<3:39:17,  1.38s/it]

{'loss': 1.2135, 'grad_norm': 11.034366607666016, 'learning_rate': 4.0124915597569215e-05, 'epoch': 0.69}


 23%|██▎       | 2850/12348 [1:05:34<3:39:47,  1.39s/it]

{'loss': 1.334, 'grad_norm': 13.256857872009277, 'learning_rate': 4.0082714382174205e-05, 'epoch': 0.69}


 23%|██▎       | 2860/12348 [1:05:48<3:40:19,  1.39s/it]

{'loss': 1.3206, 'grad_norm': 12.299225807189941, 'learning_rate': 4.00405131667792e-05, 'epoch': 0.69}


 23%|██▎       | 2870/12348 [1:06:02<3:39:16,  1.39s/it]

{'loss': 1.0676, 'grad_norm': 16.907615661621094, 'learning_rate': 3.9998311951384204e-05, 'epoch': 0.7}


 23%|██▎       | 2880/12348 [1:06:16<3:39:25,  1.39s/it]

{'loss': 1.1566, 'grad_norm': 22.174039840698242, 'learning_rate': 3.99561107359892e-05, 'epoch': 0.7}


 23%|██▎       | 2890/12348 [1:06:30<3:38:15,  1.38s/it]

{'loss': 1.2731, 'grad_norm': 23.50771713256836, 'learning_rate': 3.99139095205942e-05, 'epoch': 0.7}


 23%|██▎       | 2900/12348 [1:06:44<3:38:43,  1.39s/it]

{'loss': 1.2323, 'grad_norm': 20.95566177368164, 'learning_rate': 3.987170830519919e-05, 'epoch': 0.7}


 24%|██▎       | 2910/12348 [1:06:58<3:38:43,  1.39s/it]

{'loss': 1.4015, 'grad_norm': 10.872864723205566, 'learning_rate': 3.982950708980418e-05, 'epoch': 0.71}


 24%|██▎       | 2920/12348 [1:07:11<3:38:46,  1.39s/it]

{'loss': 1.0914, 'grad_norm': 20.265409469604492, 'learning_rate': 3.9787305874409186e-05, 'epoch': 0.71}


 24%|██▎       | 2930/12348 [1:07:25<3:39:19,  1.40s/it]

{'loss': 1.1761, 'grad_norm': 18.47481918334961, 'learning_rate': 3.974510465901418e-05, 'epoch': 0.71}


 24%|██▍       | 2940/12348 [1:07:39<3:38:39,  1.39s/it]

{'loss': 1.1879, 'grad_norm': 12.591987609863281, 'learning_rate': 3.970290344361918e-05, 'epoch': 0.71}


 24%|██▍       | 2950/12348 [1:07:53<3:37:37,  1.39s/it]

{'loss': 1.3426, 'grad_norm': 10.001426696777344, 'learning_rate': 3.9660702228224175e-05, 'epoch': 0.72}


 24%|██▍       | 2960/12348 [1:08:07<3:36:41,  1.38s/it]

{'loss': 1.4821, 'grad_norm': 8.370393753051758, 'learning_rate': 3.961850101282917e-05, 'epoch': 0.72}


 24%|██▍       | 2970/12348 [1:08:21<3:35:51,  1.38s/it]

{'loss': 1.0337, 'grad_norm': 6.148647785186768, 'learning_rate': 3.957629979743417e-05, 'epoch': 0.72}


 24%|██▍       | 2980/12348 [1:08:35<3:36:29,  1.39s/it]

{'loss': 1.3311, 'grad_norm': 6.84464693069458, 'learning_rate': 3.953409858203916e-05, 'epoch': 0.72}


 24%|██▍       | 2990/12348 [1:08:49<3:38:36,  1.40s/it]

{'loss': 1.2677, 'grad_norm': 23.796247482299805, 'learning_rate': 3.949189736664416e-05, 'epoch': 0.73}


 24%|██▍       | 3000/12348 [1:09:03<3:36:05,  1.39s/it]

{'loss': 1.2312, 'grad_norm': 10.365814208984375, 'learning_rate': 3.944969615124916e-05, 'epoch': 0.73}


 24%|██▍       | 3010/12348 [1:09:18<3:46:15,  1.45s/it]

{'loss': 1.2998, 'grad_norm': 14.445229530334473, 'learning_rate': 3.940749493585415e-05, 'epoch': 0.73}


 24%|██▍       | 3020/12348 [1:09:33<3:42:21,  1.43s/it]

{'loss': 1.3313, 'grad_norm': 20.254981994628906, 'learning_rate': 3.936529372045915e-05, 'epoch': 0.73}


 25%|██▍       | 3030/12348 [1:09:47<3:35:58,  1.39s/it]

{'loss': 1.1117, 'grad_norm': 7.493418216705322, 'learning_rate': 3.9323092505064145e-05, 'epoch': 0.74}


 25%|██▍       | 3040/12348 [1:10:01<3:35:28,  1.39s/it]

{'loss': 1.4136, 'grad_norm': 7.019394397735596, 'learning_rate': 3.928089128966914e-05, 'epoch': 0.74}


 25%|██▍       | 3050/12348 [1:10:14<3:35:44,  1.39s/it]

{'loss': 1.4958, 'grad_norm': 8.903647422790527, 'learning_rate': 3.9238690074274144e-05, 'epoch': 0.74}


 25%|██▍       | 3060/12348 [1:10:28<3:34:40,  1.39s/it]

{'loss': 1.3392, 'grad_norm': 17.672998428344727, 'learning_rate': 3.919648885887914e-05, 'epoch': 0.74}


 25%|██▍       | 3070/12348 [1:10:42<3:35:01,  1.39s/it]

{'loss': 1.2013, 'grad_norm': 7.842630386352539, 'learning_rate': 3.915428764348413e-05, 'epoch': 0.75}


 25%|██▍       | 3080/12348 [1:10:56<3:35:35,  1.40s/it]

{'loss': 1.2513, 'grad_norm': 13.481382369995117, 'learning_rate': 3.9112086428089126e-05, 'epoch': 0.75}


 25%|██▌       | 3090/12348 [1:11:10<3:34:36,  1.39s/it]

{'loss': 1.563, 'grad_norm': 14.742425918579102, 'learning_rate': 3.906988521269413e-05, 'epoch': 0.75}


 25%|██▌       | 3100/12348 [1:11:24<3:34:55,  1.39s/it]

{'loss': 1.4022, 'grad_norm': 15.591225624084473, 'learning_rate': 3.9027683997299126e-05, 'epoch': 0.75}


 25%|██▌       | 3110/12348 [1:11:38<3:32:30,  1.38s/it]

{'loss': 1.7925, 'grad_norm': 10.80518913269043, 'learning_rate': 3.898548278190412e-05, 'epoch': 0.76}


 25%|██▌       | 3120/12348 [1:11:52<3:33:35,  1.39s/it]

{'loss': 1.0827, 'grad_norm': 9.741644859313965, 'learning_rate': 3.894328156650912e-05, 'epoch': 0.76}


 25%|██▌       | 3130/12348 [1:12:06<3:33:02,  1.39s/it]

{'loss': 1.1663, 'grad_norm': 8.904467582702637, 'learning_rate': 3.8901080351114114e-05, 'epoch': 0.76}


 25%|██▌       | 3140/12348 [1:12:20<3:33:32,  1.39s/it]

{'loss': 1.3428, 'grad_norm': 11.254875183105469, 'learning_rate': 3.885887913571911e-05, 'epoch': 0.76}


 26%|██▌       | 3150/12348 [1:12:33<3:28:26,  1.36s/it]

{'loss': 0.9228, 'grad_norm': 12.776798248291016, 'learning_rate': 3.881667792032411e-05, 'epoch': 0.77}


 26%|██▌       | 3160/12348 [1:12:47<3:26:36,  1.35s/it]

{'loss': 1.2802, 'grad_norm': 12.213338851928711, 'learning_rate': 3.87744767049291e-05, 'epoch': 0.77}


 26%|██▌       | 3170/12348 [1:13:00<3:25:19,  1.34s/it]

{'loss': 0.7958, 'grad_norm': 21.11022186279297, 'learning_rate': 3.87322754895341e-05, 'epoch': 0.77}


 26%|██▌       | 3180/12348 [1:13:14<3:25:10,  1.34s/it]

{'loss': 1.36, 'grad_norm': 10.576478004455566, 'learning_rate': 3.8690074274139096e-05, 'epoch': 0.77}


 26%|██▌       | 3190/12348 [1:13:27<3:24:55,  1.34s/it]

{'loss': 1.1231, 'grad_norm': 17.738588333129883, 'learning_rate': 3.864787305874409e-05, 'epoch': 0.78}


 26%|██▌       | 3200/12348 [1:13:40<3:24:15,  1.34s/it]

{'loss': 0.8952, 'grad_norm': 7.320103168487549, 'learning_rate': 3.860567184334909e-05, 'epoch': 0.78}


 26%|██▌       | 3210/12348 [1:13:54<3:23:57,  1.34s/it]

{'loss': 1.0371, 'grad_norm': 14.66348934173584, 'learning_rate': 3.8563470627954085e-05, 'epoch': 0.78}


 26%|██▌       | 3220/12348 [1:14:07<3:24:18,  1.34s/it]

{'loss': 0.9495, 'grad_norm': 10.71745491027832, 'learning_rate': 3.852126941255909e-05, 'epoch': 0.78}


 26%|██▌       | 3230/12348 [1:14:21<3:23:39,  1.34s/it]

{'loss': 1.0768, 'grad_norm': 7.209287166595459, 'learning_rate': 3.8479068197164084e-05, 'epoch': 0.78}


 26%|██▌       | 3240/12348 [1:14:34<3:23:36,  1.34s/it]

{'loss': 0.8997, 'grad_norm': 7.338598251342773, 'learning_rate': 3.8436866981769074e-05, 'epoch': 0.79}


 26%|██▋       | 3250/12348 [1:14:48<3:23:33,  1.34s/it]

{'loss': 1.2607, 'grad_norm': 12.26395034790039, 'learning_rate': 3.839466576637407e-05, 'epoch': 0.79}


 26%|██▋       | 3260/12348 [1:15:01<3:23:06,  1.34s/it]

{'loss': 1.5083, 'grad_norm': 9.700718879699707, 'learning_rate': 3.835246455097907e-05, 'epoch': 0.79}


 26%|██▋       | 3270/12348 [1:15:15<3:23:12,  1.34s/it]

{'loss': 1.2244, 'grad_norm': 10.63632869720459, 'learning_rate': 3.831026333558407e-05, 'epoch': 0.79}


 27%|██▋       | 3280/12348 [1:15:28<3:22:51,  1.34s/it]

{'loss': 1.0964, 'grad_norm': 10.725419044494629, 'learning_rate': 3.8268062120189066e-05, 'epoch': 0.8}


 27%|██▋       | 3290/12348 [1:15:41<3:22:46,  1.34s/it]

{'loss': 1.3422, 'grad_norm': 10.695486068725586, 'learning_rate': 3.822586090479406e-05, 'epoch': 0.8}


 27%|██▋       | 3300/12348 [1:15:55<3:22:31,  1.34s/it]

{'loss': 1.1266, 'grad_norm': 20.805294036865234, 'learning_rate': 3.818365968939905e-05, 'epoch': 0.8}


 27%|██▋       | 3310/12348 [1:16:08<3:22:35,  1.34s/it]

{'loss': 1.1826, 'grad_norm': 8.437078475952148, 'learning_rate': 3.8141458474004054e-05, 'epoch': 0.8}


 27%|██▋       | 3320/12348 [1:16:22<3:22:10,  1.34s/it]

{'loss': 1.4209, 'grad_norm': 12.832810401916504, 'learning_rate': 3.809925725860905e-05, 'epoch': 0.81}


 27%|██▋       | 3330/12348 [1:16:35<3:21:55,  1.34s/it]

{'loss': 1.5361, 'grad_norm': 10.14058780670166, 'learning_rate': 3.805705604321405e-05, 'epoch': 0.81}


 27%|██▋       | 3340/12348 [1:16:49<3:21:48,  1.34s/it]

{'loss': 1.3596, 'grad_norm': 8.353918075561523, 'learning_rate': 3.801485482781904e-05, 'epoch': 0.81}


 27%|██▋       | 3350/12348 [1:17:02<3:21:38,  1.34s/it]

{'loss': 1.2974, 'grad_norm': 4.985723495483398, 'learning_rate': 3.797265361242404e-05, 'epoch': 0.81}


 27%|██▋       | 3360/12348 [1:17:16<3:20:47,  1.34s/it]

{'loss': 1.2868, 'grad_norm': 8.875734329223633, 'learning_rate': 3.7930452397029036e-05, 'epoch': 0.82}


 27%|██▋       | 3370/12348 [1:17:29<3:19:58,  1.34s/it]

{'loss': 1.4133, 'grad_norm': 14.719619750976562, 'learning_rate': 3.788825118163403e-05, 'epoch': 0.82}


 27%|██▋       | 3380/12348 [1:17:42<3:20:55,  1.34s/it]

{'loss': 1.4166, 'grad_norm': 12.333900451660156, 'learning_rate': 3.784604996623903e-05, 'epoch': 0.82}


 27%|██▋       | 3390/12348 [1:17:56<3:20:36,  1.34s/it]

{'loss': 1.4443, 'grad_norm': 7.748395919799805, 'learning_rate': 3.780384875084403e-05, 'epoch': 0.82}


 28%|██▊       | 3400/12348 [1:18:09<3:20:48,  1.35s/it]

{'loss': 1.1125, 'grad_norm': 14.508622169494629, 'learning_rate': 3.776164753544902e-05, 'epoch': 0.83}


 28%|██▊       | 3410/12348 [1:18:23<3:20:31,  1.35s/it]

{'loss': 0.9346, 'grad_norm': 7.188647270202637, 'learning_rate': 3.771944632005402e-05, 'epoch': 0.83}


 28%|██▊       | 3420/12348 [1:18:36<3:19:54,  1.34s/it]

{'loss': 1.6905, 'grad_norm': 9.655562400817871, 'learning_rate': 3.7677245104659014e-05, 'epoch': 0.83}


 28%|██▊       | 3430/12348 [1:18:50<3:19:43,  1.34s/it]

{'loss': 1.0563, 'grad_norm': 3.1696112155914307, 'learning_rate': 3.763504388926401e-05, 'epoch': 0.83}


 28%|██▊       | 3440/12348 [1:19:03<3:19:32,  1.34s/it]

{'loss': 1.1805, 'grad_norm': 15.487770080566406, 'learning_rate': 3.759284267386901e-05, 'epoch': 0.84}


 28%|██▊       | 3450/12348 [1:19:16<3:19:26,  1.34s/it]

{'loss': 1.2969, 'grad_norm': 8.688976287841797, 'learning_rate': 3.755064145847401e-05, 'epoch': 0.84}


 28%|██▊       | 3460/12348 [1:19:30<3:18:49,  1.34s/it]

{'loss': 1.4463, 'grad_norm': 11.91629409790039, 'learning_rate': 3.7508440243079e-05, 'epoch': 0.84}


 28%|██▊       | 3470/12348 [1:19:43<3:18:40,  1.34s/it]

{'loss': 1.0339, 'grad_norm': 10.415956497192383, 'learning_rate': 3.7466239027683995e-05, 'epoch': 0.84}


 28%|██▊       | 3480/12348 [1:19:57<3:18:54,  1.35s/it]

{'loss': 1.3579, 'grad_norm': 5.998025417327881, 'learning_rate': 3.7424037812289e-05, 'epoch': 0.85}


 28%|██▊       | 3490/12348 [1:20:10<3:18:21,  1.34s/it]

{'loss': 1.1574, 'grad_norm': 11.2514066696167, 'learning_rate': 3.7381836596893994e-05, 'epoch': 0.85}


 28%|██▊       | 3500/12348 [1:20:24<3:18:12,  1.34s/it]

{'loss': 1.4377, 'grad_norm': 9.172218322753906, 'learning_rate': 3.733963538149899e-05, 'epoch': 0.85}


 28%|██▊       | 3510/12348 [1:20:38<3:22:03,  1.37s/it]

{'loss': 0.961, 'grad_norm': 8.664734840393066, 'learning_rate': 3.729743416610399e-05, 'epoch': 0.85}


 29%|██▊       | 3520/12348 [1:20:52<3:18:45,  1.35s/it]

{'loss': 1.2863, 'grad_norm': 11.528997421264648, 'learning_rate': 3.725523295070898e-05, 'epoch': 0.86}


 29%|██▊       | 3530/12348 [1:21:05<3:17:32,  1.34s/it]

{'loss': 1.2061, 'grad_norm': 9.691366195678711, 'learning_rate': 3.721303173531398e-05, 'epoch': 0.86}


 29%|██▊       | 3540/12348 [1:21:19<3:16:40,  1.34s/it]

{'loss': 1.1746, 'grad_norm': 18.499034881591797, 'learning_rate': 3.7170830519918976e-05, 'epoch': 0.86}


 29%|██▊       | 3550/12348 [1:21:32<3:16:27,  1.34s/it]

{'loss': 1.222, 'grad_norm': 10.241195678710938, 'learning_rate': 3.712862930452397e-05, 'epoch': 0.86}


 29%|██▉       | 3560/12348 [1:21:46<3:16:22,  1.34s/it]

{'loss': 1.1767, 'grad_norm': 12.566801071166992, 'learning_rate': 3.708642808912897e-05, 'epoch': 0.86}


 29%|██▉       | 3570/12348 [1:21:59<3:16:15,  1.34s/it]

{'loss': 1.2405, 'grad_norm': 12.85386848449707, 'learning_rate': 3.7044226873733965e-05, 'epoch': 0.87}


 29%|██▉       | 3580/12348 [1:22:13<3:16:24,  1.34s/it]

{'loss': 1.211, 'grad_norm': 10.11059284210205, 'learning_rate': 3.700202565833896e-05, 'epoch': 0.87}


 29%|██▉       | 3590/12348 [1:22:26<3:16:10,  1.34s/it]

{'loss': 1.0807, 'grad_norm': 2.958728551864624, 'learning_rate': 3.695982444294396e-05, 'epoch': 0.87}


 29%|██▉       | 3600/12348 [1:22:39<3:15:54,  1.34s/it]

{'loss': 1.0645, 'grad_norm': 11.680769920349121, 'learning_rate': 3.6917623227548954e-05, 'epoch': 0.87}


 29%|██▉       | 3610/12348 [1:22:53<3:15:09,  1.34s/it]

{'loss': 0.8722, 'grad_norm': 4.7370710372924805, 'learning_rate': 3.687542201215396e-05, 'epoch': 0.88}


 29%|██▉       | 3620/12348 [1:23:06<3:15:25,  1.34s/it]

{'loss': 1.287, 'grad_norm': 14.06783676147461, 'learning_rate': 3.6833220796758946e-05, 'epoch': 0.88}


 29%|██▉       | 3630/12348 [1:23:20<3:15:05,  1.34s/it]

{'loss': 0.8508, 'grad_norm': 6.589122772216797, 'learning_rate': 3.679101958136394e-05, 'epoch': 0.88}


 29%|██▉       | 3640/12348 [1:23:33<3:15:22,  1.35s/it]

{'loss': 0.9704, 'grad_norm': 13.492562294006348, 'learning_rate': 3.674881836596894e-05, 'epoch': 0.88}


 30%|██▉       | 3650/12348 [1:23:47<3:15:08,  1.35s/it]

{'loss': 1.3079, 'grad_norm': 21.2347354888916, 'learning_rate': 3.670661715057394e-05, 'epoch': 0.89}


 30%|██▉       | 3660/12348 [1:24:00<3:13:56,  1.34s/it]

{'loss': 1.1763, 'grad_norm': 10.327705383300781, 'learning_rate': 3.666441593517894e-05, 'epoch': 0.89}


 30%|██▉       | 3670/12348 [1:24:14<3:14:07,  1.34s/it]

{'loss': 1.0249, 'grad_norm': 14.007782936096191, 'learning_rate': 3.6622214719783934e-05, 'epoch': 0.89}


 30%|██▉       | 3680/12348 [1:24:27<3:14:02,  1.34s/it]

{'loss': 1.0457, 'grad_norm': 14.285844802856445, 'learning_rate': 3.6580013504388924e-05, 'epoch': 0.89}


 30%|██▉       | 3690/12348 [1:24:41<3:14:36,  1.35s/it]

{'loss': 1.2365, 'grad_norm': 7.279311180114746, 'learning_rate': 3.653781228899392e-05, 'epoch': 0.9}


 30%|██▉       | 3700/12348 [1:24:54<3:14:35,  1.35s/it]

{'loss': 1.0861, 'grad_norm': 31.103986740112305, 'learning_rate': 3.649561107359892e-05, 'epoch': 0.9}


 30%|███       | 3710/12348 [1:25:08<3:13:50,  1.35s/it]

{'loss': 1.2495, 'grad_norm': 26.420696258544922, 'learning_rate': 3.645340985820392e-05, 'epoch': 0.9}


 30%|███       | 3720/12348 [1:25:21<3:13:50,  1.35s/it]

{'loss': 0.8956, 'grad_norm': 10.458259582519531, 'learning_rate': 3.6411208642808916e-05, 'epoch': 0.9}


 30%|███       | 3730/12348 [1:25:35<3:13:25,  1.35s/it]

{'loss': 1.2153, 'grad_norm': 7.78428840637207, 'learning_rate': 3.636900742741391e-05, 'epoch': 0.91}


 30%|███       | 3740/12348 [1:25:48<3:12:39,  1.34s/it]

{'loss': 0.9853, 'grad_norm': 10.252971649169922, 'learning_rate': 3.632680621201891e-05, 'epoch': 0.91}


 30%|███       | 3750/12348 [1:26:01<3:13:02,  1.35s/it]

{'loss': 0.9945, 'grad_norm': 6.904855251312256, 'learning_rate': 3.6284604996623905e-05, 'epoch': 0.91}


 30%|███       | 3760/12348 [1:26:15<3:12:15,  1.34s/it]

{'loss': 1.1954, 'grad_norm': 16.52489471435547, 'learning_rate': 3.62424037812289e-05, 'epoch': 0.91}


 31%|███       | 3770/12348 [1:26:28<3:12:07,  1.34s/it]

{'loss': 1.099, 'grad_norm': 14.103767395019531, 'learning_rate': 3.62002025658339e-05, 'epoch': 0.92}


 31%|███       | 3780/12348 [1:26:42<3:12:10,  1.35s/it]

{'loss': 0.8264, 'grad_norm': 13.590079307556152, 'learning_rate': 3.61580013504389e-05, 'epoch': 0.92}


 31%|███       | 3790/12348 [1:26:55<3:12:03,  1.35s/it]

{'loss': 1.5873, 'grad_norm': 16.208738327026367, 'learning_rate': 3.611580013504389e-05, 'epoch': 0.92}


 31%|███       | 3800/12348 [1:27:09<3:12:33,  1.35s/it]

{'loss': 1.3149, 'grad_norm': 21.441818237304688, 'learning_rate': 3.6073598919648886e-05, 'epoch': 0.92}


 31%|███       | 3810/12348 [1:27:22<3:11:26,  1.35s/it]

{'loss': 1.3986, 'grad_norm': 7.936312198638916, 'learning_rate': 3.603139770425388e-05, 'epoch': 0.93}


 31%|███       | 3820/12348 [1:27:36<3:11:11,  1.35s/it]

{'loss': 1.1796, 'grad_norm': 5.151904582977295, 'learning_rate': 3.598919648885888e-05, 'epoch': 0.93}


 31%|███       | 3830/12348 [1:27:49<3:11:04,  1.35s/it]

{'loss': 1.1706, 'grad_norm': 8.248492240905762, 'learning_rate': 3.594699527346388e-05, 'epoch': 0.93}


 31%|███       | 3840/12348 [1:28:03<3:10:54,  1.35s/it]

{'loss': 1.2705, 'grad_norm': 4.095322608947754, 'learning_rate': 3.590479405806887e-05, 'epoch': 0.93}


 31%|███       | 3850/12348 [1:28:16<3:10:18,  1.34s/it]

{'loss': 1.4594, 'grad_norm': 15.265340805053711, 'learning_rate': 3.586259284267387e-05, 'epoch': 0.94}


 31%|███▏      | 3860/12348 [1:28:30<3:10:49,  1.35s/it]

{'loss': 0.9056, 'grad_norm': 11.409424781799316, 'learning_rate': 3.5820391627278864e-05, 'epoch': 0.94}


 31%|███▏      | 3870/12348 [1:28:43<3:10:38,  1.35s/it]

{'loss': 0.9591, 'grad_norm': 12.496162414550781, 'learning_rate': 3.577819041188387e-05, 'epoch': 0.94}


 31%|███▏      | 3880/12348 [1:28:57<3:09:45,  1.34s/it]

{'loss': 0.752, 'grad_norm': 8.7913179397583, 'learning_rate': 3.573598919648886e-05, 'epoch': 0.94}


 32%|███▏      | 3890/12348 [1:29:10<3:09:21,  1.34s/it]

{'loss': 1.0285, 'grad_norm': 7.294956207275391, 'learning_rate': 3.569378798109386e-05, 'epoch': 0.95}


 32%|███▏      | 3900/12348 [1:29:24<3:09:05,  1.34s/it]

{'loss': 1.2957, 'grad_norm': 4.703986167907715, 'learning_rate': 3.565158676569885e-05, 'epoch': 0.95}


 32%|███▏      | 3910/12348 [1:29:37<3:09:35,  1.35s/it]

{'loss': 1.0178, 'grad_norm': 5.459143161773682, 'learning_rate': 3.560938555030385e-05, 'epoch': 0.95}


 32%|███▏      | 3920/12348 [1:29:51<3:08:36,  1.34s/it]

{'loss': 0.776, 'grad_norm': 5.756947040557861, 'learning_rate': 3.556718433490885e-05, 'epoch': 0.95}


 32%|███▏      | 3930/12348 [1:30:04<3:08:29,  1.34s/it]

{'loss': 1.1366, 'grad_norm': 11.632330894470215, 'learning_rate': 3.5524983119513845e-05, 'epoch': 0.95}


 32%|███▏      | 3940/12348 [1:30:18<3:09:01,  1.35s/it]

{'loss': 1.2324, 'grad_norm': 8.341232299804688, 'learning_rate': 3.548278190411884e-05, 'epoch': 0.96}


 32%|███▏      | 3950/12348 [1:30:31<3:08:31,  1.35s/it]

{'loss': 1.1461, 'grad_norm': 19.728975296020508, 'learning_rate': 3.544058068872384e-05, 'epoch': 0.96}


 32%|███▏      | 3960/12348 [1:30:44<3:07:53,  1.34s/it]

{'loss': 1.0675, 'grad_norm': 13.180700302124023, 'learning_rate': 3.5398379473328834e-05, 'epoch': 0.96}


 32%|███▏      | 3970/12348 [1:30:58<3:07:52,  1.35s/it]

{'loss': 1.1284, 'grad_norm': 4.4801859855651855, 'learning_rate': 3.535617825793383e-05, 'epoch': 0.96}


 32%|███▏      | 3980/12348 [1:31:11<3:07:30,  1.34s/it]

{'loss': 1.256, 'grad_norm': 11.959029197692871, 'learning_rate': 3.5313977042538826e-05, 'epoch': 0.97}


 32%|███▏      | 3990/12348 [1:31:25<3:07:40,  1.35s/it]

{'loss': 1.426, 'grad_norm': 9.954089164733887, 'learning_rate': 3.527177582714382e-05, 'epoch': 0.97}


 32%|███▏      | 4000/12348 [1:31:38<3:06:48,  1.34s/it]

{'loss': 1.2823, 'grad_norm': 18.151084899902344, 'learning_rate': 3.5229574611748826e-05, 'epoch': 0.97}


 32%|███▏      | 4010/12348 [1:31:53<3:08:53,  1.36s/it]

{'loss': 1.6286, 'grad_norm': 10.669901847839355, 'learning_rate': 3.5187373396353815e-05, 'epoch': 0.97}


 33%|███▎      | 4020/12348 [1:32:06<3:06:55,  1.35s/it]

{'loss': 1.218, 'grad_norm': 9.071090698242188, 'learning_rate': 3.514517218095881e-05, 'epoch': 0.98}


 33%|███▎      | 4030/12348 [1:32:20<3:06:22,  1.34s/it]

{'loss': 0.9587, 'grad_norm': 9.06705379486084, 'learning_rate': 3.510297096556381e-05, 'epoch': 0.98}


 33%|███▎      | 4040/12348 [1:32:33<3:06:14,  1.34s/it]

{'loss': 1.0402, 'grad_norm': 14.36350154876709, 'learning_rate': 3.506076975016881e-05, 'epoch': 0.98}


 33%|███▎      | 4050/12348 [1:32:47<3:06:02,  1.35s/it]

{'loss': 1.5038, 'grad_norm': 12.293768882751465, 'learning_rate': 3.501856853477381e-05, 'epoch': 0.98}


 33%|███▎      | 4060/12348 [1:33:00<3:05:13,  1.34s/it]

{'loss': 1.0665, 'grad_norm': 10.008777618408203, 'learning_rate': 3.49763673193788e-05, 'epoch': 0.99}


 33%|███▎      | 4070/12348 [1:33:14<3:06:07,  1.35s/it]

{'loss': 0.8358, 'grad_norm': 10.398530006408691, 'learning_rate': 3.493416610398379e-05, 'epoch': 0.99}


 33%|███▎      | 4080/12348 [1:33:27<3:06:05,  1.35s/it]

{'loss': 1.0892, 'grad_norm': 8.087690353393555, 'learning_rate': 3.489196488858879e-05, 'epoch': 0.99}


 33%|███▎      | 4090/12348 [1:33:41<3:05:03,  1.34s/it]

{'loss': 1.0217, 'grad_norm': 15.128432273864746, 'learning_rate': 3.484976367319379e-05, 'epoch': 0.99}


 33%|███▎      | 4100/12348 [1:33:54<3:04:45,  1.34s/it]

{'loss': 0.8838, 'grad_norm': 10.127570152282715, 'learning_rate': 3.480756245779879e-05, 'epoch': 1.0}


 33%|███▎      | 4110/12348 [1:34:08<3:04:34,  1.34s/it]

{'loss': 1.3296, 'grad_norm': 10.946629524230957, 'learning_rate': 3.4765361242403785e-05, 'epoch': 1.0}


 33%|███▎      | 4120/12348 [1:34:21<2:59:37,  1.31s/it]

{'loss': 1.0367, 'grad_norm': 11.564139366149902, 'learning_rate': 3.4723160027008774e-05, 'epoch': 1.0}


 33%|███▎      | 4130/12348 [1:34:34<3:05:58,  1.36s/it]

{'loss': 1.0181, 'grad_norm': 6.515241622924805, 'learning_rate': 3.468095881161378e-05, 'epoch': 1.0}


 34%|███▎      | 4140/12348 [1:34:48<3:05:03,  1.35s/it]

{'loss': 0.859, 'grad_norm': 5.82297945022583, 'learning_rate': 3.4638757596218774e-05, 'epoch': 1.01}


 34%|███▎      | 4150/12348 [1:35:01<3:04:40,  1.35s/it]

{'loss': 0.765, 'grad_norm': 22.268070220947266, 'learning_rate': 3.459655638082377e-05, 'epoch': 1.01}


 34%|███▎      | 4160/12348 [1:35:15<3:03:50,  1.35s/it]

{'loss': 0.7051, 'grad_norm': 12.503342628479004, 'learning_rate': 3.4554355165428766e-05, 'epoch': 1.01}


 34%|███▍      | 4170/12348 [1:35:28<3:03:18,  1.34s/it]

{'loss': 0.9771, 'grad_norm': 4.19439172744751, 'learning_rate': 3.451215395003376e-05, 'epoch': 1.01}


 34%|███▍      | 4180/12348 [1:35:42<3:03:06,  1.35s/it]

{'loss': 1.0712, 'grad_norm': 7.2152862548828125, 'learning_rate': 3.446995273463876e-05, 'epoch': 1.02}


 34%|███▍      | 4190/12348 [1:35:55<3:02:52,  1.35s/it]

{'loss': 0.738, 'grad_norm': 7.287938594818115, 'learning_rate': 3.4427751519243755e-05, 'epoch': 1.02}


 34%|███▍      | 4200/12348 [1:36:08<3:01:52,  1.34s/it]

{'loss': 1.0053, 'grad_norm': 8.4620361328125, 'learning_rate': 3.438555030384875e-05, 'epoch': 1.02}


 34%|███▍      | 4210/12348 [1:36:22<3:01:59,  1.34s/it]

{'loss': 1.2456, 'grad_norm': 12.795830726623535, 'learning_rate': 3.434334908845375e-05, 'epoch': 1.02}


 34%|███▍      | 4220/12348 [1:36:35<3:02:05,  1.34s/it]

{'loss': 0.6576, 'grad_norm': 8.698575019836426, 'learning_rate': 3.430114787305875e-05, 'epoch': 1.03}


 34%|███▍      | 4230/12348 [1:36:49<3:02:05,  1.35s/it]

{'loss': 0.4668, 'grad_norm': 9.244546890258789, 'learning_rate': 3.425894665766374e-05, 'epoch': 1.03}


 34%|███▍      | 4240/12348 [1:37:02<3:01:51,  1.35s/it]

{'loss': 0.8497, 'grad_norm': 18.980899810791016, 'learning_rate': 3.4216745442268736e-05, 'epoch': 1.03}


 34%|███▍      | 4250/12348 [1:37:16<3:00:57,  1.34s/it]

{'loss': 1.0659, 'grad_norm': 16.879404067993164, 'learning_rate': 3.417454422687373e-05, 'epoch': 1.03}


 34%|███▍      | 4260/12348 [1:37:29<3:01:05,  1.34s/it]

{'loss': 0.7376, 'grad_norm': 10.902250289916992, 'learning_rate': 3.4132343011478736e-05, 'epoch': 1.03}


 35%|███▍      | 4270/12348 [1:37:43<3:01:40,  1.35s/it]

{'loss': 1.0121, 'grad_norm': 8.138484001159668, 'learning_rate': 3.409014179608373e-05, 'epoch': 1.04}


 35%|███▍      | 4280/12348 [1:37:56<3:00:50,  1.34s/it]

{'loss': 1.0379, 'grad_norm': 9.461983680725098, 'learning_rate': 3.404794058068873e-05, 'epoch': 1.04}


 35%|███▍      | 4290/12348 [1:38:10<3:01:29,  1.35s/it]

{'loss': 0.849, 'grad_norm': 14.646530151367188, 'learning_rate': 3.400573936529372e-05, 'epoch': 1.04}


 35%|███▍      | 4300/12348 [1:38:23<3:00:34,  1.35s/it]

{'loss': 0.7892, 'grad_norm': 9.182937622070312, 'learning_rate': 3.3963538149898714e-05, 'epoch': 1.04}


 35%|███▍      | 4310/12348 [1:38:37<3:01:29,  1.35s/it]

{'loss': 1.1903, 'grad_norm': 14.53205680847168, 'learning_rate': 3.392133693450372e-05, 'epoch': 1.05}


 35%|███▍      | 4320/12348 [1:38:50<3:00:28,  1.35s/it]

{'loss': 0.9372, 'grad_norm': 16.5977725982666, 'learning_rate': 3.3879135719108714e-05, 'epoch': 1.05}


 35%|███▌      | 4330/12348 [1:39:04<2:59:35,  1.34s/it]

{'loss': 0.9949, 'grad_norm': 16.618854522705078, 'learning_rate': 3.383693450371371e-05, 'epoch': 1.05}


 35%|███▌      | 4340/12348 [1:39:17<2:59:01,  1.34s/it]

{'loss': 0.5169, 'grad_norm': 6.472561836242676, 'learning_rate': 3.3794733288318706e-05, 'epoch': 1.05}


 35%|███▌      | 4350/12348 [1:39:31<3:00:22,  1.35s/it]

{'loss': 1.2208, 'grad_norm': 12.133040428161621, 'learning_rate': 3.37525320729237e-05, 'epoch': 1.06}


 35%|███▌      | 4360/12348 [1:39:44<2:59:15,  1.35s/it]

{'loss': 1.259, 'grad_norm': 13.762012481689453, 'learning_rate': 3.37103308575287e-05, 'epoch': 1.06}


 35%|███▌      | 4370/12348 [1:39:58<2:59:44,  1.35s/it]

{'loss': 1.0942, 'grad_norm': 16.962608337402344, 'learning_rate': 3.3668129642133695e-05, 'epoch': 1.06}


 35%|███▌      | 4380/12348 [1:40:11<2:58:21,  1.34s/it]

{'loss': 0.7325, 'grad_norm': 19.632930755615234, 'learning_rate': 3.362592842673869e-05, 'epoch': 1.06}


 36%|███▌      | 4390/12348 [1:40:25<2:58:59,  1.35s/it]

{'loss': 0.8613, 'grad_norm': 10.722367286682129, 'learning_rate': 3.358372721134369e-05, 'epoch': 1.07}


 36%|███▌      | 4400/12348 [1:40:38<2:58:15,  1.35s/it]

{'loss': 0.8193, 'grad_norm': 16.474090576171875, 'learning_rate': 3.3541525995948684e-05, 'epoch': 1.07}


 36%|███▌      | 4410/12348 [1:40:52<2:58:39,  1.35s/it]

{'loss': 0.8915, 'grad_norm': 27.885997772216797, 'learning_rate': 3.349932478055368e-05, 'epoch': 1.07}


 36%|███▌      | 4420/12348 [1:41:05<2:57:44,  1.35s/it]

{'loss': 0.6944, 'grad_norm': 7.781244277954102, 'learning_rate': 3.3457123565158676e-05, 'epoch': 1.07}


 36%|███▌      | 4430/12348 [1:41:18<2:57:06,  1.34s/it]

{'loss': 0.6713, 'grad_norm': 7.298092365264893, 'learning_rate': 3.341492234976367e-05, 'epoch': 1.08}


 36%|███▌      | 4440/12348 [1:41:32<2:57:22,  1.35s/it]

{'loss': 0.9008, 'grad_norm': 10.42914867401123, 'learning_rate': 3.3372721134368676e-05, 'epoch': 1.08}


 36%|███▌      | 4450/12348 [1:41:45<2:57:49,  1.35s/it]

{'loss': 0.7442, 'grad_norm': 19.66181755065918, 'learning_rate': 3.3330519918973665e-05, 'epoch': 1.08}


 36%|███▌      | 4460/12348 [1:41:59<2:56:45,  1.34s/it]

{'loss': 1.3482, 'grad_norm': 10.71931266784668, 'learning_rate': 3.328831870357866e-05, 'epoch': 1.08}


 36%|███▌      | 4470/12348 [1:42:12<2:56:29,  1.34s/it]

{'loss': 1.0022, 'grad_norm': 9.20903491973877, 'learning_rate': 3.324611748818366e-05, 'epoch': 1.09}


 36%|███▋      | 4480/12348 [1:42:26<2:57:13,  1.35s/it]

{'loss': 1.0326, 'grad_norm': 22.105497360229492, 'learning_rate': 3.320391627278866e-05, 'epoch': 1.09}


 36%|███▋      | 4490/12348 [1:42:40<2:57:32,  1.36s/it]

{'loss': 0.9991, 'grad_norm': 9.258273124694824, 'learning_rate': 3.316171505739366e-05, 'epoch': 1.09}


 36%|███▋      | 4500/12348 [1:42:53<2:56:42,  1.35s/it]

{'loss': 1.1181, 'grad_norm': 5.015225887298584, 'learning_rate': 3.3119513841998654e-05, 'epoch': 1.09}


 37%|███▋      | 4510/12348 [1:43:08<2:59:08,  1.37s/it]

{'loss': 0.9699, 'grad_norm': 8.295215606689453, 'learning_rate': 3.307731262660364e-05, 'epoch': 1.1}


 37%|███▋      | 4520/12348 [1:43:21<2:56:07,  1.35s/it]

{'loss': 0.9033, 'grad_norm': 15.294395446777344, 'learning_rate': 3.3035111411208646e-05, 'epoch': 1.1}


 37%|███▋      | 4530/12348 [1:43:35<2:55:50,  1.35s/it]

{'loss': 1.1765, 'grad_norm': 6.616250038146973, 'learning_rate': 3.299291019581364e-05, 'epoch': 1.1}


 37%|███▋      | 4540/12348 [1:43:48<2:55:50,  1.35s/it]

{'loss': 0.8435, 'grad_norm': 14.108840942382812, 'learning_rate': 3.295070898041864e-05, 'epoch': 1.1}


 37%|███▋      | 4550/12348 [1:44:02<2:55:24,  1.35s/it]

{'loss': 0.9311, 'grad_norm': 10.595747947692871, 'learning_rate': 3.2908507765023635e-05, 'epoch': 1.11}


 37%|███▋      | 4560/12348 [1:44:15<2:55:24,  1.35s/it]

{'loss': 1.1681, 'grad_norm': 11.66101360321045, 'learning_rate': 3.286630654962863e-05, 'epoch': 1.11}


 37%|███▋      | 4570/12348 [1:44:29<2:55:32,  1.35s/it]

{'loss': 0.7788, 'grad_norm': 8.031567573547363, 'learning_rate': 3.282410533423363e-05, 'epoch': 1.11}


 37%|███▋      | 4580/12348 [1:44:42<2:54:43,  1.35s/it]

{'loss': 0.6405, 'grad_norm': 6.57482385635376, 'learning_rate': 3.2781904118838624e-05, 'epoch': 1.11}


 37%|███▋      | 4590/12348 [1:44:56<2:55:30,  1.36s/it]

{'loss': 0.7933, 'grad_norm': 12.152660369873047, 'learning_rate': 3.273970290344362e-05, 'epoch': 1.12}


 37%|███▋      | 4600/12348 [1:45:09<2:54:23,  1.35s/it]

{'loss': 0.9262, 'grad_norm': 18.004709243774414, 'learning_rate': 3.2697501688048616e-05, 'epoch': 1.12}


 37%|███▋      | 4610/12348 [1:45:23<2:53:30,  1.35s/it]

{'loss': 0.7918, 'grad_norm': 11.841161727905273, 'learning_rate': 3.265530047265361e-05, 'epoch': 1.12}


 37%|███▋      | 4620/12348 [1:45:36<2:52:57,  1.34s/it]

{'loss': 1.3486, 'grad_norm': 6.989008903503418, 'learning_rate': 3.261309925725861e-05, 'epoch': 1.12}


 37%|███▋      | 4630/12348 [1:45:50<2:53:06,  1.35s/it]

{'loss': 0.9293, 'grad_norm': 8.755552291870117, 'learning_rate': 3.2570898041863605e-05, 'epoch': 1.12}


 38%|███▊      | 4640/12348 [1:46:03<2:52:51,  1.35s/it]

{'loss': 0.9453, 'grad_norm': 18.160585403442383, 'learning_rate': 3.25286968264686e-05, 'epoch': 1.13}


 38%|███▊      | 4650/12348 [1:46:17<2:52:30,  1.34s/it]

{'loss': 0.8529, 'grad_norm': 8.78465747833252, 'learning_rate': 3.2486495611073605e-05, 'epoch': 1.13}


 38%|███▊      | 4660/12348 [1:46:30<2:53:01,  1.35s/it]

{'loss': 0.8744, 'grad_norm': 14.263411521911621, 'learning_rate': 3.24442943956786e-05, 'epoch': 1.13}


 38%|███▊      | 4670/12348 [1:46:44<2:52:36,  1.35s/it]

{'loss': 0.9652, 'grad_norm': 27.745710372924805, 'learning_rate': 3.240209318028359e-05, 'epoch': 1.13}


 38%|███▊      | 4680/12348 [1:46:57<2:52:51,  1.35s/it]

{'loss': 0.785, 'grad_norm': 17.274738311767578, 'learning_rate': 3.235989196488859e-05, 'epoch': 1.14}


 38%|███▊      | 4690/12348 [1:47:11<2:52:42,  1.35s/it]

{'loss': 1.1719, 'grad_norm': 5.143543243408203, 'learning_rate': 3.231769074949358e-05, 'epoch': 1.14}


 38%|███▊      | 4700/12348 [1:47:24<2:52:04,  1.35s/it]

{'loss': 0.9796, 'grad_norm': 14.543496131896973, 'learning_rate': 3.2275489534098586e-05, 'epoch': 1.14}


 38%|███▊      | 4710/12348 [1:47:38<2:52:07,  1.35s/it]

{'loss': 1.1376, 'grad_norm': 15.867918014526367, 'learning_rate': 3.223328831870358e-05, 'epoch': 1.14}


 38%|███▊      | 4720/12348 [1:47:51<2:51:41,  1.35s/it]

{'loss': 0.8238, 'grad_norm': 6.2117204666137695, 'learning_rate': 3.219108710330858e-05, 'epoch': 1.15}


 38%|███▊      | 4730/12348 [1:48:05<2:51:02,  1.35s/it]

{'loss': 0.7903, 'grad_norm': 34.112911224365234, 'learning_rate': 3.214888588791357e-05, 'epoch': 1.15}


 38%|███▊      | 4740/12348 [1:48:18<2:51:15,  1.35s/it]

{'loss': 1.1176, 'grad_norm': 15.515567779541016, 'learning_rate': 3.210668467251857e-05, 'epoch': 1.15}


 38%|███▊      | 4750/12348 [1:48:32<2:51:25,  1.35s/it]

{'loss': 0.9304, 'grad_norm': 13.181065559387207, 'learning_rate': 3.206448345712357e-05, 'epoch': 1.15}


 39%|███▊      | 4760/12348 [1:48:46<2:50:50,  1.35s/it]

{'loss': 0.9832, 'grad_norm': 19.807836532592773, 'learning_rate': 3.2022282241728564e-05, 'epoch': 1.16}


 39%|███▊      | 4770/12348 [1:48:59<2:50:25,  1.35s/it]

{'loss': 1.1102, 'grad_norm': 14.4014310836792, 'learning_rate': 3.198008102633356e-05, 'epoch': 1.16}


 39%|███▊      | 4780/12348 [1:49:13<2:49:52,  1.35s/it]

{'loss': 0.7738, 'grad_norm': 12.198324203491211, 'learning_rate': 3.1937879810938556e-05, 'epoch': 1.16}


 39%|███▉      | 4790/12348 [1:49:26<2:50:25,  1.35s/it]

{'loss': 0.8113, 'grad_norm': 16.666669845581055, 'learning_rate': 3.189567859554355e-05, 'epoch': 1.16}


 39%|███▉      | 4800/12348 [1:49:40<2:49:34,  1.35s/it]

{'loss': 0.9205, 'grad_norm': 7.889746189117432, 'learning_rate': 3.185347738014855e-05, 'epoch': 1.17}


 39%|███▉      | 4810/12348 [1:49:53<2:49:21,  1.35s/it]

{'loss': 0.693, 'grad_norm': 42.0567741394043, 'learning_rate': 3.1811276164753545e-05, 'epoch': 1.17}


 39%|███▉      | 4820/12348 [1:50:07<2:48:39,  1.34s/it]

{'loss': 1.0702, 'grad_norm': 14.682048797607422, 'learning_rate': 3.176907494935854e-05, 'epoch': 1.17}


 39%|███▉      | 4830/12348 [1:50:20<2:48:21,  1.34s/it]

{'loss': 0.975, 'grad_norm': 15.07065486907959, 'learning_rate': 3.1726873733963545e-05, 'epoch': 1.17}


 39%|███▉      | 4840/12348 [1:50:34<2:48:00,  1.34s/it]

{'loss': 1.0728, 'grad_norm': 16.252904891967773, 'learning_rate': 3.1684672518568534e-05, 'epoch': 1.18}


 39%|███▉      | 4850/12348 [1:50:47<2:47:56,  1.34s/it]

{'loss': 0.9574, 'grad_norm': 10.639796257019043, 'learning_rate': 3.164247130317353e-05, 'epoch': 1.18}


 39%|███▉      | 4860/12348 [1:51:01<2:48:20,  1.35s/it]

{'loss': 0.7844, 'grad_norm': 6.040040016174316, 'learning_rate': 3.160027008777853e-05, 'epoch': 1.18}


 39%|███▉      | 4870/12348 [1:51:14<2:48:14,  1.35s/it]

{'loss': 1.2241, 'grad_norm': 7.581918239593506, 'learning_rate': 3.155806887238353e-05, 'epoch': 1.18}


 40%|███▉      | 4880/12348 [1:51:28<2:47:57,  1.35s/it]

{'loss': 0.7142, 'grad_norm': 15.918221473693848, 'learning_rate': 3.1515867656988526e-05, 'epoch': 1.19}


 40%|███▉      | 4890/12348 [1:51:41<2:47:44,  1.35s/it]

{'loss': 0.8568, 'grad_norm': 10.34267520904541, 'learning_rate': 3.1473666441593516e-05, 'epoch': 1.19}


 40%|███▉      | 4900/12348 [1:51:55<2:48:06,  1.35s/it]

{'loss': 1.0328, 'grad_norm': 8.961493492126465, 'learning_rate': 3.143146522619851e-05, 'epoch': 1.19}


 40%|███▉      | 4910/12348 [1:52:08<2:47:26,  1.35s/it]

{'loss': 1.0761, 'grad_norm': 11.402952194213867, 'learning_rate': 3.1389264010803515e-05, 'epoch': 1.19}


 40%|███▉      | 4920/12348 [1:52:22<2:47:28,  1.35s/it]

{'loss': 0.8647, 'grad_norm': 10.074902534484863, 'learning_rate': 3.134706279540851e-05, 'epoch': 1.2}


 40%|███▉      | 4930/12348 [1:52:35<2:47:32,  1.36s/it]

{'loss': 0.9953, 'grad_norm': 9.192719459533691, 'learning_rate': 3.130486158001351e-05, 'epoch': 1.2}


 40%|████      | 4940/12348 [1:52:49<2:47:00,  1.35s/it]

{'loss': 1.2145, 'grad_norm': 8.831969261169434, 'learning_rate': 3.1262660364618504e-05, 'epoch': 1.2}


 40%|████      | 4950/12348 [1:53:02<2:46:02,  1.35s/it]

{'loss': 1.0885, 'grad_norm': 11.7861909866333, 'learning_rate': 3.122045914922349e-05, 'epoch': 1.2}


 40%|████      | 4960/12348 [1:53:16<2:45:52,  1.35s/it]

{'loss': 0.6791, 'grad_norm': 16.442138671875, 'learning_rate': 3.1178257933828496e-05, 'epoch': 1.21}


 40%|████      | 4970/12348 [1:53:29<2:45:58,  1.35s/it]

{'loss': 0.825, 'grad_norm': 8.479351043701172, 'learning_rate': 3.113605671843349e-05, 'epoch': 1.21}


 40%|████      | 4980/12348 [1:53:43<2:45:49,  1.35s/it]

{'loss': 1.0116, 'grad_norm': 1.895908236503601, 'learning_rate': 3.109385550303849e-05, 'epoch': 1.21}


 40%|████      | 4990/12348 [1:53:56<2:45:56,  1.35s/it]

{'loss': 0.6624, 'grad_norm': 24.98076629638672, 'learning_rate': 3.1051654287643485e-05, 'epoch': 1.21}


 40%|████      | 5000/12348 [1:54:10<2:44:40,  1.34s/it]

{'loss': 0.703, 'grad_norm': 10.661664962768555, 'learning_rate': 3.100945307224848e-05, 'epoch': 1.21}


 41%|████      | 5010/12348 [1:54:24<2:48:58,  1.38s/it]

{'loss': 0.6607, 'grad_norm': 9.092447280883789, 'learning_rate': 3.096725185685348e-05, 'epoch': 1.22}


 41%|████      | 5020/12348 [1:54:38<2:44:36,  1.35s/it]

{'loss': 1.0126, 'grad_norm': 5.088666915893555, 'learning_rate': 3.0925050641458474e-05, 'epoch': 1.22}


 41%|████      | 5030/12348 [1:54:52<2:44:42,  1.35s/it]

{'loss': 0.7908, 'grad_norm': 8.992178916931152, 'learning_rate': 3.088284942606347e-05, 'epoch': 1.22}


 41%|████      | 5040/12348 [1:55:05<2:43:49,  1.35s/it]

{'loss': 1.1194, 'grad_norm': 4.250816345214844, 'learning_rate': 3.0840648210668473e-05, 'epoch': 1.22}


 41%|████      | 5050/12348 [1:55:19<2:44:38,  1.35s/it]

{'loss': 0.8147, 'grad_norm': 12.354120254516602, 'learning_rate': 3.079844699527347e-05, 'epoch': 1.23}


 41%|████      | 5060/12348 [1:55:32<2:44:13,  1.35s/it]

{'loss': 0.6168, 'grad_norm': 7.57598352432251, 'learning_rate': 3.075624577987846e-05, 'epoch': 1.23}


 41%|████      | 5070/12348 [1:55:46<2:43:56,  1.35s/it]

{'loss': 0.891, 'grad_norm': 12.246764183044434, 'learning_rate': 3.0714044564483456e-05, 'epoch': 1.23}


 41%|████      | 5080/12348 [1:55:59<2:43:40,  1.35s/it]

{'loss': 0.9398, 'grad_norm': 4.797857761383057, 'learning_rate': 3.067184334908845e-05, 'epoch': 1.23}


 41%|████      | 5090/12348 [1:56:13<2:43:10,  1.35s/it]

{'loss': 0.7707, 'grad_norm': 22.401355743408203, 'learning_rate': 3.0629642133693455e-05, 'epoch': 1.24}


 41%|████▏     | 5100/12348 [1:56:26<2:43:06,  1.35s/it]

{'loss': 1.047, 'grad_norm': 8.913973808288574, 'learning_rate': 3.058744091829845e-05, 'epoch': 1.24}


 41%|████▏     | 5110/12348 [1:56:40<2:44:01,  1.36s/it]

{'loss': 0.6347, 'grad_norm': 25.913633346557617, 'learning_rate': 3.054523970290345e-05, 'epoch': 1.24}


 41%|████▏     | 5120/12348 [1:56:53<2:42:49,  1.35s/it]

{'loss': 0.6387, 'grad_norm': 12.76533031463623, 'learning_rate': 3.050303848750844e-05, 'epoch': 1.24}


 42%|████▏     | 5130/12348 [1:57:07<2:42:09,  1.35s/it]

{'loss': 1.0663, 'grad_norm': 8.856313705444336, 'learning_rate': 3.046083727211344e-05, 'epoch': 1.25}


 42%|████▏     | 5140/12348 [1:57:20<2:41:55,  1.35s/it]

{'loss': 0.9125, 'grad_norm': 15.567537307739258, 'learning_rate': 3.0418636056718436e-05, 'epoch': 1.25}


 42%|████▏     | 5150/12348 [1:57:34<2:41:42,  1.35s/it]

{'loss': 0.9477, 'grad_norm': 14.427535057067871, 'learning_rate': 3.0376434841323433e-05, 'epoch': 1.25}


 42%|████▏     | 5160/12348 [1:57:47<2:41:43,  1.35s/it]

{'loss': 0.8891, 'grad_norm': 17.88591194152832, 'learning_rate': 3.0334233625928426e-05, 'epoch': 1.25}


 42%|████▏     | 5170/12348 [1:58:01<2:41:35,  1.35s/it]

{'loss': 0.8721, 'grad_norm': 7.9190874099731445, 'learning_rate': 3.029203241053343e-05, 'epoch': 1.26}


 42%|████▏     | 5180/12348 [1:58:14<2:41:09,  1.35s/it]

{'loss': 0.8369, 'grad_norm': 7.805469989776611, 'learning_rate': 3.024983119513842e-05, 'epoch': 1.26}


 42%|████▏     | 5190/12348 [1:58:28<2:40:20,  1.34s/it]

{'loss': 0.9163, 'grad_norm': 25.89236831665039, 'learning_rate': 3.0207629979743418e-05, 'epoch': 1.26}


 42%|████▏     | 5200/12348 [1:58:41<2:40:26,  1.35s/it]

{'loss': 0.7337, 'grad_norm': 12.731062889099121, 'learning_rate': 3.0165428764348414e-05, 'epoch': 1.26}


 42%|████▏     | 5210/12348 [1:58:55<2:40:20,  1.35s/it]

{'loss': 0.5952, 'grad_norm': 9.210966110229492, 'learning_rate': 3.012322754895341e-05, 'epoch': 1.27}


 42%|████▏     | 5220/12348 [1:59:08<2:40:08,  1.35s/it]

{'loss': 1.1439, 'grad_norm': 14.946334838867188, 'learning_rate': 3.008102633355841e-05, 'epoch': 1.27}


 42%|████▏     | 5230/12348 [1:59:22<2:40:15,  1.35s/it]

{'loss': 0.7238, 'grad_norm': 30.14358901977539, 'learning_rate': 3.0038825118163406e-05, 'epoch': 1.27}


 42%|████▏     | 5240/12348 [1:59:35<2:40:04,  1.35s/it]

{'loss': 0.8593, 'grad_norm': 33.4475212097168, 'learning_rate': 2.99966239027684e-05, 'epoch': 1.27}


 43%|████▎     | 5250/12348 [1:59:49<2:39:10,  1.35s/it]

{'loss': 0.6092, 'grad_norm': 10.239006042480469, 'learning_rate': 2.9954422687373396e-05, 'epoch': 1.28}


 43%|████▎     | 5260/12348 [2:00:02<2:39:02,  1.35s/it]

{'loss': 0.8231, 'grad_norm': 11.016648292541504, 'learning_rate': 2.9912221471978395e-05, 'epoch': 1.28}


 43%|████▎     | 5270/12348 [2:00:16<2:38:50,  1.35s/it]

{'loss': 0.8047, 'grad_norm': 4.993918418884277, 'learning_rate': 2.987002025658339e-05, 'epoch': 1.28}


 43%|████▎     | 5280/12348 [2:00:29<2:38:28,  1.35s/it]

{'loss': 0.8649, 'grad_norm': 15.418204307556152, 'learning_rate': 2.9827819041188388e-05, 'epoch': 1.28}


 43%|████▎     | 5290/12348 [2:00:42<2:38:32,  1.35s/it]

{'loss': 1.0767, 'grad_norm': 20.82951545715332, 'learning_rate': 2.9785617825793384e-05, 'epoch': 1.29}


 43%|████▎     | 5300/12348 [2:00:56<2:38:22,  1.35s/it]

{'loss': 1.2094, 'grad_norm': 32.235538482666016, 'learning_rate': 2.9743416610398384e-05, 'epoch': 1.29}


 43%|████▎     | 5310/12348 [2:01:09<2:37:59,  1.35s/it]

{'loss': 0.937, 'grad_norm': 13.243202209472656, 'learning_rate': 2.970121539500338e-05, 'epoch': 1.29}


 43%|████▎     | 5320/12348 [2:01:23<2:37:50,  1.35s/it]

{'loss': 0.8988, 'grad_norm': 27.274911880493164, 'learning_rate': 2.9659014179608373e-05, 'epoch': 1.29}


 43%|████▎     | 5330/12348 [2:01:36<2:37:37,  1.35s/it]

{'loss': 0.5854, 'grad_norm': 19.100223541259766, 'learning_rate': 2.961681296421337e-05, 'epoch': 1.29}


 43%|████▎     | 5340/12348 [2:01:50<2:38:05,  1.35s/it]

{'loss': 0.6728, 'grad_norm': 10.290969848632812, 'learning_rate': 2.9574611748818366e-05, 'epoch': 1.3}


 43%|████▎     | 5350/12348 [2:02:04<2:38:00,  1.35s/it]

{'loss': 1.1122, 'grad_norm': 11.901979446411133, 'learning_rate': 2.9532410533423365e-05, 'epoch': 1.3}


 43%|████▎     | 5360/12348 [2:02:17<2:37:12,  1.35s/it]

{'loss': 0.859, 'grad_norm': 8.410956382751465, 'learning_rate': 2.949020931802836e-05, 'epoch': 1.3}


 43%|████▎     | 5370/12348 [2:02:31<2:37:15,  1.35s/it]

{'loss': 0.8084, 'grad_norm': 5.899892330169678, 'learning_rate': 2.9448008102633358e-05, 'epoch': 1.3}


 44%|████▎     | 5380/12348 [2:02:44<2:36:51,  1.35s/it]

{'loss': 0.8208, 'grad_norm': 11.111425399780273, 'learning_rate': 2.940580688723835e-05, 'epoch': 1.31}


 44%|████▎     | 5390/12348 [2:02:58<2:37:19,  1.36s/it]

{'loss': 1.422, 'grad_norm': 17.94756317138672, 'learning_rate': 2.9363605671843354e-05, 'epoch': 1.31}


 44%|████▎     | 5400/12348 [2:03:11<2:36:45,  1.35s/it]

{'loss': 0.708, 'grad_norm': 9.471521377563477, 'learning_rate': 2.9321404456448347e-05, 'epoch': 1.31}


 44%|████▍     | 5410/12348 [2:03:25<2:36:12,  1.35s/it]

{'loss': 0.6038, 'grad_norm': 13.47903823852539, 'learning_rate': 2.9279203241053343e-05, 'epoch': 1.31}


 44%|████▍     | 5420/12348 [2:03:38<2:36:12,  1.35s/it]

{'loss': 0.8181, 'grad_norm': 2.3035671710968018, 'learning_rate': 2.923700202565834e-05, 'epoch': 1.32}


 44%|████▍     | 5430/12348 [2:03:52<2:35:50,  1.35s/it]

{'loss': 0.8519, 'grad_norm': 15.547467231750488, 'learning_rate': 2.919480081026334e-05, 'epoch': 1.32}


 44%|████▍     | 5440/12348 [2:04:05<2:35:44,  1.35s/it]

{'loss': 1.0405, 'grad_norm': 13.633901596069336, 'learning_rate': 2.9152599594868335e-05, 'epoch': 1.32}


 44%|████▍     | 5450/12348 [2:04:19<2:35:02,  1.35s/it]

{'loss': 0.8307, 'grad_norm': 8.523821830749512, 'learning_rate': 2.911039837947333e-05, 'epoch': 1.32}


 44%|████▍     | 5460/12348 [2:04:32<2:34:52,  1.35s/it]

{'loss': 0.8839, 'grad_norm': 17.469524383544922, 'learning_rate': 2.9068197164078324e-05, 'epoch': 1.33}


 44%|████▍     | 5470/12348 [2:04:46<2:35:31,  1.36s/it]

{'loss': 1.0717, 'grad_norm': 15.03116512298584, 'learning_rate': 2.902599594868332e-05, 'epoch': 1.33}


 44%|████▍     | 5480/12348 [2:04:59<2:34:30,  1.35s/it]

{'loss': 1.1843, 'grad_norm': 11.031503677368164, 'learning_rate': 2.898379473328832e-05, 'epoch': 1.33}


 44%|████▍     | 5490/12348 [2:05:13<2:34:48,  1.35s/it]

{'loss': 0.9829, 'grad_norm': 18.37306785583496, 'learning_rate': 2.8941593517893317e-05, 'epoch': 1.33}


 45%|████▍     | 5500/12348 [2:05:27<2:34:42,  1.36s/it]

{'loss': 0.6458, 'grad_norm': 25.588661193847656, 'learning_rate': 2.8899392302498313e-05, 'epoch': 1.34}


 45%|████▍     | 5510/12348 [2:05:41<2:36:41,  1.37s/it]

{'loss': 0.7797, 'grad_norm': 24.290897369384766, 'learning_rate': 2.885719108710331e-05, 'epoch': 1.34}


 45%|████▍     | 5520/12348 [2:05:55<2:34:01,  1.35s/it]

{'loss': 0.8802, 'grad_norm': 27.565248489379883, 'learning_rate': 2.881498987170831e-05, 'epoch': 1.34}


 45%|████▍     | 5530/12348 [2:06:08<2:33:46,  1.35s/it]

{'loss': 0.5663, 'grad_norm': 16.967750549316406, 'learning_rate': 2.8772788656313305e-05, 'epoch': 1.34}


 45%|████▍     | 5540/12348 [2:06:22<2:32:48,  1.35s/it]

{'loss': 0.9938, 'grad_norm': 15.15822982788086, 'learning_rate': 2.8730587440918298e-05, 'epoch': 1.35}


 45%|████▍     | 5550/12348 [2:06:35<2:32:52,  1.35s/it]

{'loss': 0.7317, 'grad_norm': 12.347661018371582, 'learning_rate': 2.8688386225523294e-05, 'epoch': 1.35}


 45%|████▌     | 5560/12348 [2:06:49<2:32:57,  1.35s/it]

{'loss': 0.7247, 'grad_norm': 13.250691413879395, 'learning_rate': 2.8646185010128297e-05, 'epoch': 1.35}


 45%|████▌     | 5570/12348 [2:07:02<2:32:52,  1.35s/it]

{'loss': 0.728, 'grad_norm': 33.86111068725586, 'learning_rate': 2.860398379473329e-05, 'epoch': 1.35}


 45%|████▌     | 5580/12348 [2:07:16<2:32:42,  1.35s/it]

{'loss': 1.0073, 'grad_norm': 23.312870025634766, 'learning_rate': 2.8561782579338287e-05, 'epoch': 1.36}


 45%|████▌     | 5590/12348 [2:07:29<2:32:11,  1.35s/it]

{'loss': 1.0119, 'grad_norm': 17.00289535522461, 'learning_rate': 2.8519581363943283e-05, 'epoch': 1.36}


 45%|████▌     | 5600/12348 [2:07:43<2:31:47,  1.35s/it]

{'loss': 0.8544, 'grad_norm': 18.280187606811523, 'learning_rate': 2.8477380148548276e-05, 'epoch': 1.36}


 45%|████▌     | 5610/12348 [2:07:57<2:31:46,  1.35s/it]

{'loss': 1.0339, 'grad_norm': 9.1737642288208, 'learning_rate': 2.843517893315328e-05, 'epoch': 1.36}


 46%|████▌     | 5620/12348 [2:08:10<2:31:59,  1.36s/it]

{'loss': 1.0519, 'grad_norm': 22.323028564453125, 'learning_rate': 2.8392977717758272e-05, 'epoch': 1.37}


 46%|████▌     | 5630/12348 [2:08:24<2:31:44,  1.36s/it]

{'loss': 0.929, 'grad_norm': 7.1376566886901855, 'learning_rate': 2.8350776502363268e-05, 'epoch': 1.37}


 46%|████▌     | 5640/12348 [2:08:37<2:31:13,  1.35s/it]

{'loss': 0.7904, 'grad_norm': 8.352127075195312, 'learning_rate': 2.8308575286968264e-05, 'epoch': 1.37}


 46%|████▌     | 5650/12348 [2:08:51<2:30:18,  1.35s/it]

{'loss': 0.7017, 'grad_norm': 16.054471969604492, 'learning_rate': 2.8266374071573264e-05, 'epoch': 1.37}


 46%|████▌     | 5660/12348 [2:09:04<2:30:44,  1.35s/it]

{'loss': 0.5932, 'grad_norm': 3.501049041748047, 'learning_rate': 2.822417285617826e-05, 'epoch': 1.38}


 46%|████▌     | 5670/12348 [2:09:18<2:30:13,  1.35s/it]

{'loss': 0.8009, 'grad_norm': 7.925566673278809, 'learning_rate': 2.8181971640783257e-05, 'epoch': 1.38}


 46%|████▌     | 5680/12348 [2:09:31<2:30:16,  1.35s/it]

{'loss': 1.0392, 'grad_norm': 26.561695098876953, 'learning_rate': 2.813977042538825e-05, 'epoch': 1.38}


 46%|████▌     | 5690/12348 [2:09:45<2:30:02,  1.35s/it]

{'loss': 0.8003, 'grad_norm': 11.239730834960938, 'learning_rate': 2.8097569209993246e-05, 'epoch': 1.38}


 46%|████▌     | 5700/12348 [2:09:58<2:29:41,  1.35s/it]

{'loss': 1.0869, 'grad_norm': 13.043599128723145, 'learning_rate': 2.805536799459825e-05, 'epoch': 1.38}


 46%|████▌     | 5710/12348 [2:10:12<2:29:25,  1.35s/it]

{'loss': 1.1302, 'grad_norm': 8.997475624084473, 'learning_rate': 2.8013166779203242e-05, 'epoch': 1.39}


 46%|████▋     | 5720/12348 [2:10:25<2:29:26,  1.35s/it]

{'loss': 0.8017, 'grad_norm': 2.6352577209472656, 'learning_rate': 2.7970965563808238e-05, 'epoch': 1.39}


 46%|████▋     | 5730/12348 [2:10:39<2:29:12,  1.35s/it]

{'loss': 1.0297, 'grad_norm': 18.314010620117188, 'learning_rate': 2.7928764348413234e-05, 'epoch': 1.39}


 46%|████▋     | 5740/12348 [2:10:52<2:28:56,  1.35s/it]

{'loss': 0.7945, 'grad_norm': 9.255284309387207, 'learning_rate': 2.7886563133018234e-05, 'epoch': 1.39}


 47%|████▋     | 5750/12348 [2:11:06<2:27:59,  1.35s/it]

{'loss': 0.7494, 'grad_norm': 5.896427631378174, 'learning_rate': 2.784436191762323e-05, 'epoch': 1.4}


 47%|████▋     | 5760/12348 [2:11:19<2:28:26,  1.35s/it]

{'loss': 1.0639, 'grad_norm': 9.282094955444336, 'learning_rate': 2.7802160702228223e-05, 'epoch': 1.4}


 47%|████▋     | 5770/12348 [2:11:33<2:28:01,  1.35s/it]

{'loss': 0.8137, 'grad_norm': 6.678783893585205, 'learning_rate': 2.775995948683322e-05, 'epoch': 1.4}


 47%|████▋     | 5780/12348 [2:11:46<2:27:47,  1.35s/it]

{'loss': 0.9834, 'grad_norm': 6.6193413734436035, 'learning_rate': 2.7717758271438223e-05, 'epoch': 1.4}


 47%|████▋     | 5790/12348 [2:12:00<2:27:52,  1.35s/it]

{'loss': 1.0717, 'grad_norm': 12.7752103805542, 'learning_rate': 2.7675557056043215e-05, 'epoch': 1.41}


 47%|████▋     | 5800/12348 [2:12:13<2:27:27,  1.35s/it]

{'loss': 0.6757, 'grad_norm': 15.986654281616211, 'learning_rate': 2.7633355840648212e-05, 'epoch': 1.41}


 47%|████▋     | 5810/12348 [2:12:27<2:27:49,  1.36s/it]

{'loss': 0.9343, 'grad_norm': 16.981847763061523, 'learning_rate': 2.7591154625253208e-05, 'epoch': 1.41}


 47%|████▋     | 5820/12348 [2:12:41<2:27:04,  1.35s/it]

{'loss': 0.9973, 'grad_norm': 13.848825454711914, 'learning_rate': 2.75489534098582e-05, 'epoch': 1.41}


 47%|████▋     | 5830/12348 [2:12:54<2:26:46,  1.35s/it]

{'loss': 0.7605, 'grad_norm': 16.010469436645508, 'learning_rate': 2.7506752194463204e-05, 'epoch': 1.42}


 47%|████▋     | 5840/12348 [2:13:08<2:26:27,  1.35s/it]

{'loss': 0.9721, 'grad_norm': 21.614837646484375, 'learning_rate': 2.74645509790682e-05, 'epoch': 1.42}


 47%|████▋     | 5850/12348 [2:13:21<2:26:05,  1.35s/it]

{'loss': 0.8099, 'grad_norm': 16.46520233154297, 'learning_rate': 2.7422349763673193e-05, 'epoch': 1.42}


 47%|████▋     | 5860/12348 [2:13:35<2:25:28,  1.35s/it]

{'loss': 0.667, 'grad_norm': 32.91876983642578, 'learning_rate': 2.738014854827819e-05, 'epoch': 1.42}


 48%|████▊     | 5870/12348 [2:13:48<2:25:50,  1.35s/it]

{'loss': 0.8852, 'grad_norm': 21.504344940185547, 'learning_rate': 2.733794733288319e-05, 'epoch': 1.43}


 48%|████▊     | 5880/12348 [2:14:02<2:25:11,  1.35s/it]

{'loss': 0.6565, 'grad_norm': 13.755488395690918, 'learning_rate': 2.7295746117488185e-05, 'epoch': 1.43}


 48%|████▊     | 5890/12348 [2:14:15<2:25:07,  1.35s/it]

{'loss': 1.0457, 'grad_norm': 16.906694412231445, 'learning_rate': 2.7253544902093182e-05, 'epoch': 1.43}


 48%|████▊     | 5900/12348 [2:14:29<2:25:00,  1.35s/it]

{'loss': 0.8383, 'grad_norm': 9.56906795501709, 'learning_rate': 2.7211343686698178e-05, 'epoch': 1.43}


 48%|████▊     | 5910/12348 [2:14:42<2:24:31,  1.35s/it]

{'loss': 1.1552, 'grad_norm': 6.316782474517822, 'learning_rate': 2.7169142471303178e-05, 'epoch': 1.44}


 48%|████▊     | 5920/12348 [2:14:56<2:24:33,  1.35s/it]

{'loss': 0.856, 'grad_norm': 12.609240531921387, 'learning_rate': 2.7126941255908174e-05, 'epoch': 1.44}


 48%|████▊     | 5930/12348 [2:15:09<2:23:57,  1.35s/it]

{'loss': 1.1912, 'grad_norm': 12.469667434692383, 'learning_rate': 2.7084740040513167e-05, 'epoch': 1.44}


 48%|████▊     | 5940/12348 [2:15:23<2:24:14,  1.35s/it]

{'loss': 0.666, 'grad_norm': 9.231675148010254, 'learning_rate': 2.7042538825118163e-05, 'epoch': 1.44}


 48%|████▊     | 5950/12348 [2:15:36<2:23:46,  1.35s/it]

{'loss': 0.6679, 'grad_norm': 6.333774566650391, 'learning_rate': 2.700033760972316e-05, 'epoch': 1.45}


 48%|████▊     | 5960/12348 [2:15:50<2:23:59,  1.35s/it]

{'loss': 1.2058, 'grad_norm': 21.941991806030273, 'learning_rate': 2.695813639432816e-05, 'epoch': 1.45}


 48%|████▊     | 5970/12348 [2:16:03<2:23:07,  1.35s/it]

{'loss': 0.9013, 'grad_norm': 19.045372009277344, 'learning_rate': 2.6915935178933155e-05, 'epoch': 1.45}


 48%|████▊     | 5980/12348 [2:16:17<2:23:14,  1.35s/it]

{'loss': 0.9002, 'grad_norm': 10.056026458740234, 'learning_rate': 2.6873733963538152e-05, 'epoch': 1.45}


 49%|████▊     | 5990/12348 [2:16:30<2:22:40,  1.35s/it]

{'loss': 0.9721, 'grad_norm': 15.969696044921875, 'learning_rate': 2.6831532748143145e-05, 'epoch': 1.46}


 49%|████▊     | 6000/12348 [2:16:44<2:22:31,  1.35s/it]

{'loss': 1.1463, 'grad_norm': 18.853347778320312, 'learning_rate': 2.6789331532748148e-05, 'epoch': 1.46}


 49%|████▊     | 6010/12348 [2:16:58<2:25:42,  1.38s/it]

{'loss': 1.0759, 'grad_norm': 15.761801719665527, 'learning_rate': 2.674713031735314e-05, 'epoch': 1.46}


 49%|████▉     | 6020/12348 [2:17:12<2:22:18,  1.35s/it]

{'loss': 0.6279, 'grad_norm': 16.688133239746094, 'learning_rate': 2.6704929101958137e-05, 'epoch': 1.46}


 49%|████▉     | 6030/12348 [2:17:25<2:21:48,  1.35s/it]

{'loss': 0.6165, 'grad_norm': 4.710174560546875, 'learning_rate': 2.6662727886563133e-05, 'epoch': 1.47}


 49%|████▉     | 6040/12348 [2:17:39<2:21:46,  1.35s/it]

{'loss': 0.7042, 'grad_norm': 17.463361740112305, 'learning_rate': 2.6620526671168133e-05, 'epoch': 1.47}


 49%|████▉     | 6050/12348 [2:17:52<2:21:25,  1.35s/it]

{'loss': 0.7619, 'grad_norm': 4.375425815582275, 'learning_rate': 2.657832545577313e-05, 'epoch': 1.47}


 49%|████▉     | 6060/12348 [2:18:06<2:20:56,  1.34s/it]

{'loss': 1.0368, 'grad_norm': 8.530348777770996, 'learning_rate': 2.6536124240378125e-05, 'epoch': 1.47}


 49%|████▉     | 6070/12348 [2:18:19<2:21:41,  1.35s/it]

{'loss': 1.0375, 'grad_norm': 4.3959455490112305, 'learning_rate': 2.649392302498312e-05, 'epoch': 1.47}


 49%|████▉     | 6080/12348 [2:18:33<2:20:51,  1.35s/it]

{'loss': 0.8562, 'grad_norm': 29.064350128173828, 'learning_rate': 2.6451721809588115e-05, 'epoch': 1.48}


 49%|████▉     | 6090/12348 [2:18:46<2:20:32,  1.35s/it]

{'loss': 0.8468, 'grad_norm': 4.700024127960205, 'learning_rate': 2.6409520594193114e-05, 'epoch': 1.48}


 49%|████▉     | 6100/12348 [2:19:00<2:20:43,  1.35s/it]

{'loss': 0.8308, 'grad_norm': 13.78597354888916, 'learning_rate': 2.636731937879811e-05, 'epoch': 1.48}


 49%|████▉     | 6110/12348 [2:19:13<2:20:59,  1.36s/it]

{'loss': 0.5895, 'grad_norm': 13.491803169250488, 'learning_rate': 2.6325118163403107e-05, 'epoch': 1.48}


 50%|████▉     | 6120/12348 [2:19:27<2:20:48,  1.36s/it]

{'loss': 0.8106, 'grad_norm': 7.440253734588623, 'learning_rate': 2.6282916948008103e-05, 'epoch': 1.49}


 50%|████▉     | 6130/12348 [2:19:40<2:19:46,  1.35s/it]

{'loss': 0.9025, 'grad_norm': 14.732638359069824, 'learning_rate': 2.6240715732613103e-05, 'epoch': 1.49}


 50%|████▉     | 6140/12348 [2:19:54<2:19:49,  1.35s/it]

{'loss': 0.7183, 'grad_norm': 8.306254386901855, 'learning_rate': 2.61985145172181e-05, 'epoch': 1.49}


 50%|████▉     | 6150/12348 [2:20:07<2:19:39,  1.35s/it]

{'loss': 0.9375, 'grad_norm': 14.167644500732422, 'learning_rate': 2.6156313301823092e-05, 'epoch': 1.49}


 50%|████▉     | 6160/12348 [2:20:21<2:18:53,  1.35s/it]

{'loss': 0.9287, 'grad_norm': 10.569555282592773, 'learning_rate': 2.611411208642809e-05, 'epoch': 1.5}


 50%|████▉     | 6170/12348 [2:20:34<2:18:54,  1.35s/it]

{'loss': 0.8053, 'grad_norm': 5.986770153045654, 'learning_rate': 2.6071910871033088e-05, 'epoch': 1.5}


 50%|█████     | 6180/12348 [2:20:48<2:18:33,  1.35s/it]

{'loss': 0.8566, 'grad_norm': 8.119325637817383, 'learning_rate': 2.6029709655638084e-05, 'epoch': 1.5}


 50%|█████     | 6190/12348 [2:21:02<2:18:50,  1.35s/it]

{'loss': 1.2042, 'grad_norm': 29.79290199279785, 'learning_rate': 2.598750844024308e-05, 'epoch': 1.5}


 50%|█████     | 6200/12348 [2:21:15<2:18:32,  1.35s/it]

{'loss': 0.9361, 'grad_norm': 3.104417085647583, 'learning_rate': 2.5945307224848077e-05, 'epoch': 1.51}


 50%|█████     | 6210/12348 [2:21:29<2:18:35,  1.35s/it]

{'loss': 1.4715, 'grad_norm': 14.740224838256836, 'learning_rate': 2.590310600945307e-05, 'epoch': 1.51}


 50%|█████     | 6220/12348 [2:21:42<2:18:22,  1.35s/it]

{'loss': 1.4397, 'grad_norm': 11.960949897766113, 'learning_rate': 2.5860904794058073e-05, 'epoch': 1.51}


 50%|█████     | 6230/12348 [2:21:56<2:17:48,  1.35s/it]

{'loss': 0.9555, 'grad_norm': 21.766437530517578, 'learning_rate': 2.5818703578663066e-05, 'epoch': 1.51}


 51%|█████     | 6240/12348 [2:22:09<2:17:50,  1.35s/it]

{'loss': 1.0914, 'grad_norm': 15.125680923461914, 'learning_rate': 2.5776502363268062e-05, 'epoch': 1.52}


 51%|█████     | 6250/12348 [2:22:23<2:17:29,  1.35s/it]

{'loss': 0.6372, 'grad_norm': 7.101709842681885, 'learning_rate': 2.573430114787306e-05, 'epoch': 1.52}


 51%|█████     | 6260/12348 [2:22:36<2:16:56,  1.35s/it]

{'loss': 0.7632, 'grad_norm': 8.66113567352295, 'learning_rate': 2.5692099932478058e-05, 'epoch': 1.52}


 51%|█████     | 6270/12348 [2:22:50<2:16:46,  1.35s/it]

{'loss': 0.6748, 'grad_norm': 8.579340934753418, 'learning_rate': 2.5649898717083054e-05, 'epoch': 1.52}


 51%|█████     | 6280/12348 [2:23:03<2:16:51,  1.35s/it]

{'loss': 0.8696, 'grad_norm': 11.537736892700195, 'learning_rate': 2.560769750168805e-05, 'epoch': 1.53}


 51%|█████     | 6290/12348 [2:23:17<2:16:26,  1.35s/it]

{'loss': 0.7865, 'grad_norm': 15.651762008666992, 'learning_rate': 2.5565496286293043e-05, 'epoch': 1.53}


 51%|█████     | 6300/12348 [2:23:30<2:16:26,  1.35s/it]

{'loss': 0.7376, 'grad_norm': 20.599414825439453, 'learning_rate': 2.5523295070898047e-05, 'epoch': 1.53}


 51%|█████     | 6310/12348 [2:23:44<2:16:07,  1.35s/it]

{'loss': 0.8927, 'grad_norm': 16.41969108581543, 'learning_rate': 2.548109385550304e-05, 'epoch': 1.53}


 51%|█████     | 6320/12348 [2:23:57<2:15:43,  1.35s/it]

{'loss': 0.9116, 'grad_norm': 5.909139633178711, 'learning_rate': 2.5438892640108036e-05, 'epoch': 1.54}


 51%|█████▏    | 6330/12348 [2:24:11<2:15:34,  1.35s/it]

{'loss': 0.81, 'grad_norm': 36.53042984008789, 'learning_rate': 2.5396691424713032e-05, 'epoch': 1.54}


 51%|█████▏    | 6340/12348 [2:24:25<2:15:50,  1.36s/it]

{'loss': 0.897, 'grad_norm': 12.976602554321289, 'learning_rate': 2.5354490209318028e-05, 'epoch': 1.54}


 51%|█████▏    | 6350/12348 [2:24:38<2:15:08,  1.35s/it]

{'loss': 1.0887, 'grad_norm': 12.638731956481934, 'learning_rate': 2.5312288993923028e-05, 'epoch': 1.54}


 52%|█████▏    | 6360/12348 [2:24:52<2:14:44,  1.35s/it]

{'loss': 0.6383, 'grad_norm': 9.486698150634766, 'learning_rate': 2.5270087778528024e-05, 'epoch': 1.55}


 52%|█████▏    | 6370/12348 [2:25:05<2:14:28,  1.35s/it]

{'loss': 0.9787, 'grad_norm': 11.843282699584961, 'learning_rate': 2.5227886563133017e-05, 'epoch': 1.55}


 52%|█████▏    | 6380/12348 [2:25:19<2:14:04,  1.35s/it]

{'loss': 0.6849, 'grad_norm': 17.739112854003906, 'learning_rate': 2.5185685347738013e-05, 'epoch': 1.55}


 52%|█████▏    | 6390/12348 [2:25:32<2:13:58,  1.35s/it]

{'loss': 0.9501, 'grad_norm': 12.08103084564209, 'learning_rate': 2.5143484132343013e-05, 'epoch': 1.55}


 52%|█████▏    | 6400/12348 [2:25:46<2:13:54,  1.35s/it]

{'loss': 0.7116, 'grad_norm': 8.329568862915039, 'learning_rate': 2.510128291694801e-05, 'epoch': 1.55}


 52%|█████▏    | 6410/12348 [2:25:59<2:13:31,  1.35s/it]

{'loss': 0.9373, 'grad_norm': 19.480016708374023, 'learning_rate': 2.5059081701553006e-05, 'epoch': 1.56}


 52%|█████▏    | 6420/12348 [2:26:13<2:13:30,  1.35s/it]

{'loss': 0.8363, 'grad_norm': 18.33068084716797, 'learning_rate': 2.5016880486158002e-05, 'epoch': 1.56}


 52%|█████▏    | 6430/12348 [2:26:26<2:13:14,  1.35s/it]

{'loss': 0.7037, 'grad_norm': 13.782565116882324, 'learning_rate': 2.4974679270762998e-05, 'epoch': 1.56}


 52%|█████▏    | 6440/12348 [2:26:40<2:12:51,  1.35s/it]

{'loss': 1.06, 'grad_norm': 26.358922958374023, 'learning_rate': 2.4932478055367998e-05, 'epoch': 1.56}


 52%|█████▏    | 6450/12348 [2:26:53<2:12:36,  1.35s/it]

{'loss': 0.7758, 'grad_norm': 10.902746200561523, 'learning_rate': 2.489027683997299e-05, 'epoch': 1.57}


 52%|█████▏    | 6460/12348 [2:27:07<2:12:06,  1.35s/it]

{'loss': 0.4753, 'grad_norm': 9.55760383605957, 'learning_rate': 2.4848075624577987e-05, 'epoch': 1.57}


 52%|█████▏    | 6470/12348 [2:27:20<2:11:59,  1.35s/it]

{'loss': 0.5425, 'grad_norm': 10.831913948059082, 'learning_rate': 2.4805874409182987e-05, 'epoch': 1.57}


 52%|█████▏    | 6480/12348 [2:27:34<2:11:51,  1.35s/it]

{'loss': 0.8714, 'grad_norm': 15.651382446289062, 'learning_rate': 2.476367319378798e-05, 'epoch': 1.57}


 53%|█████▎    | 6490/12348 [2:27:47<2:12:05,  1.35s/it]

{'loss': 0.7367, 'grad_norm': 13.939422607421875, 'learning_rate': 2.472147197839298e-05, 'epoch': 1.58}


 53%|█████▎    | 6500/12348 [2:28:01<2:11:28,  1.35s/it]

{'loss': 1.3775, 'grad_norm': 15.160284042358398, 'learning_rate': 2.4679270762997976e-05, 'epoch': 1.58}


 53%|█████▎    | 6510/12348 [2:28:15<2:12:29,  1.36s/it]

{'loss': 1.2168, 'grad_norm': 21.554332733154297, 'learning_rate': 2.4637069547602972e-05, 'epoch': 1.58}


 53%|█████▎    | 6520/12348 [2:28:29<2:10:57,  1.35s/it]

{'loss': 1.3949, 'grad_norm': 18.678144454956055, 'learning_rate': 2.4594868332207968e-05, 'epoch': 1.58}


 53%|█████▎    | 6530/12348 [2:28:42<2:10:11,  1.34s/it]

{'loss': 0.5259, 'grad_norm': 12.668229103088379, 'learning_rate': 2.4552667116812968e-05, 'epoch': 1.59}


 53%|█████▎    | 6540/12348 [2:28:56<2:10:23,  1.35s/it]

{'loss': 0.5898, 'grad_norm': 11.576933860778809, 'learning_rate': 2.451046590141796e-05, 'epoch': 1.59}


 53%|█████▎    | 6550/12348 [2:29:09<2:09:54,  1.34s/it]

{'loss': 0.8324, 'grad_norm': 9.526627540588379, 'learning_rate': 2.446826468602296e-05, 'epoch': 1.59}


 53%|█████▎    | 6560/12348 [2:29:23<2:09:45,  1.35s/it]

{'loss': 0.7416, 'grad_norm': 8.245567321777344, 'learning_rate': 2.4426063470627953e-05, 'epoch': 1.59}


 53%|█████▎    | 6570/12348 [2:29:36<2:09:40,  1.35s/it]

{'loss': 0.7302, 'grad_norm': 16.604103088378906, 'learning_rate': 2.4383862255232953e-05, 'epoch': 1.6}


 53%|█████▎    | 6580/12348 [2:29:50<2:09:25,  1.35s/it]

{'loss': 1.1411, 'grad_norm': 22.263736724853516, 'learning_rate': 2.434166103983795e-05, 'epoch': 1.6}


 53%|█████▎    | 6590/12348 [2:30:03<2:09:31,  1.35s/it]

{'loss': 0.667, 'grad_norm': 12.833884239196777, 'learning_rate': 2.4299459824442942e-05, 'epoch': 1.6}


 53%|█████▎    | 6600/12348 [2:30:17<2:09:52,  1.36s/it]

{'loss': 1.0118, 'grad_norm': 20.472627639770508, 'learning_rate': 2.4257258609047942e-05, 'epoch': 1.6}


 54%|█████▎    | 6610/12348 [2:30:30<2:09:16,  1.35s/it]

{'loss': 0.8258, 'grad_norm': 8.849925994873047, 'learning_rate': 2.4215057393652938e-05, 'epoch': 1.61}


 54%|█████▎    | 6620/12348 [2:30:44<2:09:09,  1.35s/it]

{'loss': 1.0296, 'grad_norm': 17.04418182373047, 'learning_rate': 2.4172856178257935e-05, 'epoch': 1.61}


 54%|█████▎    | 6630/12348 [2:30:57<2:08:55,  1.35s/it]

{'loss': 0.7781, 'grad_norm': 2.9985921382904053, 'learning_rate': 2.413065496286293e-05, 'epoch': 1.61}


 54%|█████▍    | 6640/12348 [2:31:11<2:08:27,  1.35s/it]

{'loss': 0.6824, 'grad_norm': 8.447574615478516, 'learning_rate': 2.408845374746793e-05, 'epoch': 1.61}


 54%|█████▍    | 6650/12348 [2:31:24<2:08:24,  1.35s/it]

{'loss': 1.2843, 'grad_norm': 9.4629487991333, 'learning_rate': 2.4046252532072923e-05, 'epoch': 1.62}


 54%|█████▍    | 6660/12348 [2:31:38<2:08:18,  1.35s/it]

{'loss': 0.977, 'grad_norm': 9.201711654663086, 'learning_rate': 2.4004051316677923e-05, 'epoch': 1.62}


 54%|█████▍    | 6670/12348 [2:31:51<2:07:55,  1.35s/it]

{'loss': 0.5267, 'grad_norm': 2.253690242767334, 'learning_rate': 2.396185010128292e-05, 'epoch': 1.62}


 54%|█████▍    | 6680/12348 [2:32:05<2:07:41,  1.35s/it]

{'loss': 0.8158, 'grad_norm': 11.279520034790039, 'learning_rate': 2.3919648885887916e-05, 'epoch': 1.62}


 54%|█████▍    | 6690/12348 [2:32:19<2:07:25,  1.35s/it]

{'loss': 0.8307, 'grad_norm': 6.600910186767578, 'learning_rate': 2.3877447670492912e-05, 'epoch': 1.63}


 54%|█████▍    | 6700/12348 [2:32:32<2:08:04,  1.36s/it]

{'loss': 0.5579, 'grad_norm': 4.422999858856201, 'learning_rate': 2.3835246455097908e-05, 'epoch': 1.63}


 54%|█████▍    | 6710/12348 [2:32:46<2:06:09,  1.34s/it]

{'loss': 0.78, 'grad_norm': 12.136879920959473, 'learning_rate': 2.3793045239702905e-05, 'epoch': 1.63}


 54%|█████▍    | 6720/12348 [2:32:59<2:06:07,  1.34s/it]

{'loss': 1.0182, 'grad_norm': 11.678579330444336, 'learning_rate': 2.37508440243079e-05, 'epoch': 1.63}


 55%|█████▍    | 6730/12348 [2:33:13<2:06:40,  1.35s/it]

{'loss': 1.1284, 'grad_norm': 18.916868209838867, 'learning_rate': 2.3708642808912897e-05, 'epoch': 1.64}


 55%|█████▍    | 6740/12348 [2:33:26<2:06:52,  1.36s/it]

{'loss': 0.8278, 'grad_norm': 14.997638702392578, 'learning_rate': 2.3666441593517893e-05, 'epoch': 1.64}


 55%|█████▍    | 6750/12348 [2:33:40<2:06:24,  1.35s/it]

{'loss': 0.8211, 'grad_norm': 21.327817916870117, 'learning_rate': 2.3624240378122893e-05, 'epoch': 1.64}


 55%|█████▍    | 6760/12348 [2:33:53<2:05:27,  1.35s/it]

{'loss': 0.501, 'grad_norm': 23.829627990722656, 'learning_rate': 2.3582039162727886e-05, 'epoch': 1.64}


 55%|█████▍    | 6770/12348 [2:34:07<2:05:37,  1.35s/it]

{'loss': 0.75, 'grad_norm': 18.108898162841797, 'learning_rate': 2.3539837947332886e-05, 'epoch': 1.64}


 55%|█████▍    | 6780/12348 [2:34:20<2:05:56,  1.36s/it]

{'loss': 0.6868, 'grad_norm': 16.773908615112305, 'learning_rate': 2.3497636731937882e-05, 'epoch': 1.65}


 55%|█████▍    | 6790/12348 [2:34:34<2:05:02,  1.35s/it]

{'loss': 0.866, 'grad_norm': 22.973941802978516, 'learning_rate': 2.3455435516542878e-05, 'epoch': 1.65}


 55%|█████▌    | 6800/12348 [2:34:47<2:04:58,  1.35s/it]

{'loss': 0.9985, 'grad_norm': 3.5516045093536377, 'learning_rate': 2.3413234301147875e-05, 'epoch': 1.65}


 55%|█████▌    | 6810/12348 [2:35:01<2:04:34,  1.35s/it]

{'loss': 1.0756, 'grad_norm': 3.480633497238159, 'learning_rate': 2.337103308575287e-05, 'epoch': 1.65}


 55%|█████▌    | 6820/12348 [2:35:14<2:04:06,  1.35s/it]

{'loss': 0.9087, 'grad_norm': 3.686488628387451, 'learning_rate': 2.3328831870357867e-05, 'epoch': 1.66}


 55%|█████▌    | 6830/12348 [2:35:28<2:04:26,  1.35s/it]

{'loss': 1.0747, 'grad_norm': 19.448684692382812, 'learning_rate': 2.3286630654962863e-05, 'epoch': 1.66}


 55%|█████▌    | 6840/12348 [2:35:41<2:04:09,  1.35s/it]

{'loss': 1.0612, 'grad_norm': 26.296728134155273, 'learning_rate': 2.324442943956786e-05, 'epoch': 1.66}


 55%|█████▌    | 6850/12348 [2:35:55<2:03:47,  1.35s/it]

{'loss': 0.8696, 'grad_norm': 23.16974449157715, 'learning_rate': 2.3202228224172856e-05, 'epoch': 1.66}


 56%|█████▌    | 6860/12348 [2:36:09<2:03:11,  1.35s/it]

{'loss': 0.8002, 'grad_norm': 10.32725715637207, 'learning_rate': 2.3160027008777856e-05, 'epoch': 1.67}


 56%|█████▌    | 6870/12348 [2:36:22<2:03:28,  1.35s/it]

{'loss': 1.1134, 'grad_norm': 32.975494384765625, 'learning_rate': 2.311782579338285e-05, 'epoch': 1.67}


 56%|█████▌    | 6880/12348 [2:36:36<2:03:18,  1.35s/it]

{'loss': 0.9768, 'grad_norm': 8.900736808776855, 'learning_rate': 2.3075624577987848e-05, 'epoch': 1.67}


 56%|█████▌    | 6890/12348 [2:36:49<2:03:20,  1.36s/it]

{'loss': 0.7677, 'grad_norm': 4.686188220977783, 'learning_rate': 2.3033423362592845e-05, 'epoch': 1.67}


 56%|█████▌    | 6900/12348 [2:37:03<2:02:28,  1.35s/it]

{'loss': 0.8161, 'grad_norm': 16.81064796447754, 'learning_rate': 2.299122214719784e-05, 'epoch': 1.68}


 56%|█████▌    | 6910/12348 [2:37:16<2:02:15,  1.35s/it]

{'loss': 0.7643, 'grad_norm': 21.260896682739258, 'learning_rate': 2.2949020931802837e-05, 'epoch': 1.68}


 56%|█████▌    | 6920/12348 [2:37:30<2:02:04,  1.35s/it]

{'loss': 0.8991, 'grad_norm': 21.531503677368164, 'learning_rate': 2.2906819716407833e-05, 'epoch': 1.68}


 56%|█████▌    | 6930/12348 [2:37:43<2:01:46,  1.35s/it]

{'loss': 0.7781, 'grad_norm': 9.231986999511719, 'learning_rate': 2.286461850101283e-05, 'epoch': 1.68}


 56%|█████▌    | 6940/12348 [2:37:57<2:01:38,  1.35s/it]

{'loss': 0.986, 'grad_norm': 9.572646141052246, 'learning_rate': 2.282241728561783e-05, 'epoch': 1.69}


 56%|█████▋    | 6950/12348 [2:38:10<2:01:11,  1.35s/it]

{'loss': 1.1292, 'grad_norm': 8.779298782348633, 'learning_rate': 2.2780216070222822e-05, 'epoch': 1.69}


 56%|█████▋    | 6960/12348 [2:38:24<2:01:15,  1.35s/it]

{'loss': 0.5858, 'grad_norm': 6.927555561065674, 'learning_rate': 2.273801485482782e-05, 'epoch': 1.69}


 56%|█████▋    | 6970/12348 [2:38:37<2:01:03,  1.35s/it]

{'loss': 0.4744, 'grad_norm': 9.385906219482422, 'learning_rate': 2.2695813639432818e-05, 'epoch': 1.69}


 57%|█████▋    | 6980/12348 [2:38:51<2:01:24,  1.36s/it]

{'loss': 0.8196, 'grad_norm': 14.065520286560059, 'learning_rate': 2.265361242403781e-05, 'epoch': 1.7}


 57%|█████▋    | 6990/12348 [2:39:04<2:00:18,  1.35s/it]

{'loss': 1.0112, 'grad_norm': 17.3995418548584, 'learning_rate': 2.261141120864281e-05, 'epoch': 1.7}


 57%|█████▋    | 7000/12348 [2:39:18<2:00:39,  1.35s/it]

{'loss': 0.7844, 'grad_norm': 10.395831108093262, 'learning_rate': 2.2569209993247807e-05, 'epoch': 1.7}


 57%|█████▋    | 7010/12348 [2:39:33<2:02:14,  1.37s/it]

{'loss': 1.1569, 'grad_norm': 19.191686630249023, 'learning_rate': 2.2527008777852803e-05, 'epoch': 1.7}


 57%|█████▋    | 7020/12348 [2:39:46<2:00:13,  1.35s/it]

{'loss': 0.8919, 'grad_norm': 3.1225531101226807, 'learning_rate': 2.24848075624578e-05, 'epoch': 1.71}


 57%|█████▋    | 7030/12348 [2:40:00<1:59:48,  1.35s/it]

{'loss': 0.7282, 'grad_norm': 11.157397270202637, 'learning_rate': 2.2442606347062796e-05, 'epoch': 1.71}


 57%|█████▋    | 7040/12348 [2:40:13<1:59:33,  1.35s/it]

{'loss': 0.7968, 'grad_norm': 27.565948486328125, 'learning_rate': 2.2400405131667792e-05, 'epoch': 1.71}


 57%|█████▋    | 7050/12348 [2:40:27<1:59:09,  1.35s/it]

{'loss': 0.8532, 'grad_norm': 9.37297534942627, 'learning_rate': 2.2358203916272792e-05, 'epoch': 1.71}


 57%|█████▋    | 7060/12348 [2:40:40<1:58:52,  1.35s/it]

{'loss': 0.525, 'grad_norm': 5.529254913330078, 'learning_rate': 2.2316002700877785e-05, 'epoch': 1.72}


 57%|█████▋    | 7070/12348 [2:40:54<1:59:16,  1.36s/it]

{'loss': 0.9, 'grad_norm': 24.78221321105957, 'learning_rate': 2.2273801485482785e-05, 'epoch': 1.72}


 57%|█████▋    | 7080/12348 [2:41:07<1:58:48,  1.35s/it]

{'loss': 1.1312, 'grad_norm': 19.0609073638916, 'learning_rate': 2.223160027008778e-05, 'epoch': 1.72}


 57%|█████▋    | 7090/12348 [2:41:21<1:58:21,  1.35s/it]

{'loss': 0.6694, 'grad_norm': 4.819210052490234, 'learning_rate': 2.2189399054692774e-05, 'epoch': 1.72}


 57%|█████▋    | 7100/12348 [2:41:34<1:58:17,  1.35s/it]

{'loss': 0.7974, 'grad_norm': 22.652679443359375, 'learning_rate': 2.2147197839297773e-05, 'epoch': 1.72}


 58%|█████▊    | 7110/12348 [2:41:48<1:58:28,  1.36s/it]

{'loss': 0.8248, 'grad_norm': 46.138790130615234, 'learning_rate': 2.210499662390277e-05, 'epoch': 1.73}


 58%|█████▊    | 7120/12348 [2:42:02<1:59:47,  1.37s/it]

{'loss': 0.9637, 'grad_norm': 4.141680717468262, 'learning_rate': 2.2062795408507766e-05, 'epoch': 1.73}


 58%|█████▊    | 7130/12348 [2:42:15<1:57:45,  1.35s/it]

{'loss': 0.7207, 'grad_norm': 25.733753204345703, 'learning_rate': 2.2020594193112762e-05, 'epoch': 1.73}


 58%|█████▊    | 7140/12348 [2:42:29<1:57:24,  1.35s/it]

{'loss': 0.9044, 'grad_norm': 8.705419540405273, 'learning_rate': 2.197839297771776e-05, 'epoch': 1.73}


 58%|█████▊    | 7150/12348 [2:42:42<1:56:30,  1.34s/it]

{'loss': 0.7672, 'grad_norm': 21.770078659057617, 'learning_rate': 2.1936191762322755e-05, 'epoch': 1.74}


 58%|█████▊    | 7160/12348 [2:42:56<1:56:18,  1.35s/it]

{'loss': 0.8058, 'grad_norm': 10.726367950439453, 'learning_rate': 2.1893990546927754e-05, 'epoch': 1.74}


 58%|█████▊    | 7170/12348 [2:43:09<1:56:52,  1.35s/it]

{'loss': 0.8801, 'grad_norm': 7.5338873863220215, 'learning_rate': 2.1851789331532747e-05, 'epoch': 1.74}


 58%|█████▊    | 7180/12348 [2:43:23<1:56:18,  1.35s/it]

{'loss': 0.8843, 'grad_norm': 7.434098243713379, 'learning_rate': 2.1809588116137747e-05, 'epoch': 1.74}


 58%|█████▊    | 7190/12348 [2:43:36<1:56:37,  1.36s/it]

{'loss': 0.9079, 'grad_norm': 17.247209548950195, 'learning_rate': 2.1767386900742743e-05, 'epoch': 1.75}


 58%|█████▊    | 7200/12348 [2:43:50<1:55:51,  1.35s/it]

{'loss': 0.7668, 'grad_norm': 16.990989685058594, 'learning_rate': 2.172518568534774e-05, 'epoch': 1.75}


 58%|█████▊    | 7210/12348 [2:44:04<1:55:42,  1.35s/it]

{'loss': 0.5728, 'grad_norm': 17.651342391967773, 'learning_rate': 2.1682984469952736e-05, 'epoch': 1.75}


 58%|█████▊    | 7220/12348 [2:44:17<1:55:29,  1.35s/it]

{'loss': 0.5835, 'grad_norm': 8.664691925048828, 'learning_rate': 2.1640783254557732e-05, 'epoch': 1.75}


 59%|█████▊    | 7230/12348 [2:44:31<1:55:25,  1.35s/it]

{'loss': 1.1858, 'grad_norm': 21.32024383544922, 'learning_rate': 2.159858203916273e-05, 'epoch': 1.76}


 59%|█████▊    | 7240/12348 [2:44:44<1:55:13,  1.35s/it]

{'loss': 0.5792, 'grad_norm': 5.725470066070557, 'learning_rate': 2.1556380823767725e-05, 'epoch': 1.76}


 59%|█████▊    | 7250/12348 [2:44:58<1:55:29,  1.36s/it]

{'loss': 0.8357, 'grad_norm': 9.859268188476562, 'learning_rate': 2.151417960837272e-05, 'epoch': 1.76}


 59%|█████▉    | 7260/12348 [2:45:11<1:54:38,  1.35s/it]

{'loss': 1.1725, 'grad_norm': 26.492450714111328, 'learning_rate': 2.1471978392977717e-05, 'epoch': 1.76}


 59%|█████▉    | 7270/12348 [2:45:25<1:54:34,  1.35s/it]

{'loss': 0.5427, 'grad_norm': 8.358819007873535, 'learning_rate': 2.1429777177582717e-05, 'epoch': 1.77}


 59%|█████▉    | 7280/12348 [2:45:38<1:54:21,  1.35s/it]

{'loss': 0.6467, 'grad_norm': 2.217139482498169, 'learning_rate': 2.138757596218771e-05, 'epoch': 1.77}


 59%|█████▉    | 7290/12348 [2:45:52<1:54:00,  1.35s/it]

{'loss': 0.5611, 'grad_norm': 8.859563827514648, 'learning_rate': 2.134537474679271e-05, 'epoch': 1.77}


 59%|█████▉    | 7300/12348 [2:46:05<1:53:42,  1.35s/it]

{'loss': 0.9205, 'grad_norm': 11.740228652954102, 'learning_rate': 2.1303173531397706e-05, 'epoch': 1.77}


 59%|█████▉    | 7310/12348 [2:46:19<1:53:30,  1.35s/it]

{'loss': 1.0503, 'grad_norm': 17.884578704833984, 'learning_rate': 2.1260972316002702e-05, 'epoch': 1.78}


 59%|█████▉    | 7320/12348 [2:46:33<1:53:14,  1.35s/it]

{'loss': 1.266, 'grad_norm': 21.781694412231445, 'learning_rate': 2.12187711006077e-05, 'epoch': 1.78}


 59%|█████▉    | 7330/12348 [2:46:46<1:53:09,  1.35s/it]

{'loss': 0.6884, 'grad_norm': 13.057308197021484, 'learning_rate': 2.1176569885212695e-05, 'epoch': 1.78}


 59%|█████▉    | 7340/12348 [2:47:00<1:52:43,  1.35s/it]

{'loss': 0.8499, 'grad_norm': 12.64869213104248, 'learning_rate': 2.113436866981769e-05, 'epoch': 1.78}


 60%|█████▉    | 7350/12348 [2:47:13<1:52:23,  1.35s/it]

{'loss': 0.5306, 'grad_norm': 4.508280277252197, 'learning_rate': 2.1092167454422687e-05, 'epoch': 1.79}


 60%|█████▉    | 7360/12348 [2:47:27<1:52:15,  1.35s/it]

{'loss': 0.6599, 'grad_norm': 9.845698356628418, 'learning_rate': 2.1049966239027684e-05, 'epoch': 1.79}


 60%|█████▉    | 7370/12348 [2:47:40<1:52:21,  1.35s/it]

{'loss': 0.5561, 'grad_norm': 18.47942543029785, 'learning_rate': 2.100776502363268e-05, 'epoch': 1.79}


 60%|█████▉    | 7380/12348 [2:47:54<1:52:03,  1.35s/it]

{'loss': 1.0942, 'grad_norm': 7.169036865234375, 'learning_rate': 2.096556380823768e-05, 'epoch': 1.79}


 60%|█████▉    | 7390/12348 [2:48:07<1:51:52,  1.35s/it]

{'loss': 0.8375, 'grad_norm': 10.330402374267578, 'learning_rate': 2.0923362592842673e-05, 'epoch': 1.8}


 60%|█████▉    | 7400/12348 [2:48:21<1:51:41,  1.35s/it]

{'loss': 0.6879, 'grad_norm': 16.378904342651367, 'learning_rate': 2.0881161377447672e-05, 'epoch': 1.8}


 60%|██████    | 7410/12348 [2:48:34<1:51:08,  1.35s/it]

{'loss': 0.6847, 'grad_norm': 7.466867446899414, 'learning_rate': 2.083896016205267e-05, 'epoch': 1.8}


 60%|██████    | 7420/12348 [2:48:48<1:51:04,  1.35s/it]

{'loss': 0.7525, 'grad_norm': 8.100869178771973, 'learning_rate': 2.0796758946657665e-05, 'epoch': 1.8}


 60%|██████    | 7430/12348 [2:49:01<1:50:37,  1.35s/it]

{'loss': 0.8511, 'grad_norm': 10.422220230102539, 'learning_rate': 2.075455773126266e-05, 'epoch': 1.81}


 60%|██████    | 7440/12348 [2:49:15<1:50:27,  1.35s/it]

{'loss': 1.0074, 'grad_norm': 4.0165534019470215, 'learning_rate': 2.071235651586766e-05, 'epoch': 1.81}


 60%|██████    | 7450/12348 [2:49:28<1:50:20,  1.35s/it]

{'loss': 1.0861, 'grad_norm': 19.093189239501953, 'learning_rate': 2.0670155300472654e-05, 'epoch': 1.81}


 60%|██████    | 7460/12348 [2:49:42<1:50:17,  1.35s/it]

{'loss': 0.8687, 'grad_norm': 12.701135635375977, 'learning_rate': 2.0627954085077653e-05, 'epoch': 1.81}


 60%|██████    | 7470/12348 [2:49:56<1:49:52,  1.35s/it]

{'loss': 1.0315, 'grad_norm': 12.49218463897705, 'learning_rate': 2.058575286968265e-05, 'epoch': 1.81}


 61%|██████    | 7480/12348 [2:50:09<1:49:38,  1.35s/it]

{'loss': 0.9432, 'grad_norm': 9.344742774963379, 'learning_rate': 2.0543551654287643e-05, 'epoch': 1.82}


 61%|██████    | 7490/12348 [2:50:23<1:49:38,  1.35s/it]

{'loss': 0.9372, 'grad_norm': 17.157129287719727, 'learning_rate': 2.0501350438892642e-05, 'epoch': 1.82}


 61%|██████    | 7500/12348 [2:50:36<1:49:15,  1.35s/it]

{'loss': 1.0326, 'grad_norm': 18.162139892578125, 'learning_rate': 2.0459149223497635e-05, 'epoch': 1.82}


 61%|██████    | 7510/12348 [2:50:51<1:51:07,  1.38s/it]

{'loss': 0.7978, 'grad_norm': 13.308183670043945, 'learning_rate': 2.0416948008102635e-05, 'epoch': 1.82}


 61%|██████    | 7520/12348 [2:51:04<1:49:24,  1.36s/it]

{'loss': 0.8204, 'grad_norm': 7.087520599365234, 'learning_rate': 2.037474679270763e-05, 'epoch': 1.83}


 61%|██████    | 7530/12348 [2:51:18<1:48:21,  1.35s/it]

{'loss': 0.783, 'grad_norm': 13.204638481140137, 'learning_rate': 2.0332545577312627e-05, 'epoch': 1.83}


 61%|██████    | 7540/12348 [2:51:31<1:48:20,  1.35s/it]

{'loss': 0.8422, 'grad_norm': 10.985734939575195, 'learning_rate': 2.0290344361917624e-05, 'epoch': 1.83}


 61%|██████    | 7550/12348 [2:51:45<1:48:11,  1.35s/it]

{'loss': 1.0174, 'grad_norm': 7.346226215362549, 'learning_rate': 2.0248143146522623e-05, 'epoch': 1.83}


 61%|██████    | 7560/12348 [2:51:59<1:47:34,  1.35s/it]

{'loss': 0.7168, 'grad_norm': 6.937911510467529, 'learning_rate': 2.0205941931127616e-05, 'epoch': 1.84}


 61%|██████▏   | 7570/12348 [2:52:12<1:47:07,  1.35s/it]

{'loss': 0.7208, 'grad_norm': 10.24710464477539, 'learning_rate': 2.0163740715732616e-05, 'epoch': 1.84}


 61%|██████▏   | 7580/12348 [2:52:25<1:46:58,  1.35s/it]

{'loss': 0.9266, 'grad_norm': 13.863224983215332, 'learning_rate': 2.0121539500337612e-05, 'epoch': 1.84}


 61%|██████▏   | 7590/12348 [2:52:39<1:47:13,  1.35s/it]

{'loss': 0.9921, 'grad_norm': 19.93137550354004, 'learning_rate': 2.0079338284942605e-05, 'epoch': 1.84}


 62%|██████▏   | 7600/12348 [2:52:53<1:46:49,  1.35s/it]

{'loss': 0.5208, 'grad_norm': 12.51224422454834, 'learning_rate': 2.0037137069547605e-05, 'epoch': 1.85}


 62%|██████▏   | 7610/12348 [2:53:06<1:46:41,  1.35s/it]

{'loss': 0.9092, 'grad_norm': 22.306032180786133, 'learning_rate': 1.99949358541526e-05, 'epoch': 1.85}


 62%|██████▏   | 7620/12348 [2:53:19<1:45:56,  1.34s/it]

{'loss': 0.6829, 'grad_norm': 16.031509399414062, 'learning_rate': 1.9952734638757597e-05, 'epoch': 1.85}


 62%|██████▏   | 7630/12348 [2:53:33<1:45:53,  1.35s/it]

{'loss': 0.7826, 'grad_norm': 14.260581016540527, 'learning_rate': 1.9910533423362594e-05, 'epoch': 1.85}


 62%|██████▏   | 7640/12348 [2:53:46<1:45:31,  1.34s/it]

{'loss': 0.5773, 'grad_norm': 15.4940767288208, 'learning_rate': 1.986833220796759e-05, 'epoch': 1.86}


 62%|██████▏   | 7650/12348 [2:54:00<1:45:53,  1.35s/it]

{'loss': 0.9593, 'grad_norm': 14.895143508911133, 'learning_rate': 1.9826130992572586e-05, 'epoch': 1.86}


 62%|██████▏   | 7660/12348 [2:54:14<1:45:58,  1.36s/it]

{'loss': 0.821, 'grad_norm': 6.963369369506836, 'learning_rate': 1.9783929777177586e-05, 'epoch': 1.86}


 62%|██████▏   | 7670/12348 [2:54:27<1:45:17,  1.35s/it]

{'loss': 1.0888, 'grad_norm': 14.438512802124023, 'learning_rate': 1.974172856178258e-05, 'epoch': 1.86}


 62%|██████▏   | 7680/12348 [2:54:41<1:45:40,  1.36s/it]

{'loss': 1.1111, 'grad_norm': 29.157052993774414, 'learning_rate': 1.969952734638758e-05, 'epoch': 1.87}


 62%|██████▏   | 7690/12348 [2:54:54<1:44:53,  1.35s/it]

{'loss': 0.5223, 'grad_norm': 8.079636573791504, 'learning_rate': 1.9657326130992575e-05, 'epoch': 1.87}


 62%|██████▏   | 7700/12348 [2:55:08<1:44:55,  1.35s/it]

{'loss': 0.8093, 'grad_norm': 7.561102390289307, 'learning_rate': 1.961512491559757e-05, 'epoch': 1.87}


 62%|██████▏   | 7710/12348 [2:55:22<1:46:55,  1.38s/it]

{'loss': 1.1865, 'grad_norm': 11.962729454040527, 'learning_rate': 1.9572923700202567e-05, 'epoch': 1.87}


 63%|██████▎   | 7720/12348 [2:55:35<1:46:25,  1.38s/it]

{'loss': 0.9058, 'grad_norm': 11.762080192565918, 'learning_rate': 1.9530722484807564e-05, 'epoch': 1.88}


 63%|██████▎   | 7730/12348 [2:55:49<1:46:24,  1.38s/it]

{'loss': 0.5497, 'grad_norm': 19.89323616027832, 'learning_rate': 1.948852126941256e-05, 'epoch': 1.88}


 63%|██████▎   | 7740/12348 [2:56:03<1:46:24,  1.39s/it]

{'loss': 0.9638, 'grad_norm': 15.440937042236328, 'learning_rate': 1.9446320054017556e-05, 'epoch': 1.88}


 63%|██████▎   | 7750/12348 [2:56:17<1:46:03,  1.38s/it]

{'loss': 0.735, 'grad_norm': 18.994657516479492, 'learning_rate': 1.9404118838622552e-05, 'epoch': 1.88}


 63%|██████▎   | 7760/12348 [2:56:31<1:45:22,  1.38s/it]

{'loss': 0.9603, 'grad_norm': 10.888838768005371, 'learning_rate': 1.936191762322755e-05, 'epoch': 1.89}


 63%|██████▎   | 7770/12348 [2:56:45<1:45:21,  1.38s/it]

{'loss': 0.7622, 'grad_norm': 9.434183120727539, 'learning_rate': 1.931971640783255e-05, 'epoch': 1.89}


 63%|██████▎   | 7780/12348 [2:56:58<1:45:16,  1.38s/it]

{'loss': 0.5815, 'grad_norm': 17.87385368347168, 'learning_rate': 1.927751519243754e-05, 'epoch': 1.89}


 63%|██████▎   | 7790/12348 [2:57:12<1:45:31,  1.39s/it]

{'loss': 0.9709, 'grad_norm': 19.198261260986328, 'learning_rate': 1.923531397704254e-05, 'epoch': 1.89}


 63%|██████▎   | 7800/12348 [2:57:26<1:45:07,  1.39s/it]

{'loss': 0.9909, 'grad_norm': 18.912260055541992, 'learning_rate': 1.9193112761647537e-05, 'epoch': 1.9}


 63%|██████▎   | 7810/12348 [2:57:40<1:45:16,  1.39s/it]

{'loss': 0.7187, 'grad_norm': 11.471015930175781, 'learning_rate': 1.9150911546252534e-05, 'epoch': 1.9}


 63%|██████▎   | 7820/12348 [2:57:54<1:43:55,  1.38s/it]

{'loss': 0.6706, 'grad_norm': 18.28453254699707, 'learning_rate': 1.910871033085753e-05, 'epoch': 1.9}


 63%|██████▎   | 7830/12348 [2:58:08<1:44:18,  1.39s/it]

{'loss': 0.8112, 'grad_norm': 22.97032356262207, 'learning_rate': 1.9066509115462526e-05, 'epoch': 1.9}


 63%|██████▎   | 7840/12348 [2:58:22<1:43:39,  1.38s/it]

{'loss': 0.9526, 'grad_norm': 12.836382865905762, 'learning_rate': 1.9024307900067522e-05, 'epoch': 1.9}


 64%|██████▎   | 7850/12348 [2:58:35<1:43:16,  1.38s/it]

{'loss': 0.7843, 'grad_norm': 9.81551456451416, 'learning_rate': 1.898210668467252e-05, 'epoch': 1.91}


 64%|██████▎   | 7860/12348 [2:58:49<1:43:18,  1.38s/it]

{'loss': 0.8223, 'grad_norm': 18.86881446838379, 'learning_rate': 1.8939905469277515e-05, 'epoch': 1.91}


 64%|██████▎   | 7870/12348 [2:59:03<1:42:58,  1.38s/it]

{'loss': 1.1008, 'grad_norm': 50.096595764160156, 'learning_rate': 1.889770425388251e-05, 'epoch': 1.91}


 64%|██████▍   | 7880/12348 [2:59:17<1:42:26,  1.38s/it]

{'loss': 0.9049, 'grad_norm': 6.921515464782715, 'learning_rate': 1.885550303848751e-05, 'epoch': 1.91}


 64%|██████▍   | 7890/12348 [2:59:31<1:42:15,  1.38s/it]

{'loss': 1.0655, 'grad_norm': 16.41046714782715, 'learning_rate': 1.8813301823092504e-05, 'epoch': 1.92}


 64%|██████▍   | 7900/12348 [2:59:44<1:42:25,  1.38s/it]

{'loss': 0.8086, 'grad_norm': 11.186607360839844, 'learning_rate': 1.8771100607697504e-05, 'epoch': 1.92}


 64%|██████▍   | 7910/12348 [2:59:58<1:41:43,  1.38s/it]

{'loss': 0.7558, 'grad_norm': 18.205970764160156, 'learning_rate': 1.87288993923025e-05, 'epoch': 1.92}


 64%|██████▍   | 7920/12348 [3:00:12<1:41:50,  1.38s/it]

{'loss': 0.7694, 'grad_norm': 2.6408936977386475, 'learning_rate': 1.8686698176907496e-05, 'epoch': 1.92}


 64%|██████▍   | 7930/12348 [3:00:26<1:41:43,  1.38s/it]

{'loss': 0.8142, 'grad_norm': 18.6475830078125, 'learning_rate': 1.8644496961512492e-05, 'epoch': 1.93}


 64%|██████▍   | 7940/12348 [3:00:40<1:41:26,  1.38s/it]

{'loss': 0.8485, 'grad_norm': 13.689752578735352, 'learning_rate': 1.860229574611749e-05, 'epoch': 1.93}


 64%|██████▍   | 7950/12348 [3:00:54<1:41:09,  1.38s/it]

{'loss': 0.9029, 'grad_norm': 23.788616180419922, 'learning_rate': 1.8560094530722485e-05, 'epoch': 1.93}


 64%|██████▍   | 7960/12348 [3:01:07<1:40:43,  1.38s/it]

{'loss': 0.7121, 'grad_norm': 2.657315731048584, 'learning_rate': 1.8517893315327485e-05, 'epoch': 1.93}


 65%|██████▍   | 7970/12348 [3:01:21<1:40:49,  1.38s/it]

{'loss': 0.5625, 'grad_norm': 6.414037704467773, 'learning_rate': 1.8475692099932478e-05, 'epoch': 1.94}


 65%|██████▍   | 7980/12348 [3:01:35<1:40:10,  1.38s/it]

{'loss': 0.9048, 'grad_norm': 6.802720069885254, 'learning_rate': 1.8433490884537474e-05, 'epoch': 1.94}


 65%|██████▍   | 7990/12348 [3:01:49<1:43:08,  1.42s/it]

{'loss': 0.9268, 'grad_norm': 12.284735679626465, 'learning_rate': 1.8391289669142474e-05, 'epoch': 1.94}


 65%|██████▍   | 8000/12348 [3:02:03<1:39:35,  1.37s/it]

{'loss': 1.0547, 'grad_norm': 5.5714921951293945, 'learning_rate': 1.8349088453747466e-05, 'epoch': 1.94}


 65%|██████▍   | 8010/12348 [3:02:18<1:42:20,  1.42s/it]

{'loss': 0.8044, 'grad_norm': 8.077537536621094, 'learning_rate': 1.8306887238352466e-05, 'epoch': 1.95}


 65%|██████▍   | 8020/12348 [3:02:32<1:40:14,  1.39s/it]

{'loss': 0.5932, 'grad_norm': 10.06467056274414, 'learning_rate': 1.8264686022957462e-05, 'epoch': 1.95}


 65%|██████▌   | 8030/12348 [3:02:46<1:40:14,  1.39s/it]

{'loss': 1.0652, 'grad_norm': 17.326019287109375, 'learning_rate': 1.822248480756246e-05, 'epoch': 1.95}


 65%|██████▌   | 8040/12348 [3:03:00<1:39:42,  1.39s/it]

{'loss': 1.0097, 'grad_norm': 33.490081787109375, 'learning_rate': 1.8180283592167455e-05, 'epoch': 1.95}


 65%|██████▌   | 8050/12348 [3:03:13<1:39:32,  1.39s/it]

{'loss': 0.7172, 'grad_norm': 20.364990234375, 'learning_rate': 1.813808237677245e-05, 'epoch': 1.96}


 65%|██████▌   | 8060/12348 [3:03:27<1:39:22,  1.39s/it]

{'loss': 0.6381, 'grad_norm': 15.00008487701416, 'learning_rate': 1.8095881161377448e-05, 'epoch': 1.96}


 65%|██████▌   | 8070/12348 [3:03:41<1:39:08,  1.39s/it]

{'loss': 0.6712, 'grad_norm': 17.74252700805664, 'learning_rate': 1.8053679945982447e-05, 'epoch': 1.96}


 65%|██████▌   | 8080/12348 [3:03:55<1:38:33,  1.39s/it]

{'loss': 0.7945, 'grad_norm': 3.439890146255493, 'learning_rate': 1.801147873058744e-05, 'epoch': 1.96}


 66%|██████▌   | 8090/12348 [3:04:09<1:38:37,  1.39s/it]

{'loss': 0.5938, 'grad_norm': 9.361634254455566, 'learning_rate': 1.796927751519244e-05, 'epoch': 1.97}


 66%|██████▌   | 8100/12348 [3:04:23<1:38:03,  1.38s/it]

{'loss': 0.6358, 'grad_norm': 12.12320613861084, 'learning_rate': 1.7927076299797436e-05, 'epoch': 1.97}


 66%|██████▌   | 8110/12348 [3:04:37<1:37:20,  1.38s/it]

{'loss': 1.0378, 'grad_norm': 17.95981216430664, 'learning_rate': 1.788487508440243e-05, 'epoch': 1.97}


 66%|██████▌   | 8120/12348 [3:04:51<1:37:22,  1.38s/it]

{'loss': 0.8073, 'grad_norm': 9.144828796386719, 'learning_rate': 1.784267386900743e-05, 'epoch': 1.97}


 66%|██████▌   | 8130/12348 [3:05:04<1:37:02,  1.38s/it]

{'loss': 0.7629, 'grad_norm': 9.919093132019043, 'learning_rate': 1.7800472653612425e-05, 'epoch': 1.98}


 66%|██████▌   | 8140/12348 [3:05:18<1:36:25,  1.37s/it]

{'loss': 0.9613, 'grad_norm': 11.196549415588379, 'learning_rate': 1.775827143821742e-05, 'epoch': 1.98}


 66%|██████▌   | 8150/12348 [3:05:32<1:36:22,  1.38s/it]

{'loss': 1.0598, 'grad_norm': 10.32431697845459, 'learning_rate': 1.7716070222822418e-05, 'epoch': 1.98}


 66%|██████▌   | 8160/12348 [3:05:46<1:36:17,  1.38s/it]

{'loss': 1.2771, 'grad_norm': 12.255349159240723, 'learning_rate': 1.7673869007427414e-05, 'epoch': 1.98}


 66%|██████▌   | 8170/12348 [3:06:00<1:36:17,  1.38s/it]

{'loss': 1.0279, 'grad_norm': 15.210427284240723, 'learning_rate': 1.763166779203241e-05, 'epoch': 1.98}


 66%|██████▌   | 8180/12348 [3:06:13<1:35:39,  1.38s/it]

{'loss': 0.7867, 'grad_norm': 18.37870216369629, 'learning_rate': 1.758946657663741e-05, 'epoch': 1.99}


 66%|██████▋   | 8190/12348 [3:06:27<1:35:57,  1.38s/it]

{'loss': 0.8355, 'grad_norm': 13.092122077941895, 'learning_rate': 1.7547265361242403e-05, 'epoch': 1.99}


 66%|██████▋   | 8200/12348 [3:06:41<1:35:37,  1.38s/it]

{'loss': 0.7088, 'grad_norm': 13.643369674682617, 'learning_rate': 1.7505064145847402e-05, 'epoch': 1.99}


 66%|██████▋   | 8210/12348 [3:06:55<1:35:11,  1.38s/it]

{'loss': 0.4799, 'grad_norm': 13.315354347229004, 'learning_rate': 1.74628629304524e-05, 'epoch': 1.99}


 67%|██████▋   | 8220/12348 [3:07:09<1:34:45,  1.38s/it]

{'loss': 0.6536, 'grad_norm': 9.459232330322266, 'learning_rate': 1.7420661715057395e-05, 'epoch': 2.0}


 67%|██████▋   | 8230/12348 [3:07:22<1:34:59,  1.38s/it]

{'loss': 0.8253, 'grad_norm': 7.035528659820557, 'learning_rate': 1.737846049966239e-05, 'epoch': 2.0}


 67%|██████▋   | 8240/12348 [3:07:36<1:33:49,  1.37s/it]

{'loss': 0.6561, 'grad_norm': 9.24023723602295, 'learning_rate': 1.7336259284267388e-05, 'epoch': 2.0}


 67%|██████▋   | 8250/12348 [3:07:50<1:34:03,  1.38s/it]

{'loss': 0.5642, 'grad_norm': 3.87595534324646, 'learning_rate': 1.7294058068872384e-05, 'epoch': 2.0}


 67%|██████▋   | 8260/12348 [3:08:03<1:34:15,  1.38s/it]

{'loss': 0.5167, 'grad_norm': 7.0372138023376465, 'learning_rate': 1.725185685347738e-05, 'epoch': 2.01}


 67%|██████▋   | 8270/12348 [3:08:17<1:33:37,  1.38s/it]

{'loss': 0.7616, 'grad_norm': 10.02840805053711, 'learning_rate': 1.7209655638082376e-05, 'epoch': 2.01}


 67%|██████▋   | 8280/12348 [3:08:31<1:33:43,  1.38s/it]

{'loss': 0.7245, 'grad_norm': 13.115612030029297, 'learning_rate': 1.7167454422687373e-05, 'epoch': 2.01}


 67%|██████▋   | 8290/12348 [3:08:45<1:33:27,  1.38s/it]

{'loss': 0.5716, 'grad_norm': 18.703901290893555, 'learning_rate': 1.7125253207292372e-05, 'epoch': 2.01}


 67%|██████▋   | 8300/12348 [3:08:59<1:33:07,  1.38s/it]

{'loss': 0.5077, 'grad_norm': 9.851821899414062, 'learning_rate': 1.7083051991897365e-05, 'epoch': 2.02}


 67%|██████▋   | 8310/12348 [3:09:13<1:32:58,  1.38s/it]

{'loss': 0.5451, 'grad_norm': 13.866158485412598, 'learning_rate': 1.7040850776502365e-05, 'epoch': 2.02}


 67%|██████▋   | 8320/12348 [3:09:26<1:32:45,  1.38s/it]

{'loss': 0.765, 'grad_norm': 23.991300582885742, 'learning_rate': 1.699864956110736e-05, 'epoch': 2.02}


 67%|██████▋   | 8330/12348 [3:09:40<1:32:49,  1.39s/it]

{'loss': 0.7239, 'grad_norm': 7.058870792388916, 'learning_rate': 1.6956448345712358e-05, 'epoch': 2.02}


 68%|██████▊   | 8340/12348 [3:09:54<1:32:03,  1.38s/it]

{'loss': 0.4288, 'grad_norm': 9.141676902770996, 'learning_rate': 1.6914247130317354e-05, 'epoch': 2.03}


 68%|██████▊   | 8350/12348 [3:10:08<1:31:44,  1.38s/it]

{'loss': 0.7308, 'grad_norm': 11.548788070678711, 'learning_rate': 1.687204591492235e-05, 'epoch': 2.03}


 68%|██████▊   | 8360/12348 [3:10:22<1:31:44,  1.38s/it]

{'loss': 0.56, 'grad_norm': 7.660162448883057, 'learning_rate': 1.6829844699527346e-05, 'epoch': 2.03}


 68%|██████▊   | 8370/12348 [3:10:35<1:31:31,  1.38s/it]

{'loss': 0.766, 'grad_norm': 8.798091888427734, 'learning_rate': 1.6787643484132343e-05, 'epoch': 2.03}


 68%|██████▊   | 8380/12348 [3:10:49<1:31:22,  1.38s/it]

{'loss': 0.5087, 'grad_norm': 13.643552780151367, 'learning_rate': 1.6745442268737342e-05, 'epoch': 2.04}


 68%|██████▊   | 8390/12348 [3:11:03<1:31:03,  1.38s/it]

{'loss': 0.3731, 'grad_norm': 6.064436435699463, 'learning_rate': 1.6703241053342335e-05, 'epoch': 2.04}


 68%|██████▊   | 8400/12348 [3:11:17<1:30:40,  1.38s/it]

{'loss': 0.4364, 'grad_norm': 8.441529273986816, 'learning_rate': 1.6661039837947335e-05, 'epoch': 2.04}


 68%|██████▊   | 8410/12348 [3:11:31<1:30:32,  1.38s/it]

{'loss': 0.3835, 'grad_norm': 10.414318084716797, 'learning_rate': 1.661883862255233e-05, 'epoch': 2.04}


 68%|██████▊   | 8420/12348 [3:11:45<1:30:05,  1.38s/it]

{'loss': 0.7998, 'grad_norm': 16.87904167175293, 'learning_rate': 1.6576637407157328e-05, 'epoch': 2.05}


 68%|██████▊   | 8430/12348 [3:11:58<1:30:17,  1.38s/it]

{'loss': 0.5246, 'grad_norm': 10.893668174743652, 'learning_rate': 1.6534436191762324e-05, 'epoch': 2.05}


 68%|██████▊   | 8440/12348 [3:12:12<1:29:59,  1.38s/it]

{'loss': 0.6644, 'grad_norm': 45.93880844116211, 'learning_rate': 1.649223497636732e-05, 'epoch': 2.05}


 68%|██████▊   | 8450/12348 [3:12:26<1:29:51,  1.38s/it]

{'loss': 1.0728, 'grad_norm': 19.096168518066406, 'learning_rate': 1.6450033760972316e-05, 'epoch': 2.05}


 69%|██████▊   | 8460/12348 [3:12:40<1:29:31,  1.38s/it]

{'loss': 0.7365, 'grad_norm': 23.8417911529541, 'learning_rate': 1.6407832545577316e-05, 'epoch': 2.06}


 69%|██████▊   | 8470/12348 [3:12:54<1:29:00,  1.38s/it]

{'loss': 0.4993, 'grad_norm': 4.44570779800415, 'learning_rate': 1.636563133018231e-05, 'epoch': 2.06}


 69%|██████▊   | 8480/12348 [3:13:08<1:29:07,  1.38s/it]

{'loss': 0.5302, 'grad_norm': 11.468033790588379, 'learning_rate': 1.6323430114787305e-05, 'epoch': 2.06}


 69%|██████▉   | 8490/12348 [3:13:21<1:28:34,  1.38s/it]

{'loss': 0.4164, 'grad_norm': 2.7438836097717285, 'learning_rate': 1.6281228899392305e-05, 'epoch': 2.06}


 69%|██████▉   | 8500/12348 [3:13:35<1:28:25,  1.38s/it]

{'loss': 0.5249, 'grad_norm': 10.14736270904541, 'learning_rate': 1.6239027683997298e-05, 'epoch': 2.07}


 69%|██████▉   | 8510/12348 [3:13:50<1:30:01,  1.41s/it]

{'loss': 0.679, 'grad_norm': 8.521796226501465, 'learning_rate': 1.6196826468602298e-05, 'epoch': 2.07}


 69%|██████▉   | 8520/12348 [3:14:04<1:28:24,  1.39s/it]

{'loss': 0.502, 'grad_norm': 15.868871688842773, 'learning_rate': 1.6154625253207294e-05, 'epoch': 2.07}


 69%|██████▉   | 8530/12348 [3:14:18<1:28:13,  1.39s/it]

{'loss': 0.6493, 'grad_norm': 29.808218002319336, 'learning_rate': 1.611242403781229e-05, 'epoch': 2.07}


 69%|██████▉   | 8540/12348 [3:14:32<1:27:51,  1.38s/it]

{'loss': 0.6484, 'grad_norm': 16.313053131103516, 'learning_rate': 1.6070222822417286e-05, 'epoch': 2.07}


 69%|██████▉   | 8550/12348 [3:14:46<1:27:16,  1.38s/it]

{'loss': 0.5931, 'grad_norm': 7.3596978187561035, 'learning_rate': 1.6028021607022283e-05, 'epoch': 2.08}


 69%|██████▉   | 8560/12348 [3:14:59<1:27:14,  1.38s/it]

{'loss': 0.6663, 'grad_norm': 18.564836502075195, 'learning_rate': 1.598582039162728e-05, 'epoch': 2.08}


 69%|██████▉   | 8570/12348 [3:15:13<1:26:55,  1.38s/it]

{'loss': 0.5349, 'grad_norm': 11.144649505615234, 'learning_rate': 1.594361917623228e-05, 'epoch': 2.08}


 69%|██████▉   | 8580/12348 [3:15:27<1:26:38,  1.38s/it]

{'loss': 0.3599, 'grad_norm': 7.83633279800415, 'learning_rate': 1.590141796083727e-05, 'epoch': 2.08}


 70%|██████▉   | 8590/12348 [3:15:41<1:26:21,  1.38s/it]

{'loss': 0.2911, 'grad_norm': 5.306938648223877, 'learning_rate': 1.585921674544227e-05, 'epoch': 2.09}


 70%|██████▉   | 8600/12348 [3:15:55<1:26:21,  1.38s/it]

{'loss': 0.8281, 'grad_norm': 16.492568969726562, 'learning_rate': 1.5817015530047268e-05, 'epoch': 2.09}


 70%|██████▉   | 8610/12348 [3:16:09<1:26:06,  1.38s/it]

{'loss': 0.5214, 'grad_norm': 3.5350492000579834, 'learning_rate': 1.577481431465226e-05, 'epoch': 2.09}


 70%|██████▉   | 8620/12348 [3:16:22<1:26:02,  1.38s/it]

{'loss': 0.5285, 'grad_norm': 7.5037150382995605, 'learning_rate': 1.573261309925726e-05, 'epoch': 2.09}


 70%|██████▉   | 8630/12348 [3:16:36<1:25:15,  1.38s/it]

{'loss': 0.6285, 'grad_norm': 27.537412643432617, 'learning_rate': 1.5690411883862256e-05, 'epoch': 2.1}


 70%|██████▉   | 8640/12348 [3:16:50<1:25:12,  1.38s/it]

{'loss': 0.373, 'grad_norm': 11.75238037109375, 'learning_rate': 1.5648210668467253e-05, 'epoch': 2.1}


 70%|███████   | 8650/12348 [3:17:04<1:24:47,  1.38s/it]

{'loss': 0.764, 'grad_norm': 12.735769271850586, 'learning_rate': 1.560600945307225e-05, 'epoch': 2.1}


 70%|███████   | 8660/12348 [3:17:18<1:25:00,  1.38s/it]

{'loss': 0.5131, 'grad_norm': 28.72891616821289, 'learning_rate': 1.5563808237677245e-05, 'epoch': 2.1}


 70%|███████   | 8670/12348 [3:17:31<1:24:27,  1.38s/it]

{'loss': 0.6792, 'grad_norm': 9.771638870239258, 'learning_rate': 1.552160702228224e-05, 'epoch': 2.11}


 70%|███████   | 8680/12348 [3:17:45<1:24:25,  1.38s/it]

{'loss': 0.7063, 'grad_norm': 6.530681610107422, 'learning_rate': 1.547940580688724e-05, 'epoch': 2.11}


 70%|███████   | 8690/12348 [3:17:59<1:24:14,  1.38s/it]

{'loss': 0.4505, 'grad_norm': 25.39303207397461, 'learning_rate': 1.5437204591492234e-05, 'epoch': 2.11}


 70%|███████   | 8700/12348 [3:18:13<1:23:55,  1.38s/it]

{'loss': 0.4939, 'grad_norm': 11.61579704284668, 'learning_rate': 1.5395003376097234e-05, 'epoch': 2.11}


 71%|███████   | 8710/12348 [3:18:27<1:23:40,  1.38s/it]

{'loss': 0.4973, 'grad_norm': 32.80937957763672, 'learning_rate': 1.535280216070223e-05, 'epoch': 2.12}


 71%|███████   | 8720/12348 [3:18:40<1:23:39,  1.38s/it]

{'loss': 0.3513, 'grad_norm': 4.221514701843262, 'learning_rate': 1.5310600945307226e-05, 'epoch': 2.12}


 71%|███████   | 8730/12348 [3:18:54<1:22:52,  1.37s/it]

{'loss': 0.3283, 'grad_norm': 10.98125171661377, 'learning_rate': 1.5268399729912223e-05, 'epoch': 2.12}


 71%|███████   | 8740/12348 [3:19:08<1:23:16,  1.38s/it]

{'loss': 0.4356, 'grad_norm': 9.60623836517334, 'learning_rate': 1.5226198514517217e-05, 'epoch': 2.12}


 71%|███████   | 8750/12348 [3:19:22<1:22:38,  1.38s/it]

{'loss': 0.5051, 'grad_norm': 14.756047248840332, 'learning_rate': 1.5183997299122215e-05, 'epoch': 2.13}


 71%|███████   | 8760/12348 [3:19:36<1:22:35,  1.38s/it]

{'loss': 0.4655, 'grad_norm': 0.14668215811252594, 'learning_rate': 1.5141796083727212e-05, 'epoch': 2.13}


 71%|███████   | 8770/12348 [3:19:50<1:22:10,  1.38s/it]

{'loss': 0.8374, 'grad_norm': 36.81044006347656, 'learning_rate': 1.509959486833221e-05, 'epoch': 2.13}


 71%|███████   | 8780/12348 [3:20:03<1:21:59,  1.38s/it]

{'loss': 0.4531, 'grad_norm': 3.194908380508423, 'learning_rate': 1.5057393652937204e-05, 'epoch': 2.13}


 71%|███████   | 8790/12348 [3:20:17<1:21:58,  1.38s/it]

{'loss': 0.6522, 'grad_norm': 13.025309562683105, 'learning_rate': 1.5015192437542202e-05, 'epoch': 2.14}


 71%|███████▏  | 8800/12348 [3:20:31<1:21:43,  1.38s/it]

{'loss': 0.7388, 'grad_norm': 18.893165588378906, 'learning_rate': 1.4972991222147198e-05, 'epoch': 2.14}


 71%|███████▏  | 8810/12348 [3:20:45<1:21:22,  1.38s/it]

{'loss': 1.0011, 'grad_norm': 17.444538116455078, 'learning_rate': 1.4930790006752196e-05, 'epoch': 2.14}


 71%|███████▏  | 8820/12348 [3:20:59<1:21:06,  1.38s/it]

{'loss': 0.8126, 'grad_norm': 28.984500885009766, 'learning_rate': 1.4888588791357191e-05, 'epoch': 2.14}


 72%|███████▏  | 8830/12348 [3:21:13<1:21:05,  1.38s/it]

{'loss': 0.3829, 'grad_norm': 15.976840019226074, 'learning_rate': 1.4846387575962189e-05, 'epoch': 2.15}


 72%|███████▏  | 8840/12348 [3:21:26<1:20:35,  1.38s/it]

{'loss': 0.5365, 'grad_norm': 1.2238391637802124, 'learning_rate': 1.4804186360567185e-05, 'epoch': 2.15}


 72%|███████▏  | 8850/12348 [3:21:40<1:20:35,  1.38s/it]

{'loss': 0.4387, 'grad_norm': 11.67246150970459, 'learning_rate': 1.4761985145172183e-05, 'epoch': 2.15}


 72%|███████▏  | 8860/12348 [3:21:54<1:20:05,  1.38s/it]

{'loss': 0.7213, 'grad_norm': 19.395063400268555, 'learning_rate': 1.4719783929777178e-05, 'epoch': 2.15}


 72%|███████▏  | 8870/12348 [3:22:08<1:19:53,  1.38s/it]

{'loss': 0.5726, 'grad_norm': 4.188187599182129, 'learning_rate': 1.4677582714382174e-05, 'epoch': 2.16}


 72%|███████▏  | 8880/12348 [3:22:22<1:19:53,  1.38s/it]

{'loss': 0.443, 'grad_norm': 12.55532455444336, 'learning_rate': 1.4635381498987172e-05, 'epoch': 2.16}


 72%|███████▏  | 8890/12348 [3:22:35<1:19:32,  1.38s/it]

{'loss': 0.3917, 'grad_norm': 0.8417633771896362, 'learning_rate': 1.4593180283592167e-05, 'epoch': 2.16}


 72%|███████▏  | 8900/12348 [3:22:49<1:19:27,  1.38s/it]

{'loss': 0.681, 'grad_norm': 3.3061203956604004, 'learning_rate': 1.4550979068197165e-05, 'epoch': 2.16}


 72%|███████▏  | 8910/12348 [3:23:03<1:19:04,  1.38s/it]

{'loss': 0.6065, 'grad_norm': 5.669334411621094, 'learning_rate': 1.4508777852802161e-05, 'epoch': 2.16}


 72%|███████▏  | 8920/12348 [3:23:17<1:18:45,  1.38s/it]

{'loss': 0.3717, 'grad_norm': 6.859986305236816, 'learning_rate': 1.4466576637407159e-05, 'epoch': 2.17}


 72%|███████▏  | 8930/12348 [3:23:31<1:18:37,  1.38s/it]

{'loss': 0.3452, 'grad_norm': 17.95920753479004, 'learning_rate': 1.4424375422012154e-05, 'epoch': 2.17}


 72%|███████▏  | 8940/12348 [3:23:45<1:18:18,  1.38s/it]

{'loss': 0.3683, 'grad_norm': 2.7483925819396973, 'learning_rate': 1.4382174206617152e-05, 'epoch': 2.17}


 72%|███████▏  | 8950/12348 [3:23:58<1:18:06,  1.38s/it]

{'loss': 0.4549, 'grad_norm': 19.29351806640625, 'learning_rate': 1.4339972991222148e-05, 'epoch': 2.17}


 73%|███████▎  | 8960/12348 [3:24:12<1:17:49,  1.38s/it]

{'loss': 0.4607, 'grad_norm': 20.80133628845215, 'learning_rate': 1.4297771775827146e-05, 'epoch': 2.18}


 73%|███████▎  | 8970/12348 [3:24:26<1:17:58,  1.38s/it]

{'loss': 0.458, 'grad_norm': 11.909981727600098, 'learning_rate': 1.425557056043214e-05, 'epoch': 2.18}


 73%|███████▎  | 8980/12348 [3:24:40<1:17:36,  1.38s/it]

{'loss': 0.7211, 'grad_norm': 13.865236282348633, 'learning_rate': 1.4213369345037138e-05, 'epoch': 2.18}


 73%|███████▎  | 8990/12348 [3:24:54<1:17:20,  1.38s/it]

{'loss': 0.3312, 'grad_norm': 22.75741195678711, 'learning_rate': 1.4171168129642135e-05, 'epoch': 2.18}


 73%|███████▎  | 9000/12348 [3:25:07<1:16:54,  1.38s/it]

{'loss': 0.495, 'grad_norm': 0.41918593645095825, 'learning_rate': 1.412896691424713e-05, 'epoch': 2.19}


 73%|███████▎  | 9010/12348 [3:25:22<1:18:11,  1.41s/it]

{'loss': 0.6642, 'grad_norm': 21.8140926361084, 'learning_rate': 1.4086765698852127e-05, 'epoch': 2.19}


 73%|███████▎  | 9020/12348 [3:25:36<1:17:07,  1.39s/it]

{'loss': 0.7369, 'grad_norm': 3.421870708465576, 'learning_rate': 1.4044564483457124e-05, 'epoch': 2.19}


 73%|███████▎  | 9030/12348 [3:25:50<1:16:45,  1.39s/it]

{'loss': 0.792, 'grad_norm': 20.216087341308594, 'learning_rate': 1.4002363268062122e-05, 'epoch': 2.19}


 73%|███████▎  | 9040/12348 [3:26:04<1:16:28,  1.39s/it]

{'loss': 0.3454, 'grad_norm': 37.59352111816406, 'learning_rate': 1.3960162052667116e-05, 'epoch': 2.2}


 73%|███████▎  | 9050/12348 [3:26:18<1:15:54,  1.38s/it]

{'loss': 0.511, 'grad_norm': 9.98868179321289, 'learning_rate': 1.3917960837272114e-05, 'epoch': 2.2}


 73%|███████▎  | 9060/12348 [3:26:32<1:15:38,  1.38s/it]

{'loss': 0.523, 'grad_norm': 13.18276596069336, 'learning_rate': 1.387575962187711e-05, 'epoch': 2.2}


 73%|███████▎  | 9070/12348 [3:26:46<1:15:23,  1.38s/it]

{'loss': 0.6465, 'grad_norm': 0.3888101875782013, 'learning_rate': 1.3833558406482108e-05, 'epoch': 2.2}


 74%|███████▎  | 9080/12348 [3:26:59<1:15:23,  1.38s/it]

{'loss': 0.3352, 'grad_norm': 16.635581970214844, 'learning_rate': 1.3791357191087103e-05, 'epoch': 2.21}


 74%|███████▎  | 9090/12348 [3:27:13<1:14:55,  1.38s/it]

{'loss': 0.3829, 'grad_norm': 13.304790496826172, 'learning_rate': 1.3749155975692101e-05, 'epoch': 2.21}


 74%|███████▎  | 9100/12348 [3:27:27<1:14:59,  1.39s/it]

{'loss': 0.4452, 'grad_norm': 15.593376159667969, 'learning_rate': 1.3706954760297097e-05, 'epoch': 2.21}


 74%|███████▍  | 9110/12348 [3:27:41<1:14:42,  1.38s/it]

{'loss': 0.579, 'grad_norm': 20.787513732910156, 'learning_rate': 1.3664753544902092e-05, 'epoch': 2.21}


 74%|███████▍  | 9120/12348 [3:27:55<1:14:37,  1.39s/it]

{'loss': 0.5625, 'grad_norm': 21.17177391052246, 'learning_rate': 1.362255232950709e-05, 'epoch': 2.22}


 74%|███████▍  | 9130/12348 [3:28:09<1:14:03,  1.38s/it]

{'loss': 0.486, 'grad_norm': 20.00232696533203, 'learning_rate': 1.3580351114112086e-05, 'epoch': 2.22}


 74%|███████▍  | 9140/12348 [3:28:22<1:13:55,  1.38s/it]

{'loss': 0.6218, 'grad_norm': 6.312755107879639, 'learning_rate': 1.3538149898717084e-05, 'epoch': 2.22}


 74%|███████▍  | 9150/12348 [3:28:36<1:13:21,  1.38s/it]

{'loss': 0.537, 'grad_norm': 2.557157516479492, 'learning_rate': 1.3495948683322079e-05, 'epoch': 2.22}


 74%|███████▍  | 9160/12348 [3:28:50<1:13:27,  1.38s/it]

{'loss': 0.6553, 'grad_norm': 15.198925971984863, 'learning_rate': 1.3453747467927077e-05, 'epoch': 2.23}


 74%|███████▍  | 9170/12348 [3:29:04<1:12:59,  1.38s/it]

{'loss': 0.6366, 'grad_norm': 18.955780029296875, 'learning_rate': 1.3411546252532073e-05, 'epoch': 2.23}


 74%|███████▍  | 9180/12348 [3:29:18<1:13:04,  1.38s/it]

{'loss': 0.5203, 'grad_norm': 4.775759696960449, 'learning_rate': 1.3369345037137071e-05, 'epoch': 2.23}


 74%|███████▍  | 9190/12348 [3:29:32<1:12:39,  1.38s/it]

{'loss': 0.6595, 'grad_norm': 3.804102897644043, 'learning_rate': 1.3327143821742066e-05, 'epoch': 2.23}


 75%|███████▍  | 9200/12348 [3:29:45<1:12:31,  1.38s/it]

{'loss': 0.6855, 'grad_norm': 5.138683319091797, 'learning_rate': 1.3284942606347065e-05, 'epoch': 2.24}


 75%|███████▍  | 9210/12348 [3:29:59<1:12:17,  1.38s/it]

{'loss': 0.9627, 'grad_norm': 16.66240692138672, 'learning_rate': 1.324274139095206e-05, 'epoch': 2.24}


 75%|███████▍  | 9220/12348 [3:30:13<1:12:02,  1.38s/it]

{'loss': 0.3635, 'grad_norm': 22.684782028198242, 'learning_rate': 1.3200540175557058e-05, 'epoch': 2.24}


 75%|███████▍  | 9230/12348 [3:30:27<1:11:45,  1.38s/it]

{'loss': 0.4134, 'grad_norm': 17.428627014160156, 'learning_rate': 1.3158338960162054e-05, 'epoch': 2.24}


 75%|███████▍  | 9240/12348 [3:30:41<1:11:45,  1.39s/it]

{'loss': 0.4397, 'grad_norm': 10.164565086364746, 'learning_rate': 1.3116137744767049e-05, 'epoch': 2.24}


 75%|███████▍  | 9250/12348 [3:30:55<1:11:20,  1.38s/it]

{'loss': 0.8628, 'grad_norm': 29.662734985351562, 'learning_rate': 1.3073936529372047e-05, 'epoch': 2.25}


 75%|███████▍  | 9260/12348 [3:31:08<1:10:53,  1.38s/it]

{'loss': 0.5319, 'grad_norm': 23.550222396850586, 'learning_rate': 1.3031735313977041e-05, 'epoch': 2.25}


 75%|███████▌  | 9270/12348 [3:31:22<1:10:43,  1.38s/it]

{'loss': 0.4003, 'grad_norm': 10.947608947753906, 'learning_rate': 1.2989534098582041e-05, 'epoch': 2.25}


 75%|███████▌  | 9280/12348 [3:31:36<1:10:42,  1.38s/it]

{'loss': 0.809, 'grad_norm': 19.537755966186523, 'learning_rate': 1.2947332883187036e-05, 'epoch': 2.25}


 75%|███████▌  | 9290/12348 [3:31:50<1:10:26,  1.38s/it]

{'loss': 0.4007, 'grad_norm': 14.696080207824707, 'learning_rate': 1.2905131667792034e-05, 'epoch': 2.26}


 75%|███████▌  | 9300/12348 [3:32:04<1:10:11,  1.38s/it]

{'loss': 0.519, 'grad_norm': 19.884540557861328, 'learning_rate': 1.286293045239703e-05, 'epoch': 2.26}


 75%|███████▌  | 9310/12348 [3:32:18<1:09:45,  1.38s/it]

{'loss': 0.6465, 'grad_norm': 15.629982948303223, 'learning_rate': 1.2820729237002028e-05, 'epoch': 2.26}


 75%|███████▌  | 9320/12348 [3:32:31<1:09:45,  1.38s/it]

{'loss': 0.6069, 'grad_norm': 12.482458114624023, 'learning_rate': 1.2778528021607022e-05, 'epoch': 2.26}


 76%|███████▌  | 9330/12348 [3:32:45<1:09:22,  1.38s/it]

{'loss': 0.4047, 'grad_norm': 8.776737213134766, 'learning_rate': 1.273632680621202e-05, 'epoch': 2.27}


 76%|███████▌  | 9340/12348 [3:32:59<1:09:21,  1.38s/it]

{'loss': 0.5698, 'grad_norm': 14.929300308227539, 'learning_rate': 1.2694125590817017e-05, 'epoch': 2.27}


 76%|███████▌  | 9350/12348 [3:33:13<1:08:52,  1.38s/it]

{'loss': 0.6638, 'grad_norm': 10.326534271240234, 'learning_rate': 1.2651924375422015e-05, 'epoch': 2.27}


 76%|███████▌  | 9360/12348 [3:33:27<1:09:01,  1.39s/it]

{'loss': 0.4802, 'grad_norm': 6.223794937133789, 'learning_rate': 1.260972316002701e-05, 'epoch': 2.27}


 76%|███████▌  | 9370/12348 [3:33:40<1:08:37,  1.38s/it]

{'loss': 0.3798, 'grad_norm': 16.0166072845459, 'learning_rate': 1.2567521944632006e-05, 'epoch': 2.28}


 76%|███████▌  | 9380/12348 [3:33:54<1:08:20,  1.38s/it]

{'loss': 0.5904, 'grad_norm': 54.49198913574219, 'learning_rate': 1.2525320729237004e-05, 'epoch': 2.28}


 76%|███████▌  | 9390/12348 [3:34:08<1:07:58,  1.38s/it]

{'loss': 0.3648, 'grad_norm': 7.2394208908081055, 'learning_rate': 1.2483119513842e-05, 'epoch': 2.28}


 76%|███████▌  | 9400/12348 [3:34:22<1:07:47,  1.38s/it]

{'loss': 0.6032, 'grad_norm': 30.84479331970215, 'learning_rate': 1.2440918298446996e-05, 'epoch': 2.28}


 76%|███████▌  | 9410/12348 [3:34:36<1:07:35,  1.38s/it]

{'loss': 0.6412, 'grad_norm': 8.263395309448242, 'learning_rate': 1.2398717083051992e-05, 'epoch': 2.29}


 76%|███████▋  | 9420/12348 [3:34:50<1:07:23,  1.38s/it]

{'loss': 0.5288, 'grad_norm': 9.033743858337402, 'learning_rate': 1.235651586765699e-05, 'epoch': 2.29}


 76%|███████▋  | 9430/12348 [3:35:03<1:07:03,  1.38s/it]

{'loss': 0.5493, 'grad_norm': 8.768134117126465, 'learning_rate': 1.2314314652261985e-05, 'epoch': 2.29}


 76%|███████▋  | 9440/12348 [3:35:17<1:06:53,  1.38s/it]

{'loss': 0.4498, 'grad_norm': 0.7966355681419373, 'learning_rate': 1.2272113436866981e-05, 'epoch': 2.29}


 77%|███████▋  | 9450/12348 [3:35:31<1:06:50,  1.38s/it]

{'loss': 0.3467, 'grad_norm': 12.661548614501953, 'learning_rate': 1.222991222147198e-05, 'epoch': 2.3}


 77%|███████▋  | 9460/12348 [3:35:45<1:06:26,  1.38s/it]

{'loss': 0.6568, 'grad_norm': 12.057004928588867, 'learning_rate': 1.2187711006076976e-05, 'epoch': 2.3}


 77%|███████▋  | 9470/12348 [3:35:59<1:06:20,  1.38s/it]

{'loss': 0.5292, 'grad_norm': 10.42369556427002, 'learning_rate': 1.2145509790681972e-05, 'epoch': 2.3}


 77%|███████▋  | 9480/12348 [3:36:13<1:06:04,  1.38s/it]

{'loss': 0.6944, 'grad_norm': 9.932692527770996, 'learning_rate': 1.2103308575286968e-05, 'epoch': 2.3}


 77%|███████▋  | 9490/12348 [3:36:26<1:05:58,  1.39s/it]

{'loss': 0.4932, 'grad_norm': 7.083014965057373, 'learning_rate': 1.2061107359891966e-05, 'epoch': 2.31}


 77%|███████▋  | 9500/12348 [3:36:40<1:05:37,  1.38s/it]

{'loss': 0.4847, 'grad_norm': 59.28655242919922, 'learning_rate': 1.2018906144496962e-05, 'epoch': 2.31}


 77%|███████▋  | 9510/12348 [3:36:55<1:07:15,  1.42s/it]

{'loss': 0.8948, 'grad_norm': 6.512999057769775, 'learning_rate': 1.1976704929101959e-05, 'epoch': 2.31}


 77%|███████▋  | 9520/12348 [3:37:09<1:06:12,  1.40s/it]

{'loss': 0.5697, 'grad_norm': 11.677608489990234, 'learning_rate': 1.1934503713706955e-05, 'epoch': 2.31}


 77%|███████▋  | 9530/12348 [3:37:23<1:05:52,  1.40s/it]

{'loss': 0.5302, 'grad_norm': 13.89017105102539, 'learning_rate': 1.1892302498311953e-05, 'epoch': 2.32}


 77%|███████▋  | 9540/12348 [3:37:37<1:05:17,  1.40s/it]

{'loss': 0.5863, 'grad_norm': 36.35702133178711, 'learning_rate': 1.185010128291695e-05, 'epoch': 2.32}


 77%|███████▋  | 9550/12348 [3:37:51<1:04:44,  1.39s/it]

{'loss': 0.3688, 'grad_norm': 0.9281421899795532, 'learning_rate': 1.1807900067521945e-05, 'epoch': 2.32}


 77%|███████▋  | 9560/12348 [3:38:05<1:04:21,  1.39s/it]

{'loss': 0.6548, 'grad_norm': 29.7369441986084, 'learning_rate': 1.1765698852126942e-05, 'epoch': 2.32}


 78%|███████▊  | 9570/12348 [3:38:19<1:04:11,  1.39s/it]

{'loss': 0.5249, 'grad_norm': 5.7036542892456055, 'learning_rate': 1.1723497636731938e-05, 'epoch': 2.33}


 78%|███████▊  | 9580/12348 [3:38:33<1:03:44,  1.38s/it]

{'loss': 0.4342, 'grad_norm': 13.101618766784668, 'learning_rate': 1.1681296421336934e-05, 'epoch': 2.33}


 78%|███████▊  | 9590/12348 [3:38:47<1:03:43,  1.39s/it]

{'loss': 0.7422, 'grad_norm': 6.01476526260376, 'learning_rate': 1.163909520594193e-05, 'epoch': 2.33}


 78%|███████▊  | 9600/12348 [3:39:01<1:03:11,  1.38s/it]

{'loss': 0.3735, 'grad_norm': 0.03729875758290291, 'learning_rate': 1.1596893990546929e-05, 'epoch': 2.33}


 78%|███████▊  | 9610/12348 [3:39:14<1:03:08,  1.38s/it]

{'loss': 0.6332, 'grad_norm': 9.729293823242188, 'learning_rate': 1.1554692775151925e-05, 'epoch': 2.33}


 78%|███████▊  | 9620/12348 [3:39:28<1:02:42,  1.38s/it]

{'loss': 0.5372, 'grad_norm': 15.36502742767334, 'learning_rate': 1.1512491559756921e-05, 'epoch': 2.34}


 78%|███████▊  | 9630/12348 [3:39:42<1:02:43,  1.38s/it]

{'loss': 0.3435, 'grad_norm': 12.755036354064941, 'learning_rate': 1.1470290344361918e-05, 'epoch': 2.34}


 78%|███████▊  | 9640/12348 [3:39:56<1:02:17,  1.38s/it]

{'loss': 0.4923, 'grad_norm': 16.2878475189209, 'learning_rate': 1.1428089128966915e-05, 'epoch': 2.34}


 78%|███████▊  | 9650/12348 [3:40:10<1:02:17,  1.39s/it]

{'loss': 0.4012, 'grad_norm': 5.806301593780518, 'learning_rate': 1.1385887913571912e-05, 'epoch': 2.34}


 78%|███████▊  | 9660/12348 [3:40:23<1:01:44,  1.38s/it]

{'loss': 0.4935, 'grad_norm': 2.380713939666748, 'learning_rate': 1.1343686698176908e-05, 'epoch': 2.35}


 78%|███████▊  | 9670/12348 [3:40:37<1:01:39,  1.38s/it]

{'loss': 0.8157, 'grad_norm': 18.644020080566406, 'learning_rate': 1.1301485482781906e-05, 'epoch': 2.35}


 78%|███████▊  | 9680/12348 [3:40:51<1:01:25,  1.38s/it]

{'loss': 0.7255, 'grad_norm': 28.83805274963379, 'learning_rate': 1.12592842673869e-05, 'epoch': 2.35}


 78%|███████▊  | 9690/12348 [3:41:05<1:01:10,  1.38s/it]

{'loss': 0.7315, 'grad_norm': 12.046892166137695, 'learning_rate': 1.1217083051991897e-05, 'epoch': 2.35}


 79%|███████▊  | 9700/12348 [3:41:19<1:00:50,  1.38s/it]

{'loss': 0.5873, 'grad_norm': 16.37349510192871, 'learning_rate': 1.1174881836596895e-05, 'epoch': 2.36}


 79%|███████▊  | 9710/12348 [3:41:33<1:00:44,  1.38s/it]

{'loss': 0.6686, 'grad_norm': 24.113685607910156, 'learning_rate': 1.1132680621201891e-05, 'epoch': 2.36}


 79%|███████▊  | 9720/12348 [3:41:46<1:00:33,  1.38s/it]

{'loss': 0.7391, 'grad_norm': 26.248516082763672, 'learning_rate': 1.1090479405806887e-05, 'epoch': 2.36}


 79%|███████▉  | 9730/12348 [3:42:00<1:00:24,  1.38s/it]

{'loss': 0.4024, 'grad_norm': 29.63797950744629, 'learning_rate': 1.1048278190411884e-05, 'epoch': 2.36}


 79%|███████▉  | 9740/12348 [3:42:14<1:00:04,  1.38s/it]

{'loss': 0.7691, 'grad_norm': 28.379528045654297, 'learning_rate': 1.1006076975016882e-05, 'epoch': 2.37}


 79%|███████▉  | 9750/12348 [3:42:28<59:52,  1.38s/it]  

{'loss': 0.6211, 'grad_norm': 1.1480767726898193, 'learning_rate': 1.0963875759621878e-05, 'epoch': 2.37}


 79%|███████▉  | 9760/12348 [3:42:42<59:43,  1.38s/it]

{'loss': 0.6165, 'grad_norm': 4.759340286254883, 'learning_rate': 1.0921674544226874e-05, 'epoch': 2.37}


 79%|███████▉  | 9770/12348 [3:42:56<59:20,  1.38s/it]  

{'loss': 0.6478, 'grad_norm': 17.44548988342285, 'learning_rate': 1.087947332883187e-05, 'epoch': 2.37}


 79%|███████▉  | 9780/12348 [3:43:09<58:59,  1.38s/it]

{'loss': 0.498, 'grad_norm': 7.196582794189453, 'learning_rate': 1.0837272113436869e-05, 'epoch': 2.38}


 79%|███████▉  | 9790/12348 [3:43:23<58:52,  1.38s/it]

{'loss': 0.3388, 'grad_norm': 12.555231094360352, 'learning_rate': 1.0795070898041865e-05, 'epoch': 2.38}


 79%|███████▉  | 9800/12348 [3:43:37<58:49,  1.39s/it]

{'loss': 0.7507, 'grad_norm': 26.53321647644043, 'learning_rate': 1.0752869682646861e-05, 'epoch': 2.38}


 79%|███████▉  | 9810/12348 [3:43:51<58:24,  1.38s/it]

{'loss': 0.7471, 'grad_norm': 17.131057739257812, 'learning_rate': 1.0710668467251857e-05, 'epoch': 2.38}


 80%|███████▉  | 9820/12348 [3:44:05<58:09,  1.38s/it]

{'loss': 0.4232, 'grad_norm': 1.7967166900634766, 'learning_rate': 1.0668467251856854e-05, 'epoch': 2.39}


 80%|███████▉  | 9830/12348 [3:44:19<57:58,  1.38s/it]

{'loss': 0.6273, 'grad_norm': 14.389677047729492, 'learning_rate': 1.062626603646185e-05, 'epoch': 2.39}


 80%|███████▉  | 9840/12348 [3:44:32<57:41,  1.38s/it]

{'loss': 0.5035, 'grad_norm': 3.9233345985412598, 'learning_rate': 1.0584064821066846e-05, 'epoch': 2.39}


 80%|███████▉  | 9850/12348 [3:44:46<57:33,  1.38s/it]

{'loss': 0.2729, 'grad_norm': 22.975168228149414, 'learning_rate': 1.0541863605671844e-05, 'epoch': 2.39}


 80%|███████▉  | 9860/12348 [3:45:00<57:10,  1.38s/it]

{'loss': 0.6237, 'grad_norm': 13.252409934997559, 'learning_rate': 1.049966239027684e-05, 'epoch': 2.4}


 80%|███████▉  | 9870/12348 [3:45:14<57:13,  1.39s/it]

{'loss': 0.4273, 'grad_norm': 8.996405601501465, 'learning_rate': 1.0457461174881837e-05, 'epoch': 2.4}


 80%|████████  | 9880/12348 [3:45:28<56:50,  1.38s/it]

{'loss': 0.5099, 'grad_norm': 2.930406093597412, 'learning_rate': 1.0415259959486833e-05, 'epoch': 2.4}


 80%|████████  | 9890/12348 [3:45:42<56:48,  1.39s/it]

{'loss': 0.6843, 'grad_norm': 11.455147743225098, 'learning_rate': 1.0373058744091831e-05, 'epoch': 2.4}


 80%|████████  | 9900/12348 [3:45:56<56:39,  1.39s/it]

{'loss': 0.6881, 'grad_norm': 7.597421169281006, 'learning_rate': 1.0330857528696827e-05, 'epoch': 2.41}


 80%|████████  | 9910/12348 [3:46:09<55:55,  1.38s/it]

{'loss': 0.3031, 'grad_norm': 21.094501495361328, 'learning_rate': 1.0288656313301824e-05, 'epoch': 2.41}


 80%|████████  | 9920/12348 [3:46:23<55:54,  1.38s/it]

{'loss': 0.4571, 'grad_norm': 7.001164436340332, 'learning_rate': 1.024645509790682e-05, 'epoch': 2.41}


 80%|████████  | 9930/12348 [3:46:37<55:35,  1.38s/it]

{'loss': 0.4473, 'grad_norm': 24.218402862548828, 'learning_rate': 1.0204253882511818e-05, 'epoch': 2.41}


 80%|████████  | 9940/12348 [3:46:51<55:24,  1.38s/it]

{'loss': 0.4942, 'grad_norm': 31.77629280090332, 'learning_rate': 1.0162052667116813e-05, 'epoch': 2.41}


 81%|████████  | 9950/12348 [3:47:05<55:16,  1.38s/it]

{'loss': 0.8245, 'grad_norm': 18.723325729370117, 'learning_rate': 1.0119851451721809e-05, 'epoch': 2.42}


 81%|████████  | 9960/12348 [3:47:18<54:58,  1.38s/it]

{'loss': 0.5549, 'grad_norm': 16.80594825744629, 'learning_rate': 1.0077650236326807e-05, 'epoch': 2.42}


 81%|████████  | 9970/12348 [3:47:32<54:46,  1.38s/it]

{'loss': 0.424, 'grad_norm': 2.1181821823120117, 'learning_rate': 1.0035449020931803e-05, 'epoch': 2.42}


 81%|████████  | 9980/12348 [3:47:46<54:27,  1.38s/it]

{'loss': 0.3355, 'grad_norm': 1.7189834117889404, 'learning_rate': 9.9932478055368e-06, 'epoch': 2.42}


 81%|████████  | 9990/12348 [3:48:00<54:25,  1.38s/it]

{'loss': 0.774, 'grad_norm': 35.43455505371094, 'learning_rate': 9.951046590141796e-06, 'epoch': 2.43}


 81%|████████  | 10000/12348 [3:48:14<54:04,  1.38s/it]

{'loss': 0.4164, 'grad_norm': 0.9361504316329956, 'learning_rate': 9.908845374746794e-06, 'epoch': 2.43}


 81%|████████  | 10010/12348 [3:48:29<54:57,  1.41s/it]  

{'loss': 0.6441, 'grad_norm': 10.73544979095459, 'learning_rate': 9.86664415935179e-06, 'epoch': 2.43}


 81%|████████  | 10020/12348 [3:48:43<53:46,  1.39s/it]

{'loss': 0.5283, 'grad_norm': 0.45568135380744934, 'learning_rate': 9.824442943956786e-06, 'epoch': 2.43}


 81%|████████  | 10030/12348 [3:48:57<53:31,  1.39s/it]

{'loss': 0.3287, 'grad_norm': 1.3122599124908447, 'learning_rate': 9.782241728561783e-06, 'epoch': 2.44}


 81%|████████▏ | 10040/12348 [3:49:10<53:03,  1.38s/it]

{'loss': 0.9036, 'grad_norm': 25.53919219970703, 'learning_rate': 9.74004051316678e-06, 'epoch': 2.44}


 81%|████████▏ | 10050/12348 [3:49:24<52:53,  1.38s/it]

{'loss': 0.262, 'grad_norm': 9.523512840270996, 'learning_rate': 9.697839297771777e-06, 'epoch': 2.44}


 81%|████████▏ | 10060/12348 [3:49:38<52:35,  1.38s/it]

{'loss': 0.556, 'grad_norm': 12.9423828125, 'learning_rate': 9.655638082376771e-06, 'epoch': 2.44}


 82%|████████▏ | 10070/12348 [3:49:52<52:30,  1.38s/it]

{'loss': 0.6716, 'grad_norm': 17.158451080322266, 'learning_rate': 9.61343686698177e-06, 'epoch': 2.45}


 82%|████████▏ | 10080/12348 [3:50:06<52:09,  1.38s/it]

{'loss': 0.3491, 'grad_norm': 1.292806625366211, 'learning_rate': 9.571235651586766e-06, 'epoch': 2.45}


 82%|████████▏ | 10090/12348 [3:50:20<51:56,  1.38s/it]

{'loss': 0.4554, 'grad_norm': 6.7701735496521, 'learning_rate': 9.529034436191762e-06, 'epoch': 2.45}


 82%|████████▏ | 10100/12348 [3:50:33<51:55,  1.39s/it]

{'loss': 0.4928, 'grad_norm': 21.520769119262695, 'learning_rate': 9.486833220796758e-06, 'epoch': 2.45}


 82%|████████▏ | 10110/12348 [3:50:47<51:25,  1.38s/it]

{'loss': 0.5391, 'grad_norm': 17.507444381713867, 'learning_rate': 9.444632005401756e-06, 'epoch': 2.46}


 82%|████████▏ | 10120/12348 [3:51:01<51:11,  1.38s/it]

{'loss': 0.2244, 'grad_norm': 9.30534553527832, 'learning_rate': 9.402430790006753e-06, 'epoch': 2.46}


 82%|████████▏ | 10130/12348 [3:51:15<51:03,  1.38s/it]

{'loss': 0.7938, 'grad_norm': 4.419005393981934, 'learning_rate': 9.360229574611749e-06, 'epoch': 2.46}


 82%|████████▏ | 10140/12348 [3:51:29<51:02,  1.39s/it]

{'loss': 0.773, 'grad_norm': 4.262432098388672, 'learning_rate': 9.318028359216747e-06, 'epoch': 2.46}


 82%|████████▏ | 10150/12348 [3:51:43<50:37,  1.38s/it]

{'loss': 0.7467, 'grad_norm': 4.253844261169434, 'learning_rate': 9.275827143821743e-06, 'epoch': 2.47}


 82%|████████▏ | 10160/12348 [3:51:56<50:22,  1.38s/it]

{'loss': 0.385, 'grad_norm': 12.192298889160156, 'learning_rate': 9.23362592842674e-06, 'epoch': 2.47}


 82%|████████▏ | 10170/12348 [3:52:10<50:08,  1.38s/it]

{'loss': 0.4734, 'grad_norm': 12.892061233520508, 'learning_rate': 9.191424713031736e-06, 'epoch': 2.47}


 82%|████████▏ | 10180/12348 [3:52:24<49:46,  1.38s/it]

{'loss': 0.4397, 'grad_norm': 10.097731590270996, 'learning_rate': 9.149223497636734e-06, 'epoch': 2.47}


 83%|████████▎ | 10190/12348 [3:52:38<49:38,  1.38s/it]

{'loss': 0.7047, 'grad_norm': 5.344496250152588, 'learning_rate': 9.107022282241728e-06, 'epoch': 2.48}


 83%|████████▎ | 10200/12348 [3:52:52<49:25,  1.38s/it]

{'loss': 0.5744, 'grad_norm': 5.900732517242432, 'learning_rate': 9.064821066846725e-06, 'epoch': 2.48}


 83%|████████▎ | 10210/12348 [3:53:06<49:12,  1.38s/it]

{'loss': 0.4073, 'grad_norm': 2.849740505218506, 'learning_rate': 9.022619851451723e-06, 'epoch': 2.48}


 83%|████████▎ | 10220/12348 [3:53:19<48:54,  1.38s/it]

{'loss': 0.404, 'grad_norm': 11.79879379272461, 'learning_rate': 8.980418636056719e-06, 'epoch': 2.48}


 83%|████████▎ | 10230/12348 [3:53:33<48:51,  1.38s/it]

{'loss': 0.7847, 'grad_norm': 17.23260498046875, 'learning_rate': 8.938217420661715e-06, 'epoch': 2.49}


 83%|████████▎ | 10240/12348 [3:53:47<48:39,  1.38s/it]

{'loss': 0.4485, 'grad_norm': 8.163534164428711, 'learning_rate': 8.896016205266711e-06, 'epoch': 2.49}


 83%|████████▎ | 10250/12348 [3:54:01<48:19,  1.38s/it]

{'loss': 0.5524, 'grad_norm': 14.94781494140625, 'learning_rate': 8.85381498987171e-06, 'epoch': 2.49}


 83%|████████▎ | 10260/12348 [3:54:15<47:57,  1.38s/it]

{'loss': 0.7092, 'grad_norm': 15.765273094177246, 'learning_rate': 8.811613774476706e-06, 'epoch': 2.49}


 83%|████████▎ | 10270/12348 [3:54:29<47:58,  1.39s/it]

{'loss': 0.6713, 'grad_norm': 1.7190883159637451, 'learning_rate': 8.769412559081702e-06, 'epoch': 2.5}


 83%|████████▎ | 10280/12348 [3:54:42<47:36,  1.38s/it]

{'loss': 0.4398, 'grad_norm': 3.3027594089508057, 'learning_rate': 8.727211343686698e-06, 'epoch': 2.5}


 83%|████████▎ | 10290/12348 [3:54:56<47:24,  1.38s/it]

{'loss': 0.5254, 'grad_norm': 5.8885416984558105, 'learning_rate': 8.685010128291696e-06, 'epoch': 2.5}


 83%|████████▎ | 10300/12348 [3:55:10<47:12,  1.38s/it]

{'loss': 0.6464, 'grad_norm': 7.116899490356445, 'learning_rate': 8.642808912896693e-06, 'epoch': 2.5}


 83%|████████▎ | 10310/12348 [3:55:24<46:56,  1.38s/it]

{'loss': 0.6619, 'grad_norm': 3.947460412979126, 'learning_rate': 8.600607697501689e-06, 'epoch': 2.5}


 84%|████████▎ | 10320/12348 [3:55:38<46:40,  1.38s/it]

{'loss': 0.6344, 'grad_norm': 11.940305709838867, 'learning_rate': 8.558406482106685e-06, 'epoch': 2.51}


 84%|████████▎ | 10330/12348 [3:55:52<46:27,  1.38s/it]

{'loss': 0.693, 'grad_norm': 15.409649848937988, 'learning_rate': 8.516205266711681e-06, 'epoch': 2.51}


 84%|████████▎ | 10340/12348 [3:56:05<46:18,  1.38s/it]

{'loss': 0.5835, 'grad_norm': 15.524317741394043, 'learning_rate': 8.474004051316678e-06, 'epoch': 2.51}


 84%|████████▍ | 10350/12348 [3:56:19<45:59,  1.38s/it]

{'loss': 0.3553, 'grad_norm': 3.453512191772461, 'learning_rate': 8.431802835921674e-06, 'epoch': 2.51}


 84%|████████▍ | 10360/12348 [3:56:33<45:49,  1.38s/it]

{'loss': 0.6356, 'grad_norm': 4.074822425842285, 'learning_rate': 8.389601620526672e-06, 'epoch': 2.52}


 84%|████████▍ | 10370/12348 [3:56:47<45:33,  1.38s/it]

{'loss': 0.9495, 'grad_norm': 8.792967796325684, 'learning_rate': 8.347400405131668e-06, 'epoch': 2.52}


 84%|████████▍ | 10380/12348 [3:57:01<45:27,  1.39s/it]

{'loss': 0.5608, 'grad_norm': 19.00848960876465, 'learning_rate': 8.305199189736665e-06, 'epoch': 2.52}


 84%|████████▍ | 10390/12348 [3:57:15<45:09,  1.38s/it]

{'loss': 0.3987, 'grad_norm': 2.646467447280884, 'learning_rate': 8.262997974341661e-06, 'epoch': 2.52}


 84%|████████▍ | 10400/12348 [3:57:28<44:50,  1.38s/it]

{'loss': 0.3418, 'grad_norm': 4.715478420257568, 'learning_rate': 8.220796758946659e-06, 'epoch': 2.53}


 84%|████████▍ | 10410/12348 [3:57:42<44:53,  1.39s/it]

{'loss': 0.6732, 'grad_norm': 10.196520805358887, 'learning_rate': 8.178595543551655e-06, 'epoch': 2.53}


 84%|████████▍ | 10420/12348 [3:57:56<44:32,  1.39s/it]

{'loss': 0.4819, 'grad_norm': 18.90713119506836, 'learning_rate': 8.136394328156651e-06, 'epoch': 2.53}


 84%|████████▍ | 10430/12348 [3:58:10<44:17,  1.39s/it]

{'loss': 0.4149, 'grad_norm': 37.39377975463867, 'learning_rate': 8.094193112761648e-06, 'epoch': 2.53}


 85%|████████▍ | 10440/12348 [3:58:24<44:00,  1.38s/it]

{'loss': 0.3083, 'grad_norm': 6.509977340698242, 'learning_rate': 8.051991897366644e-06, 'epoch': 2.54}


 85%|████████▍ | 10450/12348 [3:58:38<43:42,  1.38s/it]

{'loss': 0.6558, 'grad_norm': 36.3701057434082, 'learning_rate': 8.00979068197164e-06, 'epoch': 2.54}


 85%|████████▍ | 10460/12348 [3:58:52<43:36,  1.39s/it]

{'loss': 0.5897, 'grad_norm': 11.315788269042969, 'learning_rate': 7.967589466576637e-06, 'epoch': 2.54}


 85%|████████▍ | 10470/12348 [3:59:05<43:19,  1.38s/it]

{'loss': 0.6002, 'grad_norm': 13.06952953338623, 'learning_rate': 7.925388251181635e-06, 'epoch': 2.54}


 85%|████████▍ | 10480/12348 [3:59:19<43:02,  1.38s/it]

{'loss': 0.753, 'grad_norm': 14.290257453918457, 'learning_rate': 7.883187035786631e-06, 'epoch': 2.55}


 85%|████████▍ | 10490/12348 [3:59:33<42:53,  1.39s/it]

{'loss': 0.783, 'grad_norm': 11.5159273147583, 'learning_rate': 7.840985820391627e-06, 'epoch': 2.55}


 85%|████████▌ | 10500/12348 [3:59:47<42:38,  1.38s/it]

{'loss': 0.4571, 'grad_norm': 9.023869514465332, 'learning_rate': 7.798784604996623e-06, 'epoch': 2.55}


 85%|████████▌ | 10510/12348 [4:00:02<43:43,  1.43s/it]

{'loss': 0.4788, 'grad_norm': 20.419116973876953, 'learning_rate': 7.756583389601621e-06, 'epoch': 2.55}


 85%|████████▌ | 10520/12348 [4:00:16<42:39,  1.40s/it]

{'loss': 0.4158, 'grad_norm': 15.922649383544922, 'learning_rate': 7.714382174206618e-06, 'epoch': 2.56}


 85%|████████▌ | 10530/12348 [4:00:30<42:12,  1.39s/it]

{'loss': 0.5351, 'grad_norm': 11.7909574508667, 'learning_rate': 7.672180958811614e-06, 'epoch': 2.56}


 85%|████████▌ | 10540/12348 [4:00:44<41:42,  1.38s/it]

{'loss': 0.6293, 'grad_norm': 9.396920204162598, 'learning_rate': 7.629979743416612e-06, 'epoch': 2.56}


 85%|████████▌ | 10550/12348 [4:00:58<41:25,  1.38s/it]

{'loss': 0.4546, 'grad_norm': 13.739740371704102, 'learning_rate': 7.587778528021608e-06, 'epoch': 2.56}


 86%|████████▌ | 10560/12348 [4:01:12<41:05,  1.38s/it]

{'loss': 0.5636, 'grad_norm': 4.262838840484619, 'learning_rate': 7.5455773126266046e-06, 'epoch': 2.57}


 86%|████████▌ | 10570/12348 [4:01:25<40:54,  1.38s/it]

{'loss': 0.5729, 'grad_norm': 8.462865829467773, 'learning_rate': 7.5033760972316e-06, 'epoch': 2.57}


 86%|████████▌ | 10580/12348 [4:01:39<40:46,  1.38s/it]

{'loss': 0.8369, 'grad_norm': 20.933412551879883, 'learning_rate': 7.461174881836597e-06, 'epoch': 2.57}


 86%|████████▌ | 10590/12348 [4:01:53<40:18,  1.38s/it]

{'loss': 0.7579, 'grad_norm': 5.849923610687256, 'learning_rate': 7.4189736664415934e-06, 'epoch': 2.57}


 86%|████████▌ | 10600/12348 [4:02:07<40:16,  1.38s/it]

{'loss': 0.7377, 'grad_norm': 21.66123390197754, 'learning_rate': 7.3767724510465906e-06, 'epoch': 2.58}


 86%|████████▌ | 10610/12348 [4:02:21<40:04,  1.38s/it]

{'loss': 0.7684, 'grad_norm': 13.54538631439209, 'learning_rate': 7.334571235651587e-06, 'epoch': 2.58}


 86%|████████▌ | 10620/12348 [4:02:35<39:43,  1.38s/it]

{'loss': 0.4919, 'grad_norm': 12.553388595581055, 'learning_rate': 7.292370020256584e-06, 'epoch': 2.58}


 86%|████████▌ | 10630/12348 [4:02:48<39:41,  1.39s/it]

{'loss': 0.4825, 'grad_norm': 22.099506378173828, 'learning_rate': 7.25016880486158e-06, 'epoch': 2.58}


 86%|████████▌ | 10640/12348 [4:03:02<39:29,  1.39s/it]

{'loss': 0.6652, 'grad_norm': 20.546295166015625, 'learning_rate': 7.2079675894665774e-06, 'epoch': 2.59}


 86%|████████▌ | 10650/12348 [4:03:16<39:16,  1.39s/it]

{'loss': 0.6798, 'grad_norm': 16.671634674072266, 'learning_rate': 7.165766374071574e-06, 'epoch': 2.59}


 86%|████████▋ | 10660/12348 [4:03:30<38:59,  1.39s/it]

{'loss': 0.4096, 'grad_norm': 23.606359481811523, 'learning_rate': 7.123565158676571e-06, 'epoch': 2.59}


 86%|████████▋ | 10670/12348 [4:03:44<38:51,  1.39s/it]

{'loss': 0.5796, 'grad_norm': 13.73953628540039, 'learning_rate': 7.081363943281567e-06, 'epoch': 2.59}


 86%|████████▋ | 10680/12348 [4:03:58<38:24,  1.38s/it]

{'loss': 0.2368, 'grad_norm': 9.082436561584473, 'learning_rate': 7.039162727886564e-06, 'epoch': 2.59}


 87%|████████▋ | 10690/12348 [4:04:12<38:07,  1.38s/it]

{'loss': 0.4931, 'grad_norm': 30.942731857299805, 'learning_rate': 6.99696151249156e-06, 'epoch': 2.6}


 87%|████████▋ | 10700/12348 [4:04:26<38:07,  1.39s/it]

{'loss': 0.5618, 'grad_norm': 17.225406646728516, 'learning_rate': 6.954760297096556e-06, 'epoch': 2.6}


 87%|████████▋ | 10710/12348 [4:04:39<37:49,  1.39s/it]

{'loss': 0.4886, 'grad_norm': 18.453323364257812, 'learning_rate': 6.912559081701553e-06, 'epoch': 2.6}


 87%|████████▋ | 10720/12348 [4:04:53<37:30,  1.38s/it]

{'loss': 0.5324, 'grad_norm': 10.99764347076416, 'learning_rate': 6.8703578663065494e-06, 'epoch': 2.6}


 87%|████████▋ | 10730/12348 [4:05:07<37:15,  1.38s/it]

{'loss': 0.6292, 'grad_norm': 15.128966331481934, 'learning_rate': 6.8281566509115466e-06, 'epoch': 2.61}


 87%|████████▋ | 10740/12348 [4:05:21<37:07,  1.39s/it]

{'loss': 0.3735, 'grad_norm': 11.263696670532227, 'learning_rate': 6.785955435516543e-06, 'epoch': 2.61}


 87%|████████▋ | 10750/12348 [4:05:35<36:49,  1.38s/it]

{'loss': 0.6647, 'grad_norm': 5.555226802825928, 'learning_rate': 6.74375422012154e-06, 'epoch': 2.61}


 87%|████████▋ | 10760/12348 [4:05:49<36:43,  1.39s/it]

{'loss': 0.5081, 'grad_norm': 29.055801391601562, 'learning_rate': 6.701553004726536e-06, 'epoch': 2.61}


 87%|████████▋ | 10770/12348 [4:06:03<36:24,  1.38s/it]

{'loss': 0.6301, 'grad_norm': 9.520781517028809, 'learning_rate': 6.659351789331533e-06, 'epoch': 2.62}


 87%|████████▋ | 10780/12348 [4:06:16<36:08,  1.38s/it]

{'loss': 0.4191, 'grad_norm': 19.447710037231445, 'learning_rate': 6.61715057393653e-06, 'epoch': 2.62}


 87%|████████▋ | 10790/12348 [4:06:30<36:03,  1.39s/it]

{'loss': 0.5941, 'grad_norm': 2.230489492416382, 'learning_rate': 6.574949358541527e-06, 'epoch': 2.62}


 87%|████████▋ | 10800/12348 [4:06:44<35:43,  1.38s/it]

{'loss': 0.5136, 'grad_norm': 0.5400857329368591, 'learning_rate': 6.532748143146523e-06, 'epoch': 2.62}


 88%|████████▊ | 10810/12348 [4:06:58<35:31,  1.39s/it]

{'loss': 0.6939, 'grad_norm': 15.442452430725098, 'learning_rate': 6.49054692775152e-06, 'epoch': 2.63}


 88%|████████▊ | 10820/12348 [4:07:12<35:19,  1.39s/it]

{'loss': 0.3092, 'grad_norm': 17.579545974731445, 'learning_rate': 6.448345712356516e-06, 'epoch': 2.63}


 88%|████████▊ | 10830/12348 [4:07:26<34:57,  1.38s/it]

{'loss': 0.5727, 'grad_norm': 10.735791206359863, 'learning_rate': 6.406144496961512e-06, 'epoch': 2.63}


 88%|████████▊ | 10840/12348 [4:07:40<34:47,  1.38s/it]

{'loss': 0.4222, 'grad_norm': 2.4499831199645996, 'learning_rate': 6.363943281566509e-06, 'epoch': 2.63}


 88%|████████▊ | 10850/12348 [4:07:53<34:36,  1.39s/it]

{'loss': 0.398, 'grad_norm': 23.980384826660156, 'learning_rate': 6.321742066171505e-06, 'epoch': 2.64}


 88%|████████▊ | 10860/12348 [4:08:07<34:17,  1.38s/it]

{'loss': 0.573, 'grad_norm': 25.638887405395508, 'learning_rate': 6.2795408507765026e-06, 'epoch': 2.64}


 88%|████████▊ | 10870/12348 [4:08:21<34:02,  1.38s/it]

{'loss': 0.7906, 'grad_norm': 23.96787452697754, 'learning_rate': 6.237339635381499e-06, 'epoch': 2.64}


 88%|████████▊ | 10880/12348 [4:08:35<33:49,  1.38s/it]

{'loss': 0.6029, 'grad_norm': 20.351669311523438, 'learning_rate': 6.195138419986496e-06, 'epoch': 2.64}


 88%|████████▊ | 10890/12348 [4:08:49<33:45,  1.39s/it]

{'loss': 0.3162, 'grad_norm': 9.16977310180664, 'learning_rate': 6.152937204591492e-06, 'epoch': 2.65}


 88%|████████▊ | 10900/12348 [4:09:03<33:27,  1.39s/it]

{'loss': 0.3876, 'grad_norm': 15.123270034790039, 'learning_rate': 6.110735989196489e-06, 'epoch': 2.65}


 88%|████████▊ | 10910/12348 [4:09:16<33:16,  1.39s/it]

{'loss': 0.5106, 'grad_norm': 0.9096508026123047, 'learning_rate': 6.068534773801486e-06, 'epoch': 2.65}


 88%|████████▊ | 10920/12348 [4:09:30<32:57,  1.38s/it]

{'loss': 0.3834, 'grad_norm': 7.316196918487549, 'learning_rate': 6.026333558406482e-06, 'epoch': 2.65}


 89%|████████▊ | 10930/12348 [4:09:44<32:44,  1.39s/it]

{'loss': 0.442, 'grad_norm': 0.7595571279525757, 'learning_rate': 5.984132343011479e-06, 'epoch': 2.66}


 89%|████████▊ | 10940/12348 [4:09:58<32:31,  1.39s/it]

{'loss': 0.4432, 'grad_norm': 0.3301320970058441, 'learning_rate': 5.941931127616475e-06, 'epoch': 2.66}


 89%|████████▊ | 10950/12348 [4:10:12<32:13,  1.38s/it]

{'loss': 0.6378, 'grad_norm': 5.2556915283203125, 'learning_rate': 5.8997299122214725e-06, 'epoch': 2.66}


 89%|████████▉ | 10960/12348 [4:10:26<32:04,  1.39s/it]

{'loss': 0.4925, 'grad_norm': 13.956068992614746, 'learning_rate': 5.857528696826469e-06, 'epoch': 2.66}


 89%|████████▉ | 10970/12348 [4:10:40<31:45,  1.38s/it]

{'loss': 0.5466, 'grad_norm': 6.990555286407471, 'learning_rate': 5.815327481431466e-06, 'epoch': 2.67}


 89%|████████▉ | 10980/12348 [4:10:54<31:35,  1.39s/it]

{'loss': 0.5807, 'grad_norm': 22.01103973388672, 'learning_rate': 5.773126266036462e-06, 'epoch': 2.67}


 89%|████████▉ | 10990/12348 [4:11:07<31:09,  1.38s/it]

{'loss': 0.4401, 'grad_norm': 19.604093551635742, 'learning_rate': 5.7309250506414586e-06, 'epoch': 2.67}


 89%|████████▉ | 11000/12348 [4:11:21<31:05,  1.38s/it]

{'loss': 0.7281, 'grad_norm': 5.190398693084717, 'learning_rate': 5.688723835246456e-06, 'epoch': 2.67}


 89%|████████▉ | 11010/12348 [4:11:36<31:40,  1.42s/it]

{'loss': 0.4799, 'grad_norm': 34.31359100341797, 'learning_rate': 5.646522619851452e-06, 'epoch': 2.67}


 89%|████████▉ | 11020/12348 [4:11:51<30:55,  1.40s/it]

{'loss': 0.563, 'grad_norm': 37.496986389160156, 'learning_rate': 5.604321404456449e-06, 'epoch': 2.68}


 89%|████████▉ | 11030/12348 [4:12:04<30:40,  1.40s/it]

{'loss': 0.3847, 'grad_norm': 28.703369140625, 'learning_rate': 5.562120189061445e-06, 'epoch': 2.68}


 89%|████████▉ | 11040/12348 [4:12:18<30:19,  1.39s/it]

{'loss': 0.358, 'grad_norm': 10.951852798461914, 'learning_rate': 5.519918973666442e-06, 'epoch': 2.68}


 89%|████████▉ | 11050/12348 [4:12:32<29:54,  1.38s/it]

{'loss': 0.6908, 'grad_norm': 12.745321273803711, 'learning_rate': 5.477717758271438e-06, 'epoch': 2.68}


 90%|████████▉ | 11060/12348 [4:12:46<29:46,  1.39s/it]

{'loss': 0.5765, 'grad_norm': 24.228391647338867, 'learning_rate': 5.435516542876435e-06, 'epoch': 2.69}


 90%|████████▉ | 11070/12348 [4:13:00<29:22,  1.38s/it]

{'loss': 0.4497, 'grad_norm': 15.04577922821045, 'learning_rate': 5.393315327481431e-06, 'epoch': 2.69}


 90%|████████▉ | 11080/12348 [4:13:14<29:10,  1.38s/it]

{'loss': 0.775, 'grad_norm': 32.07213592529297, 'learning_rate': 5.3511141120864285e-06, 'epoch': 2.69}


 90%|████████▉ | 11090/12348 [4:13:28<29:05,  1.39s/it]

{'loss': 0.6822, 'grad_norm': 18.1137752532959, 'learning_rate': 5.308912896691425e-06, 'epoch': 2.69}


 90%|████████▉ | 11100/12348 [4:13:42<28:44,  1.38s/it]

{'loss': 0.3098, 'grad_norm': 5.455410957336426, 'learning_rate': 5.266711681296422e-06, 'epoch': 2.7}


 90%|████████▉ | 11110/12348 [4:13:55<28:31,  1.38s/it]

{'loss': 0.8055, 'grad_norm': 30.25009536743164, 'learning_rate': 5.224510465901418e-06, 'epoch': 2.7}


 90%|█████████ | 11120/12348 [4:14:09<28:13,  1.38s/it]

{'loss': 0.4901, 'grad_norm': 11.210193634033203, 'learning_rate': 5.1823092505064145e-06, 'epoch': 2.7}


 90%|█████████ | 11130/12348 [4:14:23<27:58,  1.38s/it]

{'loss': 0.6423, 'grad_norm': 4.031783580780029, 'learning_rate': 5.140108035111412e-06, 'epoch': 2.7}


 90%|█████████ | 11140/12348 [4:14:37<27:48,  1.38s/it]

{'loss': 0.3678, 'grad_norm': 8.711564064025879, 'learning_rate': 5.097906819716408e-06, 'epoch': 2.71}


 90%|█████████ | 11150/12348 [4:14:51<27:30,  1.38s/it]

{'loss': 0.4434, 'grad_norm': 0.4257371723651886, 'learning_rate': 5.055705604321405e-06, 'epoch': 2.71}


 90%|█████████ | 11160/12348 [4:15:04<27:23,  1.38s/it]

{'loss': 0.3074, 'grad_norm': 10.242094993591309, 'learning_rate': 5.013504388926401e-06, 'epoch': 2.71}


 90%|█████████ | 11170/12348 [4:15:18<27:11,  1.38s/it]

{'loss': 0.4467, 'grad_norm': 14.318483352661133, 'learning_rate': 4.971303173531398e-06, 'epoch': 2.71}


 91%|█████████ | 11180/12348 [4:15:32<26:58,  1.39s/it]

{'loss': 0.4494, 'grad_norm': 17.54971694946289, 'learning_rate': 4.929101958136394e-06, 'epoch': 2.72}


 91%|█████████ | 11190/12348 [4:15:46<26:43,  1.38s/it]

{'loss': 0.7529, 'grad_norm': 12.737356185913086, 'learning_rate': 4.886900742741391e-06, 'epoch': 2.72}


 91%|█████████ | 11200/12348 [4:16:00<26:28,  1.38s/it]

{'loss': 0.3602, 'grad_norm': 27.843769073486328, 'learning_rate': 4.844699527346388e-06, 'epoch': 2.72}


 91%|█████████ | 11210/12348 [4:16:14<26:11,  1.38s/it]

{'loss': 0.6971, 'grad_norm': 11.330381393432617, 'learning_rate': 4.8024983119513845e-06, 'epoch': 2.72}


 91%|█████████ | 11220/12348 [4:16:28<26:00,  1.38s/it]

{'loss': 0.5655, 'grad_norm': 8.3824462890625, 'learning_rate': 4.760297096556382e-06, 'epoch': 2.73}


 91%|█████████ | 11230/12348 [4:16:41<25:43,  1.38s/it]

{'loss': 0.744, 'grad_norm': 1.1634941101074219, 'learning_rate': 4.718095881161377e-06, 'epoch': 2.73}


 91%|█████████ | 11240/12348 [4:16:55<25:32,  1.38s/it]

{'loss': 0.3394, 'grad_norm': 4.5779290199279785, 'learning_rate': 4.675894665766374e-06, 'epoch': 2.73}


 91%|█████████ | 11250/12348 [4:17:09<25:15,  1.38s/it]

{'loss': 0.4178, 'grad_norm': 12.848552703857422, 'learning_rate': 4.6336934503713705e-06, 'epoch': 2.73}


 91%|█████████ | 11260/12348 [4:17:23<25:05,  1.38s/it]

{'loss': 0.602, 'grad_norm': 35.87683868408203, 'learning_rate': 4.591492234976368e-06, 'epoch': 2.74}


 91%|█████████▏| 11270/12348 [4:17:37<24:49,  1.38s/it]

{'loss': 0.5391, 'grad_norm': 29.475175857543945, 'learning_rate': 4.549291019581364e-06, 'epoch': 2.74}


 91%|█████████▏| 11280/12348 [4:17:51<24:35,  1.38s/it]

{'loss': 0.5589, 'grad_norm': 16.002567291259766, 'learning_rate': 4.507089804186361e-06, 'epoch': 2.74}


 91%|█████████▏| 11290/12348 [4:18:04<24:21,  1.38s/it]

{'loss': 0.5768, 'grad_norm': 54.678836822509766, 'learning_rate': 4.464888588791357e-06, 'epoch': 2.74}


 92%|█████████▏| 11300/12348 [4:18:18<24:08,  1.38s/it]

{'loss': 0.5272, 'grad_norm': 5.713119983673096, 'learning_rate': 4.422687373396354e-06, 'epoch': 2.75}


 92%|█████████▏| 11310/12348 [4:18:32<23:55,  1.38s/it]

{'loss': 0.4554, 'grad_norm': 39.88937759399414, 'learning_rate': 4.380486158001351e-06, 'epoch': 2.75}


 92%|█████████▏| 11320/12348 [4:18:46<23:44,  1.39s/it]

{'loss': 0.2452, 'grad_norm': 22.464073181152344, 'learning_rate': 4.338284942606347e-06, 'epoch': 2.75}


 92%|█████████▏| 11330/12348 [4:19:00<23:27,  1.38s/it]

{'loss': 0.4838, 'grad_norm': 10.00041389465332, 'learning_rate': 4.296083727211344e-06, 'epoch': 2.75}


 92%|█████████▏| 11340/12348 [4:19:14<23:11,  1.38s/it]

{'loss': 0.4516, 'grad_norm': 18.791217803955078, 'learning_rate': 4.2538825118163405e-06, 'epoch': 2.76}


 92%|█████████▏| 11350/12348 [4:19:28<22:59,  1.38s/it]

{'loss': 0.9916, 'grad_norm': 13.359635353088379, 'learning_rate': 4.211681296421338e-06, 'epoch': 2.76}


 92%|█████████▏| 11360/12348 [4:19:41<22:46,  1.38s/it]

{'loss': 0.5908, 'grad_norm': 7.772082805633545, 'learning_rate': 4.169480081026333e-06, 'epoch': 2.76}


 92%|█████████▏| 11370/12348 [4:19:55<22:30,  1.38s/it]

{'loss': 0.5013, 'grad_norm': 7.996153354644775, 'learning_rate': 4.12727886563133e-06, 'epoch': 2.76}


 92%|█████████▏| 11380/12348 [4:20:09<22:17,  1.38s/it]

{'loss': 0.4096, 'grad_norm': 19.246353149414062, 'learning_rate': 4.0850776502363265e-06, 'epoch': 2.76}


 92%|█████████▏| 11390/12348 [4:20:23<22:07,  1.39s/it]

{'loss': 0.4505, 'grad_norm': 19.587121963500977, 'learning_rate': 4.042876434841324e-06, 'epoch': 2.77}


 92%|█████████▏| 11400/12348 [4:20:37<21:49,  1.38s/it]

{'loss': 0.4473, 'grad_norm': 5.946681499481201, 'learning_rate': 4.000675219446321e-06, 'epoch': 2.77}


 92%|█████████▏| 11410/12348 [4:20:51<21:39,  1.39s/it]

{'loss': 0.3572, 'grad_norm': 33.2835578918457, 'learning_rate': 3.958474004051317e-06, 'epoch': 2.77}


 92%|█████████▏| 11420/12348 [4:21:04<21:21,  1.38s/it]

{'loss': 0.7179, 'grad_norm': 3.8515141010284424, 'learning_rate': 3.916272788656313e-06, 'epoch': 2.77}


 93%|█████████▎| 11430/12348 [4:21:18<21:08,  1.38s/it]

{'loss': 0.2963, 'grad_norm': 3.0166895389556885, 'learning_rate': 3.87407157326131e-06, 'epoch': 2.78}


 93%|█████████▎| 11440/12348 [4:21:32<20:54,  1.38s/it]

{'loss': 0.5426, 'grad_norm': 34.77705764770508, 'learning_rate': 3.831870357866307e-06, 'epoch': 2.78}


 93%|█████████▎| 11450/12348 [4:21:46<20:42,  1.38s/it]

{'loss': 0.5615, 'grad_norm': 26.542593002319336, 'learning_rate': 3.7896691424713035e-06, 'epoch': 2.78}


 93%|█████████▎| 11460/12348 [4:22:00<20:26,  1.38s/it]

{'loss': 0.4983, 'grad_norm': 2.770261287689209, 'learning_rate': 3.7474679270763002e-06, 'epoch': 2.78}


 93%|█████████▎| 11470/12348 [4:22:14<20:13,  1.38s/it]

{'loss': 0.5032, 'grad_norm': 1.000687837600708, 'learning_rate': 3.705266711681297e-06, 'epoch': 2.79}


 93%|█████████▎| 11480/12348 [4:22:28<20:00,  1.38s/it]

{'loss': 0.7866, 'grad_norm': 24.432100296020508, 'learning_rate': 3.6630654962862937e-06, 'epoch': 2.79}


 93%|█████████▎| 11490/12348 [4:22:41<19:47,  1.38s/it]

{'loss': 0.5498, 'grad_norm': 4.382292747497559, 'learning_rate': 3.6208642808912895e-06, 'epoch': 2.79}


 93%|█████████▎| 11500/12348 [4:22:55<19:31,  1.38s/it]

{'loss': 0.2476, 'grad_norm': 7.246513366699219, 'learning_rate': 3.5786630654962862e-06, 'epoch': 2.79}


 93%|█████████▎| 11510/12348 [4:23:10<19:39,  1.41s/it]

{'loss': 0.5148, 'grad_norm': 20.506582260131836, 'learning_rate': 3.536461850101283e-06, 'epoch': 2.8}


 93%|█████████▎| 11520/12348 [4:23:24<19:08,  1.39s/it]

{'loss': 0.6789, 'grad_norm': 26.849475860595703, 'learning_rate': 3.4942606347062797e-06, 'epoch': 2.8}


 93%|█████████▎| 11530/12348 [4:23:38<18:51,  1.38s/it]

{'loss': 0.7646, 'grad_norm': 2.0715253353118896, 'learning_rate': 3.4520594193112764e-06, 'epoch': 2.8}


 93%|█████████▎| 11540/12348 [4:23:52<18:38,  1.38s/it]

{'loss': 0.6712, 'grad_norm': 54.780433654785156, 'learning_rate': 3.409858203916273e-06, 'epoch': 2.8}


 94%|█████████▎| 11550/12348 [4:24:06<18:22,  1.38s/it]

{'loss': 0.3696, 'grad_norm': 3.0226597785949707, 'learning_rate': 3.3676569885212694e-06, 'epoch': 2.81}


 94%|█████████▎| 11560/12348 [4:24:20<18:13,  1.39s/it]

{'loss': 0.6652, 'grad_norm': 18.596420288085938, 'learning_rate': 3.325455773126266e-06, 'epoch': 2.81}


 94%|█████████▎| 11570/12348 [4:24:34<17:55,  1.38s/it]

{'loss': 0.6723, 'grad_norm': 8.172816276550293, 'learning_rate': 3.283254557731263e-06, 'epoch': 2.81}


 94%|█████████▍| 11580/12348 [4:24:47<17:43,  1.38s/it]

{'loss': 0.7453, 'grad_norm': 1.0472952127456665, 'learning_rate': 3.2410533423362595e-06, 'epoch': 2.81}


 94%|█████████▍| 11590/12348 [4:25:01<17:25,  1.38s/it]

{'loss': 0.5902, 'grad_norm': 28.9901065826416, 'learning_rate': 3.1988521269412562e-06, 'epoch': 2.82}


 94%|█████████▍| 11600/12348 [4:25:15<17:13,  1.38s/it]

{'loss': 0.1775, 'grad_norm': 0.42056137323379517, 'learning_rate': 3.156650911546253e-06, 'epoch': 2.82}


 94%|█████████▍| 11610/12348 [4:25:29<16:59,  1.38s/it]

{'loss': 0.426, 'grad_norm': 9.688702583312988, 'learning_rate': 3.1144496961512492e-06, 'epoch': 2.82}


 94%|█████████▍| 11620/12348 [4:25:43<16:44,  1.38s/it]

{'loss': 0.7662, 'grad_norm': 18.211397171020508, 'learning_rate': 3.072248480756246e-06, 'epoch': 2.82}


 94%|█████████▍| 11630/12348 [4:25:57<16:30,  1.38s/it]

{'loss': 0.5158, 'grad_norm': 15.495262145996094, 'learning_rate': 3.0300472653612426e-06, 'epoch': 2.83}


 94%|█████████▍| 11640/12348 [4:26:10<16:18,  1.38s/it]

{'loss': 0.555, 'grad_norm': 0.34472358226776123, 'learning_rate': 2.9878460499662394e-06, 'epoch': 2.83}


 94%|█████████▍| 11650/12348 [4:26:24<16:03,  1.38s/it]

{'loss': 0.8258, 'grad_norm': 19.32613754272461, 'learning_rate': 2.945644834571236e-06, 'epoch': 2.83}


 94%|█████████▍| 11660/12348 [4:26:38<15:53,  1.39s/it]

{'loss': 0.3691, 'grad_norm': 21.272109985351562, 'learning_rate': 2.9034436191762324e-06, 'epoch': 2.83}


 95%|█████████▍| 11670/12348 [4:26:52<15:39,  1.39s/it]

{'loss': 0.5418, 'grad_norm': 16.761011123657227, 'learning_rate': 2.861242403781229e-06, 'epoch': 2.84}


 95%|█████████▍| 11680/12348 [4:27:06<15:24,  1.38s/it]

{'loss': 0.3952, 'grad_norm': 6.955835342407227, 'learning_rate': 2.8190411883862258e-06, 'epoch': 2.84}


 95%|█████████▍| 11690/12348 [4:27:20<15:09,  1.38s/it]

{'loss': 0.5493, 'grad_norm': 5.034339427947998, 'learning_rate': 2.776839972991222e-06, 'epoch': 2.84}


 95%|█████████▍| 11700/12348 [4:27:34<14:57,  1.38s/it]

{'loss': 0.6685, 'grad_norm': 23.36344337463379, 'learning_rate': 2.7346387575962188e-06, 'epoch': 2.84}


 95%|█████████▍| 11710/12348 [4:27:47<14:41,  1.38s/it]

{'loss': 0.5874, 'grad_norm': 6.3909711837768555, 'learning_rate': 2.6924375422012155e-06, 'epoch': 2.84}


 95%|█████████▍| 11720/12348 [4:28:01<14:26,  1.38s/it]

{'loss': 0.6136, 'grad_norm': 20.120319366455078, 'learning_rate': 2.650236326806212e-06, 'epoch': 2.85}


 95%|█████████▍| 11730/12348 [4:28:15<14:10,  1.38s/it]

{'loss': 0.5086, 'grad_norm': 22.648422241210938, 'learning_rate': 2.608035111411209e-06, 'epoch': 2.85}


 95%|█████████▌| 11740/12348 [4:28:29<13:58,  1.38s/it]

{'loss': 0.5409, 'grad_norm': 30.918405532836914, 'learning_rate': 2.5658338960162056e-06, 'epoch': 2.85}


 95%|█████████▌| 11750/12348 [4:28:43<13:47,  1.38s/it]

{'loss': 0.793, 'grad_norm': 22.553407669067383, 'learning_rate': 2.523632680621202e-06, 'epoch': 2.85}


 95%|█████████▌| 11760/12348 [4:28:57<13:33,  1.38s/it]

{'loss': 0.4459, 'grad_norm': 7.834748268127441, 'learning_rate': 2.4814314652261986e-06, 'epoch': 2.86}


 95%|█████████▌| 11770/12348 [4:29:10<13:17,  1.38s/it]

{'loss': 0.6529, 'grad_norm': 1.952571153640747, 'learning_rate': 2.4392302498311954e-06, 'epoch': 2.86}


 95%|█████████▌| 11780/12348 [4:29:24<13:06,  1.38s/it]

{'loss': 0.4589, 'grad_norm': 0.26523730158805847, 'learning_rate': 2.397029034436192e-06, 'epoch': 2.86}


 95%|█████████▌| 11790/12348 [4:29:38<12:49,  1.38s/it]

{'loss': 0.2671, 'grad_norm': 21.188554763793945, 'learning_rate': 2.3548278190411884e-06, 'epoch': 2.86}


 96%|█████████▌| 11800/12348 [4:29:52<12:38,  1.38s/it]

{'loss': 0.5275, 'grad_norm': 18.291181564331055, 'learning_rate': 2.312626603646185e-06, 'epoch': 2.87}


 96%|█████████▌| 11810/12348 [4:30:06<12:22,  1.38s/it]

{'loss': 0.4164, 'grad_norm': 10.172674179077148, 'learning_rate': 2.2704253882511818e-06, 'epoch': 2.87}


 96%|█████████▌| 11820/12348 [4:30:20<12:12,  1.39s/it]

{'loss': 0.4992, 'grad_norm': 26.3275146484375, 'learning_rate': 2.228224172856178e-06, 'epoch': 2.87}


 96%|█████████▌| 11830/12348 [4:30:34<11:57,  1.39s/it]

{'loss': 0.6935, 'grad_norm': 16.821924209594727, 'learning_rate': 2.186022957461175e-06, 'epoch': 2.87}


 96%|█████████▌| 11840/12348 [4:30:47<11:43,  1.38s/it]

{'loss': 0.5712, 'grad_norm': 25.64488410949707, 'learning_rate': 2.143821742066172e-06, 'epoch': 2.88}


 96%|█████████▌| 11850/12348 [4:31:01<11:29,  1.38s/it]

{'loss': 0.4053, 'grad_norm': 14.392219543457031, 'learning_rate': 2.101620526671168e-06, 'epoch': 2.88}


 96%|█████████▌| 11860/12348 [4:31:15<11:14,  1.38s/it]

{'loss': 0.6382, 'grad_norm': 3.2123031616210938, 'learning_rate': 2.059419311276165e-06, 'epoch': 2.88}


 96%|█████████▌| 11870/12348 [4:31:29<11:00,  1.38s/it]

{'loss': 0.3805, 'grad_norm': 7.043694019317627, 'learning_rate': 2.0172180958811616e-06, 'epoch': 2.88}


 96%|█████████▌| 11880/12348 [4:31:43<10:47,  1.38s/it]

{'loss': 0.4558, 'grad_norm': 15.233898162841797, 'learning_rate': 1.975016880486158e-06, 'epoch': 2.89}


 96%|█████████▋| 11890/12348 [4:31:57<10:36,  1.39s/it]

{'loss': 0.7062, 'grad_norm': 22.974018096923828, 'learning_rate': 1.9328156650911546e-06, 'epoch': 2.89}


 96%|█████████▋| 11900/12348 [4:32:11<10:19,  1.38s/it]

{'loss': 0.7475, 'grad_norm': 15.227452278137207, 'learning_rate': 1.8906144496961513e-06, 'epoch': 2.89}


 96%|█████████▋| 11910/12348 [4:32:24<10:05,  1.38s/it]

{'loss': 0.458, 'grad_norm': 10.784180641174316, 'learning_rate': 1.8484132343011478e-06, 'epoch': 2.89}


 97%|█████████▋| 11920/12348 [4:32:38<09:52,  1.39s/it]

{'loss': 0.5659, 'grad_norm': 14.324983596801758, 'learning_rate': 1.8062120189061446e-06, 'epoch': 2.9}


 97%|█████████▋| 11930/12348 [4:32:52<09:37,  1.38s/it]

{'loss': 0.3366, 'grad_norm': 17.29718017578125, 'learning_rate': 1.7640108035111413e-06, 'epoch': 2.9}


 97%|█████████▋| 11940/12348 [4:33:06<09:25,  1.39s/it]

{'loss': 0.546, 'grad_norm': 6.741286754608154, 'learning_rate': 1.7218095881161376e-06, 'epoch': 2.9}


 97%|█████████▋| 11950/12348 [4:33:20<09:10,  1.38s/it]

{'loss': 0.5898, 'grad_norm': 8.203944206237793, 'learning_rate': 1.6796083727211345e-06, 'epoch': 2.9}


 97%|█████████▋| 11960/12348 [4:33:34<08:58,  1.39s/it]

{'loss': 0.3951, 'grad_norm': 6.2999467849731445, 'learning_rate': 1.6374071573261312e-06, 'epoch': 2.91}


 97%|█████████▋| 11970/12348 [4:33:48<08:42,  1.38s/it]

{'loss': 0.673, 'grad_norm': 15.875116348266602, 'learning_rate': 1.595205941931128e-06, 'epoch': 2.91}


 97%|█████████▋| 11980/12348 [4:34:01<08:27,  1.38s/it]

{'loss': 0.4395, 'grad_norm': 12.31072998046875, 'learning_rate': 1.5530047265361242e-06, 'epoch': 2.91}


 97%|█████████▋| 11990/12348 [4:34:15<08:16,  1.39s/it]

{'loss': 0.3497, 'grad_norm': 10.528377532958984, 'learning_rate': 1.510803511141121e-06, 'epoch': 2.91}


 97%|█████████▋| 12000/12348 [4:34:29<08:00,  1.38s/it]

{'loss': 0.4671, 'grad_norm': 6.256296157836914, 'learning_rate': 1.4686022957461176e-06, 'epoch': 2.92}


 97%|█████████▋| 12010/12348 [4:34:44<07:49,  1.39s/it]

{'loss': 0.3544, 'grad_norm': 6.563342571258545, 'learning_rate': 1.4264010803511143e-06, 'epoch': 2.92}


 97%|█████████▋| 12020/12348 [4:34:58<07:25,  1.36s/it]

{'loss': 0.6646, 'grad_norm': 24.737171173095703, 'learning_rate': 1.3841998649561108e-06, 'epoch': 2.92}


 97%|█████████▋| 12030/12348 [4:35:11<07:13,  1.36s/it]

{'loss': 0.7915, 'grad_norm': 44.66778564453125, 'learning_rate': 1.3419986495611073e-06, 'epoch': 2.92}


 98%|█████████▊| 12040/12348 [4:35:25<06:58,  1.36s/it]

{'loss': 0.4814, 'grad_norm': 15.131022453308105, 'learning_rate': 1.299797434166104e-06, 'epoch': 2.93}


 98%|█████████▊| 12050/12348 [4:35:39<06:44,  1.36s/it]

{'loss': 0.4584, 'grad_norm': 3.1876680850982666, 'learning_rate': 1.2575962187711008e-06, 'epoch': 2.93}


 98%|█████████▊| 12060/12348 [4:35:52<06:31,  1.36s/it]

{'loss': 0.9367, 'grad_norm': 25.29318618774414, 'learning_rate': 1.2153950033760973e-06, 'epoch': 2.93}


 98%|█████████▊| 12070/12348 [4:36:06<06:17,  1.36s/it]

{'loss': 0.6599, 'grad_norm': 22.316978454589844, 'learning_rate': 1.173193787981094e-06, 'epoch': 2.93}


 98%|█████████▊| 12080/12348 [4:36:19<06:03,  1.36s/it]

{'loss': 0.439, 'grad_norm': 8.770845413208008, 'learning_rate': 1.1309925725860905e-06, 'epoch': 2.93}


 98%|█████████▊| 12090/12348 [4:36:33<05:50,  1.36s/it]

{'loss': 0.6455, 'grad_norm': 11.852821350097656, 'learning_rate': 1.0887913571910872e-06, 'epoch': 2.94}


 98%|█████████▊| 12100/12348 [4:36:47<05:36,  1.36s/it]

{'loss': 0.3626, 'grad_norm': 2.551205635070801, 'learning_rate': 1.046590141796084e-06, 'epoch': 2.94}


 98%|█████████▊| 12110/12348 [4:37:00<05:22,  1.35s/it]

{'loss': 0.5373, 'grad_norm': 7.199706554412842, 'learning_rate': 1.0043889264010804e-06, 'epoch': 2.94}


 98%|█████████▊| 12120/12348 [4:37:14<05:09,  1.36s/it]

{'loss': 0.7007, 'grad_norm': 7.313873291015625, 'learning_rate': 9.621877110060771e-07, 'epoch': 2.94}


 98%|█████████▊| 12130/12348 [4:37:27<04:54,  1.35s/it]

{'loss': 0.5961, 'grad_norm': 19.13006019592285, 'learning_rate': 9.199864956110736e-07, 'epoch': 2.95}


 98%|█████████▊| 12140/12348 [4:37:41<04:41,  1.35s/it]

{'loss': 0.5344, 'grad_norm': 25.774473190307617, 'learning_rate': 8.777852802160702e-07, 'epoch': 2.95}


 98%|█████████▊| 12150/12348 [4:37:54<04:28,  1.35s/it]

{'loss': 0.5762, 'grad_norm': 8.113850593566895, 'learning_rate': 8.355840648210669e-07, 'epoch': 2.95}


 98%|█████████▊| 12160/12348 [4:38:08<04:15,  1.36s/it]

{'loss': 0.5959, 'grad_norm': 4.607651710510254, 'learning_rate': 7.933828494260635e-07, 'epoch': 2.95}


 99%|█████████▊| 12170/12348 [4:38:22<04:01,  1.36s/it]

{'loss': 0.4075, 'grad_norm': 1.293554663658142, 'learning_rate': 7.511816340310601e-07, 'epoch': 2.96}


 99%|█████████▊| 12180/12348 [4:38:35<03:47,  1.35s/it]

{'loss': 0.4794, 'grad_norm': 10.588540077209473, 'learning_rate': 7.089804186360568e-07, 'epoch': 2.96}


 99%|█████████▊| 12190/12348 [4:38:49<03:34,  1.36s/it]

{'loss': 0.6631, 'grad_norm': 6.166029453277588, 'learning_rate': 6.667792032410534e-07, 'epoch': 2.96}


 99%|█████████▉| 12200/12348 [4:39:02<03:21,  1.36s/it]

{'loss': 0.5162, 'grad_norm': 12.77153491973877, 'learning_rate': 6.2457798784605e-07, 'epoch': 2.96}


 99%|█████████▉| 12210/12348 [4:39:16<03:06,  1.35s/it]

{'loss': 0.5618, 'grad_norm': 19.490022659301758, 'learning_rate': 5.823767724510467e-07, 'epoch': 2.97}


 99%|█████████▉| 12220/12348 [4:39:29<02:54,  1.36s/it]

{'loss': 0.5548, 'grad_norm': 18.197908401489258, 'learning_rate': 5.401755570560433e-07, 'epoch': 2.97}


 99%|█████████▉| 12230/12348 [4:39:43<02:40,  1.36s/it]

{'loss': 0.6327, 'grad_norm': 24.51283073425293, 'learning_rate': 4.979743416610398e-07, 'epoch': 2.97}


 99%|█████████▉| 12240/12348 [4:39:57<02:27,  1.36s/it]

{'loss': 0.5266, 'grad_norm': 9.121406555175781, 'learning_rate': 4.557731262660365e-07, 'epoch': 2.97}


 99%|█████████▉| 12250/12348 [4:40:10<02:13,  1.36s/it]

{'loss': 0.5406, 'grad_norm': 9.526514053344727, 'learning_rate': 4.135719108710331e-07, 'epoch': 2.98}


 99%|█████████▉| 12260/12348 [4:40:24<02:00,  1.37s/it]

{'loss': 0.5598, 'grad_norm': 21.241418838500977, 'learning_rate': 3.713706954760297e-07, 'epoch': 2.98}


 99%|█████████▉| 12270/12348 [4:40:38<01:46,  1.36s/it]

{'loss': 0.293, 'grad_norm': 3.032750129699707, 'learning_rate': 3.2916948008102637e-07, 'epoch': 2.98}


 99%|█████████▉| 12280/12348 [4:40:51<01:32,  1.36s/it]

{'loss': 0.4236, 'grad_norm': 13.57091999053955, 'learning_rate': 2.86968264686023e-07, 'epoch': 2.98}


100%|█████████▉| 12290/12348 [4:41:05<01:18,  1.35s/it]

{'loss': 0.7686, 'grad_norm': 21.221397399902344, 'learning_rate': 2.447670492910196e-07, 'epoch': 2.99}


100%|█████████▉| 12300/12348 [4:41:18<01:05,  1.36s/it]

{'loss': 0.7344, 'grad_norm': 17.65997886657715, 'learning_rate': 2.0256583389601622e-07, 'epoch': 2.99}


100%|█████████▉| 12310/12348 [4:41:32<00:51,  1.35s/it]

{'loss': 0.3868, 'grad_norm': 10.57867431640625, 'learning_rate': 1.6036461850101285e-07, 'epoch': 2.99}


100%|█████████▉| 12320/12348 [4:41:46<00:37,  1.35s/it]

{'loss': 0.3667, 'grad_norm': 29.943178176879883, 'learning_rate': 1.1816340310600946e-07, 'epoch': 2.99}


100%|█████████▉| 12330/12348 [4:41:59<00:24,  1.35s/it]

{'loss': 0.3069, 'grad_norm': 25.53411293029785, 'learning_rate': 7.596218771100608e-08, 'epoch': 3.0}


100%|█████████▉| 12340/12348 [4:42:13<00:10,  1.35s/it]

{'loss': 0.4496, 'grad_norm': 20.017152786254883, 'learning_rate': 3.3760972316002705e-08, 'epoch': 3.0}


100%|██████████| 12348/12348 [4:42:24<00:00,  1.37s/it]

{'train_runtime': 16944.5661, 'train_samples_per_second': 5.829, 'train_steps_per_second': 0.729, 'train_loss': 1.019124189800223, 'epoch': 3.0}





TrainOutput(global_step=12348, training_loss=1.019124189800223, metrics={'train_runtime': 16944.5661, 'train_samples_per_second': 5.829, 'train_steps_per_second': 0.729, 'total_flos': 6520801542838272.0, 'train_loss': 1.019124189800223, 'epoch': 3.0})

In [30]:
print(df1['target'].unique())

[310 308 130 357 259  46 324 378 373 120 315 227 218 170  28  19 250 168
 193 178   7 206 189   2 201 187 320 301  16 309 303  42  27   0 311 205
 147 118 264 287 409  44 300  50  45  69  26 323 226  75 249 175  94 317
 124 261   8  35 167 331   4 166 307 314 405 182 337 258 121  30 313 165
 156 325 327 328 102 101 145 296 140 136 134 255 388 348 265  65  52 159
 199   3 132  73 316 109 190 407 339 312  38 160 349 153 347 247 230 383
 372 254 392 285 232  78 216 267 333 141 246 277 391  58  63  62  72 257
 238 217 215  51  92 256 148  49 176 294 403 231 131  59  14 293  74 172
  37 229  98 214 185 204 297  31 384 400  81  82  91 371 343 406 399 390
 106 370 183 224 149 220 129 268 356 335 228 123  87 154 352  29 338 251
  66  85 144  86 212 395 353 169   5 234 289 126 380 171 326 162 355  68
 100  48 194  80 151  61  11 252 202 334 103  22 336 385 125 273 128 389
  90  60 318 393 329 196 288 158 117 236 398 110 319 115 394  57 188 275
 404 163 295 290 143 402 241 260  24 263 346 198 27

Replaced -1 with 409

In [26]:
df1['target'] = df1['target'].replace(-1, 409)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['target'] = df1['target'].replace(-1, 409)


In [33]:
df1[df['target'] == 409]

Unnamed: 0,text,target,split


In [34]:
invalid_rows = df1[df1['target'] == 409]
print(invalid_rows)

                                                    text  target  split
2220   I swear I will piss on this gross pikey until ...     409  train
2741   Really? There are a lot of gypsy travellers th...     409  train
2971   alot ogf people make fun of my personality in ...     409  train
4175   I have worse teeth than a camel humper, my mum...     409  train
9093    Please stop inviting this squaw at the workplace     409  train
9185   A girl I'm working with hails from Egypt and t...     409  train
11812                fucking hell I've just seen a pikey     409    dev
23651  The purpose of the Aryan Hall is not just to h...     409  train


In [18]:
model.save_pretrained('./results')

In [19]:
tokenizer.save_pretrained('./results')

('./results\\tokenizer_config.json',
 './results\\special_tokens_map.json',
 './results\\vocab.txt',
 './results\\added_tokens.json')

In [20]:
from transformers import BertForSequenceClassification, BertTokenizer

model = BertForSequenceClassification.from_pretrained('./results')
tokenizer = BertTokenizer.from_pretrained('./results')

In [31]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

# Load the model and tokenizer
model = BertForSequenceClassification.from_pretrained('./results')
tokenizer = BertTokenizer.from_pretrained('./results')

# Prepare your input text
text = "fucking hell"
inputs = tokenizer(text, return_tensors='pt')

# Make predictions
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Convert logits to probabilities
probabilities = torch.nn.functional.softmax(logits, dim=-1)
category_mapping = dict(enumerate(pd.Categorical(df['target']).categories))

# Get the predicted class
predicted_class = torch.argmax(probabilities, dim=1).item()

print(f"Predicted class: {predicted_class} \n Targetted group: {category_mapping[predicted_class]}")

Predicted class: 309 
 Targetted group: notargetrecorded


In [14]:
import torch
from torch.utils.data import DataLoader
from sklearn.metrics import f1_score
from transformers import BertForSequenceClassification, Trainer, TrainingArguments

# Load the trained model
model = BertForSequenceClassification.from_pretrained('./results')
model.eval()

# Create DataLoader for the test dataset
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

def evaluate_model(model, dataloader):
    all_preds = []
    all_labels = []
    
    with torch.no_grad():
        for batch in dataloader:
            inputs = batch['input_ids']
            attention_mask = batch['attention_mask']
            labels = batch['labels']
            
            outputs = model(input_ids=inputs, attention_mask=attention_mask)
            preds = torch.argmax(outputs.logits, dim=1)
            
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    
    return all_preds, all_labels

def test_model():
    preds, labels = evaluate_model(model, test_loader)
    macro_f1 = f1_score(labels, preds, average='macro')
    print(f'Macro F1 Score: {macro_f1}')
    
test_model()

Macro F1 Score: 0.135716318197409


### Results

In this section, we present the performance of our models and compare them with previous work where available. The primary evaluation metric used is the **macro F1-score**, which balances precision and recall across all classes, making it suitable for datasets with class imbalance.

#### Binary Classification

For the binary classification task of detecting hate speech ('hate' vs. 'nothate'), we compared our fine-tuned BERT model with the RoBERTa model reported in the DHate paper. The results are as follows:

##### Table 1: Binary Classification Results (Macro F1-score)

| **Model**              | **Method**   | **Macro F1-Score** |
|------------------------|--------------|--------------------|
| RoBERTa (DHate paper)  | Fine-tuned   |       0.7538       |
| **Our BERT Model**     | Fine-tuned   |     **0.7640**     |

As shown in Table 1, our fine-tuned BERT model achieved a macro F1-score of **0.7640**, outperforming the RoBERTa model from the DHate paper, which achieved a macro F1-score of **0.7538**. This improvement suggests that our model effectively captures the nuances of hate speech detection in the binary classification setting.

#### Multiclass Classification

For the multiclass classification tasks, we focused on two aspects:

1. **Target Classification**: Identifying the target group of the hate speech.
2. **Type Classification**: Determining the type of hate speech expressed.

To the best of our knowledge, there is no prior work that has performed multiclass classification in the domain of hate speech detection using these specific categories. Therefore, we present our results as a baseline for future research.

##### Table 2: Multiclass Classification Results (Macro F1-score)

| **Task**                | **Our BERT Model (Macro F1-Score)** |
|-------------------------|-------------------------------------|
| Target Classification   |               0.135                 |
| Type Classification     |               0.475                 |

In Table 2, the macro F1-score for **target classification** is **0.135**, indicating that the model has difficulty distinguishing between different target groups. This could be due to class imbalance or subtle differences between classes that are hard for the model to learn.

For **type classification**, the model achieved a macro F1-score of **0.475**, showing better performance compared to the target classification task. This suggests that the model is more effective at identifying the type of hate speech than pinpointing the specific target group.

#### Analysis

The superior performance of our BERT model in binary classification compared to the RoBERTa model may be attributed to several factors:

- **Data Preprocessing**: Our meticulous data cleaning and preprocessing steps might have improved the quality of input data.
- **Fine-Tuning Strategies**: The hyperparameter tuning and training strategies employed could have led to better model generalization.
- **Model Architecture**: While both BERT and RoBERTa are transformer-based models, differences in architecture and training objectives can impact performance.

For the multiclass tasks, the lower macro F1-score in target classification highlights the challenge in detecting specific target groups within hate speech. This opens avenues for future work to improve model performance, possibly by employing techniques to handle class imbalance or incorporating external knowledge sources.

### Conclusion

Our experiments demonstrate that the fine-tuned BERT model performs effectively in binary hate speech detection, surpassing the RoBERTa model from the DHate paper. In multiclass classification tasks, our results provide a baseline for target and type classification within hate speech detection. Future research can build upon these findings to enhance model performance, especially in distinguishing between various target groups.