# Training a BERT Model for Medical Diagnoses Classification
This notebook demonstrates how to train a BERT model to classify medical diagnoses based on their descriptions and CIE-10 codes. It includes steps for loading data, preprocessing, training, evaluation, and querying the model.

In [8]:
!pip install pandas numpy scikit-learn torch transformers datasets matplotlib



## Import Required Libraries
We need several libraries for data manipulation, model training, and evaluation.
- `pandas` and `numpy` for data manipulation.
- `scikit-learn` for data splitting.
- `torch` for PyTorch, the deep learning framework.
- `transformers` for BERT model and tokenizer.
- `datasets` for handling datasets.
- `matplotlib` for plotting graphs.

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding
from datasets import Dataset
import matplotlib.pyplot as plt
import os

## Load the Training Dataset
We load the training dataset containing medical diagnoses descriptions and their corresponding CIE-10 codes.
- `pd.read_csv` is used to read the CSV file into a DataFrame.
- We select only the relevant columns (`description` and `code`).
- We rename the columns to `text` and `label` for consistency.

In [10]:
# Load the training dataset
df_train = pd.read_csv('../csv_import_scrips/cie10-es-diagnoses.csv')
df_train = df_train[['description', 'code']]
df_train = df_train.rename(columns={'description': 'text', 'code': 'label'})
df_train.head()

Unnamed: 0,text,label
0,Clamidia psittaci infecciones,A70
1,Tracoma,A71
2,Etapa inicial de tracoma,A71.0
3,Fase activa de tracoma,A71.1
4,"Tracoma, no especificado",A71.9


## Load the Evaluation Dataset
We load a separate evaluation dataset to validate the model's performance.
- Similar steps are followed as for the training dataset.

In [11]:
# Load the evaluation dataset
df_eval = pd.read_csv('../generated-diagnoses/diagnosticos_medicos_10000.csv')
df_eval = df_eval[['Diagnóstico', 'CIE-10']]
df_eval = df_eval.rename(columns={'Diagnóstico': 'text', 'CIE-10': 'label'})
df_eval.head()

Unnamed: 0,text,label
0,Mujer de 32 años con disuria y urgencia miccio...,N30.0
1,Paciente masculino de 60 años con pérdida prog...,M48.0
2,"Niño de 8 años con fiebre persistente de 39°C,...",B05.9
3,"Mujer de 29 años con antecedentes de ansiedad,...",F41.0
4,Paciente masculino de 65 años con tos crónica ...,J44.9


## Preprocess Data
We convert the CIE-10 codes to categorical labels and create a mapping from label indices back to CIE-10 codes.
- Convert the `label` column to a categorical type.
- Create a dictionary to map label indices to CIE-10 codes.
- Convert the categorical labels to numerical codes.

In [12]:
# Preprocess data
df_train['label'] = df_train['label'].astype('category')
df_eval['label'] = df_eval['label'].astype('category')
label_to_code = dict(enumerate(df_train['label'].cat.categories))
df_train['label'] = df_train['label'].cat.codes
df_eval['label'] = df_eval['label'].cat.codes
df_train.head(), df_eval.head()

(                            text  label
 0  Clamidia psittaci infecciones    557
 1                        Tracoma    558
 2       Etapa inicial de tracoma    559
 3         Fase activa de tracoma    560
 4       Tracoma, no especificado    561,
                                                 text  label
 0  Mujer de 32 años con disuria y urgencia miccio...     16
 1  Paciente masculino de 60 años con pérdida prog...     14
 2  Niño de 8 años con fiebre persistente de 39°C,...      1
 3  Mujer de 29 años con antecedentes de ansiedad,...      6
 4  Paciente masculino de 65 años con tos crónica ...     11)

## Tokenize and Encode Data
We use the BERT tokenizer to tokenize and encode the text data.
- Load the BERT tokenizer.
- Define a function to tokenize the text data.
- Convert the DataFrame to a Dataset object.
- Apply the tokenizer to the dataset.
- Use `DataCollatorWithPadding` to handle padding.

In [13]:
# Tokenize and encode data
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True)

train_dataset = Dataset.from_pandas(df_train)
eval_dataset = Dataset.from_pandas(df_eval)

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_eval_dataset = eval_dataset.map(tokenize_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:  15%|█▍        | 15000/101246 [00:02<00:13, 6276.66 examples/s]


KeyboardInterrupt: 

 67%|██████▋   | 12703/18984 [13:10<05:52, 17.80it/s]

## Load or Train the Model
We check if a trained model already exists. If it does, we load it. Otherwise, we train a new model.
- Check if the model directory exists.
- If it exists, load the model and tokenizer from the saved files.
- If it doesn't exist, load a pre-trained BERT model and fine-tune it on our dataset.
- Save the trained model and tokenizer.

In [7]:
# Check if the model is already trained and saved
model_path = './trained_model'

if os.path.exists(model_path):
    # Load the saved model and tokenizer
    model = BertForSequenceClassification.from_pretrained(model_path)
    tokenizer = BertTokenizer.from_pretrained(model_path)
    print('Model loaded from saved files.')
else:
    # Load pre-trained BERT model
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(df_train['label'].unique()))
    
    # Fine-tune BERT model
    training_args = TrainingArguments(
        output_dir='./results',
        evaluation_strategy='epoch',
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=3,
        weight_decay=0.01,
        logging_dir='./logs',
        logging_steps=10,
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train_dataset,
        eval_dataset=tokenized_eval_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
    )
    
    trainer.train()
    
    # Save the trained model
    model.save_pretrained(model_path)
    tokenizer.save_pretrained(model_path)
    print('Model trained and saved.')

# Define the trainer for evaluation
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(
  0%|          | 14/18984 [00:00<18:37, 16.98it/s] 

{'loss': 11.5452, 'grad_norm': 12.671746253967285, 'learning_rate': 1.9989464812473665e-05, 'epoch': 0.0}


  0%|          | 24/18984 [00:01<17:25, 18.13it/s]

{'loss': 11.6206, 'grad_norm': 4.710226535797119, 'learning_rate': 1.9978929624947324e-05, 'epoch': 0.0}


  0%|          | 32/18984 [00:01<17:23, 18.16it/s]

{'loss': 11.5829, 'grad_norm': 4.356561660766602, 'learning_rate': 1.9968394437420987e-05, 'epoch': 0.0}


  0%|          | 44/18984 [00:02<17:09, 18.39it/s]

{'loss': 11.5635, 'grad_norm': 3.892437219619751, 'learning_rate': 1.995785924989465e-05, 'epoch': 0.01}


  0%|          | 54/18984 [00:03<17:31, 18.01it/s]

{'loss': 11.5404, 'grad_norm': 3.0347814559936523, 'learning_rate': 1.994732406236831e-05, 'epoch': 0.01}


  0%|          | 64/18984 [00:03<17:37, 17.89it/s]

{'loss': 11.5461, 'grad_norm': 2.9070229530334473, 'learning_rate': 1.9936788874841973e-05, 'epoch': 0.01}


  0%|          | 74/18984 [00:04<17:27, 18.05it/s]

{'loss': 11.543, 'grad_norm': 2.609387159347534, 'learning_rate': 1.9926253687315636e-05, 'epoch': 0.01}


  0%|          | 84/18984 [00:04<17:27, 18.05it/s]

{'loss': 11.5216, 'grad_norm': 2.9632389545440674, 'learning_rate': 1.99157184997893e-05, 'epoch': 0.01}


  0%|          | 94/18984 [00:05<17:23, 18.11it/s]

{'loss': 11.5519, 'grad_norm': 3.1478731632232666, 'learning_rate': 1.990518331226296e-05, 'epoch': 0.01}


  1%|          | 104/18984 [00:05<17:41, 17.78it/s]

{'loss': 11.5341, 'grad_norm': 2.647794008255005, 'learning_rate': 1.9894648124736622e-05, 'epoch': 0.02}


  1%|          | 114/18984 [00:06<17:28, 18.00it/s]

{'loss': 11.534, 'grad_norm': 2.4727954864501953, 'learning_rate': 1.9884112937210285e-05, 'epoch': 0.02}


  1%|          | 124/18984 [00:07<17:28, 17.99it/s]

{'loss': 11.5624, 'grad_norm': 2.856041193008423, 'learning_rate': 1.9873577749683945e-05, 'epoch': 0.02}


  1%|          | 134/18984 [00:07<17:27, 18.00it/s]

{'loss': 11.5361, 'grad_norm': 2.5055410861968994, 'learning_rate': 1.9863042562157608e-05, 'epoch': 0.02}


  1%|          | 144/18984 [00:08<17:27, 17.99it/s]

{'loss': 11.5294, 'grad_norm': 2.3354811668395996, 'learning_rate': 1.985250737463127e-05, 'epoch': 0.02}


  1%|          | 154/18984 [00:08<17:19, 18.12it/s]

{'loss': 11.542, 'grad_norm': 2.6855008602142334, 'learning_rate': 1.984197218710493e-05, 'epoch': 0.02}


  1%|          | 164/18984 [00:09<16:57, 18.50it/s]

{'loss': 11.5168, 'grad_norm': 3.058065891265869, 'learning_rate': 1.9831436999578593e-05, 'epoch': 0.03}


  1%|          | 174/18984 [00:09<17:14, 18.19it/s]

{'loss': 11.5451, 'grad_norm': 3.3445963859558105, 'learning_rate': 1.9820901812052256e-05, 'epoch': 0.03}


  1%|          | 182/18984 [00:10<17:31, 17.88it/s]

{'loss': 11.5339, 'grad_norm': 3.4085898399353027, 'learning_rate': 1.981036662452592e-05, 'epoch': 0.03}


  1%|          | 194/18984 [00:10<17:37, 17.77it/s]

{'loss': 11.5415, 'grad_norm': 3.1032683849334717, 'learning_rate': 1.979983143699958e-05, 'epoch': 0.03}


  1%|          | 202/18984 [00:11<17:27, 17.93it/s]

{'loss': 11.5463, 'grad_norm': 2.5287177562713623, 'learning_rate': 1.9789296249473242e-05, 'epoch': 0.03}


  1%|          | 214/18984 [00:12<17:09, 18.24it/s]

{'loss': 11.5657, 'grad_norm': 2.389235496520996, 'learning_rate': 1.9778761061946905e-05, 'epoch': 0.03}


  1%|          | 224/18984 [00:12<17:29, 17.88it/s]

{'loss': 11.5517, 'grad_norm': 2.2332284450531006, 'learning_rate': 1.9768225874420565e-05, 'epoch': 0.03}


  1%|          | 234/18984 [00:13<17:08, 18.23it/s]

{'loss': 11.5386, 'grad_norm': 2.1896166801452637, 'learning_rate': 1.9757690686894228e-05, 'epoch': 0.04}


  1%|▏         | 244/18984 [00:13<17:30, 17.84it/s]

{'loss': 11.5368, 'grad_norm': 3.0951645374298096, 'learning_rate': 1.974715549936789e-05, 'epoch': 0.04}


  1%|▏         | 254/18984 [00:14<17:25, 17.91it/s]

{'loss': 11.548, 'grad_norm': 3.093583822250366, 'learning_rate': 1.973662031184155e-05, 'epoch': 0.04}


  1%|▏         | 264/18984 [00:14<17:09, 18.18it/s]

{'loss': 11.5482, 'grad_norm': 2.785658121109009, 'learning_rate': 1.9726085124315214e-05, 'epoch': 0.04}


  1%|▏         | 274/18984 [00:15<17:08, 18.19it/s]

{'loss': 11.5435, 'grad_norm': 2.4545445442199707, 'learning_rate': 1.9715549936788877e-05, 'epoch': 0.04}


  1%|▏         | 284/18984 [00:15<17:20, 17.97it/s]

{'loss': 11.5417, 'grad_norm': 2.168226718902588, 'learning_rate': 1.970501474926254e-05, 'epoch': 0.04}


  2%|▏         | 294/18984 [00:16<17:13, 18.08it/s]

{'loss': 11.5468, 'grad_norm': 2.728064775466919, 'learning_rate': 1.96944795617362e-05, 'epoch': 0.05}


  2%|▏         | 304/18984 [00:17<17:15, 18.03it/s]

{'loss': 11.5363, 'grad_norm': 2.52807354927063, 'learning_rate': 1.9683944374209863e-05, 'epoch': 0.05}


  2%|▏         | 314/18984 [00:17<17:07, 18.16it/s]

{'loss': 11.5371, 'grad_norm': 2.701465368270874, 'learning_rate': 1.9673409186683526e-05, 'epoch': 0.05}


  2%|▏         | 324/18984 [00:18<17:07, 18.17it/s]

{'loss': 11.5343, 'grad_norm': 2.410048246383667, 'learning_rate': 1.9662873999157185e-05, 'epoch': 0.05}


  2%|▏         | 334/18984 [00:18<17:09, 18.12it/s]

{'loss': 11.5206, 'grad_norm': 2.4462029933929443, 'learning_rate': 1.9652338811630848e-05, 'epoch': 0.05}


  2%|▏         | 344/18984 [00:19<17:26, 17.82it/s]

{'loss': 11.5417, 'grad_norm': 2.2668275833129883, 'learning_rate': 1.9641803624104508e-05, 'epoch': 0.05}


  2%|▏         | 354/18984 [00:19<17:07, 18.14it/s]

{'loss': 11.5296, 'grad_norm': 2.57421875, 'learning_rate': 1.963126843657817e-05, 'epoch': 0.06}


  2%|▏         | 364/18984 [00:20<17:14, 17.99it/s]

{'loss': 11.5565, 'grad_norm': 2.3696887493133545, 'learning_rate': 1.9620733249051834e-05, 'epoch': 0.06}


  2%|▏         | 374/18984 [00:20<17:08, 18.09it/s]

{'loss': 11.5542, 'grad_norm': 2.4023544788360596, 'learning_rate': 1.9610198061525497e-05, 'epoch': 0.06}


  2%|▏         | 384/18984 [00:21<17:02, 18.19it/s]

{'loss': 11.5486, 'grad_norm': 2.352658271789551, 'learning_rate': 1.959966287399916e-05, 'epoch': 0.06}


  2%|▏         | 394/18984 [00:22<17:10, 18.04it/s]

{'loss': 11.5372, 'grad_norm': 2.5283799171447754, 'learning_rate': 1.9589127686472823e-05, 'epoch': 0.06}


  2%|▏         | 404/18984 [00:22<17:22, 17.82it/s]

{'loss': 11.5396, 'grad_norm': 2.60941481590271, 'learning_rate': 1.9578592498946483e-05, 'epoch': 0.06}


  2%|▏         | 414/18984 [00:23<17:15, 17.93it/s]

{'loss': 11.5416, 'grad_norm': 2.4773597717285156, 'learning_rate': 1.9568057311420146e-05, 'epoch': 0.06}


  2%|▏         | 424/18984 [00:23<17:05, 18.09it/s]

{'loss': 11.5338, 'grad_norm': 2.53262996673584, 'learning_rate': 1.9557522123893806e-05, 'epoch': 0.07}


  2%|▏         | 434/18984 [00:24<16:52, 18.32it/s]

{'loss': 11.5417, 'grad_norm': 3.4553563594818115, 'learning_rate': 1.954698693636747e-05, 'epoch': 0.07}


  2%|▏         | 444/18984 [00:24<17:05, 18.09it/s]

{'loss': 11.5318, 'grad_norm': 3.217681407928467, 'learning_rate': 1.953645174884113e-05, 'epoch': 0.07}


  2%|▏         | 454/18984 [00:25<17:17, 17.87it/s]

{'loss': 11.5293, 'grad_norm': 2.65987491607666, 'learning_rate': 1.952591656131479e-05, 'epoch': 0.07}


  2%|▏         | 464/18984 [00:25<17:15, 17.88it/s]

{'loss': 11.5308, 'grad_norm': 2.7887792587280273, 'learning_rate': 1.9515381373788454e-05, 'epoch': 0.07}


  2%|▏         | 474/18984 [00:26<17:04, 18.07it/s]

{'loss': 11.5327, 'grad_norm': 2.9700427055358887, 'learning_rate': 1.9504846186262117e-05, 'epoch': 0.07}


  3%|▎         | 484/18984 [00:27<17:20, 17.79it/s]

{'loss': 11.5394, 'grad_norm': 3.097884178161621, 'learning_rate': 1.949431099873578e-05, 'epoch': 0.08}


  3%|▎         | 494/18984 [00:27<17:15, 17.85it/s]

{'loss': 11.5273, 'grad_norm': 3.0715255737304688, 'learning_rate': 1.9483775811209443e-05, 'epoch': 0.08}


  3%|▎         | 500/18984 [00:27<17:12, 17.90it/s]

{'loss': 11.5323, 'grad_norm': 2.8076088428497314, 'learning_rate': 1.9473240623683103e-05, 'epoch': 0.08}


  3%|▎         | 514/18984 [00:31<33:40,  9.14it/s]  

{'loss': 11.5217, 'grad_norm': 2.6257596015930176, 'learning_rate': 1.9462705436156766e-05, 'epoch': 0.08}


  3%|▎         | 524/18984 [00:32<19:55, 15.45it/s]

{'loss': 11.549, 'grad_norm': 2.5945024490356445, 'learning_rate': 1.9452170248630426e-05, 'epoch': 0.08}


  3%|▎         | 534/18984 [00:32<17:35, 17.48it/s]

{'loss': 11.5525, 'grad_norm': 3.883265495300293, 'learning_rate': 1.944163506110409e-05, 'epoch': 0.08}


  3%|▎         | 544/18984 [00:33<16:59, 18.10it/s]

{'loss': 11.5658, 'grad_norm': 3.2854092121124268, 'learning_rate': 1.9431099873577752e-05, 'epoch': 0.09}


  3%|▎         | 554/18984 [00:33<16:49, 18.25it/s]

{'loss': 11.5285, 'grad_norm': 2.985541820526123, 'learning_rate': 1.942056468605141e-05, 'epoch': 0.09}


  3%|▎         | 564/18984 [00:34<17:02, 18.02it/s]

{'loss': 11.5564, 'grad_norm': 2.658684253692627, 'learning_rate': 1.9410029498525075e-05, 'epoch': 0.09}


  3%|▎         | 574/18984 [00:35<17:04, 17.97it/s]

{'loss': 11.5389, 'grad_norm': 2.335466146469116, 'learning_rate': 1.9399494310998738e-05, 'epoch': 0.09}


  3%|▎         | 584/18984 [00:35<17:06, 17.93it/s]

{'loss': 11.5409, 'grad_norm': 2.674593448638916, 'learning_rate': 1.93889591234724e-05, 'epoch': 0.09}


  3%|▎         | 594/18984 [00:36<17:05, 17.93it/s]

{'loss': 11.5386, 'grad_norm': 2.9766993522644043, 'learning_rate': 1.9378423935946064e-05, 'epoch': 0.09}


  3%|▎         | 604/18984 [00:36<16:49, 18.20it/s]

{'loss': 11.5464, 'grad_norm': 2.86518931388855, 'learning_rate': 1.9367888748419723e-05, 'epoch': 0.09}


  3%|▎         | 614/18984 [00:37<16:51, 18.16it/s]

{'loss': 11.535, 'grad_norm': 2.2582333087921143, 'learning_rate': 1.9357353560893386e-05, 'epoch': 0.1}


  3%|▎         | 624/18984 [00:37<16:46, 18.25it/s]

{'loss': 11.5464, 'grad_norm': 2.8226752281188965, 'learning_rate': 1.9346818373367046e-05, 'epoch': 0.1}


  3%|▎         | 634/18984 [00:38<16:41, 18.33it/s]

{'loss': 11.5344, 'grad_norm': 2.906437873840332, 'learning_rate': 1.933628318584071e-05, 'epoch': 0.1}


  3%|▎         | 644/18984 [00:38<17:02, 17.93it/s]

{'loss': 11.5329, 'grad_norm': 2.951902389526367, 'learning_rate': 1.9325747998314372e-05, 'epoch': 0.1}


  3%|▎         | 654/18984 [00:39<16:51, 18.11it/s]

{'loss': 11.5513, 'grad_norm': 3.048130989074707, 'learning_rate': 1.9315212810788032e-05, 'epoch': 0.1}


  3%|▎         | 664/18984 [00:40<16:57, 18.01it/s]

{'loss': 11.535, 'grad_norm': 2.9418859481811523, 'learning_rate': 1.9304677623261695e-05, 'epoch': 0.1}


  4%|▎         | 674/18984 [00:40<17:13, 17.71it/s]

{'loss': 11.5327, 'grad_norm': 3.108011245727539, 'learning_rate': 1.9294142435735358e-05, 'epoch': 0.11}


  4%|▎         | 684/18984 [00:41<17:02, 17.91it/s]

{'loss': 11.5057, 'grad_norm': 3.6572349071502686, 'learning_rate': 1.928360724820902e-05, 'epoch': 0.11}


  4%|▎         | 694/18984 [00:41<16:46, 18.17it/s]

{'loss': 11.5344, 'grad_norm': 4.200192451477051, 'learning_rate': 1.9273072060682684e-05, 'epoch': 0.11}


  4%|▎         | 704/18984 [00:42<17:04, 17.84it/s]

{'loss': 11.5652, 'grad_norm': 4.087437629699707, 'learning_rate': 1.9262536873156344e-05, 'epoch': 0.11}


  4%|▍         | 714/18984 [00:42<16:48, 18.12it/s]

{'loss': 11.5511, 'grad_norm': 3.763753652572632, 'learning_rate': 1.9252001685630007e-05, 'epoch': 0.11}


  4%|▍         | 724/18984 [00:43<16:58, 17.93it/s]

{'loss': 11.569, 'grad_norm': 3.2249350547790527, 'learning_rate': 1.9241466498103666e-05, 'epoch': 0.11}


  4%|▍         | 734/18984 [00:43<16:54, 17.99it/s]

{'loss': 11.5372, 'grad_norm': 2.732282876968384, 'learning_rate': 1.923093131057733e-05, 'epoch': 0.12}


  4%|▍         | 744/18984 [00:44<16:51, 18.04it/s]

{'loss': 11.5419, 'grad_norm': 2.1281304359436035, 'learning_rate': 1.9220396123050992e-05, 'epoch': 0.12}


  4%|▍         | 752/18984 [00:44<17:09, 17.72it/s]

{'loss': 11.5466, 'grad_norm': 1.8543239831924438, 'learning_rate': 1.9209860935524652e-05, 'epoch': 0.12}


  4%|▍         | 764/18984 [00:45<16:41, 18.19it/s]

{'loss': 11.5271, 'grad_norm': 2.347259283065796, 'learning_rate': 1.9199325747998315e-05, 'epoch': 0.12}


  4%|▍         | 774/18984 [00:46<16:30, 18.39it/s]

{'loss': 11.5466, 'grad_norm': 3.008169412612915, 'learning_rate': 1.9188790560471978e-05, 'epoch': 0.12}


  4%|▍         | 782/18984 [00:46<17:13, 17.61it/s]

{'loss': 11.5226, 'grad_norm': 3.1842429637908936, 'learning_rate': 1.917825537294564e-05, 'epoch': 0.12}


  4%|▍         | 794/18984 [00:47<16:52, 17.96it/s]

{'loss': 11.544, 'grad_norm': 3.215588331222534, 'learning_rate': 1.9167720185419304e-05, 'epoch': 0.12}


  4%|▍         | 804/18984 [00:47<16:30, 18.35it/s]

{'loss': 11.5339, 'grad_norm': 3.3907434940338135, 'learning_rate': 1.9157184997892964e-05, 'epoch': 0.13}


  4%|▍         | 814/18984 [00:48<16:21, 18.52it/s]

{'loss': 11.549, 'grad_norm': 3.6108970642089844, 'learning_rate': 1.9146649810366627e-05, 'epoch': 0.13}


  4%|▍         | 824/18984 [00:48<16:28, 18.36it/s]

{'loss': 11.5477, 'grad_norm': 3.3740124702453613, 'learning_rate': 1.9136114622840287e-05, 'epoch': 0.13}


  4%|▍         | 834/18984 [00:49<16:36, 18.22it/s]

{'loss': 11.5372, 'grad_norm': 3.485590934753418, 'learning_rate': 1.912557943531395e-05, 'epoch': 0.13}


  4%|▍         | 844/18984 [00:50<16:53, 17.89it/s]

{'loss': 11.564, 'grad_norm': 3.0111019611358643, 'learning_rate': 1.9115044247787613e-05, 'epoch': 0.13}


  4%|▍         | 854/18984 [00:50<16:36, 18.20it/s]

{'loss': 11.5256, 'grad_norm': 3.337409019470215, 'learning_rate': 1.9104509060261272e-05, 'epoch': 0.13}


  5%|▍         | 864/18984 [00:51<16:26, 18.36it/s]

{'loss': 11.5193, 'grad_norm': 4.430864334106445, 'learning_rate': 1.9093973872734935e-05, 'epoch': 0.14}


  5%|▍         | 874/18984 [00:51<16:17, 18.52it/s]

{'loss': 11.5114, 'grad_norm': 4.994176864624023, 'learning_rate': 1.90834386852086e-05, 'epoch': 0.14}


  5%|▍         | 882/18984 [00:52<16:35, 18.19it/s]

{'loss': 11.569, 'grad_norm': 4.458938121795654, 'learning_rate': 1.907290349768226e-05, 'epoch': 0.14}


  5%|▍         | 894/18984 [00:52<16:46, 17.97it/s]

{'loss': 11.5075, 'grad_norm': 4.63010311126709, 'learning_rate': 1.9062368310155925e-05, 'epoch': 0.14}


  5%|▍         | 904/18984 [00:53<16:14, 18.54it/s]

{'loss': 11.5404, 'grad_norm': 5.106715679168701, 'learning_rate': 1.9051833122629584e-05, 'epoch': 0.14}


  5%|▍         | 914/18984 [00:53<16:22, 18.39it/s]

{'loss': 11.5515, 'grad_norm': 5.506734371185303, 'learning_rate': 1.9041297935103247e-05, 'epoch': 0.14}


  5%|▍         | 924/18984 [00:54<16:44, 17.99it/s]

{'loss': 11.5762, 'grad_norm': 5.695484638214111, 'learning_rate': 1.9030762747576907e-05, 'epoch': 0.15}


  5%|▍         | 934/18984 [00:54<16:52, 17.82it/s]

{'loss': 11.5601, 'grad_norm': 4.85464334487915, 'learning_rate': 1.902022756005057e-05, 'epoch': 0.15}


  5%|▍         | 944/18984 [00:55<16:51, 17.84it/s]

{'loss': 11.5452, 'grad_norm': 4.321030616760254, 'learning_rate': 1.9009692372524233e-05, 'epoch': 0.15}


  5%|▌         | 954/18984 [00:56<16:52, 17.80it/s]

{'loss': 11.5499, 'grad_norm': 3.8843281269073486, 'learning_rate': 1.8999157184997893e-05, 'epoch': 0.15}


  5%|▌         | 964/18984 [00:56<16:10, 18.57it/s]

{'loss': 11.5336, 'grad_norm': 2.5567233562469482, 'learning_rate': 1.8988621997471556e-05, 'epoch': 0.15}


  5%|▌         | 974/18984 [00:57<16:14, 18.48it/s]

{'loss': 11.5449, 'grad_norm': 3.174865245819092, 'learning_rate': 1.897808680994522e-05, 'epoch': 0.15}


  5%|▌         | 984/18984 [00:57<16:41, 17.97it/s]

{'loss': 11.5188, 'grad_norm': 3.4343557357788086, 'learning_rate': 1.8967551622418882e-05, 'epoch': 0.15}


  5%|▌         | 994/18984 [00:58<16:30, 18.16it/s]

{'loss': 11.5069, 'grad_norm': 4.1035332679748535, 'learning_rate': 1.8957016434892545e-05, 'epoch': 0.16}


  5%|▌         | 1000/18984 [00:58<16:27, 18.22it/s]

{'loss': 11.5289, 'grad_norm': 4.174877166748047, 'learning_rate': 1.8946481247366205e-05, 'epoch': 0.16}


  5%|▌         | 1012/18984 [01:02<39:39,  7.55it/s]  

{'loss': 11.5431, 'grad_norm': 3.8422532081604004, 'learning_rate': 1.8935946059839868e-05, 'epoch': 0.16}


  5%|▌         | 1024/18984 [01:02<19:15, 15.55it/s]

{'loss': 11.5369, 'grad_norm': 3.7181577682495117, 'learning_rate': 1.8925410872313527e-05, 'epoch': 0.16}


  5%|▌         | 1034/18984 [01:03<16:39, 17.96it/s]

{'loss': 11.5476, 'grad_norm': 4.011626720428467, 'learning_rate': 1.891487568478719e-05, 'epoch': 0.16}


  5%|▌         | 1044/18984 [01:04<16:52, 17.71it/s]

{'loss': 11.5451, 'grad_norm': 4.929717063903809, 'learning_rate': 1.8904340497260853e-05, 'epoch': 0.16}


  6%|▌         | 1054/18984 [01:04<16:28, 18.14it/s]

{'loss': 11.5361, 'grad_norm': 5.1986308097839355, 'learning_rate': 1.8893805309734513e-05, 'epoch': 0.17}


  6%|▌         | 1064/18984 [01:05<16:26, 18.17it/s]

{'loss': 11.5637, 'grad_norm': 4.965215682983398, 'learning_rate': 1.8883270122208176e-05, 'epoch': 0.17}


  6%|▌         | 1074/18984 [01:05<16:41, 17.87it/s]

{'loss': 11.5242, 'grad_norm': 4.979870796203613, 'learning_rate': 1.887273493468184e-05, 'epoch': 0.17}


  6%|▌         | 1082/18984 [01:06<17:04, 17.47it/s]

{'loss': 11.511, 'grad_norm': 5.809108734130859, 'learning_rate': 1.8862199747155502e-05, 'epoch': 0.17}


  6%|▌         | 1094/18984 [01:06<16:29, 18.07it/s]

{'loss': 11.5143, 'grad_norm': 6.14996337890625, 'learning_rate': 1.8851664559629165e-05, 'epoch': 0.17}


  6%|▌         | 1104/18984 [01:07<16:38, 17.91it/s]

{'loss': 11.5276, 'grad_norm': 5.99971342086792, 'learning_rate': 1.8841129372102825e-05, 'epoch': 0.17}


  6%|▌         | 1114/18984 [01:07<16:45, 17.77it/s]

{'loss': 11.5371, 'grad_norm': 4.315360069274902, 'learning_rate': 1.8830594184576488e-05, 'epoch': 0.18}


  6%|▌         | 1124/18984 [01:08<16:53, 17.62it/s]

{'loss': 11.5405, 'grad_norm': 4.58412504196167, 'learning_rate': 1.8820058997050148e-05, 'epoch': 0.18}


  6%|▌         | 1134/18984 [01:09<16:27, 18.08it/s]

{'loss': 11.5384, 'grad_norm': 4.208969593048096, 'learning_rate': 1.880952380952381e-05, 'epoch': 0.18}


  6%|▌         | 1144/18984 [01:09<16:29, 18.03it/s]

{'loss': 11.528, 'grad_norm': 4.093333721160889, 'learning_rate': 1.8798988621997474e-05, 'epoch': 0.18}


  6%|▌         | 1154/18984 [01:10<16:36, 17.89it/s]

{'loss': 11.5574, 'grad_norm': 4.359792709350586, 'learning_rate': 1.8788453434471133e-05, 'epoch': 0.18}


  6%|▌         | 1164/18984 [01:10<16:35, 17.90it/s]

{'loss': 11.561, 'grad_norm': 3.2470080852508545, 'learning_rate': 1.8777918246944796e-05, 'epoch': 0.18}


  6%|▌         | 1174/18984 [01:11<16:28, 18.01it/s]

{'loss': 11.5324, 'grad_norm': 2.750810146331787, 'learning_rate': 1.876738305941846e-05, 'epoch': 0.18}


  6%|▌         | 1182/18984 [01:11<16:50, 17.62it/s]

{'loss': 11.5322, 'grad_norm': 3.600754976272583, 'learning_rate': 1.8756847871892122e-05, 'epoch': 0.19}


  6%|▋         | 1194/18984 [01:12<15:59, 18.55it/s]

{'loss': 11.5089, 'grad_norm': 4.11431360244751, 'learning_rate': 1.8746312684365786e-05, 'epoch': 0.19}


  6%|▋         | 1204/18984 [01:12<16:05, 18.42it/s]

{'loss': 11.5353, 'grad_norm': 4.438050746917725, 'learning_rate': 1.8735777496839445e-05, 'epoch': 0.19}


  6%|▋         | 1214/18984 [01:13<16:01, 18.48it/s]

{'loss': 11.5277, 'grad_norm': 4.125344276428223, 'learning_rate': 1.8725242309313108e-05, 'epoch': 0.19}


  6%|▋         | 1224/18984 [01:14<16:09, 18.31it/s]

{'loss': 11.5527, 'grad_norm': 3.71366548538208, 'learning_rate': 1.871470712178677e-05, 'epoch': 0.19}


  7%|▋         | 1234/18984 [01:14<16:36, 17.81it/s]

{'loss': 11.548, 'grad_norm': 3.5066707134246826, 'learning_rate': 1.870417193426043e-05, 'epoch': 0.19}


  7%|▋         | 1244/18984 [01:15<16:24, 18.02it/s]

{'loss': 11.5618, 'grad_norm': 2.5573792457580566, 'learning_rate': 1.8693636746734094e-05, 'epoch': 0.2}


  7%|▋         | 1254/18984 [01:15<16:24, 18.00it/s]

{'loss': 11.5485, 'grad_norm': 3.493520975112915, 'learning_rate': 1.8683101559207754e-05, 'epoch': 0.2}


  7%|▋         | 1264/18984 [01:16<16:27, 17.94it/s]

{'loss': 11.5362, 'grad_norm': 4.421123027801514, 'learning_rate': 1.8672566371681417e-05, 'epoch': 0.2}


  7%|▋         | 1274/18984 [01:16<16:46, 17.60it/s]

{'loss': 11.5119, 'grad_norm': 4.834132194519043, 'learning_rate': 1.866203118415508e-05, 'epoch': 0.2}


  7%|▋         | 1284/18984 [01:17<16:38, 17.72it/s]

{'loss': 11.5215, 'grad_norm': 5.160983562469482, 'learning_rate': 1.8651495996628743e-05, 'epoch': 0.2}


  7%|▋         | 1294/18984 [01:17<16:26, 17.93it/s]

{'loss': 11.566, 'grad_norm': 4.735618591308594, 'learning_rate': 1.8640960809102406e-05, 'epoch': 0.2}


  7%|▋         | 1304/18984 [01:18<16:41, 17.66it/s]

{'loss': 11.558, 'grad_norm': 4.009241580963135, 'learning_rate': 1.8630425621576065e-05, 'epoch': 0.21}


  7%|▋         | 1314/18984 [01:19<16:32, 17.80it/s]

{'loss': 11.5422, 'grad_norm': 3.106506824493408, 'learning_rate': 1.861989043404973e-05, 'epoch': 0.21}


  7%|▋         | 1324/18984 [01:19<16:20, 18.02it/s]

{'loss': 11.5261, 'grad_norm': 2.8746988773345947, 'learning_rate': 1.860935524652339e-05, 'epoch': 0.21}


  7%|▋         | 1334/18984 [01:20<16:36, 17.72it/s]

{'loss': 11.4756, 'grad_norm': 5.062015533447266, 'learning_rate': 1.859882005899705e-05, 'epoch': 0.21}


  7%|▋         | 1344/18984 [01:20<16:21, 17.96it/s]

{'loss': 11.4918, 'grad_norm': 6.267241954803467, 'learning_rate': 1.8588284871470714e-05, 'epoch': 0.21}


  7%|▋         | 1354/18984 [01:21<16:17, 18.04it/s]

{'loss': 11.5053, 'grad_norm': 6.7213287353515625, 'learning_rate': 1.8577749683944374e-05, 'epoch': 0.21}


  7%|▋         | 1364/18984 [01:21<16:25, 17.88it/s]

{'loss': 11.5205, 'grad_norm': 6.051482677459717, 'learning_rate': 1.8567214496418037e-05, 'epoch': 0.21}


  7%|▋         | 1374/18984 [01:22<16:22, 17.93it/s]

{'loss': 11.5743, 'grad_norm': 6.239078998565674, 'learning_rate': 1.85566793088917e-05, 'epoch': 0.22}


  7%|▋         | 1384/18984 [01:23<16:04, 18.24it/s]

{'loss': 11.5983, 'grad_norm': 5.697982311248779, 'learning_rate': 1.8546144121365363e-05, 'epoch': 0.22}


  7%|▋         | 1394/18984 [01:23<16:16, 18.01it/s]

{'loss': 11.5608, 'grad_norm': 6.063037872314453, 'learning_rate': 1.8535608933839023e-05, 'epoch': 0.22}


  7%|▋         | 1404/18984 [01:24<16:02, 18.27it/s]

{'loss': 11.563, 'grad_norm': 5.568971633911133, 'learning_rate': 1.8525073746312686e-05, 'epoch': 0.22}


  7%|▋         | 1414/18984 [01:24<16:18, 17.96it/s]

{'loss': 11.5698, 'grad_norm': 4.47915506362915, 'learning_rate': 1.851453855878635e-05, 'epoch': 0.22}


  8%|▊         | 1424/18984 [01:25<15:51, 18.46it/s]

{'loss': 11.5151, 'grad_norm': 5.731061935424805, 'learning_rate': 1.8504003371260012e-05, 'epoch': 0.22}


  8%|▊         | 1434/18984 [01:25<15:50, 18.47it/s]

{'loss': 11.4797, 'grad_norm': 5.806625843048096, 'learning_rate': 1.849346818373367e-05, 'epoch': 0.23}


  8%|▊         | 1444/18984 [01:26<16:00, 18.26it/s]

{'loss': 11.5029, 'grad_norm': 5.774048328399658, 'learning_rate': 1.8482932996207335e-05, 'epoch': 0.23}


  8%|▊         | 1454/18984 [01:26<16:04, 18.17it/s]

{'loss': 11.5195, 'grad_norm': 5.8285980224609375, 'learning_rate': 1.8472397808680994e-05, 'epoch': 0.23}


  8%|▊         | 1464/18984 [01:27<15:59, 18.26it/s]

{'loss': 11.5096, 'grad_norm': 5.874873161315918, 'learning_rate': 1.8461862621154657e-05, 'epoch': 0.23}


  8%|▊         | 1474/18984 [01:27<16:09, 18.07it/s]

{'loss': 11.5592, 'grad_norm': 5.878820419311523, 'learning_rate': 1.845132743362832e-05, 'epoch': 0.23}


  8%|▊         | 1484/18984 [01:28<16:29, 17.69it/s]

{'loss': 11.6199, 'grad_norm': 5.757099151611328, 'learning_rate': 1.8440792246101983e-05, 'epoch': 0.23}


  8%|▊         | 1494/18984 [01:29<16:16, 17.91it/s]

{'loss': 11.6198, 'grad_norm': 5.7671613693237305, 'learning_rate': 1.8430257058575643e-05, 'epoch': 0.24}


  8%|▊         | 1500/18984 [01:29<16:09, 18.03it/s]

{'loss': 11.6077, 'grad_norm': 5.558716297149658, 'learning_rate': 1.8419721871049306e-05, 'epoch': 0.24}


  8%|▊         | 1512/18984 [01:33<38:38,  7.54it/s]  

{'loss': 11.6521, 'grad_norm': 5.303976058959961, 'learning_rate': 1.840918668352297e-05, 'epoch': 0.24}


  8%|▊         | 1524/18984 [01:33<18:43, 15.53it/s]

{'loss': 11.5918, 'grad_norm': 4.602643013000488, 'learning_rate': 1.8398651495996632e-05, 'epoch': 0.24}


  8%|▊         | 1534/18984 [01:34<16:43, 17.39it/s]

{'loss': 11.5155, 'grad_norm': 5.4378275871276855, 'learning_rate': 1.8388116308470292e-05, 'epoch': 0.24}


  8%|▊         | 1542/18984 [01:34<16:21, 17.77it/s]

{'loss': 11.4667, 'grad_norm': 5.3979620933532715, 'learning_rate': 1.8377581120943955e-05, 'epoch': 0.24}


  8%|▊         | 1552/18984 [01:35<16:38, 17.46it/s]

{'loss': 11.5035, 'grad_norm': 5.649708271026611, 'learning_rate': 1.8367045933417614e-05, 'epoch': 0.24}


  8%|▊         | 1564/18984 [01:36<16:12, 17.90it/s]

{'loss': 11.5058, 'grad_norm': 5.546151638031006, 'learning_rate': 1.8356510745891278e-05, 'epoch': 0.25}


  8%|▊         | 1574/18984 [01:36<16:10, 17.95it/s]

{'loss': 11.5056, 'grad_norm': 5.649816989898682, 'learning_rate': 1.834597555836494e-05, 'epoch': 0.25}


  8%|▊         | 1584/18984 [01:37<16:20, 17.75it/s]

{'loss': 11.5518, 'grad_norm': 5.5787529945373535, 'learning_rate': 1.8335440370838604e-05, 'epoch': 0.25}


  8%|▊         | 1594/18984 [01:37<16:16, 17.80it/s]

{'loss': 11.569, 'grad_norm': 5.472720623016357, 'learning_rate': 1.8324905183312263e-05, 'epoch': 0.25}


  8%|▊         | 1604/18984 [01:38<16:09, 17.93it/s]

{'loss': 11.548, 'grad_norm': 5.303387641906738, 'learning_rate': 1.8314369995785926e-05, 'epoch': 0.25}


  9%|▊         | 1614/18984 [01:38<16:17, 17.77it/s]

{'loss': 11.5726, 'grad_norm': 5.263978004455566, 'learning_rate': 1.830383480825959e-05, 'epoch': 0.25}


  9%|▊         | 1624/18984 [01:39<15:55, 18.17it/s]

{'loss': 11.5871, 'grad_norm': 5.613766670227051, 'learning_rate': 1.8293299620733252e-05, 'epoch': 0.26}


  9%|▊         | 1634/18984 [01:40<16:14, 17.81it/s]

{'loss': 11.5675, 'grad_norm': 5.312638282775879, 'learning_rate': 1.8282764433206912e-05, 'epoch': 0.26}


  9%|▊         | 1642/18984 [01:40<16:22, 17.65it/s]

{'loss': 11.5356, 'grad_norm': 5.449750900268555, 'learning_rate': 1.8272229245680575e-05, 'epoch': 0.26}


  9%|▊         | 1654/18984 [01:41<16:09, 17.87it/s]

{'loss': 11.5101, 'grad_norm': 6.105261325836182, 'learning_rate': 1.8261694058154235e-05, 'epoch': 0.26}


  9%|▉         | 1664/18984 [01:41<16:14, 17.77it/s]

{'loss': 11.5032, 'grad_norm': 5.8312883377075195, 'learning_rate': 1.8251158870627898e-05, 'epoch': 0.26}


  9%|▉         | 1674/18984 [01:42<16:14, 17.77it/s]

{'loss': 11.5035, 'grad_norm': 5.743258953094482, 'learning_rate': 1.824062368310156e-05, 'epoch': 0.26}


  9%|▉         | 1684/18984 [01:42<16:09, 17.85it/s]

{'loss': 11.5323, 'grad_norm': 5.379996299743652, 'learning_rate': 1.823008849557522e-05, 'epoch': 0.27}


  9%|▉         | 1694/18984 [01:43<15:55, 18.10it/s]

{'loss': 11.5194, 'grad_norm': 4.733675479888916, 'learning_rate': 1.8219553308048884e-05, 'epoch': 0.27}


  9%|▉         | 1704/18984 [01:43<16:01, 17.98it/s]

{'loss': 11.5366, 'grad_norm': 4.471098899841309, 'learning_rate': 1.8209018120522547e-05, 'epoch': 0.27}


  9%|▉         | 1714/18984 [01:44<15:44, 18.28it/s]

{'loss': 11.5284, 'grad_norm': 4.123257637023926, 'learning_rate': 1.819848293299621e-05, 'epoch': 0.27}


  9%|▉         | 1724/18984 [01:45<16:00, 17.96it/s]

{'loss': 11.5428, 'grad_norm': 4.366350173950195, 'learning_rate': 1.8187947745469873e-05, 'epoch': 0.27}


  9%|▉         | 1734/18984 [01:45<15:56, 18.04it/s]

{'loss': 11.5303, 'grad_norm': 5.316423416137695, 'learning_rate': 1.8177412557943532e-05, 'epoch': 0.27}


  9%|▉         | 1744/18984 [01:46<16:18, 17.61it/s]

{'loss': 11.569, 'grad_norm': 4.176026344299316, 'learning_rate': 1.8166877370417195e-05, 'epoch': 0.27}


  9%|▉         | 1754/18984 [01:46<16:18, 17.62it/s]

{'loss': 11.5146, 'grad_norm': 4.000420570373535, 'learning_rate': 1.8156342182890855e-05, 'epoch': 0.28}


  9%|▉         | 1764/18984 [01:47<16:05, 17.84it/s]

{'loss': 11.5007, 'grad_norm': 5.16377592086792, 'learning_rate': 1.8145806995364518e-05, 'epoch': 0.28}


  9%|▉         | 1774/18984 [01:47<15:54, 18.04it/s]

{'loss': 11.4996, 'grad_norm': 5.578585624694824, 'learning_rate': 1.813527180783818e-05, 'epoch': 0.28}


  9%|▉         | 1784/18984 [01:48<15:43, 18.22it/s]

{'loss': 11.5252, 'grad_norm': 5.3336181640625, 'learning_rate': 1.812473662031184e-05, 'epoch': 0.28}


  9%|▉         | 1792/18984 [01:48<15:47, 18.14it/s]

{'loss': 11.5334, 'grad_norm': 5.060458660125732, 'learning_rate': 1.8114201432785504e-05, 'epoch': 0.28}


 10%|▉         | 1804/18984 [01:49<15:53, 18.01it/s]

{'loss': 11.5596, 'grad_norm': 6.325238227844238, 'learning_rate': 1.8103666245259167e-05, 'epoch': 0.28}


 10%|▉         | 1814/18984 [01:50<16:05, 17.78it/s]

{'loss': 11.5458, 'grad_norm': 4.945870399475098, 'learning_rate': 1.809313105773283e-05, 'epoch': 0.29}


 10%|▉         | 1824/18984 [01:50<15:42, 18.21it/s]

{'loss': 11.5339, 'grad_norm': 3.9823265075683594, 'learning_rate': 1.8082595870206493e-05, 'epoch': 0.29}


 10%|▉         | 1834/18984 [01:51<16:06, 17.74it/s]

{'loss': 11.538, 'grad_norm': 6.312694072723389, 'learning_rate': 1.8072060682680153e-05, 'epoch': 0.29}


 10%|▉         | 1844/18984 [01:51<15:51, 18.01it/s]

{'loss': 11.5454, 'grad_norm': 4.782395362854004, 'learning_rate': 1.8061525495153816e-05, 'epoch': 0.29}


 10%|▉         | 1854/18984 [01:52<15:58, 17.87it/s]

{'loss': 11.5459, 'grad_norm': 4.599812030792236, 'learning_rate': 1.8050990307627475e-05, 'epoch': 0.29}


 10%|▉         | 1864/18984 [01:52<15:35, 18.31it/s]

{'loss': 11.5444, 'grad_norm': 4.845687389373779, 'learning_rate': 1.804045512010114e-05, 'epoch': 0.29}


 10%|▉         | 1874/18984 [01:53<15:50, 18.00it/s]

{'loss': 11.5475, 'grad_norm': 4.961956977844238, 'learning_rate': 1.80299199325748e-05, 'epoch': 0.3}


 10%|▉         | 1884/18984 [01:53<15:43, 18.13it/s]

{'loss': 11.5792, 'grad_norm': 4.7294840812683105, 'learning_rate': 1.801938474504846e-05, 'epoch': 0.3}


 10%|▉         | 1894/18984 [01:54<15:41, 18.16it/s]

{'loss': 11.5816, 'grad_norm': 4.257230758666992, 'learning_rate': 1.8008849557522124e-05, 'epoch': 0.3}


 10%|█         | 1904/18984 [01:54<15:39, 18.19it/s]

{'loss': 11.5653, 'grad_norm': 3.553257465362549, 'learning_rate': 1.7998314369995787e-05, 'epoch': 0.3}


 10%|█         | 1914/18984 [01:55<16:12, 17.56it/s]

{'loss': 11.4743, 'grad_norm': 5.753499984741211, 'learning_rate': 1.798777918246945e-05, 'epoch': 0.3}


 10%|█         | 1922/18984 [01:56<15:59, 17.79it/s]

{'loss': 11.5101, 'grad_norm': 5.6514458656311035, 'learning_rate': 1.7977243994943113e-05, 'epoch': 0.3}


 10%|█         | 1934/18984 [01:56<15:42, 18.10it/s]

{'loss': 11.5226, 'grad_norm': 5.704028129577637, 'learning_rate': 1.7966708807416773e-05, 'epoch': 0.3}


 10%|█         | 1944/18984 [01:57<15:29, 18.33it/s]

{'loss': 11.5112, 'grad_norm': 5.614576816558838, 'learning_rate': 1.7956173619890436e-05, 'epoch': 0.31}


 10%|█         | 1954/18984 [01:57<15:39, 18.13it/s]

{'loss': 11.5288, 'grad_norm': 5.877979755401611, 'learning_rate': 1.79456384323641e-05, 'epoch': 0.31}


 10%|█         | 1964/18984 [01:58<15:50, 17.91it/s]

{'loss': 11.5219, 'grad_norm': 5.901684284210205, 'learning_rate': 1.793510324483776e-05, 'epoch': 0.31}


 10%|█         | 1974/18984 [01:58<15:30, 18.29it/s]

{'loss': 11.5897, 'grad_norm': 5.635468482971191, 'learning_rate': 1.7924568057311422e-05, 'epoch': 0.31}


 10%|█         | 1984/18984 [01:59<15:43, 18.02it/s]

{'loss': 11.5916, 'grad_norm': 5.3811869621276855, 'learning_rate': 1.791403286978508e-05, 'epoch': 0.31}


 11%|█         | 1994/18984 [02:00<15:43, 18.00it/s]

{'loss': 11.5952, 'grad_norm': 5.159571647644043, 'learning_rate': 1.7903497682258744e-05, 'epoch': 0.31}


 11%|█         | 2000/18984 [02:00<15:21, 18.43it/s]

{'loss': 11.5272, 'grad_norm': 5.768825531005859, 'learning_rate': 1.7892962494732408e-05, 'epoch': 0.32}


 11%|█         | 2014/18984 [02:04<30:36,  9.24it/s]  

{'loss': 11.4867, 'grad_norm': 5.819807529449463, 'learning_rate': 1.788242730720607e-05, 'epoch': 0.32}


 11%|█         | 2024/18984 [02:04<18:17, 15.45it/s]

{'loss': 11.5278, 'grad_norm': 5.980379581451416, 'learning_rate': 1.7871892119679734e-05, 'epoch': 0.32}


 11%|█         | 2032/18984 [02:05<16:32, 17.08it/s]

{'loss': 11.5338, 'grad_norm': 5.745467185974121, 'learning_rate': 1.7861356932153393e-05, 'epoch': 0.32}


 11%|█         | 2044/18984 [02:05<16:04, 17.57it/s]

{'loss': 11.5661, 'grad_norm': 5.434061527252197, 'learning_rate': 1.7850821744627056e-05, 'epoch': 0.32}


 11%|█         | 2052/18984 [02:06<16:24, 17.21it/s]

{'loss': 11.5771, 'grad_norm': 5.120912075042725, 'learning_rate': 1.784028655710072e-05, 'epoch': 0.32}


 11%|█         | 2064/18984 [02:06<15:51, 17.78it/s]

{'loss': 11.5947, 'grad_norm': 3.818863868713379, 'learning_rate': 1.782975136957438e-05, 'epoch': 0.33}


 11%|█         | 2074/18984 [02:07<15:54, 17.72it/s]

{'loss': 11.5612, 'grad_norm': 2.628882884979248, 'learning_rate': 1.7819216182048042e-05, 'epoch': 0.33}


 11%|█         | 2084/18984 [02:07<15:32, 18.13it/s]

{'loss': 11.5542, 'grad_norm': 2.608367919921875, 'learning_rate': 1.7808680994521702e-05, 'epoch': 0.33}


 11%|█         | 2092/18984 [02:08<15:47, 17.83it/s]

{'loss': 11.5285, 'grad_norm': 3.0555613040924072, 'learning_rate': 1.7798145806995365e-05, 'epoch': 0.33}


 11%|█         | 2104/18984 [02:09<15:43, 17.90it/s]

{'loss': 11.5076, 'grad_norm': 3.781522750854492, 'learning_rate': 1.7787610619469028e-05, 'epoch': 0.33}


 11%|█         | 2114/18984 [02:09<15:49, 17.76it/s]

{'loss': 11.4793, 'grad_norm': 4.4594645500183105, 'learning_rate': 1.777707543194269e-05, 'epoch': 0.33}


 11%|█         | 2124/18984 [02:10<15:41, 17.90it/s]

{'loss': 11.5228, 'grad_norm': 5.182709693908691, 'learning_rate': 1.7766540244416354e-05, 'epoch': 0.34}


 11%|█         | 2134/18984 [02:10<15:15, 18.40it/s]

{'loss': 11.5104, 'grad_norm': 22.569828033447266, 'learning_rate': 1.7756005056890014e-05, 'epoch': 0.34}


 11%|█▏        | 2144/18984 [02:11<15:24, 18.21it/s]

{'loss': 11.526, 'grad_norm': 5.617788791656494, 'learning_rate': 1.7745469869363677e-05, 'epoch': 0.34}


 11%|█▏        | 2154/18984 [02:11<15:23, 18.23it/s]

{'loss': 11.5604, 'grad_norm': 5.490856170654297, 'learning_rate': 1.773493468183734e-05, 'epoch': 0.34}


 11%|█▏        | 2164/18984 [02:12<15:15, 18.37it/s]

{'loss': 11.6054, 'grad_norm': 5.029422283172607, 'learning_rate': 1.7724399494311e-05, 'epoch': 0.34}


 11%|█▏        | 2174/18984 [02:12<15:18, 18.30it/s]

{'loss': 11.5533, 'grad_norm': 5.132821559906006, 'learning_rate': 1.7713864306784662e-05, 'epoch': 0.34}


 12%|█▏        | 2184/18984 [02:13<15:27, 18.11it/s]

{'loss': 11.5364, 'grad_norm': 5.846770763397217, 'learning_rate': 1.7703329119258322e-05, 'epoch': 0.34}


 12%|█▏        | 2194/18984 [02:14<15:26, 18.12it/s]

{'loss': 11.5123, 'grad_norm': 5.160998344421387, 'learning_rate': 1.7692793931731985e-05, 'epoch': 0.35}


 12%|█▏        | 2204/18984 [02:14<15:40, 17.85it/s]

{'loss': 11.5165, 'grad_norm': 5.850002765655518, 'learning_rate': 1.7682258744205648e-05, 'epoch': 0.35}


 12%|█▏        | 2214/18984 [02:15<15:34, 17.94it/s]

{'loss': 11.5201, 'grad_norm': 5.703981876373291, 'learning_rate': 1.767172355667931e-05, 'epoch': 0.35}


 12%|█▏        | 2224/18984 [02:15<15:25, 18.11it/s]

{'loss': 11.546, 'grad_norm': 5.602800369262695, 'learning_rate': 1.7661188369152974e-05, 'epoch': 0.35}


 12%|█▏        | 2234/18984 [02:16<15:42, 17.78it/s]

{'loss': 11.5225, 'grad_norm': 5.541199207305908, 'learning_rate': 1.7650653181626634e-05, 'epoch': 0.35}


 12%|█▏        | 2244/18984 [02:16<15:39, 17.82it/s]

{'loss': 11.5393, 'grad_norm': 5.58787727355957, 'learning_rate': 1.7640117994100297e-05, 'epoch': 0.35}


 12%|█▏        | 2254/18984 [02:17<14:46, 18.87it/s]

{'loss': 11.5512, 'grad_norm': 5.3390374183654785, 'learning_rate': 1.762958280657396e-05, 'epoch': 0.36}


 12%|█▏        | 2264/18984 [02:17<14:54, 18.69it/s]

{'loss': 11.5495, 'grad_norm': 5.706515312194824, 'learning_rate': 1.761904761904762e-05, 'epoch': 0.36}


 12%|█▏        | 2274/18984 [02:18<14:50, 18.76it/s]

{'loss': 11.5297, 'grad_norm': 5.581253528594971, 'learning_rate': 1.7608512431521283e-05, 'epoch': 0.36}


 12%|█▏        | 2284/18984 [02:18<14:40, 18.98it/s]

{'loss': 11.5767, 'grad_norm': 5.7056403160095215, 'learning_rate': 1.7597977243994942e-05, 'epoch': 0.36}


 12%|█▏        | 2294/18984 [02:19<14:51, 18.72it/s]

{'loss': 11.5834, 'grad_norm': 5.505011081695557, 'learning_rate': 1.7587442056468605e-05, 'epoch': 0.36}


 12%|█▏        | 2304/18984 [02:20<14:54, 18.64it/s]

{'loss': 11.5973, 'grad_norm': 5.527444839477539, 'learning_rate': 1.757690686894227e-05, 'epoch': 0.36}


 12%|█▏        | 2314/18984 [02:20<14:42, 18.90it/s]

{'loss': 11.5615, 'grad_norm': 5.354640960693359, 'learning_rate': 1.756637168141593e-05, 'epoch': 0.37}


 12%|█▏        | 2324/18984 [02:21<14:45, 18.81it/s]

{'loss': 11.478, 'grad_norm': 6.056477069854736, 'learning_rate': 1.7555836493889594e-05, 'epoch': 0.37}


 12%|█▏        | 2334/18984 [02:21<15:11, 18.27it/s]

{'loss': 11.4774, 'grad_norm': 6.148768424987793, 'learning_rate': 1.7545301306363254e-05, 'epoch': 0.37}


 12%|█▏        | 2344/18984 [02:22<15:06, 18.36it/s]

{'loss': 11.5, 'grad_norm': 6.693237781524658, 'learning_rate': 1.7534766118836917e-05, 'epoch': 0.37}


 12%|█▏        | 2352/18984 [02:22<14:47, 18.74it/s]

{'loss': 11.5359, 'grad_norm': 6.555806636810303, 'learning_rate': 1.752423093131058e-05, 'epoch': 0.37}


 12%|█▏        | 2363/18984 [02:23<14:57, 18.52it/s]

{'loss': 11.5948, 'grad_norm': 6.058091163635254, 'learning_rate': 1.751369574378424e-05, 'epoch': 0.37}


 12%|█▎        | 2373/18984 [02:23<14:49, 18.66it/s]

{'loss': 11.598, 'grad_norm': 5.592937469482422, 'learning_rate': 1.7503160556257903e-05, 'epoch': 0.37}


 13%|█▎        | 2383/18984 [02:24<14:49, 18.67it/s]

{'loss': 11.5294, 'grad_norm': 4.665388107299805, 'learning_rate': 1.7492625368731563e-05, 'epoch': 0.38}


 13%|█▎        | 2393/18984 [02:24<14:52, 18.59it/s]

{'loss': 11.4729, 'grad_norm': 5.534492492675781, 'learning_rate': 1.7482090181205226e-05, 'epoch': 0.38}


 13%|█▎        | 2403/18984 [02:25<14:40, 18.84it/s]

{'loss': 11.49, 'grad_norm': 5.909528732299805, 'learning_rate': 1.747155499367889e-05, 'epoch': 0.38}


 13%|█▎        | 2413/18984 [02:25<14:30, 19.05it/s]

{'loss': 11.5153, 'grad_norm': 5.559305191040039, 'learning_rate': 1.7461019806152552e-05, 'epoch': 0.38}


 13%|█▎        | 2423/18984 [02:26<14:48, 18.65it/s]

{'loss': 11.5341, 'grad_norm': 4.240507125854492, 'learning_rate': 1.7450484618626215e-05, 'epoch': 0.38}


 13%|█▎        | 2433/18984 [02:26<14:52, 18.54it/s]

{'loss': 11.513, 'grad_norm': 4.890506267547607, 'learning_rate': 1.7439949431099874e-05, 'epoch': 0.38}


 13%|█▎        | 2443/18984 [02:27<14:49, 18.59it/s]

{'loss': 11.5095, 'grad_norm': 5.838067531585693, 'learning_rate': 1.7429414243573537e-05, 'epoch': 0.39}


 13%|█▎        | 2453/18984 [02:27<14:48, 18.61it/s]

{'loss': 11.5279, 'grad_norm': 6.308444499969482, 'learning_rate': 1.74188790560472e-05, 'epoch': 0.39}


 13%|█▎        | 2463/18984 [02:28<15:12, 18.11it/s]

{'loss': 11.5564, 'grad_norm': 5.47650146484375, 'learning_rate': 1.740834386852086e-05, 'epoch': 0.39}


 13%|█▎        | 2473/18984 [02:29<14:37, 18.81it/s]

{'loss': 11.5638, 'grad_norm': 6.091769695281982, 'learning_rate': 1.7397808680994523e-05, 'epoch': 0.39}


 13%|█▎        | 2483/18984 [02:29<14:35, 18.85it/s]

{'loss': 11.5168, 'grad_norm': 5.7074174880981445, 'learning_rate': 1.7387273493468183e-05, 'epoch': 0.39}


 13%|█▎        | 2493/18984 [02:30<14:42, 18.68it/s]

{'loss': 11.5133, 'grad_norm': 5.335904121398926, 'learning_rate': 1.7376738305941846e-05, 'epoch': 0.39}


 13%|█▎        | 2500/18984 [02:30<14:30, 18.93it/s]

{'loss': 11.5497, 'grad_norm': 4.87552547454834, 'learning_rate': 1.736620311841551e-05, 'epoch': 0.4}


 13%|█▎        | 2513/18984 [02:34<28:45,  9.54it/s]  

{'loss': 11.5413, 'grad_norm': 4.2732462882995605, 'learning_rate': 1.7355667930889172e-05, 'epoch': 0.4}


 13%|█▎        | 2523/18984 [02:34<17:49, 15.39it/s]

{'loss': 11.5673, 'grad_norm': 3.8834004402160645, 'learning_rate': 1.7345132743362835e-05, 'epoch': 0.4}


 13%|█▎        | 2533/18984 [02:35<15:11, 18.06it/s]

{'loss': 11.5773, 'grad_norm': 3.329164981842041, 'learning_rate': 1.7334597555836495e-05, 'epoch': 0.4}


 13%|█▎        | 2543/18984 [02:35<14:48, 18.50it/s]

{'loss': 11.5724, 'grad_norm': 3.1632354259490967, 'learning_rate': 1.7324062368310158e-05, 'epoch': 0.4}


 13%|█▎        | 2553/18984 [02:36<14:36, 18.74it/s]

{'loss': 11.4813, 'grad_norm': 5.67664909362793, 'learning_rate': 1.731352718078382e-05, 'epoch': 0.4}


 14%|█▎        | 2563/18984 [02:36<14:50, 18.44it/s]

{'loss': 11.4823, 'grad_norm': 5.746628284454346, 'learning_rate': 1.730299199325748e-05, 'epoch': 0.4}


 14%|█▎        | 2573/18984 [02:37<14:37, 18.71it/s]

{'loss': 11.5109, 'grad_norm': 5.576230525970459, 'learning_rate': 1.7292456805731144e-05, 'epoch': 0.41}


 14%|█▎        | 2583/18984 [02:37<14:44, 18.54it/s]

{'loss': 11.5324, 'grad_norm': 5.46654748916626, 'learning_rate': 1.7281921618204803e-05, 'epoch': 0.41}


 14%|█▎        | 2593/18984 [02:38<14:51, 18.39it/s]

{'loss': 11.5599, 'grad_norm': 5.054519176483154, 'learning_rate': 1.7271386430678466e-05, 'epoch': 0.41}


 14%|█▎        | 2603/18984 [02:38<15:45, 17.32it/s]

{'loss': 11.5783, 'grad_norm': 4.1219658851623535, 'learning_rate': 1.726085124315213e-05, 'epoch': 0.41}


 14%|█▍        | 2613/18984 [02:39<15:34, 17.51it/s]

{'loss': 11.5778, 'grad_norm': 3.3271312713623047, 'learning_rate': 1.7250316055625792e-05, 'epoch': 0.41}


 14%|█▍        | 2623/18984 [02:40<14:57, 18.23it/s]

{'loss': 11.551, 'grad_norm': 3.3139562606811523, 'learning_rate': 1.7239780868099455e-05, 'epoch': 0.41}


 14%|█▍        | 2634/18984 [02:40<14:27, 18.85it/s]

{'loss': 11.5567, 'grad_norm': 3.4125900268554688, 'learning_rate': 1.7229245680573115e-05, 'epoch': 0.42}


 14%|█▍        | 2644/18984 [02:41<14:34, 18.69it/s]

{'loss': 11.5165, 'grad_norm': 4.571602821350098, 'learning_rate': 1.7218710493046778e-05, 'epoch': 0.42}


 14%|█▍        | 2654/18984 [02:41<14:35, 18.66it/s]

{'loss': 11.4601, 'grad_norm': 6.031376838684082, 'learning_rate': 1.720817530552044e-05, 'epoch': 0.42}


 14%|█▍        | 2664/18984 [02:42<14:34, 18.66it/s]

{'loss': 11.497, 'grad_norm': 6.561659336090088, 'learning_rate': 1.71976401179941e-05, 'epoch': 0.42}


 14%|█▍        | 2674/18984 [02:42<14:37, 18.59it/s]

{'loss': 11.533, 'grad_norm': 6.514002323150635, 'learning_rate': 1.7187104930467764e-05, 'epoch': 0.42}


 14%|█▍        | 2684/18984 [02:43<15:01, 18.08it/s]

{'loss': 11.5915, 'grad_norm': 6.541847229003906, 'learning_rate': 1.7176569742941427e-05, 'epoch': 0.42}


 14%|█▍        | 2692/18984 [02:43<15:21, 17.68it/s]

{'loss': 11.6144, 'grad_norm': 6.368044853210449, 'learning_rate': 1.7166034555415087e-05, 'epoch': 0.43}


 14%|█▍        | 2704/18984 [02:44<14:47, 18.34it/s]

{'loss': 11.6193, 'grad_norm': 4.786823272705078, 'learning_rate': 1.715549936788875e-05, 'epoch': 0.43}


 14%|█▍        | 2714/18984 [02:44<14:27, 18.75it/s]

{'loss': 11.5971, 'grad_norm': 4.194796085357666, 'learning_rate': 1.7144964180362413e-05, 'epoch': 0.43}


 14%|█▍        | 2724/18984 [02:45<14:18, 18.93it/s]

{'loss': 11.5831, 'grad_norm': 4.075684547424316, 'learning_rate': 1.7134428992836076e-05, 'epoch': 0.43}


 14%|█▍        | 2734/18984 [02:46<14:33, 18.61it/s]

{'loss': 11.5671, 'grad_norm': 4.229310989379883, 'learning_rate': 1.7123893805309735e-05, 'epoch': 0.43}


 14%|█▍        | 2742/18984 [02:46<14:20, 18.88it/s]

{'loss': 11.4524, 'grad_norm': 5.763555526733398, 'learning_rate': 1.71133586177834e-05, 'epoch': 0.43}


 15%|█▍        | 2753/18984 [02:47<14:14, 18.99it/s]

{'loss': 11.4588, 'grad_norm': 6.036934852600098, 'learning_rate': 1.710282343025706e-05, 'epoch': 0.43}


 15%|█▍        | 2763/18984 [02:47<14:29, 18.65it/s]

{'loss': 11.4343, 'grad_norm': 6.189779281616211, 'learning_rate': 1.709228824273072e-05, 'epoch': 0.44}


 15%|█▍        | 2773/18984 [02:48<14:21, 18.81it/s]

{'loss': 11.482, 'grad_norm': 6.102536678314209, 'learning_rate': 1.7081753055204384e-05, 'epoch': 0.44}


 15%|█▍        | 2784/18984 [02:48<13:48, 19.56it/s]

{'loss': 11.5322, 'grad_norm': 6.128588676452637, 'learning_rate': 1.7071217867678047e-05, 'epoch': 0.44}


 15%|█▍        | 2794/18984 [02:49<14:15, 18.92it/s]

{'loss': 11.5531, 'grad_norm': 5.835873603820801, 'learning_rate': 1.7060682680151707e-05, 'epoch': 0.44}


 15%|█▍        | 2804/18984 [02:49<14:31, 18.57it/s]

{'loss': 11.5649, 'grad_norm': 4.694636821746826, 'learning_rate': 1.705014749262537e-05, 'epoch': 0.44}


 15%|█▍        | 2814/18984 [02:50<14:16, 18.88it/s]

{'loss': 11.549, 'grad_norm': 5.7497029304504395, 'learning_rate': 1.7039612305099033e-05, 'epoch': 0.44}


 15%|█▍        | 2824/18984 [02:50<14:18, 18.82it/s]

{'loss': 11.5211, 'grad_norm': 6.209151268005371, 'learning_rate': 1.7029077117572696e-05, 'epoch': 0.45}


 15%|█▍        | 2834/18984 [02:51<14:33, 18.48it/s]

{'loss': 11.511, 'grad_norm': 6.083618640899658, 'learning_rate': 1.7018541930046356e-05, 'epoch': 0.45}


 15%|█▍        | 2844/18984 [02:51<14:55, 18.03it/s]

{'loss': 11.5319, 'grad_norm': 5.920045375823975, 'learning_rate': 1.700800674252002e-05, 'epoch': 0.45}


 15%|█▌        | 2854/18984 [02:52<14:21, 18.72it/s]

{'loss': 11.5137, 'grad_norm': 5.8188042640686035, 'learning_rate': 1.6997471554993682e-05, 'epoch': 0.45}


 15%|█▌        | 2864/18984 [02:52<13:59, 19.20it/s]

{'loss': 11.5486, 'grad_norm': 6.009525775909424, 'learning_rate': 1.698693636746734e-05, 'epoch': 0.45}


 15%|█▌        | 2872/18984 [02:53<14:13, 18.87it/s]

{'loss': 11.4954, 'grad_norm': 6.2978515625, 'learning_rate': 1.6976401179941004e-05, 'epoch': 0.45}


 15%|█▌        | 2883/18984 [02:53<14:47, 18.15it/s]

{'loss': 11.4865, 'grad_norm': 6.51642370223999, 'learning_rate': 1.6965865992414667e-05, 'epoch': 0.46}


 15%|█▌        | 2893/18984 [02:54<14:38, 18.31it/s]

{'loss': 11.4872, 'grad_norm': 6.5612077713012695, 'learning_rate': 1.6955330804888327e-05, 'epoch': 0.46}


 15%|█▌        | 2903/18984 [02:55<14:34, 18.38it/s]

{'loss': 11.5471, 'grad_norm': 6.462802886962891, 'learning_rate': 1.694479561736199e-05, 'epoch': 0.46}


 15%|█▌        | 2913/18984 [02:55<14:36, 18.33it/s]

{'loss': 11.5899, 'grad_norm': 5.3004865646362305, 'learning_rate': 1.6934260429835653e-05, 'epoch': 0.46}


 15%|█▌        | 2924/18984 [02:56<14:27, 18.51it/s]

{'loss': 11.548, 'grad_norm': 3.568199872970581, 'learning_rate': 1.6923725242309316e-05, 'epoch': 0.46}


 15%|█▌        | 2934/18984 [02:56<14:41, 18.20it/s]

{'loss': 11.4926, 'grad_norm': 4.865268230438232, 'learning_rate': 1.6913190054782976e-05, 'epoch': 0.46}


 16%|█▌        | 2944/18984 [02:57<14:19, 18.65it/s]

{'loss': 11.4697, 'grad_norm': 5.212942600250244, 'learning_rate': 1.690265486725664e-05, 'epoch': 0.46}


 16%|█▌        | 2954/18984 [02:57<14:32, 18.38it/s]

{'loss': 11.5123, 'grad_norm': 5.550991535186768, 'learning_rate': 1.6892119679730302e-05, 'epoch': 0.47}


 16%|█▌        | 2964/18984 [02:58<14:30, 18.41it/s]

{'loss': 11.4988, 'grad_norm': 5.976451396942139, 'learning_rate': 1.688158449220396e-05, 'epoch': 0.47}


 16%|█▌        | 2972/18984 [02:58<14:26, 18.47it/s]

{'loss': 11.5357, 'grad_norm': 5.650786876678467, 'learning_rate': 1.6871049304677625e-05, 'epoch': 0.47}


 16%|█▌        | 2983/18984 [02:59<14:08, 18.86it/s]

{'loss': 11.584, 'grad_norm': 5.532675266265869, 'learning_rate': 1.6860514117151288e-05, 'epoch': 0.47}


 16%|█▌        | 2993/18984 [02:59<13:58, 19.07it/s]

{'loss': 11.5964, 'grad_norm': 5.346095085144043, 'learning_rate': 1.6849978929624947e-05, 'epoch': 0.47}


 16%|█▌        | 3000/18984 [03:00<14:10, 18.79it/s]

{'loss': 11.5918, 'grad_norm': 5.093421936035156, 'learning_rate': 1.683944374209861e-05, 'epoch': 0.47}


 16%|█▌        | 3013/18984 [03:03<27:55,  9.53it/s]  

{'loss': 11.4748, 'grad_norm': 5.831890106201172, 'learning_rate': 1.6828908554572274e-05, 'epoch': 0.48}


 16%|█▌        | 3023/18984 [03:04<16:59, 15.66it/s]

{'loss': 11.4567, 'grad_norm': 6.201482772827148, 'learning_rate': 1.6818373367045933e-05, 'epoch': 0.48}


 16%|█▌        | 3033/18984 [03:04<15:18, 17.36it/s]

{'loss': 11.5043, 'grad_norm': 6.4142303466796875, 'learning_rate': 1.6807838179519596e-05, 'epoch': 0.48}


 16%|█▌        | 3043/18984 [03:05<14:26, 18.39it/s]

{'loss': 11.518, 'grad_norm': 6.360816478729248, 'learning_rate': 1.679730299199326e-05, 'epoch': 0.48}


 16%|█▌        | 3053/18984 [03:06<15:25, 17.21it/s]

{'loss': 11.5757, 'grad_norm': 6.6568098068237305, 'learning_rate': 1.6786767804466922e-05, 'epoch': 0.48}


 16%|█▌        | 3063/18984 [03:06<15:05, 17.58it/s]

{'loss': 11.6122, 'grad_norm': 6.680734157562256, 'learning_rate': 1.6776232616940582e-05, 'epoch': 0.48}


 16%|█▌        | 3073/18984 [03:07<14:40, 18.07it/s]

{'loss': 11.6715, 'grad_norm': 6.577372074127197, 'learning_rate': 1.6765697429414245e-05, 'epoch': 0.49}


 16%|█▌        | 3084/18984 [03:07<14:21, 18.45it/s]

{'loss': 11.7174, 'grad_norm': 5.941869258880615, 'learning_rate': 1.6755162241887908e-05, 'epoch': 0.49}


 16%|█▋        | 3094/18984 [03:08<14:10, 18.67it/s]

{'loss': 11.6469, 'grad_norm': 6.568626880645752, 'learning_rate': 1.6744627054361568e-05, 'epoch': 0.49}


 16%|█▋        | 3104/18984 [03:08<13:52, 19.06it/s]

{'loss': 11.4467, 'grad_norm': 6.295217037200928, 'learning_rate': 1.673409186683523e-05, 'epoch': 0.49}


 16%|█▋        | 3114/18984 [03:09<13:46, 19.20it/s]

{'loss': 11.4089, 'grad_norm': 6.25247859954834, 'learning_rate': 1.6723556679308894e-05, 'epoch': 0.49}


 16%|█▋        | 3124/18984 [03:09<14:10, 18.65it/s]

{'loss': 11.4258, 'grad_norm': 6.54896879196167, 'learning_rate': 1.6713021491782553e-05, 'epoch': 0.49}


 17%|█▋        | 3134/18984 [03:10<14:04, 18.77it/s]

{'loss': 11.4824, 'grad_norm': 6.365739345550537, 'learning_rate': 1.6702486304256216e-05, 'epoch': 0.49}


 17%|█▋        | 3143/18984 [03:10<13:48, 19.12it/s]

{'loss': 11.5059, 'grad_norm': 6.013182640075684, 'learning_rate': 1.669195111672988e-05, 'epoch': 0.5}


 17%|█▋        | 3153/18984 [03:11<13:45, 19.17it/s]

{'loss': 11.5323, 'grad_norm': 5.456315517425537, 'learning_rate': 1.6681415929203543e-05, 'epoch': 0.5}


 17%|█▋        | 3163/18984 [03:12<14:02, 18.78it/s]

{'loss': 11.5561, 'grad_norm': 4.922454833984375, 'learning_rate': 1.6670880741677202e-05, 'epoch': 0.5}


 17%|█▋        | 3173/18984 [03:12<13:59, 18.83it/s]

{'loss': 11.513, 'grad_norm': 5.514484882354736, 'learning_rate': 1.6660345554150865e-05, 'epoch': 0.5}


 17%|█▋        | 3183/18984 [03:13<14:07, 18.64it/s]

{'loss': 11.5299, 'grad_norm': 5.603752613067627, 'learning_rate': 1.664981036662453e-05, 'epoch': 0.5}


 17%|█▋        | 3193/18984 [03:13<13:48, 19.06it/s]

{'loss': 11.5569, 'grad_norm': 5.181178569793701, 'learning_rate': 1.6639275179098188e-05, 'epoch': 0.5}


 17%|█▋        | 3203/18984 [03:14<14:14, 18.46it/s]

{'loss': 11.5536, 'grad_norm': 4.877265453338623, 'learning_rate': 1.662873999157185e-05, 'epoch': 0.51}


 17%|█▋        | 3213/18984 [03:14<14:00, 18.75it/s]

{'loss': 11.5764, 'grad_norm': 4.409954071044922, 'learning_rate': 1.6618204804045514e-05, 'epoch': 0.51}


 17%|█▋        | 3223/18984 [03:15<13:57, 18.81it/s]

{'loss': 11.5841, 'grad_norm': 4.753412246704102, 'learning_rate': 1.6607669616519174e-05, 'epoch': 0.51}


 17%|█▋        | 3234/18984 [03:15<13:44, 19.09it/s]

{'loss': 11.5409, 'grad_norm': 4.466461658477783, 'learning_rate': 1.6597134428992837e-05, 'epoch': 0.51}


 17%|█▋        | 3244/18984 [03:16<13:45, 19.06it/s]

{'loss': 11.4599, 'grad_norm': 6.866162300109863, 'learning_rate': 1.65865992414665e-05, 'epoch': 0.51}


 17%|█▋        | 3252/18984 [03:16<13:37, 19.24it/s]

{'loss': 11.4513, 'grad_norm': 6.8803019523620605, 'learning_rate': 1.6576064053940163e-05, 'epoch': 0.51}


 17%|█▋        | 3263/18984 [03:17<13:59, 18.74it/s]

{'loss': 11.5455, 'grad_norm': 6.768038272857666, 'learning_rate': 1.6565528866413823e-05, 'epoch': 0.52}


 17%|█▋        | 3273/18984 [03:17<13:44, 19.05it/s]

{'loss': 11.5903, 'grad_norm': 6.702284812927246, 'learning_rate': 1.6554993678887486e-05, 'epoch': 0.52}


 17%|█▋        | 3283/18984 [03:18<13:57, 18.75it/s]

{'loss': 11.6478, 'grad_norm': 6.4452643394470215, 'learning_rate': 1.654445849136115e-05, 'epoch': 0.52}


 17%|█▋        | 3293/18984 [03:18<13:50, 18.89it/s]

{'loss': 11.5524, 'grad_norm': 4.632907390594482, 'learning_rate': 1.6533923303834808e-05, 'epoch': 0.52}


 17%|█▋        | 3303/18984 [03:19<13:54, 18.78it/s]

{'loss': 11.4251, 'grad_norm': 5.989138126373291, 'learning_rate': 1.652338811630847e-05, 'epoch': 0.52}


 17%|█▋        | 3313/18984 [03:19<14:16, 18.29it/s]

{'loss': 11.4304, 'grad_norm': 6.513682842254639, 'learning_rate': 1.651285292878213e-05, 'epoch': 0.52}


 18%|█▊        | 3323/18984 [03:20<14:06, 18.50it/s]

{'loss': 11.4764, 'grad_norm': 6.65117883682251, 'learning_rate': 1.6502317741255794e-05, 'epoch': 0.52}


 18%|█▊        | 3333/18984 [03:21<14:07, 18.46it/s]

{'loss': 11.5255, 'grad_norm': 6.663359642028809, 'learning_rate': 1.6491782553729457e-05, 'epoch': 0.53}


 18%|█▊        | 3343/18984 [03:21<13:54, 18.73it/s]

{'loss': 11.5844, 'grad_norm': 6.632983684539795, 'learning_rate': 1.648124736620312e-05, 'epoch': 0.53}


 18%|█▊        | 3353/18984 [03:22<13:53, 18.75it/s]

{'loss': 11.6079, 'grad_norm': 6.3988728523254395, 'learning_rate': 1.6470712178676783e-05, 'epoch': 0.53}


 18%|█▊        | 3364/18984 [03:22<13:56, 18.67it/s]

{'loss': 11.5595, 'grad_norm': 6.3622941970825195, 'learning_rate': 1.6460176991150443e-05, 'epoch': 0.53}


 18%|█▊        | 3374/18984 [03:23<14:15, 18.26it/s]

{'loss': 11.4746, 'grad_norm': 6.570395469665527, 'learning_rate': 1.6449641803624106e-05, 'epoch': 0.53}


 18%|█▊        | 3384/18984 [03:23<13:41, 19.00it/s]

{'loss': 11.4667, 'grad_norm': 6.660427570343018, 'learning_rate': 1.643910661609777e-05, 'epoch': 0.53}


 18%|█▊        | 3394/18984 [03:24<14:02, 18.51it/s]

{'loss': 11.5115, 'grad_norm': 6.607781410217285, 'learning_rate': 1.642857142857143e-05, 'epoch': 0.54}


 18%|█▊        | 3404/18984 [03:24<13:56, 18.63it/s]

{'loss': 11.5724, 'grad_norm': 6.183868885040283, 'learning_rate': 1.641803624104509e-05, 'epoch': 0.54}


 18%|█▊        | 3414/18984 [03:25<13:39, 18.99it/s]

{'loss': 11.6042, 'grad_norm': 5.3532915115356445, 'learning_rate': 1.640750105351875e-05, 'epoch': 0.54}


 18%|█▊        | 3424/18984 [03:25<13:49, 18.75it/s]

{'loss': 11.5936, 'grad_norm': 4.562872886657715, 'learning_rate': 1.6396965865992414e-05, 'epoch': 0.54}


 18%|█▊        | 3434/18984 [03:26<14:06, 18.38it/s]

{'loss': 11.5516, 'grad_norm': 4.32092809677124, 'learning_rate': 1.6386430678466077e-05, 'epoch': 0.54}


 18%|█▊        | 3444/18984 [03:27<13:58, 18.54it/s]

{'loss': 11.5795, 'grad_norm': 4.0873332023620605, 'learning_rate': 1.637589549093974e-05, 'epoch': 0.54}


 18%|█▊        | 3454/18984 [03:27<13:39, 18.96it/s]

{'loss': 11.5143, 'grad_norm': 4.22154426574707, 'learning_rate': 1.6365360303413403e-05, 'epoch': 0.55}


 18%|█▊        | 3464/18984 [03:28<13:30, 19.14it/s]

{'loss': 11.4585, 'grad_norm': 5.970331192016602, 'learning_rate': 1.6354825115887067e-05, 'epoch': 0.55}


 18%|█▊        | 3474/18984 [03:28<14:05, 18.34it/s]

{'loss': 11.4693, 'grad_norm': 5.9828033447265625, 'learning_rate': 1.6344289928360726e-05, 'epoch': 0.55}


 18%|█▊        | 3483/18984 [03:29<13:38, 18.93it/s]

{'loss': 11.5241, 'grad_norm': 5.933938503265381, 'learning_rate': 1.633375474083439e-05, 'epoch': 0.55}


 18%|█▊        | 3493/18984 [03:29<13:49, 18.66it/s]

{'loss': 11.5543, 'grad_norm': 5.823047161102295, 'learning_rate': 1.632321955330805e-05, 'epoch': 0.55}


 18%|█▊        | 3500/18984 [03:30<13:44, 18.77it/s]

{'loss': 11.6064, 'grad_norm': 5.436305522918701, 'learning_rate': 1.6312684365781712e-05, 'epoch': 0.55}


 19%|█▊        | 3513/18984 [03:33<27:35,  9.35it/s]  

{'loss': 11.5935, 'grad_norm': 4.690250396728516, 'learning_rate': 1.6302149178255375e-05, 'epoch': 0.55}


 19%|█▊        | 3523/18984 [03:34<15:49, 16.28it/s]

{'loss': 11.5396, 'grad_norm': 4.913275718688965, 'learning_rate': 1.6291613990729035e-05, 'epoch': 0.56}


 19%|█▊        | 3533/18984 [03:34<13:55, 18.49it/s]

{'loss': 11.4788, 'grad_norm': 5.842208385467529, 'learning_rate': 1.6281078803202698e-05, 'epoch': 0.56}


 19%|█▊        | 3543/18984 [03:35<13:38, 18.87it/s]

{'loss': 11.4686, 'grad_norm': 5.903475284576416, 'learning_rate': 1.627054361567636e-05, 'epoch': 0.56}


 19%|█▊        | 3553/18984 [03:35<13:35, 18.91it/s]

{'loss': 11.4495, 'grad_norm': 6.350930213928223, 'learning_rate': 1.6260008428150024e-05, 'epoch': 0.56}


 19%|█▉        | 3563/18984 [03:36<13:41, 18.76it/s]

{'loss': 11.4759, 'grad_norm': 6.48610782623291, 'learning_rate': 1.6249473240623687e-05, 'epoch': 0.56}


 19%|█▉        | 3573/18984 [03:36<13:32, 18.96it/s]

{'loss': 11.5446, 'grad_norm': 4.322790145874023, 'learning_rate': 1.6238938053097346e-05, 'epoch': 0.56}


 19%|█▉        | 3583/18984 [03:37<13:36, 18.86it/s]

{'loss': 11.5555, 'grad_norm': 4.289000988006592, 'learning_rate': 1.622840286557101e-05, 'epoch': 0.57}


 19%|█▉        | 3594/18984 [03:37<13:27, 19.05it/s]

{'loss': 11.5108, 'grad_norm': 4.910270690917969, 'learning_rate': 1.621786767804467e-05, 'epoch': 0.57}


 19%|█▉        | 3604/18984 [03:38<13:22, 19.17it/s]

{'loss': 11.526, 'grad_norm': 5.120728969573975, 'learning_rate': 1.6207332490518332e-05, 'epoch': 0.57}


 19%|█▉        | 3614/18984 [03:38<13:18, 19.25it/s]

{'loss': 11.534, 'grad_norm': 5.129228591918945, 'learning_rate': 1.6196797302991995e-05, 'epoch': 0.57}


 19%|█▉        | 3624/18984 [03:39<13:37, 18.78it/s]

{'loss': 11.5582, 'grad_norm': 4.775640964508057, 'learning_rate': 1.6186262115465655e-05, 'epoch': 0.57}


 19%|█▉        | 3634/18984 [03:40<13:39, 18.72it/s]

{'loss': 11.5573, 'grad_norm': 4.171814918518066, 'learning_rate': 1.6175726927939318e-05, 'epoch': 0.57}


 19%|█▉        | 3643/18984 [03:40<13:45, 18.58it/s]

{'loss': 11.4937, 'grad_norm': 5.9394450187683105, 'learning_rate': 1.616519174041298e-05, 'epoch': 0.58}


 19%|█▉        | 3653/18984 [03:41<13:37, 18.75it/s]

{'loss': 11.5553, 'grad_norm': 4.708479404449463, 'learning_rate': 1.6154656552886644e-05, 'epoch': 0.58}


 19%|█▉        | 3663/18984 [03:41<14:15, 17.90it/s]

{'loss': 11.5744, 'grad_norm': 4.850809574127197, 'learning_rate': 1.6144121365360307e-05, 'epoch': 0.58}


 19%|█▉        | 3673/18984 [03:42<13:59, 18.25it/s]

{'loss': 11.5295, 'grad_norm': 5.224264144897461, 'learning_rate': 1.6133586177833967e-05, 'epoch': 0.58}


 19%|█▉        | 3683/18984 [03:42<14:12, 17.95it/s]

{'loss': 11.4835, 'grad_norm': 5.26925802230835, 'learning_rate': 1.612305099030763e-05, 'epoch': 0.58}


 19%|█▉        | 3693/18984 [03:43<13:46, 18.50it/s]

{'loss': 11.5492, 'grad_norm': 5.500238418579102, 'learning_rate': 1.611251580278129e-05, 'epoch': 0.58}


 20%|█▉        | 3703/18984 [03:43<13:24, 18.99it/s]

{'loss': 11.5859, 'grad_norm': 5.361262321472168, 'learning_rate': 1.6101980615254953e-05, 'epoch': 0.58}


 20%|█▉        | 3713/18984 [03:44<13:47, 18.45it/s]

{'loss': 11.5782, 'grad_norm': 5.42003059387207, 'learning_rate': 1.6091445427728616e-05, 'epoch': 0.59}


 20%|█▉        | 3723/18984 [03:44<13:43, 18.54it/s]

{'loss': 11.6449, 'grad_norm': 5.345873832702637, 'learning_rate': 1.6080910240202275e-05, 'epoch': 0.59}


 20%|█▉        | 3733/18984 [03:45<13:40, 18.59it/s]

{'loss': 11.6245, 'grad_norm': 4.390584468841553, 'learning_rate': 1.6070375052675938e-05, 'epoch': 0.59}


 20%|█▉        | 3744/18984 [03:45<13:09, 19.29it/s]

{'loss': 11.4934, 'grad_norm': 5.355727672576904, 'learning_rate': 1.60598398651496e-05, 'epoch': 0.59}


 20%|█▉        | 3754/18984 [03:46<13:08, 19.31it/s]

{'loss': 11.4955, 'grad_norm': 5.698140621185303, 'learning_rate': 1.6049304677623264e-05, 'epoch': 0.59}


 20%|█▉        | 3764/18984 [03:47<13:20, 19.02it/s]

{'loss': 11.5004, 'grad_norm': 5.609325408935547, 'learning_rate': 1.6038769490096927e-05, 'epoch': 0.59}


 20%|█▉        | 3774/18984 [03:47<13:14, 19.14it/s]

{'loss': 11.5287, 'grad_norm': 5.721851825714111, 'learning_rate': 1.6028234302570587e-05, 'epoch': 0.6}


 20%|█▉        | 3784/18984 [03:48<13:34, 18.65it/s]

{'loss': 11.5869, 'grad_norm': 5.565725326538086, 'learning_rate': 1.601769911504425e-05, 'epoch': 0.6}


 20%|█▉        | 3794/18984 [03:48<13:22, 18.93it/s]

{'loss': 11.577, 'grad_norm': 5.1128973960876465, 'learning_rate': 1.600716392751791e-05, 'epoch': 0.6}


 20%|██        | 3803/18984 [03:49<13:29, 18.76it/s]

{'loss': 11.5863, 'grad_norm': 4.803117752075195, 'learning_rate': 1.5996628739991573e-05, 'epoch': 0.6}


 20%|██        | 3813/18984 [03:49<13:08, 19.24it/s]

{'loss': 11.5766, 'grad_norm': 4.433225631713867, 'learning_rate': 1.5986093552465236e-05, 'epoch': 0.6}


 20%|██        | 3823/18984 [03:50<13:01, 19.40it/s]

{'loss': 11.5397, 'grad_norm': 4.342889308929443, 'learning_rate': 1.5975558364938896e-05, 'epoch': 0.6}


 20%|██        | 3833/18984 [03:50<13:22, 18.87it/s]

{'loss': 11.4129, 'grad_norm': 6.437836647033691, 'learning_rate': 1.596502317741256e-05, 'epoch': 0.61}


 20%|██        | 3844/18984 [03:51<13:14, 19.06it/s]

{'loss': 11.4331, 'grad_norm': 6.649377346038818, 'learning_rate': 1.595448798988622e-05, 'epoch': 0.61}


 20%|██        | 3854/18984 [03:51<13:19, 18.91it/s]

{'loss': 11.4798, 'grad_norm': 6.425291061401367, 'learning_rate': 1.5943952802359885e-05, 'epoch': 0.61}


 20%|██        | 3864/18984 [03:52<13:31, 18.63it/s]

{'loss': 11.4931, 'grad_norm': 6.370870113372803, 'learning_rate': 1.5933417614833548e-05, 'epoch': 0.61}


 20%|██        | 3874/18984 [03:52<13:31, 18.63it/s]

{'loss': 11.5655, 'grad_norm': 6.292261123657227, 'learning_rate': 1.5922882427307207e-05, 'epoch': 0.61}


 20%|██        | 3883/18984 [03:53<13:07, 19.18it/s]

{'loss': 11.6097, 'grad_norm': 5.705821990966797, 'learning_rate': 1.591234723978087e-05, 'epoch': 0.61}


 21%|██        | 3893/18984 [03:53<13:05, 19.21it/s]

{'loss': 11.5574, 'grad_norm': 5.166132926940918, 'learning_rate': 1.590181205225453e-05, 'epoch': 0.61}


 21%|██        | 3903/18984 [03:54<13:04, 19.22it/s]

{'loss': 11.4477, 'grad_norm': 6.105010986328125, 'learning_rate': 1.5891276864728193e-05, 'epoch': 0.62}


 21%|██        | 3913/18984 [03:54<13:19, 18.85it/s]

{'loss': 11.5153, 'grad_norm': 6.116929531097412, 'learning_rate': 1.5880741677201856e-05, 'epoch': 0.62}


 21%|██        | 3923/18984 [03:55<13:14, 18.96it/s]

{'loss': 11.5549, 'grad_norm': 5.785914421081543, 'learning_rate': 1.5870206489675516e-05, 'epoch': 0.62}


 21%|██        | 3933/18984 [03:55<13:20, 18.80it/s]

{'loss': 11.5623, 'grad_norm': 5.519758224487305, 'learning_rate': 1.585967130214918e-05, 'epoch': 0.62}


 21%|██        | 3944/18984 [03:56<13:05, 19.14it/s]

{'loss': 11.562, 'grad_norm': 5.327033519744873, 'learning_rate': 1.5849136114622842e-05, 'epoch': 0.62}


 21%|██        | 3954/18984 [03:57<13:06, 19.11it/s]

{'loss': 11.5249, 'grad_norm': 5.804076671600342, 'learning_rate': 1.5838600927096505e-05, 'epoch': 0.62}


 21%|██        | 3964/18984 [03:57<13:14, 18.91it/s]

{'loss': 11.4969, 'grad_norm': 5.7069854736328125, 'learning_rate': 1.5828065739570168e-05, 'epoch': 0.63}


 21%|██        | 3974/18984 [03:58<13:07, 19.05it/s]

{'loss': 11.5174, 'grad_norm': 5.486670970916748, 'learning_rate': 1.5817530552043828e-05, 'epoch': 0.63}


 21%|██        | 3984/18984 [03:58<13:13, 18.91it/s]

{'loss': 11.5438, 'grad_norm': 5.507224082946777, 'learning_rate': 1.580699536451749e-05, 'epoch': 0.63}


 21%|██        | 3994/18984 [03:59<13:11, 18.94it/s]

{'loss': 11.5646, 'grad_norm': 5.1612982749938965, 'learning_rate': 1.579646017699115e-05, 'epoch': 0.63}


 21%|██        | 4000/18984 [03:59<13:40, 18.26it/s]

{'loss': 11.5888, 'grad_norm': 5.203604698181152, 'learning_rate': 1.5785924989464813e-05, 'epoch': 0.63}


 21%|██        | 4014/18984 [04:03<25:49,  9.66it/s]  

{'loss': 11.5799, 'grad_norm': 5.3233771324157715, 'learning_rate': 1.5775389801938476e-05, 'epoch': 0.63}


 21%|██        | 4024/18984 [04:03<15:17, 16.31it/s]

{'loss': 11.537, 'grad_norm': 5.291522979736328, 'learning_rate': 1.5764854614412136e-05, 'epoch': 0.64}


 21%|██        | 4034/18984 [04:04<13:25, 18.55it/s]

{'loss': 11.5313, 'grad_norm': 5.220381736755371, 'learning_rate': 1.57543194268858e-05, 'epoch': 0.64}


 21%|██▏       | 4043/18984 [04:04<13:06, 19.00it/s]

{'loss': 11.4825, 'grad_norm': 5.67849063873291, 'learning_rate': 1.5743784239359462e-05, 'epoch': 0.64}


 21%|██▏       | 4053/18984 [04:05<13:03, 19.06it/s]

{'loss': 11.483, 'grad_norm': 5.748435020446777, 'learning_rate': 1.5733249051833125e-05, 'epoch': 0.64}


 21%|██▏       | 4063/18984 [04:05<13:06, 18.98it/s]

{'loss': 11.5371, 'grad_norm': 5.1359758377075195, 'learning_rate': 1.5722713864306788e-05, 'epoch': 0.64}


 21%|██▏       | 4073/18984 [04:06<13:05, 18.99it/s]

{'loss': 11.5441, 'grad_norm': 4.925698757171631, 'learning_rate': 1.5712178676780448e-05, 'epoch': 0.64}


 22%|██▏       | 4084/18984 [04:06<12:50, 19.34it/s]

{'loss': 11.5398, 'grad_norm': 5.296452045440674, 'learning_rate': 1.570164348925411e-05, 'epoch': 0.64}


 22%|██▏       | 4094/18984 [04:07<13:27, 18.44it/s]

{'loss': 11.569, 'grad_norm': 5.556230545043945, 'learning_rate': 1.569110830172777e-05, 'epoch': 0.65}


 22%|██▏       | 4104/18984 [04:07<13:10, 18.83it/s]

{'loss': 11.6057, 'grad_norm': 5.552095890045166, 'learning_rate': 1.5680573114201434e-05, 'epoch': 0.65}


 22%|██▏       | 4114/18984 [04:08<13:02, 19.00it/s]

{'loss': 11.6322, 'grad_norm': 5.549143314361572, 'learning_rate': 1.5670037926675097e-05, 'epoch': 0.65}


 22%|██▏       | 4124/18984 [04:08<13:30, 18.33it/s]

{'loss': 11.6145, 'grad_norm': 5.801091194152832, 'learning_rate': 1.5659502739148756e-05, 'epoch': 0.65}


 22%|██▏       | 4134/18984 [04:09<13:08, 18.83it/s]

{'loss': 11.5915, 'grad_norm': 5.38149356842041, 'learning_rate': 1.564896755162242e-05, 'epoch': 0.65}


 22%|██▏       | 4144/18984 [04:09<13:03, 18.95it/s]

{'loss': 11.475, 'grad_norm': 4.6694746017456055, 'learning_rate': 1.5638432364096082e-05, 'epoch': 0.65}


 22%|██▏       | 4154/18984 [04:10<13:28, 18.35it/s]

{'loss': 11.4548, 'grad_norm': 4.987158298492432, 'learning_rate': 1.5627897176569746e-05, 'epoch': 0.66}


 22%|██▏       | 4163/18984 [04:10<12:59, 19.01it/s]

{'loss': 11.4605, 'grad_norm': 5.485151767730713, 'learning_rate': 1.561736198904341e-05, 'epoch': 0.66}


 22%|██▏       | 4173/18984 [04:11<13:28, 18.31it/s]

{'loss': 11.4532, 'grad_norm': 5.999688148498535, 'learning_rate': 1.5606826801517068e-05, 'epoch': 0.66}


 22%|██▏       | 4183/18984 [04:12<13:20, 18.49it/s]

{'loss': 11.497, 'grad_norm': 5.8286566734313965, 'learning_rate': 1.559629161399073e-05, 'epoch': 0.66}


 22%|██▏       | 4193/18984 [04:12<12:52, 19.15it/s]

{'loss': 11.5083, 'grad_norm': 6.340852737426758, 'learning_rate': 1.558575642646439e-05, 'epoch': 0.66}


 22%|██▏       | 4203/18984 [04:13<13:06, 18.79it/s]

{'loss': 11.5661, 'grad_norm': 6.5089874267578125, 'learning_rate': 1.5575221238938054e-05, 'epoch': 0.66}


 22%|██▏       | 4213/18984 [04:13<13:11, 18.66it/s]

{'loss': 11.6194, 'grad_norm': 5.279449939727783, 'learning_rate': 1.5564686051411717e-05, 'epoch': 0.67}


 22%|██▏       | 4223/18984 [04:14<12:46, 19.25it/s]

{'loss': 11.5266, 'grad_norm': 5.145142078399658, 'learning_rate': 1.5554150863885377e-05, 'epoch': 0.67}


 22%|██▏       | 4233/18984 [04:14<13:07, 18.73it/s]

{'loss': 11.4874, 'grad_norm': 5.589077472686768, 'learning_rate': 1.554361567635904e-05, 'epoch': 0.67}


 22%|██▏       | 4243/18984 [04:15<13:15, 18.54it/s]

{'loss': 11.4855, 'grad_norm': 5.8759565353393555, 'learning_rate': 1.5533080488832703e-05, 'epoch': 0.67}


 22%|██▏       | 4253/18984 [04:15<13:29, 18.19it/s]

{'loss': 11.5288, 'grad_norm': 5.62993049621582, 'learning_rate': 1.5522545301306366e-05, 'epoch': 0.67}


 22%|██▏       | 4264/18984 [04:16<12:52, 19.06it/s]

{'loss': 11.5713, 'grad_norm': 5.5438761711120605, 'learning_rate': 1.551201011378003e-05, 'epoch': 0.67}


 23%|██▎       | 4274/18984 [04:16<13:04, 18.76it/s]

{'loss': 11.5602, 'grad_norm': 4.37656831741333, 'learning_rate': 1.550147492625369e-05, 'epoch': 0.67}


 23%|██▎       | 4284/18984 [04:17<12:56, 18.93it/s]

{'loss': 11.4914, 'grad_norm': 5.35875940322876, 'learning_rate': 1.549093973872735e-05, 'epoch': 0.68}


 23%|██▎       | 4294/18984 [04:17<13:13, 18.52it/s]

{'loss': 11.5488, 'grad_norm': 5.433990001678467, 'learning_rate': 1.5480404551201015e-05, 'epoch': 0.68}


 23%|██▎       | 4304/18984 [04:18<12:54, 18.95it/s]

{'loss': 11.5449, 'grad_norm': 5.524067401885986, 'learning_rate': 1.5469869363674674e-05, 'epoch': 0.68}


 23%|██▎       | 4314/18984 [04:19<13:16, 18.42it/s]

{'loss': 11.5914, 'grad_norm': 5.350133419036865, 'learning_rate': 1.5459334176148337e-05, 'epoch': 0.68}


 23%|██▎       | 4324/18984 [04:19<13:19, 18.33it/s]

{'loss': 11.6291, 'grad_norm': 5.524967670440674, 'learning_rate': 1.5448798988621997e-05, 'epoch': 0.68}


 23%|██▎       | 4334/18984 [04:20<13:15, 18.41it/s]

{'loss': 11.6168, 'grad_norm': 5.324610233306885, 'learning_rate': 1.543826380109566e-05, 'epoch': 0.68}


 23%|██▎       | 4344/18984 [04:20<13:11, 18.49it/s]

{'loss': 11.5772, 'grad_norm': 5.099666118621826, 'learning_rate': 1.5427728613569323e-05, 'epoch': 0.69}


 23%|██▎       | 4354/18984 [04:21<13:21, 18.26it/s]

{'loss': 11.501, 'grad_norm': 5.270315170288086, 'learning_rate': 1.5417193426042986e-05, 'epoch': 0.69}


 23%|██▎       | 4363/18984 [04:21<13:15, 18.38it/s]

{'loss': 11.4413, 'grad_norm': 5.766093730926514, 'learning_rate': 1.5406658238516646e-05, 'epoch': 0.69}


 23%|██▎       | 4373/18984 [04:22<13:26, 18.12it/s]

{'loss': 11.4747, 'grad_norm': 5.758742809295654, 'learning_rate': 1.539612305099031e-05, 'epoch': 0.69}


 23%|██▎       | 4383/18984 [04:22<13:50, 17.57it/s]

{'loss': 11.5137, 'grad_norm': 5.810871601104736, 'learning_rate': 1.5385587863463972e-05, 'epoch': 0.69}


 23%|██▎       | 4393/18984 [04:23<13:09, 18.48it/s]

{'loss': 11.5621, 'grad_norm': 5.6401214599609375, 'learning_rate': 1.5375052675937635e-05, 'epoch': 0.69}


 23%|██▎       | 4403/18984 [04:23<13:04, 18.58it/s]

{'loss': 11.5586, 'grad_norm': 5.342923164367676, 'learning_rate': 1.5364517488411295e-05, 'epoch': 0.7}


 23%|██▎       | 4414/18984 [04:24<12:48, 18.96it/s]

{'loss': 11.5934, 'grad_norm': 4.242066383361816, 'learning_rate': 1.5353982300884958e-05, 'epoch': 0.7}


 23%|██▎       | 4424/18984 [04:24<12:33, 19.33it/s]

{'loss': 11.5778, 'grad_norm': 3.251598596572876, 'learning_rate': 1.5343447113358617e-05, 'epoch': 0.7}


 23%|██▎       | 4434/18984 [04:25<12:50, 18.89it/s]

{'loss': 11.5231, 'grad_norm': 5.040935039520264, 'learning_rate': 1.533291192583228e-05, 'epoch': 0.7}


 23%|██▎       | 4444/18984 [04:26<13:02, 18.58it/s]

{'loss': 11.4289, 'grad_norm': 6.119016170501709, 'learning_rate': 1.5322376738305943e-05, 'epoch': 0.7}


 23%|██▎       | 4454/18984 [04:26<12:55, 18.74it/s]

{'loss': 11.4413, 'grad_norm': 6.3642659187316895, 'learning_rate': 1.5311841550779606e-05, 'epoch': 0.7}


 24%|██▎       | 4464/18984 [04:27<12:58, 18.66it/s]

{'loss': 11.5021, 'grad_norm': 6.539431095123291, 'learning_rate': 1.5301306363253266e-05, 'epoch': 0.7}


 24%|██▎       | 4474/18984 [04:27<13:08, 18.40it/s]

{'loss': 11.5685, 'grad_norm': 5.748349189758301, 'learning_rate': 1.529077117572693e-05, 'epoch': 0.71}


 24%|██▎       | 4484/18984 [04:28<13:23, 18.04it/s]

{'loss': 11.5181, 'grad_norm': 6.349479675292969, 'learning_rate': 1.5280235988200592e-05, 'epoch': 0.71}


 24%|██▎       | 4494/18984 [04:28<13:15, 18.21it/s]

{'loss': 11.5076, 'grad_norm': 6.349828243255615, 'learning_rate': 1.5269700800674255e-05, 'epoch': 0.71}


 24%|██▎       | 4500/18984 [04:29<13:07, 18.40it/s]

{'loss': 11.565, 'grad_norm': 6.364684104919434, 'learning_rate': 1.5259165613147915e-05, 'epoch': 0.71}


 24%|██▍       | 4514/18984 [04:32<25:04,  9.62it/s]  

{'loss': 11.6413, 'grad_norm': 6.287039279937744, 'learning_rate': 1.5248630425621578e-05, 'epoch': 0.71}


 24%|██▍       | 4524/18984 [04:33<15:23, 15.66it/s]

{'loss': 11.6575, 'grad_norm': 5.7437615394592285, 'learning_rate': 1.523809523809524e-05, 'epoch': 0.71}


 24%|██▍       | 4534/18984 [04:33<13:32, 17.78it/s]

{'loss': 11.6791, 'grad_norm': 4.916903495788574, 'learning_rate': 1.52275600505689e-05, 'epoch': 0.72}


 24%|██▍       | 4544/18984 [04:34<13:29, 17.84it/s]

{'loss': 11.6154, 'grad_norm': 4.490325927734375, 'learning_rate': 1.5217024863042564e-05, 'epoch': 0.72}


 24%|██▍       | 4553/18984 [04:34<13:17, 18.10it/s]

{'loss': 11.4818, 'grad_norm': 5.342253684997559, 'learning_rate': 1.5206489675516225e-05, 'epoch': 0.72}


 24%|██▍       | 4563/18984 [04:35<12:54, 18.63it/s]

{'loss': 11.4148, 'grad_norm': 6.500797748565674, 'learning_rate': 1.5195954487989888e-05, 'epoch': 0.72}


 24%|██▍       | 4573/18984 [04:35<12:47, 18.77it/s]

{'loss': 11.4036, 'grad_norm': 6.97475528717041, 'learning_rate': 1.518541930046355e-05, 'epoch': 0.72}


 24%|██▍       | 4583/18984 [04:36<12:35, 19.07it/s]

{'loss': 11.4437, 'grad_norm': 6.379415035247803, 'learning_rate': 1.517488411293721e-05, 'epoch': 0.72}


 24%|██▍       | 4593/18984 [04:36<12:53, 18.60it/s]

{'loss': 11.5108, 'grad_norm': 6.802947044372559, 'learning_rate': 1.5164348925410874e-05, 'epoch': 0.73}


 24%|██▍       | 4604/18984 [04:37<12:25, 19.29it/s]

{'loss': 11.5623, 'grad_norm': 6.893250465393066, 'learning_rate': 1.5153813737884535e-05, 'epoch': 0.73}


 24%|██▍       | 4614/18984 [04:38<12:49, 18.67it/s]

{'loss': 11.6591, 'grad_norm': 6.828449726104736, 'learning_rate': 1.5143278550358198e-05, 'epoch': 0.73}


 24%|██▍       | 4624/18984 [04:38<12:44, 18.78it/s]

{'loss': 11.6803, 'grad_norm': 6.285799980163574, 'learning_rate': 1.513274336283186e-05, 'epoch': 0.73}


 24%|██▍       | 4632/18984 [04:39<13:26, 17.79it/s]

{'loss': 11.6609, 'grad_norm': 6.397904396057129, 'learning_rate': 1.5122208175305521e-05, 'epoch': 0.73}


 24%|██▍       | 4644/18984 [04:39<13:15, 18.04it/s]

{'loss': 11.6116, 'grad_norm': 6.092503547668457, 'learning_rate': 1.5111672987779184e-05, 'epoch': 0.73}


 25%|██▍       | 4654/18984 [04:40<12:54, 18.49it/s]

{'loss': 11.4977, 'grad_norm': 6.585930347442627, 'learning_rate': 1.5101137800252845e-05, 'epoch': 0.73}


 25%|██▍       | 4664/18984 [04:40<12:36, 18.92it/s]

{'loss': 11.4463, 'grad_norm': 6.698732376098633, 'learning_rate': 1.5090602612726508e-05, 'epoch': 0.74}


 25%|██▍       | 4674/18984 [04:41<13:22, 17.84it/s]

{'loss': 11.3846, 'grad_norm': 5.966575622558594, 'learning_rate': 1.508006742520017e-05, 'epoch': 0.74}


 25%|██▍       | 4684/18984 [04:41<12:32, 19.01it/s]

{'loss': 11.4521, 'grad_norm': 6.1910295486450195, 'learning_rate': 1.5069532237673831e-05, 'epoch': 0.74}


 25%|██▍       | 4694/18984 [04:42<13:14, 17.98it/s]

{'loss': 11.5334, 'grad_norm': 6.050600528717041, 'learning_rate': 1.5058997050147494e-05, 'epoch': 0.74}


 25%|██▍       | 4704/18984 [04:42<12:39, 18.81it/s]

{'loss': 11.5578, 'grad_norm': 5.623876094818115, 'learning_rate': 1.5048461862621155e-05, 'epoch': 0.74}


 25%|██▍       | 4714/18984 [04:43<12:50, 18.52it/s]

{'loss': 11.5471, 'grad_norm': 4.892614841461182, 'learning_rate': 1.5037926675094818e-05, 'epoch': 0.74}


 25%|██▍       | 4724/18984 [04:44<12:47, 18.59it/s]

{'loss': 11.4907, 'grad_norm': 5.563862323760986, 'learning_rate': 1.5027391487568478e-05, 'epoch': 0.75}


 25%|██▍       | 4734/18984 [04:44<12:41, 18.72it/s]

{'loss': 11.4868, 'grad_norm': 6.065645217895508, 'learning_rate': 1.5016856300042141e-05, 'epoch': 0.75}


 25%|██▍       | 4744/18984 [04:45<12:41, 18.71it/s]

{'loss': 11.5492, 'grad_norm': 6.180010795593262, 'learning_rate': 1.5006321112515804e-05, 'epoch': 0.75}


 25%|██▌       | 4753/18984 [04:45<12:31, 18.94it/s]

{'loss': 11.5662, 'grad_norm': 6.570683479309082, 'learning_rate': 1.4995785924989466e-05, 'epoch': 0.75}


 25%|██▌       | 4763/18984 [04:46<12:59, 18.24it/s]

{'loss': 11.5187, 'grad_norm': 6.299622535705566, 'learning_rate': 1.4985250737463129e-05, 'epoch': 0.75}


 25%|██▌       | 4774/18984 [04:46<12:39, 18.72it/s]

{'loss': 11.5197, 'grad_norm': 6.503856658935547, 'learning_rate': 1.4974715549936788e-05, 'epoch': 0.75}


 25%|██▌       | 4784/18984 [04:47<12:30, 18.93it/s]

{'loss': 11.5509, 'grad_norm': 6.137994289398193, 'learning_rate': 1.4964180362410451e-05, 'epoch': 0.76}


 25%|██▌       | 4794/18984 [04:47<12:43, 18.60it/s]

{'loss': 11.5901, 'grad_norm': 5.846000671386719, 'learning_rate': 1.4953645174884114e-05, 'epoch': 0.76}


 25%|██▌       | 4804/18984 [04:48<12:44, 18.55it/s]

{'loss': 11.5415, 'grad_norm': 6.484961032867432, 'learning_rate': 1.4943109987357776e-05, 'epoch': 0.76}


 25%|██▌       | 4814/18984 [04:48<12:56, 18.26it/s]

{'loss': 11.4688, 'grad_norm': 6.684693336486816, 'learning_rate': 1.4932574799831439e-05, 'epoch': 0.76}


 25%|██▌       | 4824/18984 [04:49<12:41, 18.60it/s]

{'loss': 11.5071, 'grad_norm': 6.469518184661865, 'learning_rate': 1.4922039612305098e-05, 'epoch': 0.76}


 25%|██▌       | 4834/18984 [04:49<12:45, 18.50it/s]

{'loss': 11.5422, 'grad_norm': 6.128827095031738, 'learning_rate': 1.4911504424778761e-05, 'epoch': 0.76}


 26%|██▌       | 4844/18984 [04:50<12:48, 18.41it/s]

{'loss': 11.5027, 'grad_norm': 5.812535762786865, 'learning_rate': 1.4900969237252425e-05, 'epoch': 0.76}


 26%|██▌       | 4854/18984 [04:51<12:36, 18.69it/s]

{'loss': 11.5272, 'grad_norm': 5.930678367614746, 'learning_rate': 1.4890434049726086e-05, 'epoch': 0.77}


 26%|██▌       | 4864/18984 [04:51<12:49, 18.36it/s]

{'loss': 11.4937, 'grad_norm': 6.000586986541748, 'learning_rate': 1.4879898862199749e-05, 'epoch': 0.77}


 26%|██▌       | 4874/18984 [04:52<12:50, 18.31it/s]

{'loss': 11.5678, 'grad_norm': 5.6796393394470215, 'learning_rate': 1.4869363674673409e-05, 'epoch': 0.77}


 26%|██▌       | 4884/18984 [04:52<12:39, 18.56it/s]

{'loss': 11.5647, 'grad_norm': 6.134646415710449, 'learning_rate': 1.4858828487147072e-05, 'epoch': 0.77}


 26%|██▌       | 4894/18984 [04:53<12:30, 18.78it/s]

{'loss': 11.5706, 'grad_norm': 4.596354961395264, 'learning_rate': 1.4848293299620735e-05, 'epoch': 0.77}


 26%|██▌       | 4903/18984 [04:53<12:22, 18.97it/s]

{'loss': 11.4996, 'grad_norm': 6.153382301330566, 'learning_rate': 1.4837758112094396e-05, 'epoch': 0.77}


 26%|██▌       | 4913/18984 [04:54<12:23, 18.93it/s]

{'loss': 11.513, 'grad_norm': 6.7660651206970215, 'learning_rate': 1.4827222924568059e-05, 'epoch': 0.78}


 26%|██▌       | 4923/18984 [04:54<12:24, 18.90it/s]

{'loss': 11.5561, 'grad_norm': 5.673181533813477, 'learning_rate': 1.4816687737041719e-05, 'epoch': 0.78}


 26%|██▌       | 4933/18984 [04:55<12:03, 19.41it/s]

{'loss': 11.5913, 'grad_norm': 4.37532901763916, 'learning_rate': 1.4806152549515382e-05, 'epoch': 0.78}


 26%|██▌       | 4943/18984 [04:55<12:36, 18.55it/s]

{'loss': 11.5644, 'grad_norm': 3.3067924976348877, 'learning_rate': 1.4795617361989045e-05, 'epoch': 0.78}


 26%|██▌       | 4953/18984 [04:56<12:21, 18.92it/s]

{'loss': 11.5173, 'grad_norm': 3.921475410461426, 'learning_rate': 1.4785082174462706e-05, 'epoch': 0.78}


 26%|██▌       | 4963/18984 [04:56<12:07, 19.28it/s]

{'loss': 11.5203, 'grad_norm': 5.3842973709106445, 'learning_rate': 1.477454698693637e-05, 'epoch': 0.78}


 26%|██▌       | 4973/18984 [04:57<12:07, 19.25it/s]

{'loss': 11.5202, 'grad_norm': 5.922453880310059, 'learning_rate': 1.4764011799410029e-05, 'epoch': 0.79}


 26%|██▌       | 4983/18984 [04:57<12:24, 18.81it/s]

{'loss': 11.6268, 'grad_norm': 5.803462505340576, 'learning_rate': 1.4753476611883692e-05, 'epoch': 0.79}


 26%|██▋       | 4993/18984 [04:58<12:19, 18.91it/s]

{'loss': 11.6343, 'grad_norm': 4.555659770965576, 'learning_rate': 1.4742941424357355e-05, 'epoch': 0.79}


 26%|██▋       | 5000/18984 [04:58<12:12, 19.10it/s]

{'loss': 11.6268, 'grad_norm': 3.928400993347168, 'learning_rate': 1.4732406236831016e-05, 'epoch': 0.79}


 26%|██▋       | 5013/18984 [05:02<23:41,  9.83it/s]  

{'loss': 11.6266, 'grad_norm': 3.555518627166748, 'learning_rate': 1.472187104930468e-05, 'epoch': 0.79}


 26%|██▋       | 5023/18984 [05:02<14:32, 16.00it/s]

{'loss': 11.6215, 'grad_norm': 3.628790855407715, 'learning_rate': 1.4711335861778342e-05, 'epoch': 0.79}


 27%|██▋       | 5033/18984 [05:03<13:04, 17.78it/s]

{'loss': 11.6121, 'grad_norm': 3.692584991455078, 'learning_rate': 1.4700800674252002e-05, 'epoch': 0.79}


 27%|██▋       | 5043/18984 [05:03<12:28, 18.63it/s]

{'loss': 11.6017, 'grad_norm': 4.124512672424316, 'learning_rate': 1.4690265486725665e-05, 'epoch': 0.8}


 27%|██▋       | 5053/18984 [05:04<12:27, 18.64it/s]

{'loss': 11.5158, 'grad_norm': 4.952431678771973, 'learning_rate': 1.4679730299199326e-05, 'epoch': 0.8}


 27%|██▋       | 5063/18984 [05:04<12:17, 18.89it/s]

{'loss': 11.3696, 'grad_norm': 6.794622421264648, 'learning_rate': 1.466919511167299e-05, 'epoch': 0.8}


 27%|██▋       | 5073/18984 [05:05<12:20, 18.78it/s]

{'loss': 11.3424, 'grad_norm': 6.860923767089844, 'learning_rate': 1.4658659924146653e-05, 'epoch': 0.8}


 27%|██▋       | 5083/18984 [05:06<12:45, 18.17it/s]

{'loss': 11.4195, 'grad_norm': 6.926253795623779, 'learning_rate': 1.4648124736620312e-05, 'epoch': 0.8}


 27%|██▋       | 5094/18984 [05:06<12:31, 18.47it/s]

{'loss': 11.46, 'grad_norm': 6.67568302154541, 'learning_rate': 1.4637589549093975e-05, 'epoch': 0.8}


 27%|██▋       | 5104/18984 [05:07<12:22, 18.70it/s]

{'loss': 11.5393, 'grad_norm': 6.6842546463012695, 'learning_rate': 1.4627054361567637e-05, 'epoch': 0.81}


 27%|██▋       | 5114/18984 [05:07<12:03, 19.18it/s]

{'loss': 11.6117, 'grad_norm': 6.725583076477051, 'learning_rate': 1.46165191740413e-05, 'epoch': 0.81}


 27%|██▋       | 5124/18984 [05:08<12:28, 18.51it/s]

{'loss': 11.6774, 'grad_norm': 6.6916608810424805, 'learning_rate': 1.4605983986514963e-05, 'epoch': 0.81}


 27%|██▋       | 5134/18984 [05:08<12:14, 18.85it/s]

{'loss': 11.7413, 'grad_norm': 6.654147624969482, 'learning_rate': 1.4595448798988622e-05, 'epoch': 0.81}


 27%|██▋       | 5144/18984 [05:09<12:40, 18.19it/s]

{'loss': 11.7812, 'grad_norm': 6.635855197906494, 'learning_rate': 1.4584913611462285e-05, 'epoch': 0.81}


 27%|██▋       | 5154/18984 [05:09<12:19, 18.70it/s]

{'loss': 11.8213, 'grad_norm': 6.315138339996338, 'learning_rate': 1.4574378423935947e-05, 'epoch': 0.81}


 27%|██▋       | 5164/18984 [05:10<12:24, 18.57it/s]

{'loss': 11.5872, 'grad_norm': 5.405404567718506, 'learning_rate': 1.456384323640961e-05, 'epoch': 0.82}


 27%|██▋       | 5174/18984 [05:10<12:28, 18.45it/s]

{'loss': 11.3026, 'grad_norm': 6.249885559082031, 'learning_rate': 1.4553308048883273e-05, 'epoch': 0.82}


 27%|██▋       | 5184/18984 [05:11<12:45, 18.03it/s]

{'loss': 11.3323, 'grad_norm': 6.265578269958496, 'learning_rate': 1.4542772861356933e-05, 'epoch': 0.82}


 27%|██▋       | 5194/18984 [05:12<12:13, 18.79it/s]

{'loss': 11.3978, 'grad_norm': 6.414100646972656, 'learning_rate': 1.4532237673830596e-05, 'epoch': 0.82}


 27%|██▋       | 5204/18984 [05:12<12:10, 18.87it/s]

{'loss': 11.4601, 'grad_norm': 6.485194206237793, 'learning_rate': 1.4521702486304257e-05, 'epoch': 0.82}


 27%|██▋       | 5214/18984 [05:13<12:17, 18.67it/s]

{'loss': 11.544, 'grad_norm': 6.5023193359375, 'learning_rate': 1.451116729877792e-05, 'epoch': 0.82}


 28%|██▊       | 5223/18984 [05:13<12:14, 18.74it/s]

{'loss': 11.6164, 'grad_norm': 6.557341575622559, 'learning_rate': 1.4500632111251583e-05, 'epoch': 0.82}


 28%|██▊       | 5233/18984 [05:14<12:25, 18.43it/s]

{'loss': 11.6557, 'grad_norm': 6.499024391174316, 'learning_rate': 1.4490096923725243e-05, 'epoch': 0.83}


 28%|██▊       | 5243/18984 [05:14<12:24, 18.47it/s]

{'loss': 11.6952, 'grad_norm': 6.149465560913086, 'learning_rate': 1.4479561736198906e-05, 'epoch': 0.83}


 28%|██▊       | 5254/18984 [05:15<11:56, 19.15it/s]

{'loss': 11.6404, 'grad_norm': 5.280609130859375, 'learning_rate': 1.4469026548672567e-05, 'epoch': 0.83}


 28%|██▊       | 5264/18984 [05:15<12:16, 18.63it/s]

{'loss': 11.4654, 'grad_norm': 5.792374134063721, 'learning_rate': 1.445849136114623e-05, 'epoch': 0.83}


 28%|██▊       | 5274/18984 [05:16<12:27, 18.35it/s]

{'loss': 11.3862, 'grad_norm': 5.818470001220703, 'learning_rate': 1.4447956173619893e-05, 'epoch': 0.83}


 28%|██▊       | 5284/18984 [05:16<12:15, 18.63it/s]

{'loss': 11.3877, 'grad_norm': 6.159913539886475, 'learning_rate': 1.4437420986093553e-05, 'epoch': 0.83}


 28%|██▊       | 5294/18984 [05:17<12:16, 18.59it/s]

{'loss': 11.4546, 'grad_norm': 6.352745056152344, 'learning_rate': 1.4426885798567216e-05, 'epoch': 0.84}


 28%|██▊       | 5304/18984 [05:17<12:14, 18.62it/s]

{'loss': 11.5658, 'grad_norm': 6.402485370635986, 'learning_rate': 1.4416350611040877e-05, 'epoch': 0.84}


 28%|██▊       | 5314/18984 [05:18<11:55, 19.11it/s]

{'loss': 11.6072, 'grad_norm': 6.34507417678833, 'learning_rate': 1.440581542351454e-05, 'epoch': 0.84}


 28%|██▊       | 5324/18984 [05:18<12:03, 18.88it/s]

{'loss': 11.6613, 'grad_norm': 6.196537494659424, 'learning_rate': 1.4395280235988203e-05, 'epoch': 0.84}


 28%|██▊       | 5334/18984 [05:19<12:19, 18.45it/s]

{'loss': 11.7015, 'grad_norm': 5.976818561553955, 'learning_rate': 1.4384745048461863e-05, 'epoch': 0.84}


 28%|██▊       | 5344/18984 [05:20<12:02, 18.88it/s]

{'loss': 11.7468, 'grad_norm': 5.802163124084473, 'learning_rate': 1.4374209860935526e-05, 'epoch': 0.84}


 28%|██▊       | 5354/18984 [05:20<12:12, 18.60it/s]

{'loss': 11.7219, 'grad_norm': 4.63489294052124, 'learning_rate': 1.4363674673409187e-05, 'epoch': 0.85}


 28%|██▊       | 5364/18984 [05:21<12:07, 18.73it/s]

{'loss': 11.3719, 'grad_norm': 6.086096286773682, 'learning_rate': 1.435313948588285e-05, 'epoch': 0.85}


 28%|██▊       | 5374/18984 [05:21<12:00, 18.88it/s]

{'loss': 11.4229, 'grad_norm': 6.408741474151611, 'learning_rate': 1.4342604298356513e-05, 'epoch': 0.85}


 28%|██▊       | 5384/18984 [05:22<12:04, 18.77it/s]

{'loss': 11.3965, 'grad_norm': 6.450619220733643, 'learning_rate': 1.4332069110830173e-05, 'epoch': 0.85}


 28%|██▊       | 5394/18984 [05:22<12:02, 18.82it/s]

{'loss': 11.4155, 'grad_norm': 6.651066303253174, 'learning_rate': 1.4321533923303836e-05, 'epoch': 0.85}


 28%|██▊       | 5404/18984 [05:23<11:45, 19.24it/s]

{'loss': 11.5095, 'grad_norm': 6.646234512329102, 'learning_rate': 1.4310998735777498e-05, 'epoch': 0.85}


 29%|██▊       | 5414/18984 [05:23<11:55, 18.96it/s]

{'loss': 11.6048, 'grad_norm': 6.7142205238342285, 'learning_rate': 1.430046354825116e-05, 'epoch': 0.85}


 29%|██▊       | 5424/18984 [05:24<12:16, 18.40it/s]

{'loss': 11.6462, 'grad_norm': 6.663517475128174, 'learning_rate': 1.4289928360724824e-05, 'epoch': 0.86}


 29%|██▊       | 5434/18984 [05:24<12:09, 18.59it/s]

{'loss': 11.7149, 'grad_norm': 6.770091533660889, 'learning_rate': 1.4279393173198483e-05, 'epoch': 0.86}


 29%|██▊       | 5444/18984 [05:25<12:10, 18.53it/s]

{'loss': 11.6322, 'grad_norm': 4.923784255981445, 'learning_rate': 1.4268857985672146e-05, 'epoch': 0.86}


 29%|██▊       | 5452/18984 [05:25<11:54, 18.93it/s]

{'loss': 11.5965, 'grad_norm': 4.787644863128662, 'learning_rate': 1.4258322798145808e-05, 'epoch': 0.86}


 29%|██▉       | 5463/18984 [05:26<11:57, 18.84it/s]

{'loss': 11.5602, 'grad_norm': 5.403185844421387, 'learning_rate': 1.424778761061947e-05, 'epoch': 0.86}


 29%|██▉       | 5473/18984 [05:26<11:54, 18.92it/s]

{'loss': 11.5324, 'grad_norm': 16.57608413696289, 'learning_rate': 1.4237252423093134e-05, 'epoch': 0.86}


 29%|██▉       | 5483/18984 [05:27<11:40, 19.27it/s]

{'loss': 11.4503, 'grad_norm': 5.552616596221924, 'learning_rate': 1.4226717235566793e-05, 'epoch': 0.87}


 29%|██▉       | 5493/18984 [05:27<11:43, 19.18it/s]

{'loss': 11.4302, 'grad_norm': 5.932099342346191, 'learning_rate': 1.4216182048040456e-05, 'epoch': 0.87}


 29%|██▉       | 5500/18984 [05:28<12:12, 18.40it/s]

{'loss': 11.4772, 'grad_norm': 5.824156761169434, 'learning_rate': 1.4205646860514118e-05, 'epoch': 0.87}


 29%|██▉       | 5513/18984 [05:31<23:28,  9.56it/s]  

{'loss': 11.5296, 'grad_norm': 5.866775035858154, 'learning_rate': 1.4195111672987781e-05, 'epoch': 0.87}


 29%|██▉       | 5523/18984 [05:32<14:09, 15.85it/s]

{'loss': 11.5788, 'grad_norm': 5.603229522705078, 'learning_rate': 1.4184576485461444e-05, 'epoch': 0.87}


 29%|██▉       | 5533/18984 [05:32<12:00, 18.66it/s]

{'loss': 11.6051, 'grad_norm': 5.948534965515137, 'learning_rate': 1.4174041297935104e-05, 'epoch': 0.87}


 29%|██▉       | 5543/18984 [05:33<11:56, 18.77it/s]

{'loss': 11.7145, 'grad_norm': 5.61048698425293, 'learning_rate': 1.4163506110408767e-05, 'epoch': 0.88}


 29%|██▉       | 5553/18984 [05:34<11:54, 18.81it/s]

{'loss': 11.7437, 'grad_norm': 4.336771488189697, 'learning_rate': 1.4152970922882428e-05, 'epoch': 0.88}


 29%|██▉       | 5563/18984 [05:34<11:59, 18.67it/s]

{'loss': 11.4381, 'grad_norm': 5.62666130065918, 'learning_rate': 1.4142435735356091e-05, 'epoch': 0.88}


 29%|██▉       | 5573/18984 [05:35<12:01, 18.60it/s]

{'loss': 11.3514, 'grad_norm': 6.396836757659912, 'learning_rate': 1.4131900547829754e-05, 'epoch': 0.88}


 29%|██▉       | 5583/18984 [05:35<11:57, 18.69it/s]

{'loss': 11.371, 'grad_norm': 6.595626354217529, 'learning_rate': 1.4121365360303414e-05, 'epoch': 0.88}


 29%|██▉       | 5593/18984 [05:36<11:45, 18.97it/s]

{'loss': 11.4599, 'grad_norm': 6.574589252471924, 'learning_rate': 1.4110830172777077e-05, 'epoch': 0.88}


 30%|██▉       | 5603/18984 [05:36<11:39, 19.14it/s]

{'loss': 11.539, 'grad_norm': 6.419673919677734, 'learning_rate': 1.4100294985250738e-05, 'epoch': 0.88}


 30%|██▉       | 5613/18984 [05:37<11:47, 18.90it/s]

{'loss': 11.5534, 'grad_norm': 6.512951374053955, 'learning_rate': 1.4089759797724401e-05, 'epoch': 0.89}


 30%|██▉       | 5623/18984 [05:37<11:47, 18.90it/s]

{'loss': 11.6209, 'grad_norm': 6.361390590667725, 'learning_rate': 1.4079224610198064e-05, 'epoch': 0.89}


 30%|██▉       | 5633/18984 [05:38<11:55, 18.65it/s]

{'loss': 11.6412, 'grad_norm': 5.8090715408325195, 'learning_rate': 1.4068689422671724e-05, 'epoch': 0.89}


 30%|██▉       | 5643/18984 [05:38<11:37, 19.14it/s]

{'loss': 11.5671, 'grad_norm': 6.245147228240967, 'learning_rate': 1.4058154235145387e-05, 'epoch': 0.89}


 30%|██▉       | 5653/18984 [05:39<11:48, 18.81it/s]

{'loss': 11.4668, 'grad_norm': 6.00823450088501, 'learning_rate': 1.4047619047619048e-05, 'epoch': 0.89}


 30%|██▉       | 5663/18984 [05:39<11:50, 18.74it/s]

{'loss': 11.4418, 'grad_norm': 6.6149067878723145, 'learning_rate': 1.4037083860092711e-05, 'epoch': 0.89}


 30%|██▉       | 5673/18984 [05:40<11:37, 19.09it/s]

{'loss': 11.4929, 'grad_norm': 6.520812034606934, 'learning_rate': 1.4026548672566374e-05, 'epoch': 0.9}


 30%|██▉       | 5683/18984 [05:40<11:37, 19.06it/s]

{'loss': 11.5356, 'grad_norm': 6.704962253570557, 'learning_rate': 1.4016013485040034e-05, 'epoch': 0.9}


 30%|██▉       | 5693/18984 [05:41<11:55, 18.58it/s]

{'loss': 11.5794, 'grad_norm': 6.378116607666016, 'learning_rate': 1.4005478297513697e-05, 'epoch': 0.9}


 30%|███       | 5703/18984 [05:42<11:59, 18.47it/s]

{'loss': 11.6303, 'grad_norm': 4.8805389404296875, 'learning_rate': 1.3994943109987358e-05, 'epoch': 0.9}


 30%|███       | 5713/18984 [05:42<11:33, 19.14it/s]

{'loss': 11.6538, 'grad_norm': 3.6064627170562744, 'learning_rate': 1.3984407922461021e-05, 'epoch': 0.9}


 30%|███       | 5723/18984 [05:43<11:31, 19.18it/s]

{'loss': 11.5536, 'grad_norm': 5.605377674102783, 'learning_rate': 1.3973872734934684e-05, 'epoch': 0.9}


 30%|███       | 5733/18984 [05:43<11:39, 18.94it/s]

{'loss': 11.422, 'grad_norm': 6.231663227081299, 'learning_rate': 1.3963337547408344e-05, 'epoch': 0.91}


 30%|███       | 5743/18984 [05:44<11:39, 18.92it/s]

{'loss': 11.4257, 'grad_norm': 6.329931735992432, 'learning_rate': 1.3952802359882007e-05, 'epoch': 0.91}


 30%|███       | 5754/18984 [05:44<11:32, 19.11it/s]

{'loss': 11.4837, 'grad_norm': 5.929773330688477, 'learning_rate': 1.394226717235567e-05, 'epoch': 0.91}


 30%|███       | 5763/18984 [05:45<11:30, 19.16it/s]

{'loss': 11.5772, 'grad_norm': 6.587661266326904, 'learning_rate': 1.3931731984829332e-05, 'epoch': 0.91}


 30%|███       | 5773/18984 [05:45<11:39, 18.87it/s]

{'loss': 11.6096, 'grad_norm': 6.027715682983398, 'learning_rate': 1.3921196797302993e-05, 'epoch': 0.91}


 30%|███       | 5783/18984 [05:46<11:32, 19.06it/s]

{'loss': 11.521, 'grad_norm': 6.369609355926514, 'learning_rate': 1.3910661609776654e-05, 'epoch': 0.91}


 31%|███       | 5793/18984 [05:46<11:21, 19.37it/s]

{'loss': 11.4997, 'grad_norm': 6.4528069496154785, 'learning_rate': 1.3900126422250317e-05, 'epoch': 0.91}


 31%|███       | 5803/18984 [05:47<11:30, 19.10it/s]

{'loss': 11.5355, 'grad_norm': 5.854005813598633, 'learning_rate': 1.388959123472398e-05, 'epoch': 0.92}


 31%|███       | 5813/18984 [05:47<11:29, 19.11it/s]

{'loss': 11.567, 'grad_norm': 6.238162517547607, 'learning_rate': 1.3879056047197642e-05, 'epoch': 0.92}


 31%|███       | 5823/18984 [05:48<11:32, 19.01it/s]

{'loss': 11.5393, 'grad_norm': 5.363241195678711, 'learning_rate': 1.3868520859671303e-05, 'epoch': 0.92}


 31%|███       | 5833/18984 [05:48<11:22, 19.26it/s]

{'loss': 11.5188, 'grad_norm': 5.3094353675842285, 'learning_rate': 1.3857985672144964e-05, 'epoch': 0.92}


 31%|███       | 5844/18984 [05:49<11:22, 19.24it/s]

{'loss': 11.5331, 'grad_norm': 5.290342807769775, 'learning_rate': 1.3847450484618627e-05, 'epoch': 0.92}


 31%|███       | 5854/18984 [05:49<11:29, 19.05it/s]

{'loss': 11.5753, 'grad_norm': 6.2001872062683105, 'learning_rate': 1.383691529709229e-05, 'epoch': 0.92}


 31%|███       | 5864/18984 [05:50<11:38, 18.78it/s]

{'loss': 11.5699, 'grad_norm': 5.966038703918457, 'learning_rate': 1.3826380109565952e-05, 'epoch': 0.93}


 31%|███       | 5874/18984 [05:50<11:22, 19.22it/s]

{'loss': 11.5706, 'grad_norm': 6.22617244720459, 'learning_rate': 1.3815844922039613e-05, 'epoch': 0.93}


 31%|███       | 5884/18984 [05:51<11:20, 19.24it/s]

{'loss': 11.6412, 'grad_norm': 4.804070472717285, 'learning_rate': 1.3805309734513275e-05, 'epoch': 0.93}


 31%|███       | 5894/18984 [05:51<11:27, 19.03it/s]

{'loss': 11.5443, 'grad_norm': 3.6672909259796143, 'learning_rate': 1.3794774546986938e-05, 'epoch': 0.93}


 31%|███       | 5904/18984 [05:52<11:33, 18.87it/s]

{'loss': 11.5355, 'grad_norm': 3.8743910789489746, 'learning_rate': 1.37842393594606e-05, 'epoch': 0.93}


 31%|███       | 5914/18984 [05:53<11:23, 19.12it/s]

{'loss': 11.5255, 'grad_norm': 4.2465667724609375, 'learning_rate': 1.3773704171934262e-05, 'epoch': 0.93}


 31%|███       | 5922/18984 [05:53<11:22, 19.15it/s]

{'loss': 11.5735, 'grad_norm': 4.44203519821167, 'learning_rate': 1.3763168984407923e-05, 'epoch': 0.94}


 31%|███▏      | 5933/18984 [05:54<11:17, 19.27it/s]

{'loss': 11.5935, 'grad_norm': 4.359008312225342, 'learning_rate': 1.3752633796881585e-05, 'epoch': 0.94}


 31%|███▏      | 5943/18984 [05:54<11:09, 19.47it/s]

{'loss': 11.6025, 'grad_norm': 4.308990955352783, 'learning_rate': 1.3742098609355248e-05, 'epoch': 0.94}


 31%|███▏      | 5953/18984 [05:55<11:37, 18.68it/s]

{'loss': 11.647, 'grad_norm': 4.087258815765381, 'learning_rate': 1.373156342182891e-05, 'epoch': 0.94}


 31%|███▏      | 5963/18984 [05:55<11:41, 18.57it/s]

{'loss': 11.6433, 'grad_norm': 3.795011281967163, 'learning_rate': 1.3721028234302572e-05, 'epoch': 0.94}


 31%|███▏      | 5973/18984 [05:56<11:30, 18.84it/s]

{'loss': 11.6552, 'grad_norm': 3.231215476989746, 'learning_rate': 1.3710493046776234e-05, 'epoch': 0.94}


 32%|███▏      | 5983/18984 [05:56<11:24, 18.99it/s]

{'loss': 11.5079, 'grad_norm': 5.171086311340332, 'learning_rate': 1.3699957859249895e-05, 'epoch': 0.95}


 32%|███▏      | 5993/18984 [05:57<11:39, 18.58it/s]

{'loss': 11.414, 'grad_norm': 5.65041446685791, 'learning_rate': 1.3689422671723558e-05, 'epoch': 0.95}


 32%|███▏      | 6000/18984 [05:57<11:19, 19.11it/s]

{'loss': 11.4353, 'grad_norm': 5.924592018127441, 'learning_rate': 1.3678887484197221e-05, 'epoch': 0.95}


 32%|███▏      | 6013/18984 [06:01<22:59,  9.40it/s]  

{'loss': 11.5066, 'grad_norm': 6.049632549285889, 'learning_rate': 1.3668352296670882e-05, 'epoch': 0.95}


 32%|███▏      | 6023/18984 [06:01<13:30, 16.00it/s]

{'loss': 11.5444, 'grad_norm': 6.021719932556152, 'learning_rate': 1.3657817109144544e-05, 'epoch': 0.95}


 32%|███▏      | 6033/18984 [06:02<11:42, 18.42it/s]

{'loss': 11.5434, 'grad_norm': 4.837954044342041, 'learning_rate': 1.3647281921618205e-05, 'epoch': 0.95}


 32%|███▏      | 6043/18984 [06:02<11:48, 18.26it/s]

{'loss': 11.5419, 'grad_norm': 5.238865852355957, 'learning_rate': 1.3636746734091868e-05, 'epoch': 0.95}


 32%|███▏      | 6053/18984 [06:03<11:49, 18.23it/s]

{'loss': 11.5318, 'grad_norm': 5.273008823394775, 'learning_rate': 1.3626211546565531e-05, 'epoch': 0.96}


 32%|███▏      | 6064/18984 [06:03<11:11, 19.25it/s]

{'loss': 11.524, 'grad_norm': 4.895862102508545, 'learning_rate': 1.361567635903919e-05, 'epoch': 0.96}


 32%|███▏      | 6074/18984 [06:04<11:19, 18.99it/s]

{'loss': 11.5565, 'grad_norm': 4.66545295715332, 'learning_rate': 1.3605141171512854e-05, 'epoch': 0.96}


 32%|███▏      | 6084/18984 [06:04<11:09, 19.27it/s]

{'loss': 11.5836, 'grad_norm': 5.04995059967041, 'learning_rate': 1.3594605983986515e-05, 'epoch': 0.96}


 32%|███▏      | 6094/18984 [06:05<11:13, 19.14it/s]

{'loss': 11.6141, 'grad_norm': 4.880496501922607, 'learning_rate': 1.3584070796460178e-05, 'epoch': 0.96}


 32%|███▏      | 6103/18984 [06:05<11:05, 19.36it/s]

{'loss': 11.6419, 'grad_norm': 4.438222885131836, 'learning_rate': 1.3573535608933841e-05, 'epoch': 0.96}


 32%|███▏      | 6113/18984 [06:06<11:52, 18.07it/s]

{'loss': 11.6669, 'grad_norm': 3.3565754890441895, 'learning_rate': 1.3563000421407501e-05, 'epoch': 0.97}


 32%|███▏      | 6123/18984 [06:07<12:07, 17.68it/s]

{'loss': 11.5541, 'grad_norm': 5.679083824157715, 'learning_rate': 1.3552465233881164e-05, 'epoch': 0.97}


 32%|███▏      | 6133/18984 [06:07<11:27, 18.70it/s]

{'loss': 11.4464, 'grad_norm': 5.518311977386475, 'learning_rate': 1.3541930046354825e-05, 'epoch': 0.97}


 32%|███▏      | 6143/18984 [06:08<11:26, 18.71it/s]

{'loss': 11.4524, 'grad_norm': 5.750025272369385, 'learning_rate': 1.3531394858828488e-05, 'epoch': 0.97}


 32%|███▏      | 6153/18984 [06:08<11:09, 19.18it/s]

{'loss': 11.4906, 'grad_norm': 5.858418941497803, 'learning_rate': 1.3520859671302151e-05, 'epoch': 0.97}


 32%|███▏      | 6163/18984 [06:09<11:24, 18.73it/s]

{'loss': 11.5531, 'grad_norm': 6.008096218109131, 'learning_rate': 1.3510324483775811e-05, 'epoch': 0.97}


 33%|███▎      | 6173/18984 [06:09<11:16, 18.94it/s]

{'loss': 11.6055, 'grad_norm': 6.2851481437683105, 'learning_rate': 1.3499789296249474e-05, 'epoch': 0.98}


 33%|███▎      | 6183/18984 [06:10<11:29, 18.57it/s]

{'loss': 11.6628, 'grad_norm': 6.286808013916016, 'learning_rate': 1.3489254108723135e-05, 'epoch': 0.98}


 33%|███▎      | 6193/18984 [06:10<11:46, 18.10it/s]

{'loss': 11.6049, 'grad_norm': 6.489506244659424, 'learning_rate': 1.3478718921196799e-05, 'epoch': 0.98}


 33%|███▎      | 6203/18984 [06:11<11:37, 18.33it/s]

{'loss': 11.5038, 'grad_norm': 5.80601167678833, 'learning_rate': 1.3468183733670462e-05, 'epoch': 0.98}


 33%|███▎      | 6213/18984 [06:11<11:18, 18.82it/s]

{'loss': 11.4411, 'grad_norm': 5.839954376220703, 'learning_rate': 1.3457648546144121e-05, 'epoch': 0.98}


 33%|███▎      | 6223/18984 [06:12<11:06, 19.14it/s]

{'loss': 11.4445, 'grad_norm': 6.1992082595825195, 'learning_rate': 1.3447113358617784e-05, 'epoch': 0.98}


 33%|███▎      | 6233/18984 [06:12<11:17, 18.82it/s]

{'loss': 11.5285, 'grad_norm': 6.352978706359863, 'learning_rate': 1.3436578171091446e-05, 'epoch': 0.98}


 33%|███▎      | 6243/18984 [06:13<11:12, 18.93it/s]

{'loss': 11.5456, 'grad_norm': 6.151437282562256, 'learning_rate': 1.3426042983565109e-05, 'epoch': 0.99}


 33%|███▎      | 6253/18984 [06:13<11:19, 18.74it/s]

{'loss': 11.6468, 'grad_norm': 5.727299213409424, 'learning_rate': 1.3415507796038772e-05, 'epoch': 0.99}


 33%|███▎      | 6263/18984 [06:14<11:31, 18.40it/s]

{'loss': 11.6529, 'grad_norm': 5.696998596191406, 'learning_rate': 1.3404972608512431e-05, 'epoch': 0.99}


 33%|███▎      | 6273/18984 [06:15<11:29, 18.44it/s]

{'loss': 11.6208, 'grad_norm': 5.512545585632324, 'learning_rate': 1.3394437420986094e-05, 'epoch': 0.99}


 33%|███▎      | 6283/18984 [06:15<11:05, 19.09it/s]

{'loss': 11.4993, 'grad_norm': 5.815402984619141, 'learning_rate': 1.3383902233459756e-05, 'epoch': 0.99}


 33%|███▎      | 6293/18984 [06:16<11:13, 18.84it/s]

{'loss': 11.4544, 'grad_norm': 5.428296089172363, 'learning_rate': 1.3373367045933419e-05, 'epoch': 0.99}


 33%|███▎      | 6302/18984 [06:16<11:00, 19.19it/s]

{'loss': 11.4662, 'grad_norm': 5.4965972900390625, 'learning_rate': 1.3362831858407082e-05, 'epoch': 1.0}


 33%|███▎      | 6313/18984 [06:17<10:50, 19.48it/s]

{'loss': 11.5031, 'grad_norm': 5.431319236755371, 'learning_rate': 1.3352296670880742e-05, 'epoch': 1.0}


 33%|███▎      | 6323/18984 [06:17<11:08, 18.94it/s]

{'loss': 11.5863, 'grad_norm': 5.496974468231201, 'learning_rate': 1.3341761483354405e-05, 'epoch': 1.0}


                                                    
 33%|███▎      | 6331/18984 [06:24<2:37:26,  1.34it/s]

{'eval_loss': 11.557514190673828, 'eval_runtime': 6.696, 'eval_samples_per_second': 1493.439, 'eval_steps_per_second': 93.34, 'epoch': 1.0}
{'loss': 11.5892, 'grad_norm': 5.524685859680176, 'learning_rate': 1.3331226295828066e-05, 'epoch': 1.0}


 33%|███▎      | 6343/18984 [06:25<29:07,  7.23it/s]  

{'loss': 11.5465, 'grad_norm': 5.501379013061523, 'learning_rate': 1.3320691108301729e-05, 'epoch': 1.0}


 33%|███▎      | 6353/18984 [06:26<14:44, 14.28it/s]

{'loss': 11.5505, 'grad_norm': 5.655401229858398, 'learning_rate': 1.3310155920775392e-05, 'epoch': 1.0}


 34%|███▎      | 6363/18984 [06:26<12:21, 17.03it/s]

{'loss': 11.5346, 'grad_norm': 5.587723255157471, 'learning_rate': 1.3299620733249052e-05, 'epoch': 1.01}


 34%|███▎      | 6373/18984 [06:27<11:53, 17.68it/s]

{'loss': 11.557, 'grad_norm': 5.47481107711792, 'learning_rate': 1.3289085545722715e-05, 'epoch': 1.01}


 34%|███▎      | 6383/18984 [06:27<11:51, 17.72it/s]

{'loss': 11.5213, 'grad_norm': 5.502654075622559, 'learning_rate': 1.3278550358196376e-05, 'epoch': 1.01}


 34%|███▎      | 6393/18984 [06:28<11:44, 17.87it/s]

{'loss': 11.5605, 'grad_norm': 5.316521644592285, 'learning_rate': 1.3268015170670039e-05, 'epoch': 1.01}


 34%|███▎      | 6403/18984 [06:28<11:29, 18.24it/s]

{'loss': 11.5314, 'grad_norm': 5.2917327880859375, 'learning_rate': 1.3257479983143702e-05, 'epoch': 1.01}


 34%|███▍      | 6413/18984 [06:29<11:27, 18.29it/s]

{'loss': 11.5603, 'grad_norm': 5.5481038093566895, 'learning_rate': 1.3246944795617362e-05, 'epoch': 1.01}


 34%|███▍      | 6423/18984 [06:29<11:33, 18.12it/s]

{'loss': 11.541, 'grad_norm': 5.928160190582275, 'learning_rate': 1.3236409608091025e-05, 'epoch': 1.01}


 34%|███▍      | 6433/18984 [06:30<11:45, 17.78it/s]

{'loss': 11.5602, 'grad_norm': 5.867589473724365, 'learning_rate': 1.3225874420564686e-05, 'epoch': 1.02}


 34%|███▍      | 6443/18984 [06:31<11:42, 17.84it/s]

{'loss': 11.5742, 'grad_norm': 5.642910957336426, 'learning_rate': 1.321533923303835e-05, 'epoch': 1.02}


 34%|███▍      | 6453/18984 [06:31<11:41, 17.87it/s]

{'loss': 11.5802, 'grad_norm': 5.625364780426025, 'learning_rate': 1.3204804045512012e-05, 'epoch': 1.02}


 34%|███▍      | 6463/18984 [06:32<11:56, 17.49it/s]

{'loss': 11.5207, 'grad_norm': 5.701879978179932, 'learning_rate': 1.3194268857985672e-05, 'epoch': 1.02}


 34%|███▍      | 6473/18984 [06:32<11:29, 18.13it/s]

{'loss': 11.5787, 'grad_norm': 5.684577941894531, 'learning_rate': 1.3183733670459335e-05, 'epoch': 1.02}


 34%|███▍      | 6483/18984 [06:33<11:34, 18.01it/s]

{'loss': 11.5314, 'grad_norm': 5.468348979949951, 'learning_rate': 1.3173198482932996e-05, 'epoch': 1.02}


 34%|███▍      | 6493/18984 [06:33<11:33, 18.02it/s]

{'loss': 11.5183, 'grad_norm': 5.449819564819336, 'learning_rate': 1.316266329540666e-05, 'epoch': 1.03}


 34%|███▍      | 6500/18984 [06:34<11:33, 18.00it/s]

{'loss': 11.5471, 'grad_norm': 5.069084644317627, 'learning_rate': 1.3152128107880322e-05, 'epoch': 1.03}


 34%|███▍      | 6513/18984 [06:37<20:55,  9.94it/s]  

{'loss': 11.5394, 'grad_norm': 5.28896427154541, 'learning_rate': 1.3141592920353982e-05, 'epoch': 1.03}


 34%|███▍      | 6523/18984 [06:38<13:10, 15.76it/s]

{'loss': 11.5707, 'grad_norm': 5.140254974365234, 'learning_rate': 1.3131057732827645e-05, 'epoch': 1.03}


 34%|███▍      | 6533/18984 [06:38<11:46, 17.62it/s]

{'loss': 11.5704, 'grad_norm': 5.098391056060791, 'learning_rate': 1.3120522545301308e-05, 'epoch': 1.03}


 34%|███▍      | 6543/18984 [06:39<11:27, 18.09it/s]

{'loss': 11.5432, 'grad_norm': 5.097888946533203, 'learning_rate': 1.310998735777497e-05, 'epoch': 1.03}


 35%|███▍      | 6553/18984 [06:39<11:27, 18.07it/s]

{'loss': 11.5661, 'grad_norm': 4.889869213104248, 'learning_rate': 1.3099452170248633e-05, 'epoch': 1.04}


 35%|███▍      | 6563/18984 [06:40<11:26, 18.08it/s]

{'loss': 11.5623, 'grad_norm': 4.8515496253967285, 'learning_rate': 1.3088916982722292e-05, 'epoch': 1.04}


 35%|███▍      | 6573/18984 [06:40<11:34, 17.87it/s]

{'loss': 11.5503, 'grad_norm': 4.930224418640137, 'learning_rate': 1.3078381795195955e-05, 'epoch': 1.04}


 35%|███▍      | 6583/18984 [06:41<11:28, 18.01it/s]

{'loss': 11.5522, 'grad_norm': 5.281823635101318, 'learning_rate': 1.3067846607669618e-05, 'epoch': 1.04}


 35%|███▍      | 6593/18984 [06:41<11:13, 18.41it/s]

{'loss': 11.5253, 'grad_norm': 5.4870171546936035, 'learning_rate': 1.305731142014328e-05, 'epoch': 1.04}


 35%|███▍      | 6603/18984 [06:42<11:24, 18.08it/s]

{'loss': 11.5249, 'grad_norm': 5.72181510925293, 'learning_rate': 1.3046776232616943e-05, 'epoch': 1.04}


 35%|███▍      | 6613/18984 [06:43<11:20, 18.18it/s]

{'loss': 11.539, 'grad_norm': 5.7351603507995605, 'learning_rate': 1.3036241045090602e-05, 'epoch': 1.04}


 35%|███▍      | 6623/18984 [06:43<11:17, 18.24it/s]

{'loss': 11.5289, 'grad_norm': 5.848960876464844, 'learning_rate': 1.3025705857564265e-05, 'epoch': 1.05}


 35%|███▍      | 6633/18984 [06:44<11:22, 18.09it/s]

{'loss': 11.5402, 'grad_norm': 5.989912986755371, 'learning_rate': 1.3015170670037928e-05, 'epoch': 1.05}


 35%|███▍      | 6643/18984 [06:44<11:17, 18.20it/s]

{'loss': 11.5442, 'grad_norm': 5.641470909118652, 'learning_rate': 1.300463548251159e-05, 'epoch': 1.05}


 35%|███▌      | 6653/18984 [06:45<11:36, 17.71it/s]

{'loss': 11.5331, 'grad_norm': 5.6028337478637695, 'learning_rate': 1.2994100294985253e-05, 'epoch': 1.05}


 35%|███▌      | 6663/18984 [06:45<11:23, 18.03it/s]

{'loss': 11.535, 'grad_norm': 5.894370079040527, 'learning_rate': 1.2983565107458913e-05, 'epoch': 1.05}


 35%|███▌      | 6673/18984 [06:46<11:52, 17.29it/s]

{'loss': 11.5227, 'grad_norm': 5.655032634735107, 'learning_rate': 1.2973029919932576e-05, 'epoch': 1.05}


 35%|███▌      | 6683/18984 [06:46<11:15, 18.21it/s]

{'loss': 11.5393, 'grad_norm': 5.856973648071289, 'learning_rate': 1.2962494732406239e-05, 'epoch': 1.06}


 35%|███▌      | 6693/18984 [06:47<11:34, 17.69it/s]

{'loss': 11.5216, 'grad_norm': 5.802043914794922, 'learning_rate': 1.29519595448799e-05, 'epoch': 1.06}


 35%|███▌      | 6703/18984 [06:48<11:28, 17.84it/s]

{'loss': 11.5429, 'grad_norm': 6.372490406036377, 'learning_rate': 1.2941424357353563e-05, 'epoch': 1.06}


 35%|███▌      | 6713/18984 [06:48<11:14, 18.19it/s]

{'loss': 11.5477, 'grad_norm': 5.705162048339844, 'learning_rate': 1.2930889169827223e-05, 'epoch': 1.06}


 35%|███▌      | 6723/18984 [06:49<11:14, 18.18it/s]

{'loss': 11.5457, 'grad_norm': 5.713461399078369, 'learning_rate': 1.2920353982300886e-05, 'epoch': 1.06}


 35%|███▌      | 6733/18984 [06:49<11:15, 18.13it/s]

{'loss': 11.5125, 'grad_norm': 5.788963794708252, 'learning_rate': 1.2909818794774549e-05, 'epoch': 1.06}


 36%|███▌      | 6743/18984 [06:50<11:16, 18.09it/s]

{'loss': 11.5607, 'grad_norm': 5.724465847015381, 'learning_rate': 1.289928360724821e-05, 'epoch': 1.07}


 36%|███▌      | 6753/18984 [06:50<11:36, 17.55it/s]

{'loss': 11.573, 'grad_norm': 5.323657512664795, 'learning_rate': 1.2888748419721873e-05, 'epoch': 1.07}


 36%|███▌      | 6763/18984 [06:51<11:19, 17.99it/s]

{'loss': 11.5452, 'grad_norm': 5.528728485107422, 'learning_rate': 1.2878213232195533e-05, 'epoch': 1.07}


 36%|███▌      | 6773/18984 [06:51<11:06, 18.32it/s]

{'loss': 11.5455, 'grad_norm': 4.924102783203125, 'learning_rate': 1.2867678044669196e-05, 'epoch': 1.07}


 36%|███▌      | 6783/18984 [06:52<11:06, 18.32it/s]

{'loss': 11.5336, 'grad_norm': 4.756637096405029, 'learning_rate': 1.2857142857142859e-05, 'epoch': 1.07}


 36%|███▌      | 6793/18984 [06:53<11:31, 17.63it/s]

{'loss': 11.5599, 'grad_norm': 4.85258674621582, 'learning_rate': 1.284660766961652e-05, 'epoch': 1.07}


 36%|███▌      | 6803/18984 [06:53<11:13, 18.09it/s]

{'loss': 11.5497, 'grad_norm': 4.531111717224121, 'learning_rate': 1.2836072482090183e-05, 'epoch': 1.07}


 36%|███▌      | 6813/18984 [06:54<11:17, 17.96it/s]

{'loss': 11.5656, 'grad_norm': 3.787109375, 'learning_rate': 1.2825537294563843e-05, 'epoch': 1.08}


 36%|███▌      | 6823/18984 [06:54<11:20, 17.87it/s]

{'loss': 11.5434, 'grad_norm': 3.956843137741089, 'learning_rate': 1.2815002107037506e-05, 'epoch': 1.08}


 36%|███▌      | 6833/18984 [06:55<11:08, 18.17it/s]

{'loss': 11.5419, 'grad_norm': 4.213074207305908, 'learning_rate': 1.2804466919511169e-05, 'epoch': 1.08}


 36%|███▌      | 6843/18984 [06:55<11:30, 17.60it/s]

{'loss': 11.5453, 'grad_norm': 4.969977378845215, 'learning_rate': 1.279393173198483e-05, 'epoch': 1.08}


 36%|███▌      | 6853/18984 [06:56<11:08, 18.16it/s]

{'loss': 11.5457, 'grad_norm': 5.4790496826171875, 'learning_rate': 1.2783396544458493e-05, 'epoch': 1.08}


 36%|███▌      | 6863/18984 [06:56<10:56, 18.46it/s]

{'loss': 11.542, 'grad_norm': 4.130350112915039, 'learning_rate': 1.2772861356932153e-05, 'epoch': 1.08}


 36%|███▌      | 6873/18984 [06:57<11:06, 18.18it/s]

{'loss': 11.5311, 'grad_norm': 4.6271138191223145, 'learning_rate': 1.2762326169405816e-05, 'epoch': 1.09}


 36%|███▋      | 6883/18984 [06:58<11:12, 18.00it/s]

{'loss': 11.5422, 'grad_norm': 5.691083908081055, 'learning_rate': 1.275179098187948e-05, 'epoch': 1.09}


 36%|███▋      | 6893/18984 [06:58<11:23, 17.69it/s]

{'loss': 11.5243, 'grad_norm': 5.933627605438232, 'learning_rate': 1.274125579435314e-05, 'epoch': 1.09}


 36%|███▋      | 6903/18984 [06:59<11:35, 17.36it/s]

{'loss': 11.5633, 'grad_norm': 6.582406520843506, 'learning_rate': 1.2730720606826804e-05, 'epoch': 1.09}


 36%|███▋      | 6913/18984 [06:59<11:11, 17.99it/s]

{'loss': 11.5446, 'grad_norm': 6.595830917358398, 'learning_rate': 1.2720185419300463e-05, 'epoch': 1.09}


 36%|███▋      | 6923/18984 [07:00<11:14, 17.89it/s]

{'loss': 11.5427, 'grad_norm': 6.585029125213623, 'learning_rate': 1.2709650231774126e-05, 'epoch': 1.09}


 37%|███▋      | 6933/18984 [07:00<11:11, 17.96it/s]

{'loss': 11.5344, 'grad_norm': 6.4644455909729, 'learning_rate': 1.269911504424779e-05, 'epoch': 1.1}


 37%|███▋      | 6943/18984 [07:01<11:12, 17.90it/s]

{'loss': 11.5491, 'grad_norm': 6.380259037017822, 'learning_rate': 1.268857985672145e-05, 'epoch': 1.1}


 37%|███▋      | 6953/18984 [07:01<11:02, 18.16it/s]

{'loss': 11.5498, 'grad_norm': 5.609585285186768, 'learning_rate': 1.2678044669195114e-05, 'epoch': 1.1}


 37%|███▋      | 6963/18984 [07:02<11:00, 18.21it/s]

{'loss': 11.556, 'grad_norm': 5.10457706451416, 'learning_rate': 1.2667509481668773e-05, 'epoch': 1.1}


 37%|███▋      | 6973/18984 [07:03<11:03, 18.11it/s]

{'loss': 11.5405, 'grad_norm': 4.795846462249756, 'learning_rate': 1.2656974294142436e-05, 'epoch': 1.1}


 37%|███▋      | 6983/18984 [07:03<11:15, 17.78it/s]

{'loss': 11.5562, 'grad_norm': 4.614699363708496, 'learning_rate': 1.26464391066161e-05, 'epoch': 1.1}


 37%|███▋      | 6993/18984 [07:04<11:21, 17.60it/s]

{'loss': 11.5322, 'grad_norm': 5.424022674560547, 'learning_rate': 1.2635903919089761e-05, 'epoch': 1.1}


 37%|███▋      | 7000/18984 [07:04<11:22, 17.57it/s]

{'loss': 11.541, 'grad_norm': 5.456178665161133, 'learning_rate': 1.2625368731563424e-05, 'epoch': 1.11}


 37%|███▋      | 7013/18984 [07:07<20:19,  9.82it/s]  

{'loss': 11.56, 'grad_norm': 4.952658176422119, 'learning_rate': 1.2614833544037084e-05, 'epoch': 1.11}


 37%|███▋      | 7023/18984 [07:08<12:44, 15.65it/s]

{'loss': 11.5454, 'grad_norm': 4.82200813293457, 'learning_rate': 1.2604298356510747e-05, 'epoch': 1.11}


 37%|███▋      | 7033/18984 [07:09<11:30, 17.32it/s]

{'loss': 11.5538, 'grad_norm': 4.765251636505127, 'learning_rate': 1.259376316898441e-05, 'epoch': 1.11}


 37%|███▋      | 7043/18984 [07:09<10:53, 18.26it/s]

{'loss': 11.5389, 'grad_norm': 4.737929821014404, 'learning_rate': 1.2583227981458071e-05, 'epoch': 1.11}


 37%|███▋      | 7053/18984 [07:10<10:45, 18.47it/s]

{'loss': 11.5557, 'grad_norm': 4.791202545166016, 'learning_rate': 1.2572692793931734e-05, 'epoch': 1.11}


 37%|███▋      | 7063/18984 [07:10<11:20, 17.52it/s]

{'loss': 11.5534, 'grad_norm': 4.765791893005371, 'learning_rate': 1.2562157606405394e-05, 'epoch': 1.12}


 37%|███▋      | 7073/18984 [07:11<11:16, 17.61it/s]

{'loss': 11.5636, 'grad_norm': 5.037947654724121, 'learning_rate': 1.2551622418879057e-05, 'epoch': 1.12}


 37%|███▋      | 7083/18984 [07:11<11:17, 17.56it/s]

{'loss': 11.5387, 'grad_norm': 5.072703838348389, 'learning_rate': 1.254108723135272e-05, 'epoch': 1.12}


 37%|███▋      | 7093/18984 [07:12<11:23, 17.39it/s]

{'loss': 11.553, 'grad_norm': 5.0305681228637695, 'learning_rate': 1.2530552043826381e-05, 'epoch': 1.12}


 37%|███▋      | 7103/18984 [07:12<10:54, 18.14it/s]

{'loss': 11.5744, 'grad_norm': 5.0367865562438965, 'learning_rate': 1.2520016856300044e-05, 'epoch': 1.12}


 37%|███▋      | 7113/18984 [07:13<11:11, 17.69it/s]

{'loss': 11.5499, 'grad_norm': 4.8486552238464355, 'learning_rate': 1.2509481668773704e-05, 'epoch': 1.12}


 38%|███▊      | 7123/18984 [07:14<10:42, 18.46it/s]

{'loss': 11.5348, 'grad_norm': 4.281057834625244, 'learning_rate': 1.2498946481247367e-05, 'epoch': 1.13}


 38%|███▊      | 7133/18984 [07:14<10:30, 18.79it/s]

{'loss': 11.5354, 'grad_norm': 5.924530506134033, 'learning_rate': 1.248841129372103e-05, 'epoch': 1.13}


 38%|███▊      | 7143/18984 [07:15<10:58, 17.98it/s]

{'loss': 11.5701, 'grad_norm': 6.2442779541015625, 'learning_rate': 1.2477876106194691e-05, 'epoch': 1.13}


 38%|███▊      | 7153/18984 [07:15<10:53, 18.11it/s]

{'loss': 11.5639, 'grad_norm': 6.242358207702637, 'learning_rate': 1.2467340918668354e-05, 'epoch': 1.13}


 38%|███▊      | 7163/18984 [07:16<10:48, 18.22it/s]

{'loss': 11.5272, 'grad_norm': 6.088951587677002, 'learning_rate': 1.2456805731142014e-05, 'epoch': 1.13}


 38%|███▊      | 7173/18984 [07:16<10:53, 18.06it/s]

{'loss': 11.5414, 'grad_norm': 6.084969520568848, 'learning_rate': 1.2446270543615677e-05, 'epoch': 1.13}


 38%|███▊      | 7183/18984 [07:17<10:45, 18.28it/s]

{'loss': 11.551, 'grad_norm': 5.840928554534912, 'learning_rate': 1.243573535608934e-05, 'epoch': 1.13}


 38%|███▊      | 7193/18984 [07:17<10:58, 17.92it/s]

{'loss': 11.536, 'grad_norm': 5.7103495597839355, 'learning_rate': 1.2425200168563001e-05, 'epoch': 1.14}


 38%|███▊      | 7203/18984 [07:18<10:47, 18.20it/s]

{'loss': 11.5425, 'grad_norm': 5.721323013305664, 'learning_rate': 1.2414664981036664e-05, 'epoch': 1.14}


 38%|███▊      | 7213/18984 [07:19<10:40, 18.39it/s]

{'loss': 11.5893, 'grad_norm': 5.795228481292725, 'learning_rate': 1.2404129793510324e-05, 'epoch': 1.14}


 38%|███▊      | 7223/18984 [07:19<10:46, 18.20it/s]

{'loss': 11.5337, 'grad_norm': 5.61666202545166, 'learning_rate': 1.2393594605983987e-05, 'epoch': 1.14}


 38%|███▊      | 7233/18984 [07:20<10:44, 18.22it/s]

{'loss': 11.546, 'grad_norm': 5.640875339508057, 'learning_rate': 1.238305941845765e-05, 'epoch': 1.14}


 38%|███▊      | 7243/18984 [07:20<11:08, 17.57it/s]

{'loss': 11.5497, 'grad_norm': 5.11624002456665, 'learning_rate': 1.2372524230931312e-05, 'epoch': 1.14}


 38%|███▊      | 7253/18984 [07:21<10:52, 17.98it/s]

{'loss': 11.5558, 'grad_norm': 4.740294456481934, 'learning_rate': 1.2361989043404975e-05, 'epoch': 1.15}


 38%|███▊      | 7263/18984 [07:21<10:50, 18.03it/s]

{'loss': 11.5452, 'grad_norm': 5.049333095550537, 'learning_rate': 1.2351453855878634e-05, 'epoch': 1.15}


 38%|███▊      | 7273/18984 [07:22<10:45, 18.13it/s]

{'loss': 11.5312, 'grad_norm': 5.008164405822754, 'learning_rate': 1.2340918668352297e-05, 'epoch': 1.15}


 38%|███▊      | 7283/18984 [07:22<10:43, 18.19it/s]

{'loss': 11.5459, 'grad_norm': 5.0213446617126465, 'learning_rate': 1.233038348082596e-05, 'epoch': 1.15}


 38%|███▊      | 7293/18984 [07:23<10:45, 18.11it/s]

{'loss': 11.5567, 'grad_norm': 5.118193626403809, 'learning_rate': 1.2319848293299622e-05, 'epoch': 1.15}


 38%|███▊      | 7303/18984 [07:24<10:37, 18.33it/s]

{'loss': 11.5534, 'grad_norm': 5.216094970703125, 'learning_rate': 1.2309313105773285e-05, 'epoch': 1.15}


 39%|███▊      | 7313/18984 [07:24<10:41, 18.20it/s]

{'loss': 11.5359, 'grad_norm': 5.250338554382324, 'learning_rate': 1.2298777918246946e-05, 'epoch': 1.16}


 39%|███▊      | 7323/18984 [07:25<10:42, 18.14it/s]

{'loss': 11.5423, 'grad_norm': 5.3297929763793945, 'learning_rate': 1.2288242730720607e-05, 'epoch': 1.16}


 39%|███▊      | 7333/18984 [07:25<10:29, 18.52it/s]

{'loss': 11.5561, 'grad_norm': 5.367889881134033, 'learning_rate': 1.227770754319427e-05, 'epoch': 1.16}


 39%|███▊      | 7343/18984 [07:26<10:43, 18.10it/s]

{'loss': 11.5483, 'grad_norm': 5.328464031219482, 'learning_rate': 1.2267172355667932e-05, 'epoch': 1.16}


 39%|███▊      | 7353/18984 [07:26<10:47, 17.98it/s]

{'loss': 11.5594, 'grad_norm': 5.329095840454102, 'learning_rate': 1.2256637168141595e-05, 'epoch': 1.16}


 39%|███▉      | 7363/18984 [07:27<10:48, 17.92it/s]

{'loss': 11.5312, 'grad_norm': 5.439267158508301, 'learning_rate': 1.2246101980615256e-05, 'epoch': 1.16}


 39%|███▉      | 7373/18984 [07:27<10:41, 18.10it/s]

{'loss': 11.5461, 'grad_norm': 5.338561534881592, 'learning_rate': 1.2235566793088918e-05, 'epoch': 1.16}


 39%|███▉      | 7383/18984 [07:28<10:48, 17.89it/s]

{'loss': 11.5531, 'grad_norm': 5.516037464141846, 'learning_rate': 1.222503160556258e-05, 'epoch': 1.17}


 39%|███▉      | 7393/18984 [07:28<10:43, 18.02it/s]

{'loss': 11.5369, 'grad_norm': 5.563336372375488, 'learning_rate': 1.2214496418036242e-05, 'epoch': 1.17}


 39%|███▉      | 7403/18984 [07:29<10:41, 18.05it/s]

{'loss': 11.5543, 'grad_norm': 5.581070899963379, 'learning_rate': 1.2203961230509903e-05, 'epoch': 1.17}


 39%|███▉      | 7413/18984 [07:30<10:30, 18.34it/s]

{'loss': 11.5689, 'grad_norm': 5.610748767852783, 'learning_rate': 1.2193426042983566e-05, 'epoch': 1.17}


 39%|███▉      | 7423/18984 [07:30<10:29, 18.37it/s]

{'loss': 11.5569, 'grad_norm': 5.587394714355469, 'learning_rate': 1.2182890855457228e-05, 'epoch': 1.17}


 39%|███▉      | 7433/18984 [07:31<11:02, 17.44it/s]

{'loss': 11.5447, 'grad_norm': 5.617396831512451, 'learning_rate': 1.217235566793089e-05, 'epoch': 1.17}


 39%|███▉      | 7443/18984 [07:31<10:46, 17.86it/s]

{'loss': 11.561, 'grad_norm': 5.671519756317139, 'learning_rate': 1.2161820480404552e-05, 'epoch': 1.18}


 39%|███▉      | 7453/18984 [07:32<10:47, 17.80it/s]

{'loss': 11.5939, 'grad_norm': 5.556743144989014, 'learning_rate': 1.2151285292878214e-05, 'epoch': 1.18}


 39%|███▉      | 7463/18984 [07:32<10:32, 18.23it/s]

{'loss': 11.5426, 'grad_norm': 5.476258754730225, 'learning_rate': 1.2140750105351877e-05, 'epoch': 1.18}


 39%|███▉      | 7473/18984 [07:33<10:39, 18.00it/s]

{'loss': 11.5697, 'grad_norm': 5.4672017097473145, 'learning_rate': 1.2130214917825538e-05, 'epoch': 1.18}


 39%|███▉      | 7483/18984 [07:33<10:30, 18.25it/s]

{'loss': 11.5958, 'grad_norm': 5.364503860473633, 'learning_rate': 1.2119679730299201e-05, 'epoch': 1.18}


 39%|███▉      | 7493/18984 [07:34<10:26, 18.33it/s]

{'loss': 11.5878, 'grad_norm': 5.3777618408203125, 'learning_rate': 1.2109144542772862e-05, 'epoch': 1.18}


 40%|███▉      | 7500/18984 [07:34<10:31, 18.19it/s]

{'loss': 11.5608, 'grad_norm': 4.952010154724121, 'learning_rate': 1.2098609355246524e-05, 'epoch': 1.19}


 40%|███▉      | 7513/18984 [07:38<19:28,  9.81it/s]  

{'loss': 11.5532, 'grad_norm': 4.561861515045166, 'learning_rate': 1.2088074167720187e-05, 'epoch': 1.19}


 40%|███▉      | 7523/18984 [07:38<12:04, 15.83it/s]

{'loss': 11.547, 'grad_norm': 4.628912448883057, 'learning_rate': 1.2077538980193848e-05, 'epoch': 1.19}


 40%|███▉      | 7533/18984 [07:39<10:40, 17.87it/s]

{'loss': 11.5238, 'grad_norm': 5.4622344970703125, 'learning_rate': 1.2067003792667511e-05, 'epoch': 1.19}


 40%|███▉      | 7543/18984 [07:39<10:41, 17.83it/s]

{'loss': 11.5147, 'grad_norm': 5.631198406219482, 'learning_rate': 1.2056468605141172e-05, 'epoch': 1.19}


 40%|███▉      | 7553/18984 [07:40<10:22, 18.37it/s]

{'loss': 11.4993, 'grad_norm': 5.734482288360596, 'learning_rate': 1.2045933417614834e-05, 'epoch': 1.19}


 40%|███▉      | 7563/18984 [07:40<10:19, 18.44it/s]

{'loss': 11.5254, 'grad_norm': 5.794717788696289, 'learning_rate': 1.2035398230088497e-05, 'epoch': 1.19}


 40%|███▉      | 7573/18984 [07:41<10:33, 18.02it/s]

{'loss': 11.5304, 'grad_norm': 5.798405170440674, 'learning_rate': 1.2024863042562158e-05, 'epoch': 1.2}


 40%|███▉      | 7583/18984 [07:42<10:36, 17.90it/s]

{'loss': 11.5163, 'grad_norm': 5.8588643074035645, 'learning_rate': 1.2014327855035821e-05, 'epoch': 1.2}


 40%|███▉      | 7593/18984 [07:42<10:38, 17.85it/s]

{'loss': 11.523, 'grad_norm': 5.857717037200928, 'learning_rate': 1.2003792667509483e-05, 'epoch': 1.2}


 40%|████      | 7603/18984 [07:43<10:37, 17.84it/s]

{'loss': 11.5265, 'grad_norm': 6.035763740539551, 'learning_rate': 1.1993257479983144e-05, 'epoch': 1.2}


 40%|████      | 7613/18984 [07:43<10:24, 18.20it/s]

{'loss': 11.5331, 'grad_norm': 6.2320098876953125, 'learning_rate': 1.1982722292456807e-05, 'epoch': 1.2}


 40%|████      | 7623/18984 [07:44<10:41, 17.72it/s]

{'loss': 11.5385, 'grad_norm': 6.223710060119629, 'learning_rate': 1.1972187104930468e-05, 'epoch': 1.2}


 40%|████      | 7633/18984 [07:44<10:32, 17.95it/s]

{'loss': 11.5305, 'grad_norm': 6.177052974700928, 'learning_rate': 1.1961651917404131e-05, 'epoch': 1.21}


 40%|████      | 7643/18984 [07:45<10:35, 17.85it/s]

{'loss': 11.5258, 'grad_norm': 6.186944484710693, 'learning_rate': 1.1951116729877791e-05, 'epoch': 1.21}


 40%|████      | 7653/18984 [07:45<10:19, 18.30it/s]

{'loss': 11.5422, 'grad_norm': 6.23160982131958, 'learning_rate': 1.1940581542351454e-05, 'epoch': 1.21}


 40%|████      | 7663/18984 [07:46<10:19, 18.27it/s]

{'loss': 11.5295, 'grad_norm': 6.2636895179748535, 'learning_rate': 1.1930046354825117e-05, 'epoch': 1.21}


 40%|████      | 7673/18984 [07:47<10:30, 17.94it/s]

{'loss': 11.5442, 'grad_norm': 6.30283260345459, 'learning_rate': 1.1919511167298779e-05, 'epoch': 1.21}


 40%|████      | 7683/18984 [07:47<10:36, 17.74it/s]

{'loss': 11.5546, 'grad_norm': 6.226770401000977, 'learning_rate': 1.1908975979772442e-05, 'epoch': 1.21}


 41%|████      | 7693/18984 [07:48<10:28, 17.96it/s]

{'loss': 11.5638, 'grad_norm': 6.17182731628418, 'learning_rate': 1.1898440792246101e-05, 'epoch': 1.22}


 41%|████      | 7703/18984 [07:48<10:19, 18.20it/s]

{'loss': 11.5303, 'grad_norm': 6.146017551422119, 'learning_rate': 1.1887905604719764e-05, 'epoch': 1.22}


 41%|████      | 7713/18984 [07:49<10:09, 18.49it/s]

{'loss': 11.551, 'grad_norm': 6.157796859741211, 'learning_rate': 1.1877370417193427e-05, 'epoch': 1.22}


 41%|████      | 7723/18984 [07:49<10:27, 17.96it/s]

{'loss': 11.569, 'grad_norm': 6.2794270515441895, 'learning_rate': 1.1866835229667089e-05, 'epoch': 1.22}


 41%|████      | 7733/18984 [07:50<10:20, 18.13it/s]

{'loss': 11.5658, 'grad_norm': 6.358717441558838, 'learning_rate': 1.1856300042140752e-05, 'epoch': 1.22}


 41%|████      | 7743/18984 [07:50<10:29, 17.85it/s]

{'loss': 11.5947, 'grad_norm': 6.358982086181641, 'learning_rate': 1.1845764854614411e-05, 'epoch': 1.22}


 41%|████      | 7753/18984 [07:51<10:21, 18.08it/s]

{'loss': 11.5657, 'grad_norm': 6.374664783477783, 'learning_rate': 1.1835229667088074e-05, 'epoch': 1.22}


 41%|████      | 7763/18984 [07:52<10:14, 18.27it/s]

{'loss': 11.5831, 'grad_norm': 6.384713172912598, 'learning_rate': 1.1824694479561737e-05, 'epoch': 1.23}


 41%|████      | 7773/18984 [07:52<10:10, 18.37it/s]

{'loss': 11.5868, 'grad_norm': 6.360019683837891, 'learning_rate': 1.1814159292035399e-05, 'epoch': 1.23}


 41%|████      | 7783/18984 [07:53<10:09, 18.37it/s]

{'loss': 11.584, 'grad_norm': 6.272343635559082, 'learning_rate': 1.1803624104509062e-05, 'epoch': 1.23}


 41%|████      | 7793/18984 [07:53<10:35, 17.60it/s]

{'loss': 11.6031, 'grad_norm': 6.18996000289917, 'learning_rate': 1.1793088916982722e-05, 'epoch': 1.23}


 41%|████      | 7803/18984 [07:54<10:33, 17.66it/s]

{'loss': 11.5838, 'grad_norm': 6.125385761260986, 'learning_rate': 1.1782553729456385e-05, 'epoch': 1.23}


 41%|████      | 7813/18984 [07:54<10:40, 17.45it/s]

{'loss': 11.5843, 'grad_norm': 6.059473991394043, 'learning_rate': 1.1772018541930048e-05, 'epoch': 1.23}


 41%|████      | 7823/18984 [07:55<10:31, 17.68it/s]

{'loss': 11.5786, 'grad_norm': 5.975889205932617, 'learning_rate': 1.1761483354403709e-05, 'epoch': 1.24}


 41%|████▏     | 7833/18984 [07:55<10:31, 17.67it/s]

{'loss': 11.602, 'grad_norm': 5.755411148071289, 'learning_rate': 1.1750948166877372e-05, 'epoch': 1.24}


 41%|████▏     | 7843/18984 [07:56<10:22, 17.90it/s]

{'loss': 11.5927, 'grad_norm': 5.557784557342529, 'learning_rate': 1.1740412979351032e-05, 'epoch': 1.24}


 41%|████▏     | 7853/18984 [07:57<10:17, 18.01it/s]

{'loss': 11.5795, 'grad_norm': 5.270325660705566, 'learning_rate': 1.1729877791824695e-05, 'epoch': 1.24}


 41%|████▏     | 7863/18984 [07:57<10:07, 18.29it/s]

{'loss': 11.5602, 'grad_norm': 4.879794597625732, 'learning_rate': 1.1719342604298358e-05, 'epoch': 1.24}


 41%|████▏     | 7873/18984 [07:58<10:21, 17.87it/s]

{'loss': 11.5119, 'grad_norm': 5.286187648773193, 'learning_rate': 1.1708807416772019e-05, 'epoch': 1.24}


 42%|████▏     | 7883/18984 [07:58<10:28, 17.66it/s]

{'loss': 11.4925, 'grad_norm': 5.765294551849365, 'learning_rate': 1.1698272229245682e-05, 'epoch': 1.25}


 42%|████▏     | 7893/18984 [07:59<10:29, 17.62it/s]

{'loss': 11.4879, 'grad_norm': 5.936576843261719, 'learning_rate': 1.1687737041719342e-05, 'epoch': 1.25}


 42%|████▏     | 7903/18984 [07:59<10:27, 17.67it/s]

{'loss': 11.5004, 'grad_norm': 5.9737372398376465, 'learning_rate': 1.1677201854193005e-05, 'epoch': 1.25}


 42%|████▏     | 7913/18984 [08:00<10:32, 17.52it/s]

{'loss': 11.523, 'grad_norm': 6.039190292358398, 'learning_rate': 1.1666666666666668e-05, 'epoch': 1.25}


 42%|████▏     | 7923/18984 [08:01<10:20, 17.82it/s]

{'loss': 11.5293, 'grad_norm': 6.006556510925293, 'learning_rate': 1.165613147914033e-05, 'epoch': 1.25}


 42%|████▏     | 7933/18984 [08:01<10:21, 17.79it/s]

{'loss': 11.5418, 'grad_norm': 6.125289440155029, 'learning_rate': 1.1645596291613992e-05, 'epoch': 1.25}


 42%|████▏     | 7943/18984 [08:02<10:09, 18.11it/s]

{'loss': 11.5379, 'grad_norm': 6.172372817993164, 'learning_rate': 1.1635061104087652e-05, 'epoch': 1.25}


 42%|████▏     | 7953/18984 [08:02<10:13, 17.98it/s]

{'loss': 11.5178, 'grad_norm': 6.29172945022583, 'learning_rate': 1.1624525916561315e-05, 'epoch': 1.26}


 42%|████▏     | 7963/18984 [08:03<10:07, 18.15it/s]

{'loss': 11.5382, 'grad_norm': 6.129814147949219, 'learning_rate': 1.1613990729034978e-05, 'epoch': 1.26}


 42%|████▏     | 7973/18984 [08:03<10:26, 17.59it/s]

{'loss': 11.5409, 'grad_norm': 6.136831760406494, 'learning_rate': 1.160345554150864e-05, 'epoch': 1.26}


 42%|████▏     | 7983/18984 [08:04<10:06, 18.14it/s]

{'loss': 11.5411, 'grad_norm': 6.120018482208252, 'learning_rate': 1.1592920353982302e-05, 'epoch': 1.26}


 42%|████▏     | 7993/18984 [08:04<10:11, 17.99it/s]

{'loss': 11.5395, 'grad_norm': 6.095518112182617, 'learning_rate': 1.1582385166455962e-05, 'epoch': 1.26}


 42%|████▏     | 8000/18984 [08:05<10:08, 18.05it/s]

{'loss': 11.5593, 'grad_norm': 6.11324405670166, 'learning_rate': 1.1571849978929625e-05, 'epoch': 1.26}


 42%|████▏     | 8013/18984 [08:08<18:30,  9.88it/s]  

{'loss': 11.5639, 'grad_norm': 5.986815452575684, 'learning_rate': 1.1561314791403288e-05, 'epoch': 1.27}


 42%|████▏     | 8023/18984 [08:09<11:45, 15.55it/s]

{'loss': 11.521, 'grad_norm': 4.9866108894348145, 'learning_rate': 1.155077960387695e-05, 'epoch': 1.27}


 42%|████▏     | 8033/18984 [08:09<10:34, 17.26it/s]

{'loss': 11.5377, 'grad_norm': 5.297416687011719, 'learning_rate': 1.1540244416350613e-05, 'epoch': 1.27}


 42%|████▏     | 8043/18984 [08:10<10:24, 17.51it/s]

{'loss': 11.5278, 'grad_norm': 5.493922233581543, 'learning_rate': 1.1529709228824276e-05, 'epoch': 1.27}


 42%|████▏     | 8053/18984 [08:10<10:00, 18.20it/s]

{'loss': 11.5737, 'grad_norm': 5.638643741607666, 'learning_rate': 1.1519174041297935e-05, 'epoch': 1.27}


 42%|████▏     | 8063/18984 [08:11<10:12, 17.83it/s]

{'loss': 11.5568, 'grad_norm': 5.357171535491943, 'learning_rate': 1.1508638853771598e-05, 'epoch': 1.27}


 43%|████▎     | 8073/18984 [08:11<10:02, 18.11it/s]

{'loss': 11.5752, 'grad_norm': 4.7251877784729, 'learning_rate': 1.149810366624526e-05, 'epoch': 1.28}


 43%|████▎     | 8083/18984 [08:12<09:56, 18.28it/s]

{'loss': 11.5478, 'grad_norm': 4.959751129150391, 'learning_rate': 1.1487568478718923e-05, 'epoch': 1.28}


 43%|████▎     | 8093/18984 [08:13<10:06, 17.95it/s]

{'loss': 11.5553, 'grad_norm': 4.914914608001709, 'learning_rate': 1.1477033291192586e-05, 'epoch': 1.28}


 43%|████▎     | 8103/18984 [08:13<10:08, 17.88it/s]

{'loss': 11.574, 'grad_norm': 5.02999210357666, 'learning_rate': 1.1466498103666245e-05, 'epoch': 1.28}


 43%|████▎     | 8113/18984 [08:14<10:17, 17.62it/s]

{'loss': 11.5703, 'grad_norm': 4.980316162109375, 'learning_rate': 1.1455962916139908e-05, 'epoch': 1.28}


 43%|████▎     | 8123/18984 [08:14<09:57, 18.16it/s]

{'loss': 11.5447, 'grad_norm': 4.2712883949279785, 'learning_rate': 1.144542772861357e-05, 'epoch': 1.28}


 43%|████▎     | 8133/18984 [08:15<10:03, 17.99it/s]

{'loss': 11.5423, 'grad_norm': 3.6270596981048584, 'learning_rate': 1.1434892541087233e-05, 'epoch': 1.28}


 43%|████▎     | 8143/18984 [08:15<10:04, 17.94it/s]

{'loss': 11.545, 'grad_norm': 3.6963419914245605, 'learning_rate': 1.1424357353560896e-05, 'epoch': 1.29}


 43%|████▎     | 8153/18984 [08:16<10:04, 17.93it/s]

{'loss': 11.5362, 'grad_norm': 4.194551467895508, 'learning_rate': 1.1413822166034556e-05, 'epoch': 1.29}


 43%|████▎     | 8163/18984 [08:16<10:12, 17.67it/s]

{'loss': 11.5471, 'grad_norm': 4.779338836669922, 'learning_rate': 1.1403286978508219e-05, 'epoch': 1.29}


 43%|████▎     | 8173/18984 [08:17<10:05, 17.87it/s]

{'loss': 11.534, 'grad_norm': 5.380198955535889, 'learning_rate': 1.139275179098188e-05, 'epoch': 1.29}


 43%|████▎     | 8183/18984 [08:18<09:57, 18.09it/s]

{'loss': 11.5493, 'grad_norm': 5.404332160949707, 'learning_rate': 1.1382216603455543e-05, 'epoch': 1.29}


 43%|████▎     | 8193/18984 [08:18<09:49, 18.32it/s]

{'loss': 11.5262, 'grad_norm': 5.640345573425293, 'learning_rate': 1.1371681415929206e-05, 'epoch': 1.29}


 43%|████▎     | 8203/18984 [08:19<10:01, 17.91it/s]

{'loss': 11.51, 'grad_norm': 5.839875221252441, 'learning_rate': 1.1361146228402866e-05, 'epoch': 1.3}


 43%|████▎     | 8213/18984 [08:19<09:44, 18.44it/s]

{'loss': 11.55, 'grad_norm': 5.92060661315918, 'learning_rate': 1.1350611040876529e-05, 'epoch': 1.3}


 43%|████▎     | 8223/18984 [08:20<09:37, 18.62it/s]

{'loss': 11.5565, 'grad_norm': 5.996257781982422, 'learning_rate': 1.134007585335019e-05, 'epoch': 1.3}


 43%|████▎     | 8233/18984 [08:20<09:42, 18.47it/s]

{'loss': 11.5468, 'grad_norm': 6.081951141357422, 'learning_rate': 1.1329540665823853e-05, 'epoch': 1.3}


 43%|████▎     | 8243/18984 [08:21<09:56, 18.00it/s]

{'loss': 11.5754, 'grad_norm': 6.085567474365234, 'learning_rate': 1.1319005478297516e-05, 'epoch': 1.3}


 43%|████▎     | 8253/18984 [08:21<09:47, 18.26it/s]

{'loss': 11.5411, 'grad_norm': 6.0677666664123535, 'learning_rate': 1.1308470290771176e-05, 'epoch': 1.3}


 44%|████▎     | 8263/18984 [08:22<09:42, 18.42it/s]

{'loss': 11.5458, 'grad_norm': 6.107664108276367, 'learning_rate': 1.1297935103244839e-05, 'epoch': 1.31}


 44%|████▎     | 8273/18984 [08:23<09:54, 18.02it/s]

{'loss': 11.5618, 'grad_norm': 6.12069845199585, 'learning_rate': 1.12873999157185e-05, 'epoch': 1.31}


 44%|████▎     | 8283/18984 [08:23<09:49, 18.14it/s]

{'loss': 11.5761, 'grad_norm': 6.096598148345947, 'learning_rate': 1.1276864728192163e-05, 'epoch': 1.31}


 44%|████▎     | 8293/18984 [08:24<09:52, 18.05it/s]

{'loss': 11.5589, 'grad_norm': 5.956881523132324, 'learning_rate': 1.1266329540665826e-05, 'epoch': 1.31}


 44%|████▎     | 8303/18984 [08:24<10:06, 17.61it/s]

{'loss': 11.5871, 'grad_norm': 5.97127103805542, 'learning_rate': 1.1255794353139486e-05, 'epoch': 1.31}


 44%|████▍     | 8313/18984 [08:25<09:51, 18.04it/s]

{'loss': 11.5792, 'grad_norm': 5.73501443862915, 'learning_rate': 1.1245259165613149e-05, 'epoch': 1.31}


 44%|████▍     | 8323/18984 [08:25<10:04, 17.63it/s]

{'loss': 11.5959, 'grad_norm': 5.631643772125244, 'learning_rate': 1.123472397808681e-05, 'epoch': 1.31}


 44%|████▍     | 8333/18984 [08:26<09:51, 17.99it/s]

{'loss': 11.5867, 'grad_norm': 5.303243637084961, 'learning_rate': 1.1224188790560473e-05, 'epoch': 1.32}


 44%|████▍     | 8343/18984 [08:26<09:48, 18.07it/s]

{'loss': 11.5966, 'grad_norm': 4.8090314865112305, 'learning_rate': 1.1213653603034137e-05, 'epoch': 1.32}


 44%|████▍     | 8353/18984 [08:27<09:52, 17.95it/s]

{'loss': 11.5696, 'grad_norm': 4.41431999206543, 'learning_rate': 1.1203118415507796e-05, 'epoch': 1.32}


 44%|████▍     | 8363/18984 [08:28<09:38, 18.35it/s]

{'loss': 11.5073, 'grad_norm': 5.311333656311035, 'learning_rate': 1.119258322798146e-05, 'epoch': 1.32}


 44%|████▍     | 8373/18984 [08:28<09:47, 18.07it/s]

{'loss': 11.5264, 'grad_norm': 5.545373439788818, 'learning_rate': 1.118204804045512e-05, 'epoch': 1.32}


 44%|████▍     | 8383/18984 [08:29<09:29, 18.60it/s]

{'loss': 11.5075, 'grad_norm': 5.49139404296875, 'learning_rate': 1.1171512852928784e-05, 'epoch': 1.32}


 44%|████▍     | 8393/18984 [08:29<09:34, 18.45it/s]

{'loss': 11.5214, 'grad_norm': 5.565073013305664, 'learning_rate': 1.1160977665402447e-05, 'epoch': 1.33}


 44%|████▍     | 8403/18984 [08:30<09:46, 18.03it/s]

{'loss': 11.5095, 'grad_norm': 5.522054672241211, 'learning_rate': 1.1150442477876106e-05, 'epoch': 1.33}


 44%|████▍     | 8413/18984 [08:30<09:36, 18.32it/s]

{'loss': 11.5421, 'grad_norm': 5.519513130187988, 'learning_rate': 1.113990729034977e-05, 'epoch': 1.33}


 44%|████▍     | 8423/18984 [08:31<10:06, 17.41it/s]

{'loss': 11.5403, 'grad_norm': 5.635406970977783, 'learning_rate': 1.112937210282343e-05, 'epoch': 1.33}


 44%|████▍     | 8433/18984 [08:31<09:44, 18.04it/s]

{'loss': 11.5299, 'grad_norm': 5.630114555358887, 'learning_rate': 1.1118836915297094e-05, 'epoch': 1.33}


 44%|████▍     | 8443/18984 [08:32<09:39, 18.19it/s]

{'loss': 11.5365, 'grad_norm': 5.623133659362793, 'learning_rate': 1.1108301727770757e-05, 'epoch': 1.33}


 45%|████▍     | 8453/18984 [08:33<09:50, 17.83it/s]

{'loss': 11.5193, 'grad_norm': 5.6398749351501465, 'learning_rate': 1.1097766540244416e-05, 'epoch': 1.34}


 45%|████▍     | 8463/18984 [08:33<09:29, 18.47it/s]

{'loss': 11.5266, 'grad_norm': 5.711526870727539, 'learning_rate': 1.108723135271808e-05, 'epoch': 1.34}


 45%|████▍     | 8473/18984 [08:34<09:31, 18.38it/s]

{'loss': 11.522, 'grad_norm': 5.774631977081299, 'learning_rate': 1.1076696165191741e-05, 'epoch': 1.34}


 45%|████▍     | 8483/18984 [08:34<09:55, 17.63it/s]

{'loss': 11.5318, 'grad_norm': 5.7606096267700195, 'learning_rate': 1.1066160977665404e-05, 'epoch': 1.34}


 45%|████▍     | 8493/18984 [08:35<09:28, 18.46it/s]

{'loss': 11.5378, 'grad_norm': 5.793381214141846, 'learning_rate': 1.1055625790139067e-05, 'epoch': 1.34}


 45%|████▍     | 8500/18984 [08:35<09:31, 18.35it/s]

{'loss': 11.5266, 'grad_norm': 5.8247971534729, 'learning_rate': 1.1045090602612727e-05, 'epoch': 1.34}


 45%|████▍     | 8513/18984 [08:38<17:47,  9.81it/s]  

{'loss': 11.5366, 'grad_norm': 5.7842254638671875, 'learning_rate': 1.103455541508639e-05, 'epoch': 1.34}


 45%|████▍     | 8523/18984 [08:39<11:03, 15.77it/s]

{'loss': 11.578, 'grad_norm': 5.774494171142578, 'learning_rate': 1.1024020227560051e-05, 'epoch': 1.35}


 45%|████▍     | 8533/18984 [08:40<09:52, 17.65it/s]

{'loss': 11.5611, 'grad_norm': 5.806774139404297, 'learning_rate': 1.1013485040033714e-05, 'epoch': 1.35}


 45%|████▌     | 8543/18984 [08:40<09:44, 17.85it/s]

{'loss': 11.5639, 'grad_norm': 5.826408863067627, 'learning_rate': 1.1002949852507377e-05, 'epoch': 1.35}


 45%|████▌     | 8553/18984 [08:41<09:24, 18.47it/s]

{'loss': 11.5708, 'grad_norm': 5.581186294555664, 'learning_rate': 1.0992414664981037e-05, 'epoch': 1.35}


 45%|████▌     | 8563/18984 [08:41<09:19, 18.62it/s]

{'loss': 11.5619, 'grad_norm': 5.407035827636719, 'learning_rate': 1.09818794774547e-05, 'epoch': 1.35}


 45%|████▌     | 8573/18984 [08:42<09:23, 18.47it/s]

{'loss': 11.5656, 'grad_norm': 5.3111653327941895, 'learning_rate': 1.0971344289928361e-05, 'epoch': 1.35}


 45%|████▌     | 8583/18984 [08:42<09:34, 18.11it/s]

{'loss': 11.5743, 'grad_norm': 5.0161309242248535, 'learning_rate': 1.0960809102402024e-05, 'epoch': 1.36}


 45%|████▌     | 8593/18984 [08:43<09:33, 18.12it/s]

{'loss': 11.5545, 'grad_norm': 4.436641693115234, 'learning_rate': 1.0950273914875687e-05, 'epoch': 1.36}


 45%|████▌     | 8603/18984 [08:43<09:57, 17.38it/s]

{'loss': 11.5517, 'grad_norm': 4.637186527252197, 'learning_rate': 1.0939738727349347e-05, 'epoch': 1.36}


 45%|████▌     | 8613/18984 [08:44<09:50, 17.58it/s]

{'loss': 11.5517, 'grad_norm': 5.2349677085876465, 'learning_rate': 1.092920353982301e-05, 'epoch': 1.36}


 45%|████▌     | 8623/18984 [08:45<09:44, 17.74it/s]

{'loss': 11.5271, 'grad_norm': 5.536937713623047, 'learning_rate': 1.0918668352296671e-05, 'epoch': 1.36}


 45%|████▌     | 8633/18984 [08:45<09:53, 17.45it/s]

{'loss': 11.5163, 'grad_norm': 5.889328956604004, 'learning_rate': 1.0908133164770334e-05, 'epoch': 1.36}


 46%|████▌     | 8643/18984 [08:46<09:28, 18.20it/s]

{'loss': 11.505, 'grad_norm': 6.059207439422607, 'learning_rate': 1.0897597977243997e-05, 'epoch': 1.37}


 46%|████▌     | 8653/18984 [08:46<09:27, 18.22it/s]

{'loss': 11.5622, 'grad_norm': 6.1075758934021, 'learning_rate': 1.0887062789717657e-05, 'epoch': 1.37}


 46%|████▌     | 8663/18984 [08:47<09:40, 17.78it/s]

{'loss': 11.5259, 'grad_norm': 5.990057468414307, 'learning_rate': 1.087652760219132e-05, 'epoch': 1.37}


 46%|████▌     | 8673/18984 [08:47<09:48, 17.53it/s]

{'loss': 11.5279, 'grad_norm': 5.922715187072754, 'learning_rate': 1.0865992414664981e-05, 'epoch': 1.37}


 46%|████▌     | 8683/18984 [08:48<09:45, 17.61it/s]

{'loss': 11.5496, 'grad_norm': 5.957425594329834, 'learning_rate': 1.0855457227138645e-05, 'epoch': 1.37}


 46%|████▌     | 8693/18984 [08:48<09:33, 17.95it/s]

{'loss': 11.5269, 'grad_norm': 6.029913425445557, 'learning_rate': 1.0844922039612308e-05, 'epoch': 1.37}


 46%|████▌     | 8703/18984 [08:49<09:28, 18.07it/s]

{'loss': 11.5517, 'grad_norm': 6.018980026245117, 'learning_rate': 1.0834386852085967e-05, 'epoch': 1.37}


 46%|████▌     | 8713/18984 [08:50<09:40, 17.70it/s]

{'loss': 11.5664, 'grad_norm': 5.864125728607178, 'learning_rate': 1.082385166455963e-05, 'epoch': 1.38}


 46%|████▌     | 8723/18984 [08:50<09:32, 17.93it/s]

{'loss': 11.5465, 'grad_norm': 5.691434860229492, 'learning_rate': 1.0813316477033292e-05, 'epoch': 1.38}


 46%|████▌     | 8733/18984 [08:51<09:27, 18.07it/s]

{'loss': 11.5597, 'grad_norm': 5.280123710632324, 'learning_rate': 1.0802781289506955e-05, 'epoch': 1.38}


 46%|████▌     | 8743/18984 [08:51<09:23, 18.17it/s]

{'loss': 11.5274, 'grad_norm': 5.427305221557617, 'learning_rate': 1.0792246101980616e-05, 'epoch': 1.38}


 46%|████▌     | 8753/18984 [08:52<09:49, 17.35it/s]

{'loss': 11.539, 'grad_norm': 5.584980010986328, 'learning_rate': 1.0781710914454277e-05, 'epoch': 1.38}


 46%|████▌     | 8763/18984 [08:52<09:27, 18.01it/s]

{'loss': 11.5451, 'grad_norm': 5.035991191864014, 'learning_rate': 1.077117572692794e-05, 'epoch': 1.38}


 46%|████▌     | 8773/18984 [08:53<09:21, 18.18it/s]

{'loss': 11.5585, 'grad_norm': 4.64633321762085, 'learning_rate': 1.0760640539401602e-05, 'epoch': 1.39}


 46%|████▋     | 8783/18984 [08:53<09:14, 18.40it/s]

{'loss': 11.5542, 'grad_norm': 4.688498497009277, 'learning_rate': 1.0750105351875265e-05, 'epoch': 1.39}


 46%|████▋     | 8793/18984 [08:54<09:15, 18.35it/s]

{'loss': 11.5601, 'grad_norm': 4.562958717346191, 'learning_rate': 1.0739570164348926e-05, 'epoch': 1.39}


 46%|████▋     | 8803/18984 [08:55<09:11, 18.47it/s]

{'loss': 11.55, 'grad_norm': 4.301084518432617, 'learning_rate': 1.0729034976822588e-05, 'epoch': 1.39}


 46%|████▋     | 8813/18984 [08:55<09:13, 18.36it/s]

{'loss': 11.5701, 'grad_norm': 4.081503391265869, 'learning_rate': 1.071849978929625e-05, 'epoch': 1.39}


 46%|████▋     | 8823/18984 [08:56<09:26, 17.95it/s]

{'loss': 11.5636, 'grad_norm': 3.909741163253784, 'learning_rate': 1.0707964601769914e-05, 'epoch': 1.39}


 47%|████▋     | 8833/18984 [08:56<09:17, 18.19it/s]

{'loss': 11.5317, 'grad_norm': 4.183611869812012, 'learning_rate': 1.0697429414243575e-05, 'epoch': 1.4}


 47%|████▋     | 8843/18984 [08:57<09:12, 18.36it/s]

{'loss': 11.5661, 'grad_norm': 4.6325364112854, 'learning_rate': 1.0686894226717236e-05, 'epoch': 1.4}


 47%|████▋     | 8853/18984 [08:57<09:17, 18.18it/s]

{'loss': 11.5467, 'grad_norm': 4.803328990936279, 'learning_rate': 1.0676359039190898e-05, 'epoch': 1.4}


 47%|████▋     | 8863/18984 [08:58<09:21, 18.04it/s]

{'loss': 11.5422, 'grad_norm': 4.450186252593994, 'learning_rate': 1.066582385166456e-05, 'epoch': 1.4}


 47%|████▋     | 8873/18984 [08:58<09:19, 18.07it/s]

{'loss': 11.5246, 'grad_norm': 4.031712055206299, 'learning_rate': 1.0655288664138224e-05, 'epoch': 1.4}


 47%|████▋     | 8883/18984 [08:59<09:24, 17.88it/s]

{'loss': 11.5413, 'grad_norm': 4.570611953735352, 'learning_rate': 1.0644753476611885e-05, 'epoch': 1.4}


 47%|████▋     | 8893/18984 [09:00<09:33, 17.61it/s]

{'loss': 11.524, 'grad_norm': 4.78211784362793, 'learning_rate': 1.0634218289085546e-05, 'epoch': 1.4}


 47%|████▋     | 8903/18984 [09:00<09:27, 17.77it/s]

{'loss': 11.5279, 'grad_norm': 4.750027656555176, 'learning_rate': 1.0623683101559208e-05, 'epoch': 1.41}


 47%|████▋     | 8913/18984 [09:01<09:13, 18.19it/s]

{'loss': 11.5066, 'grad_norm': 4.9069743156433105, 'learning_rate': 1.061314791403287e-05, 'epoch': 1.41}


 47%|████▋     | 8923/18984 [09:01<09:16, 18.06it/s]

{'loss': 11.5192, 'grad_norm': 4.848837852478027, 'learning_rate': 1.0602612726506534e-05, 'epoch': 1.41}


 47%|████▋     | 8933/18984 [09:02<09:16, 18.05it/s]

{'loss': 11.5263, 'grad_norm': 5.072946548461914, 'learning_rate': 1.0592077538980195e-05, 'epoch': 1.41}


 47%|████▋     | 8943/18984 [09:02<09:08, 18.29it/s]

{'loss': 11.5509, 'grad_norm': 4.946922779083252, 'learning_rate': 1.0581542351453857e-05, 'epoch': 1.41}


 47%|████▋     | 8953/18984 [09:03<09:07, 18.33it/s]

{'loss': 11.5355, 'grad_norm': 4.958719730377197, 'learning_rate': 1.0571007163927518e-05, 'epoch': 1.41}


 47%|████▋     | 8963/18984 [09:03<09:14, 18.08it/s]

{'loss': 11.5413, 'grad_norm': 4.9652419090271, 'learning_rate': 1.0560471976401181e-05, 'epoch': 1.42}


 47%|████▋     | 8973/18984 [09:04<09:17, 17.95it/s]

{'loss': 11.5469, 'grad_norm': 5.266510009765625, 'learning_rate': 1.0549936788874844e-05, 'epoch': 1.42}


 47%|████▋     | 8983/18984 [09:05<09:17, 17.94it/s]

{'loss': 11.5539, 'grad_norm': 5.109410762786865, 'learning_rate': 1.0539401601348505e-05, 'epoch': 1.42}


 47%|████▋     | 8993/18984 [09:05<09:04, 18.35it/s]

{'loss': 11.5657, 'grad_norm': 5.172743797302246, 'learning_rate': 1.0528866413822167e-05, 'epoch': 1.42}


 47%|████▋     | 9000/18984 [09:05<09:03, 18.37it/s]

{'loss': 11.5672, 'grad_norm': 4.689064025878906, 'learning_rate': 1.0518331226295828e-05, 'epoch': 1.42}


 47%|████▋     | 9013/18984 [09:09<16:44,  9.92it/s]  

{'loss': 11.5651, 'grad_norm': 4.683358192443848, 'learning_rate': 1.0507796038769491e-05, 'epoch': 1.42}


 48%|████▊     | 9023/18984 [09:09<10:33, 15.73it/s]

{'loss': 11.5718, 'grad_norm': 4.5718207359313965, 'learning_rate': 1.0497260851243154e-05, 'epoch': 1.43}


 48%|████▊     | 9033/18984 [09:10<09:28, 17.49it/s]

{'loss': 11.5498, 'grad_norm': 4.638716220855713, 'learning_rate': 1.0486725663716814e-05, 'epoch': 1.43}


 48%|████▊     | 9043/18984 [09:10<09:06, 18.18it/s]

{'loss': 11.5643, 'grad_norm': 4.733968257904053, 'learning_rate': 1.0476190476190477e-05, 'epoch': 1.43}


 48%|████▊     | 9053/18984 [09:11<09:25, 17.55it/s]

{'loss': 11.5543, 'grad_norm': 4.978190898895264, 'learning_rate': 1.0465655288664138e-05, 'epoch': 1.43}


 48%|████▊     | 9063/18984 [09:12<09:07, 18.12it/s]

{'loss': 11.5636, 'grad_norm': 5.070303440093994, 'learning_rate': 1.0455120101137801e-05, 'epoch': 1.43}


 48%|████▊     | 9073/18984 [09:12<09:10, 17.99it/s]

{'loss': 11.5548, 'grad_norm': 4.960079193115234, 'learning_rate': 1.0444584913611464e-05, 'epoch': 1.43}


 48%|████▊     | 9083/18984 [09:13<08:55, 18.50it/s]

{'loss': 11.5288, 'grad_norm': 4.494243621826172, 'learning_rate': 1.0434049726085124e-05, 'epoch': 1.43}


 48%|████▊     | 9093/18984 [09:13<09:11, 17.93it/s]

{'loss': 11.5495, 'grad_norm': 4.403951168060303, 'learning_rate': 1.0423514538558787e-05, 'epoch': 1.44}


 48%|████▊     | 9103/18984 [09:14<09:08, 18.01it/s]

{'loss': 11.5145, 'grad_norm': 4.4017863273620605, 'learning_rate': 1.0412979351032448e-05, 'epoch': 1.44}


 48%|████▊     | 9113/18984 [09:14<09:22, 17.54it/s]

{'loss': 11.5105, 'grad_norm': 4.825895309448242, 'learning_rate': 1.0402444163506111e-05, 'epoch': 1.44}


 48%|████▊     | 9123/18984 [09:15<09:11, 17.89it/s]

{'loss': 11.5371, 'grad_norm': 4.946898937225342, 'learning_rate': 1.0391908975979774e-05, 'epoch': 1.44}


 48%|████▊     | 9133/18984 [09:15<09:03, 18.12it/s]

{'loss': 11.5168, 'grad_norm': 4.984652042388916, 'learning_rate': 1.0381373788453434e-05, 'epoch': 1.44}


 48%|████▊     | 9143/18984 [09:16<09:09, 17.92it/s]

{'loss': 11.5187, 'grad_norm': 4.984367370605469, 'learning_rate': 1.0370838600927097e-05, 'epoch': 1.44}


 48%|████▊     | 9153/18984 [09:17<08:52, 18.45it/s]

{'loss': 11.5418, 'grad_norm': 4.899346828460693, 'learning_rate': 1.0360303413400759e-05, 'epoch': 1.45}


 48%|████▊     | 9163/18984 [09:17<08:56, 18.31it/s]

{'loss': 11.5167, 'grad_norm': 4.992956638336182, 'learning_rate': 1.0349768225874422e-05, 'epoch': 1.45}


 48%|████▊     | 9173/18984 [09:18<08:58, 18.20it/s]

{'loss': 11.5419, 'grad_norm': 5.072480201721191, 'learning_rate': 1.0339233038348085e-05, 'epoch': 1.45}


 48%|████▊     | 9183/18984 [09:18<09:02, 18.05it/s]

{'loss': 11.5457, 'grad_norm': 5.013040065765381, 'learning_rate': 1.0328697850821744e-05, 'epoch': 1.45}


 48%|████▊     | 9193/18984 [09:19<09:07, 17.87it/s]

{'loss': 11.5416, 'grad_norm': 4.888092994689941, 'learning_rate': 1.0318162663295407e-05, 'epoch': 1.45}


 48%|████▊     | 9203/18984 [09:19<09:15, 17.61it/s]

{'loss': 11.5426, 'grad_norm': 4.814177989959717, 'learning_rate': 1.0307627475769069e-05, 'epoch': 1.45}


 49%|████▊     | 9213/18984 [09:20<09:07, 17.86it/s]

{'loss': 11.5471, 'grad_norm': 4.852218151092529, 'learning_rate': 1.0297092288242732e-05, 'epoch': 1.46}


 49%|████▊     | 9223/18984 [09:20<08:58, 18.14it/s]

{'loss': 11.5673, 'grad_norm': 4.611086368560791, 'learning_rate': 1.0286557100716395e-05, 'epoch': 1.46}


 49%|████▊     | 9233/18984 [09:21<09:14, 17.60it/s]

{'loss': 11.5497, 'grad_norm': 4.851500034332275, 'learning_rate': 1.0276021913190054e-05, 'epoch': 1.46}


 49%|████▊     | 9243/18984 [09:22<09:02, 17.95it/s]

{'loss': 11.5523, 'grad_norm': 4.790828227996826, 'learning_rate': 1.0265486725663717e-05, 'epoch': 1.46}


 49%|████▊     | 9253/18984 [09:22<08:50, 18.34it/s]

{'loss': 11.5417, 'grad_norm': 4.900382995605469, 'learning_rate': 1.0254951538137379e-05, 'epoch': 1.46}


 49%|████▉     | 9263/18984 [09:23<08:47, 18.42it/s]

{'loss': 11.5754, 'grad_norm': 5.0432047843933105, 'learning_rate': 1.0244416350611042e-05, 'epoch': 1.46}


 49%|████▉     | 9273/18984 [09:23<09:02, 17.91it/s]

{'loss': 11.5833, 'grad_norm': 4.550020694732666, 'learning_rate': 1.0233881163084705e-05, 'epoch': 1.46}


 49%|████▉     | 9283/18984 [09:24<09:08, 17.68it/s]

{'loss': 11.5833, 'grad_norm': 4.02275276184082, 'learning_rate': 1.0223345975558365e-05, 'epoch': 1.47}


 49%|████▉     | 9293/18984 [09:24<09:02, 17.86it/s]

{'loss': 11.5428, 'grad_norm': 3.8330910205841064, 'learning_rate': 1.0212810788032028e-05, 'epoch': 1.47}


 49%|████▉     | 9303/18984 [09:25<09:03, 17.81it/s]

{'loss': 11.5451, 'grad_norm': 3.9162473678588867, 'learning_rate': 1.0202275600505689e-05, 'epoch': 1.47}


 49%|████▉     | 9313/18984 [09:25<08:48, 18.28it/s]

{'loss': 11.5537, 'grad_norm': 4.4280524253845215, 'learning_rate': 1.0191740412979352e-05, 'epoch': 1.47}


 49%|████▉     | 9323/18984 [09:26<08:45, 18.38it/s]

{'loss': 11.5487, 'grad_norm': 4.781182289123535, 'learning_rate': 1.0181205225453015e-05, 'epoch': 1.47}


 49%|████▉     | 9333/18984 [09:27<08:43, 18.43it/s]

{'loss': 11.5274, 'grad_norm': 4.947895526885986, 'learning_rate': 1.0170670037926675e-05, 'epoch': 1.47}


 49%|████▉     | 9343/18984 [09:27<08:49, 18.22it/s]

{'loss': 11.5434, 'grad_norm': 5.023636817932129, 'learning_rate': 1.0160134850400338e-05, 'epoch': 1.48}


 49%|████▉     | 9353/18984 [09:28<08:40, 18.49it/s]

{'loss': 11.5653, 'grad_norm': 5.127773284912109, 'learning_rate': 1.0149599662873999e-05, 'epoch': 1.48}


 49%|████▉     | 9363/18984 [09:28<08:54, 18.00it/s]

{'loss': 11.5201, 'grad_norm': 5.23052453994751, 'learning_rate': 1.0139064475347662e-05, 'epoch': 1.48}


 49%|████▉     | 9373/18984 [09:29<08:53, 18.01it/s]

{'loss': 11.5392, 'grad_norm': 5.251288890838623, 'learning_rate': 1.0128529287821325e-05, 'epoch': 1.48}


 49%|████▉     | 9383/18984 [09:29<08:55, 17.94it/s]

{'loss': 11.5266, 'grad_norm': 5.2959303855896, 'learning_rate': 1.0117994100294985e-05, 'epoch': 1.48}


 49%|████▉     | 9393/18984 [09:30<08:49, 18.11it/s]

{'loss': 11.5382, 'grad_norm': 5.365468502044678, 'learning_rate': 1.0107458912768648e-05, 'epoch': 1.48}


 50%|████▉     | 9403/18984 [09:30<09:01, 17.69it/s]

{'loss': 11.5379, 'grad_norm': 5.295530319213867, 'learning_rate': 1.009692372524231e-05, 'epoch': 1.49}


 50%|████▉     | 9413/18984 [09:31<09:00, 17.70it/s]

{'loss': 11.546, 'grad_norm': 5.277470111846924, 'learning_rate': 1.0086388537715972e-05, 'epoch': 1.49}


 50%|████▉     | 9423/18984 [09:32<08:56, 17.82it/s]

{'loss': 11.5615, 'grad_norm': 5.160449981689453, 'learning_rate': 1.0075853350189635e-05, 'epoch': 1.49}


 50%|████▉     | 9433/18984 [09:32<08:58, 17.74it/s]

{'loss': 11.5749, 'grad_norm': 5.044431686401367, 'learning_rate': 1.0065318162663295e-05, 'epoch': 1.49}


 50%|████▉     | 9443/18984 [09:33<09:03, 17.54it/s]

{'loss': 11.5531, 'grad_norm': 4.944745063781738, 'learning_rate': 1.0054782975136958e-05, 'epoch': 1.49}


 50%|████▉     | 9453/18984 [09:33<09:00, 17.65it/s]

{'loss': 11.5766, 'grad_norm': 4.68666410446167, 'learning_rate': 1.004424778761062e-05, 'epoch': 1.49}


 50%|████▉     | 9463/18984 [09:34<08:42, 18.22it/s]

{'loss': 11.5573, 'grad_norm': 4.524714946746826, 'learning_rate': 1.0033712600084282e-05, 'epoch': 1.49}


 50%|████▉     | 9473/18984 [09:34<08:44, 18.12it/s]

{'loss': 11.5484, 'grad_norm': 5.018959045410156, 'learning_rate': 1.0023177412557946e-05, 'epoch': 1.5}


 50%|████▉     | 9483/18984 [09:35<08:49, 17.95it/s]

{'loss': 11.5353, 'grad_norm': 5.236486911773682, 'learning_rate': 1.0012642225031605e-05, 'epoch': 1.5}


 50%|█████     | 9493/18984 [09:35<08:46, 18.04it/s]

{'loss': 11.5364, 'grad_norm': 5.235049247741699, 'learning_rate': 1.0002107037505268e-05, 'epoch': 1.5}


 50%|█████     | 9500/18984 [09:36<08:39, 18.26it/s]

{'loss': 11.5532, 'grad_norm': 5.243793964385986, 'learning_rate': 9.991571849978931e-06, 'epoch': 1.5}


 50%|█████     | 9513/18984 [09:39<15:50,  9.96it/s]  

{'loss': 11.5525, 'grad_norm': 5.152530193328857, 'learning_rate': 9.981036662452593e-06, 'epoch': 1.5}


 50%|█████     | 9523/18984 [09:40<10:01, 15.74it/s]

{'loss': 11.526, 'grad_norm': 5.191780090332031, 'learning_rate': 9.970501474926254e-06, 'epoch': 1.5}


 50%|█████     | 9533/18984 [09:40<08:53, 17.73it/s]

{'loss': 11.5157, 'grad_norm': 5.316463470458984, 'learning_rate': 9.959966287399917e-06, 'epoch': 1.51}


 50%|█████     | 9543/18984 [09:41<08:49, 17.84it/s]

{'loss': 11.4912, 'grad_norm': 5.445953845977783, 'learning_rate': 9.949431099873578e-06, 'epoch': 1.51}


 50%|█████     | 9553/18984 [09:41<09:02, 17.38it/s]

{'loss': 11.5105, 'grad_norm': 5.460787296295166, 'learning_rate': 9.938895912347241e-06, 'epoch': 1.51}


 50%|█████     | 9563/18984 [09:42<08:54, 17.63it/s]

{'loss': 11.5144, 'grad_norm': 4.829361915588379, 'learning_rate': 9.928360724820903e-06, 'epoch': 1.51}


 50%|█████     | 9573/18984 [09:43<08:37, 18.20it/s]

{'loss': 11.5356, 'grad_norm': 5.768972396850586, 'learning_rate': 9.917825537294564e-06, 'epoch': 1.51}


 50%|█████     | 9584/18984 [09:43<08:25, 18.60it/s]

{'loss': 11.509, 'grad_norm': 5.600894451141357, 'learning_rate': 9.907290349768227e-06, 'epoch': 1.51}


 51%|█████     | 9594/18984 [09:44<08:40, 18.04it/s]

{'loss': 11.5244, 'grad_norm': 4.900505065917969, 'learning_rate': 9.896755162241889e-06, 'epoch': 1.52}


 51%|█████     | 9604/18984 [09:44<08:46, 17.83it/s]

{'loss': 11.5103, 'grad_norm': 4.211315155029297, 'learning_rate': 9.886219974715552e-06, 'epoch': 1.52}


 51%|█████     | 9614/18984 [09:45<08:50, 17.67it/s]

{'loss': 11.5229, 'grad_norm': 3.557610034942627, 'learning_rate': 9.875684787189213e-06, 'epoch': 1.52}


 51%|█████     | 9624/18984 [09:45<08:50, 17.66it/s]

{'loss': 11.5396, 'grad_norm': 3.052128791809082, 'learning_rate': 9.865149599662874e-06, 'epoch': 1.52}


 51%|█████     | 9634/18984 [09:46<08:51, 17.59it/s]

{'loss': 11.5302, 'grad_norm': 2.7384743690490723, 'learning_rate': 9.854614412136537e-06, 'epoch': 1.52}


 51%|█████     | 9644/18984 [09:47<08:40, 17.95it/s]

{'loss': 11.5141, 'grad_norm': 3.1072850227355957, 'learning_rate': 9.844079224610199e-06, 'epoch': 1.52}


 51%|█████     | 9654/18984 [09:47<08:35, 18.08it/s]

{'loss': 11.5246, 'grad_norm': 3.412757396697998, 'learning_rate': 9.833544037083862e-06, 'epoch': 1.52}


 51%|█████     | 9664/18984 [09:48<08:27, 18.35it/s]

{'loss': 11.5233, 'grad_norm': 3.2432985305786133, 'learning_rate': 9.823008849557523e-06, 'epoch': 1.53}


 51%|█████     | 9674/18984 [09:48<08:40, 17.88it/s]

{'loss': 11.5303, 'grad_norm': 2.943225145339966, 'learning_rate': 9.812473662031184e-06, 'epoch': 1.53}


 51%|█████     | 9684/18984 [09:49<08:35, 18.04it/s]

{'loss': 11.5196, 'grad_norm': 2.860433578491211, 'learning_rate': 9.801938474504847e-06, 'epoch': 1.53}


 51%|█████     | 9694/18984 [09:49<08:37, 17.94it/s]

{'loss': 11.5261, 'grad_norm': 2.697296619415283, 'learning_rate': 9.791403286978509e-06, 'epoch': 1.53}


 51%|█████     | 9704/18984 [09:50<08:32, 18.10it/s]

{'loss': 11.5005, 'grad_norm': 2.7086660861968994, 'learning_rate': 9.78086809945217e-06, 'epoch': 1.53}


 51%|█████     | 9714/18984 [09:50<08:28, 18.23it/s]

{'loss': 11.5159, 'grad_norm': 2.7865405082702637, 'learning_rate': 9.770332911925833e-06, 'epoch': 1.53}


 51%|█████     | 9724/18984 [09:51<08:32, 18.08it/s]

{'loss': 11.5193, 'grad_norm': 3.036858081817627, 'learning_rate': 9.759797724399495e-06, 'epoch': 1.54}


 51%|█████▏    | 9734/18984 [09:52<08:34, 17.96it/s]

{'loss': 11.5244, 'grad_norm': 2.8875679969787598, 'learning_rate': 9.749262536873158e-06, 'epoch': 1.54}


 51%|█████▏    | 9744/18984 [09:52<08:26, 18.25it/s]

{'loss': 11.5383, 'grad_norm': 2.9273970127105713, 'learning_rate': 9.738727349346819e-06, 'epoch': 1.54}


 51%|█████▏    | 9754/18984 [09:53<08:26, 18.23it/s]

{'loss': 11.539, 'grad_norm': 3.0244204998016357, 'learning_rate': 9.72819216182048e-06, 'epoch': 1.54}


 51%|█████▏    | 9764/18984 [09:53<08:26, 18.19it/s]

{'loss': 11.5465, 'grad_norm': 3.130011796951294, 'learning_rate': 9.717656974294143e-06, 'epoch': 1.54}


 51%|█████▏    | 9774/18984 [09:54<08:30, 18.02it/s]

{'loss': 11.5445, 'grad_norm': 3.0037779808044434, 'learning_rate': 9.707121786767806e-06, 'epoch': 1.54}


 52%|█████▏    | 9784/18984 [09:54<08:30, 18.01it/s]

{'loss': 11.5342, 'grad_norm': 3.122734546661377, 'learning_rate': 9.696586599241468e-06, 'epoch': 1.55}


 52%|█████▏    | 9794/18984 [09:55<08:30, 18.01it/s]

{'loss': 11.5209, 'grad_norm': 3.064790725708008, 'learning_rate': 9.686051411715129e-06, 'epoch': 1.55}


 52%|█████▏    | 9804/18984 [09:55<08:10, 18.70it/s]

{'loss': 11.5309, 'grad_norm': 3.176046371459961, 'learning_rate': 9.67551622418879e-06, 'epoch': 1.55}


 52%|█████▏    | 9814/18984 [09:56<08:24, 18.19it/s]

{'loss': 11.5346, 'grad_norm': 3.2868049144744873, 'learning_rate': 9.664981036662453e-06, 'epoch': 1.55}


 52%|█████▏    | 9824/18984 [09:56<08:27, 18.03it/s]

{'loss': 11.5419, 'grad_norm': 3.182225465774536, 'learning_rate': 9.654445849136117e-06, 'epoch': 1.55}


 52%|█████▏    | 9834/18984 [09:57<08:27, 18.04it/s]

{'loss': 11.545, 'grad_norm': 3.1737895011901855, 'learning_rate': 9.643910661609778e-06, 'epoch': 1.55}


 52%|█████▏    | 9844/18984 [09:58<08:17, 18.35it/s]

{'loss': 11.532, 'grad_norm': 3.2407734394073486, 'learning_rate': 9.63337547408344e-06, 'epoch': 1.55}


 52%|█████▏    | 9854/18984 [09:58<08:24, 18.08it/s]

{'loss': 11.5522, 'grad_norm': 3.0905380249023438, 'learning_rate': 9.6228402865571e-06, 'epoch': 1.56}


 52%|█████▏    | 9864/18984 [09:59<08:19, 18.27it/s]

{'loss': 11.5344, 'grad_norm': 3.1267592906951904, 'learning_rate': 9.612305099030764e-06, 'epoch': 1.56}


 52%|█████▏    | 9874/18984 [09:59<08:12, 18.48it/s]

{'loss': 11.5413, 'grad_norm': 3.0652801990509033, 'learning_rate': 9.601769911504427e-06, 'epoch': 1.56}


 52%|█████▏    | 9884/18984 [10:00<08:16, 18.32it/s]

{'loss': 11.5468, 'grad_norm': 2.8402018547058105, 'learning_rate': 9.591234723978088e-06, 'epoch': 1.56}


 52%|█████▏    | 9894/18984 [10:00<08:16, 18.31it/s]

{'loss': 11.5539, 'grad_norm': 2.7881758213043213, 'learning_rate': 9.58069953645175e-06, 'epoch': 1.56}


 52%|█████▏    | 9904/18984 [10:01<08:19, 18.17it/s]

{'loss': 11.5373, 'grad_norm': 2.7199184894561768, 'learning_rate': 9.57016434892541e-06, 'epoch': 1.56}


 52%|█████▏    | 9914/18984 [10:01<08:17, 18.25it/s]

{'loss': 11.5342, 'grad_norm': 2.4800305366516113, 'learning_rate': 9.559629161399074e-06, 'epoch': 1.57}


 52%|█████▏    | 9924/18984 [10:02<08:13, 18.35it/s]

{'loss': 11.5309, 'grad_norm': 2.625746488571167, 'learning_rate': 9.549093973872737e-06, 'epoch': 1.57}


 52%|█████▏    | 9934/18984 [10:02<08:16, 18.24it/s]

{'loss': 11.5351, 'grad_norm': 2.8587379455566406, 'learning_rate': 9.538558786346398e-06, 'epoch': 1.57}


 52%|█████▏    | 9944/18984 [10:03<08:25, 17.88it/s]

{'loss': 11.5358, 'grad_norm': 3.2375168800354004, 'learning_rate': 9.52802359882006e-06, 'epoch': 1.57}


 52%|█████▏    | 9954/18984 [10:04<08:26, 17.83it/s]

{'loss': 11.5241, 'grad_norm': 3.8972272872924805, 'learning_rate': 9.517488411293721e-06, 'epoch': 1.57}


 52%|█████▏    | 9964/18984 [10:04<08:22, 17.96it/s]

{'loss': 11.526, 'grad_norm': 3.9345622062683105, 'learning_rate': 9.506953223767384e-06, 'epoch': 1.57}


 53%|█████▎    | 9974/18984 [10:05<08:13, 18.26it/s]

{'loss': 11.5298, 'grad_norm': 3.3274121284484863, 'learning_rate': 9.496418036241047e-06, 'epoch': 1.58}


 53%|█████▎    | 9984/18984 [10:05<08:08, 18.41it/s]

{'loss': 11.5405, 'grad_norm': 3.195371627807617, 'learning_rate': 9.485882848714708e-06, 'epoch': 1.58}


 53%|█████▎    | 9994/18984 [10:06<08:18, 18.05it/s]

{'loss': 11.5314, 'grad_norm': 3.1781504154205322, 'learning_rate': 9.47534766118837e-06, 'epoch': 1.58}


 53%|█████▎    | 10000/18984 [10:06<08:15, 18.12it/s]

{'loss': 11.533, 'grad_norm': 3.2071714401245117, 'learning_rate': 9.464812473662031e-06, 'epoch': 1.58}


 53%|█████▎    | 10012/18984 [10:09<18:32,  8.07it/s]  

{'loss': 11.5334, 'grad_norm': 3.1520659923553467, 'learning_rate': 9.454277286135694e-06, 'epoch': 1.58}


 53%|█████▎    | 10024/18984 [10:10<09:35, 15.57it/s]

{'loss': 11.5493, 'grad_norm': 3.1178133487701416, 'learning_rate': 9.443742098609357e-06, 'epoch': 1.58}


 53%|█████▎    | 10034/18984 [10:11<08:19, 17.92it/s]

{'loss': 11.5359, 'grad_norm': 3.1073899269104004, 'learning_rate': 9.433206911083018e-06, 'epoch': 1.59}


 53%|█████▎    | 10042/18984 [10:11<08:23, 17.77it/s]

{'loss': 11.5338, 'grad_norm': 3.0524938106536865, 'learning_rate': 9.42267172355668e-06, 'epoch': 1.59}


 53%|█████▎    | 10054/18984 [10:12<08:13, 18.09it/s]

{'loss': 11.5411, 'grad_norm': 2.972383499145508, 'learning_rate': 9.412136536030341e-06, 'epoch': 1.59}


 53%|█████▎    | 10064/18984 [10:12<08:11, 18.14it/s]

{'loss': 11.5363, 'grad_norm': 3.0735652446746826, 'learning_rate': 9.401601348504004e-06, 'epoch': 1.59}


 53%|█████▎    | 10074/18984 [10:13<08:08, 18.24it/s]

{'loss': 11.5419, 'grad_norm': 2.7689743041992188, 'learning_rate': 9.391066160977667e-06, 'epoch': 1.59}


 53%|█████▎    | 10084/18984 [10:13<08:09, 18.18it/s]

{'loss': 11.5377, 'grad_norm': 2.860208511352539, 'learning_rate': 9.380530973451329e-06, 'epoch': 1.59}


 53%|█████▎    | 10094/18984 [10:14<07:59, 18.52it/s]

{'loss': 11.5416, 'grad_norm': 2.67891526222229, 'learning_rate': 9.36999578592499e-06, 'epoch': 1.59}


 53%|█████▎    | 10104/18984 [10:14<08:04, 18.32it/s]

{'loss': 11.5406, 'grad_norm': 2.7077434062957764, 'learning_rate': 9.359460598398651e-06, 'epoch': 1.6}


 53%|█████▎    | 10114/18984 [10:15<08:08, 18.15it/s]

{'loss': 11.5446, 'grad_norm': 2.787000894546509, 'learning_rate': 9.348925410872314e-06, 'epoch': 1.6}


 53%|█████▎    | 10124/18984 [10:16<08:10, 18.06it/s]

{'loss': 11.5328, 'grad_norm': 2.8248605728149414, 'learning_rate': 9.338390223345977e-06, 'epoch': 1.6}


 53%|█████▎    | 10134/18984 [10:16<08:10, 18.03it/s]

{'loss': 11.5422, 'grad_norm': 3.0852603912353516, 'learning_rate': 9.327855035819639e-06, 'epoch': 1.6}


 53%|█████▎    | 10144/18984 [10:17<08:11, 17.99it/s]

{'loss': 11.5279, 'grad_norm': 3.328549385070801, 'learning_rate': 9.3173198482933e-06, 'epoch': 1.6}


 53%|█████▎    | 10154/18984 [10:17<08:17, 17.74it/s]

{'loss': 11.5349, 'grad_norm': 3.811983346939087, 'learning_rate': 9.306784660766961e-06, 'epoch': 1.6}


 54%|█████▎    | 10164/18984 [10:18<08:05, 18.15it/s]

{'loss': 11.5404, 'grad_norm': 4.340025901794434, 'learning_rate': 9.296249473240625e-06, 'epoch': 1.61}


 54%|█████▎    | 10174/18984 [10:18<08:07, 18.07it/s]

{'loss': 11.5434, 'grad_norm': 4.229137897491455, 'learning_rate': 9.285714285714288e-06, 'epoch': 1.61}


 54%|█████▎    | 10184/18984 [10:19<08:03, 18.22it/s]

{'loss': 11.5296, 'grad_norm': 4.061923027038574, 'learning_rate': 9.275179098187949e-06, 'epoch': 1.61}


 54%|█████▎    | 10194/18984 [10:19<07:57, 18.39it/s]

{'loss': 11.5377, 'grad_norm': 4.422486305236816, 'learning_rate': 9.26464391066161e-06, 'epoch': 1.61}


 54%|█████▍    | 10204/18984 [10:20<08:00, 18.28it/s]

{'loss': 11.5596, 'grad_norm': 4.528146743774414, 'learning_rate': 9.254108723135272e-06, 'epoch': 1.61}


 54%|█████▍    | 10214/18984 [10:21<07:55, 18.45it/s]

{'loss': 11.5446, 'grad_norm': 4.342454433441162, 'learning_rate': 9.243573535608935e-06, 'epoch': 1.61}


 54%|█████▍    | 10224/18984 [10:21<08:03, 18.13it/s]

{'loss': 11.5549, 'grad_norm': 4.202686786651611, 'learning_rate': 9.233038348082598e-06, 'epoch': 1.62}


 54%|█████▍    | 10234/18984 [10:22<07:53, 18.49it/s]

{'loss': 11.5703, 'grad_norm': 4.0458784103393555, 'learning_rate': 9.222503160556259e-06, 'epoch': 1.62}


 54%|█████▍    | 10244/18984 [10:22<07:48, 18.65it/s]

{'loss': 11.5602, 'grad_norm': 3.9982194900512695, 'learning_rate': 9.21196797302992e-06, 'epoch': 1.62}


 54%|█████▍    | 10254/18984 [10:23<08:03, 18.06it/s]

{'loss': 11.5749, 'grad_norm': 3.8920531272888184, 'learning_rate': 9.201432785503582e-06, 'epoch': 1.62}


 54%|█████▍    | 10262/18984 [10:23<08:13, 17.66it/s]

{'loss': 11.5561, 'grad_norm': 3.6673638820648193, 'learning_rate': 9.190897597977245e-06, 'epoch': 1.62}


 54%|█████▍    | 10274/18984 [10:24<07:57, 18.25it/s]

{'loss': 11.5616, 'grad_norm': 2.938788652420044, 'learning_rate': 9.180362410450908e-06, 'epoch': 1.62}


 54%|█████▍    | 10284/18984 [10:24<08:10, 17.72it/s]

{'loss': 11.5515, 'grad_norm': 2.6473255157470703, 'learning_rate': 9.16982722292457e-06, 'epoch': 1.62}


 54%|█████▍    | 10294/18984 [10:25<08:04, 17.94it/s]

{'loss': 11.5197, 'grad_norm': 3.1492795944213867, 'learning_rate': 9.15929203539823e-06, 'epoch': 1.63}


 54%|█████▍    | 10304/18984 [10:26<08:13, 17.59it/s]

{'loss': 11.5138, 'grad_norm': 3.406400680541992, 'learning_rate': 9.148756847871892e-06, 'epoch': 1.63}


 54%|█████▍    | 10314/18984 [10:26<07:59, 18.08it/s]

{'loss': 11.5151, 'grad_norm': 3.7139182090759277, 'learning_rate': 9.138221660345555e-06, 'epoch': 1.63}


 54%|█████▍    | 10324/18984 [10:27<07:51, 18.37it/s]

{'loss': 11.5219, 'grad_norm': 3.825672149658203, 'learning_rate': 9.127686472819218e-06, 'epoch': 1.63}


 54%|█████▍    | 10334/18984 [10:27<07:58, 18.09it/s]

{'loss': 11.514, 'grad_norm': 3.968616485595703, 'learning_rate': 9.11715128529288e-06, 'epoch': 1.63}


 54%|█████▍    | 10344/18984 [10:28<07:52, 18.29it/s]

{'loss': 11.5229, 'grad_norm': 4.1313276290893555, 'learning_rate': 9.10661609776654e-06, 'epoch': 1.63}


 55%|█████▍    | 10354/18984 [10:28<07:47, 18.47it/s]

{'loss': 11.5233, 'grad_norm': 4.520686626434326, 'learning_rate': 9.096080910240202e-06, 'epoch': 1.64}


 55%|█████▍    | 10364/18984 [10:29<07:53, 18.21it/s]

{'loss': 11.5187, 'grad_norm': 4.8711113929748535, 'learning_rate': 9.085545722713865e-06, 'epoch': 1.64}


 55%|█████▍    | 10374/18984 [10:29<07:51, 18.25it/s]

{'loss': 11.5302, 'grad_norm': 5.120185852050781, 'learning_rate': 9.075010535187526e-06, 'epoch': 1.64}


 55%|█████▍    | 10384/18984 [10:30<08:04, 17.77it/s]

{'loss': 11.5504, 'grad_norm': 5.155350685119629, 'learning_rate': 9.06447534766119e-06, 'epoch': 1.64}


 55%|█████▍    | 10394/18984 [10:31<07:51, 18.23it/s]

{'loss': 11.5632, 'grad_norm': 5.166878700256348, 'learning_rate': 9.053940160134851e-06, 'epoch': 1.64}


 55%|█████▍    | 10404/18984 [10:31<07:50, 18.25it/s]

{'loss': 11.5445, 'grad_norm': 4.831222057342529, 'learning_rate': 9.043404972608512e-06, 'epoch': 1.64}


 55%|█████▍    | 10414/18984 [10:32<07:47, 18.33it/s]

{'loss': 11.5668, 'grad_norm': 4.683725833892822, 'learning_rate': 9.032869785082175e-06, 'epoch': 1.65}


 55%|█████▍    | 10424/18984 [10:32<07:49, 18.24it/s]

{'loss': 11.5523, 'grad_norm': 4.3941969871521, 'learning_rate': 9.022334597555837e-06, 'epoch': 1.65}


 55%|█████▍    | 10434/18984 [10:33<07:57, 17.91it/s]

{'loss': 11.5496, 'grad_norm': 4.306339740753174, 'learning_rate': 9.0117994100295e-06, 'epoch': 1.65}


 55%|█████▌    | 10444/18984 [10:33<07:45, 18.36it/s]

{'loss': 11.5328, 'grad_norm': 3.8295400142669678, 'learning_rate': 9.001264222503161e-06, 'epoch': 1.65}


 55%|█████▌    | 10454/18984 [10:34<07:56, 17.90it/s]

{'loss': 11.5291, 'grad_norm': 4.36838960647583, 'learning_rate': 8.990729034976822e-06, 'epoch': 1.65}


 55%|█████▌    | 10464/18984 [10:34<07:52, 18.03it/s]

{'loss': 11.5202, 'grad_norm': 6.082516193389893, 'learning_rate': 8.980193847450485e-06, 'epoch': 1.65}


 55%|█████▌    | 10474/18984 [10:35<07:55, 17.91it/s]

{'loss': 11.5188, 'grad_norm': 6.399726867675781, 'learning_rate': 8.969658659924147e-06, 'epoch': 1.65}


 55%|█████▌    | 10484/18984 [10:36<07:56, 17.83it/s]

{'loss': 11.5122, 'grad_norm': 6.403289794921875, 'learning_rate': 8.95912347239781e-06, 'epoch': 1.66}


 55%|█████▌    | 10494/18984 [10:36<07:48, 18.13it/s]

{'loss': 11.53, 'grad_norm': 6.4416728019714355, 'learning_rate': 8.948588284871471e-06, 'epoch': 1.66}


 55%|█████▌    | 10500/18984 [10:36<07:39, 18.45it/s]

{'loss': 11.5132, 'grad_norm': 6.418084621429443, 'learning_rate': 8.938053097345133e-06, 'epoch': 1.66}


 55%|█████▌    | 10514/18984 [10:40<14:18,  9.87it/s]  

{'loss': 11.5623, 'grad_norm': 5.953082084655762, 'learning_rate': 8.927517909818796e-06, 'epoch': 1.66}


 55%|█████▌    | 10524/18984 [10:40<08:57, 15.74it/s]

{'loss': 11.5293, 'grad_norm': 3.5843849182128906, 'learning_rate': 8.916982722292457e-06, 'epoch': 1.66}


 55%|█████▌    | 10534/18984 [10:41<08:09, 17.26it/s]

{'loss': 11.5248, 'grad_norm': 3.919740676879883, 'learning_rate': 8.90644753476612e-06, 'epoch': 1.66}


 56%|█████▌    | 10544/18984 [10:41<07:53, 17.83it/s]

{'loss': 11.5219, 'grad_norm': 6.060231685638428, 'learning_rate': 8.895912347239781e-06, 'epoch': 1.67}


 56%|█████▌    | 10554/18984 [10:42<07:51, 17.90it/s]

{'loss': 11.5344, 'grad_norm': 5.9988508224487305, 'learning_rate': 8.885377159713444e-06, 'epoch': 1.67}


 56%|█████▌    | 10564/18984 [10:43<07:48, 17.97it/s]

{'loss': 11.528, 'grad_norm': 6.128851890563965, 'learning_rate': 8.874841972187106e-06, 'epoch': 1.67}


 56%|█████▌    | 10574/18984 [10:43<07:36, 18.42it/s]

{'loss': 11.5132, 'grad_norm': 6.147186756134033, 'learning_rate': 8.864306784660767e-06, 'epoch': 1.67}


 56%|█████▌    | 10582/18984 [10:44<07:46, 18.02it/s]

{'loss': 11.5657, 'grad_norm': 5.48877477645874, 'learning_rate': 8.85377159713443e-06, 'epoch': 1.67}


 56%|█████▌    | 10594/18984 [10:44<07:52, 17.75it/s]

{'loss': 11.527, 'grad_norm': 5.365156173706055, 'learning_rate': 8.843236409608091e-06, 'epoch': 1.67}


 56%|█████▌    | 10604/18984 [10:45<07:43, 18.07it/s]

{'loss': 11.5578, 'grad_norm': 5.34183931350708, 'learning_rate': 8.832701222081754e-06, 'epoch': 1.68}


 56%|█████▌    | 10614/18984 [10:45<07:44, 18.02it/s]

{'loss': 11.5641, 'grad_norm': 4.933505535125732, 'learning_rate': 8.822166034555416e-06, 'epoch': 1.68}


 56%|█████▌    | 10624/18984 [10:46<07:42, 18.07it/s]

{'loss': 11.5587, 'grad_norm': 4.705964088439941, 'learning_rate': 8.811630847029077e-06, 'epoch': 1.68}


 56%|█████▌    | 10634/18984 [10:46<07:46, 17.90it/s]

{'loss': 11.5459, 'grad_norm': 4.7802734375, 'learning_rate': 8.80109565950274e-06, 'epoch': 1.68}


 56%|█████▌    | 10644/18984 [10:47<07:34, 18.34it/s]

{'loss': 11.5666, 'grad_norm': 4.228231430053711, 'learning_rate': 8.790560471976402e-06, 'epoch': 1.68}


 56%|█████▌    | 10654/18984 [10:48<07:40, 18.08it/s]

{'loss': 11.5275, 'grad_norm': 3.8753366470336914, 'learning_rate': 8.780025284450065e-06, 'epoch': 1.68}


 56%|█████▌    | 10664/18984 [10:48<07:49, 17.71it/s]

{'loss': 11.5063, 'grad_norm': 4.694995403289795, 'learning_rate': 8.769490096923726e-06, 'epoch': 1.68}


 56%|█████▌    | 10674/18984 [10:49<07:44, 17.90it/s]

{'loss': 11.4717, 'grad_norm': 5.09813928604126, 'learning_rate': 8.758954909397387e-06, 'epoch': 1.69}


 56%|█████▋    | 10684/18984 [10:49<07:37, 18.16it/s]

{'loss': 11.4793, 'grad_norm': 5.340395927429199, 'learning_rate': 8.74841972187105e-06, 'epoch': 1.69}


 56%|█████▋    | 10694/18984 [10:50<07:38, 18.10it/s]

{'loss': 11.4901, 'grad_norm': 5.4658589363098145, 'learning_rate': 8.737884534344712e-06, 'epoch': 1.69}


 56%|█████▋    | 10704/18984 [10:50<07:41, 17.95it/s]

{'loss': 11.5005, 'grad_norm': 5.736547470092773, 'learning_rate': 8.727349346818375e-06, 'epoch': 1.69}


 56%|█████▋    | 10714/18984 [10:51<07:41, 17.91it/s]

{'loss': 11.5043, 'grad_norm': 5.758473873138428, 'learning_rate': 8.716814159292036e-06, 'epoch': 1.69}


 56%|█████▋    | 10724/18984 [10:51<07:30, 18.32it/s]

{'loss': 11.5324, 'grad_norm': 5.607614517211914, 'learning_rate': 8.706278971765697e-06, 'epoch': 1.69}


 57%|█████▋    | 10734/18984 [10:52<07:33, 18.19it/s]

{'loss': 11.5186, 'grad_norm': 5.469541549682617, 'learning_rate': 8.69574378423936e-06, 'epoch': 1.7}


 57%|█████▋    | 10744/18984 [10:53<07:27, 18.41it/s]

{'loss': 11.5378, 'grad_norm': 5.336350917816162, 'learning_rate': 8.685208596713022e-06, 'epoch': 1.7}


 57%|█████▋    | 10754/18984 [10:53<07:43, 17.77it/s]

{'loss': 11.5598, 'grad_norm': 5.163914203643799, 'learning_rate': 8.674673409186685e-06, 'epoch': 1.7}


 57%|█████▋    | 10764/18984 [10:54<07:41, 17.81it/s]

{'loss': 11.5623, 'grad_norm': 5.283160209655762, 'learning_rate': 8.664138221660346e-06, 'epoch': 1.7}


 57%|█████▋    | 10774/18984 [10:54<07:43, 17.71it/s]

{'loss': 11.553, 'grad_norm': 5.363796234130859, 'learning_rate': 8.653603034134008e-06, 'epoch': 1.7}


 57%|█████▋    | 10782/18984 [10:55<07:46, 17.57it/s]

{'loss': 11.5724, 'grad_norm': 5.320961952209473, 'learning_rate': 8.64306784660767e-06, 'epoch': 1.7}


 57%|█████▋    | 10794/18984 [10:55<07:36, 17.96it/s]

{'loss': 11.5719, 'grad_norm': 5.299270153045654, 'learning_rate': 8.632532659081332e-06, 'epoch': 1.71}


 57%|█████▋    | 10804/18984 [10:56<07:36, 17.91it/s]

{'loss': 11.5698, 'grad_norm': 5.517440319061279, 'learning_rate': 8.621997471554995e-06, 'epoch': 1.71}


 57%|█████▋    | 10814/18984 [10:56<07:30, 18.13it/s]

{'loss': 11.5653, 'grad_norm': 5.045990943908691, 'learning_rate': 8.611462284028656e-06, 'epoch': 1.71}


 57%|█████▋    | 10824/18984 [10:57<07:29, 18.16it/s]

{'loss': 11.5947, 'grad_norm': 4.894885063171387, 'learning_rate': 8.600927096502318e-06, 'epoch': 1.71}


 57%|█████▋    | 10834/18984 [10:58<07:39, 17.75it/s]

{'loss': 11.5393, 'grad_norm': 4.206936836242676, 'learning_rate': 8.59039190897598e-06, 'epoch': 1.71}


 57%|█████▋    | 10844/18984 [10:58<07:38, 17.75it/s]

{'loss': 11.4974, 'grad_norm': 5.642159938812256, 'learning_rate': 8.579856721449642e-06, 'epoch': 1.71}


 57%|█████▋    | 10854/18984 [10:59<07:22, 18.38it/s]

{'loss': 11.4746, 'grad_norm': 5.773835182189941, 'learning_rate': 8.569321533923305e-06, 'epoch': 1.71}


 57%|█████▋    | 10864/18984 [10:59<07:29, 18.05it/s]

{'loss': 11.4882, 'grad_norm': 5.813100814819336, 'learning_rate': 8.558786346396967e-06, 'epoch': 1.72}


 57%|█████▋    | 10874/18984 [11:00<07:34, 17.83it/s]

{'loss': 11.5031, 'grad_norm': 5.779758930206299, 'learning_rate': 8.548251158870628e-06, 'epoch': 1.72}


 57%|█████▋    | 10884/18984 [11:00<07:32, 17.90it/s]

{'loss': 11.4952, 'grad_norm': 5.743074893951416, 'learning_rate': 8.537715971344291e-06, 'epoch': 1.72}


 57%|█████▋    | 10894/18984 [11:01<07:34, 17.81it/s]

{'loss': 11.5125, 'grad_norm': 5.727722644805908, 'learning_rate': 8.527180783817952e-06, 'epoch': 1.72}


 57%|█████▋    | 10904/18984 [11:01<07:28, 18.02it/s]

{'loss': 11.509, 'grad_norm': 5.557343006134033, 'learning_rate': 8.516645596291615e-06, 'epoch': 1.72}


 57%|█████▋    | 10914/18984 [11:02<07:24, 18.16it/s]

{'loss': 11.5195, 'grad_norm': 5.601102352142334, 'learning_rate': 8.506110408765277e-06, 'epoch': 1.72}


 58%|█████▊    | 10924/18984 [11:03<07:37, 17.61it/s]

{'loss': 11.5256, 'grad_norm': 5.519636154174805, 'learning_rate': 8.495575221238938e-06, 'epoch': 1.73}


 58%|█████▊    | 10932/18984 [11:03<07:42, 17.42it/s]

{'loss': 11.543, 'grad_norm': 5.343372344970703, 'learning_rate': 8.485040033712601e-06, 'epoch': 1.73}


 58%|█████▊    | 10944/18984 [11:04<07:31, 17.82it/s]

{'loss': 11.5506, 'grad_norm': 5.214298248291016, 'learning_rate': 8.474504846186264e-06, 'epoch': 1.73}


 58%|█████▊    | 10954/18984 [11:04<07:33, 17.72it/s]

{'loss': 11.5542, 'grad_norm': 4.736446380615234, 'learning_rate': 8.463969658659926e-06, 'epoch': 1.73}


 58%|█████▊    | 10964/18984 [11:05<07:28, 17.86it/s]

{'loss': 11.5379, 'grad_norm': 4.421371936798096, 'learning_rate': 8.453434471133587e-06, 'epoch': 1.73}


 58%|█████▊    | 10974/18984 [11:05<07:16, 18.36it/s]

{'loss': 11.5508, 'grad_norm': 4.118631839752197, 'learning_rate': 8.442899283607248e-06, 'epoch': 1.73}


 58%|█████▊    | 10984/18984 [11:06<07:20, 18.16it/s]

{'loss': 11.5171, 'grad_norm': 4.013153076171875, 'learning_rate': 8.432364096080911e-06, 'epoch': 1.74}


 58%|█████▊    | 10994/18984 [11:06<07:19, 18.19it/s]

{'loss': 11.4964, 'grad_norm': 4.2915873527526855, 'learning_rate': 8.421828908554574e-06, 'epoch': 1.74}


 58%|█████▊    | 11000/18984 [11:07<07:13, 18.41it/s]

{'loss': 11.5119, 'grad_norm': 4.863195419311523, 'learning_rate': 8.411293721028236e-06, 'epoch': 1.74}


 58%|█████▊    | 11014/18984 [11:10<13:31,  9.82it/s]

{'loss': 11.5192, 'grad_norm': 5.039172649383545, 'learning_rate': 8.400758533501897e-06, 'epoch': 1.74}


 58%|█████▊    | 11024/18984 [11:11<08:31, 15.57it/s]

{'loss': 11.5012, 'grad_norm': 5.013768672943115, 'learning_rate': 8.390223345975558e-06, 'epoch': 1.74}


 58%|█████▊    | 11034/18984 [11:11<07:29, 17.69it/s]

{'loss': 11.5214, 'grad_norm': 4.8920979499816895, 'learning_rate': 8.379688158449221e-06, 'epoch': 1.74}


 58%|█████▊    | 11044/18984 [11:12<07:21, 17.99it/s]

{'loss': 11.5508, 'grad_norm': 4.7258405685424805, 'learning_rate': 8.369152970922883e-06, 'epoch': 1.74}


 58%|█████▊    | 11054/18984 [11:12<07:25, 17.78it/s]

{'loss': 11.552, 'grad_norm': 4.998566150665283, 'learning_rate': 8.358617783396546e-06, 'epoch': 1.75}


 58%|█████▊    | 11064/18984 [11:13<07:20, 17.98it/s]

{'loss': 11.5491, 'grad_norm': 4.966271877288818, 'learning_rate': 8.348082595870207e-06, 'epoch': 1.75}


 58%|█████▊    | 11074/18984 [11:14<07:16, 18.10it/s]

{'loss': 11.5517, 'grad_norm': 4.213253021240234, 'learning_rate': 8.337547408343869e-06, 'epoch': 1.75}


 58%|█████▊    | 11084/18984 [11:14<07:18, 18.03it/s]

{'loss': 11.5207, 'grad_norm': 4.830935001373291, 'learning_rate': 8.327012220817532e-06, 'epoch': 1.75}


 58%|█████▊    | 11094/18984 [11:15<07:17, 18.03it/s]

{'loss': 11.5312, 'grad_norm': 5.585084438323975, 'learning_rate': 8.316477033291193e-06, 'epoch': 1.75}


 58%|█████▊    | 11104/18984 [11:15<07:12, 18.22it/s]

{'loss': 11.4961, 'grad_norm': 6.083871364593506, 'learning_rate': 8.305941845764856e-06, 'epoch': 1.75}


 59%|█████▊    | 11114/18984 [11:16<07:14, 18.10it/s]

{'loss': 11.526, 'grad_norm': 5.99043607711792, 'learning_rate': 8.295406658238517e-06, 'epoch': 1.76}


 59%|█████▊    | 11124/18984 [11:16<07:12, 18.17it/s]

{'loss': 11.5601, 'grad_norm': 3.958012104034424, 'learning_rate': 8.284871470712179e-06, 'epoch': 1.76}


 59%|█████▊    | 11134/18984 [11:17<07:16, 18.00it/s]

{'loss': 11.5427, 'grad_norm': 3.3342208862304688, 'learning_rate': 8.274336283185842e-06, 'epoch': 1.76}


 59%|█████▊    | 11144/18984 [11:17<07:21, 17.78it/s]

{'loss': 11.5441, 'grad_norm': 3.3980565071105957, 'learning_rate': 8.263801095659503e-06, 'epoch': 1.76}


 59%|█████▊    | 11152/18984 [11:18<07:19, 17.82it/s]

{'loss': 11.5535, 'grad_norm': 3.2235066890716553, 'learning_rate': 8.253265908133166e-06, 'epoch': 1.76}


 59%|█████▉    | 11164/18984 [11:19<07:11, 18.13it/s]

{'loss': 11.5389, 'grad_norm': 3.2903685569763184, 'learning_rate': 8.242730720606827e-06, 'epoch': 1.76}


 59%|█████▉    | 11174/18984 [11:19<06:56, 18.74it/s]

{'loss': 11.5584, 'grad_norm': 3.4866082668304443, 'learning_rate': 8.232195533080489e-06, 'epoch': 1.77}


 59%|█████▉    | 11184/18984 [11:20<07:08, 18.20it/s]

{'loss': 11.5473, 'grad_norm': 3.8112080097198486, 'learning_rate': 8.221660345554152e-06, 'epoch': 1.77}


 59%|█████▉    | 11194/18984 [11:20<07:12, 17.99it/s]

{'loss': 11.5334, 'grad_norm': 4.298910140991211, 'learning_rate': 8.211125158027813e-06, 'epoch': 1.77}


 59%|█████▉    | 11204/18984 [11:21<07:16, 17.84it/s]

{'loss': 11.5306, 'grad_norm': 4.619564056396484, 'learning_rate': 8.200589970501476e-06, 'epoch': 1.77}


 59%|█████▉    | 11214/18984 [11:21<07:10, 18.07it/s]

{'loss': 11.5328, 'grad_norm': 5.085122585296631, 'learning_rate': 8.190054782975138e-06, 'epoch': 1.77}


 59%|█████▉    | 11222/18984 [11:22<07:13, 17.91it/s]

{'loss': 11.5382, 'grad_norm': 5.3075852394104, 'learning_rate': 8.179519595448799e-06, 'epoch': 1.77}


 59%|█████▉    | 11232/18984 [11:22<07:32, 17.14it/s]

{'loss': 11.5259, 'grad_norm': 4.116803169250488, 'learning_rate': 8.168984407922462e-06, 'epoch': 1.77}


 59%|█████▉    | 11244/18984 [11:23<07:15, 17.76it/s]

{'loss': 11.5258, 'grad_norm': 4.343174934387207, 'learning_rate': 8.158449220396123e-06, 'epoch': 1.78}


 59%|█████▉    | 11254/18984 [11:24<07:00, 18.37it/s]

{'loss': 11.5196, 'grad_norm': 4.402010440826416, 'learning_rate': 8.147914032869786e-06, 'epoch': 1.78}


 59%|█████▉    | 11264/18984 [11:24<06:55, 18.57it/s]

{'loss': 11.5138, 'grad_norm': 4.910860061645508, 'learning_rate': 8.137378845343448e-06, 'epoch': 1.78}


 59%|█████▉    | 11274/18984 [11:25<06:48, 18.87it/s]

{'loss': 11.5022, 'grad_norm': 5.082664966583252, 'learning_rate': 8.126843657817109e-06, 'epoch': 1.78}


 59%|█████▉    | 11284/18984 [11:25<06:58, 18.38it/s]

{'loss': 11.5094, 'grad_norm': 5.153046131134033, 'learning_rate': 8.11630847029077e-06, 'epoch': 1.78}


 59%|█████▉    | 11294/18984 [11:26<06:58, 18.39it/s]

{'loss': 11.5302, 'grad_norm': 5.203767776489258, 'learning_rate': 8.105773282764433e-06, 'epoch': 1.78}


 60%|█████▉    | 11304/18984 [11:26<06:59, 18.32it/s]

{'loss': 11.5555, 'grad_norm': 5.132905960083008, 'learning_rate': 8.095238095238097e-06, 'epoch': 1.79}


 60%|█████▉    | 11312/18984 [11:27<07:09, 17.85it/s]

{'loss': 11.5557, 'grad_norm': 5.080596923828125, 'learning_rate': 8.084702907711758e-06, 'epoch': 1.79}


 60%|█████▉    | 11324/18984 [11:27<07:16, 17.57it/s]

{'loss': 11.5588, 'grad_norm': 5.074263095855713, 'learning_rate': 8.07416772018542e-06, 'epoch': 1.79}


 60%|█████▉    | 11334/18984 [11:28<07:08, 17.84it/s]

{'loss': 11.569, 'grad_norm': 5.185979843139648, 'learning_rate': 8.063632532659082e-06, 'epoch': 1.79}


 60%|█████▉    | 11344/18984 [11:28<07:02, 18.07it/s]

{'loss': 11.584, 'grad_norm': 5.1070966720581055, 'learning_rate': 8.053097345132744e-06, 'epoch': 1.79}


 60%|█████▉    | 11354/18984 [11:29<07:04, 17.97it/s]

{'loss': 11.5811, 'grad_norm': 4.930354595184326, 'learning_rate': 8.042562157606407e-06, 'epoch': 1.79}


 60%|█████▉    | 11364/18984 [11:30<06:59, 18.17it/s]

{'loss': 11.5686, 'grad_norm': 4.129580020904541, 'learning_rate': 8.032026970080068e-06, 'epoch': 1.8}


 60%|█████▉    | 11374/18984 [11:30<07:03, 17.97it/s]

{'loss': 11.5453, 'grad_norm': 4.309326171875, 'learning_rate': 8.02149178255373e-06, 'epoch': 1.8}


 60%|█████▉    | 11384/18984 [11:31<07:04, 17.90it/s]

{'loss': 11.546, 'grad_norm': 4.458310604095459, 'learning_rate': 8.010956595027392e-06, 'epoch': 1.8}


 60%|██████    | 11394/18984 [11:31<07:01, 17.99it/s]

{'loss': 11.5046, 'grad_norm': 4.366369247436523, 'learning_rate': 8.000421407501054e-06, 'epoch': 1.8}


 60%|██████    | 11404/18984 [11:32<07:09, 17.66it/s]

{'loss': 11.5113, 'grad_norm': 4.0716376304626465, 'learning_rate': 7.989886219974717e-06, 'epoch': 1.8}


 60%|██████    | 11414/18984 [11:32<07:03, 17.89it/s]

{'loss': 11.5009, 'grad_norm': 4.6850056648254395, 'learning_rate': 7.979351032448378e-06, 'epoch': 1.8}


 60%|██████    | 11424/18984 [11:33<06:52, 18.31it/s]

{'loss': 11.5149, 'grad_norm': 4.832932949066162, 'learning_rate': 7.96881584492204e-06, 'epoch': 1.8}


 60%|██████    | 11434/18984 [11:33<06:53, 18.26it/s]

{'loss': 11.5097, 'grad_norm': 4.9387311935424805, 'learning_rate': 7.958280657395703e-06, 'epoch': 1.81}


 60%|██████    | 11444/18984 [11:34<06:50, 18.35it/s]

{'loss': 11.5389, 'grad_norm': 4.964620113372803, 'learning_rate': 7.947745469869364e-06, 'epoch': 1.81}


 60%|██████    | 11454/18984 [11:35<06:48, 18.46it/s]

{'loss': 11.5438, 'grad_norm': 4.96120023727417, 'learning_rate': 7.937210282343027e-06, 'epoch': 1.81}


 60%|██████    | 11462/18984 [11:35<07:03, 17.78it/s]

{'loss': 11.5121, 'grad_norm': 4.813220024108887, 'learning_rate': 7.926675094816688e-06, 'epoch': 1.81}


 60%|██████    | 11474/18984 [11:36<06:56, 18.04it/s]

{'loss': 11.53, 'grad_norm': 5.08121919631958, 'learning_rate': 7.91613990729035e-06, 'epoch': 1.81}


 60%|██████    | 11484/18984 [11:36<06:51, 18.24it/s]

{'loss': 11.5512, 'grad_norm': 4.97410249710083, 'learning_rate': 7.905604719764013e-06, 'epoch': 1.81}


 61%|██████    | 11494/18984 [11:37<06:48, 18.32it/s]

{'loss': 11.5712, 'grad_norm': 5.276172637939453, 'learning_rate': 7.895069532237674e-06, 'epoch': 1.82}


 61%|██████    | 11500/18984 [11:37<06:47, 18.38it/s]

{'loss': 11.541, 'grad_norm': 4.095147132873535, 'learning_rate': 7.884534344711337e-06, 'epoch': 1.82}


 61%|██████    | 11514/18984 [11:41<12:44,  9.77it/s]

{'loss': 11.5285, 'grad_norm': 3.631824493408203, 'learning_rate': 7.873999157184998e-06, 'epoch': 1.82}


 61%|██████    | 11524/18984 [11:41<07:46, 15.99it/s]

{'loss': 11.5216, 'grad_norm': 3.4860761165618896, 'learning_rate': 7.86346396965866e-06, 'epoch': 1.82}


 61%|██████    | 11534/18984 [11:42<07:05, 17.50it/s]

{'loss': 11.5195, 'grad_norm': 3.6978840827941895, 'learning_rate': 7.852928782132323e-06, 'epoch': 1.82}


 61%|██████    | 11544/18984 [11:42<06:56, 17.87it/s]

{'loss': 11.5241, 'grad_norm': 3.718127965927124, 'learning_rate': 7.842393594605984e-06, 'epoch': 1.82}


 61%|██████    | 11554/18984 [11:43<06:55, 17.90it/s]

{'loss': 11.5387, 'grad_norm': 3.55135178565979, 'learning_rate': 7.831858407079647e-06, 'epoch': 1.83}


 61%|██████    | 11564/18984 [11:43<06:55, 17.87it/s]

{'loss': 11.5309, 'grad_norm': 3.512295961380005, 'learning_rate': 7.821323219553309e-06, 'epoch': 1.83}


 61%|██████    | 11574/18984 [11:44<06:50, 18.06it/s]

{'loss': 11.5521, 'grad_norm': 3.3074839115142822, 'learning_rate': 7.81078803202697e-06, 'epoch': 1.83}


 61%|██████    | 11584/18984 [11:44<06:52, 17.93it/s]

{'loss': 11.5299, 'grad_norm': 3.063143253326416, 'learning_rate': 7.800252844500633e-06, 'epoch': 1.83}


 61%|██████    | 11594/18984 [11:45<06:51, 17.97it/s]

{'loss': 11.5404, 'grad_norm': 2.806535005569458, 'learning_rate': 7.789717656974294e-06, 'epoch': 1.83}


 61%|██████    | 11604/18984 [11:46<06:59, 17.59it/s]

{'loss': 11.5295, 'grad_norm': 3.0657641887664795, 'learning_rate': 7.779182469447957e-06, 'epoch': 1.83}


 61%|██████    | 11614/18984 [11:46<06:49, 17.99it/s]

{'loss': 11.5201, 'grad_norm': 3.679900884628296, 'learning_rate': 7.768647281921619e-06, 'epoch': 1.83}


 61%|██████    | 11624/18984 [11:47<06:46, 18.09it/s]

{'loss': 11.5123, 'grad_norm': 4.314517974853516, 'learning_rate': 7.75811209439528e-06, 'epoch': 1.84}


 61%|██████▏   | 11634/18984 [11:47<06:54, 17.72it/s]

{'loss': 11.5348, 'grad_norm': 4.2604875564575195, 'learning_rate': 7.747576906868943e-06, 'epoch': 1.84}


 61%|██████▏   | 11642/18984 [11:48<07:11, 17.02it/s]

{'loss': 11.5376, 'grad_norm': 4.28140926361084, 'learning_rate': 7.737041719342605e-06, 'epoch': 1.84}


 61%|██████▏   | 11654/18984 [11:48<06:54, 17.70it/s]

{'loss': 11.55, 'grad_norm': 3.6287472248077393, 'learning_rate': 7.726506531816268e-06, 'epoch': 1.84}


 61%|██████▏   | 11664/18984 [11:49<06:47, 17.95it/s]

{'loss': 11.5508, 'grad_norm': 3.25132155418396, 'learning_rate': 7.715971344289929e-06, 'epoch': 1.84}


 61%|██████▏   | 11674/18984 [11:49<06:46, 17.99it/s]

{'loss': 11.5401, 'grad_norm': 3.163444995880127, 'learning_rate': 7.70543615676359e-06, 'epoch': 1.84}


 62%|██████▏   | 11684/18984 [11:50<06:42, 18.13it/s]

{'loss': 11.54, 'grad_norm': 3.1964385509490967, 'learning_rate': 7.694900969237253e-06, 'epoch': 1.85}


 62%|██████▏   | 11692/18984 [11:50<06:45, 17.98it/s]

{'loss': 11.5362, 'grad_norm': 3.2223949432373047, 'learning_rate': 7.684365781710915e-06, 'epoch': 1.85}


 62%|██████▏   | 11704/18984 [11:51<06:45, 17.94it/s]

{'loss': 11.5296, 'grad_norm': 3.4281113147735596, 'learning_rate': 7.673830594184578e-06, 'epoch': 1.85}


 62%|██████▏   | 11714/18984 [11:52<06:41, 18.11it/s]

{'loss': 11.5401, 'grad_norm': 3.576124429702759, 'learning_rate': 7.663295406658239e-06, 'epoch': 1.85}


 62%|██████▏   | 11724/18984 [11:52<06:40, 18.11it/s]

{'loss': 11.512, 'grad_norm': 4.3244805335998535, 'learning_rate': 7.652760219131902e-06, 'epoch': 1.85}


 62%|██████▏   | 11734/18984 [11:53<06:40, 18.08it/s]

{'loss': 11.5191, 'grad_norm': 5.881453990936279, 'learning_rate': 7.642225031605563e-06, 'epoch': 1.85}


 62%|██████▏   | 11744/18984 [11:53<06:42, 17.97it/s]

{'loss': 11.5187, 'grad_norm': 5.837346076965332, 'learning_rate': 7.631689844079225e-06, 'epoch': 1.86}


 62%|██████▏   | 11754/18984 [11:54<06:39, 18.09it/s]

{'loss': 11.5536, 'grad_norm': 5.870468616485596, 'learning_rate': 7.621154656552887e-06, 'epoch': 1.86}


 62%|██████▏   | 11764/18984 [11:54<06:37, 18.17it/s]

{'loss': 11.571, 'grad_norm': 5.587343215942383, 'learning_rate': 7.610619469026549e-06, 'epoch': 1.86}


 62%|██████▏   | 11774/18984 [11:55<06:25, 18.69it/s]

{'loss': 11.5598, 'grad_norm': 4.861900329589844, 'learning_rate': 7.600084281500211e-06, 'epoch': 1.86}


 62%|██████▏   | 11784/18984 [11:56<06:40, 17.99it/s]

{'loss': 11.5559, 'grad_norm': 3.853653907775879, 'learning_rate': 7.589549093973874e-06, 'epoch': 1.86}


 62%|██████▏   | 11794/18984 [11:56<06:42, 17.86it/s]

{'loss': 11.5363, 'grad_norm': 4.101428508758545, 'learning_rate': 7.579013906447536e-06, 'epoch': 1.86}


 62%|██████▏   | 11804/18984 [11:57<06:39, 17.99it/s]

{'loss': 11.5319, 'grad_norm': 4.478420734405518, 'learning_rate': 7.568478718921197e-06, 'epoch': 1.86}


 62%|██████▏   | 11814/18984 [11:57<06:33, 18.23it/s]

{'loss': 11.5319, 'grad_norm': 4.682229042053223, 'learning_rate': 7.557943531394859e-06, 'epoch': 1.87}


 62%|██████▏   | 11822/18984 [11:58<06:48, 17.53it/s]

{'loss': 11.5186, 'grad_norm': 4.774085521697998, 'learning_rate': 7.5474083438685216e-06, 'epoch': 1.87}


 62%|██████▏   | 11834/18984 [11:58<06:35, 18.10it/s]

{'loss': 11.5246, 'grad_norm': 4.9773125648498535, 'learning_rate': 7.536873156342184e-06, 'epoch': 1.87}


 62%|██████▏   | 11844/18984 [11:59<06:40, 17.85it/s]

{'loss': 11.534, 'grad_norm': 4.97785758972168, 'learning_rate': 7.526337968815846e-06, 'epoch': 1.87}


 62%|██████▏   | 11854/18984 [11:59<06:32, 18.16it/s]

{'loss': 11.5386, 'grad_norm': 4.985626697540283, 'learning_rate': 7.515802781289507e-06, 'epoch': 1.87}


 62%|██████▏   | 11864/18984 [12:00<06:32, 18.14it/s]

{'loss': 11.5494, 'grad_norm': 4.782435417175293, 'learning_rate': 7.5052675937631695e-06, 'epoch': 1.87}


 63%|██████▎   | 11872/18984 [12:00<06:48, 17.40it/s]

{'loss': 11.5348, 'grad_norm': 4.777548789978027, 'learning_rate': 7.494732406236832e-06, 'epoch': 1.88}


 63%|██████▎   | 11884/18984 [12:01<06:43, 17.58it/s]

{'loss': 11.5531, 'grad_norm': 4.651680946350098, 'learning_rate': 7.484197218710494e-06, 'epoch': 1.88}


 63%|██████▎   | 11894/18984 [12:02<06:31, 18.10it/s]

{'loss': 11.5253, 'grad_norm': 4.679718494415283, 'learning_rate': 7.473662031184155e-06, 'epoch': 1.88}


 63%|██████▎   | 11904/18984 [12:02<06:31, 18.08it/s]

{'loss': 11.5135, 'grad_norm': 4.837716579437256, 'learning_rate': 7.4631268436578175e-06, 'epoch': 1.88}


 63%|██████▎   | 11912/18984 [12:03<06:44, 17.48it/s]

{'loss': 11.522, 'grad_norm': 5.195469856262207, 'learning_rate': 7.45259165613148e-06, 'epoch': 1.88}


 63%|██████▎   | 11924/18984 [12:03<06:31, 18.04it/s]

{'loss': 11.5154, 'grad_norm': 5.309089183807373, 'learning_rate': 7.442056468605142e-06, 'epoch': 1.88}


 63%|██████▎   | 11934/18984 [12:04<06:26, 18.25it/s]

{'loss': 11.5115, 'grad_norm': 5.460261344909668, 'learning_rate': 7.431521281078804e-06, 'epoch': 1.89}


 63%|██████▎   | 11944/18984 [12:05<06:34, 17.83it/s]

{'loss': 11.5377, 'grad_norm': 5.4531965255737305, 'learning_rate': 7.420986093552465e-06, 'epoch': 1.89}


 63%|██████▎   | 11954/18984 [12:05<06:37, 17.68it/s]

{'loss': 11.5535, 'grad_norm': 5.406893253326416, 'learning_rate': 7.410450906026128e-06, 'epoch': 1.89}


 63%|██████▎   | 11964/18984 [12:06<06:22, 18.34it/s]

{'loss': 11.5643, 'grad_norm': 5.2799553871154785, 'learning_rate': 7.39991571849979e-06, 'epoch': 1.89}


 63%|██████▎   | 11974/18984 [12:06<06:24, 18.23it/s]

{'loss': 11.5562, 'grad_norm': 4.366048336029053, 'learning_rate': 7.389380530973452e-06, 'epoch': 1.89}


 63%|██████▎   | 11984/18984 [12:07<06:29, 17.99it/s]

{'loss': 11.5567, 'grad_norm': 2.62011981010437, 'learning_rate': 7.378845343447114e-06, 'epoch': 1.89}


 63%|██████▎   | 11994/18984 [12:07<06:29, 17.93it/s]

{'loss': 11.5246, 'grad_norm': 3.185195207595825, 'learning_rate': 7.3683101559207756e-06, 'epoch': 1.89}


 63%|██████▎   | 12000/18984 [12:08<06:33, 17.73it/s]

{'loss': 11.5352, 'grad_norm': 3.2727534770965576, 'learning_rate': 7.357774968394438e-06, 'epoch': 1.9}


 63%|██████▎   | 12014/18984 [12:11<11:42,  9.93it/s]

{'loss': 11.5184, 'grad_norm': 3.3173789978027344, 'learning_rate': 7.347239780868099e-06, 'epoch': 1.9}


 63%|██████▎   | 12024/18984 [12:11<07:10, 16.18it/s]

{'loss': 11.5248, 'grad_norm': 3.5394937992095947, 'learning_rate': 7.336704593341762e-06, 'epoch': 1.9}


 63%|██████▎   | 12034/18984 [12:12<06:20, 18.26it/s]

{'loss': 11.5364, 'grad_norm': 3.429522752761841, 'learning_rate': 7.326169405815424e-06, 'epoch': 1.9}


 63%|██████▎   | 12044/18984 [12:13<06:34, 17.60it/s]

{'loss': 11.5415, 'grad_norm': 3.4322454929351807, 'learning_rate': 7.315634218289086e-06, 'epoch': 1.9}


 63%|██████▎   | 12054/18984 [12:13<06:30, 17.74it/s]

{'loss': 11.5649, 'grad_norm': 3.431166410446167, 'learning_rate': 7.305099030762748e-06, 'epoch': 1.9}


 64%|██████▎   | 12064/18984 [12:14<06:30, 17.74it/s]

{'loss': 11.5363, 'grad_norm': 3.3670454025268555, 'learning_rate': 7.294563843236411e-06, 'epoch': 1.91}


 64%|██████▎   | 12074/18984 [12:14<06:32, 17.59it/s]

{'loss': 11.5651, 'grad_norm': 3.9102797508239746, 'learning_rate': 7.284028655710072e-06, 'epoch': 1.91}


 64%|██████▎   | 12084/18984 [12:15<06:18, 18.25it/s]

{'loss': 11.5598, 'grad_norm': 4.816059112548828, 'learning_rate': 7.2734934681837345e-06, 'epoch': 1.91}


 64%|██████▎   | 12094/18984 [12:15<06:27, 17.78it/s]

{'loss': 11.5495, 'grad_norm': 4.717682838439941, 'learning_rate': 7.262958280657396e-06, 'epoch': 1.91}


 64%|██████▎   | 12102/18984 [12:16<06:32, 17.52it/s]

{'loss': 11.5517, 'grad_norm': 4.890909671783447, 'learning_rate': 7.252423093131058e-06, 'epoch': 1.91}


 64%|██████▍   | 12112/18984 [12:16<06:40, 17.14it/s]

{'loss': 11.5488, 'grad_norm': 5.330929279327393, 'learning_rate': 7.241887905604721e-06, 'epoch': 1.91}


 64%|██████▍   | 12124/18984 [12:17<06:25, 17.81it/s]

{'loss': 11.5403, 'grad_norm': 5.017761707305908, 'learning_rate': 7.2313527180783824e-06, 'epoch': 1.92}


 64%|██████▍   | 12134/18984 [12:18<06:30, 17.54it/s]

{'loss': 11.5421, 'grad_norm': 4.797388553619385, 'learning_rate': 7.220817530552045e-06, 'epoch': 1.92}


 64%|██████▍   | 12144/18984 [12:18<06:21, 17.91it/s]

{'loss': 11.5064, 'grad_norm': 5.664155006408691, 'learning_rate': 7.210282343025706e-06, 'epoch': 1.92}


 64%|██████▍   | 12154/18984 [12:19<06:08, 18.53it/s]

{'loss': 11.5265, 'grad_norm': 5.428346633911133, 'learning_rate': 7.199747155499368e-06, 'epoch': 1.92}


 64%|██████▍   | 12164/18984 [12:19<06:15, 18.17it/s]

{'loss': 11.5407, 'grad_norm': 4.937008857727051, 'learning_rate': 7.189211967973031e-06, 'epoch': 1.92}


 64%|██████▍   | 12174/18984 [12:20<06:08, 18.47it/s]

{'loss': 11.5582, 'grad_norm': 4.243667125701904, 'learning_rate': 7.178676780446693e-06, 'epoch': 1.92}


 64%|██████▍   | 12184/18984 [12:20<06:15, 18.11it/s]

{'loss': 11.5432, 'grad_norm': 3.550323486328125, 'learning_rate': 7.168141592920355e-06, 'epoch': 1.92}


 64%|██████▍   | 12194/18984 [12:21<06:19, 17.88it/s]

{'loss': 11.5323, 'grad_norm': 3.3301258087158203, 'learning_rate': 7.157606405394016e-06, 'epoch': 1.93}


 64%|██████▍   | 12204/18984 [12:22<06:19, 17.86it/s]

{'loss': 11.5385, 'grad_norm': 3.3745083808898926, 'learning_rate': 7.147071217867678e-06, 'epoch': 1.93}


 64%|██████▍   | 12214/18984 [12:22<06:27, 17.47it/s]

{'loss': 11.5241, 'grad_norm': 3.5219435691833496, 'learning_rate': 7.136536030341341e-06, 'epoch': 1.93}


 64%|██████▍   | 12224/18984 [12:23<06:27, 17.46it/s]

{'loss': 11.5285, 'grad_norm': 3.726652145385742, 'learning_rate': 7.126000842815003e-06, 'epoch': 1.93}


 64%|██████▍   | 12234/18984 [12:23<06:14, 18.02it/s]

{'loss': 11.5261, 'grad_norm': 3.8061938285827637, 'learning_rate': 7.115465655288665e-06, 'epoch': 1.93}


 64%|██████▍   | 12244/18984 [12:24<06:13, 18.06it/s]

{'loss': 11.5442, 'grad_norm': 3.7994277477264404, 'learning_rate': 7.104930467762326e-06, 'epoch': 1.93}


 65%|██████▍   | 12252/18984 [12:24<06:24, 17.50it/s]

{'loss': 11.5316, 'grad_norm': 3.725733757019043, 'learning_rate': 7.0943952802359885e-06, 'epoch': 1.94}


 65%|██████▍   | 12264/18984 [12:25<06:24, 17.49it/s]

{'loss': 11.5337, 'grad_norm': 3.8421566486358643, 'learning_rate': 7.0838600927096515e-06, 'epoch': 1.94}


 65%|██████▍   | 12274/18984 [12:26<06:18, 17.75it/s]

{'loss': 11.5108, 'grad_norm': 5.417545318603516, 'learning_rate': 7.073324905183313e-06, 'epoch': 1.94}


 65%|██████▍   | 12284/18984 [12:26<06:18, 17.70it/s]

{'loss': 11.5291, 'grad_norm': 5.389244556427002, 'learning_rate': 7.062789717656975e-06, 'epoch': 1.94}


 65%|██████▍   | 12294/18984 [12:27<06:07, 18.19it/s]

{'loss': 11.5341, 'grad_norm': 5.637127876281738, 'learning_rate': 7.0522545301306364e-06, 'epoch': 1.94}


 65%|██████▍   | 12304/18984 [12:27<06:11, 17.99it/s]

{'loss': 11.5094, 'grad_norm': 4.470741271972656, 'learning_rate': 7.041719342604299e-06, 'epoch': 1.94}


 65%|██████▍   | 12312/18984 [12:28<06:20, 17.52it/s]

{'loss': 11.5213, 'grad_norm': 5.644996166229248, 'learning_rate': 7.031184155077962e-06, 'epoch': 1.95}


 65%|██████▍   | 12322/18984 [12:28<06:17, 17.63it/s]

{'loss': 11.566, 'grad_norm': 5.4009480476379395, 'learning_rate': 7.020648967551623e-06, 'epoch': 1.95}


 65%|██████▍   | 12334/18984 [12:29<06:12, 17.86it/s]

{'loss': 11.5528, 'grad_norm': 4.792998313903809, 'learning_rate': 7.010113780025285e-06, 'epoch': 1.95}


 65%|██████▌   | 12342/18984 [12:29<06:15, 17.68it/s]

{'loss': 11.5384, 'grad_norm': 4.413956642150879, 'learning_rate': 6.999578592498947e-06, 'epoch': 1.95}


 65%|██████▌   | 12354/18984 [12:30<06:14, 17.70it/s]

{'loss': 11.549, 'grad_norm': 3.8700733184814453, 'learning_rate': 6.989043404972609e-06, 'epoch': 1.95}


 65%|██████▌   | 12364/18984 [12:31<06:07, 18.03it/s]

{'loss': 11.5077, 'grad_norm': 3.9113969802856445, 'learning_rate': 6.978508217446272e-06, 'epoch': 1.95}


 65%|██████▌   | 12374/18984 [12:31<06:11, 17.80it/s]

{'loss': 11.4973, 'grad_norm': 4.967901229858398, 'learning_rate': 6.967973029919933e-06, 'epoch': 1.95}


 65%|██████▌   | 12384/18984 [12:32<06:12, 17.72it/s]

{'loss': 11.5137, 'grad_norm': 4.439691543579102, 'learning_rate': 6.957437842393595e-06, 'epoch': 1.96}


 65%|██████▌   | 12394/18984 [12:32<06:20, 17.31it/s]

{'loss': 11.5468, 'grad_norm': 3.317457914352417, 'learning_rate': 6.946902654867257e-06, 'epoch': 1.96}


 65%|██████▌   | 12404/18984 [12:33<06:15, 17.50it/s]

{'loss': 11.5501, 'grad_norm': 7.648723125457764, 'learning_rate': 6.936367467340919e-06, 'epoch': 1.96}


 65%|██████▌   | 12414/18984 [12:33<06:06, 17.94it/s]

{'loss': 11.5435, 'grad_norm': 4.6212358474731445, 'learning_rate': 6.925832279814582e-06, 'epoch': 1.96}


 65%|██████▌   | 12424/18984 [12:34<06:05, 17.96it/s]

{'loss': 11.4985, 'grad_norm': 5.676652431488037, 'learning_rate': 6.915297092288243e-06, 'epoch': 1.96}


 65%|██████▌   | 12434/18984 [12:35<06:08, 17.78it/s]

{'loss': 11.5153, 'grad_norm': 3.538207769393921, 'learning_rate': 6.9047619047619055e-06, 'epoch': 1.96}


 66%|██████▌   | 12444/18984 [12:35<06:00, 18.13it/s]

{'loss': 11.5339, 'grad_norm': 3.86369252204895, 'learning_rate': 6.894226717235567e-06, 'epoch': 1.97}


 66%|██████▌   | 12454/18984 [12:36<06:03, 17.96it/s]

{'loss': 11.5344, 'grad_norm': 3.9151771068573, 'learning_rate': 6.88369152970923e-06, 'epoch': 1.97}


 66%|██████▌   | 12464/18984 [12:36<06:04, 17.89it/s]

{'loss': 11.5291, 'grad_norm': 3.1931943893432617, 'learning_rate': 6.873156342182892e-06, 'epoch': 1.97}


 66%|██████▌   | 12474/18984 [12:37<05:59, 18.13it/s]

{'loss': 11.5484, 'grad_norm': 3.3139777183532715, 'learning_rate': 6.8626211546565535e-06, 'epoch': 1.97}


 66%|██████▌   | 12484/18984 [12:37<06:01, 18.00it/s]

{'loss': 11.5277, 'grad_norm': 4.009294509887695, 'learning_rate': 6.852085967130216e-06, 'epoch': 1.97}


 66%|██████▌   | 12494/18984 [12:38<06:03, 17.88it/s]

{'loss': 11.5286, 'grad_norm': 4.798647880554199, 'learning_rate': 6.841550779603877e-06, 'epoch': 1.97}


 66%|██████▌   | 12500/18984 [12:38<06:02, 17.88it/s]

{'loss': 11.5312, 'grad_norm': 4.596057891845703, 'learning_rate': 6.83101559207754e-06, 'epoch': 1.98}


 66%|██████▌   | 12514/18984 [12:42<11:04,  9.73it/s]

{'loss': 11.5364, 'grad_norm': 4.445986747741699, 'learning_rate': 6.820480404551202e-06, 'epoch': 1.98}


 66%|██████▌   | 12522/18984 [12:42<07:27, 14.43it/s]

{'loss': 11.544, 'grad_norm': 3.6542465686798096, 'learning_rate': 6.809945217024864e-06, 'epoch': 1.98}


 66%|██████▌   | 12532/18984 [12:43<06:22, 16.89it/s]

{'loss': 11.5294, 'grad_norm': 5.022709369659424, 'learning_rate': 6.799410029498526e-06, 'epoch': 1.98}


 66%|██████▌   | 12544/18984 [12:43<05:55, 18.11it/s]

{'loss': 11.5063, 'grad_norm': 5.74731969833374, 'learning_rate': 6.788874841972187e-06, 'epoch': 1.98}


 66%|██████▌   | 12554/18984 [12:44<05:51, 18.28it/s]

{'loss': 11.5134, 'grad_norm': 5.460032939910889, 'learning_rate': 6.77833965444585e-06, 'epoch': 1.98}


 66%|██████▌   | 12564/18984 [12:44<05:57, 17.96it/s]

{'loss': 11.5215, 'grad_norm': 5.066954612731934, 'learning_rate': 6.7678044669195116e-06, 'epoch': 1.98}


 66%|██████▌   | 12574/18984 [12:45<05:51, 18.24it/s]

{'loss': 11.5367, 'grad_norm': 4.423928260803223, 'learning_rate': 6.757269279393174e-06, 'epoch': 1.99}


 66%|██████▋   | 12584/18984 [12:46<05:55, 17.98it/s]

{'loss': 11.5506, 'grad_norm': 5.024514198303223, 'learning_rate': 6.746734091866836e-06, 'epoch': 1.99}


 66%|██████▋   | 12594/18984 [12:46<05:56, 17.90it/s]

{'loss': 11.5507, 'grad_norm': 5.323646545410156, 'learning_rate': 6.736198904340497e-06, 'epoch': 1.99}


 66%|██████▋   | 12604/18984 [12:47<05:59, 17.73it/s]

{'loss': 11.5679, 'grad_norm': 5.0322442054748535, 'learning_rate': 6.72566371681416e-06, 'epoch': 1.99}


 66%|██████▋   | 12614/18984 [12:47<05:52, 18.07it/s]

{'loss': 11.5502, 'grad_norm': 3.646014928817749, 'learning_rate': 6.715128529287822e-06, 'epoch': 1.99}


 66%|██████▋   | 12624/18984 [12:48<05:50, 18.15it/s]

{'loss': 11.5372, 'grad_norm': 3.2591960430145264, 'learning_rate': 6.704593341761484e-06, 'epoch': 1.99}


 67%|██████▋   | 12634/18984 [12:48<05:56, 17.79it/s]

{'loss': 11.5288, 'grad_norm': 3.2118616104125977, 'learning_rate': 6.694058154235146e-06, 'epoch': 2.0}


 67%|██████▋   | 12644/18984 [12:49<05:58, 17.70it/s]

{'loss': 11.54, 'grad_norm': 3.182882785797119, 'learning_rate': 6.6835229667088075e-06, 'epoch': 2.0}


 67%|██████▋   | 12654/18984 [12:49<06:01, 17.50it/s]

{'loss': 11.515, 'grad_norm': 3.441662549972534, 'learning_rate': 6.6729877791824705e-06, 'epoch': 2.0}


                                                     
 67%|██████▋   | 12660/18984 [12:57<1:20:34,  1.31it/s]

{'eval_loss': 11.570937156677246, 'eval_runtime': 6.7733, 'eval_samples_per_second': 1476.377, 'eval_steps_per_second': 92.274, 'epoch': 2.0}


 67%|██████▋   | 12664/18984 [12:57<42:27,  2.48it/s]  

{'loss': 11.5341, 'grad_norm': 3.609293222427368, 'learning_rate': 6.662452591656132e-06, 'epoch': 2.0}


 67%|██████▋   | 12674/18984 [12:57<11:56,  8.80it/s]

{'loss': 11.5338, 'grad_norm': 3.7757508754730225, 'learning_rate': 6.651917404129794e-06, 'epoch': 2.0}


 67%|██████▋   | 12684/18984 [12:58<06:55, 15.16it/s]

{'loss': 11.5458, 'grad_norm': 3.5101943016052246, 'learning_rate': 6.641382216603455e-06, 'epoch': 2.0}


 67%|██████▋   | 12694/18984 [12:58<06:02, 17.37it/s]

{'loss': 11.5235, 'grad_norm': 3.3715245723724365, 'learning_rate': 6.630847029077118e-06, 'epoch': 2.01}


 67%|██████▋   | 12702/18984 [12:59<05:52, 17.80it/s]

{'loss': 11.546, 'grad_norm': 3.457831382751465, 'learning_rate': 6.620311841550781e-06, 'epoch': 2.01}


KeyboardInterrupt: 

## Evaluate Model Performance
We evaluate the model's performance on the evaluation dataset.
- Use the `evaluate` method of the `Trainer` class to evaluate the model.

In [None]:
# Evaluate model performance
trainer.evaluate()

## Plot Training and Validation Loss
We plot the training and validation loss to visualize the model's performance over epochs.
- Extract the training and validation loss from the training metrics.
- Plot the losses using `matplotlib`.

In [None]:
# Plot training metrics
training_metrics = trainer.state.log_history
losses = [x['loss'] for x in training_metrics if 'loss' in x]
eval_losses = [x['eval_loss'] for x in training_metrics if 'eval_loss' in x]
epochs = range(1, len(losses) + 1)

plt.figure(figsize=(10, 5))
plt.plot(epochs, losses, label='Training Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training Loss')
plt.legend()
plt.show()

plt.figure(figsize=(10, 5))
plt.plot(range(1, len(eval_losses) + 1), eval_losses, label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Validation Loss')
plt.legend()
plt.show()

## Query the Model
We provide examples of how to query the model with new diagnoses descriptions and get the predicted CIE-10 codes.
- Define a `predict` function to process the input text and get the model's prediction.
- Ensure the inputs are on the same device as the model.
- Map the predicted label index back to the CIE-10 code using the `label_to_code` dictionary.

In [None]:
# Example queries to the model
def predict(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}  # Ensure inputs are on the same device as the model
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    return label_to_code[predictions.item()]

# Example queries
examples = [
    'Cambio en cerebro, de dispositivo de drenaje, abordaje externo',
    'Escisión de cerebro, diagnóstico, abordaje abierto'
]

for example in examples:
    label = predict(example)
    print(f'Text: {example}\nPredicted Label: {label}\n')