 <h1 style="font-family:verdana;"> <center>CommonLit Readability:Prompt Tuning BERT</center> </h1>

📌GPT-2 Fine Tuning:https://www.kaggle.com/shreyasajal/pytorch-openai-gpt2-commonlit-readability

<h4 style="font-family:verdana">
    What is Prompt Tuning?<br><br>
Prompt-tuning is a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks.Soft prompts are learned through backpropagation and can be tuned to incorporate
signal from any number of labeled examples. Finally, we show that conditioning,a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning.<br>
Instead of modeling classification as the probability of an output class given some input, p(y|X),where X is a series of tokens and y is a single class label, we now model it as conditional generation,where Y is a sequence of tokens that represent a class label.<br>
Prompting is the approach of adding extra information for the model to condition on during its generation of Y . Normally, prompting is done by prepending a series of tokens, P, to the input X,such that the model maximizes the likelihood of the
correct Y , pθ(Y |[P; X]), while keeping the model parameters, θ, fixed.<br>
Given a series of n tokens, {x0, x1, . . . , xn}, the first thing is embedding the tokens, forming a matrix Xe ∈ Rn×e where e is the dimension ofthe embedding space. Our soft-prompts are represented as a parameter Pe ∈ Rp×e
, where p is the length of the prompt. Our prompt is then concatenated to the embedded input forming a single matrix [Pe; Xe] ∈ R(p+n)×e
    


📌[The Paper:The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691v1.pdf)

**NOTE:This notebook mainly illustrates the use of prompt embeddings in tuning your model.I didn't implement freezing because it wasn't giving good results in this case,just using the prompts embeddings worked good .Feel free to fork the notebook and experiment freezing or other things with it.**

# Let's start


<p style="color:#159364; font-family:cursive;">INSTALL THE TRANSFORMERS PACKAGE FROM THE HUGGING FACE LIBRARY</center></p>


In [3]:
# !pip install transformers
# !pip install plotly

# <p style="color:#159364; font-family:cursive;">IMPORT THE LIBRARIES</center></p>

In [2]:
import os
import gc
import copy
import datetime
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
from torch.utils.data import Dataset, DataLoader
from torch.cuda import amp
import transformers
from transformers import BertTokenizer,BertForSequenceClassification, BertModel, BertConfig
from transformers import AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm
from collections import defaultdict
import plotly.graph_objects as go
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold, KFold
import warnings
warnings.filterwarnings("ignore")


# <p style="color:#159364; font-family:cursive;">DEFINE PROMPT EMBEDDINGS CLASS</center></p>

Reference:https://github.com/kipgparker/

In [3]:
class PROMPTEmbedding(nn.Module):
    def __init__(self, 
                wte: nn.Embedding,
                n_tokens: int = 10, 
                random_range: float = 0.5,
                initialize_from_vocab: bool = True):
        super(PROMPTEmbedding, self).__init__()
        self.wte = wte
        self.n_tokens = n_tokens
        self.learned_embedding = nn.parameter.Parameter(self.initialize_embedding(wte,
                                                                               n_tokens, 
                                                                               random_range, 
                                                                               initialize_from_vocab))
            
    def initialize_embedding(self, 
                             wte: nn.Embedding,
                             n_tokens: int = 10, 
                             random_range: float = 0.5, 
                             initialize_from_vocab: bool = True):
        if initialize_from_vocab:
            return self.wte.weight[:n_tokens].clone().detach()
        return torch.FloatTensor(wte.weight.size(1), n_tokens).uniform_(-random_range, random_range)
            
    def forward(self, tokens):
        print(f"inside prompt embeddings class, original tokens: {tokens}")
        input_embedding = self.wte(tokens[:, self.n_tokens:])
        print(f"intput_embeddings shape: {input_embedding.shape}")
        learned_embedding = self.learned_embedding.repeat(input_embedding.size(0), 1, 1)
        return torch.cat([learned_embedding, input_embedding], 1)

# <p style="color:#159364; font-family:cursive;">LOOK AT THE DATA</center></p>

In [4]:
data_dir = "../../data/commonlitreadabilityprize"
df = pd.read_csv(f"{data_dir}/train.csv")
test_df = pd.read_csv(f"{data_dir}/test.csv",usecols=["id","excerpt"])
print('Number of training sentences: {:,}\n'.format(df.shape[0]))
df.sample(10)

Number of training sentences: 2,834



Unnamed: 0,id,url_legal,license,excerpt,target,standard_error
485,7bcdf0b70,https://simple.wikipedia.org/wiki/Middle_Ages,CC BY-SA 3.0 and GFDL,The Middle Ages are a time period in European ...,-0.929455,0.464225
2582,097311017,,,"After they had eaten all they wanted, they tho...",-0.572768,0.476274
1807,fb13e084e,,,"It is now a well established fact that matter,...",-2.11573,0.509237
392,438d0393b,https://simple.wikipedia.org/wiki/Glucose,CC BY-SA 3.0 and GFDL,"Glucose is a simple carbohydrate, or sugar. It...",0.104885,0.490678
2277,eb57cde1c,,,History in its broadest aspect is a record of ...,-2.186442,0.535444
2088,7c053644e,,,"In another moment down went Alice after it, ne...",-0.274541,0.460605
239,bf1f402ca,,,"""When you want a thing done well, do it yourse...",-0.426813,0.47822
2551,82486c2a2,,,Ceasing his restless walk up and down the room...,-0.158522,0.495309
1659,3974b08a4,,,In compound lenses the matter is complicated b...,-2.014504,0.519803
2054,4625afea0,,,"I had finished eating my dinner, set my pail u...",0.255064,0.49013


# <p style="color:#159364; font-family:cursive;">A BIT OF PREPROCESSING</center></p>

In [5]:
def prep_text(text_df):
    text_df = text_df.str.replace("\n","",regex=False) 
    return text_df.str.replace("\'s",r"s",regex=True).values
df["excerpt"] = prep_text(df["excerpt"])
test_df["excerpt"] = prep_text(test_df["excerpt"])

# <p style="color:#159364; font-family:cursive;">CREATE FOLDS</center></p>

Code taken from:https://www.kaggle.com/abhishek/step-1-create-folds

In [6]:
def create_folds(data, num_splits):
    # we create a new column called kfold and fill it with -1
    data["kfold"] = -1
    
    # the next step is to randomize the rows of the data
    data = data.sample(frac=1).reset_index(drop=True)

    # calculate number of bins by Sturge's rule
    # I take the floor of the value, you can also
    # just round it
    num_bins = int(np.floor(1 + np.log2(len(data))))
    
    # bin targets
    data.loc[:, "bins"] = pd.cut(
        data["target"], bins=num_bins, labels=False
    )
    
    # initiate the kfold class from model_selection module
    kf = StratifiedKFold(n_splits=num_splits)
    
    # fill the new kfold column
    # note that, instead of targets, we use bins!
    for f, (t_, v_) in enumerate(kf.split(X=data, y=data.bins.values)):
        data.loc[v_, 'kfold'] = f
    
    # drop the bins column
    data = data.drop("bins", axis=1)

    # return dataframe with folds
    return data


# create folds
df = create_folds(df, num_splits=5)

In [9]:
df.head(10)

Unnamed: 0,id,url_legal,license,excerpt,target,standard_error,kfold
0,55990b441,https://kids.frontiersin.org/article/10.3389/f...,CC BY 4.0,The previous arthropods may seem pretty harmle...,-1.17918,0.464665,0
1,3efb796e2,,,"But at length, one night, as Hilarion heard th...",-2.086623,0.512389,0
2,b300ba844,,,In many industries there are operations that h...,-1.933358,0.488522,0
3,08aa1ae28,,,"In another instant, however, the girls attenti...",-0.506932,0.480851,0
4,8be3592cf,https://kids.frontiersin.org/article/10.3389/f...,CC BY 4.0,What actually happens when parts of the brain ...,-0.55607,0.525426,0
5,5e854dab8,,,He was a very selfish Giant.The poor children ...,-0.057944,0.504743,0
6,2defec2e6,,,The two lads had come to a halt on the road ab...,-1.075147,0.475292,0
7,7448774f1,,,"Four days on the Platte, and yet no buffalo! L...",-1.059063,0.450921,0
8,3b1faa196,,,The first two weeks at Overton glided by with ...,-1.71718,0.516009,0
9,37567968b,,,"Our first domestic war loan of £6,000 was made...",-1.683823,0.476443,0


# <p style="color:#159364; font-family:cursive;">TRAINING CONFIGURATION</center></p>

In [7]:
class CONFIG:
    gpu_num = 0
    seed = 42
    max_len = 331
    train_batch = 16
    valid_batch = 32
    epochs = 10
    n_tokens=20
    learning_rate = 2e-5
    splits = 5
    scaler = amp.GradScaler()
    model='bert-base-cased'
    tokenizer = BertTokenizer.from_pretrained(model, do_lower_case=True)
    tokenizer.save_pretrained('./tokenizer')
    device = torch.device(f'cuda:{gpu_num}' if torch.cuda.is_available() else 'cpu')

In [11]:
CONFIG.device

device(type='cuda', index=0)

In [12]:
torch.cuda.get_device_name()

'NVIDIA GeForce GTX 1050 Ti with Max-Q Design'

# <p style="color:#159364; font-family:cursive;">REPRODUCIBILITY</center></p>

In [8]:
def set_seed(seed = CONFIG.seed):
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    os.environ['PYTHONHASHSEED'] = str(seed)
set_seed(CONFIG.seed)

# <p style="color:#159364; font-family:cursive;">DEFINE THE DATASET CLASS</center></p>

In [9]:
class BERTDataset(Dataset):
    def __init__(self,df):
        self.text = df['excerpt'].values
        self.target = df['target'].values
        self.max_len = CONFIG.max_len
        self.tokenizer = CONFIG.tokenizer
        self.n_tokens=CONFIG.n_tokens
        
    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, index):
        text = self.text[index]
        text = ' '.join(text.split())
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            truncation=True,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            return_token_type_ids=True
        )
        org_input_ids = inputs['input_ids']
        inputs['input_ids']=torch.cat((torch.full((1,self.n_tokens), 500).resize(CONFIG.n_tokens),torch.tensor(inputs['input_ids'], dtype=torch.long)))
        inputs['attention_mask'] = torch.cat((torch.full((1,self.n_tokens), 1).resize(CONFIG.n_tokens), torch.tensor(inputs['attention_mask'], dtype=torch.long)))

        return {
            'ids': inputs['input_ids'],
            'mask': inputs['attention_mask'],
    
            'target': torch.tensor(self.target[index], dtype=torch.float),
            'org_ids': org_input_ids
        }
    

In [43]:
train_dataset = BERTDataset(df)
train_dataset.__getitem__(2)

{'ids': tensor([  500,   500,   500,   500,   500,   500,   500,   500,   500,   500,
           500,   500,   500,   500,   500,   500,   500,   500,   500,   500,
           101,  1107,  1242,  7519,  1175,  1132,  2500,  1115,  1138,  1106,
          1129,  4892,  1120,  2366, 14662,   117,  1105,   117,  1111,  1142,
          2255,   117,  1103,  2058,  1104,  1126, 16486,  1111,  2368,   170,
          4344,   117,  1136,  1178,  1120,  1103,  2396,  4275,   117,  1133,
          1145,  1120,  4463, 14662,   117,  1110,   170,  2187,  1104,  2199,
           119,  1103,  2304,  1104,  1833,  1142,  1144,  1151, 13785,  1107,
           170,  1304, 12002,  1236,  1118,   182,  1197,   119, 27466,  7580,
          1107,  1103, 11918,  1104,  1103, 16486,   119,  1122,  2923,  1104,
           170,  4705,  2133, 17693,  1110,  2136,  1114,   170,  1326,  1104,
          1353, 18607,   119,  1103,  1493,  1132, 22233,  8360,  1121,  1103,
          1692,  1105, 10621,  1114,  1141,  

In [30]:
torch.cat((torch.full((1,20),500),

torch.Size([20])

# <p style="color:#159364; font-family:cursive;">MODEL:BERT FOR SEQUENCE CLASSIFICATION from 🤗 </center></p>

<p style="color:#159364; font-family:cursive;">With prompt embeddings in the input,and all the layers have requires_grad True,you can try layer freezing as well
</center></p>


In [10]:
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels = 1,
    output_attentions = False,
    output_hidden_states = False, 
)

original_emb = model.get_input_embeddings()
prompt_emb = PROMPTEmbedding(model.get_input_embeddings(), 
                      n_tokens=20, 
                      initialize_from_vocab=True)
model.set_input_embeddings(prompt_emb)
model.cuda()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): PROMPTEmbedding(
        (wte): Embedding(30522, 768, padding_idx=0)
      )
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNor

In [17]:
original_emb

Embedding(30522, 768, padding_idx=0)

In [18]:
prompt_emb.learned_embedding

Parameter containing:
tensor([[-0.0102, -0.0615, -0.0265,  ..., -0.0199, -0.0372, -0.0098],
        [-0.0117, -0.0600, -0.0323,  ..., -0.0168, -0.0401, -0.0107],
        [-0.0198, -0.0627, -0.0326,  ..., -0.0165, -0.0420, -0.0032],
        ...,
        [-0.0210, -0.0524, -0.0289,  ..., -0.0206, -0.0384, -0.0176],
        [-0.0098, -0.0563, -0.0322,  ..., -0.0215, -0.0314, -0.0087],
        [-0.0166, -0.0492, -0.0288,  ..., -0.0235, -0.0364, -0.0148]],
       device='cuda:0', requires_grad=True)

In [22]:
model.bert.embeddings.word_embeddings.learned_embedding.shape

torch.Size([20, 768])

In [17]:
model.bert.embeddings.word_embeddings.learned_embedding.requires_grad

PROMPTEmbedding(
  (wte): Embedding(30522, 768, padding_idx=0)
)

In [15]:
para_dict = model.state_dict()
# para_dict.keys()

odict_keys(['bert.embeddings.position_ids', 'bert.embeddings.word_embeddings.learned_embedding', 'bert.embeddings.word_embeddings.wte.weight', 'bert.embeddings.position_embeddings.weight', 'bert.embeddings.token_type_embeddings.weight', 'bert.embeddings.LayerNorm.weight', 'bert.embeddings.LayerNorm.bias', 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.query.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.key.bias', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.0.attention.output.dense.weight', 'bert.encoder.layer.0.attention.output.dense.bias', 'bert.encoder.layer.0.attention.output.LayerNorm.weight', 'bert.encoder.layer.0.attention.output.LayerNorm.bias', 'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.intermediate.dense.bias', 'bert.encoder.layer.0.output.dense.weight', 'bert.encoder.layer.0.output.

# <p style="color:#159364; font-family:cursive;">GET THE PREPARED DATA</center></p>

In [19]:
def get_data(fold):
    df_train = df[df.kfold != fold].reset_index(drop=True)
    df_valid = df[df.kfold == fold].reset_index(drop=True)
    
    train_dataset = BERTDataset(df_train)
    valid_dataset = BERTDataset(df_valid)

    train_loader = DataLoader(train_dataset, batch_size=CONFIG.train_batch, 
                              num_workers=0, shuffle=True, pin_memory=True)
    valid_loader = DataLoader(valid_dataset, batch_size=CONFIG.valid_batch, 
                              num_workers=0, shuffle=False, pin_memory=True)
    
    return train_loader, valid_loader

In [20]:
train_testing = BERTDataset(df[df.kfold != 0].reset_index(drop=True))
train_testing.text

array(['One old woman especially loved the smells that drifted out of the bakery window every morning. This was Ma Shange who slept on a bench in the park every night. A few weeks before, a kind person had given her the money to buy herself a cinnamon bun. She had taken the bun back to the park and ate it very slowly, licking her lips and sharing the last crumbs with the birds. After that, although the old woman didn\'t have enough money to buy breakfast, she longed for the delicious bun again. So, every morning she walked slowly past Mr Shabangus bakery, sniffing the air and smiling blissfully at the mouth-watering smell. Ma Shanges new habit made the baker very angry. As each day went by, he grew angrier and angrier with her. Finally, one winter morning when he was in an especially bad mood, he stormed out of his bakery and grabbed the old woman by the arm. "How dare you steal my smells!" he shouted. "You\'re nothing but a smell thief!" He wiped his hands on an apron, then pulled it 

# <p style="color:#159364; font-family:cursive;">FOLD:0</center></p>

In [21]:
train_dataloader,validation_dataloader=get_data(0)
len(train_dataloader)

142

# <p style="color:#159364; font-family:cursive;">OPTIMIZER</center></p>

In [22]:
param_optimizer = list(model.named_parameters())
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
optimizer_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 
     'weight_decay': 0.0001},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 
     'weight_decay': 0.0}
    ]  

optimizer = AdamW(optimizer_parameters, lr=CONFIG.learning_rate)


# <p style="color:#159364; font-family:cursive;">LEARNING RATE SCHEDULER</center></p>

In [23]:
# Defining LR Scheduler
scheduler = get_linear_schedule_with_warmup(
    optimizer, 
    num_warmup_steps=0, 
    num_training_steps=len(train_dataloader)*CONFIG.epochs
)

lrs = []
for epoch in range(1, CONFIG.epochs + 1):
    if scheduler is not None:
        scheduler.step()
    lrs.append(optimizer.param_groups[0]["lr"])
layout = go.Layout(template= "plotly_dark",title='Learning_rate')
fig = go.Figure(layout=layout)

fig.add_trace(go.Scatter(x=list(range(CONFIG.epochs)), y=lrs,
                    mode='lines+markers',
                    name='Learning_rate'))
fig.show()

# <p style="color:#159364; font-family:cursive;">DEFINE LOSS AND TIME FUNCTIONS</center></p>

In [24]:
def loss_fn(output,target):
     return torch.sqrt(nn.MSELoss()(output,target))
def format_time(elapsed):
    elapsed_rounded = int(round((elapsed)))
    return str(datetime.timedelta(seconds=elapsed_rounded))

# <p style="color:#159364; font-family:cursive;">DEFINE THE FUNCTION FOR TRAINING,VALIDATION AND RUNNING</center></p>

In [25]:
def run(model,optimizer,scheduler):
    set_seed(40)
    scaler=CONFIG.scaler
    training_stats = []
    total_t0 = time.time()
    best_rmse = np.inf
    epochs=CONFIG.epochs
    for epoch_i in range(0, epochs):
        print("")
        print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
        print('Training...')
        t0 = time.time()
        total_train_loss = 0
        data_size=0
        model.train()
        for step, batch in enumerate(train_dataloader):    
            tr_loss=[]
            b_input_ids = batch['ids'].to(CONFIG.device)
            b_input_mask = batch['mask'].to(CONFIG.device)
            b_labels = batch['target'].to(CONFIG.device)
            batch_size = b_input_ids.size(0)
            model.zero_grad() 
            with amp.autocast(enabled=True):
                output= model(b_input_ids,attention_mask=b_input_mask)          
                output=output["logits"].squeeze(-1)
                loss = loss_fn(output,b_labels)
                # print("")
                # print(f"batch loss is: {loss.item()}")
                tr_loss.append(loss.item()/len(output))
            scheduler.step()
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        avg_train_loss = np.mean(tr_loss)    
        training_time = format_time(time.time() - t0)
        gc.collect()
        print("")
        print("  Average training loss: {0:.2f}".format(avg_train_loss))
        print("  Training epoch took: {:}".format(training_time))
        print("")
        print("Running Validation...")

        t0 = time.time()
        model.eval()
        val_loss = 0
        allpreds = []
        alltargets = []
        for batch in validation_dataloader:
            losses = []
            with torch.no_grad():
                device=CONFIG.device
                ids = batch["ids"].to(device)
                mask = batch["mask"].to(device)
                output = model(ids,mask)
                output = output["logits"].squeeze(-1)
                target = batch["target"].to(device)
                loss = loss_fn(output,target)
                losses.append(loss.item()/len(output))
                allpreds.append(output.detach().cpu().numpy())
                alltargets.append(target.detach().squeeze(-1).cpu().numpy())
                
        allpreds = np.concatenate(allpreds)
        alltargets = np.concatenate(alltargets)
        val_rmse=mean_squared_error(alltargets, allpreds, squared=False)
        losses = np.mean(losses)
        gc.collect() 
        validation_time = format_time(time.time() - t0)
        print("  Validation Loss: {0:.2f}".format(losses))
        print("  Validation took: {:}".format(validation_time))
        
        if val_rmse <= best_rmse:
            print(f"Validation RMSE Improved ({best_rmse} -> {val_rmse})")
            best_rmse = val_rmse
            best_model_wts = copy.deepcopy(model.state_dict())
            PATH = "rmse{:.4f}_epoch{:.0f}.bin".format(best_rmse, epoch_i)
            torch.save(model.state_dict(), PATH)
            print("Model Saved")
            
        training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': losses,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    ) 
    print("")
    print("Training complete!")
    return training_stats  

# <p style="color:#159364; font-family:cursive;">VISUALIZATION FUNCTION </center></p>

In [26]:
def Visualizations(training_stats):
    pd.set_option('precision', 2)
    df_stats = pd.DataFrame(data=training_stats)
    df_stats = df_stats.set_index('epoch')
    layout = go.Layout(template= "plotly_dark")
    fig = go.Figure(layout=layout)
    fig.add_trace(go.Scatter(x=df_stats.index, y=df_stats['Training Loss'],
                    mode='lines+markers',
                    name='Training Loss'))
    fig.add_trace(go.Scatter(x=df_stats.index, y=df_stats['Valid. Loss'],
                    mode='lines+markers',
                    name='Validation Loss'))
    fig.show()

In [27]:
model.train()
for step, batch in enumerate(train_dataloader):    
    tr_loss=[]
    b_org_ids = batch['org_ids']
    b_input_ids = batch['ids'].to(CONFIG.device)
    b_input_mask = batch['mask'].to(CONFIG.device)

    print(f"original ids: {b_org_ids[1]}")
    output = model(b_input_ids,b_input_mask)

    
    break


original ids: tensor([1191, 1103, 6434, 1159, 5871, 1103, 1126, 1208,  170,  172, 1892, 1103,
        1141, 1103, 1103, 1126])
inside prompt embeddings class, original tokens: tensor([[500, 500, 500,  ...,   0,   0,   0],
        [500, 500, 500,  ...,   0,   0,   0],
        [500, 500, 500,  ...,   0,   0,   0],
        ...,
        [500, 500, 500,  ...,   0,   0,   0],
        [500, 500, 500,  ...,   0,   0,   0],
        [500, 500, 500,  ...,   0,   0,   0]], device='cuda:0')
intput_embeddings shape: torch.Size([16, 331, 768])


RuntimeError: CUDA out of memory. Tried to allocate 92.00 MiB (GPU 0; 4.00 GiB total capacity; 2.91 GiB already allocated; 0 bytes free; 3.05 GiB reserved in total by PyTorch)


 <p style="color:#159364; font-family:cursive;">RUN THE MODEL WITH PROMPT EMBEDDINGS ON FOLD 0 </center></p>

In [None]:
# df=run(model,optimizer,scheduler)
# Visualizations(df)


![Upvote!](https://img.shields.io/badge/Upvote-If%20you%20like%20my%20work-07b3c8?style=for-the-badge&logo=kaggle)
