## Finetuning on Alpaca dataset using PyTorch

### Downloading our preprocessed dataset from wandb

In [1]:
!pip install -U bitsandbytes transformers peft accelerate datasets scipy ipywidgets matplotlib huggingface wandb

Collecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl.metadata (9.9 kB)
Collecting transformers
  Downloading transformers-4.38.0-py3-none-any.whl.metadata (131 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m131.1/131.1 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting peft
  Downloading peft-0.8.2-py3-none-any.whl.metadata (25 kB)
Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl.metadata (18 kB)
Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl.metadata (20 kB)
Collecting scipy
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Collecting ipywidgets
  Downloading ipywidgets-8.1.2-py3-none-any.whl.metadata (2.4 kB)
Collecting matplotlib
  Downloading matplotlib-3.8.3-cp310-cp310-manylinux_2_17_x86_64.manyli

In [None]:
import wandb
from pathlib import Path 
import torch
from transformers import AutoModelForCausalLM
from datasets import load_dataset

In [3]:
## Run this when you need to retrieve the dataset. I didn't run it because the files were locally available.
run = wandb.init(project = 'alpaca_finetuning_2-20')

## We're using the artifact we previously stored at this location in WandB
artifact = run.use_artifact('venkatakshay98/alpaca_finetuning_2-20/alpaca_packed:v0', type='dataset')
artifact_dir = Path(artifact.download())

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


[34m[1mwandb[0m: Downloading large artifact alpaca_packed:v0, 129.39MB. 2 files... 
[34m[1mwandb[0m:   2 of 2 files downloaded.  
Done. 0:0:4.3


#### Method 1 to load dataset: Use them as .jsonl files

In [None]:
import json 

def load_jsonl(filename):
    data = []
    with open(filename,'r') as file:
        for line in file:
            data.append(json.loads(line))
    return data

In [None]:
artifact_dir

PosixPath('/Users/venkatakshaychintalapati/Documents/GitHub/learning/LLM_Finetuning/artifacts/alpaca_packed:v0')

In [5]:
## Run this if you need to download the datasets from a WANDB artifact directory
train_ds_packed = load_jsonl(f"{artifact_dir}/train_alpaca_packed.jsonl")
eval_ds_packed =load_jsonl(f"{artifact_dir}/eval_alpaca_packed.jsonl")


In [17]:
## Run this if you have the files available locally for use and don't need to download
train_ds_packed = load_jsonl('train_alpaca_packed.jsonl')
eval_ds_packed = load_jsonl('eval_alpaca_packed.jsonl')

len(train_ds_packed),len(eval_ds_packed)

**The difference between the above where the dataset is a plain json versus below when we load this using load_dataset is that** the latter has many advantages such as fast loading, built-in map/filter methods, etc. Hence we use the latter.

#### Method 2 to load data: Loading them uisng the Huggingface method `load_dataset` which converts them into a format more ideal for creating dataloaders and training models

In [9]:
from datasets import load_dataset

run = wandb.init(project='alpaca_finetuning_2-20')
artifact = run.use_artifact('venkatakshay98/alpaca_finetuning_2-20/alpaca_packed:v0', type='dataset') # Declares the artifact as an input to run
artifact_dir = artifact.download() ## when we call .download() on the artifact, it downloads (gets) the contents locally.

artifact_dir



VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011167940732816028, max=1.0…

[34m[1mwandb[0m: Downloading large artifact alpaca_packed:v0, 129.39MB. 2 files... 
[34m[1mwandb[0m:   2 of 2 files downloaded.  
Done. 0:0:0.6


In [15]:
## We are using the HF method load_dataset which provides many advantages for dataset loading specific to training models on it.
ds_packed = load_dataset(artifact_dir) 

train_ds_packed = ds_packed['train']
eval_ds_packed = ds_packed['test']

In [16]:
train_ds_packed,eval_ds_packed

Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 11105
})

In [19]:
ds_packed ## using load_dataset method of HF

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 11105
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 222
    })
})

**Retrieving our eval_dataset from WandB that we saved previously**


Converting the jsonl into a list of dictionaries, the way we want to feed it into our validation loop.


In [8]:
import json 

def load_jsonl(filename):
    data = []
    with open(filename,'r') as file:
        for line in file:
            data.append(json.loads(line))
    return data

**Downloading the `eval_dataset` that we created in `LLM_Finetuning/prepare_dataset_for_FT.ipynb`**

This is not a packed dataset; because we want to run eval on it, so we don't want to fill up the sequence to `max_seq_len`

In [6]:
run = wandb.init()
artifact = run.use_artifact('venkatakshay98/alpaca_finetuning_2-20/alpaca_dataset_for_evals:v2', type='dataset')
artifact_dir = artifact.download()

VBox(children=(Label(value='0.007 MB of 0.007 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112768782509698, max=1.0…

[34m[1mwandb[0m:   1 of 1 files downloaded.  


In [14]:
eval_dataset = load_jsonl(f'{artifact_dir}/eval_dataset.jsonl')

len(eval_dataset), eval_dataset[:1]

(1000,
 [{'prompt': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWrite a cover letter applying to become a teacher.\n\n### Input:\nYour resume\n\n### Response:\n',
   'output': "Dear Hiring Manager,\n\nI am writing to apply for the position of a teacher, as advertised on your company's website. I have attached my Resume for your review.\n\nAs an experienced educator, I am passionate about teaching and empowering students to reach their full potential. My background in education includes a Bachelor's degree in Education, and I have had the opportunity to teach diverse student populations in both urban and suburban environments. In my previous role, I developed and taught engaging lesson plans that were tailored to the needs and interests of my students, and my lessons consistently resulted in increased student engagement and achievement.\n\nI pride mysel

#### Creating a WandB table
This will be used in many places, one of them being the `prompt_table` function in the training loop which holds evals after every epoch.

In [None]:
import wandb

# log to wandb
with wandb.init(project="alpaca_finetuning_2-20"):
    at = wandb.Artifact(
        name="alpaca_gpt4", 
        type="dataset",
        description="A GPT4 generated Alpaca like dataset for instruction finetunning",
        metadata={"url":"https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#how-good-is-the-data"},
    )
    at.add_file(dataset_file)

    # log as a table
    table = wandb.Table(columns=list(alpaca[0].keys()))
    for row in alpaca:
        table.add_data(*row.values())
    wandb.log({"alpaca_gpt4_table": table})

### DataLoader

In [15]:
from torch.utils.data import DataLoader
from transformers import default_data_collator ## This method simply collates batches of dict-like objects

In [17]:
help(default_data_collator)

Help on function default_data_collator in module transformers.data.data_collator:

default_data_collator(features: List[transformers.data.data_collator.InputDataClass], return_tensors='pt') -> Dict[str, Any]
    Very simple data collator that simply collates batches of dict-like objects and performs special handling for
    potential keys named:
    
        - `label`: handles a single value (int or float) per object
        - `label_ids`: handles a list of values per object
    
    Does not do any additional preprocessing: property names of the input object will be used as corresponding inputs
    to the model. See glue and ner for example of how it's useful.



In [16]:
batch_size = 8

train_dataloader = DataLoader(
    train_ds_packed,
    batch_size=batch_size,
    collate_fn = default_data_collator
)

eval_dataloader = DataLoader(
    eval_ds_packed,
    batch_size=batch_size,
    collate_fn=default_data_collator,
    shuffle=False
)

In [28]:
train_dataloader

<torch.utils.data.dataloader.DataLoader at 0x29000f810>

In [17]:
batch = next(iter(train_dataloader))

In [18]:
batch['input_ids'].shape

torch.Size([8, 1024])

In [37]:
batch.keys(),batch['input_ids'][0][:25], batch['labels'][0][:25], batch['input_ids'].shape

(dict_keys(['input_ids', 'labels']),
 tensor([    1, 13866,   338,   385, 15278,   393, 16612,   263,  3414, 29892,
          3300,  2859,   411,   385,  1881,   393,  8128,  4340,  3030, 29889,
         14350,   263,  2933,   393,  7128]),
 tensor([13866,   338,   385, 15278,   393, 16612,   263,  3414, 29892,  3300,
          2859,   411,   385,  1881,   393,  8128,  4340,  3030, 29889, 14350,
           263,  2933,   393,  7128,  2486]),
 torch.Size([8, 1024]))

#### Sidenote: Some context on techniques we'll be using in our training

##### Gradient Checkpointing

**How does it work?**
This helps us when the available GPU memory is low, for the trade-off of additional computation time because we're recalculating the gradients that were discarded.


**How does it work?**

*TL;DR*: Gradient checkpointing just means that during forward pass, we store activations of only certain layers (called checkpoints) and discard the activations of other layers in order to save memory. 

This doesn't affect the outcome because, during backward pass, when we need to calculate gradients w.r.t activations of layers which were not stored, we can just calculate these activations again with the checkpoints we stored. 

##### In further detail:

- **During the forward pass**: In gradient checkpointing, not all intermediate activations are stored. For layers designated as "checkpoints," activations are stored, but for others, they are not. When you reach layer 3, for example, you compute its activations based on the inputs from layer 2 as usual.
- **Discarding activations**: After the activations for layer 3 are computed from those of layer 2, the activations of layer 2 can be discarded if layer 2 is not a checkpoint. This is where the memory savings come in. By not storing the activations for every layer, you reduce the overall memory footprint.
- **During the backward pass**: When it's time to compute the gradients with respect to the weights of layer 2, you need the activations of layer 2 again. Since they were discarded, you recompute them by doing a localized forward pass starting from the last checkpoint before layer 2. This might be the inputs to layer 1, or the activations of layer 1 if that was designated as a checkpoint.

##### Automated Mixed Precision

**Why is it used?**
This technique is again used to lower memory requirements and consequently speed up training, 

**How does it work?**
By means of using a mix of precision when performing matmul in forward and backward pass. This means that some of the computations are performed in single-precision (aka 32-bit) and some in half-precision (16-bit). 

AMP can identify which parts of computation can be performed with which precision.

##### Gradient accumulation:

**Why is it used?**
When performing backpropogation, we generally calculate the gradient of loss w.r.t weights and biases for entire batch at once, but this can be computationally expensive. By using gradient accumulation, we can reduce memory requirements as well, albeit with an increase in training time.

**How does it work?**
 Instead of performing gradient calculation for an entire batch at once, we instead do this in parts, where we divide the batch into n parts, and calculate each part's gradient. We do not perform backprop yet, we continue to calculate gradient of the other parts, accumulate them to the existing gradient.

 Once the gradient of all these parts of a batch are complete, then we use that accumulated gradient to perform backpropogation.

### Training Loop

In [19]:
import random
seed = 42
random.seed(seed)

In [20]:
from types import SimpleNamespace

In [22]:
max_seq_len = 1024

In [23]:
len(train_dataloader)

1389

#### Login to Huggingface in order to access the Meta Llama2 model

In [None]:
from huggingface_hub import login
login()

In [24]:
gradient_accumulation_steps = 32 // batch_size

config = SimpleNamespace(
    model_id = 'meta-llama/Llama-2-7b-hf',
    dataset_name = 'alpaca-gpt4',
    precision = 'bf16', # faster than fp16 apparently, also provides better precision and range of numbers
    n_freeze = 24, # Frozen layers = layers not trained. Here, we are freezing 24 of 32 layers in Llama7b
    lr = 2e-4,
    n_eval_samples = 10, # Number of samples to generate on validation
    max_seq_len = max_seq_len, ## The max_seq_len is mentioned in the metadata for our dataset 
    epochs = 3,
    gradient_accumulation_steps = gradient_accumulation_steps,
    batch_size=batch_size,
    log_model = True, # uploading model to WandB
    mom=0.9, #optimizer parameter
    gradient_checkpointing = True,
    freeze_embed=True, #keep the embeddings frozen
    seed=seed
)

config.total_train_steps = config.epochs * len(train_dataloader) // config.gradient_accumulation_steps


In [27]:
len(train_dataloader), config.epochs, config.gradient_accumulation_steps

(1389, 3, 4)

In [25]:
print(f"We will train for {config.total_train_steps} steps in total and evaluate every epoch")

We will train for 1041 steps and evaluate every epoch


**Why are we dividing `config.epochs * len(train_dataloader)` by `config.gradient_accumulation_steps`?**

Lets explain this with what gradient accumulation does. Let's say you're trying to use a batch size of 20 but that would lead to *GPU out of memory* error. We want to avoid this, but still we do not want to use a smaller batch size, because with a smaller batch size, it can create too much noise during training because of its batch size.

Hence, to 
- stay within GPU memory, and 
- still effectively implement a larger batch size,

we use something called **Gradient accumulation**. With this, we do not perform backprop after every single batch. Instead, we perform forward pass for a batch, calculate gradients w.r.t loss, store them, then do the same for more batches (the number depends on the `gradient_accumulation_steps` we choose). Once we accumulate these gradients, we then perform backprop using this to update our parameters.

**Example**: That's why, if there's 100 batches, in the normal case, 1 epoch should have 100 training steps. *But* if we use gradient accumulation (say steps = 4), then we'll have only 25 training steps, because we're performing backprop only once every 4 batches.

In [28]:
from transformers import AutoModelForCausalLM

#### Get a pre-trained model with the configuration params set above

In [33]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [44]:
token = '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'

In [34]:
model = AutoModelForCausalLM.from_pretrained(
    config.model_id,
    device_map = 0,
    trust_remote_code = True,
    low_cpu_mem_usage = True,
    torch_dtype = torch.bfloat16,
    use_cache = False,
    token = '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')

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [35]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Lin

In [34]:
## apparently this notation works
type(1_000_000), 1_000_000

(int, 1000000)

**Trainable parameters**: The parameters (weights and biases) in a model which can be updated during training thru backprop.

**Non-trainable parameters**: This can be the parameters that we decide to keep frozen during training, or the layers of the model are kept as-is (when we add new layers on top that we train, in transfer learning).

In [36]:
## Calculate the number of parameters for the model

def param_count(model):
    params = sum([p.numel() for p in model.parameters()])/1_000_000
    trainable_params = sum([p.numel() for p in model.parameters() if p.requires_grad])/1_000_000
    print(f"Total params: {params:.2f}M, Trainable: {trainable_params:.2f}M") ## M refers for million
    return params, trainable_params 

params, trainable_params = param_count(model)

Total params: 6738.42M, Trainable: 6738.42M


#### Freezing the model to save memory

Because training LLMs is expensive and GPUs have memory constraints. So in this specific case, we're freezing 24 of the 32 layers in Llama2.

In [37]:
n_freeze = 24

## Freezing all layers and params
for param in model.parameters(): param.requires_grad = False

## Unfreezing the head of the model aka the final layer which outputs predictions
for param in model.lm_head.parameters(): param.requires_grad = True

## Unfreezing the layers 24 to end because they're the ones we want to train
for param in model.model.layers[n_freeze:].parameters(): param.requires_grad = True

Additional reduction in memory requirements through freezing embeddings

In [38]:
if config.freeze_embed:
    model.model.embed_tokens.weight.requires_grad_(False)

Here we now also use gradient checkpointing

In [39]:
if config.gradient_checkpointing:
    model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant":False})

In [40]:
## Checking again what the number of params are
params, trainable_params = param_count(model)

Total params: 6738.42M, Trainable: 1750.14M


### Optimizer and Scheduler

**Optimizer**: An algorithm that updates the model's weights based on the gradients of the loss function with respect to those weights, with the goal to minimize the loss, and thereby improve the model's performance on the task.

**Scheduler**: A scheduler aka learning rate scheduler is used to determine how to adjust the learning rate of the optimizer through the process of training a neural network.

**Learning rate**: The learning rate is a hyperparameter that controls how much we are adjusting the weights of our network with respect to the loss gradient.

Both together work to train the neural network to perform better on the intended task(s).

In [41]:
from transformers import get_cosine_schedule_with_warmup

In [55]:
optim = torch.optim.Adam(model.parameters(), lr = config.lr, betas = (0.9,0.99),eps=1e-5)
scheduler = get_cosine_schedule_with_warmup(
    optim,
    num_training_steps = config.total_train_steps,
    num_warmup_steps=config.total_train_steps//10
)

def loss_fn(x,y):
    ## Cross entropy loss
    return torch.nn.functional.cross_entropy(x.view(-1, x.shape[-1]), y.view(-1))


The `eps` value, aka epsilon term, is a very small number added to the denominator of the update step of Adam, in order to ensure that we don't end up diving by zero. It is a numerical stability parameter.

The `get_cosine_schedule_with_warmup` means that the LR starts at zero at the beginning of training, updates linearly till it hits the learning_rate prescribed by us when initializing the model, and it achieves it following the `num_warmup_steps` steps we prescribed.

Once it hits the LR value, then it starts decreasing to zero, following a cosine schedule (meaning it gradually decreases following a part of the cosine curve from the max value to zero.)

**Why use `get_cosine_schedule_with_warmup`**?
This combination of initial warmup and the subsequent decrease helps the training with the need for larger LR in the initial steps for faster convergence, followed by the need for smaller LR later in training to perform precise adjustments to the parameters.

In [79]:
k = torch.randn(4,5,6,7)
k.shape[-1], k.view(-1,k.shape[-1]).shape,batch['labels'].shape

(7, torch.Size([120, 7]), torch.Size([8, 1024]))

### Sampling from the model during training

The `generate` function is used for inference to run predictions using the model at different steps of training, to visually see what the model is outputting. We can grab the default values for sampling parameters from the GenerationConfig and pass the corresponding model_id, which will grab the defaults for parameters like temperature, top p, etc.

In [45]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(config.model_id,token=token)

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

`GenerationConfig` is a class that holds a configuration for a generation task.

In [46]:
from transformers import GenerationConfig
gen_config = GenerationConfig.from_pretrained(config.model_id)

def generate(prompt, max_new_tokens = 100, gen_config = gen_config):
    with torch.inference_mode():
        tokenized_prompt = tokenizer(prompt, return_tensors = 'pt')['input_ids'].cuda()
        output = model.generate(tokenized_prompt,
                                max_new_tokens = max_new_tokens,
                                generation_config = gen_config)
    return tokenizer.decode(output[0][len(tokenized_prompt[0]):],skip_special_tokens=True)

In [47]:
from types import SimpleNamespace


## We are creating a test_config here so as to use it for eval.
test_config = SimpleNamespace(
    max_new_tokens = 256,
    gen_config = gen_config
)

In [48]:
import wandb
import tqdm
from tqdm.auto import tqdm

The `prompt_table` below creates a table so we can view our model's predictions against GPT-4's output and assess how well it has learnt post-finetuning.

In [49]:
def prompt_table(examples, log=False, table_name = 'predictions'):
    table = wandb.table(columns = ['prompt','generation','concat',
                                   'output','max_new_tokens','temperature','top_p'])
    
    for example in tqdm(examples, leave=False):
        prompt,gpt4_output = example['prompt'], example['output']
        out = generate(prompt, test_config.max_new_tokens, test_config.gen_config)
        table.add_data(prompt, out, prompt+out, gpt4_output,test_config.max_new_tokens,
                       test_config.gen_config.temperature, test_config.gen_config.top_p)
    
    if log:
        ## We will use log=True only when running eval, as we set log=False for train time
        wandb.log({table_name:table})
    return table

`to_gpu` is a function to essentially move all the values (tensors) for all the keys of the dict to a GPU, in preparation for training our model on these inputs.


In [50]:
def to_gpu(tensor_dict):
    return {k:v.to('cuda') for k,v in tensor_dict.items()}

In [51]:
class Accuracy:
    "Simple accuracy function compatible with HF models"
    def __init__(self):
        self.count = 0
        self.correct = 0.
    def update(self, logits, labels):
        # calculating the predictions by applying argmax on logits.
        predictions, labels = logits.argmax(dim=-1).view(-1).cpu(), labels.view(-1).cpu()
        correct = (predictions == labels).sum()

        ## here we append the number of items we ran predictions on
        self.count += len(predictions)
        ## here we append how many we got right of these predictions
        self.correct += correct
        ## this will be the updated accuracy
        return correct/len(predictions) 
    # once the update function runs over all batches of our epoch, then we return the accuracy
    # using the compute function
    def compute(self):
        return self.correct / self.count

Using the `Accuracy` function directly from the repo below, since the above one, mine - doesn't work. I get the error 


**Update**: I fixed the error in my code. It was because I was doing `(logits == labels)` which is incorrect.

```python
RuntimeError: The size of tensor a (32000) must match the size of tensor b (8192) at non-singleton dimension 2```

In [59]:
class Accuracy:
    "A simple Accuracy function compatible with HF models"
    def __init__(self):
        self.count = 0
        self.tp = 0.
    def update(self, logits, labels):
        logits, labels = logits.argmax(dim=-1).view(-1).cpu(), labels.view(-1).cpu()
        tp = (logits == labels).sum()
        self.count += len(logits)
        self.tp += tp
        return tp / len(logits)
    def compute(self):
        return self.tp / self.count

#### Setting up the validation step

In [60]:
@torch.no_grad() ##this just indicates we'll not calculate gradients aka not perform backprop
def validate():
    model.eval() # putting the model in eval mode
    eval_acc = Accuracy()
    loss, total_steps = 0., 0
    for step, batch in enumerate(pbar:=tqdm(eval_dataloader, leave=False)):
        pbar.set_description(f"doing validation")
        batch = to_gpu(batch)
        total_steps += 1
        with torch.amp.autocast('cuda',dtype=torch.bfloat16):
            out = model(**batch)
            loss += loss_fn(out.logits, batch['labels']) # you could use out.loss and not shift the dataset
        # after every batch, we update the number of correctly predicted items and number of items into the accuracy function.
        eval_acc.update(out.logits, batch['labels'])

    # because we complete the evaluation over all batches, we now log results at the end
    wandb.log({'eval/loss':loss.item()/total_steps,
               'eval/accuracy':eval_acc.compute()})
    
    prompt_table(eval_dataset[:config.n_eval_samples],log=True)
    model.train()

In [61]:
from pathlib import Path
def save_model(model, model_name, models_folder="models", log=False):
    """Save the model to wandb as an artifact
    Args:
        model (nn.Module): Model to save.
        model_name (str): Name of the model.
        models_folder (str, optional): Folder to save the model. Defaults to "models".
    """
    model_name = f"{wandb.run.id}_{model_name}"
    file_name = Path(f"{models_folder}/{model_name}")
    file_name.parent.mkdir(parents=True, exist_ok=True)
    model.save_pretrained(file_name, safe_serialization=True)
    # save tokenizer for easy inference
    tokenizer = AutoTokenizer.from_pretrained(model.name_or_path)
    tokenizer.save_pretrained(model_name)
    if log:
        at = wandb.Artifact(model_name, type="model")
        at.add_dir(file_name)
        wandb.log_artifact(at)

### Tying it all together: the model loop!

In [62]:
wandb.init(project="alpaca_finetuning_2-20", # the project I am working on
           tags=["baseline","7b"],
           job_type="train",
           config=config) # the Hyperparameters I want to keep track of


# Training
acc = Accuracy()
model.train()
train_step = 0
for epoch in tqdm(range(config.epochs)):
    for step, batch in enumerate(tqdm(train_dataloader)):
        batch = to_gpu(batch)
        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
            out = model(**batch)
            loss = loss_fn(out.logits, batch["labels"]) / config.gradient_accumulation_steps  # you could use out.loss and not shift the dataset  
            loss.backward()

        # whenever we complete a step which is a multiple of grad_accumln_steps, we
        # perform backprop to update parameters, and also log our values.
        if step%config.gradient_accumulation_steps == 0:
            # we can log the metrics to W&B
            wandb.log({"train/loss": loss.item() * config.gradient_accumulation_steps,
                       "train/accuracy": acc.update(out.logits, batch["labels"]),
                       "train/learning_rate": scheduler.get_last_lr()[0],
                       "train/global_step": train_step})
            optim.step()
            scheduler.step()
            optim.zero_grad(set_to_none=True)
            train_step += 1

    ## At the end of every epoch, we call `validate()` to we run evals on our eval_dataset.
    validate()   

VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011113588801688619, max=1.0…

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/1389 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

AttributeError: module 'wandb' has no attribute 'table'

In [None]:
# we save the model checkpoint at the end
save_model(model, model_name=config.model_id.replace("/", "_"), models_folder="models/", log=config.log_model)
    
wandb.finish()

### Full Eval Dataset evaluation

Let's log a table with model predictions on the eval_dataset (or at least the 250 first samples)

In [None]:
with wandb.init(project="alpaca_ft", # the project we are working on
           job_type="eval",
           config=config): # the hyperparameters we want to keep track of
    model.eval();
    prompt_table(eval_dataset[:250], log=True, table_name="eval_predictions")