### Imports

In [4]:
import os
import time
import torch
from peft import prepare_model_for_kbit_training
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

### Loading the model

In [2]:
model_id = "tiiuae/falcon-40b-instruct"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Falcon requires you to allow remote code execution. This is because the model uses a new architecture that is not part of transformers yet.
# The code is provided by the model authors in the repo.
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, quantization_config=bnb_config, device_map="auto")

Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

In [3]:
# Set the Falcon tokenizer
tokenizer.pad_token = tokenizer.eos_token

In [7]:
os.listdir("outputs/checkpoint-2089")

['rng_state.pth',
 'README.md',
 'training_args.bin',
 'adapter_config.json',
 'adapter_model.bin',
 'optimizer.pt',
 'trainer_state.json',
 'scheduler.pt']

### Applying delta

In [19]:
model = PeftModel.from_pretrained(
    model,
    "outputs/checkpoint-2089",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

### Inference

In [50]:
def generate(model, tokenizer, question, max_new_tokens=128):
    start_time = time.time()
    test_prompt_template = f"As a pytorch expert engineer answer the question. Question: \n{{question}}\n---\Answer:\n"    
    test_sample = test_prompt_template.format(question=question)
    input_ids = tokenizer(test_sample, return_tensors="pt").input_ids
    output_tokens = input_ids.shape[1] + max_new_tokens
    input_ids = input_ids.to("cuda")
    outputs = model.generate(inputs=input_ids, do_sample=True, max_length=output_tokens)
    gen_text = tokenizer.batch_decode(outputs)[0]
    gen_text = gen_text.split("Answer:")[-1]
    print("<<<<<<<<<<<<<< Time taken for inference: ", time.time() - start_time)
    return gen_text
    

### Sample questions

In [42]:
print(generate(model=model, tokenizer=tokenizer, question="How to convert numpy to tensor?"))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.



There are different ways to convert numpy to tensor in PyTorch. The most common and straightforward way is to use `numpy.array` to convert the numpy array into a torch tensor. Here’s an example:

```python
import numpy as np
import torch

# Define a numpy array
a = np.array([1, 2, 3])

# Convert the numpy array to a torch tensor
b = torch.tensor(a)

print(b)  # Output: tensor([1, 2, 3])
```

Another way is to use


In [43]:
print(generate(model=model, tokenizer=tokenizer, question="How do i check if pytorch is using GPU?"))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.



You can check if pytorch is using GPU by setting the `CUDA.available` flag to `True` before importing pytorch. If the flag is `True`, then pytorch is using GPU. Here is an example code snippet:
```
import sys
import os
import shutil

import importlib
importlib.reload(sys)


sys.stdout.write('\n')


if os.name == 'nt':
    os.system("setx GPU_FORCE_64BIT_PTR 0")

os.environ


In [44]:
print(generate(model=model, tokenizer=tokenizer, question="Steps to debug OOM issues in pytorch?", max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.



There are several steps you can take to diagnose and debug OOM (Out Of Memory) issues in PyTorch:

1. Increase the size of your GPU. If you have a large model or dataset, it may require more memory than your GPU has available. Consider upgrading to a larger GPU or using a cloud-based service that can handle larger models.

2. Check memory usage during training. PyTorch provides a memory profiling utility that can help you identify where memory is being used during training. This can help you optimize your model or dataset to better utilize available memory.

3. Use a smaller batch size. Training with a smaller batch size can reduce the amount of memory required during training. This can help avoid OOM errors.

4. Use memory-efficient optimizers. If memory usage is a critical factor, consider using an optimizer like SGD or Adam, which are designed to be memory-efficient.

5. Use more advanced debugging tools. If the above steps do not resolve the issue, consider using more advanced deb

In [45]:
print(generate(model=model, tokenizer=tokenizer, question="Steps to debug OOM issues in pytorch?", max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.




To debug OOM issues in pytorch, one can follow the below steps:

1. Identify the root cause of the issue by profiling the application to identify the memory usage patterns.

2. Use a memory-profiling tool such as tracemalloc to gain insight into Python's memory usage.

3. Use PyTorch's internal memory tracker to identify memory leaks in PyTorch.

4. Optimize memory usage by minimizing the amount of data stored in memory and reducing data copying.

5. Consider using a virtual machine with increased memory limits to run the application.

6. Consider reducing the batch size to reduce memory requirements.

7. Use PyTorch's memory optimization techniques such as memory pinning to avoid frequently swapping data between RAM and slow external devices such as hard drives.

8. Optimize network communication to reduce the amount of data transmitted between devices.

9. Consider scaling up or down the number of devices in the cluster to balance memory usage and optimize memory usage at scale.

1

In [46]:
print(generate(model=model, tokenizer=tokenizer, question="Code to create a custom dataset?", max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.



To create a custom dataset, you can subclass the `torch.utils.data.Dataset` class and implement the required methods such as `__init__`, `__len__`, `__getitem__`, and `__setitem__`. The `__init__` method takes no arguments and is called when the dataset is initialized. The `__len__` method returns the length of the dataset. The `__getitem__` method takes an index `i` and returns the data point at that index. The `__setitem__` method takes an index `i` and a data point as arguments and sets the data point at that index.

Here is an example implementation of the custom dataset:

``` python
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, data_file='data.txt') -> None:
        with open(data_file, 'r') as f:
            lines = f.readlines()
        self.data = [line.strip() for line in lines]

    def __len__(self) -> int:
        return len(self.data)


In [47]:
print(generate(model=model, tokenizer=tokenizer, question="What is L1 and L2 regularization?"))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.



L1 and L2 regularization are two commonly used methods to regularize neural networks and prevent overfitting. L1 regularization, also known as LASSO or Lasso regularization, is a technique that introduces a penalty term on the sum of the absolute values of model parameters to encourage sparsity in the model. This ensures that the solution contains only a few non-zero values, which helps to avoid overfitting and improve the generalization ability of the model. L2 regularization, also known as ridge or Tikhonov regularization, is a technique that adds a penalty term on the sum of squares of the model parameters to


In [51]:
print(generate(model=model, tokenizer=tokenizer, question="How to save the model?"))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  118.64722800254822

As a pytorch expert engineer, I can provide a detailed answer to your question on how to save the model in PyTorch. 

There are three main ways to save a trained model in PyTorch:

1. Using PyTorch’s built-in save function: PyTorch provides the `torch.save()` function to save a model’s weights, state dictionary, and its optimizer. The `model.save()` method uses this function to save the model to a file on disk.

Here is an example code snippet:

```python
import torch


