### Imports

In [1]:
import os
import time
import torch
from peft import prepare_model_for_kbit_training
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/conda/envs/pytorch/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /opt/conda/envs/pytorch/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/envs/pytorch/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...


Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


### Loading the model

In [2]:
model_id = "tiiuae/falcon-40b-instruct"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Falcon requires you to allow remote code execution. This is because the model uses a new architecture that is not part of transformers yet.
# The code is provided by the model authors in the repo.
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, quantization_config=bnb_config, device_map="auto")

Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

In [3]:
# Set the Falcon tokenizer
tokenizer.pad_token = tokenizer.eos_token

In [4]:
os.listdir("outputs/checkpoint-2089")

['rng_state.pth',
 'README.md',
 'training_args.bin',
 'adapter_config.json',
 'adapter_model.bin',
 'optimizer.pt',
 'trainer_state.json',
 'scheduler.pt']

### Applying delta

In [5]:
model = PeftModel.from_pretrained(
    model,
    "outputs/checkpoint-2089",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

### Inference

In [6]:
def generate(model, tokenizer, question, max_new_tokens=128):
    start_time = time.time()
    test_prompt_template = f"As a pytorch expert engineer answer the question. Question: \n{{question}}\n---\Answer:\n"    
    test_sample = test_prompt_template.format(question=question)
    input_ids = tokenizer(test_sample, return_tensors="pt").input_ids
    output_tokens = input_ids.shape[1] + max_new_tokens
    input_ids = input_ids.to("cuda")
    outputs = model.generate(inputs=input_ids, do_sample=True, max_length=output_tokens)
    gen_text = tokenizer.batch_decode(outputs)[0]
    gen_text = gen_text.split("Answer:")[-1]
    print("<<<<<<<<<<<<<< Time taken for inference: ", time.time() - start_time)
    return gen_text
    

### Sample questions

In [7]:
print(generate(model=model, tokenizer=tokenizer, question="How to convert numpy to tensor?"))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  122.85856223106384

To convert a numpy tensor to a PyTorch tensor, you can use PyTorch's torch.from_numpy() method. Here's an example:

import numpy as np
import torch

# create a numpy tensor
numpy_tensor = np.random.randn(3, 5)

# convert numpy tensor to PyTorch tensor
tensor = torch.from_numpy(numpy_tensor)

# check the tensor creation with PyTorch
print(tensor.size())  # outputs size torch.Size([3, 5])

print


In [8]:
question = "How do I check if PyTorch is using the GPU"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=512))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  649.230696439743

Yes, the.is_cuda() method doesn't work on older versions of PyTorch and will raise an error. You can check the PyTorch version that you're using with the `torch.__version__` method. Here's an example:

```
import torch

# check the PyTorch version
print(torch.__version__)

# create a tensor on the CPU
x = torch.


In [10]:
question = "How do I check if PyTorch is using the GPU"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  267.01890087127686

Do you have any other question? Let me know if you need help with anything else. I'm happy to


In [11]:
question = "Code to save the model?"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  265.32049679756165

Here is some code to save the model:

```python
import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Sequential(
     nn.Linear(28 * 28, 10), # 28x28 -> 10
     nn.Softmax(dim=0)
)

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
    
# Training
lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, "min", patience=10) # Reduce learning rate on plateu
num_epochs = 10000
for epoch in range(num_epochs):
    for images, labels in dataloader:
        # Training
        optimizer.zero_grad()
        logits = model(images)
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()
        
        # Validation
        if (epoch + 1) % 1000 ==


In [12]:
question = "Code to save the model?"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  265.315322637558

To save PyTorch model you have to use `torch.save()` function. For saving just the model, you can do:
```python
import torch  # Import torch module

model = YourModel(input_size=..., hidden_size=...)
torch.save(model.state_dict(), 'your_model.pt')
```

Where `YourModel` is your PyTorch model with state_dict() and `your_model.pt` will be the file location where your model will be saved.

For saving both the model and its corresponding optimizer state, you can do:
```python
import torch

model = YourModel(input_size=..., hidden_size=...)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001) # use whatever optimizer you want

torch.save({'model': model.state_dict(),
              'optimizer': optimizer.state_dict(),
              'loss':...}, 
              'your_model.pt')
```

Here `{'model': model.state_dict(),
          'optimizer':


In [13]:
question = "What does .view() do?"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  257.35193061828613

Is there anything else I can help you with? Let me know if you have any more questions or if there's anything else I can do for you. Have a great day!<|endoftext|>


In [14]:
question = "Why do we need to call zero_grad()?"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  267.2173728942871

To train a neural network, we need to apply a stochastic gradient descent algorithm. This algorithm needs to know the gradients of the loss with respect to the model parameters, as it uses these gradients to update the parameter values. By calling `zero_grad()`, we zero-out the gradients of the model parameters. 

Here's some code that should make the concept more clear:

```
import torch
import torch.nn as nn
import torch.optim as optim

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.fc1 = nn.Linear(2, 2)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(2, 1)
        self.loss = nn.MSELoss()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu1(x)
        x = self.fc2(x)
        return x

model = Model()
optimizer = optim.Adam(model.parameters(), lr=0.01)



In [15]:
question = "How do I print the model summary?"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  266.07621026039124

There are multiple ways to print the model summary in Pytorch. Here are some options:

1. Method 1: Using the.summary() method:

The first option for printing the model summary is to use the.summary() method. This method prints the model structure along with the number of parameters.
Here is an example:

```
import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self, n_inputs, n_hidden, n_outputs):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(n_inputs, n_hidden)        # 1 hidden layer
        self.fc2 = nn.Linear(n_hidden, n_outputs)     # 1 output layer
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

net


In [16]:
question = "What is the purpose of PyTorch's grad_fn attribute in tensors?"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  270.05516624450684

The grad_fn attribute in PyTorch tensors is used to store the gradients (i.e. the derivative of the tensor with respect to its inputs) of a tensor with respect to each of its inputs. This is useful for backpropagation during training, where gradients computed during forward pass are used to update the weights of the model parameters using an optimizer.

Here is an example of how to use the grad_fn attribute:

``` python
import torch

# define a neural network with trainable parameters
class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc = torch.nn.Linear(3, 2)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        x = self.fc(x)
        x = self.relu(x)
        return x

# create parameters and set requires_grad to True
params = [torch.nn.Parameter(torch.randn(3, 2)) for _ in range(2)]
for param in params:
    param.requires_grad =


In [17]:
question = "What is fsdp and how to use it in code?"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  268.02260088920593

Sure! FSDP or Feature Scaling for Deep PyTorch is a pytorch extension that allows to automatically find the most appropriate data augmentation for PyTorch deep learning models. It can be used to scale the feature values of the inputs before they are fed into deep learning models.

To use FSDP, you need to install it first using pip:

```
pip install pytorch-extensions
```

Once installed, you can just import it and add it to your PyTorch model as shown below:

```
import torch
import torch.nn as nn

torch.set_default_tensor_type('torch.cuda.FloatTensor')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = nn.Sequential(
    nn.Linear(784, 500),
    nn.ReLU(),
    nn.Linear(500, 10)
)

fsdp_model = FsDPSequential(model)

loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(fsdp_model.


In [18]:
question = "How train the model in multi-node"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  266.0567967891693

To train the PyTorch model in a multi-node scenario, you can follow the below steps:

1. Install PyTorch and torch.distributed on all the nodes in your cluster.

2. Create a configuration file for your PyTorch model and specify the number of nodes and number of GPUs on each node. You can get an example configuration file from this GitHub link

3. Prepare the data for training and distribute the data files across all nodes.

4. Use torch.distributed to run the model training.

5. Checkpointing: To avoid overwriting the checkpoints on a worker node, the code needs to be updated to run with absolute paths or a shared file system.

Here is a sample code snippet for distributed training:

```
import torch
import torch.nn as nn
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
from torch.distributed import ParallelContext

def init_process_group(backend):
    torch.cuda.set_device(device)


In [19]:
question = "How to write a custom dataset"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  265.3009719848633

To answer your question, a custom dataset in PyTorch refers to defining your own data loader by a specific DataLoader subclass. 

Based on your question and answer, the data loader that you implemented is just creating a wrapper for the torch.utils.data.dataset.Dataset. Therefore, it can be considered as a custom dataset.

Actually, a custom dataset is not always required to define a new file system. You can also create a custom iterator from an existing dataset by implementing the __iter__() method like in the example below:

```python
import torch
from torch.utils.data import Dataset, IterableDataset
from torch.utils.data.dataloader import DataLoader


class CustomIterator(IterableDataset):
    def __init__(self, dataset, batch_size: int = 32, shuffle: bool = True):
        self.dataset = dataset
        self.batch_size = batch_size
        self.shuffle = shuffle

    def __iter__(self):
        return self
    
    def __l

In [20]:
question = "How to write a custom dataset"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  265.28244805336

There are a few ways to write a custom dataset, e.g using DataLoader from torch.utils.data or create your custom data-module extending PyTorch Dataset. In the following I'll show you an example of creating a custom dataset by subclassing PyTorch's DataLoader:

```python
import numpy as np
import torch
from torch.utils.data import DataLoader

# custom numpy dataset
class CustomDataset(object):
    def __init__(self, input, target):
        self.input = input
        self.target = target

    def __getitem__(self, index):
        return self.input[index], self.target[index]

    def __len__(self):
        return len(self.input)

# custom data-module extending PyTorch `Dataset`
class CustomDataModule(torch.utils.data.Dataset):
    def __init__(self):
        self.idx_file = 'train.idx'

    def __len__(self):
        return open(self.idx_file, 'r').readline().split()

    def __getitem__(self, idx):
        filename =


In [21]:
question = "How should I use torch.compile properly??"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  266.45243668556213

torch.compile is an optimization method introduced in PyTorch version 0.4.0 that uses pytorch autograd to automatically compute and optimize the gradient of your model as it runs. This allows for more efficient and flexible computation. The idea being that the computation being done is the same, but the actual computation is done using as little python code (where errors occur) as possible.

torch.compile takes three arguments. First, it takes the model you want to optimize (as a module or instance), and second, it takes a list of tensors whose gradients that the model depends on. Third, it takes a list which indicates which tensors in your model do not need to have their gradient computed (e.g., the biases in a neural network). Lastly, it takes an optional target (a subclass of nn.Module) which is used to infer the type of operation that the model is doing (e.g., cross entropy).

Here's an example that I think should convey

In [22]:
question = "Why do we pack the sequences?"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  265.67030143737793

The purpose of packing the sequences is to convert padded sequences back into their original length and format before applying the softmax function. Padding is used to ensure that all sequences have the same length, so that they can be concatenated together into a single tensor. In other words, padding allows us to add extra 'dummy' vectors to the sequences to ensure that all sequences are the same length before we concatenate them together.

When we apply the softmax function, we want to get the probability distribution of each word over the entire vocabulary. To do this, we need to provide our model with a single tensor that contains the entire sequence. The sequences in our training data are padded using a value that is different from any letter in our vocabulary. This padding value serves as a placeholder for any non-letter characters in the sequence, such as spaces or punctuation marks.

By packing the sequences and con

In [23]:
question = "Check the total number of parameters"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  265.2859137058258

To check the total number of parameters in a PyTorch model, you can use its `parameters()` method. Here's an example:

```
import torch
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 5)
        self.fc2 = nn.Linear(5, 4)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return x

# create an instance of the model
model = MyModel()

# get the number of trainable parameters in total
total_params = sum(p.numel() for p in model.parameters())
print(f"Number of trainable parameters: {total_params}")
```

In the above code, the `MyModel` class has two fully-connected layers with 5 and 4 trainable parameters each, respectively. The `forward()` method just applies two linear transformations to the given


In [24]:
question = "RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  274.4844238758087

Unfortunately, the provided code cannot be tested as there are no values for `img_size`, `batch_size`,


In [25]:
question = "Connection between loss.backward() and optimizer.step()"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  268.00793170928955

The backward computation is used to calculate and compute gradients of variables that have been added to the graph and optimizer.step() is used to update the variables in forward direction, or more specifically in a way that the optimizer can understand where to make updates along the forward propagation back propagation path. In other words backward propagation is used to compute deltas, while optimizer.step() is used to apply those deltas to the variables. For more information and how-to, you should have a look at PyTorch's official documentation about those two functionalities: 

https://pytorch.org/docs/stable/optim.html
https://pytorch.org/docs/stable/autograd.html

```
import torch

net = torch.nn.Linear(4, 2)
x = torch.rand(4)
net.zero_grad() # clear the buffer

out = net(x)
loss = torch.mean(out) # compute the loss

# call backward() to compute deltas for the loss
loss.backward()

# call optimizer.step() to update th

In [26]:
question = "How to improve model accuracy"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  264.905070066452

There are several things one could try and improve the accuracy of their model in this scenario. Here are a few suggestions:

1. Try increasing the number of layers or the number of filters in each layer of the network. This will help the model to capture more complex relationships between the input features and the target class.

2. Instead of just using mean squared error as the loss function, try experimenting with more complex loss functions. This will help the model to capture more complex relationships between the input features and the target class.

3. If possible, try increasing the amount of training data for the network. This will help the model to learn more complex relationships between the input features and the target class.

4. Finally, consider trying different hyperparameters for your network, such as learning rate or batch size. This will help improve the accuracy and stability of your network.

Overall, the

In [27]:
question = "How to improve the training speed"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  265.29113030433655

To improve the training speed, you can try increasing batch size, reducing dropout, decreasing the learning rate, reducing the number of layers and using less complex models. Here is some code example:

Batch size: 

batch_size=32

DataLoader(train_tensor, batch_size=batch_size, shuffle=True)


Dropout: 

dropout_prob=0.4

model.eval()
drop = nn.Dropout(p=dropout_prob)
output = drop(model(input))


Learning rate:  

optimizer.param_groups[1]['lr'] = lr


Reducing the number of layers:

classifiers:

model = nn.Sequential(
    nn.Linear(512,256), #64
    nn.ReLU(),
    nn.Dropout(dropout_prob),
    nn.Linear(256, 64), #128
    nn.ReLU(),
    nn.Dropout(dropout_prob),
    nn.Linear(64, 10), #final layer
)

optimizer = torch.optim.Adam(model.parameters


In [28]:
question = "How to manage huge input tensor?"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  265.6680574417114

There are several ways in which you can deal with a massive input tensor without running out of GPU memory. Here are a few of them.

1. Sub-batch data: Instead of processing the entire data in a single batch, you can divide the data into two or more sub-batches. This will help you to make better use of GPU memory. You can achieve this by modifying the `batch_size` attribute of the `DataLoader`. For example:

```python
from torch.utils import data
from torch.utils.data.dataloader import default_collate
from torchvision import transforms
import torch

train_set = data.ImageFolder(args.train_dir, transforms.Compose([
   transforms.RandomCrop(224),
   transforms.RandomHorizontalFlip(),
   transforms.ToTensor(),
]))

batch_size = 128  
train_loader = data.DataLoader(train_set, batch_size=batch_size, shuffle=True, drop_last=True)

for images, labels in train_loader:
    print(images.shape, labels.shape)
    # perform operations wit

In [29]:
question = "How to fix RuntimeError 'Expected object of scalar type Float but got scalar type Double for argument'?"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  271.2506945133209

Is there anything else I can assist you further?  Do you have any other questions in mind? If so, please feel free to ask. I'll be happy to help


In [30]:
question = "how to do mixed precision training?"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  265.6795959472656

Yes, you have to use cudnn (CuDNN). It is a deep learning library that can be used to perform mixed-precision training in PyTorch using Tensor Cores on NVIDIA GPUs. Here are the steps to perform MPT in PyTorch and use Cudnn:

1. Check that Cudnn is available:
You can check whether Cudnn is available by using the following code snippet:
```
import torch
torch.backends.cudnn.enabled = True
```
If you get the following error, then Cudnn is not available:
```
RuntimeError: CUDNN_STATUS_INTERNAL_ERROR occurred.
```

2. Load the necessary extensions:
Load the necessary extensions such as TorchScript, and Tensor Cores to enable Cudnn:
```
import torch
import torchvision.transforms as transforms
from torchvision import datasets, models, transforms

from torch.utils.data import DataLoader
from torch.nn.parallel import DistributedDataParallel
from timm.models import get_model, ModelInfo
from timm.data.transform import Compose, Interpol

In [31]:
question = "How to avoid CUDA out of memory?"
print(generate(model=model, tokenizer=tokenizer, question=question, max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  266.46967005729675

There are several ways to avoid cuda out of memory errors. Here are some tips:

1. Increase your GPU memory limit. You can do this by specifying the CUDA_VISIBLE_DEVICES environment variable:

```
export CUDA_VISIBLE_DEVICES=&lt;device_index&gt;;
```

Replace &lt;device_index&gt; with the index of your desired GPU.

2. Reduce the batch size of your data loader. This will reduce the amount of memory required for training.

3. Increase the GPU memory usage efficiency of your code. This can be done by optimizing your code and reducing the amount of memory your model consumes during training.

Here's an example code snippet to optimize your model memory usage:

```
def forward(self, x):
    x = self.layer1(x)
    x = self.layer2(x)
    return self.layer3(x)
```

Instead of using tensors like this, you can use nn.utils.rnn.pack_padded_sequence and nn.utils.rnn.pad_packed_sequence to convert your sequence of inputs into a fixed


### Additional tests

In [43]:
print(generate(model=model, tokenizer=tokenizer, question="How do i check if pytorch is using GPU?"))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.



You can check if pytorch is using GPU by setting the `CUDA.available` flag to `True` before importing pytorch. If the flag is `True`, then pytorch is using GPU. Here is an example code snippet:
```
import sys
import os
import shutil

import importlib
importlib.reload(sys)


sys.stdout.write('\n')


if os.name == 'nt':
    os.system("setx GPU_FORCE_64BIT_PTR 0")

os.environ


In [44]:
print(generate(model=model, tokenizer=tokenizer, question="Steps to debug OOM issues in pytorch?", max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.



There are several steps you can take to diagnose and debug OOM (Out Of Memory) issues in PyTorch:

1. Increase the size of your GPU. If you have a large model or dataset, it may require more memory than your GPU has available. Consider upgrading to a larger GPU or using a cloud-based service that can handle larger models.

2. Check memory usage during training. PyTorch provides a memory profiling utility that can help you identify where memory is being used during training. This can help you optimize your model or dataset to better utilize available memory.

3. Use a smaller batch size. Training with a smaller batch size can reduce the amount of memory required during training. This can help avoid OOM errors.

4. Use memory-efficient optimizers. If memory usage is a critical factor, consider using an optimizer like SGD or Adam, which are designed to be memory-efficient.

5. Use more advanced debugging tools. If the above steps do not resolve the issue, consider using more advanced deb

In [45]:
print(generate(model=model, tokenizer=tokenizer, question="Steps to debug OOM issues in pytorch?", max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.




To debug OOM issues in pytorch, one can follow the below steps:

1. Identify the root cause of the issue by profiling the application to identify the memory usage patterns.

2. Use a memory-profiling tool such as tracemalloc to gain insight into Python's memory usage.

3. Use PyTorch's internal memory tracker to identify memory leaks in PyTorch.

4. Optimize memory usage by minimizing the amount of data stored in memory and reducing data copying.

5. Consider using a virtual machine with increased memory limits to run the application.

6. Consider reducing the batch size to reduce memory requirements.

7. Use PyTorch's memory optimization techniques such as memory pinning to avoid frequently swapping data between RAM and slow external devices such as hard drives.

8. Optimize network communication to reduce the amount of data transmitted between devices.

9. Consider scaling up or down the number of devices in the cluster to balance memory usage and optimize memory usage at scale.

1

In [46]:
print(generate(model=model, tokenizer=tokenizer, question="Code to create a custom dataset?", max_new_tokens=256))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.



To create a custom dataset, you can subclass the `torch.utils.data.Dataset` class and implement the required methods such as `__init__`, `__len__`, `__getitem__`, and `__setitem__`. The `__init__` method takes no arguments and is called when the dataset is initialized. The `__len__` method returns the length of the dataset. The `__getitem__` method takes an index `i` and returns the data point at that index. The `__setitem__` method takes an index `i` and a data point as arguments and sets the data point at that index.

Here is an example implementation of the custom dataset:

``` python
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, data_file='data.txt') -> None:
        with open(data_file, 'r') as f:
            lines = f.readlines()
        self.data = [line.strip() for line in lines]

    def __len__(self) -> int:
        return len(self.data)


In [47]:
print(generate(model=model, tokenizer=tokenizer, question="What is L1 and L2 regularization?"))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.



L1 and L2 regularization are two commonly used methods to regularize neural networks and prevent overfitting. L1 regularization, also known as LASSO or Lasso regularization, is a technique that introduces a penalty term on the sum of the absolute values of model parameters to encourage sparsity in the model. This ensures that the solution contains only a few non-zero values, which helps to avoid overfitting and improve the generalization ability of the model. L2 regularization, also known as ridge or Tikhonov regularization, is a technique that adds a penalty term on the sum of squares of the model parameters to


In [51]:
print(generate(model=model, tokenizer=tokenizer, question="How to save the model?"))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


<<<<<<<<<<<<<< Time taken for inference:  118.64722800254822

As a pytorch expert engineer, I can provide a detailed answer to your question on how to save the model in PyTorch. 

There are three main ways to save a trained model in PyTorch:

1. Using PyTorch’s built-in save function: PyTorch provides the `torch.save()` function to save a model’s weights, state dictionary, and its optimizer. The `model.save()` method uses this function to save the model to a file on disk.

Here is an example code snippet:

```python
import torch


