# Fine-tuning Llama 2 7b with AutoTrain

In this notebook, we'll walk you through the steps to fine-tune Llama 2 7b using your dataset. 
Follow along by running each cell in order!

### Setup Runtime
For fine-tuning Llama, a GPU instance is essential. Follow the directions below:

1. Go to `Runtime` (located in the top menu bar).
2. Select `Change Runtime Type`.
3. Choose `T4 GPU` (or a comparable option).


### Package Installation

Before we get started, let's ensure we have all the necessary packages installed.

In [None]:
!pip install pandas


In [None]:
!pip install autotrain-advanced


### Setup Autotrain
The step below is required for AutoTrain in Colab


In [None]:
!autotrain setup --update-torch


### Connect to Hugging Face for model upload
#### Logging to Hugging Face

To make sure the model can be uploaded to be used for Inference, it's necessary to log in to the Hugging Face hub. 

#### Getting a Hugging Face token
Steps: 
1. Navigate to this URL: https://huggingface.co/settings/tokens
2. Create a `write` token and copy it to your clipboard
3. Run the code below and enter your token



In [None]:
from huggingface_hub import notebook_login
notebook_login()

### Upload your dataset
Add your data set to the root directory in the Colab under the name `train.csv`. The AutoTrain command will look for your data there under that name. 

##### Don't have a data set and want to try finetuning on an example data set? 
If you don't have a dataset you can run these commands below to get an example data set and save it to `train.csv`






In [None]:
!git clone https://github.com/joshbickett/finetune-llama-2.git
%cd finetune-llama-2
%mv train.csv ../train.csv
%cd ..

## Overview of AutoTrain command

#### Short overview of what the command flags do. 

- `!autotrain`: Command executed in environments like a Jupyter notebook to run shell commands directly. `autotrain` is an automatic training utility. 

- `llm`: A sub-command or argument specifying the type of task

- `--train`: Initiates the training process.

- `--project_name`: Sets the name of the project 

- `--model abhishek/llama-2-7b-hf-small-shards`: Specifies original model that is hosted on Hugging Face named "llama-2-7b-hf-small-shards" under the "abhishek".

- `--data_path .`: The path to the dataset for training. The "." refers to the current directory. The `train.csv` file needs to be located in this directory. 

- `--use_int4`: Use of INT4 quantization to reduce model size and speed up inference times at the cost of some precision.

- `--learning_rate 2e-4`: Sets the learning rate for training to 0.0002.

- `--train_batch_size 12`: Sets the batch size for training to 12.

- `--num_train_epochs 3`: The training process will iterate over the dataset 3 times.

### Steps needed before running
Go to the `!autotrain` code cell below and update it by following the steps below:

1. After `--project_name` replace `*enter-a-project-name*` with the name that you'd like to call the project
2. After `--repo_id` replace `*username*/*repository*`. Replace `*username*` with your Hugging Face username and `*repository*` with the repository name you'd like it to be created under. You don't need to create this repository before hand, it will automatically be created and uploaded once the training is completed. 
3. Confirm that `train.csv` is in the root directory in the Colab. The `--data_path .` flag will make it so that AutoTrain looks for your data there. 
4. Once you've made these changes you're all set, run the command below!




In [None]:
!autotrain llm --train --project_name *enter-a-project-name* --model abhishek/llama-2-7b-hf-small-shards --data_path . --use_peft --use_int4 --learning_rate 2e-4 --train_batch_size 12 --num_train_epochs 3 --trainer sft --push_to_hub --repo_id *username*/*repository*

## Completed 🎉
After the command above is completed your Model will be uploaded to Hugging Face.

#### Learn more about AutoTrain (optional)
If you want to learn more about what command-line flags are available


In [None]:

!autotrain llm -h