# Instruction Tuning LLMs for Text Generation using PyTorch, Hugging Face, and the Intel® Transfer Learning Tool API

This notebook uses the `tlt` library to fine tune a pretrained large language model (LLM) from [Hugging Face](https://huggingface.co) using a custom dataset.

## 1. Import dependencies and setup parameters

This notebook assumes that you have already followed the instructions to setup a Pytorch environment with all the dependencies required to run the notebook.

In [None]:
import os
import warnings

from tlt.datasets import dataset_factory
from tlt.models import model_factory
from downloader.datasets import DataDownloader

warnings.filterwarnings('ignore')
os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Specify a directory for the dataset to be downloaded
dataset_dir = os.environ["DATASET_DIR"] if "DATASET_DIR" in os.environ else \
    os.path.join(os.environ["HOME"], "dataset")

# Specify a directory for output
output_dir = os.environ["OUTPUT_DIR"] if "OUTPUT_DIR" in os.environ else \
    os.path.join(os.environ["HOME"], "output")

print("Dataset directory:", dataset_dir)
print("Output directory:", output_dir)

## 2. Get the model

In this step, we call the Intel Transfer Learning Tool model factory to list supported Hugging Face text generation models. This is a list of pretrained models from Hugging Face that we tested with our API.

In [None]:
# See a list of available text generation models
model_factory.print_supported_models(use_case='text_generation')

Use the Intel Transfer Learning Tool model factory to get one of the models listed in the previous cell. The `get_model` function returns a TLT model object that will later be used for training.

In [None]:
model_name = "gpt-j-6b"
framework = "pytorch"

model = model_factory.get_model(model_name, framework)

## 3. Load a custom dataset

In this example, we download an instruction text dataset example, where each record of the dataset contains text fields for "instruction", "input", and "output" like the following:
```
{
    "instruction": "Convert this sentence into a question.",
    "input": "He read the book.",
    "output": "Did he read the book?"
}
```
If you are using a custom dataset or downloaded dataset that has similarly formatted json, you can use the same code as below.

In [None]:
# Modify the variables below to use a different json file on your local system.
dataset_url = "https://raw.githubusercontent.com/sahil280114/codealpaca/master/data/code_alpaca_2k.json"
file_name = "code_alpaca_2k.json"

# If we don't already have the json file, download it
if not os.path.exists(os.path.join(dataset_dir, file_name)):
    data_downloader = DataDownloader('code_alpaca_2k', dataset_dir, url=dataset_url)
    data_downloader.download()

In [None]:
dataset = dataset_factory.load_dataset(dataset_dir=dataset_dir, use_case="text_generation",
                                       framework="pytorch", dataset_file=file_name)

print(dataset.info)

In [None]:
# Adjust this dictionary for the keys used in your dataset
dataset_schema = {
    "instruction_key": "instruction", 
    "context_key": "input",
    "response_key": "output"
}

### Map and tokenize the dataset

After describing the schema of your dataset, create formatted prompts out of each example for instruction-tuning. Then preprocess to tokenize the prompts and concatenate them together into longer sequences to speed up fine-tuning.

In [None]:
prompt_dict = {
    "prompt_with_context": (
        "Below is an instruction that describes a task, paired with an input that provides further context. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{{{instruction_key}}}\n\n### Context:\n{{{context_key}}}\n\n### Response:\n{{{response_key}}}".format(
        **dataset_schema)
    ),
    "prompt_without_context": (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{{{instruction_key}}}\n\n### Response:\n{{{response_key}}}".format(**dataset_schema)
    ),
}

In [None]:
# Preprocess the dataset
dataset.preprocess(model.hub_name, batch_size=32, prompt_dict=prompt_dict, dataset_schema=dataset_schema,
                   concatenate=True)

In [None]:
# Batch the dataset and create splits for training and validation
dataset.shuffle_split(train_pct=0.75, val_pct=0.25)

## 4. Preview a text completion from the pretrained model

Use the generate API to look at some output for a sample prompt. Use this sample prompt or write your own.

In [None]:
# For code generation custom dataset
prompt_template = prompt_dict["prompt_with_context"]
test_example = {dataset_schema['instruction_key']: 'Write a Python function that sorts the following list.',
               dataset_schema['context_key']: '[3, 2, 1]',
               dataset_schema['response_key']: ''}

In [None]:
test_prompt = prompt_template.format_map(test_example)
test_prompt

In [None]:
model.generate(test_prompt)

## 5. Transfer Learning (Instruction Tuning)

The Intel Transfer Learning Tool model's train function is called with the dataset that was just prepared, along with an output directory and the number of training epochs. The model's evaluate function returns a list of metrics calculated from the dataset's validation subset.

### Arguments

#### Required
-  **dataset** (TextGenerationDataset, required): Dataset to use when training the model
-  **output_dir** (str): Path to a writeable directory for checkpoint files
-  **epochs** (int): Number of epochs to train the model (default: 1)

#### Optional
-  **initial_checkpoints** (str): Path to checkpoint weights to load. If the path provided is a directory, the latest checkpoint will be used.
-  **lora_rank** (int): LoRA rank parameter (default: 8)
-  **lora_alpha** (int): LoRA alpha parameter (default: 32)
-  **lora_dropout** (float): LoRA dropout parameter (default: 0.05)
-  **enable_auto_mixed_precision** (bool or None): Enable auto mixed precision for training. Mixed precision
uses both 16-bit and 32-bit floating point types to make training run faster and use less memory. It is recommended to enable auto mixed precision training when running on platforms that support bfloat16 (Intel third or fourth generation Xeon processors). If it is enabled on a platform that does not support bfloat16, it can be detrimental to the training performance. If enable_auto_mixed_precision is set to None, auto mixed precision will be automatically enabled when running with Intel fourth generation Xeon processors, and disabled for other platforms. Defaults to None.

Note: refer to release documentation for an up-to-date list of train arguments and their current descriptions

In [None]:
history = model.train(dataset, output_dir, epochs=3)

In [None]:
model.evaluate()

## 6. Export the saved model
We can call the model export function to generate a saved model in the Hugging Face format. Each time the model is exported, a new numbered directory is created, which allows identification of the latest model.

In [None]:
# Save the model to the output_dir
model.export(output_dir)

## 7. View the text completion from the fine-tuned model

Generate with the test prompt to see if the fine-tuned model gives a better response. You may want to train for at least 3 epochs to see improvement.

### Optional Parameters
-  **temperature** (float): The value used to modulate the next token probabilities (default: 1.0)
-  **top_p** (float): If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation (default: 0.75)
-  **top_k** (int):The number of highest probability vocabulary tokens to keep for top-k-filtering (default: 40)
-  **repetition_penalty** (float): The parameter for repetition penalty. 1.0 means no penalty. (default: 1.0)
-  **num_beams** (int): Number of beams for beam search. 1 means no beam search. (default: 4)
-  **max_new_tokens** (int): The maximum number of new tokens generated (default: 128)

In [None]:
model.generate(test_prompt, repetition_penalty=6.0)