# MLX fine-tuning - Chart Metadata Model

> This notebook is based on the Microsoft official [fine-tuning cookbook](https://github.com/microsoft/Phi-3CookBook/blob/main/md/04.Fine-tuning/FineTuning_MLX.md) for Phi-3 using MLX

## Prerequisites

In [21]:
%%capture
# Install MLX-LM framework for fine-tuning LLMs using Apple Silicon GPU
%pip install mlx-lm

# Install HF datasets to load data for finetuning from Hugging Face
%pip install datasets

## Data Preparation

We are gonna use the [clnnn/letyca-chart-metadata](https://huggingface.co/datasets/clnnn/letyca-chart-metadata) dataset from Hugging Face. The dataset is using [ShareGPT style](https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style) where a data row looks like this:

```json
{
        "conversations": [
            {
                "from": "human",
                "value": "total number of products"
            },
            {
                "from": "gpt",
                "value": "```json{\"chartType\":\"countLabel\",\"title\":\"Total number of products\"}```"
            }
        ]
    },
```

In this step we will convert the dataset into a `.jsonl` data format and the conversation will use use the specific Phi-3 prompt template:

```json
{"text": "<|user|>\ntotal number of products <|end|>\n<|assistant|> ```json{\"chartType\":\"countLabel\",\"title\":\"Total number of products\"}``` <|end|>"}

In [None]:
from datasets import load_dataset, DatasetDict

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = []
    mapper = {"system": "<|system|>", "human": "<|user|>\n", "gpt": "<|assistant|> \n"}
    end_mapper = {"system": "<|end|>", "human": "<|end|>", "gpt": "<|end|>"}
    for convo in convos:
        text = "".join(f"{mapper[(turn := x['from'])]}{x['value']} {end_mapper[turn]}" for x in convo)
        texts.append(f"{text}")
    return {"text": texts}

# Load the dataset
dataset = load_dataset("clnnn/letyca-chart-metadata", split = "train")

# Apply the formatting function to the dataset
dataset = dataset.map(formatting_prompts_func, batched=True, remove_columns=dataset.column_names)

# Split the dataset into train and test set with 80% and 20% respectively
train_test_split = dataset.train_test_split(test_size=0.2)

# Split the test set into test and validation set with 50% each
test_valid_split = train_test_split["test"].train_test_split(test_size=0.5)

# Put all the splits into a DatasetDict
splitted_datasets = DatasetDict({
    'train': train_test_split['train'],
    'test': test_valid_split['test'],
    'valid': test_valid_split['train']})

# Save each split to .jsonl file in the data directory
for split in splitted_datasets:
    splitted_datasets[split].to_json(f"data/{split}.jsonl", orient="records", lines=True)

## Fine-tuning (default & custom configurations)

In [None]:
custom_config = True
if not custom_config: 
    !python3 -m mlx_lm.lora --model microsoft/Phi-3-mini-4k-instruct --train --data ./data --iters 1000
else:
    !python3 -m  mlx_lm.lora --config config.yaml

## Inference

In [None]:
!python3 -m mlx_lm.generate --model microsoft/Phi-3-mini-4k-instruct --adapter-path ./adapters --max-token 2048 --prompt "total number of birds" --eos-token "<|end|>"

## Saving the model

### GGUF format

In [None]:
format = "q8_0" # f32, b16, q8_0

# Fuse the model using MLX-LM
!python3 -m mlx_lm.fuse --model microsoft/Phi-3-mini-4k-instruct --adapter-path ./adapters --de-quantize

# Convert the fused model to GGUF format using llama.cpp
!git clone https://github.com/ggerganov/llama.cpp.git && cd llama.cpp && python3 convert_hf_to_gguf.py ../lora_fused_model --outfile ../chart-metadata.gguf --outtype {format}

print("Conversion to GGUF format is complete.")
!rm -rf llama.cpp