# Train an LLM with custom data

In this notebook we will prepare a training dataset and use it to train our custom model.

You must have `HUGGINGFACE_API_KEY` in a `.env` file for this to work.

First, we'll download a training dataset (only needs to run the first time)

In [1]:
import requests
import os
from dotenv import load_dotenv

load_dotenv()

dataset = "statworx/haiku"
headers = {"Authorization": f"Bearer {os.environ.get('HUGGINGFACE_API_KEY')}"}
API_URL = f"https://datasets-server.huggingface.co/parquet?dataset={dataset}"

def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()

# get the url to the datafile
data = query()
url = data["parquet_files"][0]["url"]

r = requests.get(url, allow_redirects=True)
with open('data/haikus.parquet', 'wb') as file:
    file.write(r.content)

## Prepare the data

Next, we'll load the dataset into a pandas dataframe and prepare a training dataset

In [17]:
import json
import pandas as pd

haikus = pd.read_parquet("data/haikus.parquet")

# Let's take just a random sample of 1000 of these.
haikus = haikus.sample(5000)

# print the first haiku
print("Original Haiku format:")
print(haikus["text"].iloc[0])

# Format as JSON
def json_formatter(haiku):
    haiku_list = haiku.split(" / ")
    haiku_json = json.dumps({
        "haiku": haiku_list
    })
    return haiku_json

haikus["text"] = haikus["text"].apply(json_formatter)

print("\nWith new-lines:")
# Look at 3 of them
for i in range(3):
    print("\n" + haikus["text"].iloc[i])

Original Haiku format:
I got pregnant, but. / I don't even wanna think. / About what happened?

With new-lines:

{"haiku": ["I got pregnant, but.", "I don't even wanna think.", "About what happened?"]}

{"haiku": ["Sunrise at the pier.", "Calamari, fishermen.", "Bowing to the sea."]}

{"haiku": ["If she decides to.", "Get back together then you.", "Need a new sister?"]}


## Good enough, let's train!

Next up we'll use the "text" and "keywords" columns to construct a training dataset to use for training our model with this filtered set of haikus.

In [18]:
import logging
import os

import yaml

from ludwig.api import LudwigModel

prompt_template = """
<|system|>
You are a haiku writer and respond to all questions with a colorful, poetic haiku
You always output the Haiku as JSON, where there is a key called "haiku" and that is an array of 3 lines of the Haiku.</s>
<|user|>
Please write me a haiku about {keywords}. Write this as JSON.</s>
<|assistant|>
{text}
"""

# Build out the configuration
config = yaml.safe_load(
    """
model_type: llm
base_model: HuggingFaceH4/zephyr-7b-beta

quantization:
  bits: 4

adapter:
  type: lora

input_features:
  - name: keywords
    type: text

output_features:
  - name: text
    type: text

trainer:
    type: finetune
    learning_rate: 0.0003
    batch_size: 2
    gradient_accumulation_steps: 8
    epochs: 5
    learning_rate_scheduler:
      warmup_fraction: 0.01

backend:
  type: local
"""
)

# Define Ludwig model object that drive model training
model = LudwigModel(config=config, logging_level=logging.INFO)

# initiate model training
(
    train_stats,  # dictionary containing training statistics
    preprocessed_data,  # tuple Ludwig Dataset objects of pre-processed training data
    output_directory,  # location of training results stored on disk
) = model.train(
    dataset=haikus
)

# list contents of output directory
print("contents of output directory:", output_directory)
for item in os.listdir(output_directory):
    print("\t", item)

Setting generation max_new_tokens to 16384 to correspond with the max sequence length assigned to the output feature or the global max sequence length. This will ensure that the correct number of tokens are generated at inference time. To override this behavior, set `generation.max_new_tokens` to a different value in your Ludwig config.

╒════════════════════════╕
│ EXPERIMENT DESCRIPTION │
╘════════════════════════╛

╒══════════════════╤══════════════════════════════════════════════════════════════════════════════════════╕
│ Experiment name  │ api_experiment                                                                       │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
│ Model name       │ run                                                                                  │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
│ Output directory │ /home/dave/code/l

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Max length of feature 'keywords': 16 (without start and stop symbols)
Max sequence length is 16 for feature 'keywords'
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Max length of feature 'text': 37 (without start and stop symbols)
Max sequence length is 37 for feature 'text'
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Building dataset: DONE
Writing preprocessed training set cache to /home/dave/code/llama-haiku/5eb9de02b0c511eea60a01fcb3ec9195.training.hdf5
Writing preprocessed validation set cache to /home/dave/code/llama-haiku/5eb9de02b0c511eea60a01fcb3ec9195.validation.hdf5
Writing preprocessed test set cache to /home/dave/code/llama-haiku/5eb9de02b0c511eea60a01fcb3ec9195.test.hdf5
Writing train set metadata to /home/dave/code/llama-haiku/5eb9de02b0c511eea60a01fcb3ec9195.meta.json

Dataset Statistics
╒════════════╤═══════════════╤════════════════════╕
│ Dataset    │   Size (Rows) │ Size (In Memory)   │
╞════════════╪═══════════════╪════════════════════╡
│ Training   │          3500 │ 820.44 Kb          │
├────────────┼───────────────┼────────────────────┤
│ Validation │           500 │ 117.31 Kb          │
├────────────┼───────────────┼────────────────────┤
│ Test       │          1000 │ 234.50 Kb          │
╘════════════╧═══════════════╧════════════════════╛

╒═══════╕
│ MODEL │
╘═══════╛

Loadin

Loading checkpoint shards: 100%|██████████| 8/8 [00:12<00:00,  1.62s/it]


Done.
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer
Trainable Parameter Summary For Fine-Tuning
Fine-tuning with adapter: lora
trainable params: 3,407,872 || all params: 7,245,139,968 || trainable%: 0.04703666202518836

╒══════════╕
│ TRAINING │
╘══════════╛

Creating fresh model training run.
Training for 8750 step(s), approximately 5 epoch(s).
Early stopping policy: 5 round(s) of evaluation, or 8750 step(s), approximately 5 epoch(s).

Starting with step 0, epoch: 0
Training:  20%|██        | 1750/8750 [04:00<16:23,  7.12it/s, loss=0.231]
Running evaluation for step: 1750, epoch: 1
Evaluation valid: 100%|██████████| 250/250 [00:17<00:00, 14.38it/s]
Input: tonight
Output: < {"haiku": ["I'm like you.", "B to tonight I.", "I's in me."."]}
--------------------
Input: yard wild
Output: < {"haiku": ["Wng man.", "Inildling in the yard.", "Wildflowasparb."]}
--------------------
Input: sapling
Output: {" {" {" {" {"haiku": ["Woonlit.", "A the sapling.",s leaves."

## Ok, so how'd we do?

In [19]:
df = pd.DataFrame.from_dict({
    "keywords": 
        [
            "icicles",
            "trees",
            "flowers on the sidewalk", 
            "trucks on the highway", 
            "data science gone wrong"
        ]
    })

response = model.predict(df)

response

Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Prediction: 100%|██████████| 1/1 [00:30<00:00, 30.37s/it]
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer
Finished predicting in: 35.23s.


  return np.sum(np.log(sequence_probabilities))


(                                    text_predictions  \
 0  [, {", ha, iku, ":, [", W, inter, sol, st, ice...   
 1  [, {", ha, iku, ":, [", W, inter, sol, st, ice...   
 2  [, {", ha, iku, ":, [", A, few, flowers, .",, ...   
 3  [, {", ha, iku, ":, [", T, ru, cks, on, the, h...   
 4  [, {", ha, iku, ":, [", Data, science, gone, w...   
 
                                   text_probabilities  \
 0  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   
 1  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   
 2  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   
 3  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   
 4  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   
 
                                        text_response  text_probability  
 0  [{"haiku": ["Winter solstice.", "The icicles o...              -inf  
 1  [{"haiku": ["Winter solstice.", "The trees, st...              -inf  
 2  [{"haiku": ["A few flowers.", "On the sidewalk...              -inf  
 3  [{"haiku": [

In [20]:
answers = response[0]["text_response"]

for a in answers:
    print(a[0] + "\n")

{"haiku": ["Winter solstice.", "The icicles on the roof.", "Grow longer."]}

{"haiku": ["Winter solstice.", "The trees, still standing.", "In the wind."]}

{"haiku": ["A few flowers.", "On the sidewalk.", "In the rain."]}

{"haiku": ["Trucks on the highway.", "Their headlights like stars.", "In the night sky."]}

{"haiku": ["Data science gone wrong.", "I'm not even mad, I'm just.", "Laughing at this shit."]}



## Ok, well, we'll leave it at that 

### Now we'll save the model to Hugging Face

In [23]:
!ludwig upload hf_hub -r querri/zephyr-haiku-json -m /home/dave/code/llama-haiku/results/api_experiment_run_5

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Model uploaded to `https://huggingface.co/querri/zephyr-haiku-json/tree/main/` with repository name `querri/zephyr-haiku-json`
