# Train an LLM with custom data

In this notebook we will prepare a training dataset and use it to train our custom model.

You must have `HUGGINGFACE_API_KEY` in a `.env` file for this to work.

First, we'll download a training dataset (only needs to run the first time)

In [1]:
import requests
import os
from dotenv import load_dotenv

load_dotenv()

dataset = "statworx/haiku"
headers = {"Authorization": f"Bearer {os.environ.get('HUGGINGFACE_API_KEY')}"}
API_URL = f"https://datasets-server.huggingface.co/parquet?dataset={dataset}"

def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()

# get the url to the datafile
data = query()
url = data["parquet_files"][0]["url"]

r = requests.get(url, allow_redirects=True)
with open('data/haikus.parquet', 'wb') as file:
    file.write(r.content)

## Prepare the data

Next, we'll load the dataset into a pandas dataframe and prepare a training dataset

In [2]:
import pandas as pd

haikus = pd.read_parquet("data/haikus.parquet")
haikus

Unnamed: 0,source,text,text_phonemes,keywords,keyword_phonemes,gruen_score,text_punc
0,bfbarry,Delicate savage. / You'll never hold the cinde...,deh|lax|kaxt sae|vaxjh / yuwl neh|ver hhowld d...,cinder,sihn|der,0.639071,
1,bfbarry,A splash and a cry. / Words pulled from the ri...,ax splaesh aend ax kray / werdz puhld frahm dh...,the riverside,dhax rih|ver|sayd,0.563353,
2,bfbarry,"Steamy, mist rising. / Rocks receiving downwar...",stiy|miy mihst ray|zaxng / raaks rax|siy|vaxng...,mist rising,mihst ray|zaxng,0.538326,
3,bfbarry,You were broken glass. / But I touched you eve...,yuw wer brow|kaxn glaes / baht ay tahcht yuw i...,broken glass,brow|kaxn glaes,0.703446,
4,bfbarry,Eyes dance with firelight. / The Moon and I ar...,ayz daens wihdh faxr|layt / dhax muwn aend ay ...,eyes dance,ayz daens,0.830985,
...,...,...,...,...,...,...,...
49019,haiku_data_2,Alpine Lake. / Mybreaststrokesshiningarc. / To...,ael|payn leyk mih|brehst|strow|kehsh|hhax|nihn...,toward sunrise,tax|waord sahn|rayz,0.685355,Alpine Lake. Mybreaststrokesshiningarc. Toward...
49020,haiku_data_2,Spruce Woods. / Fireweed filling. / The vacancy.,spruws wuhdz fay|er|wiyd fih|laxng dhax vey|ka...,woods,wuhdz,0.568974,Spruce Woods. Fireweed filling. The vacancy.
49021,haiku_data_2,Corrugated sun. / Chilies and laundry. / In ro...,kao|rax|gey|taxd sahn chih|liyz aend laon|driy...,sun chilies,sahn chih|liyz,0.551056,Corrugated sun. Chilies and laundry. In roofto...
49022,haiku_data_2,Home from war. / We ease out. / The champagne ...,hhowm frahm waor wiy iyz awt dhax shaxm|peyn k...,home,hhowm,0.697112,Home from war. We ease out. The champagne corks.


In [3]:
# Let's take just a random sample of 5000 of these.
haikus_text = haikus["text"].sample(100)

# print the first haiku
print("Original Haiku format:")
print(haikus_text.iloc[0])

# Add newlines
haikus_text = haikus_text.str.replace(" / ", "\n")
print("\nWith new-lines:")
print(haikus_text.iloc[0])

# Look at 4 more of them
for i in range(1, 5):
    print("\n" + haikus_text.iloc[i])

Original Haiku format:
She wants him to change. / He wants her the way she was. / Their love fades away.

With new-lines:
She wants him to change.
He wants her the way she was.
Their love fades away.

The swell.
Before the river splits.
In another name.

Just remembered that.
David's pumpkins exists.
So that's positive.

A growing pressure.
My sphincter is pulsating.
I need to go, Poop.

For anyone that.
Can see through the illusions.
It's easy to see.


## This time we'll keep them all

In [4]:
# See if we can get some consistent formatting
haikus["text"] = haikus["text"].str.replace(" / ", "\n")

# NOTE: Random sample just to run faster
haikus = haikus.sample(1000)

for i in range(3):
    print(haikus.iloc[i]["text"] + "\n")

The heart and vases.
It's as if it was a rule.
Made to be broken.

Thunder rolls over.
You can't quell your beating heart.
Always who dares wins.

Why do teachers feel?
The need to assign so much.
Work before finals.



## Good enough, let's train!

Next up we'll use the "text" and "keywords" columns to construct a training dataset to use for training our model with this filtered set of haikus.

In [5]:
import logging
import os

import yaml

from ludwig.api import LudwigModel

prompt_template = """
<|system|>
You are a haiku writer and respond to all questions with a colorful, poetic haiku</s>
<|user|>
Please write me a haiku about {keywords}</s>
<|assistant|>
{text}
"""

# Build out the configuration
config = yaml.safe_load(
    """
model_type: llm
base_model: HuggingFaceH4/zephyr-7b-beta

quantization:
  bits: 4

adapter:
  type: lora

input_features:
  - name: keywords
    type: text

output_features:
  - name: text
    type: text

trainer:
    type: finetune
    learning_rate: 0.0003
    batch_size: 2
    gradient_accumulation_steps: 8
    epochs: 3
    learning_rate_scheduler:
      warmup_fraction: 0.01

backend:
  type: local
"""
)

# Define Ludwig model object that drive model training
model = LudwigModel(config=config, logging_level=logging.INFO)

# initiate model training
(
    train_stats,  # dictionary containing training statistics
    preprocessed_data,  # tuple Ludwig Dataset objects of pre-processed training data
    output_directory,  # location of training results stored on disk
) = model.train(
    dataset=haikus
)

# list contents of output directory
print("contents of output directory:", output_directory)
for item in os.listdir(output_directory):
    print("\t", item)

PyTorch version 2.1.2 available.


  from .autonotebook import tqdm as notebook_tqdm


Setting generation max_new_tokens to 16384 to correspond with the max sequence length assigned to the output feature or the global max sequence length. This will ensure that the correct number of tokens are generated at inference time. To override this behavior, set `generation.max_new_tokens` to a different value in your Ludwig config.

╒════════════════════════╕
│ EXPERIMENT DESCRIPTION │
╘════════════════════════╛

╒══════════════════╤══════════════════════════════════════════════════════════════════════════════════════╕
│ Experiment name  │ api_experiment                                                                       │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
│ Model name       │ run                                                                                  │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
│ Output directory │ /home/dave/code/l

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Max length of feature 'keywords': 9 (without start and stop symbols)
Max sequence length is 9 for feature 'keywords'
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Max length of feature 'text': 31 (without start and stop symbols)
Max sequence length is 31 for feature 'text'
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Building dataset: DONE
Writing preprocessed training set cache to /home/dave/code/llama-haiku/856785e4b0c111eea60a01fcb3ec9195.training.hdf5
Writing preprocessed validation set cache to /home/dave/code/llama-haiku/856785e4b0c111eea60a01fcb3ec9195.validation.hdf5
Writing preprocessed test set cache to /home/dave/code/llama-haiku/856785e4b0c111eea60a01fcb3ec9195.test.hdf5
Writing train set metadata to /home/dave/code/llama-haiku/856785e4b0c111eea60a01fcb3ec9195.meta.json

Dataset Statistics
╒════════════╤═══════════════╤════════════════════╕
│ Dataset    │   Size (Rows) │ Size (In Memory)   │
╞════════════╪═══════════════╪════════════════════╡
│ Training   │           700 │ 164.19 Kb          │
├────────────┼───────────────┼────────────────────┤
│ Validation │           100 │ 23.56 Kb           │
├────────────┼───────────────┼────────────────────┤
│ Test       │           200 │ 47.00 Kb           │
╘════════════╧═══════════════╧════════════════════╛

╒═══════╕
│ MODEL │
╘═══════╛

Loadin

Loading checkpoint shards: 100%|██████████| 8/8 [00:04<00:00,  1.98it/s]

Done.





Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer
Trainable Parameter Summary For Fine-Tuning
Fine-tuning with adapter: lora
trainable params: 3,407,872 || all params: 7,245,139,968 || trainable%: 0.04703666202518836

╒══════════╕
│ TRAINING │
╘══════════╛

Creating fresh model training run.
Training for 1050 step(s), approximately 3 epoch(s).
Early stopping policy: 5 round(s) of evaluation, or 1750 step(s), approximately 5 epoch(s).

Starting with step 0, epoch: 0
Training:  33%|███▎      | 350/1050 [00:46<01:39,  7.07it/s, loss=0.279]
Running evaluation for step: 350, epoch: 1
Evaluation valid: 100%|██████████| 50/50 [00:03<00:00, 15.18it/s]
Input: get downed
Output: The The do we a
A get get to when.
Get who get downed.
--------------------
Input: fight
Output: < I give a fight.
With someone old person..re.
Got nothing to lose.
--------------------
Input: nails
Output: nails's nails.
I'ill my
The can of nails.
--------------------
Input: Larry
Output: <. <.

## Ok, so how'd we do?

In [6]:
df = pd.DataFrame.from_dict({
    "keywords": 
        [
            "icicles",
            "trees",
            "flowers on the sidewalk", 
            "trucks on the highway", 
            "data science gone wrong"
        ]
    })

response = model.predict(df)

response

Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Prediction: 100%|██████████| 1/1 [00:11<00:00, 11.23s/it]
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer
Finished predicting in: 13.18s.


  return np.sum(np.log(sequence_probabilities))


(                                    text_predictions  \
 0  [, I, c, icles, ., \n, The, sun, ., \n, On, th...   
 1  [, The, trees, are, ., \n, So, tall, ,, I, can...   
 2  [, I, ', m, walking, ., \n, On, the, sidewalk,...   
 3  [, I, ', m, glad, I, ', m, not, ., \n, A, truc...   
 4  [, I, ', m, not, a, data, ., \n, S, ci, ence, ...   
 
                                   text_probabilities  \
 0  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   
 1  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   
 2  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   
 3  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   
 4  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   
 
                                        text_response  text_probability  
 0               [Icicles.\nThe sun.\nOn the window.]              -inf  
 1  [The trees are.\nSo tall, I can't see.\nThe to...              -inf  
 2  [I'm walking.\nOn the sidewalk and I see.\nA f...              -inf  
 3  [I'm glad I'

In [7]:
answers = response[0]["text_response"]

for a in answers:
    print(a[0] + "\n")

Icicles.
The sun.
On the window.

The trees are.
So tall, I can't see.
The top of them.

I'm walking.
On the sidewalk and I see.
A flower on the ground.

I'm glad I'm not.
A truck driver on the highway.
I'm glad I'm not.

I'm not a data.
Science guy, but I'm pretty.
Sure this is going wrong.



## Ok, well, we'll leave it at that 

### Now we'll save the model to Hugging Face

In [9]:
!ludwig upload hf_hub -r querri/zephyr-haiku -m /home/dave/code/llama-haiku/results/api_experiment_run_0

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Model uploaded to `https://huggingface.co/querri/zephyr-haiku/tree/main/` with repository name `querri/zephyr-haiku`
