# Train an LLM with custom data

In this notebook we will prepare a training dataset and use it to train our custom model.

You must have `HUGGINGFACE_API_KEY` in a `.env` file for this to work.

First, we'll download a training dataset (only needs to run the first time)

In [1]:
import requests
import os
from dotenv import load_dotenv

load_dotenv()

dataset = "statworx/haiku"
headers = {"Authorization": f"Bearer {os.environ.get('HUGGINGFACE_API_KEY')}"}
API_URL = f"https://datasets-server.huggingface.co/parquet?dataset={dataset}"

def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()

# get the url to the datafile
data = query()
url = data["parquet_files"][0]["url"]

r = requests.get(url, allow_redirects=True)
with open('data/haikus.parquet', 'wb') as file:
    file.write(r.content)

## Prepare the data

Next, we'll load the dataset into a pandas dataframe and prepare a training dataset

In [2]:
import pandas as pd

haikus = pd.read_parquet("data/haikus.parquet")
haikus

Unnamed: 0,source,text,text_phonemes,keywords,keyword_phonemes,gruen_score,text_punc
0,bfbarry,Delicate savage. / You'll never hold the cinde...,deh|lax|kaxt sae|vaxjh / yuwl neh|ver hhowld d...,cinder,sihn|der,0.639071,
1,bfbarry,A splash and a cry. / Words pulled from the ri...,ax splaesh aend ax kray / werdz puhld frahm dh...,the riverside,dhax rih|ver|sayd,0.563353,
2,bfbarry,"Steamy, mist rising. / Rocks receiving downwar...",stiy|miy mihst ray|zaxng / raaks rax|siy|vaxng...,mist rising,mihst ray|zaxng,0.538326,
3,bfbarry,You were broken glass. / But I touched you eve...,yuw wer brow|kaxn glaes / baht ay tahcht yuw i...,broken glass,brow|kaxn glaes,0.703446,
4,bfbarry,Eyes dance with firelight. / The Moon and I ar...,ayz daens wihdh faxr|layt / dhax muwn aend ay ...,eyes dance,ayz daens,0.830985,
...,...,...,...,...,...,...,...
49019,haiku_data_2,Alpine Lake. / Mybreaststrokesshiningarc. / To...,ael|payn leyk mih|brehst|strow|kehsh|hhax|nihn...,toward sunrise,tax|waord sahn|rayz,0.685355,Alpine Lake. Mybreaststrokesshiningarc. Toward...
49020,haiku_data_2,Spruce Woods. / Fireweed filling. / The vacancy.,spruws wuhdz fay|er|wiyd fih|laxng dhax vey|ka...,woods,wuhdz,0.568974,Spruce Woods. Fireweed filling. The vacancy.
49021,haiku_data_2,Corrugated sun. / Chilies and laundry. / In ro...,kao|rax|gey|taxd sahn chih|liyz aend laon|driy...,sun chilies,sahn chih|liyz,0.551056,Corrugated sun. Chilies and laundry. In roofto...
49022,haiku_data_2,Home from war. / We ease out. / The champagne ...,hhowm frahm waor wiy iyz awt dhax shaxm|peyn k...,home,hhowm,0.697112,Home from war. We ease out. The champagne corks.


In [3]:
# Let's take just a random sample of 5000 of these.
haikus_text = haikus["text"].sample(100)

# print the first haiku
print("Original Haiku format:")
print(haikus_text.iloc[0])

# Add newlines
haikus_text = haikus_text.str.replace(" / ", "\n")
print("\nWith new-lines:")
print(haikus_text.iloc[0])

# Look at 4 more of them
for i in range(1, 5):
    print("\n" + haikus_text.iloc[i])

Original Haiku format:
Starry night. / The queue for skates. / Moves slowly.

With new-lines:
Starry night.
The queue for skates.
Moves slowly.

Snowed in.
Fire wraps.
Around a log.

Last rays of sun.
Crows suddenly.
Goldwinged.

Fireside.
A piece of jigsaw.
Slips into place.

We may start seeing.
A lot of those walking boots.
On people soon, lol.


## This time we'll keep them all

In [8]:
# See if we can get some consistent formatting
haikus["text"] = haikus["text"].str.replace(" / ", "\n")

for i in range(3):
    print(haikus.iloc[i]["text"] + "\n")

Delicate savage.
You'll never hold the cinder.
But still you will burn.

A splash and a cry.
Words pulled from the riverside.
Dryed in the hot sun.

Steamy, mist rising.
Rocks receiving downward crash.
As the jungle weeps.



## Good enough, let's train!

Next up we'll use the "text" and "keywords" columns to construct a training dataset to use for training our model with this filtered set of haikus.

In [9]:
import logging
import os

import yaml

from ludwig.api import LudwigModel

prompt_template = """
<|system|>
You are a haiku writer and respond to all questions with a colorful, poetic haiku</s>
<|user|>
Please write me a haiku about {keywords}</s>
<|assistant|>
{text}
"""

# Build out the configuration
config = yaml.safe_load(
    """
model_type: llm
base_model: HuggingFaceH4/zephyr-7b-beta

quantization:
  bits: 4

adapter:
  type: lora

input_features:
  - name: keywords
    type: text

output_features:
  - name: text
    type: text

trainer:
    type: finetune
    learning_rate: 0.0003
    batch_size: 2
    gradient_accumulation_steps: 8
    epochs: 3
    learning_rate_scheduler:
      warmup_fraction: 0.01

backend:
  type: local
"""
)

# Define Ludwig model object that drive model training
model = LudwigModel(config=config, logging_level=logging.INFO)

# initiate model training
(
    train_stats,  # dictionary containing training statistics
    preprocessed_data,  # tuple Ludwig Dataset objects of pre-processed training data
    output_directory,  # location of training results stored on disk
) = model.train(
    dataset=valid_haikus
)

# list contents of output directory
print("contents of output directory:", output_directory)
for item in os.listdir(output_directory):
    print("\t", item)

PyTorch version 2.1.2 available.


  from .autonotebook import tqdm as notebook_tqdm


Setting generation max_new_tokens to 16384 to correspond with the max sequence length assigned to the output feature or the global max sequence length. This will ensure that the correct number of tokens are generated at inference time. To override this behavior, set `generation.max_new_tokens` to a different value in your Ludwig config.

╒════════════════════════╕
│ EXPERIMENT DESCRIPTION │
╘════════════════════════╛

╒══════════════════╤══════════════════════════════════════════════════════════════════════════════════════╕
│ Experiment name  │ api_experiment                                                                       │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
│ Model name       │ run                                                                                  │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
│ Output directory │ /home/dave/code/l

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Max length of feature 'keywords': 21 (without start and stop symbols)
Max sequence length is 21 for feature 'keywords'
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Max length of feature 'text': 46 (without start and stop symbols)
Max sequence length is 46 for feature 'text'
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Building dataset: DONE
Writing preprocessed training set cache to /home/dave/code/llama-haiku/d9b50b24af4c11eea87be55baf3cf3ab.training.hdf5
Writing preprocessed validation set cache to /home/dave/code/llama-haiku/d9b50b24af4c11eea87be55baf3cf3ab.validation.hdf5
Writing preprocessed test set cache to /home/dave/code/llama-haiku/d9b50b24af4c11eea87be55baf3cf3ab.test.hdf5
Writing train set metadata to /home/dave/code/llama-haiku/d9b50b24af4c11eea87be55baf3cf3ab.meta.json

Dataset Statistics
╒════════════╤═══════════════╤════════════════════╕
│ Dataset    │   Size (Rows) │ Size (In Memory)   │
╞════════════╪═══════════════╪════════════════════╡
│ Training   │         34317 │ 7.85 Mb            │
├────────────┼───────────────┼────────────────────┤
│ Validation │          4902 │ 1.12 Mb            │
├────────────┼───────────────┼────────────────────┤
│ Test       │          9805 │ 2.24 Mb            │
╘════════════╧═══════════════╧════════════════════╛

╒═══════╕
│ MODEL │
╘═══════╛

Loadin

Loading checkpoint shards: 100%|██████████| 8/8 [00:06<00:00,  1.26it/s]


Done.
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer
Trainable Parameter Summary For Fine-Tuning
Fine-tuning with adapter: lora
trainable params: 3,407,872 || all params: 7,245,139,968 || trainable%: 0.04703666202518836

╒══════════╕
│ TRAINING │
╘══════════╛

Creating fresh model training run.
Training for 51477 step(s), approximately 3 epoch(s).
Early stopping policy: 5 round(s) of evaluation, or 85795 step(s), approximately 5 epoch(s).

Starting with step 0, epoch: 0
Training:  33%|███▎      | 17158/51477 [39:14<1:17:05,  7.42it/s, loss=0.205]Last batch in epoch only has 1 sample and will be dropped.
Last batch in epoch only has 1 sample and will be dropped.
Last batch in epoch only has 1 sample and will be dropped.
Training:  33%|███▎      | 17159/51477 [39:15<1:37:00,  5.90it/s, loss=0.225]
Running evaluation for step: 17159, epoch: 1
Evaluation valid: 100%|██████████| 2451/2451 [03:08<00:00, 13.00it/s]
Input: room different
Output: < I' in in.
A room 

## Ok, so how'd we do?

In [13]:
df = pd.DataFrame.from_dict({
    "keywords": 
        [
            "icicles",
            "trees",
            "flowers on the sidewalk", 
            "trucks on the highway", 
            "data science gone wrong"
        ]
    })

response = model.predict(df)

response

Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Prediction: 100%|██████████| 1/1 [00:06<00:00,  6.04s/it]
Loaded HuggingFace implementation of HuggingFaceH4/zephyr-7b-beta tokenizer
Finished predicting in: 7.68s.


  return np.sum(np.log(sequence_probabilities))


(                                    text_predictions  \
 0  [, I, ', m, not, a, flower, ., /, I, ', m, not...   
 1  [, I, ', m, not, gonna, lie, ., /, I, ', m, re...   
 2  [, I, ', m, really, good, at, ., /, Data, anal...   
 
                                   text_probabilities  \
 0  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   
 1  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   
 2  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   
 
                                        text_response  text_probability  
 0  [I'm not a flower. / I'm not a flower, I'm not...              -inf  
 1  [I'm not gonna lie. / I'm really looking forwa...              -inf  
 2  [I'm really good at. / Data analysis, but I'm....              -inf  ,
 'results')

In [21]:
answers = response[0]["text_response"]

for a in answers:
    print(a[0])

I'm not a flower. / I'm not a flower, I'm not. / A flower, I'm not.
I'm not gonna lie. / I'm really looking forward. / To the trucks tonight.
I'm really good at. / Data analysis, but I'm. / Not good at math.
