# **Jupyter Notebook: Zephyr-7B-Beta LLM Fine-Tuning with Predibase on Gridspace-Stanford Harper Valley (GSHV) Dataset**

This quickstart will show you how to prompt, fine-tune, and deploy LLMs in Predibase. We'll be following a code generation use case where our end result will be a fine-tuned Zephyr-7B-Beta model that takes in voice call text transcripts as input and returns Task Type as output.

In [None]:
! pip install -U predibase
! pip install -U pprint
! pip install -U tqdm

# **Setup**

You'll first need to initialize your PredibaseClient object and configure your API token.

To get your Predibase API token please sign up to a free trial at [predibase.com](https://predibase.com/free-trial)

In [None]:
from predibase import PredibaseClient
from pprint import pprint

In [None]:
my_api_token: str = "<Your token here>"

In [None]:
pc: PredibaseClient = PredibaseClient(token=my_api_token)

In [None]:
pc.list_llm_deployments(active_only=True, print_as_table=True)

# **Prompt a deployed LLM**

For our code generation use case, let's first see how Llama 2 7B performs out of the box.

The first line is where we specify which deployed LLM we intend to query. If you are in the Predibase SaaS environment, you have a few shared LLMs available to you, including Llama 2 7B. If you are in a VPC environment, you'll need to deploy an LLM before you can query it.

In [None]:
from predibase.resource.llm.interface import HuggingFaceLLM, LLMDeployment

In [None]:
from predibase.pql.api import ServerResponseError

In [None]:
pb_model_deployment_name: str = "zephyr-7b-beta"

In [None]:
try:
    llm = pc.LLM(uri="hf://HuggingFaceH4/zephyr-7b-beta")
    base_llm_deployment = llm.deploy(deployment_name=pb_model_deployment_name, engine_template="llm-gpu-small").get()
except ServerResponseError:
    print(f'\n[WARNING] DEPLOYMENT_EXISTS:\n{pb_model_deployment_name}')

In [None]:
base_llm_deployment = pc.LLM(uri=f"pb://deployments/{pb_model_deployment_name}")

In [None]:
assert base_llm_deployment.name == pb_model_deployment_name

In [None]:
base_llm_deployment.wait_for_ready()

In [None]:
print(base_llm_deployment.default_prompt_template)

In [None]:
# Define the template used to prompt the model for each example
# Note the 4-space indentation, which is necessary for the YAML templating.
prompt_template: str = """
    Consider the case of a customer contacting the support center.
    The term "task type" refers to the reason for why the customer contacted support.

    ### The possible task types are: ###
    - replace card
    - transfer money
    - check balance
    - order checks
    - pay bill
    - reset password
    - schedule appointment
    - get branch hours
    - none of the above

    Summarize the issue/question/reason that drove the customer to contact support:

    ### Transcript: {transcript}

    ### Task Type:
"""

In [None]:
target: str = "task_type"

In [None]:
config: dict = llm.get_finetune_templates().default.to_config(prompt_template=prompt_template, target=target)
pprint(config)

In [None]:
test_transcript: str = """
<caller> hello <agent> hello this is [unintelligible] national bank my name is jennifer <agent> how can i help you today <caller> hi my name is james william <caller> i lost my debit card <caller> can you send me a new one <agent> yes <agent> uh which card or would you like to replace <caller> my debit card <agent> okay i've ordered your replacement debit card is there anything else i can help you with today <caller> no that's gonna be all for me today <agent> [noise] <agent> alright thank you for calling have a great day <caller> you too bye <agent> [noise] <agent> [noise]
"""

In [None]:
test_prompt: str = prompt_template.format(**{"transcript": test_transcript})
print(test_prompt)

In [None]:
result = base_llm_deployment.prompt(
    data=test_prompt,
    temperature=0.1,
    max_new_tokens=256,
    bypass_system_prompt=False,
)

In [None]:
print(f'\n[GENERATED_TEXT] BASE_MODEL_PREDICTION:\n{result.generated_text} ; TYPE: {str(type(result.generated_text))}')

# **Fine-tune a pretrained LLM**

Next we'll upload a dataset and fine-tune to see if we can get better performance.

The [Gridspace-Stanford Harper Valley (GSHV)](https://github.com/cricketclub/gridspace-stanford-harper-valley) dataset is used for fine-tuning large language models to analyze transcribed customer service voice calls to produce reasons for contact ("task type", or "contact reason") and consists of the following columns:

- `transcript` that contains the conversation between the caller and the agent
- the discerned `task_type`


## **Preprocess the Dataset**

This flow assumes that you have copied the dataset into your GDrive and mounted it to a location under `/content/drive/MyDrive/GridspaceStanfordHarperValley`. The original dataset can be found [here.](https://github.com/cricketclub/gridspace-stanford-harper-valley/)

In [None]:
import os
import json

import numpy as np
import pandas as pd

import matplotlib

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
os.listdir('/content/drive/MyDrive/GridspaceStanfordHarperValley')

In [None]:
transcript_files_path: str = "/content/drive/MyDrive/GridspaceStanfordHarperValley/data/transcript"
metadata_files_path: str = "/content/drive/MyDrive/GridspaceStanfordHarperValley/data/metadata"

In [None]:
file_names: list[str] = os.listdir(transcript_files_path)

In [None]:
len(file_names)

In [None]:
import tqdm

In [None]:
all_transcripts = []
all_task_types = []
for file_name in tqdm.tqdm(file_names):
    metadata_file_path = os.path.join(metadata_files_path, file_name)
    try:
        with open(metadata_file_path, 'r') as metadata_file:
            metadata = json.load(metadata_file)
            task_type = metadata["tasks"][0]["task_type"]
            all_task_types.append(task_type)
    except FileNotFoundError:
        continue

    transcript_file_path = os.path.join(transcript_files_path, file_name)
    with open(transcript_file_path, 'r') as transcript_file:
        conversation_turns = json.load(transcript_file)

        transcript_part_list = []
        for turn_item in conversation_turns:
          transcript_part = f'<{turn_item["speaker_role"]}>' + " " + turn_item["human_transcript"]
          transcript_part_list.append(transcript_part)

        transcript_text = " ".join(transcript_part for transcript_part in transcript_part_list)
        all_transcripts.append(transcript_text)
print(len(all_transcripts), len(all_task_types))

In [None]:
raw_data: dict = {"transcript": all_transcripts, "task_type": all_task_types}

In [None]:
df_dataset_original: pd.DataFrame = pd.DataFrame(data=raw_data)

In [None]:
df_train: pd.DataFrame = df_dataset_original.copy()

In [None]:
df_evaluation: pd.DataFrame = df_train.sample(n=10, random_state=200)
df_train = df_train.drop(df_evaluation.index)

In [None]:
df_test = df_train.sample(n=200, random_state=200)
df_train = df_train.drop(df_test.index)

In [None]:
df_validation = df_train.sample(n=100, random_state=200)
df_train = df_train.drop(df_validation.index)

In [None]:
assert df_train.shape[0] == 700
assert df_test.shape[0] == 200
assert df_validation.shape[0] == 100

In [None]:
df_train["split"] = np.zeros(df_train.shape[0])
df_test["split"] = np.ones(df_test.shape[0])
df_validation["split"] = np.full(df_validation.shape[0], 2)

In [None]:
df_dataset = pd.concat([df_train, df_test, df_validation])

In [None]:
df_dataset["split"] = df_dataset["split"].astype(int)

In [None]:
df_dataset.shape

In [None]:
assert df_dataset[df_dataset["split"] == 0].shape[0] == 700
assert df_dataset[df_dataset["split"] == 1].shape[0] == 200
assert df_dataset[df_dataset["split"] == 2].shape[0] == 100

In [None]:
df_dataset.head(n=10)

In [None]:
# Calculating the length of each cell in each column
df_dataset['num_characters_transcript'] = df_dataset['transcript'].apply(lambda x: len(x))
df_dataset['num_characters_task_type'] = df_dataset['task_type'].apply(lambda x: len(x))

# Show Distribution
df_dataset.hist(column=['num_characters_transcript', 'num_characters_task_type'])

# Calculating the average
average_chars_transcript = df_dataset['num_characters_transcript'].mean()
average_chars_task_type = df_dataset['num_characters_task_type'].mean()

print(f'Average number of tokens in the transcript column: {(average_chars_transcript / 3):.0f}')
print(f'Average number of tokens in the task_type column: {(average_chars_task_type / 3):.0f}')


In [None]:
df_evaluation

## **Now we will perform the following actions to start our fine-tuning job:**
1. Upload the dataset to Predibase for training
2. Create a prompt template to use for fine-tuning
3. Select the LLM we want to fine-tune
4. Kick off the fine-tuning job

The fine-tuning job should take around 35-45 minutes total. Queueing time depends on how quickly we're able acquire resources and what other jobs might be ahead in the queue. The training time itself should be around 25-30 minutes. As the model trains, you can receive updated metrics in your notebook or terminal. You can also see metrics and visualizations in the Predibase UI.

In [None]:
from predibase.resource.dataset import Dataset

In [None]:
dataset_name: str = "gridspace_stanford_harper_valley"

In [None]:
dataset: Dataset = pc.create_dataset_from_df(df=df_dataset, name=dataset_name)

In [None]:
llm = pc.LLM(uri="hf://HuggingFaceH4/zephyr-7b-beta")

In [None]:
from predibase.resource.model import ModelFuture

In [None]:
# Default argument values are commented out.
job: ModelFuture = llm.finetune(
    prompt_template=prompt_template,
    target=target,
    dataset=dataset,
    # engine=engine,
    # config=None,
    # repo="optional-custom-model-repository-name",
    epochs=5,
    # train_steps=None,
    # learning_rate=None,
)

In [None]:
from predibase.resource.model import Model

In [None]:
# Wait for the job to finish and get training updates and metrics
model: Model = job.get()

# **Download your fine-tuned LLM**

In this quickstart, we're running [adapter-based fine-tuning](https://huggingface.co/docs/peft/conceptual_guides/lora), so the exported model files will contain only the adapter weights, not the full LLM weights.

In [None]:
model.download(name="zephyr_7b_beta_finetuned_gridspace_stanford_harper_valley.zip", location="/Users/myusername/path/to/PredibaseCloud/models")

# **Prompt your fine-tuned LLM**

Predibase supports both real-time inference, as well as batch inference.

## **Deploy for Real-Time Inference**

There are two ways to serve your fine-tuned LLM.

#### **Real-time inference using _Dynamic Adapter Deployments_** (Recommended)

Dynamic adapter deployments allow you to prompt your fine-tuned LLM without needing to create a new deployment for each model you want to prompt. Predibase automatically loads your fine-tuned weights on top of a shared LLM deployment on demand. While this means that there will be a small amount of additional latency, the benefit is that a single LLM deployment can support many different fine-tuned model versions without requiring additional compute.

Note: Inference using dynamic adapter deployments is available to both SaaS and VPC users. Predibase provides shared base LLM deployments for use in our SaaS environment. VPC users need deploy their own base model.

In [None]:
# First, we refresh the base model deployment (e.g., if we restarted the kernel).
base_llm_deployment = pc.LLM(uri=f"pb://deployments/{pb_model_deployment_name}")

In [None]:
model_name: str = "zephyr-7b-beta-gridspace_stanford_harper_valley"

In [None]:
model: Model = pc.get_model(name=model_name, version=None, model_id=None)

In [None]:
model.version

In [None]:
# Second, we just specify the adapter to use, which is the model we fine-tuned.
adapter_deployment: LLMDeployment = base_llm_deployment.with_adapter(model=model)

In [None]:
test_prompt

In [None]:
result = adapter_deployment.prompt(
    data=test_prompt,
    temperature=0.1,
    max_new_tokens=256,
    bypass_system_prompt=False,
)

In [None]:
print(f'\n[GENERATED_TEXT] FINE_TUNED_MODEL_PREDICTION: \"{result.generated_text}\"')
