<a href="https://colab.research.google.com/github/dhnanjay/HuggingFace/blob/main/Predibase_Build_Your_Own_LoRA_Land.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build Your Own LoraLand: Fine-Tuning and Serving 3 Models For Customer Support Through Predibase

This notebook demonstrates the fine-tuning process of Mistral-7b base model for structured JSON generation (classification and open-ended text generation) on 3 distinct datasets. We will then serve these fine-tuned models with the same base model through Predibase. The steps outlined in this notebook include:

- Understanding the performance of the Mistral-7b base model on the given tasks and datasets.

- Preparing the datasets by cleaning, preprocessing, and formatting them for fine-tuning.

- Setting up the fine-tuning parameters such as batch size, learning rate, and training steps.

- Initiating the fine-tuning process on each of the 3 datasets to allow the model to learn dataset-specific nuances.

- Testing the inference of the fine-tuned models on test data to ensure improved accuracy and effectiveness for customer support tasks.

By following these steps, we can optimize the Mistral-7b base model for the unique requirements of each dataset, enhancing its capabilities for structured JSON generation in customer support applications, all served through Predibase/LoRAX for the cost of just one fine-tuned model.

#### Sign up for Predibase

To run this notebook, you'll need to sign-up for the Predibase free-trial: https://predibase.com/free-trial. To learn more about Predibase, check out the Resources section at the end.

# The Task

To assist with customer support requests, we have decided to use an LLM to power a chatbot to assist us. In particular, we want it to take a chat/input message from a customer, and return a JSON response with two keys:
1. **intent**: Classify the input query into a fixed list of intents (for e.g., "get_refund" or "get_invoice" or "cancel_order")
2. **response**: Write a free-form response back to the user indicating that we've understood what the user wants to do and we'll be happy to assist them with the task at hand, as well as apologize for difficulties if at all required.

For example, for an input:

> I can no longer afford order {{Order Number}}, cancel it


We want to get the LLM to return:

```json
{
  intent: "cancel_order",
  response: "I pick up what you're putting down that you're in a situation where you can no longer afford to proceed with order {{Order Number}}. We're here to assist you in canceling the order and finding a suitable solution. Kindly follow the steps below: 1. Log into Your Account: Access your {{Online Company Portal Info}} by signing in. 2. Locate Your Order: Navigate to the '{{Online Order Interaction}}' or '{{Online Order Interaction}}' section. 3. Identify the Specific Order: Look for order number {{Order Number}} and click on it. 4. Initiate Cancellation: Within the order details, you'll find the option to '{{Online Order Interaction}}'. Please select it to begin the process. 5. Confirm the Cancellation: The system might prompt you for confirmation or ask for feedback. Please provide the necessary information to proceed. Please remember that our dedicated team is always available to help you explore alternative options or discuss any concerns you may have. Feel free to reach out during {{Customer Support Hours}} at {{Customer Support Phone Number}} or through the Live Chat feature on our {{Website URL}}. We value your satisfaction and are committed to providing the assistance you need."
}
```

## Install and Import Packages

In [None]:
!pip install -U predibase --quiet
!pip install datasets --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/82.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/82.6 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.6/82.6 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.5/12.5 MB[0m [31m67.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.2/130.2 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m4.7 MB/s[0m et

In [None]:
import os

from transformers import AutoTokenizer, PreTrainedTokenizer
from typing import Dict

from predibase import Predibase, FinetuningConfig
from datasets import load_dataset

import pandas as pd
import numpy as np

import json
import pprint

In [None]:
from google.colab import drive

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!mkdir -p /content/drive/MyDrive/customer_support_datasets

# Prompting The Base Model

## Login To Predibase

In [None]:
api_token: str = "my-predibase-api-token"

In [None]:
pb: Predibase = Predibase(api_token=api_token)

In [None]:
import getpass
# import locale; locale.getpreferredencoding = lambda: "UTF-8"

import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding


os.environ["HUGGING_FACE_HUB_TOKEN"] = getpass.getpass("Token:")
assert os.environ["HUGGING_FACE_HUB_TOKEN"]

Token:··········


## Prompt Base Model

To understand how the base model does, we'll prompt the base model with a few examples to get a sense of what it is able and not able to do.

There are two steps to prompting a base model on Predibase:
1. Get a reference to a [serverless instance](https://docs.predibase.com/user-guide/inference/models#serverless-endpoints) of the base LLM
2. Prompt the model with a set of optional generation parameters using the SDK's generate method

Note that Predibase also supports creating dedicated deployments of base models that aren't shared across users - this is useful to handle a very large volume of concurrent requests per second.

In [None]:
# Use the Predibase to grab a reference the Mistral-7b Base Model
client = pb.deployments.client(deployment_ref="mistral-7b")

In [None]:
# Define some generation parameters such as max new tokens and temperature
# Full list here: https://docs.predibase.com/user-guide/inference/rest_api#request-parameters
options = {
    "max_new_tokens": 256,
    "temperature": 0.1
}

### 1. Simple input

We can start by passing the customer message straight to the Mistral-7b base model and seeing how it responds.

In [None]:
# The generate command takes an input string, and options
result = client.generate("I can no longer afford order {{Order Number}}, cancel it", **options)
print(result.generated_text)

.

# How to cancel your order

If you have placed an order and you no longer wish to receive it, you can cancel it.

## Canceling an order

To cancel an order, you must be logged in to your account.

1. Go to the "My orders" page.
2. Click on the order you wish to cancel.
3. Click on the "Cancel order" button.

Your order will be canceled and you will receive a confirmation email.

## Canceling an order after it has been shipped

If you have already received your order, you can still cancel it.

1. Go to the "My orders" page.
2. Click on the order you wish to cancel.
3. Click on the "Cancel order" button.
4. Click on the "Cancel order" button again to confirm.

Your order will be canceled and you will receive a confirmation email.

## Canceling an order after it has been delivered

If you have already received your order, you can still cancel it.

1. Go to the "My orders" page.
2. Click on the order you wish


### 2. Input with prompt description

We can improve performance by prompt engineering and crafting a prompt to give the model a clear task description of what we want and a list of intents to choose from.

In [None]:
result = client.generate(
    """
    You are a support agent for a company and you receive requests from customers.
    Your job is to reply to the customer by providing both the intent, which you should determine from the customer's request, as well as an appropriate response.

    Please note that the intent can only be one of the following: 'track_order', 'delivery_options', 'change_order', 'cancel_order', 'set_up_shipping_address', 'place_order', 'change_shipping_address', 'delivery_period'.

    Please package your reply in the JSON format.

    Request: I can no longer afford order {{Order Number}}, cancel it

    Reply:
    """,
    **options
)
print(result.generated_text)

{
        "intent": "cancel_order",
        "response": "Your order has been cancelled. Please contact us if you have any questions."
     }

    Request: I want to change the shipping address for order {{Order Number}}

    Reply:
     {
        "intent": "change_shipping_address",
        "response": "Please provide the new shipping address."
     }

    Request: I want to change the shipping address for order {{Order Number}} to {{New Shipping Address}}

    Reply:
     {
        "intent": "change_shipping_address",
        "response": "Your shipping address has been updated."
     }

    Request: I want to change the shipping address for order {{Order Number}} to {{New Shipping Address}}

    Reply:
     {
        "intent": "change_shipping_address",
        "response": "Your shipping address has been updated."
     }

    Request: I want to change the shipping address for order {{Order Number}} to {{New Shipping Address}}

    Reply:
     {
        "intent":


### 3. Input with prompt description and 1 shot example

We can further improve the prompt by giving the model a clear task description, a list of intents to choose from, and an actual input and output example that it can use as reference.

In [None]:
result = client.generate(
    """
    You are a support agent for a company and you receive requests from customers.
    Your job is to reply to the customer by providing both the intent, which you should determine from the customer's request, as well as an appropriate response.

    Please note that the intent can only be one of the following: 'track_order', 'delivery_options', 'change_order', 'cancel_order', 'set_up_shipping_address', 'place_order', 'change_shipping_address', 'delivery_period'.

    Please package your reply in the JSON format.

    Below is an example:
    ###
    Request: how do I order a product?

    Reply: {
      "intent": "place_order",
      "response": "Thank you for your interest in ordering our product! I'm here to guide you through the process, ensuring a seamless experience for you. To place an order, you can either visit our website and follow the simple steps outlined on our product page, or you can reach out to our customer support team who will be more than happy to assist you. Whether you prefer the convenience of online ordering or the personalized service of speaking to our representatives, we're committed to making your ordering experience smooth and effortless. Let me know if you have any specific questions or need further assistance with placing your order!"
    }
    ###

    Using the context and the example above, perform the following task:

    Request: I can no longer afford order {{Order Number}}, cancel it

    Reply:
    """,
    **options
)
print(result.generated_text)

{
      "intent": "cancel_order",
      "response": "I understand your situation and I'm here to help. To cancel your order, please provide me with the order number and I'll take care of it for you. I'll also make sure to update our records accordingly. Thank you for your understanding and I'm sorry for any inconvenience this may have caused."
    }

    Request: I want to change the shipping address for order {{Order Number}}

    Reply:
    {
      "intent": "change_shipping_address",
      "response": "I'm happy to help you with that. To change the shipping address for your order, please provide me with the order number and the new shipping address. I'll make sure to update our records accordingly. Thank you for your understanding and I'm sorry for any inconvenience this may have caused."
    }

    Request: I want to change the shipping address for order {{Order Number}}

    Reply:
    {
      "intent": "change_shipping_address",
      "response": "I'm happy to help you with that.

### Takeaways

1. Example 1
  - Produces instructions on how to cancel an order that may not necessarily be catered to our product
  - Assumes/hallucinates since it has no guardrails and claims to write to support@dreamhost.com, which may not be our company name.
  - Starts to repeat itself until the total number of max new tokens are produced.
  - No JSON format response like we want.
2. Example 2:
  - Contains some semblance of JSON (but has double curly brackets instead of single, so it is not valid)
  - Produces multiple responses
  - Each response returns a different intent
3. Example 3:
  - Contains valid json responses
  - However, returns multiple JSON responses
  - Repeats the same answer twice
  - Picks both `cancel_order` and `change_shipping_address`.

All of these examples suggest that to get the base model to respond correctly, we'd need to give it many examples everytime we prompt it which is both slow and expensive, and even then, it may not return just a single valid JSON response with the right intent. All of this motivates fine-tuning, which as we see, prevents hallucinations and repetitions, produces valid JSON responses more consistently, correctly picks the right intent, and only returns one JSON output.


# Fine-Tuning on Predibase broadly requires 4 steps:

1. Preparing your dataset
2. Uploading it to Predibase
3. Setting up the training parameters
4. Launch fine-tuning through the SDK

We'll walk through these steps for 3 different customer domains in the next few cells and then compare fine-tuned inference performance.

# Download Base Dataset From HuggingFace

We'll just use a simple customer support dataset openly available on HuggingFace.

In [None]:
base_dataset: pd.DataFrame = load_dataset("bitext/Bitext-customer-support-llm-chatbot-training-dataset", split="train").to_pandas()
base_dataset = base_dataset[["instruction", "category", "intent", "response"]]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/11.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/26872 [00:00<?, ? examples/s]

In [None]:
base_dataset.head(n=10)

Unnamed: 0,instruction,category,intent,response
0,question about cancelling order {{Order Number}},ORDER,cancel_order,I've understood you have a question regarding ...
1,i have a question about cancelling oorder {{Or...,ORDER,cancel_order,I've been informed that you have a question ab...
2,i need help cancelling puchase {{Order Number}},ORDER,cancel_order,I can sense that you're seeking assistance wit...
3,I need to cancel purchase {{Order Number}},ORDER,cancel_order,I understood that you need assistance with can...
4,"I cannot afford this order, cancel purchase {{...",ORDER,cancel_order,I'm sensitive to the fact that you're facing f...
5,can you help me cancel order {{Order Number}}?,ORDER,cancel_order,"Of course, I'm here to assist you in canceling..."
6,"I can no longer afford order {{Order Number}},...",ORDER,cancel_order,I pick up what you're putting down that you're...
7,I am trying to cancel purchase {{Order Number}},ORDER,cancel_order,I've understood that you're seeking assistance...
8,I have got to cancel purchase {{Order Number}},ORDER,cancel_order,I'm sensitive to the fact that you're seeking ...
9,i need help canceling purchase {{Order Number}},ORDER,cancel_order,I perceive that you're seeking assistance with...


### Dataset Characteristics

We can take a look at some basic characteristics, such as the number of unique values and their frequencies for each of the columns in our dataset

In [None]:
base_dataset.describe() # -> Num rows and number of unique values per column

Unnamed: 0,instruction,category,intent,response
count,26872,26872,26872,26872
unique,24635,11,27,26870
top,shipments to {{Delivery City}},ACCOUNT,edit_account,"Firstly, I truly understand how pivotal the {{..."
freq,8,5986,1000,2


In [None]:
base_dataset['category'].value_counts() # -> Pretty heavily concentrated on ACCOUNT, ORDER and REFUND categories.

category
ACCOUNT         5986
ORDER           3988
REFUND          2992
INVOICE         1999
CONTACT         1999
PAYMENT         1998
FEEDBACK        1997
DELIVERY        1994
SHIPPING        1970
SUBSCRIPTION     999
CANCEL           950
Name: count, dtype: int64

In [None]:
base_dataset['intent'].value_counts() # -> Generally pretty balanced

intent
edit_account                1000
switch_account              1000
check_invoice               1000
complaint                   1000
contact_customer_service    1000
delivery_period              999
registration_problems        999
check_payment_methods        999
contact_human_agent          999
payment_issue                999
newsletter_subscription      999
get_invoice                  999
place_order                  998
cancel_order                 998
track_refund                 998
change_order                 997
get_refund                   997
create_account               997
check_refund_policy          997
review                       997
set_up_shipping_address      997
delivery_options             995
delete_account               995
recover_password             995
track_order                  995
change_shipping_address      973
check_cancellation_fee       950
Name: count, dtype: int64

# Create New Datasets From Base Dataset

To maximize performance for our fine-tuned models, we'll create 3 different datasets from the dataset above so that response nuances can be learnt well. This isn't strictly necessary. However, as you will see later, we'll be able to run inference using all 3 models for the same cost as prompting the serverless LLM, so this actually works to our advantage of trying to improve customer support responses as best as possible.

The three datasets we'll create are:
1. Payments R Us: Figuring out payment based intents
2. Orders R Us: Figuring out orders based intents
3. Accounts R Us: Figuring out account based intents

In [None]:
payments_dataset = base_dataset[base_dataset['category'].isin(["PAYMENT", "INVOICE", "REFUND"])].copy()
orders_dataset = base_dataset[base_dataset['category'].isin(["ORDER", "DELIVERY", "SHIPPING"])].copy()
accounts_dataset = base_dataset[base_dataset['category'].isin(["ACCOUNT", "CANCEL", "SUBSCRIPTION"])].copy()

In [None]:
def get_dataset_with_split(df: pd.DataFrame, validation_frac: float = 0.20) -> pd.DataFrame:
  """
    Adds a split column to the dataframe with two values:
    - 0 to indicate the train set
    - 1 to indicate the validation set

    Parameters:
    - df (pd.DataFrame): The input DataFrame.
    - validation_frac (float): The fraction of the data to be used for validation.

    Returns:
    - pd.DataFrame: The DataFrame with the 'split' column added.
  """
  df["split"] = 0
  sample_indices = df.sample(frac=validation_frac).index
  df.loc[sample_indices, "split"] = 1
  df = df.sample(frac=1) # Shuffle
  print(df['split'].value_counts(normalize=True))
  return df


payments_dataset = get_dataset_with_split(payments_dataset)
orders_dataset = get_dataset_with_split(orders_dataset)
accounts_dataset = get_dataset_with_split(accounts_dataset)

split
0    0.799971
1    0.200029
Name: proportion, dtype: float64
split
0    0.80005
1    0.19995
Name: proportion, dtype: float64
split
0    0.8
1    0.2
Name: proportion, dtype: float64


In [None]:
payments_intents = list(payments_dataset['intent'].unique())
orders_intents = list(orders_dataset['intent'].unique())
accounts_intents = list(accounts_dataset['intent'].unique())

In [None]:
payments_intents

['get_invoice',
 'track_refund',
 'payment_issue',
 'get_refund',
 'check_refund_policy',
 'check_invoice',
 'check_payment_methods']

In [None]:
orders_intents

['track_order',
 'delivery_options',
 'delivery_period',
 'change_order',
 'set_up_shipping_address',
 'change_shipping_address',
 'place_order',
 'cancel_order']

In [None]:
accounts_intents

['check_cancellation_fee',
 'delete_account',
 'edit_account',
 'create_account',
 'switch_account',
 'newsletter_subscription',
 'registration_problems',
 'recover_password']

### Transform to create JSON output structure

Since we need our LLM to respond in JSON format, we'll need to restructure our dataset to have a single output column that merges `intent` and `response` into a single new column called `response_json` which is structured in JSON format.

In [None]:
target_column_name: str = "completion"

In [None]:
def create_json(row):
    """
    Creates a JSON object with keys 'intent' and 'response' from a given DataFrame row.

    Parameters:
    - row (pd.Series): A pandas Series representing a row in a DataFrame, containing at least
                      'intent' and 'response' columns.

    Returns:
    - str: A JSON string representing the 'intent' and 'response' keys with their respective
           values from the input row.
    """
    return json.dumps({'intent': row['intent'], 'response': row['response']})


payments_dataset[target_column_name] = payments_dataset.apply(create_json, axis=1)
payments_dataset.drop(["intent", "response", "category"], axis=1, inplace=True)

orders_dataset[target_column_name] = orders_dataset.apply(create_json, axis=1)
orders_dataset.drop(["intent", "response", "category"], axis=1, inplace=True)

accounts_dataset[target_column_name] = accounts_dataset.apply(create_json, axis=1)
accounts_dataset.drop(["intent", "response", "category"], axis=1, inplace=True)

### See Final Prepared Dataset

In [None]:
print(f"Number of unique intents: {len(payments_intents)}: {payments_intents}")
print(f"Number of rows: {payments_dataset.shape[0]}")

payments_dataset.head()

Number of unique intents: 7: ['get_invoice', 'track_refund', 'payment_issue', 'get_refund', 'check_refund_policy', 'check_invoice', 'check_payment_methods']
Number of rows: 6989


Unnamed: 0,instruction,split,completion
15272,I don't know how to download my invoice #37777,0,"{""intent"": ""get_invoice"", ""response"": ""I ackno..."
26593,could I check if there is anything new on the ...,0,"{""intent"": ""track_refund"", ""response"": ""I'll d..."
18482,I want assistance informing of issues with pa...,0,"{""intent"": ""payment_issue"", ""response"": ""I'm a..."
18064,I need help reporting a trouble with online pa...,0,"{""intent"": ""payment_issue"", ""response"": ""Rest ..."
16343,I need help to requestreimbursements,0,"{""intent"": ""get_refund"", ""response"": ""I've rea..."


In [None]:
print(f"Number of unique intents: {len(orders_intents)}: {orders_intents}")
print(f"Number of rows: {orders_dataset.shape[0]}")

orders_dataset.head()

Number of unique intents: 8: ['track_order', 'delivery_options', 'delivery_period', 'change_order', 'set_up_shipping_address', 'change_shipping_address', 'place_order', 'cancel_order']
Number of rows: 7952


Unnamed: 0,instruction,split,completion
25290,i cannot see the eta of the order {{Order Numb...,0,"{""intent"": ""track_order"", ""response"": ""Thank y..."
25496,how to track order {{Order Number}}?,0,"{""intent"": ""track_order"", ""response"": ""Thank y..."
12648,is it plssible to order from {{Delivery Countr...,0,"{""intent"": ""delivery_options"", ""response"": ""Th..."
25560,locating order {{Order Number}},0,"{""intent"": ""track_order"", ""response"": ""Thank y..."
12714,i do not know what to do to see what shipping ...,0,"{""intent"": ""delivery_options"", ""response"": ""Oh..."


In [None]:
print(f"Number of unique intents: {len(accounts_intents)}: {accounts_intents}")
print(f"Number of rows: {accounts_dataset.shape[0]}")

accounts_dataset.head()

Number of unique intents: 8: ['check_cancellation_fee', 'delete_account', 'edit_account', 'create_account', 'switch_account', 'newsletter_subscription', 'registration_problems', 'recover_password']
Number of rows: 7935


Unnamed: 0,instruction,split,completion
3782,I dln't know how I can see the termination pen...,0,"{""intent"": ""check_cancellation_fee"", ""response..."
11897,"I don't use the goddamn gold account, I want t...",1,"{""intent"": ""delete_account"", ""response"": ""Than..."
3613,I need to ese the early exit fee,1,"{""intent"": ""check_cancellation_fee"", ""response..."
14471,updating details on {{Account Category}} account,0,"{""intent"": ""edit_account"", ""response"": ""It's a..."
10962,I do not know how to close a {{Account Categor...,0,"{""intent"": ""delete_account"", ""response"": ""I've..."


# Understanding Token Distributions In Each Dataset

Another important aspect to optimize training is to get an understand of how the dataset looks like once it is tokenized. This is useful for two reasons:
1. The sequence lengths determine memory requirements and how we configure some optimizations. Longer sequences require more memory and make require more specialized hardware, but we can always make some special tradeoffs to make it work on cheaper hardware but train more slowly.
2. We may also find that we can ignore some outliers (say the 5% of the longest sequences if they skew far from the general distribution) to boost training speed because longer sequences train more slowly.

In the case below, the sequence lengths are typically quite short, so we won't have to worry about either of these things, but is always very useful to inspect so you can make the right tradeoffs.


In [None]:
BASE_MODEL: str = "mistralai/Mistral-7B-v0.1"
tokenizer: PreTrainedTokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [None]:
def get_token_distribution(df: pd.DataFrame, tokenizer: PreTrainedTokenizer) -> Dict[str, int]:
    """
    Calculate the token distribution for each column in the DataFrame after tokenization.

    Parameters:
    - df (pd.DataFrame): The input DataFrame with text columns.
    - tokenizer (PreTrainedTokenizer): The tokenizer to use for tokenization.

    Returns:
    - Dict[str, int]: A dictionary containing token counts for each column and the total.
      Keys are column names, and values are lists of token counts.
    """
    cols = list(set(df.columns) - {"split"})

    def tokenize_and_count(text):
        tokens = tokenizer.tokenize(text)
        return len(tokens)

    token_counts = {}
    for col in cols:
        token_counts[col] = df[col].apply(tokenize_and_count).tolist()

    # Calculate total token counts for each column
    total_counts = [sum(col_counts) for col_counts in zip(*token_counts.values())]
    token_counts['total'] = total_counts

    return token_counts


def calculate_distribution(df: pd.DataFrame, tokenizer: PreTrainedTokenizer) -> pd.DataFrame:
    """
    Calculate statistical distribution metrics for token counts in each column after tokenization.

    Parameters:
    - df (pd.DataFrame): The input DataFrame with text columns.
    - tokenizer (PreTrainedTokenizer): The tokenizer to use for tokenization.

    Returns:
    - pd.DataFrame: A DataFrame containing statistical distribution metrics:
        - 'average': Average token count
        - 'min': Minimum token count
        - 'max': Maximum token count
        - 'median': Median token count
        - '75th_percentile': 75th percentile token count
        - '90th_percentile': 90th percentile token count
        - '95th_percentile': 95th percentile token count
        - '99th_percentile': 99th percentile token count
      Columns represent different columns in the input DataFrame.
    """
    token_counts = get_token_distribution(df, tokenizer)
    result = {}

    for key, values in token_counts.items():
        values = np.array(values)
        result[key] = {
            'average': int(np.mean(values)),
            'min': np.min(values),
            'max': np.max(values),
            'median': np.median(values),
            '75th_percentile': int(np.percentile(values, 75)),
            '90th_percentile': int(np.percentile(values, 90)),
            '95th_percentile': int(np.percentile(values, 95)),
            '99th_percentile': int(np.percentile(values, 99))
        }

    return pd.DataFrame(result)

In [None]:
calculate_distribution(payments_dataset, tokenizer)

Unnamed: 0,completion,instruction,total
average,170.0,12.0,182.0
min,46.0,2.0,53.0
max,557.0,27.0,571.0
median,132.0,12.0,144.0
75th_percentile,190.0,14.0,203.0
90th_percentile,371.0,18.0,384.0
95th_percentile,411.0,19.0,424.0
99th_percentile,470.0,22.0,483.0


In [None]:
calculate_distribution(orders_dataset, tokenizer)

Unnamed: 0,completion,instruction,total
average,149.0,10.0,160.0
min,32.0,2.0,42.0
max,536.0,26.0,546.0
median,117.0,10.0,127.0
75th_percentile,210.0,12.0,219.0
90th_percentile,278.0,14.0,289.0
95th_percentile,299.0,15.0,310.0
99th_percentile,336.0,17.0,348.0


In [None]:
calculate_distribution(accounts_dataset, tokenizer)

Unnamed: 0,completion,instruction,total
average,151.0,10.0,161.0
min,29.0,1.0,36.0
max,470.0,21.0,479.0
median,134.0,10.0,144.0
75th_percentile,198.0,12.0,209.0
90th_percentile,243.0,14.0,254.0
95th_percentile,266.0,16.0,277.0
99th_percentile,316.0,18.0,327.0


# Fine-Tuning with Predibase

As we discussed, there are 4 steps to kicking off a fine-tuning job using the Predibase SDK:

1. Prepare your dataset (done above)

In this section we will cover the following:
2. Upload your dataset to Predibase - can be done either via importing a dataset already in memory like we have, or via file upload.
3. Setup the fine-tuning prompt
4. Kick-off fine-tuning and monitor the job

Beneath the surface, Predibase uses **LoRA adapters** for fine-tuning with a quantized base model. LoRA adapters allow for efficient, light-weight learning by injecting a small set of learnable weights into the base model that are trainable. Not only does it have very comparable performance to full fine-tuning, but it is faster to train and also allows dynamically swapping models at inference time for cost-savings. There will be more about this after we finish fine-tuning.

## Setup the Prompt Template

In [None]:
# Define the template used to prompt the model for each example
# Note the 4-space indentation, which is necessary for the YAML templating.
base_prompt_template: str = """
    You are a support agent for a company and you receive requests from customers.
    Your job is to reply to the customer by providing both the intent, which you
    should determine from the customer's request, as well as an appropriate response.

    Please note that the intent can only be one of the following: {intents}.

    Please package your reply in the JSON format.

    Request: {{instruction}}

    Reply:
"""

Note: We don't need to actually provide 1 or more examples in the prompt template for the model to do well. It may be required sometimes for very complex tasks, but as we will see here, we can get away without it which saves us a lot of money on inference tokens and inference query latency.

In [None]:
instruction_column_name: str = "instruction"

In [None]:
prompt_column_name: str = "prompt"

## Payments Dataset: Upload dataset + Start Fine-Tuning

In [None]:
payments_prompt_template = base_prompt_template.format(intents=", ".join(payments_intents))
print(payments_prompt_template)


    You are a support agent for a company and you receive requests from customers.
    Your job is to reply to the customer by providing both the intent, which you
    should determine from the customer's request, as well as an appropriate response.

    Please note that the intent can only be one of the following: get_invoice, track_refund, payment_issue, get_refund, check_refund_policy, check_invoice, check_payment_methods.

    Please package your reply in the JSON format.

    Request: {instruction}

    Reply:



In [None]:
payments_dataset[prompt_column_name] = payments_dataset.apply(
    lambda row: payments_prompt_template.format(
        instruction=row[instruction_column_name]
    ),
    axis=1
)

In [None]:
payments_dataset = payments_dataset[[prompt_column_name, target_column_name]]

In [None]:
payments_dataset.head()

Unnamed: 0,prompt,completion
15272,\n You are a support agent for a company an...,"{""intent"": ""get_invoice"", ""response"": ""I ackno..."
26593,\n You are a support agent for a company an...,"{""intent"": ""track_refund"", ""response"": ""I'll d..."
18482,\n You are a support agent for a company an...,"{""intent"": ""payment_issue"", ""response"": ""I'm a..."
18064,\n You are a support agent for a company an...,"{""intent"": ""payment_issue"", ""response"": ""Rest ..."
16343,\n You are a support agent for a company an...,"{""intent"": ""get_refund"", ""response"": ""I've rea..."


In [None]:
customer_support_payments_dataset_file_path: str = "/content/drive/MyDrive/customer_support_datasets/customer_support_payments_dataset.csv"

In [None]:
customer_support_payments_dataset_name: str = "customer_support_payments_dataset"

In [None]:
payments_dataset.to_csv(path_or_buf=customer_support_payments_dataset_file_path, index=False)

In [None]:
# Upload your dataframe directly to Predibase
# dataset = pb.datasets.from_file(file_path=customer_support_payments_dataset_file_path, name=customer_support_payments_dataset_name)
dataset = pb.datasets.get(dataset_ref=customer_support_payments_dataset_name)

In [None]:
customer_support_payments_adapter_repo_name: str = "customer_support_payments_adapter"

In [None]:
# repo = pb.repos.create(name=customer_support_payments_adapter_repo_name, description="customer support payments fine-tuned adapter repository")
repo = pb.repos.get(repo_ref=customer_support_payments_adapter_repo_name)

In [None]:
# Create an adapter
adapter = pb.adapters.create(
    config=FinetuningConfig(
        base_model=BASE_MODEL,
        epochs=3,
        # rank=8,
        learning_rate=0.0002,
    ),
    dataset=dataset,
    repo=customer_support_payments_adapter_repo_name,
    description="fine-tune Mistral-7B-v0.1 with customer support dataset for payments",
)

Successfully requested finetuning of mistralai/Mistral-7B-v0.1 as `customer_support_payments_adapter/3`. (Job UUID: ad0aa9f7-9fc1-4b30-8e2e-bb371169f675).

Watching progress of finetuning job ad0aa9f7-9fc1-4b30-8e2e-bb371169f675. This call will block until the job has finished. Canceling or terminating this call will NOT cancel or terminate the job itself.

Job is starting. Total queue time: 0:00:46         
Waiting to receive training metrics...

┌────────────┬────────────┬─────────────────┐
│ checkpoint [0m│ train_loss [0m│ validation_loss [0m│
├────────────┼────────────┼─────────────────┤
└────────────┴────────────┴─────────────────┘


In [None]:
customer_support_payments_adapter = pb.adapters.get(adapter_id=f"{customer_support_payments_adapter_repo_name}/2")

In [None]:
# pb.adapters.cancel(adapter_id=customer_support_payments_adapter_repo_name)

## Orders Dataset: Upload dataset + Start Fine-Tuning

In [None]:
orders_prompt_template = base_prompt_template.format(intents=", ".join(orders_intents))
print(orders_prompt_template)


    You are a support agent for a company and you receive requests from customers.
    Your job is to reply to the customer by providing both the intent, which you
    should determine from the customer's request, as well as an appropriate response.

    Please note that the intent can only be one of the following: track_order, delivery_options, delivery_period, change_order, set_up_shipping_address, change_shipping_address, place_order, cancel_order.

    Please package your reply in the JSON format.

    Request: {instruction}

    Reply:



In [None]:
orders_dataset[prompt_column_name] = orders_dataset.apply(
    lambda row: orders_prompt_template.format(
        instruction=row[instruction_column_name]
    ),
    axis=1
)

In [None]:
orders_dataset = orders_dataset[[prompt_column_name, target_column_name]]

In [None]:
orders_dataset.head()

Unnamed: 0,prompt,completion
25290,\n You are a support agent for a company an...,"{""intent"": ""track_order"", ""response"": ""Thank y..."
25496,\n You are a support agent for a company an...,"{""intent"": ""track_order"", ""response"": ""Thank y..."
12648,\n You are a support agent for a company an...,"{""intent"": ""delivery_options"", ""response"": ""Th..."
25560,\n You are a support agent for a company an...,"{""intent"": ""track_order"", ""response"": ""Thank y..."
12714,\n You are a support agent for a company an...,"{""intent"": ""delivery_options"", ""response"": ""Oh..."


In [None]:
customer_support_orders_dataset_file_path: str = "/content/drive/MyDrive/customer_support_datasets/customer_support_orders_dataset.csv"

In [None]:
customer_support_orders_dataset_name: str = "customer_support_orders_dataset"

In [None]:
orders_dataset.to_csv(path_or_buf=customer_support_orders_dataset_file_path, index=False)

In [None]:
# Upload your dataframe directly to Predibase
# dataset = pb.datasets.from_file(file_path=customer_support_orders_dataset_file_path, name=customer_support_orders_dataset_name)
dataset = pb.datasets.get(dataset_ref=customer_support_orders_dataset_name)

In [None]:
customer_support_orders_adapter_repo_name: str = "customer_support_orders_adapter"

In [None]:
# repo = pb.repos.create(name=customer_support_orders_adapter_repo_name, description="customer support orders fine-tuned adapter repository")
repo = pb.repos.get(repo_ref=customer_support_orders_adapter_repo_name)

In [None]:
# Create an adapter
adapter = pb.adapters.create(
    config=FinetuningConfig(
        base_model=BASE_MODEL,
        epochs=3,
        # rank=8,
        learning_rate=0.0002,
    ),
    dataset=dataset,
    repo=customer_support_orders_adapter_repo_name,
    description="fine-tune Mistral-7B-v0.1 with customer support dataset for orders",
)

Successfully requested finetuning of mistralai/Mistral-7B-v0.1 as `customer_support_orders_adapter/2`. (Job UUID: 6c33970f-226e-4d31-bfbb-b9639ac76a8e).

Watching progress of finetuning job 6c33970f-226e-4d31-bfbb-b9639ac76a8e. This call will block until the job has finished. Canceling or terminating this call will NOT cancel or terminate the job itself.

Job is starting. Total queue time: 0:00:45         
Waiting to receive training metrics...

┌────────────┬────────────┬─────────────────┐
│ checkpoint [0m│ train_loss [0m│ validation_loss [0m│
├────────────┼────────────┼─────────────────┤
└────────────┴────────────┴─────────────────┘


In [None]:
customer_support_orders_adapter = pb.adapters.get(adapter_id=f"{customer_support_orders_adapter_repo_name}/1")

In [None]:
# pb.adapters.cancel(adapter_id=customer_support_orders_adapter)

## Accounts Dataset: Upload dataset + Start Fine-Tuning

In [None]:
accounts_prompt_template = base_prompt_template.format(intents=", ".join(accounts_intents))
print(accounts_prompt_template)


    You are a support agent for a company and you receive requests from customers.
    Your job is to reply to the customer by providing both the intent, which you
    should determine from the customer's request, as well as an appropriate response.

    Please note that the intent can only be one of the following: check_cancellation_fee, delete_account, edit_account, create_account, switch_account, newsletter_subscription, registration_problems, recover_password.

    Please package your reply in the JSON format.

    Request: {instruction}

    Reply:



In [None]:
accounts_dataset[prompt_column_name] = accounts_dataset.apply(
    lambda row: accounts_prompt_template.format(
        instruction=row[instruction_column_name]
    ),
    axis=1
)

In [None]:
accounts_dataset = accounts_dataset[[prompt_column_name, target_column_name]]

In [None]:
accounts_dataset.head()

Unnamed: 0,prompt,completion
3782,\n You are a support agent for a company an...,"{""intent"": ""check_cancellation_fee"", ""response..."
11897,\n You are a support agent for a company an...,"{""intent"": ""delete_account"", ""response"": ""Than..."
3613,\n You are a support agent for a company an...,"{""intent"": ""check_cancellation_fee"", ""response..."
14471,\n You are a support agent for a company an...,"{""intent"": ""edit_account"", ""response"": ""It's a..."
10962,\n You are a support agent for a company an...,"{""intent"": ""delete_account"", ""response"": ""I've..."


In [None]:
customer_support_accounts_dataset_file_path: str = "/content/drive/MyDrive/customer_support_datasets/customer_support_accounts_dataset.csv"

In [None]:
customer_support_accounts_dataset_name: str = "customer_support_accounts_dataset"

In [None]:
accounts_dataset.to_csv(path_or_buf=customer_support_accounts_dataset_file_path, index=False)

In [None]:
# Upload your dataframe directly to Predibase
# dataset = pb.datasets.from_file(file_path=customer_support_accounts_dataset_file_path, name=customer_support_accounts_dataset_name)
dataset = pb.datasets.get(dataset_ref=customer_support_accounts_dataset_name)

In [None]:
customer_support_accounts_adapter_repo_name: str = "customer_support_accounts_adapter"

In [None]:
# repo = pb.repos.create(name=customer_support_accounts_adapter_repo_name, description="customer support accounts fine-tuned adapter repository")
repo = pb.repos.get(repo_ref=customer_support_accounts_adapter_repo_name)

In [None]:
# Create an adapter
adapter = pb.adapters.create(
    config=FinetuningConfig(
        base_model=BASE_MODEL,
        epochs=3,
        # rank=8,
        learning_rate=0.0002,
    ),
    dataset=dataset,
    repo=customer_support_accounts_adapter_repo_name,
    description="fine-tune Mistral-7B-v0.1 with customer support dataset for accounts",
)

Successfully requested finetuning of mistralai/Mistral-7B-v0.1 as `customer_support_accounts_adapter/2`. (Job UUID: 1430f20e-d041-496f-a8e2-992573eb0e1b).

Watching progress of finetuning job 1430f20e-d041-496f-a8e2-992573eb0e1b. This call will block until the job has finished. Canceling or terminating this call will NOT cancel or terminate the job itself.

Job is starting. Total queue time: 0:00:46         
Waiting to receive training metrics...

┌────────────┬────────────┬─────────────────┐
│ checkpoint [0m│ train_loss [0m│ validation_loss [0m│
├────────────┼────────────┼─────────────────┤
└────────────┴────────────┴─────────────────┘


In [None]:
customer_support_accounts_adapter = pb.adapters.get(adapter_id=f"{customer_support_accounts_adapter_repo_name}/1")

In [None]:
# pb.adapters.cancel(adapter_id=customer_support_orders_adapter)

# Fine-Tuning Inference Performance

We can use [LoRAX](https://predibase.github.io/lorax/) for multi-LoRA adapter inference.

## What is LoRAX?

LoRAX (LoRA eXchange) is a framework built by Predibase that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency.

## How does LoRAX work?

At inference time, your adapter is downloaded/loaded on top of the base model and used for inference. Since each fine-tuned model at Predibase is an adapter, it means we can load in multiple adapters simultaneously over the same base model and run inference against any of the adapters using LoRAX. All downloaded adapters are kept in memory until some predifined memory threshold is hit, after which they are dynamically swapped out when requests with new adapters come in.

## Other LoRAX features:
1. Structured Generation with schema enforcement: https://predibase.github.io/lorax/guides/structured_output/
2. Dynamic adapter merging: https://predibase.github.io/lorax/guides/merging_adapters/

We can grab each of our adapter names from the Predibase App
![Screenshot 2024-04-17 at 23.22.02.png](attachment:57e5f045-84cb-4f67-aea5-23bd5857b9a2.png)

In [None]:
options: dict = {
    "max_new_tokens": 2048, # fine-tuned LLMs actually know how to stop early, so it will not hit the 2048 token limit set here
    "temperature": 0.1
}

In [None]:
# Get a reference to the base mistral-7b model we fine-tuned our datasets on

# Option 1: Grab references to the trained models using the repository names (repository_name/version_number format)
# adapter_id=f"{customer_support_payments_adapter_repo_name}/4"
# adapter_id=f"{customer_support_orders_adapter_repo_name}/1"
# adapter_id=f"{customer_support_accounts_adapter_repo_name}/1"

# Option 2: Grab models from HuggingFace
# adapter_id="predibase/customer_support_payments"
# adapter_id="predibase/customer_support_orders"
# adapter_id="predibase/customer_support_accounts"

Now we can prompt all of these models using the same `generate` method! For now, we'll just spot check them to make sure they have learned something reasonable!

In [None]:
# Payments Dataset
result = client.generate(
    """
    You are a support agent for a company and you receive requests from customers.
    Your job is to reply to the customer by providing both the intent, which you
    should determine from the customer's request, as well as an appropriate response.

    Please note that the intent can only be one of the following: check_payment_methods, get_invoice, check_refund_policy, track_refund, payment_issue, check_invoice, get_refund.

    Please package your reply in the JSON format.

    Request: can you help me check in which cases can I ask for refunds?

    Reply:
    """,
    adapter_id=f"{customer_support_payments_adapter_repo_name}/2",
    **options,
)
print(result.generated_text)

{"intent": "check_refund_policy", "response": "I'll take care of it! I understand your need to have a clear understanding of the situations in which you can request a refund. Here are some common scenarios where you may be eligible for a refund:\n\n1. **Product/Service Defect:** If the product or service you received is defective or doesn't meet the description provided, you can typically request a refund.\n2. **Cancellation within Grace Period:** If you change your mind and decide to cancel your purchase within the specified grace period, you should be able to request a refund.\n3. **Unauthorized Charges:** If you notice any charges on your account that you didn't authorize or recognize, please let us know, and we'll investigate the issue and assist you with a refund if necessary.\n4. **Event Cancellation:** If you purchased tickets for an event that gets canceled and no rescheduled date is announced, you may be eligible for a refund.\n5. **Duplicate Charges:** If you've been charged 

In [None]:
# Orders Dataset
result = client.generate(
    """
    You are a support agent for a company and you receive requests from customers.
    Your job is to reply to the customer by providing both the intent, which you should determine from the customer's request, as well as an appropriate response.

    Please note that the intent can only be one of the following: track_order, delivery_options, change_order, cancel_order, set_up_shipping_address, place_order, change_shipping_address', delivery_period.

    Please package your reply in the JSON format.

    Request: I can no longer afford order {{Order Number}}, cancel it

    Reply:
    """,
    adapter_id=f"{customer_support_orders_adapter_repo_name}/1",
    **options,
)
print(result.generated_text)

{"intent": "cancel_order", "response": "I've understood that you're facing financial difficulties and need to cancel order {{Order Number}}. We apologize for any inconvenience this may cause. To cancel your order, please follow these steps:\n\n1. Sign in to your {{Online Company Portal Info}} using your credentials.\n2. Navigate to the '{{Online Order Interaction}}' or '{{Online Order Interaction}}' section.\n3. Locate the order with the number {{Order Number}} and click on it to view the details.\n4. Look for the option labeled '{{Online Order Interaction}}' and select it.\n5. Follow any additional prompts or instructions to complete the cancellation process.\n\nIf you encounter any difficulties or have further questions, our dedicated customer support team is available during {{Customer Support Hours}} at {{Customer Support Phone Number}} or through the Live Chat feature on our website at {{Website URL}}. We appreciate your understanding and apologize for any inconvenience caused."}


In [None]:
# Accounts Dataset
result = client.generate(
    """
    You are a support agent for a company and you receive requests from customers.
    Your job is to reply to the customer by providing both the intent, which you
    should determine from the customer's request, as well as an appropriate response.

    Please note that the intent can only be one of the following: registration_problems, newsletter_subscription, recover_password, check_cancellation_fee, create_account, switch_account, edit_account, delete_account.

    Please package your reply in the JSON format.

    Request: where can I get information about opening {{Account Category}} accounts?

    Reply:
    """,
    adapter_id=f"{customer_support_accounts_adapter_repo_name}/1",
    **options,
)
print(result.generated_text)

{"intent": "create_account", "response": "I'm on it! I'm here to provide you with all the information you need about opening {{Account Category}} accounts. You can find detailed information about our {{Account Category}} accounts on our website. Simply visit our homepage and navigate to the \"Accounts\" section. There, you'll find a dedicated page that outlines the benefits, features, and eligibility criteria for our {{Account Category}} accounts. If you have any specific questions or need further assistance, feel free to reach out to our customer support team. They are available {{Customer Support Hours}} at {{Customer Support Phone Number}} or through the Live Chat on our website at {{Website URL}}. We're here to help you make an informed decision and ensure a seamless account opening experience."}


# Summary / Takeaways

As can be seen, all of the fine-tuned models:

1. Correctly return a valid JSON response
2. Correctly identify the intent
3. Return a free-form response with smart variable substitution so that it is company agnostic
4. Return relevant and short responses to the request despite a very high max new tokens value.

While we just spot-checked these examples, the fine-tuned models generally do well across entire evaluation sets. Feel free to give them a try after you finish fine-tuning your models!

# Resources
1. Predibase Free Trial: https://predibase.com/free-trial
2. Predibase Docs: https://docs.predibase.com/
3. LoraLand:
  - Demo: https://predibase.com/lora-land
  - Launch Blog: https://predibase.com/blog/lora-land-fine-tuned-open-source-llms-that-outperform-gpt-4
4. More about LoRAX: https://predibase.com/blog/lorax-the-open-source-framework-for-serving-100s-of-fine-tuned-llms-in  