# RAFT

## Synthetic data generation phase

### Load domain specific documents we want to optimize RAG for

TODO

### Generate Q/A/CoT fine-tuning dataset using RAFT from the domain specific documents

In [None]:
doc_path = "../sample_data/UC_Berkeley_short.pdf"
ds_path = "ucb-short"

In [None]:
! python3 ../raft.py \
    --datapath $doc_path \
    --output $ds_path \
    --distractors 3 \
    --doctype pdf \
    --chunk_size 512 \
    --questions 2 \
    --completion_model gpt-4-turbo \
    --embedding_model text-embedding-ada-002

### Convert generated HuggingFace arrow dataset to JSONL format suitable for fine-tuning

In [None]:
# TODO

## Fine-tuning phase

### Loading the model to fine-tune

We will use the `llama-2-7b` model to show how user can finetune a model for text-completion task. If you opened this notebook from a specific model card, remember to replace the specific model name. Optionally, if you need to fine tune a model that is available on HuggingFace, but not available in `azureml` system registry, to do so [import](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/system/import/import_model_into_registry.ipynb) the model.

### Outline
* Pick a model to fine-tune.
* Pick and explore training data.
* Configure the fine tuning job.
* Run the fine tuning job.
* Review training metrics.
* Deploy the fine tuned model for real time inference. [TODO]
* Clean up resources.  [TODO]

### 1. Setup pre-requisites
* Install dependencies
* Connect to AzureML Workspace. Learn more at [set up SDK authentication](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?tabs=sdk). Replace  `<WORKSPACE_NAME>`, `<RESOURCE_GROUP>` and `<SUBSCRIPTION_ID>` below.
* Connect to `azureml` system registry
* Set an optional experiment name

Install dependencies by running below cell. This is not an optional step if running in a new environment.

In [None]:
%pip install azure-storage-file-datalake==12.14.0
%pip install azure-ai-ml
%pip install azure-identity

%pip install mlflow
%pip install azureml-mlflow

Install dependencies for download hugging face datasets.

In [None]:
%pip install datasets==2.9.0
%pip install py7zr

In [31]:
from azure.ai.ml import MLClient
from azure.identity import (
    DefaultAzureCredential,
    InteractiveBrowserCredential,
)

try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

try:
    workspace_ml_client = MLClient.from_config(credential=credential)
    print("Loaded ML Client configuration from config.json")
except:
    print("Loading ML Client configuration directly")
    workspace_ml_client = MLClient(
        credential,
        subscription_id="<SUBSCRIPTION_ID>",
        resource_group_name="<RESOURCE_GROUP>",
        workspace_name="<WORKSPACE_NAME>",
    )

# the models, fine tuning pipelines and environments are available in the AzureML system registry, "azureml"
registry_ml_client = MLClient(credential, registry_name="azureml")
registry_ml_client_meta = MLClient(credential, registry_name="azureml-meta")

DefaultAzureCredential failed to retrieve a token from the included credentials.
Attempted credentials:
	EnvironmentCredential: EnvironmentCredential authentication unavailable. Environment variables are not fully configured.
Visit https://aka.ms/azsdk/python/identity/environmentcredential/troubleshoot to troubleshoot this issue.
	ManagedIdentityCredential: ManagedIdentityCredential authentication unavailable, no response from the IMDS endpoint.
	SharedTokenCacheCredential: SharedTokenCacheCredential authentication unavailable. No accounts were found in the cache.
	AzureCliCredential: Azure CLI not found on path
	AzurePowerShellCredential: PowerShell is not installed
	AzureDeveloperCliCredential: Azure Developer CLI could not be found. Please visit https://aka.ms/azure-dev for installation instructions and then,once installed, authenticate to your Azure account using 'azd auth login'.
To mitigate this issue, please refer to the troubleshooting guidelines here at https://aka.ms/azsdk/py

Loaded ML Client configuration from config.json


### 2. Pick a foundation model to fine tune

Decoder based LLM models like `llama` performs well on `text-completion` tasks, we need to finetune the model for our specific purpose in order to use it. You can browse these models in the Model Catalog in the AzureML Studio, filtering by the `text-completion` task. In this example, we use the `llama-2-7b` model. If you have opened this notebook for a different model, replace the model name and version accordingly. 

Note the model id property of the model. This will be passed as input to the fine tuning job. This is also available as the `Asset ID` field in model details page in AzureML Studio Model Catalog. 

In [32]:
model_name = "Llama-2-7b"
foundation_model = registry_ml_client_meta.models.get(model_name, label="latest")
print(f"Using model name: {foundation_model.name}, version: {foundation_model.version}, id: {foundation_model.id} for fine tuning")

Using model name: Llama-2-7b, version: 20, id: azureml://registries/azureml-meta/models/Llama-2-7b/versions/20 for fine tuning


In [33]:
from azure.ai.ml.constants._common import AssetTypes
from azure.ai.ml.entities._inputs_outputs import Input
mlflow_model_llama = Input(
        type=AssetTypes.MLFLOW_MODEL, path=foundation_model.id
    )

### 4. Pick the dataset for fine-tuning the model

We use the [samsum](https://huggingface.co/datasets/samsum) dataset. The next few cells show basic data preparation for fine tuning:
* Visualize some data rows
* Preprocess the data and format it in required format. This is an important step for performing text completion as we add the required sequences/separators in the data. This is how we repurpose the text-completion task to any specific task like summarization, translation, text-completion, etc.
* While fintuning, text column is concatenated with ground_truth column to produce finetuning input. Hence, the data should be prepared such that `text + ground_truth` is your actual finetuning data.
* bos and eos tokens are added to the data by finetuning pipeline, you do not need to add it explicitly 
* We want this sample to run quickly, so save smaller `train`, `validation` and `test` files containing 10% of the original. This means the fine tuned model will have lower accuracy, hence it should not be put to real-world use. 

##### Here is an example of how the data should look like

text completion requires the training data to include at least 2 fields – one for ‘text’ and ‘ground_truth’ like in this example. The below examples are from Samsum dataset. 

Original dataset:

| dialogue (text) | summary (ground_truth) |
| :- | :- |
| Eric: MACHINE!\r\nRob: That's so gr8!\r\nEric: I know! And shows how Americans see Russian ;)\r\nRob: And it's really funny!\r\nEric: I know! I especially like the train part!\r\nRob: Hahaha! No one talks to the machine like that!\r\nEric: Is this his only stand-up?\r\nRob: Idk. I'll check.\r\nEric: Sure.\r\nRob: Turns out no! There are some of his stand-ups on youtube.\r\nEric: Gr8! I'll watch them now!\r\nRob: Me too!\r\nEric: MACHINE!\r\nRob: MACHINE!\r\nEric: TTYL?\r\nRob: Sure :) | Eric and Rob are going to watch a stand-up on youtube. | 
| Will: hey babe, what do you want for dinner tonight?\r\nEmma:  gah, don't even worry about it tonight\r\nWill: what do you mean? everything ok?\r\nEmma: not really, but it's ok, don't worry about cooking though, I'm not hungry\r\nWill: Well what time will you be home?\r\nEmma: soon, hopefully\r\nWill: you sure? Maybe you want me to pick you up?\r\nEmma: no no it's alright. I'll be home soon, i'll tell you when I get home. \r\nWill: Alright, love you. \r\nEmma: love you too. | Emma will be home soon and she will let Will know. | 

Formatted dataset the user might pass:

| text (text) | summary (ground_truth) |
| :- | :- |
| Summarize this dialog:\nEric: MACHINE!\r\nRob: That's so gr8!\r\nEric: I know! And shows how Americans see Russian ;)\r\nRob: And it's really funny!\r\nEric: I know! I especially like the train part!\r\nRob: Hahaha! No one talks to the machine like that!\r\nEric: Is this his only stand-up?\r\nRob: Idk. I'll check.\r\nEric: Sure.\r\nRob: Turns out no! There are some of his stand-ups on youtube.\r\nEric: Gr8! I'll watch them now!\r\nRob: Me too!\r\nEric: MACHINE!\r\nRob: MACHINE!\r\nEric: TTYL?\r\nRob: Sure :)\n---\nSummary:\n | Eric and Rob are going to watch a stand-up on youtube. | 
| Summarize this dialog:\nWill: hey babe, what do you want for dinner tonight?\r\nEmma:  gah, don't even worry about it tonight\r\nWill: what do you mean? everything ok?\r\nEmma: not really, but it's ok, don't worry about cooking though, I'm not hungry\r\nWill: Well what time will you be home?\r\nEmma: soon, hopefully\r\nWill: you sure? Maybe you want me to pick you up?\r\nEmma: no no it's alright. I'll be home soon, i'll tell you when I get home. \r\nWill: Alright, love you. \r\nEmma: love you too. \n---\nSummary:\n | Emma will be home soon and she will let Will know. | 
 

In [34]:
# load the ./samsum-dataset/train.jsonl file into a pandas dataframe and show the first 5 rows
import pandas as pd

pd.set_option(
    "display.max_colwidth", 0
)  # set the max column width to 0 to display the full text
df = pd.read_json("./dataset/intents-pc-16k-1000.jsonl", lines=True)
df.head()

Unnamed: 0,prompt,completion
0,You are a robot that only outputs JSON. You reply in JSON format with the field 'intent' which can only take values 'support' or 'chat'.\n\nExample question: Do you have any lightweight tents for solo backpacking in stock?\nExample answer:,"{""intent"": ""chat""}STOP"
1,You are a robot that only outputs JSON. You reply in JSON format with the field 'intent' which can only take values 'support' or 'chat'.\n\nExample question: Do you have any lightweight tents in stock that can accommodate two people?\nExample answer:,"{""intent"": ""chat""}STOP"
2,You are a robot that only outputs JSON. You reply in JSON format with the field 'intent' which can only take values 'support' or 'chat'.\n\nExample question: Do you have any lightweight tents in stock?\nExample answer:,"{""intent"": ""chat""}STOP"
3,You are a robot that only outputs JSON. You reply in JSON format with the field 'intent' which can only take values 'support' or 'chat'.\n\nExample question: Do you have any lightweight tents on sale this month?\nExample answer:,"{""intent"": ""chat""}STOP"
4,You are a robot that only outputs JSON. You reply in JSON format with the field 'intent' which can only take values 'support' or 'chat'.\n\nExample question: Do you have any lightweight tents on sale this week?\nExample answer:,"{""intent"": ""chat""}STOP"


In [35]:
# split dataset into 80%/20%
import numpy as np
train_df, validation_df = np.split(df, [int(.8*len(df))])
train_df.to_json("./dataset/train.jsonl", orient="records", lines=True)
validation_df.to_json("./dataset/validation.jsonl", orient="records", lines=True)

  return bound(*args, **kwds)


In [36]:
test_df = pd.read_json("./dataset/intents-pc-16k-test-1999.jsonl", lines=True)

### 5. Submit the fine tuning job using the the model and data as inputs
 
Create the job that uses the `text-generation` pipeline component. [Learn more](https://github.com/Azure/azureml-assets/blob/main/assets/training/finetune_acft_hf_nlp/components/pipeline_components/text_generation/README.md) about all the parameters supported for fine tuning.

Define finetune parameters

Finetune parameters can be grouped into 2 categories - training parameters, optimization parameters

Training parameters define the training aspects such as - 
1. the optimizer, scheduler to use
2. the metric to optimize the finetune
3. number of training steps and the batch size
and so on

Optimization parameters help in optimizing the GPU memory and effectively using the compute resources. Below are few of the parameters that belong to this category. _The optimization parameters differs for each model and are packaged with the model to handle these variations._
1. enable the deepspeed, ORT and LoRA
2. enable mixed precision training
2. enable multi-node training 

#### Create data inputs

In [37]:
from azure.ai.ml.entities._inputs_outputs import Input
training_data=Input(type="uri_file", path="./dataset/train.jsonl")
validation_data=Input(type="uri_file", path="./dataset/validation.jsonl")

Create FineTuning job object

In [38]:

from azure.ai.ml.entities._job.finetuning.custom_model_finetuning_job import CustomModelFineTuningJob
import uuid
from azure.ai.ml._restclient.v2024_01_01_preview.models import (
    FineTuningTaskType,
)
from azure.ai.ml.entities._inputs_outputs import Output

guid = uuid.uuid4()
short_guid = str(guid)[:8]

custom_model_finetuning_job = CustomModelFineTuningJob(
    task=FineTuningTaskType.TEXT_COMPLETION,
    training_data=training_data,
    validation_data=validation_data,
    hyperparameters={
        "per_device_train_batch_size": "1",
        "learning_rate": "0.00002",
        "num_train_epochs": "1",
    },
    model=mlflow_model_llama,
    display_name=f"llama-display-name-{short_guid}",
    name=f"llama-{short_guid}",
    experiment_name="llama-finetuning-experiment",
    tags={"agent": "gorilla-raft-notebook"},
    properties={"rocks": True},
    outputs={"registered_model": Output(type="mlflow_model", name=f"llama-finetune-registered-{short_guid}")},
)

Submit FineTuningJob

In [39]:
created_job = workspace_ml_client.jobs.create_or_update(custom_model_finetuning_job)
created_job.studio_url

Exception: 
[37m
[30m
1) One or more fields are invalid[39m[39m

Details: 

[31m(x) Supported input path value are ARM id, AzureML id, remote uri or local path.
Met <class 'azure.core.exceptions.ServiceRequestError'>:
<urllib3.connection.HTTPSConnection object at 0xfffef412f4c0>: Failed to resolve 'stcviberkele092349915883.blob.core.windows.net' ([Errno -2] Name or service not known)[39m

Resolutions: 
1) Double-check that all specified parameters are of the correct types and formats prescribed by the Job schema.
If using the CLI, you can also check the full log in debug mode for more details by adding --debug to the end of your command

Additional Resources: The easiest way to author a yaml specification file is using IntelliSense and auto-completion Azure ML VS code extension provides: [36mhttps://code.visualstudio.com/docs/datascience/azure-machine-learning.[39m To set up VS Code, visit [36mhttps://docs.microsoft.com/azure/machine-learning/how-to-setup-vs-code[39m


### 6. Review training and evaluation metrics
Viewing the job in AzureML studio is the best way to analyze logs, metrics and outputs of jobs. You can create custom charts and compare metics across different jobs. See https://learn.microsoft.com/en-us/azure/machine-learning/how-to-log-view-metrics?tabs=interactive#view-jobsruns-information-in-the-studio to learn more. 

However, we may need to access and review metrics programmatically for which we will use MLflow, which is the recommended client for logging and querying metrics.

In [None]:
import mlflow, json

mlflow_tracking_uri = workspace_ml_client.workspaces.get(
    workspace_ml_client.workspace_name
).mlflow_tracking_uri
mlflow.set_tracking_uri(mlflow_tracking_uri)
# concat 'tags.mlflow.rootRunId=' and pipeline_job.name in single quotes as filter variable
filter = "tags.mlflow.rootRunId='" + created_job.id + "'"
runs = mlflow.search_runs(
    experiment_names=[experiment_name], filter_string=filter, output_format="list"
)
training_run = None
evaluation_run = None
# get the training and evaluation runs.
# using a hacky way till 'Bug 2320997: not able to show eval metrics in FT notebooks - mlflow client now showing display names' is fixed
for run in runs:
    # check if run.data.metrics.epoch exists
    if "epoch" in run.data.metrics:
        training_run = run

In [None]:
if training_run:
    print("Training metrics:\n\n")
    print(json.dumps(training_run.data.metrics, indent=2))
else:
    print("No Training job found")

### 8. Deploy the fine tuned model to an online endpoint [TODO: Need some work]
Online endpoints give a durable REST API that can be used to integrate with applications that need to use the model.

In [None]:
import time, sys
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    ProbeSettings,
    OnlineRequestSettings,
)

# Create online endpoint - endpoint names need to be unique in a region, hence using timestamp to create unique endpoint name

online_endpoint_name = "samsum-textgen-" + timestamp
# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="Online endpoint for "
    + registered_model.name
    + ", fine tuned model for samsum textgen",
    auth_mode="key",
)
workspace_ml_client.begin_create_or_update(endpoint).wait()

You can find here the list of SKU's supported for deployment - [Managed online endpoints SKU list](https://learn.microsoft.com/en-us/azure/machine-learning/reference-managed-online-endpoints-vm-sku-list)

In [None]:
# create a deployment
demo_deployment = ManagedOnlineDeployment(
    name="demo",
    endpoint_name=online_endpoint_name,
    model=registered_model.id,
    instance_type="Standard_E64s_v3",
    instance_count=1,
    liveness_probe=ProbeSettings(initial_delay=600),
    request_settings=OnlineRequestSettings(request_timeout_ms=90000),
)
workspace_ml_client.online_deployments.begin_create_or_update(demo_deployment).wait()
endpoint.traffic = {"demo": 100}
workspace_ml_client.begin_create_or_update(endpoint).result()

### 9. Test the endpoint with sample data

We will fetch some sample data from the test dataset and submit to online endpoint for inference. We will then show the display the scored labels alongside the ground truth labels

In [None]:
# read ./samsum-dataset/small_test.jsonl into a pandas dataframe
test_df = pd.read_json("./samsum-dataset/small_test.jsonl", lines=True)
# take 5 random samples
test_df = test_df.sample(n=2)
# rebuild index
test_df.reset_index(drop=True, inplace=True)
# rename the label_string column to ground_truth_label
test_df = test_df.rename(columns={"label_string": "ground_truth_label"})
test_df.head(2)

In [None]:
# create a json object with the key as "input_data" and value as a list of values from the text column of the test dataframe
test_json = {"input_data": {"text": list(test_df["text"])}}
# save the json object to a file named sample_score.json in the ./samsum-dataset folder
with open("./samsum-dataset/sample_score.json", "w") as f:
    json.dump(test_json, f)

In [None]:
# score the sample_score.json file using the online endpoint with the azureml endpoint invoke method
response = workspace_ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name="demo",
    request_file="./samsum-dataset/sample_score.json",
)
print("raw response: \n", response, "\n")
# convert the response to a pandas dataframe and rename the label column as scored_label
response_df = pd.read_json(response)
response_df = response_df.rename(columns={0: "scored_label"})
response_df.head(2)

In [None]:
# merge the test dataframe and the response dataframe on the index
merged_df = pd.merge(test_df, response_df, left_index=True, right_index=True)
merged_df.head(2)

### 10. Delete the online endpoint
Don't forget to delete the online endpoint, else you will leave the billing meter running for the compute used by the endpoint

In [None]:
workspace_ml_client.online_endpoints.begin_delete(name=online_endpoint_name).wait()