# [CEPF L300]: Tune Gemini Model by using Supervised Fine-tuning

## Getting Started

### Install Vertex AI SDK and other required packages

In [13]:
!pip3 install --upgrade --user --quiet google-cloud-aiplatform

## Set Project and Location

First, you have to set your project_id, location, and bucket_name. You can also use an existing bucket within the project.

In [14]:
project_id_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
PROJECT_ID = project_id_output[0]
REGION = !gcloud compute project-info describe --format="value[](commonInstanceMetadata.items.google-compute-default-region)"
LOCATION = REGION[0]
print(f"REGION=${REGION}")

BUCKET_NAME = f"{PROJECT_ID}-model-dataset"
BUCKET_URI = f"gs://{BUCKET_NAME}"

REGION=$['europe-west4']


## Import Libraries

In [15]:
import vertexai
from vertexai.generative_models import (
    GenerativeModel,
    Part,
    HarmCategory,
    HarmBlockThreshold,
    GenerationConfig,
)
from vertexai.preview.tuning import sft

vertexai.init(project=PROJECT_ID, location=LOCATION)

from typing import Union
import pandas as pd
from google.cloud import bigquery
from sklearn.model_selection import train_test_split
import datetime
import time

## Generate the training and validation dataset files

To create a tuning job, you use a Q&A with a context dataset in JSON format.

Supervised fine-tuning offers a solution, as it allows focused adaptation of foundation models to new tasks. You can create a supervised text model tuning job using the Google Cloud console, API, or the Vertex AI SDK for Python. For more information, refer to the [documentation page](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning),

But how do you ensure your data is primed for success with supervised fine-tuning? Here are the critical areas to focus on:

- **Domain Alignment:** Supervised fine-tuning thrives on smaller datasets, but they must be highly relevant to your downstream task. Look for data that closely mirrors the domain you will encounter in real-world use cases.
- **Labeling Accuracy:** Noisy labels sabotage even the best technique. Prioritize accuracy in your annotations and labeling.
- **Noise Reduction:** Outliers, inconsistencies, or irrelevant examples hurt model adaptation. Implement preprocessing, such as removing duplicates, fixing typos, and verifying that data conforms to your task's expectations.
- **Distribution:** A diverse range of examples helps your model generalize better within the confines of your target task. Refrain from overloading the process with excessive variance that strays from your core domain.
- **Balanced Classes:** For classification tasks, try to keep a reasonable balance between different classes to avoid the model learning biases towards a specific class


### Fetching data from BigQuery

Your model tuning dataset must be in a JSONL format where each line contains a single training example. You must make sure that you include instructions.

You will use the [StackOverflow dataset](https://cloud.google.com/blog/topics/public-datasets/google-bigquery-public-datasets-now-include-stack-overflow-q-a) on BigQuery Public Datasets, limiting to questions with the `python` tag, and accepted answers for answers since 2020-01-01.

You use a helper function to read the data from BigQuery and create a Pandas dataframe.

In [16]:
def run_bq_query(sql: str) -> Union[str, pd.DataFrame]:
    """
    Run a BigQuery query and return the job ID or result as a DataFrame
    Args:
        sql: SQL query, as a string, to execute in BigQuery
    Returns:
        df: DataFrame of results from query,  or error, if any
    """

    bq_client = bigquery.Client(project=PROJECT_ID)

    # Try dry run before executing query to catch any errors
    job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    bq_client.query(sql, job_config=job_config)

    # If dry run succeeds without errors, proceed to run query
    job_config = bigquery.QueryJobConfig()
    client_result = bq_client.query(sql, job_config=job_config)

    job_id = client_result.job_id

    # Wait for query/job to finish running. then get & return DataFrame
    df = client_result.result().to_arrow().to_pandas()
    print(f"Finished job_id: {job_id}")

    return df

Next you write the query. Limit your example to 550.

**TODO:** Update the query below to limit the results to 550.

In [17]:
stack_overflow_df = run_bq_query(
    """SELECT
           CONCAT(q.title, q.body) AS input_text,
           a.body AS output_text
       FROM `bigquery-public-data.stackoverflow.posts_questions` q
       JOIN `bigquery-public-data.stackoverflow.posts_answers` a
         ON q.accepted_answer_id = a.id
       WHERE q.accepted_answer_id IS NOT NULL
         AND REGEXP_CONTAINS(q.tags, "python")
         AND a.creation_date >= "2020-01-01"
       LIMIT 550
    """
)

stack_overflow_df.head()

Finished job_id: 9ac4d829-bffd-4b1b-bb24-8653c0440c53


Unnamed: 0,input_text,output_text
0,mark_area resetting the Y-axis domain (Altair)...,<p>Area charts include zero by default. To cha...
1,Not sure how a node is passed as parameter in ...,<p>I will try to explain with a simple example...
2,Loop through XML in Python<p>My data set is as...,<p>Looping can be done in a list comprehension...
3,Runtime error in CSES problem set Missing Numb...,<p>You might want to try:</p>\n<pre><code>n=in...
4,python django logging: One single logger with ...,<p>Best solution I found was to create multipl...


There should be 550 questions and answers.

In [18]:
print(len(stack_overflow_df))

550


### Adding instructions
Finetuning language models on a collection of datasets phrased as instructions improve model performance and generalization to unseen tasks [(Google, 2022)](https://arxiv.org/pdf/2210.11416.pdf).

An instruction refers to a specific directive or guideline that conveys a task or action to be executed. These instructions can be expressed in various forms, such as step-by-step procedures, commands, or rules. When you don't use the instructions, it's only a question and answer. The instruction tells the large language model what to do. You want them to answer the question. You have to give a hint about the task you want to perform. Extend the dataset with an instruction.

In [19]:
INSTRUCTION_TEMPLATE = f"""\
You are a helpful Python developer \
You are good at answering Stackoverflow questions \
Your mission is to provide developers with helpful answers that work
"""

Create a new column for the `INSTRUCTION_TEMPLATE`. Use a new column and do not overwrite the existing one because you might want to use it later.

In [20]:
stack_overflow_df["input_text_instruct"] = INSTRUCTION_TEMPLATE

stack_overflow_df.head(2)

Unnamed: 0,input_text,output_text,input_text_instruct
0,mark_area resetting the Y-axis domain (Altair)...,<p>Area charts include zero by default. To cha...,You are a helpful Python developer You are goo...
1,Not sure how a node is passed as parameter in ...,<p>I will try to explain with a simple example...,You are a helpful Python developer You are goo...


**TODO:**
Next, you split the data into training and evaluation. For Extractive Q&A tasks, it's advised 500+ training examples. In this case, you use 440 to generate a tuning job that runs faster. 

20% of your dataset is used for testing. The `random_state` controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

In [21]:
# TODO: Update the test_size to select 20% of data for evaluation
train, evaluation = train_test_split(stack_overflow_df, test_size=0.2, random_state=42)

# Warning - Don't change the following print statements. It is used for score tracking. 
# Please don't forget to save this notebook script.
print('Total number of records in training dataset:',len(train))
print('Total number of records in validation dataset:',len(evaluation))

Total number of records in training dataset: 440
Total number of records in validation dataset: 110


In [10]:
train.head()

Unnamed: 0,input_text,output_text,input_text_instruct
482,How to get time.sleep working in PyCharm?<pre>...,<p>Add <code>import time</code> to the top of ...,You are a helpful Python developer You are goo...
158,discord.py :: How would I make a command to ad...,<p>You can create a command that allows modera...,You are a helpful Python developer You are goo...
15,How to get one to many like result in a querys...,<p>I think this is one possible solution. You ...,You are a helpful Python developer You are goo...
334,Verifying exactly the same variable in python ...,<p>What you are looking for is <code>is</code>...,You are a helpful Python developer You are goo...
39,"Difference between ""alpha"" and ""start_alpha""<p...",<p>The parameter <code>alpha</code> is used in...,You are a helpful Python developer You are goo...


In [11]:
evaluation.describe()

Unnamed: 0,input_text,output_text,input_text_instruct
count,110,110,110
unique,110,110,1
top,elements from a list to dataframe names for ex...,<p>Having a list</p>\n<pre><code>list_of_dfs =...,You are a helpful Python developer You are goo...
freq,1,1,110


Go to the **Task 4. Generate the training and validation dataset files** section of the lab instructions and click  **Check my progress** to verify the __Split the dataset for training and evaluation__ objective.

### Generate the JSONL files

Prepare your training data in a JSONL (JSON Lines) file and store it in a Google Cloud Storage (GCS) bucket. This format ensures efficient processing. Each line of the JSONL file must represent a single data instance and follow a well-defined schema:

`{ "systemInstruction": {"role": "system", "parts": [{"text": "instructions"}]},
  "contents": [{"role": "user", "parts": [{"text": "question"}]},{"role": "model", "parts": [{"text": "answering"}]}]}`

This is how it maps to the Pandas df columns:

*   `instructions -> input_text_instruct`
*   `question -> input_text`
*   `answer -> output_text`



In [22]:
tuning_data_filename = f"tune_data_stack_overflow_qa.jsonl"
validation_data_filename = f"validation_data_stack_overflow_qa.jsonl"

In [23]:
def format_messages(row):
    """Formats a single row into the desired JSONL structure"""
    return {
      "systemInstruction": {
        "role": "system",
        "parts": [
          {
            "text": row["input_text_instruct"]
          }
        ]
      },
      "contents": [
        {
          "role": "user",
          "parts": [
            {
              "text": row["input_text"]
            }
          ]
        },
        {
          "role": "model",
          "parts": [
            {
              "text": row["output_text"]
            }
          ]
        }
      ]
    }

In [24]:
# Apply formatting function to each row, then convert to JSON Lines format
tuning_data = train.apply(format_messages, axis=1).to_json(orient="records", lines=True)

# Save the result to a JSONL file
with open(tuning_data_filename, "w") as f:
    f.write(tuning_data)

Next, check if the number of rows match with your Pandas df.

In [25]:
with open(tuning_data_filename, "r") as f:
    num_rows = sum(1 for line in f)

print("Number of rows in the JSONL file:", num_rows)

Number of rows in the JSONL file: 440


Do the same for the validation dataset.

In [26]:
# Apply formatting function to each row, then convert to JSON Lines format
validation_data = evaluation.apply(format_messages, axis=1).to_json(
    orient="records", lines=True
)

# Save the result to a JSONL file
with open(validation_data_filename, "w") as f:
    f.write(validation_data)

Next, copy the JSONL files into the Google Cloud Storage bucket you specified or created at the beginning of the notebook.

In [27]:
!gsutil cp $tuning_data_filename $validation_data_filename $BUCKET_URI

Copying file://tune_data_stack_overflow_qa.jsonl [Content-Type=application/octet-stream]...
Copying file://validation_data_stack_overflow_qa.jsonl [Content-Type=application/octet-stream]...
/ [2 files][  1.7 MiB/  1.7 MiB]                                                
Operation completed over 2 objects/1.7 MiB.                                      


Check if the files are in the bucket.

In [28]:
!gsutil ls -al $BUCKET_URI

   1428964  2024-12-12T18:10:27Z  gs://qwiklabs-gcp-01-5b3935109de6-model-dataset/tune_data_stack_overflow_qa.jsonl#1734027027202619  metageneration=1
    328751  2024-12-12T18:10:27Z  gs://qwiklabs-gcp-01-5b3935109de6-model-dataset/validation_data_stack_overflow_qa.jsonl#1734027027328621  metageneration=1
TOTAL: 2 objects, 1757715 bytes (1.68 MiB)


Create two variables for the data.

In [29]:
TUNING_DATA_URI = f"{BUCKET_URI}/{tuning_data_filename}"
VALIDATION_DATA_URI = f"{BUCKET_URI}/{validation_data_filename}"

Go to the **Task 4. Generate the training and validation dataset files** section of the lab instructions and click  **Check my progress** to verify the __Store the training and validation files in Cloud Storage__ objective.

## Start a supervised tuning job using Gemini
It's time to start your tuning job. Use the `gemini-1.5-pro-002` model.

In [30]:
foundation_model = GenerativeModel("gemini-1.5-pro-002")

**TODO:** Create a supervised fine-tuning job with following parameters: 

* **Tuned model display name:** `StackOverflow Q&A Supervised Tuning Job`
* **Source model:** gemini-1.5-pro-002
* **Training dataset:** tune_data_stack_overflow_qa.jsonl
* **Validation dataset:** validation_data_stack_overflow_qa.jsonl
* **Epochs:** 3
* **Learning Rate Multiplier:** 1.0

In [31]:
#sft_tuning_job = [ TODO - Insert your code ]

sft_tuning_job = sft.train(
    tuned_model_display_name="StackOverflow Q&A Supervised Tuning Job",
    source_model=foundation_model,
    train_dataset=f"{BUCKET_URI}/tune_data_stack_overflow_qa.jsonl",
    # Optional:
    validation_dataset=f"{BUCKET_URI}/validation_data_stack_overflow_qa.jsonl",
    epochs=3, 
    learning_rate_multiplier=1.0
)

Creating SupervisedTuningJob
SupervisedTuningJob created. Resource name: projects/372089863747/locations/europe-west4/tuningJobs/7765162057824993280
To use this SupervisedTuningJob in another session:
tuning_job = sft.SupervisedTuningJob('projects/372089863747/locations/europe-west4/tuningJobs/7765162057824993280')
View Tuning Job:
https://console.cloud.google.com/vertex-ai/generative/language/locations/europe-west4/tuning/tuningJob/7765162057824993280?project=372089863747


projects/372089863747/locations/europe-west4/models/477289201524539392@1
projects/372089863747/locations/europe-west4/endpoints/2273334848426868736
<google.cloud.aiplatform.metadata.experiment_resources.Experiment object at 0x7f8cbae09120>


{'name': 'projects/372089863747/locations/europe-west4/tuningJobs/7765162057824993280',
 'tunedModelDisplayName': 'StackOverflow Q&A Supervised Tuning Job',
 'baseModel': 'gemini-1.5-pro-002',
 'supervisedTuningSpec': {'trainingDatasetUri': 'gs://qwiklabs-gcp-01-5b3935109de6-model-dataset/tune_data_stack_overflow_qa.jsonl',
  'validationDatasetUri': 'gs://qwiklabs-gcp-01-5b3935109de6-model-dataset/validation_data_stack_overflow_qa.jsonl',
  'hyperParameters': {'epochCount': '3',
   'learningRateMultiplier': 1.0,
   'adapterSize': 'ADAPTER_SIZE_FOUR'}},
 'state': 'JOB_STATE_SUCCEEDED',
 'createTime': '2024-12-12T18:19:54.760001Z',
 'startTime': '2024-12-12T18:20:12.791932Z',
 'endTime': '2024-12-12T18:44:54.476552Z',
 'updateTime': '2024-12-12T18:44:54.476552Z',
 'experiment': 'projects/372089863747/locations/europe-west4/metadataStores/default/contexts/tuning-experiment-20241212102354106888',
 'tunedModel': {'model': 'projects/372089863747/locations/europe-west4/models/4772892015245393

Go to the **Task 5. Start a supervised tuning job using Gemini** section of the lab and click  **Check my progress** to verify the __Start a supervised tuning job using Gemini__ objective.

Next, you retrieve the model resource name.

In [32]:
# Get the resource name of the tuning job
sft_tuning_job_name = sft_tuning_job.resource_name
sft_tuning_job_name

'projects/372089863747/locations/europe-west4/tuningJobs/7765162057824993280'

Tuning takes approximately 100-120 minutes. Wait until the job is finished before you continue after the next cell.

In [33]:
%%time
# Wait for job completion
while not sft_tuning_job.refresh().has_ended:
    time.sleep(60)

CPU times: user 29.2 ms, sys: 4.69 ms, total: 33.9 ms
Wall time: 170 ms


After completing the tuning job, go to the **Task 5. Start a supervised tuning job using Gemini** section of the lab and click  **Check my progress** to verify the __Tune Gemini model using supervised fine-tuning__ objective.

In [34]:
# tuned model name
tuned_model_name = sft_tuning_job.tuned_model_name
tuned_model_name

'projects/372089863747/locations/europe-west4/models/477289201524539392@1'

Use `tuning.TuningJob.list()` to retrieve your tuning jobs.

In [35]:
sft_tuning_job.list()

[<vertexai.tuning._supervised_tuning.SupervisedTuningJob object at 0x7f8cbbbe1120> 
 resource name: projects/372089863747/locations/europe-west4/tuningJobs/7765162057824993280]

Your model is automatically deployed as a Vertex AI Endpoint and ready for use!

In [36]:
# tuned model endpoint name
tuned_model_endpoint_name = sft_tuning_job.tuned_model_endpoint_name
tuned_model_endpoint_name

'projects/372089863747/locations/europe-west4/endpoints/2273334848426868736'

## Test the tuned model with a prompt

In [37]:
tuned_model = GenerativeModel(tuned_model_endpoint_name)
print(tuned_model)

<vertexai.generative_models.GenerativeModel object at 0x7f8cbae09d50>


Call the API

In [38]:
question = "How do I store a TensorFlow checkpoint on Google Cloud Storage while training?"
response = tuned_model.generate_content(question)

print(response)

candidates {
  content {
    role: "model"
    parts {
      text: "You can store TensorFlow checkpoints on Google Cloud Storage (GCS) during training using the `tf.keras.callbacks.ModelCheckpoint` callback with a GCS path.  Here\'s a breakdown of how to do it:\n\n```python\nimport tensorflow as tf\nfrom google.cloud import storage\n\n# 1. Define your model\nmodel = tf.keras.models.Sequential([\n    # ... your model layers ...\n])\nmodel.compile(optimizer=\'adam\', loss=\'mse\')\n\n# 2. Define the GCS path for your checkpoints\nGCS_BUCKET_NAME = \"your-gcs-bucket-name\"  # Replace with your bucket name\nGCS_CHECKPOINT_PATH = f\"gs://{GCS_BUCKET_NAME}/checkpoints/model_{{epoch:02d}}/\"\n\n\n# 3. Create the ModelCheckpoint callback\ncheckpoint_callback = tf.keras.callbacks.ModelCheckpoint(\n    filepath=GCS_CHECKPOINT_PATH,\n    save_freq=\'epoch\',  # Save every epoch\n    save_weights_only=False, # Saves the entire model (architecture + weights)\n    # Optionally add more arguments lik

In [40]:
# Save the response from model in the Cloud Storage bucket 
from google.cloud import storage

with open('tuned_model_qa_response.txt', "w+") as output:
    image_content = output.write(response.text)
    output.close()

storage_client = storage.Client()
BUCKET_NAME = f"{PROJECT_ID}-model-dataset"
bucket = storage_client.get_bucket(BUCKET_NAME)
blob = bucket.blob('tuned_model_qa_response.txt')
blob.upload_from_filename('tuned_model_qa_response.txt')

Go to the **Task 6. Test the tuned model with a prompt** section of the lab and click  **Check my progress** to verify the __Test the tuned model with a prompt__ objective.