# 💎 Google Gemma: Instruction Tuning & Embedding Analysis

**Objective:** Fine-tune Google's open-weights model **Gemma** for specific instruction-following tasks and evaluate its embedding capabilities on non-English (Vietnamese) datasets.

**Key Concepts:**
* **Model:** Gemma-2b / Gemma-7b (IT versions).
* **Task:** Custom Instruction Tuning using PEFT (Parameter-Efficient Fine-Tuning).
* **Evaluation:** Comparing semantic similarity scores against multilingual benchmarks.

In [14]:
# @title Load library

import pandas as pd
from google import genai
import json
from datasets import load_dataset
from google.colab import userdata
from sentence_transformers import SentenceTransformer
import numpy as np
from google.genai import types
import time

# 1. Data Preparation: Synthetic Generation

**Note:** 🛠️ To create the training dataset, I utilized **Gemini Flash Lite** to synthetically generate instruction-response pairs.

⚠️ **Recommendation for Reproducibility:**
Since the synthetic generation process used a lightweight model ("lite"), the quality of the raw generation might vary. For the best fine-tuning results and to reproduce the metrics shown in this repo, please **download and use the curated dataset file (`final_df.csv`)** directly instead of regenerating it from scratch.

*You can download file in section 2. Train*

In [15]:
client = genai.Client(api_key=userdata.get('google_api_key_2'))

In [16]:
import re

def remove_irrelevant_sections(description):
    """
    Removes irrelevant sections such as "About the Company," "Perks & Benefits,"
    and "Responsibilities" from a job description.

    Args:
        description (str): The job description as a string.

    Returns:
        str: The cleaned job description with irrelevant sections removed.
    """
    # Define regex patterns for sections to remove
    patterns = [
        r"(About the Company:|Our Mission:).*?(?=(Qualifications|Requirements|Skills|Experience|$))",
        r"(Perks & Benefits:|What We Offer:).*?(?=(Qualifications|Requirements|Skills|Experience|$))",
        r"(Responsibilities:).*?(?=(Qualifications|Requirements|Skills|Experience|$))"
    ]

    # Remove each pattern
    for pattern in patterns:
        description = re.sub(pattern, "", description, flags=re.IGNORECASE | re.DOTALL)

    return description.strip()

def extract_qualifications_from_html(description):
    """
    Extracts sections of a job description that begin with keywords like
    "Qualifications," "Requirements," "Skills," or "Experience."

    Args:
        description (str): The job description as a string.

    Returns:
        str: The relevant section containing qualifications, or the original
             description if no match is found.
    """
    # Search for sections starting with relevant keywords
    match = re.search(
        r"(Qualifications|Requirements|Skills|Experience).*",
        description,
        flags=re.IGNORECASE | re.DOTALL,
    )
    if match:
        # Extract the matched section
        relevant_section = match.group(0)
        return relevant_section
    return description

def remove_eoe_notes(description):
    """
    Removes Equal Opportunity Employer (EOE) notes and similar boilerplate text
    from a job description.

    Args:
        description (str): The job description as a string.

    Returns:
        str: The cleaned job description with EOE notes removed.
    """
    # Define regex patterns for common EOE notes
    patterns = [
        r"an equal opportunity employer.*?(?=\n|$)",  # Common phrasing
        r"EOE.*?(?=\n|$)",  # Short form
        r"EEO.*?(?=\n|$)",
        r"equal employment*?(?=\n|$)",  # Full boilerplate
        r"Equal employment opportunity.*?(?=\n|$)"  # Variations
    ]

    # Remove each pattern from the description
    for pattern in patterns:
        description = re.sub(pattern, "", description, flags=re.IGNORECASE | re.DOTALL)

    return description.strip()

## Load Data

In [17]:
prompt_template = lambda job_description : f"""Read the following job description and create a concise job search query with at most 3 specialized skills or \
areas of expertise that are distinct to the role. Exclude generic data science or software engineering skills like AI, machine \
learning, and coding languages unless they are explicitly highlighted as unique or advanced. Keep the query short and human-like, \
suitable for typing into a search engine.

Here's the job description: {job_description}"""

In [18]:
# load data from HF hub
ds = load_dataset("datastax/linkedin_job_listings")

# convert to pandas df
df = ds['train'].to_pandas()

# keep only title and description
df = df[['title', 'description']]
df.shape

postings.csv:   0%|          | 0.00/517M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/123849 [00:00<?, ? examples/s]

(123849, 2)

In [19]:
# List of strings to search for
search_terms = ["Data Scientist", "Data Analyst", "Machine Learning Engineer",
                "Data Engineer", "AI Engineer", "Deep Learning"]

# Create a regex pattern to match any of the strings
pattern = '|'.join(search_terms)

# Filter rows that contain any of the search terms
df = df[df['title'].str.contains(pattern, case=False, na=False)]
df.shape

(1179, 2)

In [20]:
df.head(5)

Unnamed: 0,title,description
283,Sr Data Engineer with Kafka,Data Engineer with Kafka (W2 Only)💯% Remote\nM...
360,Cloud Platform/ Big Data Engineer,About Subaru Research and Development:Do you c...
367,Data Engineer/ETL,"Responsibilities:Develop new features, fix bug..."
389,Data Analyst,Job Title: Data AnalystDuration: ContractLocat...
483,Senior Data Engineer/Analyst - Full Time,"Job Type: Full-Time, Permanent \nResponsibilit..."


In [21]:
# @title Save to file
df.to_csv('job_data.csv')

## Synthetic by Gemini

In [22]:
job_description_list = df['description'].to_list()

In [23]:
# @title Create batch requests
batch_requests = [
    {
        "key": f"request-{i+1}",
        "request": {
            "contents": [
                {
                    "role": "user",
                    "parts": [{"text": prompt_template(job_description)}]
                }
            ],
            "generationConfig": {
                "temperature": 0.7
            }
        }
    }
    for i, job_description in enumerate(job_description_list)
]


In [24]:
# @title Save JSONL file
# Convert to JSONL format (newline-delimited JSON)
batch_jsonl = "\n".join(json.dumps(request) for request in batch_requests)
# Save to a .jsonl file
with open("batch_requests.jsonl", "w") as file:
    file.write(batch_jsonl)

In [25]:
batch_input_file = client.files.upload( # file
    file='batch_requests.jsonl',
    config=types.UploadFileConfig(
        display_name='my-batch-requests',
        mime_type='jsonl'
    )
)

print(f"Uploaded file: {batch_input_file.name}")

Uploaded file: files/glbq1n3wcnxu


In [None]:
# @title Create batch job
batch_object = client.batches.create( # batch_job
    model="gemini-2.0-flash-lite",  # 1. Model is defined here, NOT in the JSONL
    src=batch_input_file.name, # 2. 'src' takes the file resource name (files/...)
    config={
        'display_name':"synthetic queries from job descriptions" # 3. Maps to metadata
    }
)

print(f"Created batch job: {batch_object.name}")

## Process batch result

In [None]:
# save to file
batch_jobs = client.batches.list()

# Optional query config:
# batch_jobs = client.batches.list(config={'page_size': 5})

for i, batch in enumerate(batch_jobs):
    if batch.state.name == 'JOB_STATE_SUCCEEDED':
           # If batch job was created with a file
          if batch.dest and batch.dest.file_name:
              # Results are in a file
              result_file_name = batch.dest.file_name
              print(f"Results are in file: {result_file_name}")

              print("Downloading result file content...")
              file_response = client.files.download(file=result_file_name)
              # Process file_content (bytes) as needed
              with open(f"output_{i}.jsonl", 'w') as file:
                  file.write(file_response.decode('utf-8'))

Results are in file: files/batch-7xuwsdsn7lrnmsgdlxx1drdwwp6zxcsafhh0
Downloading result file content...
Results are in file: files/batch-sg1vqjfkkvw74pq1z7u5dots5799nen2cips
Downloading result file content...
Results are in file: files/batch-p0mloyh2t0jxkzq3f8cfuwflv0swougdzz7m
Downloading result file content...


In [None]:
# @title Extract synthetic queries and store in list (from batch request_
file_path = 'output_2.jsonl'
query_list = []

with open(file_path, 'r') as file:
    for line in file:
        if line:
            parsed_response = json.loads(line)
            if 'response' in parsed_response and parsed_response['response']:
                for part in parsed_response['response']['candidates'][0]['content']['parts']:
                  if part.get('text'):
                    query_list.append(part['text'])
            elif 'error' in parsed_response:
                print(f"Error: {parsed_response['error']}")

In [None]:
# extract JDs
df_jobs = pd.read_csv("job_data.csv")
# df_jobs = df_jobs.drop_duplicates()

# only keep text relevant to job qualifications
df_jobs['description_cleaned'] = df_jobs['description'].apply(remove_irrelevant_sections)
df_jobs['description_cleaned'] = df_jobs['description_cleaned'].apply(extract_qualifications_from_html)
df_jobs['description_cleaned'] = df_jobs['description_cleaned'].apply(remove_eoe_notes)

# store job descriptions in a list
job_description_list = df_jobs['description_cleaned'].to_list()

In [None]:
# @title Create dict with queries and JDs
df = pd.DataFrame({"query" : query_list, "job_description_pos" : job_description_list})

In [None]:
df.head(5)

Unnamed: 0,query,job_description_pos
0,Kafka ETL Snowflake,experience neededVery strong experience in Kaf...
1,"""automotive software cockpit systems AWS GCP A...",requirements to determine feasibility of desig...
2,React B2B SaaS development AWS Lambda,"experienceAccountable for code quality, includ..."
3,"Data Analyst Queens NY, business insights, dat...","QualificationsAnalytical skills, including the..."
4,Mortgage banking data systems developer,requirements and industry practices for mortga...


In [None]:
print("Original shape:", df.shape)
df = df.drop_duplicates(subset=['job_description_pos'])
print("Unique JDs:", df.shape)
df = df.drop_duplicates(subset=['query'])
print("Unique queries:",df.shape)

Original shape: (1179, 2)
Unique JDs: (1020, 2)
Unique queries: (1001, 2)


## Create negative pair of query

In [None]:
# @title Load the model
model = SentenceTransformer("AITeamVN/Vietnamese_Embedding")

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

In [None]:
# @title Encode all job descriptions
job_embeddings = model.encode(df['job_description_pos'].to_list())
print(job_embeddings.shape)

(1001, 1024)


In [None]:
similarities = model.similarity(job_embeddings, job_embeddings)
print(similarities.shape)

torch.Size([1001, 1001])


In [None]:
# @title Match least JDs least similar to positive match as the negative match
similarities_argsorted = np.argsort(similarities.numpy(), axis=1)
negative_pair_index_list = []

for i in range(len(similarities)):

    # Start with the smallest similarity index for the current row
    j = 0
    index = int(similarities_argsorted[i][j])

    # Ensure the index is unique
    while index in negative_pair_index_list:
        j += 1  # Move to the next smallest index
        index = int(similarities_argsorted[i][j])  # Fetch next smallest index

    negative_pair_index_list.append(index)

In [None]:
# @title Add negative pairs to df
df['job_description_neg'] = df['job_description_pos'].iloc[negative_pair_index_list].values

In [None]:
df.head()

Unnamed: 0,query,job_description_pos,job_description_neg
0,Kafka ETL Snowflake,experience neededVery strong experience in Kaf...,experience. You are comfortable with a range o...
1,"""automotive software cockpit systems AWS GCP A...",requirements to determine feasibility of desig...,experience\nStored Procs (AWS Postgres) to API...
2,React B2B SaaS development AWS Lambda,"experienceAccountable for code quality, includ...",Skills:2 intermediate analytics skills (BQ/SQL)
3,"Data Analyst Queens NY, business insights, dat...","QualificationsAnalytical skills, including the...",Actively participates in projects in assigned ...
4,Mortgage banking data systems developer,requirements and industry practices for mortga...,Skill Sets: SparkPyspark.TableauSQL Query


In [None]:
# @title Save to csv
df.to_csv("final_df_1.csv", index=False, encoding="utf-8-sig")

# 2. Train

In [28]:
!wget https://github.com/thangnch/MiAI_Finetune_EmbeddingGemma/raw/refs/heads/main/final_df.csv

--2026-02-12 03:12:12--  https://github.com/thangnch/MiAI_Finetune_EmbeddingGemma/raw/refs/heads/main/final_df.csv
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/thangnch/MiAI_Finetune_EmbeddingGemma/refs/heads/main/final_df.csv [following]
--2026-02-12 03:12:13--  https://raw.githubusercontent.com/thangnch/MiAI_Finetune_EmbeddingGemma/refs/heads/main/final_df.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5219938 (5.0M) [text/plain]
Saving to: ‘final_df.csv.1’


2026-02-12 03:12:13 (223 MB/s) - ‘final_df.csv.1’ saved [5219938/5219938]



In [29]:
import pandas as pd

# Shuffle the dataset
df = pd.read_csv("final_df.csv", encoding="utf-8-sig")
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Split into train, validation, and test sets (e.g., 80% train, 10% validation, 10% test)
train_frac = 0.8
valid_frac = 0.1
test_frac = 0.1

# define train and validation size
train_size = int(train_frac * len(df))
valid_size = int(valid_frac * len(df))

# create train, validation, and test datasets
df_train = df[:train_size]
df_valid = df[train_size:train_size + valid_size]
df_test = df[train_size + valid_size:]

In [30]:
df_valid

Unnamed: 0,query,job_description_pos,job_description_neg
813,"Data Engineer with AWS Big Data Services, Orac...",experience. Excellent knowledge of database co...,requirements and industry practices.Build high...
814,"Big Data Engineer, Spark, Hadoop, AWS/GCP",Skills • Expertise and hands-on experience on ...,requirements and provide data-driven recommend...
815,"Data scientist time series analysis, condition...",Experience in Production Operations or Well En...,QUALIFICATIONSMust-Have:Bachelor’s Degree in C...
816,"Senior Data Analyst, healthcare data analysis,...",requirements.Reporting and Dashboard Developme...,experience with speech interfaces Lead and eva...
817,"Data analysis for operations, SQL expertise, d...","requirements, determine technical issues, and ...","experiences, beliefs, backgrounds, expertise, ..."
...,...,...,...
909,"Computer Vision algorithms, behavioral dynamic...",QualificationsRequirementsPh.D. in Computer Vi...,experience in:\n-Expert level SQL skills.-Very...
910,"Business Data Analyst, KPI analysis, data visu...",requirements and provide data-driven recommend...,Skills - Nice to Havessnowflakebig dataJob Des...
911,Data pipelines KNIME SharePoint financial serv...,"Skills:-SQL, SharePoint, Financial Services, E...",experienced Data Engineer to join our world le...
912,"Data Governance, Financial Services Analytics,...","experience for yourself, and a better working ...",skills : AI/ML models using Google Cloud Platf...


In [31]:
# @title Convert the pandas DataFrames back to Hugging Face Datasets
from datasets import Dataset, DatasetDict

train_ds = Dataset.from_pandas(df_train)
valid_ds = Dataset.from_pandas(df_valid)
test_ds = Dataset.from_pandas(df_test)

# Combine into a DatasetDict
dataset = DatasetDict({
    'train': train_ds,
    'validation': valid_ds,
    'test': test_ds
})


# Finetune

In [32]:
import torch
from sentence_transformers import SentenceTransformer

device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "google/embeddinggemma-300m"
model = SentenceTransformer(model_id, token=userdata.get('HF_TOKEN')).to(device=device)

print(f"Device: {model.device}")
print(model)
print("Total number of parameters in the model:", sum([p.numel() for _, p in model.named_parameters()]))

Loading weights:   0%|          | 0/314 [00:00<?, ?it/s]

Device: cuda:0
SentenceTransformer(
  (0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Gemma3TextModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (4): Normalize()
)
Total number of parameters in the model: 307581696


In [33]:
task_name = "STS"

def get_scores(query, documents):
  # Get embeddings by calling model.encode()
  query_embeddings = model.encode(query, prompt=task_name)
  doc_embeddings = model.encode(documents, prompt=task_name)

  # Get similarity giữa các embeddings
  similarities = model.similarity(query_embeddings, doc_embeddings)

  for idx, doc in enumerate(documents):
    print("*"*30,"\n")
    print("📕Document: ", doc, "\n\n🤖 Điểm số: ", similarities.numpy()[0][idx])

query = dataset["test"][0]["query"]
print("🚩Truy vấn = {}".format(query))
documents = [dataset["test"][0]["job_description_pos"],dataset["test"][0]["job_description_neg"]]

get_scores([query], documents)


🚩Truy vấn = Data Migration Specialist SAP MDG data quality
****************************** 

📕Document:  requirements, collect data, lead cleansing efforts, and load/support data into SAPthe gap between business and IT teams, effectively communicating data models and setting clear expectations of deliverablesand maintain trackers to showcase progress and hurdles to Project Managers and Stakeholders
Qualifications
knowledge of SAP and MDGcommunication skillsto manage multiple high-priority, fast-paced projects with attention to detail and organizationan excellent opportunity to learn an in-demand area of SAP MDGa strong willingness to learn, with unlimited potential for growth and plenty of opportunities to expand skills
This role offers a dynamic environment where you can directly impact IT projects and contribute to the company’s success. You will work alongside a supportive team of professionals, with ample opportunities for personal and professional development. 
If you’re ready to t

In [34]:
from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
from sentence_transformers.losses import MultipleNegativesRankingLoss
from transformers import TrainerCallback

# Định nghĩa hàm mất mát cho bài toán matching văn bản
loss = MultipleNegativesRankingLoss(model)

# Cấu hình huấn luyện
training_args = SentenceTransformerTrainingArguments(
    output_dir="custom-embedding-model",      # thư mục lưu kết quả huấn luyện
    prompts=model.prompts[task_name],         # prompt lấy từ model để train
    num_train_epochs=1,                       # số epoch
    per_device_train_batch_size=1,            # batch size mỗi thiết bị
    learning_rate=2e-5,                       # tốc độ học
    warmup_steps=81,                          # num step warmup
    logging_steps=dataset["train"].num_rows,  # số bước log
    report_to="none",                         # không log ra ngoài
)

# Callback chạy đánh giá sau mỗi epoch
class EvalCallback(TrainerCallback):
    """Callback để đánh giá mô hình trong quá trình train"""
    def __init__(self, eval_func):
        self.eval_func = eval_func

    def on_log(self, args, state, control, **kwargs):
        # In log và gọi hàm đánh giá
        print(f"✅ Step {state.global_step} hoàn tất. Bắt đầu evaluate...")
        self.eval_func()

# Hàm evaluate (giữ nguyên tên)
def evaluate():
    get_scores(query, documents)

# Khởi tạo Trainer và huấn luyện
trainer = SentenceTransformerTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    loss=loss,
    callbacks=[EvalCallback(evaluate)]
)

trainer.train()


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
813,0.165475


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

✅ Step 813 hoàn tất. Bắt đầu evaluate...
****************************** 

📕Document:  requirements, collect data, lead cleansing efforts, and load/support data into SAPthe gap between business and IT teams, effectively communicating data models and setting clear expectations of deliverablesand maintain trackers to showcase progress and hurdles to Project Managers and Stakeholders
Qualifications
knowledge of SAP and MDGcommunication skillsto manage multiple high-priority, fast-paced projects with attention to detail and organizationan excellent opportunity to learn an in-demand area of SAP MDGa strong willingness to learn, with unlimited potential for growth and plenty of opportunities to expand skills
This role offers a dynamic environment where you can directly impact IT projects and contribute to the company’s success. You will work alongside a supportive team of professionals, with ample opportunities for personal and professional development. 
If you’re ready to take on new challen

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

✅ Step 813 hoàn tất. Bắt đầu evaluate...
****************************** 

📕Document:  requirements, collect data, lead cleansing efforts, and load/support data into SAPthe gap between business and IT teams, effectively communicating data models and setting clear expectations of deliverablesand maintain trackers to showcase progress and hurdles to Project Managers and Stakeholders
Qualifications
knowledge of SAP and MDGcommunication skillsto manage multiple high-priority, fast-paced projects with attention to detail and organizationan excellent opportunity to learn an in-demand area of SAP MDGa strong willingness to learn, with unlimited potential for growth and plenty of opportunities to expand skills
This role offers a dynamic environment where you can directly impact IT projects and contribute to the company’s success. You will work alongside a supportive team of professionals, with ample opportunities for personal and professional development. 
If you’re ready to take on new challen

TrainOutput(global_step=813, training_loss=0.1654752611674066, metrics={'train_runtime': 903.4438, 'train_samples_per_second': 0.9, 'train_steps_per_second': 0.9, 'total_flos': 0.0, 'train_loss': 0.1654752611674066, 'epoch': 1.0})

In [35]:
# @title After finetune
get_scores(query, documents)

****************************** 

📕Document:  requirements, collect data, lead cleansing efforts, and load/support data into SAPthe gap between business and IT teams, effectively communicating data models and setting clear expectations of deliverablesand maintain trackers to showcase progress and hurdles to Project Managers and Stakeholders
Qualifications
knowledge of SAP and MDGcommunication skillsto manage multiple high-priority, fast-paced projects with attention to detail and organizationan excellent opportunity to learn an in-demand area of SAP MDGa strong willingness to learn, with unlimited potential for growth and plenty of opportunities to expand skills
This role offers a dynamic environment where you can directly impact IT projects and contribute to the company’s success. You will work alongside a supportive team of professionals, with ample opportunities for personal and professional development. 
If you’re ready to take on new challenges and grow your career in data analytic

In [36]:
save_path = "saved-embedding-model"
trainer.save_model(save_path)

print(f"📂 Model save at: {save_path}")

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

📂 Model save at: saved-embedding-model


In [37]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("saved-embedding-model")
get_scores(query, documents)

Loading weights:   0%|          | 0/314 [00:00<?, ?it/s]

****************************** 

📕Document:  requirements, collect data, lead cleansing efforts, and load/support data into SAPthe gap between business and IT teams, effectively communicating data models and setting clear expectations of deliverablesand maintain trackers to showcase progress and hurdles to Project Managers and Stakeholders
Qualifications
knowledge of SAP and MDGcommunication skillsto manage multiple high-priority, fast-paced projects with attention to detail and organizationan excellent opportunity to learn an in-demand area of SAP MDGa strong willingness to learn, with unlimited potential for growth and plenty of opportunities to expand skills
This role offers a dynamic environment where you can directly impact IT projects and contribute to the company’s success. You will work alongside a supportive team of professionals, with ample opportunities for personal and professional development. 
If you’re ready to take on new challenges and grow your career in data analytic