
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/demo/navigator-fine-tuning-intro-tutorial.ipynb) <br>

<br>

<center><img src="https://gretel-public-website.s3.us-west-2.amazonaws.com/assets/brand/gretel_brand_wordmark.svg" alt="Gretel" width="350"/></center>

<br>

## 👋 Welcome to the **Navigator Fine Tuning** Intro Notebook!

In this Notebook, we will demonstrate how to use Gretel's SDK to train [**Navigator Fine Tuning**](https://docs.gretel.ai/create-synthetic-data/models/synthetics/gretel-navigator-fine-tuning) to generate high-quality synthetic data. We will keep it simple in this tutorial and limit our focus to basic usage of the model for generating tabular data with _independent_ records.

<br>

## ✅ Set up your Gretel account

To get started, you will need a [free Gretel account](https://console.gretel.ai/).

If this is your first time using the Gretel SDK, we recommend starting with our [Gretel SDK Blueprints](https://docs.gretel.ai/gretel-basics/getting-started/blueprints).


<br>

#### Ready? Let's go 🚀

## 💾 Install `gretel-client` and its dependencies

In [1]:
%%capture
!pip install gretel-client

In [3]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [13]:
pwd

'/content'

In [23]:
import os

# List files in the directory
directory_path = "/content/drive/MyDrive/Home_depo/"
files = os.listdir(directory_path)

# Print the files
print(files)

['test_MASTER.csv', 'labels_MASTER.csv', 'content_data_MASTER.csv', 'processed_labels.csv']


## 🛜 Configure your Gretel session

- [The Gretel object](https://docs.gretel.ai/create-synthetic-data/gretel-sdk/the-gretel-object) provides a high-level interface for streamlining interactions with Gretel's APIs.

- Retrieve your Gretel API key [here](https://console.gretel.ai/users/me/key).

In [2]:
from gretel_client import Gretel

gretel = Gretel(
    project_name="navigator-ft-intro",
    api_key="prompt",
    endpoint="https://api.gretel.cloud",
    validate=True,
)

Gretel API Key: ··········
Logged in as ukey.b@northeastern.edu ✅
Project name `navigator-ft-intro` is not unique -> appending your user id to the name.
Project URL: https://console.gretel.ai/proj_2tGPo7oIPO6HgXbiH7aQNHyC4nh


## 📊 Tabular Data

Generating tabular data is the most straightforward application of Navigator Fine Tuning. In this case, the models [default configuration](https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics/navigator-ft.yml) parameters are an excellent place to start.

In [24]:
# @title Pick a tabular dataset 👇 { run: "auto" }
dataset_path_dict = {
    "synthetic data for search": "/content/drive/MyDrive/Home_depo/processed_labels.csv"
}

data_source = "synthetic data for search"
data_source = dataset_path_dict[data_source]


## 🏋️‍♂️ Train a generative model

- The `navigator-ft` base config tells Gretel we want to train with **Navigator Fine Tuning** using its default parameters.

- **Navigator Fine Tuning** is an LLM under the hood. Before training begins, information about how the input data was tokenized and assembled into examples will be logged in the cell output (as well as in Gretel's Console).

- Generation of a dataset for evaluation will begin immediately after the model completes training. The rate at which the model produces valid records will be logged to help assess how well the model is performing.

In [25]:
trained = gretel.submit_train("navigator-ft", data_source=data_source)

Submitting NAVIGATOR FINE TUNING training job...
Model Docs: https://docs.gretel.ai/create-synthetic-data/models/synthetics/gretel-navigator-fine-tuning
Console URL: https://console.gretel.ai/proj_2tGPo7oIPO6HgXbiH7aQNHyC4nh/models/67b5ef651d9c7fff7e61e718/activity
Model ID: 67b5ef651d9c7fff7e61e718
Analyzing input data and checking for auto-params... 
Found 1 auto-params that were set based on input data. num_input_records_to_sample 25000
<< 🧭 Navigator FT >> Preparing for training 
<< 🧭 Navigator FT >> Tokenizing records 
<< 🧭 Navigator FT >> Number of unique train records: 868 
<< 🧭 Navigator FT >> Assembling examples from 2880.2% of the input records 
<< 🧭 Navigator FT >> Training Example Statistics: 

╒════════╤═════════════════════╤══════════════════════╤═══════════════════════╕
│        │   Tokens per record │   Tokens per example │   Records per example │
╞════════╪═════════════════════╪══════════════════════╪═══════════════════════╡
│ min    │                  20 │            

In [26]:
# view the quality scores
trained.report

GretelDataQualityReport(
    synthetic_data_quality_score: 81
    field_correlation_stability: 88
    principal_component_stability: 56
    field_distribution_stability: 100
    privacy_protection_level: 0
    membership_inference_attack_score: 55.7
    attribute_inference_attack_score: 100.0
    data_privacy_score: 77
)

In [27]:
# display the full report within this notebook
trained.report.display_in_notebook()

0,1,2,3,4,5
How to interpret your SQS,Excellent,Good,Moderate,Poor,Very Poor
Suitable for machine learning or statistical analysis,,,,,
Suitable for balancing or augmenting machine learning data sources,,,,,
Suitable for pre-production testing environments,,,,,
Suitable for demo environments or mock data,,,,,
Improve your model using our tips and advice,,,,,
Significant tuning required to improve model,,,,,

0,1,2,3,4,5
Data Sharing Use Case,Excellent,Very Good,Good,Normal,Poor
"Internally, within the same team",,,,,
"Internally, across different teams",,,,,
"Externally, with trusted partners",,,,,
"Externally, public availability",,,,,

0,1,2,3,4,5
Data Sharing Use Case,Excellent,Very Good,Good,Normal,Poor
"Internally, within the same team",,,,,
"Internally, across different teams",,,,,
"Externally, with trusted partners",,,,,
"Externally, public availability",,,,,

Unnamed: 0,Training Data,Synthetic Data
Row Count,868,868
Column Count,4,4
Training Lines Duplicated,--,173

Default Privacy Protections,Advanced Protections

Field,Unique,Missing,Ave. Length,Type,Distribution Stability
searchTerm,810,0,26.28,Text,
title,465,27,29.24,Text,
slug,479,0,28.77,Other,
label,3,0,9.07,Binary,Excellent


In [28]:
# inspect the synthetic data used to create the report
df_synth_report = trained.fetch_report_synthetic_data()
df_synth_report.head()

Unnamed: 0,searchTerm,slug,label,title
0,hampton bay door chime,how-to-install-a-doorbell,RELEVANT,How to Install a Doorbell
1,24 in. x 18 in. garage storage shelves,garage-storage-ideas,RELEVANT,Garage Storage Ideas
2,sheridan kitchen cabinet pull,best-kitchen-cabinet-types-for-your-home,RELEVANT,Best Kitchen Cabinet Types for Your Home
3,multi-colored blissful nights mattresses,how-to-pick-the- best-mattress-accessories,RELEVANT,Best Mattress Toppers and Pads for a Restful S...
4,5 gal #hdppgr63u paint with primer,how-to-paint-trim,RELEVANT,How to Paint Trim


In [31]:
df_synth_report.to_csv("synthetic_data.csv")