# 🎛️ NeMo Safe Synthesizer 101: The Basics

> ⚠️ **Warning**: NeMo Safe Synthesizer is in Early Access and not recommended for production use.

<br> 

In this notebook, we demonstrate how to create a synthetic version of a tabular dataset using the NeMo Microservices Python SDK.

After completing this notebook, you'll be able to:
- Use the NeMo Microservices SDK to interact with Safe Synthesizer
- Create novel synthetic data that follows the statistical properties of your input dataset
- Access an evaluation report on synthetic data quality and privacy


#### 💾 Install dependencies

**IMPORTANT** 👉 Ensure you have a NeMo Microservices Platform deployment available. Follow the quickstart or Helm chart instructions in your environment's setup guide. You may need to restart your kernel after installing dependencies.


In [None]:
import pandas as pd
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.safe_synthesizer.builder import SafeSynthesizerBuilder

import logging
logging.basicConfig(level=logging.WARNING)
logging.getLogger("httpx").setLevel(logging.WARNING)

### ⚙️ Initialize the NeMo Safe Synthesizer Client

- The Python SDK provides a wrapper around the NeMo Microservices Platform APIs.
- `http://localhost:8080` is the default url for the client's `base_url` in the quickstart.
- If using a managed or remote deployment, ensure correct base URLs and tokens.


In [None]:
client = NeMoMicroservices(
    base_url="http://localhost:8080",
)

NeMo DataStore is launched as one of the services, and we'll use it to manage our storage. so we'll set the following:

In [None]:
datastore_config = {
    "endpoint": "http://localhost:3000/v1/hf",
    "token": "",
}

## 📥 Load input data

Safe Synthesizer learns the patterns and correlations in your input dataset to produce synthetic data with similar properties. For this tutorial, we will use a small public sample dataset. Replace it with your own data if desired.

The sample dataset used here is a set of women's clothing reviews, including age, product category, rating, and review text. Some of the reviews contain Personally Identifiable Information (PII), such as height, weight, age, and location.

In [None]:
%pip install kagglehub || uv pip install kagglehub

In [None]:
import kagglehub
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("nicapotato/womens-ecommerce-clothing-reviews")
df = pd.read_csv(f"{path}/Womens Clothing E-Commerce Reviews.csv", index_col=0)
df.head()

## 🏗️ Create a Safe Synthesizer job

The `SafeSynthesizerBuilder` provides a fluent interface to configure and submit jobs.

The following code creates and submits a job:
- `SafeSynthesizerBuilder(client)`: initialize with the NeMo Microservices client.
- `.from_data_source(df)`: set the input data source.
- `.with_datastore(datastore_config)`: configure model artifact storage.
- `.with_replace_pii()`: enable automatic replacement of PII.
- `.synthesize()`: train and generate synthetic data.
- `.create_job()`: submit the job to the platform.


In [None]:
job = (
    SafeSynthesizerBuilder(client)
    .from_data_source(df)
    .with_datastore(datastore_config)
    .with_replace_pii()
    .synthesize()
    .create_job()
)

print(f"job_id = {job.job_id}")
job.wait_for_completion()

print(f"Job finished with status {job.fetch_status()}")

In [None]:
# If your notebook shuts down, it's okay, your job is still running on the microservices platform.
# You can get the same job object and interact with it again by uncommenting the following code
# snippet, and modifying it with the job id from the previous cell output.

# from nemo_microservices.beta.safe_synthesizer.sdk.job import SafeSynthesizerJob
# job = SafeSynthesizerJob(job_id="<job id>", client=client)

## 👀 View synthetic data

After the job completes, fetch the generated synthetic dataset.

In [None]:
# Fetch the synthetic data created by the job
synthetic_df = job.fetch_data()
synthetic_df


## 📊 View evaluation report

An evaluation comparing the synthetic data to the input data is performed automatically. You can:

- **Inspect key scores**: overall synthetic data quality and privacy.
- **Download the full HTML report**: includes charts and detailed metrics.
- **Display the report inline**: useful when viewing in notebook environments.


In [None]:
# Print selected information from the job summary
summary = job.fetch_summary()
print(
    f"Synthetic data quality score (0-10, higher is better): {summary.synthetic_data_quality_score}"
)
print(f"Data privacy score (0-10, higher is better): {summary.data_privacy_score}")


In [None]:
# Download the full evaluation report to your local machine
job.save_report("evaluation_report.html")

In [None]:
# Fetch and display the full evaluation report inline
job.display_report_in_notebook()