<a href="https://colab.research.google.com/github/sudarshan-koirala/youtube-stuffs/blob/main/langchain/synthetic_data_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Use case

Synthetic data refers to artificially generated data that imitates the characteristics of real data without containing any information from actual individuals or entities. It is typically created through mathematical models, algorithms, or other data generation techniques. Synthetic data can be used for a variety of purposes, including testing, research, and training machine learning models, while preserving privacy and security.

Benefits of Synthetic Data:

1. **Privacy and Security**: No real personal data at risk of breaches.
2. **Data Augmentation**: Expands datasets for machine learning.
3. **Flexibility**: Create specific or rare scenarios.
4. **Cost-effective**: Often cheaper than real-world data collection.
5. **Regulatory Compliance**: Helps navigate strict data protection laws.
6. **Model Robustness**: Can lead to better generalizing AI models.
7. **Rapid Prototyping**: Enables quick testing without real data.
8. **Controlled Experimentation**: Simulate specific conditions.
9. **Access to Data**: Alternative when real data isn't available.

**Note: Despite the benefits, synthetic data should be used carefully, as it may not always capture real-world complexities.**

## Quickstart

In this notebook, we'll generate synthetic data that determine the user occupation based on the question asked

## Setup

In [1]:
# set environment variables
# https://platform.openai.com/account/api-keys
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter open API key")

In [2]:
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_openai.chat_models import ChatOpenAI
from langchain.pydantic_v1 import BaseModel
from langchain_experimental.tabular_synthetic_data.openai import create_openai_data_generator
from langchain_experimental.tabular_synthetic_data.prompts import SYNTHETIC_FEW_SHOT_SUFFIX, SYNTHETIC_FEW_SHOT_PREFIX

## 1. Define Your Data Model
- Every dataset has a structure or a "schema".
- The Occupation class below serves as our schema for the synthetic data.
- By defining this, we're informing our synthetic data generator about the shape and nature of data we expect.

In [3]:
class Occupation(BaseModel):
    question: str
    occupation: str


## 2. Sample Data
To guide the synthetic data generator, it's useful to provide it with a few real-world-like examples. These examples serve as a "seed" - they're representative of the kind of data you want, and the generator will use them to create more data that looks similar.

Here are some fictional medical billing records:

In [4]:
examples = [
{"example": """Question: How can I design a website that looks professional?, occupation:Web Developer"""},
{"example": """Question: What exercises can help improve my fitness?, occupation:Personal Trainer"""},
{"example": """Question: How can I teach young children to read?, occupation:Early Childhood Educator"""},
{"example": """Question: What are the best methods for managing a team?, occupation:Manager"""},
{"example": """Question: How do I fix a leaking faucet?, occupation:Plumber"""},
{"example": """Question: How can I create a budget for my business?, occupation:Financial Analyst"""},
{"example": """Question: What techniques can improve my public speaking skills?, occupation:Public Speaking Coach"""},
{"example": """Question: How do I care for a dog with special needs?, occupation:Veterinary Technician"""},
{"example": """Question: What strategies can help improve my mental health?, occupation:Therapist"""},
{"example": """Question: How do I write a compelling novel?, occupation:Author/Writer"""},

]

## 3. Craft a Prompt Template
The generator doesn't magically know how to create our data; we need to guide it. We do this by creating a prompt template. This template helps instruct the underlying language model on how to produce synthetic data in the desired format.

In [5]:
OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

The `FewShotPromptTemplate` includes:

- `prefix` and `suffix`: These likely contain guiding context or instructions.
- `examples`: The sample data we defined earlier.
- `input_variables`: These variables ("subject", "extra") are placeholders you can dynamically fill later. For instance, "subject" might be filled with "occupation" to guide the model further.
- `example_prompt`: This prompt template is the format we want each example row to take in our prompt.

## 4. Creating the Data Generator
With the schema and the prompt ready, the next step is to create the data generator. This object knows how to communicate with the underlying language model to get synthetic data.

In [6]:
synthetic_data_generator = create_openai_data_generator(
    output_schema=Occupation,
    llm=ChatOpenAI(temperature=1),
    prompt=prompt_template,
)

## 5. Generate Synthetic Data
Finally, let's get our synthetic data!

In [7]:
synthetic_results = synthetic_data_generator.generate(
    subject="Determine Occupation based on users question",
    extra="Based on the question determine what the occupation related. The question must be uniqe and more user-like",
    runs=10,
)

This command asks the generator to produce 10 synthetic medical billing records. The results are stored in `synthetic_results`. The output will be a list of the MedicalBilling pydantic models.

In [8]:
type(synthetic_results)

list

## 6. Visualize the Generated Synthetic Data

In [9]:
len(synthetic_results)

10

In [10]:
synthetic_results

[Occupation(question='How can I improve my cooking skills?', occupation='Chef'),
 Occupation(question='Where can I find the best ingredients for baking a cake?', occupation='Chef'),
 Occupation(question='How can I prepare healthy meals for my family?', occupation='Nutritionist'),
 Occupation(question='How can I train my cat to use the litter box effectively?', occupation='Animal Behaviorist'),
 Occupation(question='How do I start investing in the stock market?', occupation='Financial Advisor'),
 Occupation(question='How can I improve my photography skills?', occupation='Photographer'),
 Occupation(question='How can I become a better graphic designer?', occupation='Graphic Designer'),
 Occupation(question='How can I learn to play the guitar like a pro?', occupation='Musician'),
 Occupation(question='Where can I find the best ingredients for baking a cake?', occupation='Chef'),
 Occupation(question='How can I start my own food blog?', occupation='Food Blogger')]

## 7. Converting the synthetic data into Pandas Dataframe

In [11]:
import pandas as pd

# Create a list of dictionaries from the objects
synthetic_data = []
for item in synthetic_results:
    synthetic_data.append({
        'question': item.question,
        'occupation': item.occupation,
    })

# Create a Pandas DataFrame from the list of dictionaries
synthetic_df = pd.DataFrame(synthetic_data)

# Display the DataFrame
print(type(synthetic_df))
synthetic_df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,question,occupation
0,How can I improve my cooking skills?,Chef
1,Where can I find the best ingredients for baki...,Chef
2,How can I prepare healthy meals for my family?,Nutritionist
3,How can I train my cat to use the litter box e...,Animal Behaviorist
4,How do I start investing in the stock market?,Financial Advisor
5,How can I improve my photography skills?,Photographer
6,How can I become a better graphic designer?,Graphic Designer
7,How can I learn to play the guitar like a pro?,Musician
8,Where can I find the best ingredients for baki...,Chef
9,How can I start my own food blog?,Food Blogger


In [12]:
synthetic_df.shape

(10, 2)