
## Use case - this is an adaptation from Langchain example notebook

Synthetic data is artificially generated data, rather than data collected from real-world events. It's used to simulate real data without compromising privacy or encountering real-world limitations.

Benefits of Synthetic Data:

1. **Privacy and Security**: No real personal data at risk of breaches.
2. **Data Augmentation**: Expands datasets for machine learning.
3. **Flexibility**: Create specific or rare scenarios.
4. **Cost-effective**: Often cheaper than real-world data collection.
5. **Regulatory Compliance**: Helps navigate strict data protection laws.
6. **Model Robustness**: Can lead to better generalizing AI models.
7. **Rapid Prototyping**: Enables quick testing without real data.
8. **Controlled Experimentation**: Simulate specific conditions.
9. **Access to Data**: Alternative when real data isn't available.

Note: Despite the benefits, synthetic data should be used carefully, as it may not always capture real-world complexities.



### Setup
First, you'll need to have the langchain library installed, along with its dependencies. Since we're using the OpenAI generator chain, we'll install that as well. Since this is an experimental lib, we'll need to include `langchain_experimental` in our installs. We'll then import the necessary modules.

In [None]:
!pip install -U langchain langchain_experimental openai
# Set env var OPENAI_API_KEY or load from a .env file:
# import dotenv
# dotenv.load_dotenv()

from langchain.chat_models import ChatOpenAI
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_core.pydantic_v1 import BaseModel
from langchain_experimental.tabular_synthetic_data.openai import (
    OPENAI_TEMPLATE,
    create_openai_data_generator,
)
from langchain_experimental.tabular_synthetic_data.prompts import (
    SYNTHETIC_FEW_SHOT_PREFIX,
    SYNTHETIC_FEW_SHOT_SUFFIX,
)

Collecting langchain
  Downloading langchain-0.0.349-py3-none-any.whl (808 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m808.6/808.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_experimental
  Downloading langchain_experimental-0.0.46-py3-none-any.whl (162 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.0/163.0 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-1.3.8-py3-none-any.whl (221 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m221.5/221.5 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.1 (from langchain)
  Downloading langchain_community-0.0.1-py3-none-any.whl (1.5 MB)
[2K     [90m

## 1. Define Your Data Model
Every dataset has a structure or a "schema". The MedicalBilling class below serves as our schema for the synthetic data. By defining this, we're informing our synthetic data generator about the shape and nature of data we expect.

In [None]:

class BankingProductPropensity(BaseModel):
    customer_id: int
    customer_name: str
    age: int
    employment_status: str
    annual_income: float
    credit_score: int
    existing_products: list[str]
    product_interest: str  # This could be the banking product the customer is likely interested in
    propensity_score: float  # A score indicating the likelihood of the customer being interested in the product


For instance, every record will have a `patient_id` that's an integer, a `patient_name` that's a string, and so on.

## 2. Sample Data
To guide the synthetic data generator, it's useful to provide it with a few real-world-like examples. These examples serve as a "seed" - they're representative of the kind of data you want, and the generator will use them to create more data that looks similar.

Here are some fictional medical billing records:

In [None]:
examples = [
    {
        "example": """Customer ID: 101234, Customer Name: Alex Johnson, Age: 35, Employment Status: Employed,
        Annual Income: 85000, Credit Score: 720, Existing Products: ['Savings Account', 'Credit Card'],
        Product Interest: 'Mortgage', Propensity Score: 0.75"""
    },
    {
        "example": """Customer ID: 102345, Customer Name: Maria Garcia, Age: 28, Employment Status: Self-Employed,
        Annual Income: 67000, Credit Score: 680, Existing Products: ['Checking Account'],
        Product Interest: 'Personal Loan', Propensity Score: 0.65"""
    },
    {
        "example": """Customer ID: 103456, Customer Name: David Smith, Age: 40, Employment Status: Unemployed,
        Annual Income: 32000, Credit Score: 590, Existing Products: ['Credit Card', 'Auto Loan'],
        Product Interest: 'Credit Card Upgrade', Propensity Score: 0.55"""
    },
]


## 3. Craft a Prompt Template
The generator doesn't magically know how to create our data; we need to guide it. We do this by creating a prompt template. This template helps instruct the underlying language model on how to produce synthetic data in the desired format.

In [None]:
OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

The `FewShotPromptTemplate` includes:

- `prefix` and `suffix`: These likely contain guiding context or instructions.
- `examples`: The sample data we defined earlier.
- `input_variables`: These variables ("subject", "extra") are placeholders you can dynamically fill later. For instance, "subject" might be filled with "medical_billing" to guide the model further.
- `example_prompt`: This prompt template is the format we want each example row to take in our prompt.

## 4. Creating the Data Generator
With the schema and the prompt ready, the next step is to create the data generator. This object knows how to communicate with the underlying language model to get synthetic data.

In [None]:
import os
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('openai_api_key')

In [None]:

synthetic_data_generator = create_openai_data_generator(
    output_schema=BankingProductPropensity,
    llm=ChatOpenAI(
        temperature=1
    ),  # You'll need to replace with your actual Language Model instance
    prompt=prompt_template,
)

## 5. Generate Synthetic Data
Finally, let's get our synthetic data!

In [None]:
synthetic_results = synthetic_data_generator.generate(
    subject="BankingProductPropensity",
    extra="The products include CASA, Credit Card, Mortgage, Term Loan. CIF is unique.",
    runs=10,
)

Print the synthetic Data

In [None]:
#print results generated by the data generator
for result in synthetic_results:

  print(result)

customer_id=101234 customer_name='Alex Johnson' age=35 employment_status='Employed' annual_income=85000.0 credit_score=720 existing_products=['Savings Account', 'Credit Card'] product_interest='Mortgage' propensity_score=0.75
customer_id=102345 customer_name='Maria Garcia' age=28 employment_status='Self-Employed' annual_income=67000.0 credit_score=680 existing_products=['Checking Account'] product_interest='Personal Loan' propensity_score=0.65
customer_id=103456 customer_name='David Smith' age=40 employment_status='Unemployed' annual_income=32000.0 credit_score=590 existing_products=['Credit Card', 'Auto Loan'] product_interest='Credit Card Upgrade' propensity_score=0.55
customer_id=104567 customer_name='Sarah Thompson' age=42 employment_status='Employed' annual_income=92000.0 credit_score=760 existing_products=['Savings Account', 'Credit Card', 'Mortgage'] product_interest='Term Loan' propensity_score=0.85
customer_id=102345 customer_name='Maria Garcia' age=28 employment_status='Self-

This command asks the generator to produce 10 synthetic medical billing records. The results are stored in `synthetic_results`. The output will be a list of the MedicalBilling pydantic models.