# Synthetic data capability

## Summary

The use case provided in this notebook creates synthetic training data sets for use in DataRobot models.

This notebook outlines how to create a synthetic training data set in a csv file, with name, address, phone number, company, account number, and credit score.

# Requirements

The `datarobot` package will required an API token and an endpoint to interact with the Datarobot offering. See https://docs.datarobot.com/en/docs/api/api-quickstart/index.html#configure-api-authentication for the available methods and pick the one relevant to you.

In [None]:
# These are packages used in this accelerator
# The below format is used in the Datarobot notebooks to install packages. If running this in a DR notebook, uncomment the below entries

# !pip install datarobot
# !pip install faker
# !pip install pandas

## Setup

### Import libraries

In [None]:
from io import StringIO

import datarobot as dr
from datarobot import Dataset as ds
from faker import Faker
import pandas as pd

## Generate synthetic data

In [None]:
# Create a csv file with 10000 rows consisting of these columns:
# fake first name
# fake last name
# fake address
# phone number
# company
# fake account number
# credit score (random number between 300-850)
# good loan candidate (T/F)

Faker.seed(0)
fake = Faker()
fake.set_arguments("credit_score", {"min_value": 300, "max_value": 850})
people_csv = fake.csv(
    header=(
        "Name",
        "Address",
        "Phone_Number",
        "Company",
        "Account_Number",
        "Credit_Score",
        "Good_Loan_Candidate",
    ),
    data_columns=(
        "{{name}}",
        "{{address}}",
        "{{phone_number}}",
        "{{company}}",
        "{{bban}}",
        "{{pyint:credit_score}}",
        "{{boolean}}",
    ),
    num_rows=10000,
    include_row_ids=True,
)

## Data Frame for Output

In [None]:
# Use StringIO to create a file-like object for pandas to read from
csv_file = StringIO(people_csv)

# Read the CSV into a DataFrame
df = pd.read_csv(csv_file)

# Now 'df' is your DataFrame
print(df)

## Load CSV into AI Catalog

In [None]:
# write synthetic data csv to a file on disk
with open("people.csv", "w") as file:
    file.write(people_csv)

# push that to datarobot
# https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/autodoc/api_reference.html#datasets

people_dataset = ds.upload("people.csv")

# get the dataset id
people_dataset_id = people_dataset.id

## Load synthetic data into AutoML

In [None]:
project = dr.Project.create_from_dataset(people_dataset_id, project_name="Good_Loan_Candidate")

## Initiate autopilot

In [None]:
project.analyze_and_model(target="Good_Loan_Candidate", mode=dr.AUTOPILOT_MODE.FULL_AUTO)

## Retrieve top performing model

In [None]:
# Wait for the autopilot testing and top model identification to finish
project.wait_for_autopilot()

model = project.get_top_model()

print("""The top performing model is {model}""".format(model=str(model)))

## Deploy chosen model

In [None]:
# Get the prediction server
prediction_server = dr.PredictionServer.list()[0]

# Create a deployment
deployment = dr.Deployment.create_from_learning_model(
    model.id,
    label="Synthetic data test",
    description="Model trained on synthetic dataset with names, addresses, credit scores, etc.",
    default_prediction_server_id=prediction_server.id,
)

print(deployment.id)