## Navigation
1. [Start Here](hey.ipynb)
1. [Load Data and Clean](/eda.ipynb)
1. [To Clean, or Not To Clean?](eval_v1.ipynb)
1. Generate Datasets
    1. [Faker Naive](faker_naive.ipynb)
    1. [Faker Plus](faker_plus.ipynb)
    1. [SDV Naive](sdv_v1.ipynb)
    1. [SDV More Better](sdv_v2.ipynb)
    1. [SDV TVAE]()
1. Compare and Evaluate Performance
    1. [First impressions](eval_v2.ipynb)
    1. [Loan financial models](eval_v3.ipynb)
    1. [Predicting default risk](eval_v4.ipynb)
    1. [How hackable]()

# Synthetic Data Vault v 1.0
> #### Game mode: Lazy
Let's see how good it is if we use only the defaults

### Generate the metadata for SDV

In [2]:
# Get started by creating a blank SingleTableMetadata 
from sdv.metadata import SingleTableMetadata
import sdv.metadata
metadata = SingleTableMetadata()

In [3]:
# Put our clean(er) data into a dataframe
import pandas as pd
# Display all the things
pd.set_option('display.max_columns', 120)
pd.set_option('display.max_rows', 500)

real_data = pd.read_csv('FILEPATH',compression='gzip')

In [4]:
# Automatically detect the metadata based on the actual data
from sdv.metadata import Metadata

metadata = Metadata.load_from_json('FILEPATH')

### No distributions + No Constraints

In [6]:
# Creating the synthesizer with guassian kde distribution
from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(
    metadata,  # required
    enforce_min_max_values=False,
    enforce_rounding=False,
)

In [7]:
# Pre-process the data using the defaults
synthesizer.auto_assign_transformers(real_data)

In [None]:
# View those transformers and write the output to a txt
synthesizer.get_transformers()

with open(filename, 'w') as f:
    f.write(str(synthesizer.get_transformers()))

In [9]:
# To learn a machine learning model based on your real data, use the fit method.
synthesizer.fit(real_data)

In [10]:
# Get learned distribution
# After fitting, you can access the learned distribution for each column
learned_dist = synthesizer.get_learned_distributions()
pd.DataFrame(learned_dist).to_csv(filename, index=False)

In [11]:
# Saving the trained synthesizer as a Python pickle file for future use
synthesizer.save(filepath=filename)

### Sampling data with the synthesizer

In [None]:
# How many rows do we need?
n_rows = len(real_data)
print('Number of rows:', n_rows)
batch = round(n_rows/12)

if os.path.exists(filename):
    os.remove(filename)

synthetic_data = synthesizer.sample(
    num_rows=n_rows,
    batch_size=batch,
    output_file_path=filename
)

### Diagnostics and evaluation

In [None]:
from sdv.evaluation.single_table import run_diagnostic, evaluate_quality
from sdv.evaluation.single_table import get_column_plot

print(metadata)

# 1. perform basic validity checks
diagnostic_report = run_diagnostic(real_data, synthetic_data, metadata)

# Save diagnostic report to a text file
diagnostic_report.save(filepath='FILEPATH')

### Quality check

In [14]:
%matplotlib inline

In [None]:
from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata)

# Save report to a csv
cols = quality_report.get_details(property_name='Column Shapes')
cols.to_csv('FILEPATH', index=False)

# Save diagnostic report to a pkl file
quality_report.save(filepath='FILEPATH)

In [16]:
# Create column quality visualizations
fig = quality_report.get_visualization(property_name='Column Shapes')

try:
    fig.to_image('jpg', scale=1.5)
    fig.write_image(file=filename, format='jpg')
except Exception as e:
    print(e)