In this notebook, we'll walk through the basics of preparing your data for use with SDV. This includes loading your dataset and creating a metadata description for your tables. You can download the files used in this tutorial from the following link:
https://drive.google.com/drive/folders/1WRcrmDT_S9xq9CpqzD7WxSpWbLkwadAe?usp=sharing

# 1. Installing the dependencies
We will first install the ***sdv*** library.

In [None]:
!pip install sdv

# 2. Loading the CSV file
Upload the CSV file for which you want to generate synthetic data. In this tutorial, weâ€™ll use a sample sales dataset, but feel free to upload your own file instead.

## 2.1 Upload your CSV file

In [None]:
from google.colab import files

uploaded = files.upload()

## 2.2 Loading the data in Python

In [None]:
from sdv.io.local import CSVHandler

connector = CSVHandler()
FOLDER_NAME = '/content/'

data = connector.read(folder_name=FOLDER_NAME)

## 2.3 Inspecting the data

In [None]:
data.keys()

In [None]:
salesDf = data['data']
salesDf.head()

# 3. Loading the Metadata file
We will now upload the metadata file for the CSV file we uploaded.

## 3.1 Upload your metadata.json file

In [None]:
from google.colab import files

uploaded = files.upload()

## 3.2 Loading the metadata into Python

In [None]:
from sdv.metadata import Metadata
metadata = Metadata.load_from_json('metadata.json')

## 3.3 (Optional) Creating the metadata using SDV
Alternatively, we can use the SDV library to automatically infer the metadata. However, the results may not always be accurate or complete, so you might need to review and update it if there are any discrepancies.

In [None]:
from sdv.metadata import Metadata

metadata = Metadata.detect_from_dataframes(data)

In [None]:
print('Auto detected data:\n')
metadata.visualize()

## 3.4 Validating the metadata
Let's validate that the metadata format makes sense. If successful, the code below should run without any errors.

In [None]:
metadata.validate()

# 4. Creating Synthetic Data
With these preparatory steps complete, we can now use the metadata and original dataset with SDV. The code below trains a model and generates synthetic data.

In [None]:
from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=salesDf)

You can specify the number of rows you want the synthesizer to generate using the *num_rows* argument


In [None]:
synthetic_data = synthesizer.sample(num_rows=10000)

## 4.1 Evaluating the synthetic data

In [None]:
from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    salesDf,
    synthetic_data,
    metadata)

## 4.2 Visualizing the synthetic vs. real data distribution

In [None]:
from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=salesDf,
    synthetic_data=synthetic_data,
    column_name='Sales',
    metadata=metadata
)

fig.show()

## 4.3 Using Matplotlib to see the Average Monthly Sales for Real & Synthetic Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Ensure 'Date' columns are datetime
salesDf['Date'] = pd.to_datetime(salesDf['Date'], format='%d-%m-%Y')
synthetic_data['Date'] = pd.to_datetime(synthetic_data['Date'], format='%d-%m-%Y')

# Extract 'Month' as year-month string
salesDf['Month'] = salesDf['Date'].dt.to_period('M').astype(str)
synthetic_data['Month'] = synthetic_data['Date'].dt.to_period('M').astype(str)

# Group by 'Month' and calculate average sales
actual_avg_monthly = salesDf.groupby('Month')['Sales'].mean().rename('Actual Average Sales')
synthetic_avg_monthly = synthetic_data.groupby('Month')['Sales'].mean().rename('Synthetic Average Sales')

# Merge the two series into a DataFrame
avg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Actual Average Sales'], label='Actual Average Sales', marker='o')
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Synthetic Average Sales'], label='Synthetic Average Sales', marker='o')

plt.title('Average Monthly Sales Comparison: Actual vs Synthetic')
plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.ylim(bottom=0)  # y-axis starts at 0
plt.tight_layout()
plt.show()