# Generating Synthetic Data with Python

Hand's on workshop covering synthetic data with Python.

**Agenda**:
- Overview
- Bootstrapping
- SMOTE
- Image Augmentation
- Synthetic Data Vault (SDV)

## Overview

### What is synthetic data?

Data that is not "real", but has been generated to reflect some real world data or process. 

Data access can be challenging:
- Limited resources
- Privacy concerns

Synthetic data is becoming a prominent strategy to improve data access against both of these cases.


### How do we create synthetic data?

Two primary methods:
1. Simulate real data
2. Use existing model or background knowledge

![Generation Process](images/Synthetic-data-generation.png)


### Key Ideas

- **Utility**: how accurately synthetic data reflects real data or process
    - Depends on use case
- **Anonymization**: ensure protection of sensitive information
- **Reproducibiliy**: generation process can be replicated

In [1]:
# Import Basic Packages (some others will be imported later)
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Classification models/metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Functions from .py file
from src.funcs import plot_hist, plot_churn, model_churn, plot_grid

## Bootstrap

Monte Carlo Simulation approach to estimate the uncertainty of a statistic or estimator. Any sample taken from a population includes an inherent error from the randomness of the sampling process itself. We use the Bootstrap method to infer how other samples taken from the same population would have differed due to random error, and thus produce some idea of the uncertainty in our original sample. 

At a high level the process is as follows:
1. Take random sample from a population
1. Consider the sample to be the population
1. Repeatedly sample *with replacement* from original sample to create *bootstrap samples*


In this sense, we can consider each bootstrap sample to be a synthetic set of data points which represent what the original sample *could* have been. We can achieve similar results to the Central Limit Theorem, only using a single sample!

Let's look at an example with a dataset below. We will read pandas Series on profits for sales of a store.

In [None]:
pop_profit = pd.read_csv('data/profits.csv')['Profit']
pop_profit.head()

#### Tasks:
- Take random sample of n=250
- Explore distribution of sample
- Take 10,000 bootstrap resamples
- Explore distribution of 4 bootstrap resamples
- Construct bootstrap sampling distribution
- Calculate 90% confidence interval to estimate population mean

## SMOTE

Synthetic Minority Over-sampling Technique is a strategy used to address a very common challenge in machine learning: class imbalance. Class imbalance refers to classification problems with one class in the dataset being more prevalent than others. 

Examples include:
- Fraud
- Medical scans
- Car crashes

A model trained on an imbalanced dataset will naturally be biased to predict the majority class. A common method to counteract this issue involves generating synthetic observations of the minority class. We can give the model more data to learn from without investing additional resources to collect real data!

The example below looks at a the [Telecom Churn](https://www.kaggle.com/datasets/mnassrib/telecom-churn-datasets?select=churn-bigml-80.csv) dataset. Our goal is to use information on customer behavior to predict whether or not the customer cancels their subscription (churn).

In [None]:
# Read churn csv into DataFrame
df = pd.read_csv('data/churn-bigml-80.csv')
df.head()

#### Tasks:
- Explore distribution of target variable
- Train model on original, imbalanced data
- Generate synthetic samples of minority class
- Train new model and compare results

In [None]:
# # Import SMOTE (Synthetic Minority Over-sampling Technique)
from imblearn.over_sampling import SMOTE

## Image Augmentation

> "Data augmentation is a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data." [Berkeley Artifical Intelligence Research](https://bair.berkeley.edu/blog/2019/06/07/data_aug/)

Let's look at a simple example with the MNIST dataset on images of hand-written digits. The following example is taken from this [article](https://machinelearningmastery.com/image-augmentation-deep-learning-keras/).

#### Tasks:
- Load MNIST data
- Explore basic images
- [Tensorflow Image Augmentation](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator) tool
- Create full dataset

In [None]:
# Import MNIST dataset
from tensorflow.keras.datasets import mnist

In [None]:
# Load Data
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Standardize images and convert to float
X_train = X_train.reshape((X_train.shape[0], 28, 28, 1))
X_test = X_test.reshape((X_test.shape[0], 28, 28, 1))
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

In [None]:
# Explore original data
plot_grid(X_train)

In [None]:
# Import Tensorflow Image Augmentation Tool
from tensorflow.keras.preprocessing.image import ImageDataGenerator

## Synthetic Data Vault (SDV)

>"The **Synthetic Data Vault (SDV)** is a **Synthetic Data Generation** ecosystem of libraries that allows users to easily learn single-table, multi-table and timeseries datasets to later on generate new **Synthetic Data** that has the **same format and statistical properties** as the original dataset." [Synthetic Data Vault](https://sdv.dev/SDV/index.html)

Topics to Discuss:
- Fitting model and generating data
- Faker and anonymizing sensitive information
- Exploring distributions
- Constraints and creating your own
- Metrics to measure utility