# Generating Simulated Data in Python

Agenda:
- Tools for generating simulated data
- Examples with Python

## Tools for creating simulated data in Python:
- [random](https://docs.python.org/3/library/random.html)
- [numpy](https://numpy.org/doc/1.16/reference/routines.random.html)
- [scipy](https://docs.scipy.org/doc/scipy/reference/stats.html#probability-distributions)
- [scikit-learn](https://scikit-learn.org/stable/datasets/sample_generators.html)

In [None]:
#import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.datasets import make_blobs
from faker import Faker

### Simulated data with numpy

- np.random.seed
- np.random.randint
- np.linspace
- np.random.choice
- numpy distributions

### Simulated data with scipy.stats
- Distributions
- Sampling random variables from a distribution
- USE `random_state`

## Practical Examples with Python

- Simulated dataset with sklearn
- Generate data based purely on background knowledge
- Simulate data on previous model
- Generate data from real dataset
- Anonymize private information

## Example 1

Simulated classification dataset with sklearn `make_blobs`

## Example 2

Generate simulated data based on background knowledge.

We work for a school who is implementing a new grading platform. They don't want to input any actual grades until they know the system works and is secure so they ask you to generate some simulated data for them to test it on.

The school asks you to simulate the average overall grade of 1,000 students.

They give you the following information:
- Student grades follow a normal distribution
- The mean overall grade is 77 with a standard deviation of 9

## Example 3

Let's look at another real world example. Assume we are an analyst working for a small e-commerce store and we want to generate simulated data on the relationship between monthly advertising costs and total monthly sales.

From a previous analysis, we know the relationship between advertising costs and total monthly sales can be described with the following linear equation:

`total_monthly_sales = 92.64 * advertising_costs + 26.57`

Interpretation:
- For every additional dollar spent on advertising, we expect total sales to increase by \\$92.64
- If we spend \\$0 on advertising, we expect total sales to be \\$26.57

We spend between \\$0 and \\$500 on advertising monthly. Let's use numpy to simulate 100 data points.

In [None]:
# Simulate 100 points between 0 and 500
monthly_costs = None

Utilize numpy's vector operations to estimate `total_monthly_sales` for each possible advertising cost value

In [None]:
# Calculate total_monthly_sales using monthly_costs and model above
total_monthly_sales = None

Visualize in a line plot

In [None]:
# Replace None with appropriate variables

plt.figure(figsize=(10,6))
plt.plot(None, None)
plt.title('Monthly Advertising Costs vs Estimated Total Sales', fontsize=18)
plt.xlabel('Monthly Advertising Costs ($)', fontsize=15)
plt.ylabel('Total Monthly Sales ($)', fontsize=15);

While this linear equation models the relationship quite well, we know real world data never fits the model perfectly. To make this simulation more accurate, we can add some *noise* to the data.

Let's assume for any given month, the percentage `total_monthly_sales` can vary follows a normal distribution. Where the average variation is 1% with a standard deviation of 6%.

In [None]:
# Generate "noise" for all 100 points
np.random.seed(0)
noise = None
plt.hist(noise);

Recalculate monthly sales with `noise`

In [None]:
sales_with_noise = None

Let's visualize the line from above along with the more realistic data points we created.

In [None]:
# Replace None with appropriate variables

plt.figure(figsize=(10,6))

# Line plot of model
plt.plot(None, None, label='Model', lw=2.5)

# Scatter plot of data with noise
plt.scatter(None, None, color='orange', label='Simulated Data')

plt.title('Monthly Advertising Costs vs Estimated Total Sales', fontsize=18)
plt.xlabel('Monthly Advertising Costs ($)', fontsize=15)
plt.ylabel('Total Monthly Sales ($)', fontsize=15)
plt.legend();

Last thing to do is save the simulated data as a DataFrame.

## Example 4

Create a simulated dataset based on real world data. We will be generating a full dataset to simulate the profits of a store by product category.

Before we begin:
- Simulated data should have a high level of *utility*
- Maintain multivariate relationships

Important to note the "real" data we will be using is actually simulated data itself.

In [None]:
df = pd.read_csv('data/store-profits.csv', encoding='latin-1')
df.head()

Check shape of DataFrame

In [None]:
df.shape

Reformat column names

In [None]:
df.columns = [col.replace(' ', '_').lower() for col in df.columns]
df.head()

 Subset of `df` to only focus on `sub-category` and `profit`

In [None]:
profits_df = df[['sub-category', 'profit']]
profits_df.head()

Function below will be used to calculate the RMSE of two distributions as a metric for how well the model fits the data.

In [None]:
def check_dist_fit(real, dist):
    return np.sqrt(np.sum((real - dist) ** 2))

### Simulating Profit Data - Part 1

Let's begin by modeling the univariate distribution of `sub-category` alone and generating simulated data based on this model. 

Steps:
- Explore `sub-category`
- Model underlying distribution
- Generate 10,000 simulated data points from this distribution

### Simulating Profit Data - Part 2

The next step will be modeling the univariate distribution of `profit`.

Steps:
- Explore `profit`
- Model underlying distribution
- Generate 10,000 sumulated data points from distribution

### Explore Simulated Data

Before diving into the final part of this example, let's combine the simulated `sub-category` data with the simulated `profit` and see how it aligns with our original dataset.

### Simulating Profit Data - Part 3

Finally, let's model `sub-category` and `profit` together. We need to account for both the individual distribution of each variable as well as the relationship between them.

Notes about final simulated sample:
- Generate 10,000 total samples of `profit` and `sub-category`
- Simulated data should have similar distribution for both variables
- Maintain relationship between `profit` and `sub-category`

## Example 5

One of the most important aspects of generating simulated data is ensuring no private information is disclosed from the real data. All sensitive information should be removed or anonymized before released to the public. 

In this example we will use the [faker](https://faker.readthedocs.io/en/stable/fakerclass.html) library to help us anonymize customer names from the dataset above.

### Additional tools for generating simulated data:
- [SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html)
- [Python Package - SDV](https://sdv.dev/SDV/)
- [Python Package - Gretel](https://synthetics.docs.gretel.ai/en/stable/#)
- [Mockaroo](https://mockaroo.com/)