## Example Usage of Data Backends

This notebook presents some examples of using the various data backends to produce mock data of different types with various statistical properties.

In [None]:
from mock_data.backends import (
    WeightedDiscrete,
    BoundedNumerical,
    BoundedDatetime,
    LoremIpsumText,
)
import matplotlib.pyplot as plt
import pandas as pd

### Normal distribution between 0 and 10

In [None]:
cr = BoundedNumerical(distribution="norm", lower_bound=0, upper_bound=10)
cr_samples = cr.generate_samples(size=5000)

plt.hist(cr_samples, bins=20)

In [None]:
from scipy.stats import kstest, norm, pearsonr
import numpy as np

In [None]:
# this is similar to a Shapiro-Wilk test
other_normal_samples = np.sort(norm.rvs(size=5000) - 43)
pearsonr(np.sort(cr_samples), other_normal_samples)

In [None]:
type(other_normal_samples)

### Chi2 Distribution between -8 and 100 with 3 degrees of freedom

In [None]:
chi_squared = BoundedNumerical(
    distribution="chi2", lower_bound=-8, upper_bound=100, df=3
)
chi_squared_samples = chi_squared.generate_samples(size=5000)

plt.hist(chi_squared_samples, bins=20)

### Sample of Dates between 20200101 and 20201231 following an Arcsine distribution

This distribution is described [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.arcsine.html#scipy.stats.arcsine) and can be used to achieve high sampling frequency at both ends of the range. 

In [None]:
date_sampler = BoundedDatetime(
    min_datetime="20200101", max_datetime="20201231", distribution="arcsine"
)
sampled_dates = date_sampler.generate_samples(size=1000)

# convert to a Pandas series and plot the frequency of months
pd.Series(pd.to_datetime(sampled_dates, format="%Y%m%d")).dt.month.hist()

### Lorem Ipsum text with length following a uniform distribution and 50% chance of being blank.

The resultant distribution is interesting because approximately 50% of the samples have length 0 while the remainder are sampled from a uniform distribution between 50 and 500 characters in length.

In [None]:
lorem = LoremIpsumText(lower_bound=50, upper_bound=500, blank_probability=0.5)
text_samples = lorem.generate_samples(size=1000)

sample_lengths = [len(text_sample) for text_sample in text_samples]

plt.hist(sample_lengths, bins=20)

### Utilize WeightedDiscrete to sample days of week with particular frequency

We want Monday to be much more common than the other days. Here's how we could accomplish this. 

In [None]:
weekday_sampler = WeightedDiscrete(
    {
        "Monday": 20,
        "Tuesday": 1,
        "Wednesday": 1,
        "Thursday": 1,
        "Friday": 1,
        "Saturday": 1,
        "Sunday": 1,
    }
)

sampled_days = pd.Series(weekday_sampler.generate_samples(size=1000))

sampled_days.groupby(sampled_days).count().plot(kind="bar")