In [1]:
import pandas as pd

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 1000)

# More Realism

At some point, one may wish to have more realistic data, perhaps in the form of distribution of events.
This document will highlight some functionality to help in this endeavour. Note that attempting to mock
more realistic data does come at the cost of additional complexity in the config.

## Exploding with distribution
One way to add more realistic distributions is during explode. Before, every row would generate an equal number of
subsequent rows. This can now be controlled via `distribution_kwargs`. In the below Event table we are creating two
columns (`person_id` and `event_dt`) to create the many-to-many mapping of exploded rows.


In [2]:
from data_fabricator.v1.core.mock_generator import (
    BaseTable,
    PrimaryKey,
    Explode,
    generate_dates,
)


class Person(BaseTable):
    num_rows = 1000
    _metadata_ = {"description": "Person table with <insert detail here>"}
    person_id = PrimaryKey(prefix="person")


class Event(BaseTable):
    person_id = Explode(
        explode_func=generate_dates,
        list_of_values="Person.person_id",
        position=0,
        distribution_kwargs={
            "distribution": "gamma",
            "scale": 1,
            "shape": 3,
            "numpy_seed": 1,
        },
        explode_func_kwargs={
            "start_dt": "2019-01-01",
            "end_dt": "2020-01-01",
            "freq": "M",
        },
    )
    event_dt = Explode(
        explode_func=generate_dates,
        list_of_values="Person.person_id",
        position=1,
        distribution_kwargs={
            "distribution": "gamma",
            "scale": 1,
            "shape": 3,
            "numpy_seed": 1,
        },
        explode_func_kwargs={
            "start_dt": "2019-01-01",
            "end_dt": "2020-01-01",
            "freq": "M",
        },
    )

## Explanation of the explode configuration
We want `person_id` and `event_dt` to represent the many-to-many relationship between a persons id and an event date. Therefore
we pass the same configuration to both column definitions above - **apart** from `position`. For each column definition it will then
create the same set of exploded data, lots of `person_id`s associated with lots of `event_dt`s - the position argument in
configuration then dictates which of the generated 'exploded columns' should be assigned to the current column that is being configured.


Given the two table classes that we have defined:

In [3]:
print(Person)
print(Event)

<class '__main__.Person'>
<class '__main__.Event'>


Let's generate and find the distribution:

In [4]:
from data_fabricator.v1.core.mock_generator import MockDataGenerator

# Setting seed is not recommended for general use, please consider when to use seed
mock_generator = MockDataGenerator(tables=[Person, Event], seed=1)
mock_generator.generate_all()

df = mock_generator.tables["Event"].dataframe
print(df.head(10))

  person_id   event_dt
0   person1 2019-02-28
1   person1 2019-05-31
2   person1 2019-07-31
3   person1 2019-08-31
4   person1 2019-09-30
5   person1 2019-11-30
6   person2 2019-06-30
7   person3 2019-04-30
8   person4 2019-03-31
9   person5 2019-01-31


Let's summarize the data:

In [5]:
grouped_df_pd = (
    df.groupby(["person_id"])["event_dt"]
    .size()
    .reset_index()
    .groupby(["event_dt"])["person_id"]
    .size()
    .reset_index()
    .rename(columns={"event_dt": "count", "person_id": "count_of_count"})
    .sort_values(["count"])
)
print(grouped_df_pd.head(20))

    count  count_of_count
0       1             230
1       2             258
2       3             174
3       4             114
4       5              54
5       6              46
6       7              25
7       8              10
8       9               3
9      10               1
10     12              84
11     15               1


Plotting gives:

In [7]:
import plotly.express as px

fig = px.bar(
    data_frame=grouped_df_pd,
    x="count",
    y="count_of_count",
    title="Distribution of Events per Person",
    labels={"count_of_count": "Number of persons", "count": "Events per person"},
)
fig.write_image("data_fabricator/docs/images/distribution.png")

![](images/distribution.png)

Under the hood, the following function is used: `data_fabricator.v1.core.functions.numpy_random`. See docstring for
more details.