In [1]:
import pandas as pd

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 1000)

# More Realism

At some point, one may wish to have more realistic data, perhaps in the form of distribution of events.
This document will highlight some functionality to help in this endeavour. Note that attempting to mock
more realistic data does come at the cost of additional complexity in the config.

## Exploding with Distribution
One way to add more realistic distributions is during explode. Before, every row would generate an equal number of
subsequent rows. This can now be controlled via `distribution_kwargs`:

In [2]:
import yaml

yaml_string = """
persons:
  num_rows: 1000
  columns:
    person_id:
      type: generate_unique_id
      prefix: person

events:
  columns:
    person_id:
      type: explode
      list_of_values: persons.person_id
      distribution_kwargs:
        distribution: gamma
        scale: 1
        shape: 3
        numpy_seed: 1
      explode_func: generate_dates
      explode_func_kwargs:
        start_dt: 2019-01-01
        end_dt: 2020-01-01
        freq: M
      position: 0
    event_dt:
      type: explode
      list_of_values: persons.person_id
      distribution_kwargs:
        distribution: gamma
        scale: 1
        shape: 3
        numpy_seed: 1
      explode_func: generate_dates
      explode_func_kwargs:
        start_dt: 2019-01-01
        end_dt: 2020-01-01
        freq: M
      position: 1
"""
config = yaml.safe_load(yaml_string)

Given the following config:

In [3]:
print(yaml_string)


persons:
  num_rows: 1000
  columns:
    person_id:
      type: generate_unique_id
      prefix: person

events:
  columns:
    person_id:
      type: explode
      list_of_values: persons.person_id
      distribution_kwargs:
        distribution: gamma
        scale: 1
        shape: 3
        numpy_seed: 1
      explode_func: generate_dates
      explode_func_kwargs:
        start_dt: 2019-01-01
        end_dt: 2020-01-01
        freq: M
      position: 0
    event_dt:
      type: explode
      list_of_values: persons.person_id
      distribution_kwargs:
        distribution: gamma
        scale: 1
        shape: 3
        numpy_seed: 1
      explode_func: generate_dates
      explode_func_kwargs:
        start_dt: 2019-01-01
        end_dt: 2020-01-01
        freq: M
      position: 1



Let's generate and find the distribution:

In [4]:
from data_fabricator.v0.core.fabricator import MockDataGenerator

# Setting seed is not recommended for general use, please consider when to use seed
mock_generator = MockDataGenerator(instructions=config, seed=1)
mock_generator.generate_all()

df = mock_generator.all_dataframes["events"]
print(df.head(10))

  from data_fabricator.v0.core.fabricator import MockDataGenerator


  person_id   event_dt
0   person1 2019-02-28
1   person1 2019-05-31
2   person1 2019-07-31
3   person1 2019-08-31
4   person1 2019-09-30
5   person1 2019-11-30
6   person2 2019-06-30
7   person3 2019-04-30
8   person4 2019-03-31
9   person5 2019-01-31


Let's summarize the data:

In [5]:
grouped_df_pd = (
    df.groupby(["person_id"])["event_dt"]
    .size()
    .reset_index()
    .groupby(["event_dt"])["person_id"]
    .size()
    .reset_index()
    .rename(columns={"event_dt": "count", "person_id": "count_of_count"})
    .sort_values(["count"])
)
print(grouped_df_pd.head(20))

    count  count_of_count
0       1             230
1       2             258
2       3             174
3       4             114
4       5              54
5       6              46
6       7              25
7       8              10
8       9               3
9      10               1
10     12              84
11     15               1


Plotting gives:
![](../images/distribution.png)

Under the hood, the following function is used: `data_fabricator.fabricator.numpy_random`. See docstring for
more details.


## Injecting Real Data

If you have an existing dataframe, it is possible to inject it into the data fabricator
so that the columns can be accessible during fabrication. This might be useful in situations
where you have dimension tables readily available, possibly because these dimension tables
are considered public data sources.

In [6]:
import pandas as pd
from tabulate import tabulate
from data_fabricator.v0.core.fabricator import MockDataGenerator

params_string = """
persons:
  num_rows: 10
  columns:
    person_id:
      type: generate_unique_id
      prefix: person
    class:
      type: row_apply
      list_of_values: injected_df.class  # refers to injected dataframe
      row_func: "lambda x: x"
      resize: True
      seed: 1
"""
config = yaml.safe_load(params_string)

# Setting seed is not recommended for general use, please consider when to use seed
mock_generator = MockDataGenerator(instructions=config, seed=1)

injected_df = pd.DataFrame(
    [{"class": "engineering"}, {"class": "science"}, {"class": "history"}]
)

# inject dataframe
mock_generator.all_dataframes["injected_df"] = injected_df

mock_generator.generate_all()

generated_table = mock_generator.all_dataframes["persons"]

print(tabulate(generated_table, headers=generated_table.columns, tablefmt="psql"))

Resizing list from 3 to 10


+----+-------------+-------------+
|    | person_id   | class       |
|----+-------------+-------------|
|  0 | person1     | engineering |
|  1 | person2     | history     |
|  2 | person3     | history     |
|  3 | person4     | engineering |
|  4 | person5     | science     |
|  5 | person6     | science     |
|  6 | person7     | science     |
|  7 | person8     | history     |
|  8 | person9     | engineering |
|  9 | person10    | engineering |
+----+-------------+-------------+


As long as the dataframes are injected before calling the `generate_all()` method, the injected
dataframe may be referenced anywhere in config.