In [1]:
import pandas as pd

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 1000)

# Reverse Engineer

If you have existing data and you would like to fabricate tables by referring that existing data,
then you can use `reverse_engineer` functions to generate skeleton config.

This document will highlight `reverse engineer` process i.e, the generation of config using the given data.
It works only with spark dataframe not the pandas dataframe. Also, it doesn't allow the complex dtypes like arrays and structs.

Lets create sample spark dataframes to demonstrate the `reverse_engineer` function.


In [2]:
import yaml
import datetime
import numpy as np
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import (
    ArrayType,
    DateType,
    DoubleType,
    FloatType,
    IntegerType,
    LongType,
    StringType,
    StructField,
    StructType,
    TimestampType,
)
import random

np.random.seed(1)
random.seed(1)

spark = (
    SparkSession.builder.config("spark.ui.showConsoleProgress", False)
    .config("spark.sql.shuffle.partitions", 1)
    .getOrCreate()
)
schema = StructType(
    [
        StructField("int_col", IntegerType(), True),
        StructField("long_col", LongType(), True),
        StructField("string_col", StringType(), True),
        StructField("float_col", FloatType(), True),
        StructField("double_col", DoubleType(), True),
        StructField("date_col", DateType(), True),
        StructField("datetime_col", TimestampType(), True),
    ]
)

data = [
    (
        1,
        2,
        "hello world",
        13.01,
        0.89,
        pd.Timestamp("2012-05-01").date(),
        datetime.datetime(2020, 11, 30, 18, 29, 19, 990601),
    ),
    (
        1,
        2,
        "hello world",
        13.01,
        0.89,
        pd.Timestamp("2012-05-01").date(),
        datetime.datetime(2020, 11, 30, 18, 29, 19, 990601),
    ),
    (
        1,
        2,
        "language string",
        10.51,
        6.79,
        pd.Timestamp("2011-05-01").date(),
        datetime.datetime(2018, 11, 30, 18, 19, 19, 990601),
    ),
]

sample_df = spark.createDataFrame(data, schema)

schema_with_complex_dtype = StructType(
    [
        StructField("int_col", IntegerType(), True),
        StructField("long_col", LongType(), True),
        StructField("string_col", StringType(), True),
        StructField("float_col", FloatType(), True),
        StructField("double_col", DoubleType(), True),
        StructField("date_col", DateType(), True),
        StructField("datetime_col", TimestampType(), True),
        StructField("array_int", ArrayType(IntegerType()), True),
    ]
)

data_with_complex_dtype = [
    (
        1,
        2,
        "hello world",
        13.01,
        0.89,
        pd.Timestamp("2012-05-01").date(),
        datetime.datetime(2020, 11, 30, 18, 29, 19, 990601),
        [1, 5, 7],
    ),
    (
        1,
        2,
        "hello world",
        13.01,
        0.89,
        pd.Timestamp("2012-05-01").date(),
        datetime.datetime(2020, 11, 30, 18, 29, 19, 990601),
        [9, 2, 7],
    ),
    (
        1,
        2,
        "language string",
        10.51,
        6.79,
        pd.Timestamp("2011-05-01").date(),
        datetime.datetime(2018, 11, 30, 18, 19, 19, 990601),
        [8, 2, 9],
    ),
]

sample_df_with_complex_dtypes = spark.createDataFrame(
    data_with_complex_dtype, schema_with_complex_dtype
)

data_dict = {
    "int_col": [1, 2, 3],
    "long_col": [2, 3, 4],
    "string_col": ["awesome_string", "hello", "world"],
    "float_col": [10.51, 1.1, 2.2],
    "double_col": [6.79, 9.82, 8.99],
}

sample_pandas_df = pd.DataFrame(data_dict)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/06 10:28:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Sample spark dataframe:

In [3]:
sample_df.show(truncate=False)

+-------+--------+---------------+---------+----------+----------+--------------------------+
|int_col|long_col|string_col     |float_col|double_col|date_col  |datetime_col              |
+-------+--------+---------------+---------+----------+----------+--------------------------+
|1      |2       |hello world    |13.01    |0.89      |2012-05-01|2020-11-30 18:29:19.990601|
|1      |2       |hello world    |13.01    |0.89      |2012-05-01|2020-11-30 18:29:19.990601|
|1      |2       |language string|10.51    |6.79      |2011-05-01|2018-11-30 18:19:19.990601|
+-------+--------+---------------+---------+----------+----------+--------------------------+



## Generate Config from a Spark DataFrame:

In [4]:
from data_fabricator.v0.core.reverse_engineer import reverse_engineer_df

table_config = reverse_engineer_df(df=sample_df, num_rows=10)

print(yaml.safe_dump(table_config))

  from data_fabricator.v0.core.reverse_engineer import reverse_engineer_df


columns:
  date_col:
    sample_values:
    - 2011-05-01
    - 2012-05-01
    type: generate_values
  datetime_col:
    sample_values:
    - 2018-11-30 18:19:19.990601
    - 2020-11-30 18:29:19.990601
    type: generate_values
  double_col:
    sample_values:
    - 0.89
    - 6.79
    type: generate_values
  float_col:
    sample_values:
    - 10.510000228881836
    - 13.010000228881836
    type: generate_values
  int_col:
    sample_values:
    - 1
    type: generate_values
  long_col:
    sample_values:
    - 2
    type: generate_values
  string_col:
    sample_values:
    - hello world
    - language string
    type: generate_values
num_rows: 10



## Generate Config from a Pandas DataFrame:

Currently not implemented - convert the pandas dataframe to a spark dataframe then proceed. There are 2 ways to convert a pandas dataframe to spark, in memory (shown in the example below), or via parquet files.
It is generally recommended to use parquet files because in memory is an implicit conversion of data types. But for the brevity of the example, we will use the in memory conversion.


In [5]:
from data_fabricator.v0.core.reverse_engineer import reverse_engineer_df

# some_pandas_df.head()

# spark_df = spark.createDataFrame(some_pandas_df)

# table_config = reverse_engineer_df(df=spark_df, num_rows=10)

# print(yaml.safe_dump(table_config))

##Generating Config for Spark Dataframes with Complex Data Types:

sample spark dataframe with complex dtypes:

In [6]:
sample_df_with_complex_dtypes.show(truncate=False)

+-------+--------+---------------+---------+----------+----------+--------------------------+---------+
|int_col|long_col|string_col     |float_col|double_col|date_col  |datetime_col              |array_int|
+-------+--------+---------------+---------+----------+----------+--------------------------+---------+
|1      |2       |hello world    |13.01    |0.89      |2012-05-01|2020-11-30 18:29:19.990601|[1, 5, 7]|
|1      |2       |hello world    |13.01    |0.89      |2012-05-01|2020-11-30 18:29:19.990601|[9, 2, 7]|
|1      |2       |language string|10.51    |6.79      |2011-05-01|2018-11-30 18:19:19.990601|[8, 2, 9]|
+-------+--------+---------------+---------+----------+----------+--------------------------+---------+



As we can see, passing this dataframe to the function will result in the following error:

In [7]:
from data_fabricator.v0.core.reverse_engineer import reverse_engineer_df

try:
    table_config = reverse_engineer_df(df=sample_df_with_complex_dtypes, num_rows=10)
    print(yaml.safe_dump(table_config))

except ValueError as error:
    print("ValueError - ", error)

ValueError -  dtype array<int> not allowed. Kindly drop this column.


## Creating Config for Multiple Dataframes in one go:

In [8]:
from data_fabricator.v0.core.reverse_engineer import reverse_engineer_tables

valid_sample_df = sample_df.drop("array_int")
various_tables = {
    "table1": valid_sample_df,
    "table2": valid_sample_df,
    "table3": valid_sample_df,
}

table_config = reverse_engineer_tables(various_tables)
print(yaml.safe_dump(table_config))

table1:
  columns:
    date_col:
      sample_values:
      - 2011-05-01
      - 2012-05-01
      type: generate_values
    datetime_col:
      sample_values:
      - 2018-11-30 18:19:19.990601
      - 2020-11-30 18:29:19.990601
      type: generate_values
    double_col:
      sample_values:
      - 0.89
      - 6.79
      type: generate_values
    float_col:
      sample_values:
      - 10.510000228881836
      - 13.010000228881836
      type: generate_values
    int_col:
      sample_values:
      - 1
      type: generate_values
    long_col:
      sample_values:
      - 2
      type: generate_values
    string_col:
      sample_values:
      - hello world
      - language string
      type: generate_values
  num_rows: 10
table2:
  columns:
    date_col:
      sample_values:
      - 2011-05-01
      - 2012-05-01
      type: generate_values
    datetime_col:
      sample_values:
      - 2018-11-30 18:19:19.990601
      - 2020-11-30 18:29:19.990601
      type: generate_values
    doub

Notice that all the columns are generated with `type: generate_values` - this is because the function only samples from the dataframe and takes all unique values that are given to it.

The general flow is to use these sets of functions to generate skeleton config, then proceed to modify to the appropriate function manually. Future work might include implementing a smarter profiler.


## Now let's verify that the config is valid by passing it back to the `MockDataGenerator`:


In [9]:
from data_fabricator.v0.core.fabricator import MockDataGenerator

# Setting seed is not recommended for general use, please consider when to use seed
mock_generator = MockDataGenerator(instructions=table_config, seed=1)
mock_generator.generate_all()

generated_table1_df = mock_generator.all_dataframes["table1"]
print(generated_table1_df)

   int_col  long_col       string_col  float_col  double_col    date_col               datetime_col
0        1         2      hello world      10.51        6.79  2011-05-01 2018-11-30 18:19:19.990601
1        1         2  language string      10.51        6.79  2012-05-01 2020-11-30 18:29:19.990601
2        1         2  language string      10.51        0.89  2012-05-01 2020-11-30 18:29:19.990601
3        1         2      hello world      10.51        0.89  2012-05-01 2020-11-30 18:29:19.990601
4        1         2      hello world      10.51        6.79  2012-05-01 2018-11-30 18:19:19.990601
5        1         2      hello world      10.51        6.79  2012-05-01 2018-11-30 18:19:19.990601
6        1         2      hello world      13.01        6.79  2011-05-01 2020-11-30 18:29:19.990601
7        1         2      hello world      13.01        0.89  2011-05-01 2020-11-30 18:29:19.990601
8        1         2      hello world      13.01        6.79  2012-05-01 2020-11-30 18:29:19.990601
