In [1]:
import pandas as pd

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 1000)

# Claims Data Case Study

This document explains how the data fabricator utility could be utilised to generate commercial events data and all its associated dimensions.


## Entity relationship diagram

![ER Diagram](../images/claims_case_study_erd.svg)


## Table definition

Patient: Dimension table containing patient information.

| Column name | Column logic                               |
|-------------|--------------------------------------------|
| patient_id  | unique ID prefixed with pt_                |
| gender      | could contain one of the 3 values: m, f, u |
| birth_year  | year                                       |


Provider: Dimension table containing provider information.

| Column name | Column logic                       |
|-------------|------------------------------------|
| provider_id | unique id; prefixed with phys      |
| first_name  | String value                       |
| last_name   | String value                       |
| state       | String value from a particular set |
| zip         | US zip codes                       |
| speciality  | Values from a list of specialty    |


Diagnosis: Dimension table containing diagnosis details.

| Column name      | Column logic                                           |
|------------------|--------------------------------------------------------|
| diagnosis_code   | string of format "xx.xx" containing   specific letters |
| icd_version_type | number: 1,2 or -1                                      |


Procedure: Dimension table containing procedure code information.

| Column name         | Column logic                             |
|---------------------|------------------------------------------|
| procedure_code      | number in range of 00100 - 99999; unique |
| procedure_code_desc | static string                            |
| product_group       | value from list: group1, group2…group5   |


Events: Fact table containing patient claims events.

| Column name          | Column logic                                                                      |
|----------------------|-----------------------------------------------------------------------------------|
| claim_id             | unique_id; range from 0-20000; id length - 5                                      |
| provider_id          | provider_id from provider table                                                   |
| patient_id           | patient_id from patient table                                                     |
| procedure_code       | procedure_code from procedure table                                               |
| diagnosis_code       | diagnosis code from diagnosis table                                               |
| event_date           | date between 2019-01-01 and 2021-01-01                                            |
| record_creation_date | Date; One day after the event_date                                                |
| copay_amt            | Integer between 0-500                                                             |


In [2]:
import yaml

yaml_string = """
patient:
  num_rows: 10
  columns:
    patient_id:
      type: generate_unique_id
      prefix: pt_
      id_start_range: 0
      id_end_range: 5000
      id_length: 10
    patient_gender:
      type: generate_values
      sample_values: ["m", "f", "u"]
      seed: 0.5
    birth_year:
      type: faker
      provider: year
      # Setting seed is not recommended for general use, please consider when to use seed
      faker_seed: 1

provider:
  num_rows: 10
  columns:
    provider_id:
      type: generate_unique_id
      prefix: phys_
      id_start_range: 0
      id_end_range: 1000
      id_length: 10
    first_name:
      type: faker
      provider: pystr_format
      provider_args:
        string_format: "?????????"
        letters: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
      # Setting seed is not recommended for general use, please consider when to use seed
      faker_seed: 1
    last_name:
      type: faker
      provider: pystr_format
      provider_args:
        string_format: "?????????"
        letters: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
      # Setting seed is not recommended for general use, please consider when to use seed
      faker_seed: 1
    state:
      type: generate_values
      sample_values: ["MA", "CA", "NY", "CT", "DC", "IL"]
      seed: 1
    zip:
      type: faker
      provider: postcode
      localisation: en_US
      # Setting seed is not recommended for general use, please consider when to use seed
      faker_seed: 1
    speciality_code:
      type: generate_values
      sample_values:
        - Cardiology
        - Anesthesiology
        - Dermatology
        - Gastroenterology
        - Pulmonology
        - Urology
        - Neurology
        - Immunology
        - Ophthalmology

diagnosis_code:
  num_rows: 10
  columns:
    diagnosis_code:
      type: faker
      provider: pystr_format
      provider_args:
        string_format: "?##.##"
        letters: "CIJFGK"
      # Setting seed is not recommended for general use, please consider when to use seed
      faker_seed: 1
    icd_version_type:
      type: generate_values
      sample_values: [1, 2, -1]

procedure_code:
  num_rows: 10
  columns:
    procedure_code:
      type: generate_unique_id
      id_start_range: 100
      id_end_range: 99999
      id_length: 5
    procedure_code_desc:
      type: generate_values
      sample_values: ["procedure code description"]
    product_group:
      type: generate_values
      sample_values: ["group 1", "group 2", "group 3", "group 4", "group 5"]

events:
  num_rows: 10
  columns:
    claim_id:
      type: generate_unique_id
      prefix: rx
      id_start_range: 0
      id_end_range: 10
      id_length: 7
    patient_id:
      type: row_apply
      list_of_values: patient.patient_id
      row_func: "lambda x: x"
      resize: True
    patient_id:
      type: row_apply
      list_of_values:
        - patient.patient_id
        - patient.patient_gender
      row_func: "lambda *args: args[0]"
      seed: 1
    patient_gender:
      type: row_apply
      list_of_values:
        - patient.patient_id
        - patient.patient_gender
      row_func: "lambda *args: args[1]"
      seed: 1
    provider_id:
      type: row_apply
      list_of_values: provider.provider_id
      row_func: "lambda x: x"
      resize: True
    procedure_code:
      type: row_apply
      list_of_values: procedure_code.procedure_code
      row_func: "lambda x: x"
    diagnosis_code:
      type: row_apply
      list_of_values: diagnosis_code.diagnosis_code
      row_func: "lambda x: x"
      resize: True
    event_date:
      type: generate_dates
      start_dt: 2018-01-01
      end_dt: 2020-12-31
      freq: D
    record_creation_date:
      type: row_apply
      list_of_values: events.event_date
      row_func: "lambda x: x + datetime.timedelta(days=1)"
    copay_amt:
      type: generate_random_numbers
      start_range: 0
      end_range: 250
      integer: True
"""
config = yaml.safe_load(yaml_string)

The data fabricator configuration will look like:

In [3]:
print(yaml_string)


patient:
  num_rows: 10
  columns:
    patient_id:
      type: generate_unique_id
      prefix: pt_
      id_start_range: 0
      id_end_range: 5000
      id_length: 10
    patient_gender:
      type: generate_values
      sample_values: ["m", "f", "u"]
      seed: 0.5
    birth_year:
      type: faker
      provider: year
      # Setting seed is not recommended for general use, please consider when to use seed
      faker_seed: 1

provider:
  num_rows: 10
  columns:
    provider_id:
      type: generate_unique_id
      prefix: phys_
      id_start_range: 0
      id_end_range: 1000
      id_length: 10
    first_name:
      type: faker
      provider: pystr_format
      provider_args:
        string_format: "?????????"
        letters: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
      # Setting seed is not recommended for general use, please consider when to use seed
      faker_seed: 1
    last_name:
      type: faker
      provider: pystr_format
      provider_args:
        string_format: "????????

The data will look like:

In [4]:
from data_fabricator.v0.core.fabricator import MockDataGenerator
from tabulate import tabulate

# Setting seed is not recommended for general use, please consider when to use seed
mock_generator = MockDataGenerator(instructions=config, seed=1)
mock_generator.generate_all()

for table_name, table in mock_generator.all_dataframes.items():
    print(f"Table: {table_name}")
    print(tabulate(table, headers=table.columns, tablefmt="psql"))
    print("\n")

  from data_fabricator.v0.core.fabricator import MockDataGenerator


Table: patient
+----+--------------+------------------+--------------+
|    | patient_id   | patient_gender   |   birth_year |
|----+--------------+------------------+--------------|
|  0 | pt_0000516   | m                |         1979 |
|  1 | pt_0000965   | u                |         2008 |
|  2 | pt_0001100   | f                |         2021 |
|  3 | pt_0001719   | m                |         1974 |
|  4 | pt_0002089   | m                |         1987 |
|  5 | pt_0003109   | m                |         1978 |
|  6 | pt_0003682   | u                |         2003 |
|  7 | pt_0003868   | m                |         2021 |
|  8 | pt_0004058   | m                |         2000 |
|  9 | pt_0004662   | f                |         2002 |
+----+--------------+------------------+--------------+


Table: provider
+----+---------------+--------------+-------------+---------+-------+-------------------+
|    | provider_id   | first_name   | last_name   | state   |   zip | speciality_code   |
|--