# Synth PoC for diabetes dataset

Using [Diabetes dataset](https://www.kaggle.com/datasets/mathchi/diabetes-data-set) for exploring very simple single table mock data use case.

### Requirements

As a **data engineer**, I should be able to specify what is already public to the data scientist and include only that information to the mock dataset.
Let's specify **public data** for diabetes dataset:
* Pregnancies: integer, mean 3.8
* Age: integer, between 18 and 100
* Outcome: boolean (0, 1), 35% true (1)
* Other columns should be dropped

### Diabetes data

In [25]:
import pandas as pd

df = pd.read_csv("../datasets/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [88]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


### Get started with synth


In [None]:
# install synth
!curl --proto '=https' --tlsv1.2 -sSL https://getsynth.com/install | sh

### a) Specify schema manually

In [27]:
!rm -rf diabetes-schema
!mkdir diabetes-schema

import json
schema = {
  "type": "array",
  "length": { "type": "number", "subtype": "u64", "constant": 1 },
  "content": {
    "type": "object",
    "Pregnancies": { "type": "number", "range": {"low": 0, "high": 17, "step": 1, "include_high": True }},
    "Age": { "type": "number", "range": { "low": 18, "high": 100, "step": 1, "include_high": True }},
    "Outcome": { "type": "number", "range": { "low": 0, "high": 1, "step": 1, "include_high": True }}
  }
}
with open('diabetes-schema/diabetes.json', 'w') as fp:
    json.dump(schema, fp)

### b) Infer schema by importing a CSV

In [20]:
!rm -rf diabetes-schema-imported
!synth import diabetes-schema-imported  --from csv:../datasets

with open('diabetes-schema-imported/diabetes.json') as f:
    inferred_schema = json.load(f)
inferred_schema['content']

{'type': 'object',
 'Age': {'type': 'number',
  'range': {'low': 21, 'high': 81, 'step': 1},
  'subtype': 'u64'},
 'BMI': {'type': 'number',
  'range': {'low': 0.0, 'high': 67.1, 'step': 1.0},
  'subtype': 'f64'},
 'BloodPressure': {'type': 'number',
  'range': {'low': 0, 'high': 122, 'step': 1},
  'subtype': 'u64'},
 'DiabetesPedigreeFunction': {'type': 'number',
  'range': {'low': 0.078, 'high': 2.42, 'step': 1.0},
  'subtype': 'f64'},
 'Glucose': {'type': 'number',
  'range': {'low': 0, 'high': 199, 'step': 1},
  'subtype': 'u64'},
 'Insulin': {'type': 'number',
  'range': {'low': 0, 'high': 846, 'step': 1},
  'subtype': 'u64'},
 'Outcome': {'type': 'number',
  'range': {'low': 0, 'high': 2, 'step': 1},
  'subtype': 'u64'},
 'Pregnancies': {'type': 'number',
  'range': {'low': 0, 'high': 17, 'step': 1},
  'subtype': 'u64'},
 'SkinThickness': {'type': 'number',
  'range': {'low': 0, 'high': 99, 'step': 1},
  'subtype': 'u64'}}

In [48]:
# drop other columns than specified
c = inferred_schema['content']
inferred_schema['content'] = {key: c[key] for key in ['type', 'Age', 'Pregnancies', 'Outcome']}

with open('diabetes-schema-imported/diabetes.json', 'w') as f:
    json.dump(inferred_schema, f)

### Generate data from manual and inferred schema

In [49]:
# NOTE: csv output does not work
!synth generate diabetes-schema --size 10 > generated/diabetes.json
!synth generate diabetes-schema-imported --size 10 > generated/diabetes-inferred.json


[00:00:00] ███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 1/10 generated (0s remaining)
[2K[1B[1A[00:00:00] █████████████████████████████████████ 594/10 generated (0s remaining)
[2K[1B[1A

In [50]:
import pandas as pd

# Note: pd.load_json(..., orient='records') fails to convert the nested 'diabetes' object 
with open('generated/diabetes.json') as f:
    generated = json.load(f)['diabetes']
with open('generated/diabetes-inferred.json') as f:
    generated_inferred = json.load(f)['diabetes']


In [51]:
df_gen = pd.DataFrame.from_dict(generated)
df_gen.head()

Unnamed: 0,Age,Outcome,Pregnancies
0,85,1,2
1,69,0,17
2,72,0,12
3,24,1,6
4,23,0,5


In [52]:
df_inferred = pd.DataFrame.from_dict(generated_inferred)
df_inferred.head()

Unnamed: 0,Age,Outcome,Pregnancies
0,56,0,16
1,29,0,9
2,71,1,8
3,73,0,11
4,25,1,11
