Skip to content

This repo is for Capgemini practicum project for generating medical and pharmacy claims data.

Notifications You must be signed in to change notification settings

dipendra96/practicum_project

Repository files navigation

Synthetic Medical Data Generation - Capgemini

This repo is for our Capgemini-sponsored practicum project, containing the codebase for generating synthetic medical and pharmacy claims data.

Report: https://docs.google.com/document/d/1Q1gPblzH8KUkVfYRrjhE1NDhKXckwzmv_UeXkuF2ECA/edit?usp=sharing

Slides: https://drive.google.com/file/d/1DVrvQJAh5VSqXUa65Zi_u_NtCrd1v6Wd/view?usp=sharing

Data Preprocessing

See codes in the Data_Processing folder.

Loading the data

Use load_data.ipynb to load data from parquet zip to csv files. This notebook should run for both medical claims data and pharmacy claims data.

Merging medical and pharmacy claims data

Use merge_med_pharm_data.ipynb to merge the medical claims data with pharmacy claims data. Because of memory restrictions, we are only taking one pharmacy claim per medical claim per Member Life ID, which could be easily changed in the SQL provided in this notebook.

Modeling

See codes in Modeling folder.

TabGen_Medical_Data.ipynb -> Code for generating synthetic medical claims data (transforming multiple rows into a unified context to handle time dependance) using GPT-2 finetuned model.

TabGen_Merged_Pharmacy_Medical_data.ipynb -> Code for generating synthetic medical and pharmacy claims data (transforming multiple rows into a unified context to handle time dependence) using GPT-2 finetuned model.

Evaluation

See codes in Evaluation folder.

Evaluation_Script.ipynb can be used to evaluate the data generated by any model and examine how well it represents the original dataset. It performs evaluation for both numerical and categorical variables and also perform PCA, t-SNE, and plots distribution plots

evaluation_script_transformed_data.ipynb -> Performs sanity checks, validation and evaluation on the output of Transformed dataframe designed to handle time dependance.

About

This repo is for Capgemini practicum project for generating medical and pharmacy claims data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published