This repo is for our Capgemini-sponsored practicum project, containing the codebase for generating synthetic medical and pharmacy claims data.
Report: https://docs.google.com/document/d/1Q1gPblzH8KUkVfYRrjhE1NDhKXckwzmv_UeXkuF2ECA/edit?usp=sharing
Slides: https://drive.google.com/file/d/1DVrvQJAh5VSqXUa65Zi_u_NtCrd1v6Wd/view?usp=sharing
See codes in the Data_Processing folder.
Use load_data.ipynb to load data from parquet zip to csv files. This notebook should run for both medical claims data and pharmacy claims data.
Use merge_med_pharm_data.ipynb to merge the medical claims data with pharmacy claims data. Because of memory restrictions, we are only taking one pharmacy claim per medical claim per Member Life ID, which could be easily changed in the SQL provided in this notebook.
See codes in Modeling folder.
TabGen_Medical_Data.ipynb -> Code for generating synthetic medical claims data (transforming multiple rows into a unified context to handle time dependance) using GPT-2 finetuned model.
TabGen_Merged_Pharmacy_Medical_data.ipynb -> Code for generating synthetic medical and pharmacy claims data (transforming multiple rows into a unified context to handle time dependence) using GPT-2 finetuned model.
See codes in Evaluation folder.
Evaluation_Script.ipynb can be used to evaluate the data generated by any model and examine how well it represents the original dataset. It performs evaluation for both numerical and categorical variables and also perform PCA, t-SNE, and plots distribution plots
evaluation_script_transformed_data.ipynb -> Performs sanity checks, validation and evaluation on the output of Transformed dataframe designed to handle time dependance.