Skip to content

Latest commit

 

History

History
139 lines (93 loc) · 6.14 KB

DEVELOPER.md

File metadata and controls

139 lines (93 loc) · 6.14 KB

Developer Notes for Processed ICD Data

This document details the preparation of the processed ICD datasets for use.

Overview

flowchart LR;

icd["ICD Codes"]
icdmap["ICD Mappings"]
icdbrowser[("ICD Browser<br/>(WHO)")]

wbdraw>"WBD Excel Spreadsheet<br/>(Wilson)"]
wbd>"Cleaned WBD Spreadsheet<br/>(Richard)"]

cmearaw>"CMEA CSV<br>(Rajeev)"]
cmea>"Cleaned CMEA CSV<br>(Richard)"]

wvaraw1>"WHO VA PDF<br>(WHO)"]
wvaraw2>"Processed WHO VA CSVs<br>(Bryan)"]
wva>"Cleaned WHO VA CSVs<br>(Richard)"]

cghrraw1>"CGHR Excel Spreadsheet<br>(Patrycja)"]
cghrraw2>"Preprocessed CGHR CSVs<br>(Bryan)"]
cghr>"Cleaned CGHR CSVs<br>(Richard)"]

research["Processed ICD Data<br>(Richard)"]

icdbrowser --> icd & icdmap --> research
wbdraw --> wbd --> research
cmearaw --> cmea --> research
wvaraw1 --> wvaraw2 --> wva --> research
cghrraw1 --> cghrraw2 --> cghr --> research
Loading

The ICD data was downloaded from the World Health Organization (WHO) through their ICD-11 Browser (under Info select Spreadsheet File for the ICD-11 codes and ICD-10 / ICD-11 mapping Tables for the ICD-10 and ICD-11 mappings).

The WBD data was retrieved from Wilson Suraweera Wilson.Suraweera@unityhealth.to as an Excel Spreadsheet (copy available here), and edited by Richard Wen rrwen.dev@gmail.com manually to be parsed in an R Script here into a cleaned WBD Excel Spreadsheet (copy available here).

The CMEA data was retrieved from Rajeev Kamadod rajeevk@kentropy.com as a CSV file (copy available here), and processed in an R Script here into a cleaned CMEA files cmea10_raw.csv and icd10tocmea10_raw.csv.

The WHO VA data was collected and processed by Bryan Gascon bryan.gascon@mail.utoronto.ca as a CSV file from the WHO VA manuals (copies for WHO VA 2016 and WHO VA 2022 available), and processed in an R Script for 2016 and 2022 codes in cleaned WHO VA files wva2016_raw.csv and wva2022_raw.csv.

The CGHR codes were extracted from the Automated Versus Physician (AVP) study by Patrycja Kolpak Patrycja.Kolpak@unityhealth.to into an Excel Spreadsheet (copy here) and processed by Bryan Gascon bryan.gascon@mail.utoronto.ca into several files (WHO VA to CGHR - adult child neo, WHO VA to WBD - adult child neo, CGHR to WBD - adult child neo). These were then cleaned with an R Script here into files cghr2019_raw.csv, cghr2019towbd10_raw.csv, wva2016tocghr2019_raw.csv, and wva2016towbd10_raw.csv.

These data are then processed and managed by Richard Wen rrwen.dev@gmail.com using scripts in this repository.

For more details on the data, see the following files from the WHO:

Install

  1. Install Python 3.11
  2. Install PostgreSQL 12+
  3. Run bin/install to create a venv environment
  4. Activate the venv environment

In Windows:

bin\install
bin\activate

In Linux/Mac OS:

source bin/install.sh
source bin/activate.sh

Setup

In order to upload the data to a PostgreSQL database (optional) with the Python scripts in the src folder, you will need to login to your database.

To do this, run the bin/login.bat or bin/login.sh files depending on your operating system.

In Windows:

bin\login

In Linux/Mac OS:

source bin/login.sh

You will be prompted to enter database connection details, and if successful, it will display that the login has been saved.

Note: the password prompt is hidden - simply enter your password and press enter to proceed

Preparation

flowchart LR;
src/data --> A;
A(bin/upload) --> B(upload.ipynb);
B --> data/ --> B1>icd_data.csv] & B2>icd_ddict.csv] & B3>.csv];
B --> E[(PostgreSQL)] --> src/database/ --> database/;
database/ --> D1>icd_comments.sql] & D2>icd_views.sql];
Loading

Once the Setup step is successful, the datasets can be prepared by running the following in a command line terminal (depending on your operating system):

In Windows:

bin\run

In Linux/Mac OS:

source bin/run.sh

This uses the Python Jupyter notebook upload.ipynb and update.ipynb to:

  1. Process the raw data files in src/data into cleaned datasets
  2. Save the cleaned datasets in the data folder as .csv files
  3. Create data dictionaries using the config.yml file for all columns in the cleaned datasets and save them in the data folder
  4. Optionally, upload the cleaned datasets into the upload database defined in the Setup step
  5. Optionally, create the following files in src/database from the data uploaded to the database:
    • icd_comments.sql: contains PostgreSQL code for adding comments to the datasets
    • icd_views.sql: contains PostgreSQL code for creating up to date views of the datasets without versioning
    • icd_tables.dump: contains a PostgreSQL dump of the datasets in the database for uploading to another database

Note: This process takes about 5 minutes.

Contact

Richard Wen rrwen.dev@gmail.com