# EHR Data Profiler

Documentation of the functions available in the library as well as an in-depth tutorial on the use of `text_search` can be found on the project's GitHub page:
<a href="https://github.com/ctsidev/ehr-data-profiler#function-library">https://github.com/ctsidev/ehr-data-profiler#function-library</a>

### Run the next cell to make all the imports, which include Pandas and the EHR data anaylsis functions:


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from lib.ehr_dp_lib import *
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', 500)

### Run the following block to describe the tables in your Data folder:

In [2]:
describe_tables()

Unnamed: 0,TABLE,ROW_COUNT,COLUMN_COUNT,DESCRIPTION
0,Encounters.csv,11207,14,"This table holds data for encounters for the patient cohort. There can be multiple rows per patient, but only one row per encounter."
1,Encounter_Diagnoses.csv,9126,10,This table holds encounter diagnoses data for the patients in the cohort. There can be multiple rows per patient as well as multiple rows per encounter.
2,Flowsheet_Vitals.csv,101070,5,This table holds flowsheet information for vital signs for the patients in the cohort. There can be multiple rows per patient.
3,Labs.csv,28702,15,This table holds all laboratory result information for the patients in the cohort. There can be multiple rows per patient as well as multiple rows per encounter.
4,Medications.csv,12547,15,This table holds medication information for the patients in the cohort. There can be multiple rows per patient as well as multiple rows per encounter.
5,Patient_Demographics.csv,500,19,This table holds demographic information for the patients in the cohort. There is only one row per patient
6,Procedures.csv,15258,6,This table holds procedure information for the patients in the cohort. There can be multiple rows per patient as well as multiple rows per encounter.


### PATIENT_DEMOGRAPHICS

In [3]:
patient_demographics_df = pd.read_csv('Data/Patient_Demographics.csv')
patient_demographics_df

Unnamed: 0,IP_PATIENT_ID,AGE,SEX,RACE,ETHNICITY,VITAL_STATUS,LANGUAGE,MARITAL_STATUS,SEXUAL_ORIENTATION,RELIGION,ADI_NATRANK,ADI_STATERNK,EDUCATION,INCOME,SVI_SOCIO_ECON,SVI_HCOMP,SVI_MINO_LANG,SVI_HTYPE_TRANS,SVI_TOTAL
0,IPPAT_101101099917108,60.0,Male,White or Caucasian,Unknown,Not Known Deceased,English,Single,,Unknown,9.0,4.0,SHRINE|EDU:30-40,SHRINE|INC:100k-150k,0.3697,0.5022,0.7505,0.8394,0.6433
1,IPPAT_101101099942813,101.0,Female,Unknown,Unknown,Not Known Deceased,Unknown,Unknown,,Unknown,,,,,,,,,
2,IPPAT_101101099967579,53.0,Male,Unknown,Unknown,Not Known Deceased,English,Single,,Christian,,,SHRINE|EDU:50-60,SHRINE|INC:100k-150k,,,,,
3,IPPAT_101101099971777,107.0,Female,White or Caucasian,Unknown,Not Known Deceased,English,Widowed,,Methodist,8.0,3.0,SHRINE|EDU:40-50,SHRINE|INC:100k-150k,0.084,0.3945,0.7175,0.3115,0.2707
4,IPPAT_101101099983912,60.0,Female,White or Caucasian,Unknown,Not Known Deceased,English,Single,,Jewish,,,,,,,,,
5,IPPAT_101101099986815,64.0,Male,White or Caucasian,Unknown,Not Known Deceased,English,Married,,Protestant,9.0,4.0,SHRINE|EDU:50-60,SHRINE|INC:75k-100k,0.5095,0.0165,0.8446,0.9413,0.5644
6,IPPAT_101101099989818,60.0,Male,Other,Not Hispanic or Latino,Not Known Deceased,English,Single,Straight (not lesbian or gay),,5.0,1.0,SHRINE|EDU:60-70,SHRINE|INC:100k-150k,0.0758,0.0119,0.1684,0.4847,0.0412
7,IPPAT_101101100000423,54.0,Male,White or Caucasian,Unknown,Not Known Deceased,Unknown,Single,,Unknown,,,SHRINE|EDU:50-60,SHRINE|INC:-,-999.0,0.0011,0.5527,-999.0,-999.0
8,IPPAT_101101100002433,77.0,Female,White or Caucasian,Unknown,Not Known Deceased,Unknown,Married,,Jehovah's Witness,,,SHRINE|EDU:40-50,SHRINE|INC:150k-200k,,,,,
9,IPPAT_101101100017191,36.0,Female,Black or African American,Unknown,Not Known Deceased,English,Single,,Baptist,2.0,1.0,SHRINE|EDU:60-70,SHRINE|INC:100k-150k,0.4355,0.0687,0.7109,0.4703,0.3846


In [4]:
missingness(patient_demographics_df)

Unnamed: 0,COLUMN,NULLS,PERCENT
0,IP_PATIENT_ID,0,0.0
1,AGE,5,1.0
2,SEX,0,0.0
3,RACE,47,9.4
4,ETHNICITY,0,0.0
5,VITAL_STATUS,0,0.0
6,LANGUAGE,0,0.0
7,MARITAL_STATUS,0,0.0
8,SEXUAL_ORIENTATION,475,95.0
9,RELIGION,0,0.0


In [5]:
table_1(patient_demographics_df)


AGE
-----------------
65+ => 36.6
55-64 => 15.4
45-54 => 13.0
35-44 => 13.0
25-34 => 7.8
<18 => 7.4
18-24 => 5.8
Unknown => 1.0

RACE
-----------------
White or Caucasian => 39.3
Unknown => 35.5
Other => 10.8
Black or African American => 4.6
Asian => 4.4
Patient Refused => 3.1
Multiple Races => 1.1
American Indian or Alaska Native => 0.7
Middle Eastern or North African => 0.4

ETHNICITY
-----------------
Unknown => 50.6
Not Hispanic or Latino => 38.2
Hispanic or Latino => 7.6
Choose Not to Answer => 2.8
Hispanic/Spanish origin Other => 0.4
Cuban => 0.2
Mexican, Mexican American, Chicano/a => 0.2

LANGUAGE
-----------------
English => 56.8
Unknown => 36.8
Spanish => 4.8
Armenian => 0.4
Arabic => 0.4
Chinese (Other) => 0.2
Hindi => 0.2
Farsi, Persian => 0.2
Vietnamese => 0.2

EDUCATION
-----------------
SHRINE|EDU:50-60 => 21.9
SHRINE|EDU:60-70 => 16.7
SHRINE|EDU:40-50 => 14.3
SHRINE|EDU:30-40 => 12.8
SHRINE|EDU:20-30 => 12.2
SHRINE|EDU:10-20 => 9.1
SHRINE|EDU:<10 => 7.6
SHRINE|EDU:70-8

In [6]:
catbar(patient_demographics_df, 'LANGUAGE', graph=False) ## Set graph=True for Bar graph

Unnamed: 0,LANGUAGE,COUNT,PERCENT
0,English,284,56.8
1,Unknown,184,36.8
2,Spanish,24,4.8
3,Armenian,2,0.4
4,Arabic,2,0.4
5,Chinese (Other),1,0.2
6,Hindi,1,0.2
7,"Farsi, Persian",1,0.2
8,Vietnamese,1,0.2


In [None]:
catbar(patient_demographics_df, 'SEX', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(patient_demographics_df, 'MARITAL_STATUS', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(patient_demographics_df, 'ETHNICITY', graph=False) ## Set graph=True for Bar graph

In [None]:
numstats(patient_demographics_df, 'AGE')

In [None]:
catbar(patient_demographics_df, 'RELIGION', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(patient_demographics_df, 'RACE', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(patient_demographics_df, 'SEXUAL_ORIENTATION', graph=False) ## Set graph=True for Bar graph

### ENCOUNTERS

In [None]:
encounters_df = pd.read_csv('Data/Encounters.csv')
encounters_df

In [None]:
missingness(encounters_df)

In [None]:
occurrence_stats(encounters_df, 'IP_ENC_ID')

In [None]:
dateline(encounters_df, 'ENCOUNTER_DATE')

In [None]:
numstats(encounters_df, 'ENCOUNTER_AGE')

In [None]:
catbar(encounters_df, 'EPIC_ENCOUNTER_TYPE', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(encounters_df, 'IP_VISIT_TYPE', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(encounters_df, 'EPIC_DEPARTMENT_NAME', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(encounters_df, 'HOSP_DISCHARGE_DISPOSITION', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(encounters_df, 'ED_DISPOSITION', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(encounters_df, 'DEPARTMENT_SPECIALTY', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(encounters_df, 'LOCATION', graph=False) ## Set graph=True for Bar graph

### ENCOUNTER_DIAGNOSES

In [None]:
encounter_diagnoses_df = pd.read_csv('Data/Encounter_Diagnoses.csv')
encounter_diagnoses_df

In [None]:
missingness(encounter_diagnoses_df)

In [None]:
occurrence_stats(encounter_diagnoses_df, 'IP_ENC_ID')

In [None]:
dateline(encounter_diagnoses_df, 'DIAGNOSIS_DATE')

In [None]:
catbar(encounter_diagnoses_df, 'PRESENT_ON_ADMISSION', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(encounter_diagnoses_df, 'ADMISSION_DIAGNOSIS_FLAG', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(encounter_diagnoses_df, 'HOSPITAL_FINAL_DIAGNOSIS', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(encounter_diagnoses_df, 'PRIMARY_DIAGNOSIS_FLAG', graph=False) ## Set graph=True for Bar graph

### PROCEDURES

In [None]:
procedures_df = pd.read_csv('Data/Procedures.csv')
procedures_df

In [None]:
missingness(procedures_df)

In [None]:
occurrence_stats(procedures_df, 'IP_ENC_ID')

In [None]:
dateline(procedures_df, 'PROCEDURE_DATE')

In [None]:
catbar(procedures_df, 'PROCEDURE_DESCRIPTION', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(procedures_df, 'PROCEDURE_CODE', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(procedures_df, 'PROCEDURE_TYPE', graph=False) ## Set graph=True for Bar graph

### FLOWSHEET_VITALS

In [None]:
flowsheet_vitals_df = pd.read_csv('Data/Flowsheet_Vitals.csv')
flowsheet_vitals_df

In [None]:
missingness(flowsheet_vitals_df)

In [None]:
occurrence_stats(flowsheet_vitals_df, 'IP_ENC_ID')

In [None]:
dateline(flowsheet_vitals_df, 'VITAL_SIGN_TAKEN_TIME')

In [None]:
catbar(flowsheet_vitals_df, 'VITAL_SIGN_TYPE', graph=False) ## Set graph=True for Bar graph

In [None]:
flow_stats(flowsheet_vitals_df)

### LABS

In [None]:
labs_df = pd.read_csv('Data/Labs.csv')
labs_df

In [None]:
missingness(labs_df)

In [None]:
occurrence_stats(labs_df, 'IP_ORDER_PROC_ID')

In [None]:
dateline(labs_df, 'ORDER_TIME')

In [None]:
catbar(labs_df, 'PROCEDURE_CODE', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(labs_df, 'COMPONENT_NAME', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(labs_df, 'PROCEDURE_DESCRIPTION', graph=False) ## Set graph=True for Bar graph

In [None]:
lab_stats(labs_df, top=10)

### MEDICATIONS

In [None]:
medications_df = pd.read_csv('Data/Medications.csv')
medications_df

In [None]:
missingness(medications_df)

In [None]:
occurrence_stats(medications_df, 'IP_ORDER_MED_ID')

In [None]:
dateline(medications_df, 'ORDER_DATE')

In [None]:
catbar(medications_df, 'EPIC_MEDICATION_NAME', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(medications_df, 'MEDISPAN_GENERIC_NAME', graph=False) ## Set graph=True for Bar graph

In [None]:
catbar(medications_df, 'MEDISPAN_CLASS_NAME', graph=False) ## Set graph=True for Bar graph