## Healthcare Project
## MIMIC-III Clinical Database 1.4
## PATIENTS TABLE


**Table Description:**

- row_id: Unique row identifier.
- subject_id: Primary key. Identifies the patient.
- gender: Gender.
- dob: Date of birth.
- dod: Date of death. Null if the patient was alive at least 90 days post hospital discharge.
- dod_hosp: Date of death recorded in the hospital records.
- dod_ssn: Date of death recorded in the social security records.
- expire_flag: Flag indicating that the patient has died.

**Important notes:**

    -DOB: Patients who are older than 89 years old at any time in the database have had their date of birth shifted to obscure their age and comply with HIPAA. The shift process was as follows: the patient’s age at their first admission was determined. The date of birth was then set to exactly 300 years before their first admission.


**Websites:** 

https://physionet.org/content/mimiciii/1.4/

https://mimic.mit.edu/docs/iii/tables/

https://mit-lcp.github.io/mimic-schema-spy/tables/prescriptions.html

https://www.hipaaguide.net/what-is-considered-as-phi-under-hipaa/

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")


In [2]:
patients = pd.read_csv("PATIENTS.csv")
patients

Unnamed: 0,ROW_ID,SUBJECT_ID,GENDER,DOB,DOD,DOD_HOSP,DOD_SSN,EXPIRE_FLAG
0,234,249,F,2075-03-13 00:00:00,,,,0
1,235,250,F,2164-12-27 00:00:00,2188-11-22 00:00:00,2188-11-22 00:00:00,,1
2,236,251,M,2090-03-15 00:00:00,,,,0
3,237,252,M,2078-03-06 00:00:00,,,,0
4,238,253,F,2089-11-26 00:00:00,,,,0
...,...,...,...,...,...,...,...,...
46515,31840,44089,M,2026-05-25 00:00:00,,,,0
46516,31841,44115,F,2124-07-27 00:00:00,,,,0
46517,31842,44123,F,2049-11-26 00:00:00,2135-01-12 00:00:00,2135-01-12 00:00:00,,1
46518,31843,44126,F,2076-07-25 00:00:00,,,,0


In [3]:
#checking the number of rows/columns
print(patients.shape)

(46520, 8)


In [4]:
#checking first rows
patients.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,GENDER,DOB,DOD,DOD_HOSP,DOD_SSN,EXPIRE_FLAG
0,234,249,F,2075-03-13 00:00:00,,,,0
1,235,250,F,2164-12-27 00:00:00,2188-11-22 00:00:00,2188-11-22 00:00:00,,1
2,236,251,M,2090-03-15 00:00:00,,,,0
3,237,252,M,2078-03-06 00:00:00,,,,0
4,238,253,F,2089-11-26 00:00:00,,,,0


In [5]:
#checking general info of the dataset
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46520 entries, 0 to 46519
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ROW_ID       46520 non-null  int64 
 1   SUBJECT_ID   46520 non-null  int64 
 2   GENDER       46520 non-null  object
 3   DOB          46520 non-null  object
 4   DOD          15759 non-null  object
 5   DOD_HOSP     9974 non-null   object
 6   DOD_SSN      13378 non-null  object
 7   EXPIRE_FLAG  46520 non-null  int64 
dtypes: int64(3), object(5)
memory usage: 2.8+ MB


In [6]:
#cechking dataset statistics
patients.describe()

Unnamed: 0,ROW_ID,SUBJECT_ID,EXPIRE_FLAG
count,46520.0,46520.0,46520.0
mean,23260.5,34425.772872,0.338758
std,13429.311598,28330.400343,0.473292
min,1.0,2.0,0.0
25%,11630.75,12286.75,0.0
50%,23260.5,24650.5,0.0
75%,34890.25,55477.5,1.0
max,46520.0,99999.0,1.0


In [7]:
#checking for null values
patients.isnull().sum()

ROW_ID             0
SUBJECT_ID         0
GENDER             0
DOB                0
DOD            30761
DOD_HOSP       36546
DOD_SSN        33142
EXPIRE_FLAG        0
dtype: int64

In [8]:
#checking for unique values
patients.nunique()

ROW_ID         46520
SUBJECT_ID     46520
GENDER             2
DOB            32540
DOD            12911
DOD_HOSP        8747
DOD_SSN        11301
EXPIRE_FLAG        2
dtype: int64

In [9]:
#converting timestamp columns into Date/Time datatype

patients['DOB'] = pd.to_datetime(patients['DOB'])
patients['DOD'] = pd.to_datetime(patients['DOD'])
patients['DOD_HOSP'] = pd.to_datetime(patients['DOD_HOSP'])
patients['DOD_SSN'] = pd.to_datetime(patients['DOD_SSN'])

In [10]:
#updating some datatype from int64 to string because they are unique identifier

#ROW_ID is an unique identifier
#SUBJECT_ID refers to an unique patient

patients['ROW_ID'] = patients['ROW_ID'].astype('object')
patients['SUBJECT_ID'] = patients['SUBJECT_ID'].astype('object')
patients['EXPIRE_FLAG'] = patients['EXPIRE_FLAG'].astype('object')


In [11]:
#updating information for a better understanding, "No"=patient has not died, "Yes"=patient died
patients["EXPIRE_FLAG"] = patients["EXPIRE_FLAG"].astype(str).replace({'0': "No", '1': "Yes"})

In [12]:
#handling missing values in date columns with '1900-01-01'
patients['DOD'] = patients['DOD'].fillna('1900-01-01')
patients['DOD_HOSP'] = patients['DOD_HOSP'].fillna('1900-01-01')
patients['DOD_SSN'] = patients['DOD_SSN'].fillna('1900-01-01')

In [13]:
#checking datatypes updates
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46520 entries, 0 to 46519
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   ROW_ID       46520 non-null  object        
 1   SUBJECT_ID   46520 non-null  object        
 2   GENDER       46520 non-null  object        
 3   DOB          46520 non-null  datetime64[ns]
 4   DOD          46520 non-null  datetime64[ns]
 5   DOD_HOSP     46520 non-null  datetime64[ns]
 6   DOD_SSN      46520 non-null  datetime64[ns]
 7   EXPIRE_FLAG  46520 non-null  object        
dtypes: datetime64[ns](4), object(4)
memory usage: 2.8+ MB


In [14]:
#checking for null values after changes
patients.isnull().sum()

ROW_ID         0
SUBJECT_ID     0
GENDER         0
DOB            0
DOD            0
DOD_HOSP       0
DOD_SSN        0
EXPIRE_FLAG    0
dtype: int64

In [15]:
#checking all changes
patients

Unnamed: 0,ROW_ID,SUBJECT_ID,GENDER,DOB,DOD,DOD_HOSP,DOD_SSN,EXPIRE_FLAG
0,234,249,F,2075-03-13,1900-01-01,1900-01-01,1900-01-01,No
1,235,250,F,2164-12-27,2188-11-22,2188-11-22,1900-01-01,Yes
2,236,251,M,2090-03-15,1900-01-01,1900-01-01,1900-01-01,No
3,237,252,M,2078-03-06,1900-01-01,1900-01-01,1900-01-01,No
4,238,253,F,2089-11-26,1900-01-01,1900-01-01,1900-01-01,No
...,...,...,...,...,...,...,...,...
46515,31840,44089,M,2026-05-25,1900-01-01,1900-01-01,1900-01-01,No
46516,31841,44115,F,2124-07-27,1900-01-01,1900-01-01,1900-01-01,No
46517,31842,44123,F,2049-11-26,2135-01-12,2135-01-12,1900-01-01,Yes
46518,31843,44126,F,2076-07-25,1900-01-01,1900-01-01,1900-01-01,No


In [16]:
#save clean data into csv file

csv_path = 'patients_table_clean.csv'
patients.to_csv(csv_path,
                   index = False)