# Machine Learning Applications for Health (COMP90089_2022_SM2)
# Group Assignment: Digital Phenotype of Diabetes Mellitus.

### Group members
- Mukhammad Karimov - mkarimov@student.unimelb.edu.au - 1019332
- Ching Yin Wan - chingyin@student.unimelb.edu.au
- Youran Zhou - youran@student.unimelb.edu.au
- Kartik Mahendra Jalal - jalalk@student.unimelb.edu.au

This notebook assumes that you have access to MIMIC-IV on Google BigQuery.

### **Goals**

Propose digital pheonotype for **Diabetes Mellitus**, describe ways to identify patients cohort based on the diagnosis criteria using MIMIC IV, apply different Machine Learning aproaches and compare and contrast the metrics.

### **Definitions**

Disease: Diabetes mellitus.

The term diabetes mellitus describes diseases of abnormal carbohydrate metabolism that are characterized by hyperglycemia. It is associated with a relative or absolute impairment in insulin secretion, along with varying degrees of peripheral resistance to the action of insulin. Every few years, the diabetes community reevaluates the current recommendations for the classification, diagnosis, and screening of diabetes, reflecting new information from research and clinical practice.


Disease criteria source: [Diabetes: Diagnosis of diabetes mellitus or prediabetes in non-pregnant adults.](https://pathways.uptodate.com/pathway/122238?topicRef=1812&source=see_link&dl_node=5d600883a9795a0012a82287&rid=632af335ce0bb723d99f4934)

Disease Identifier: E10-E14 [ICD code.](https://icd.who.int/browse10/2019/en#/E10-E14)

# MIMIC Data

## Load libraries and setup environment

In [39]:
# Import libraries
from datetime import timedelta
import os
from functools import reduce

import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

from sklearn.preprocessing import OrdinalEncoder
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import CategoricalNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from IPython.display import display, HTML, Image
%matplotlib inline

plt.style.use('ggplot')
plt.rcParams.update({'font.size': 20})

# Access data using Google BigQuery.
from google.colab import auth
from google.cloud import bigquery

We need to authenticate this notebook with Google Cloud Platform (GCP) in order to query MIMIC-IV. When prompted to login, **make sure you use the Google account granted access to MIMIC-IV via PhysioNet**. [Details on granting your Google account access are described in the online documentation](https://mimic-iv.mit.edu/docs/access/cloud/).

In [2]:
# authenticate
auth.authenticate_user()

Next up, our final piece of configuration. In order to query the data, we need to **set the project ID**. **Please check your project ID in BigQuery and replace in the code below.**

If you're not sure what your project ID is, or if you haven't made a project on GCP, [you can read about creating and managing google projects here](https://cloud.google.com/resource-manager/docs/creating-managing-projects). Afterwards, change the `project_id` variable below to your Google project.

In [3]:
# Set up environment variables
project_id = 'integral-berm-358804'
if project_id == 'CHANGE-ME':
  raise ValueError('You must change project_id to your GCP project.')
os.environ["GOOGLE_CLOUD_PROJECT"] = project_id

# Read data from BigQuery into pandas dataframes.
def run_query(query, project_id=project_id):
  return pd.io.gbq.read_gbq(
      query,
      project_id=project_id,
      dialect='standard')

# set the dataset
# if you want to use the demo, change this to mimic_demo
dataset = 'mimiciv'


## Patient Cohort

Get all patients that diagnosed with Diabetes Mellitus at discharge. ICD Code E01-E14 is used to look up from discharge_icd table.

**Assuming icd_code that starts with E10-E14 belongs to general diabetes mellitus group.**

- **E10** - Type 1 diabetes mellitus
- **E11** - Type 2 diabetes mellitus
- **E12** - Malnutrition-related diabetes mellitus
- **E13** - Other specified diabetes mellitus
- **E14** - Unspecified diabetes mellitus

In [4]:
##Module hosp, table diagnoses_icd. ICD_code = E10*
query = f"""
SELECT DISTINCT subject_id FROM `physionet-data.mimiciv_hosp.diagnoses_icd`
WHERE LOWER(icd_code) LIKE '%e10%' AND icd_version = 10
"""
e10_df = run_query(query)
e10_df

Unnamed: 0,subject_id
0,10030753
1,10075925
2,10123949
3,10160622
4,10253349
...,...
1292,17797252
1293,10566618
1294,11875731
1295,15296299


In [5]:
##Module hosp, table diagnoses_icd. ICD_code = E11*
query = f"""
SELECT DISTINCT subject_id FROM `physionet-data.mimiciv_hosp.diagnoses_icd`
WHERE LOWER(icd_code) LIKE '%e11%' AND icd_version = 10
"""
e11_df = run_query(query)
e11_df

Unnamed: 0,subject_id
0,10007818
1,10011849
2,10032176
3,10040025
4,10047172
...,...
17886,17678798
17887,17817592
17888,17980774
17889,18078692


In [6]:
##Module hosp, table diagnoses_icd. ICD_code = E12*
query = f"""
SELECT DISTINCT subject_id FROM `physionet-data.mimiciv_hosp.diagnoses_icd`
WHERE LOWER(icd_code) LIKE '%e12%' AND icd_version = 10
"""
e12_df = run_query(query)
e12_df

Unnamed: 0,subject_id


In [8]:
##Module hosp, table diagnoses_icd. ICD_code = E13*
query = f"""
SELECT DISTINCT subject_id FROM `physionet-data.mimiciv_hosp.diagnoses_icd`
WHERE LOWER(icd_code) LIKE '%e13%' AND icd_version = 10
"""
e13_df = run_query(query)
e13_df

Unnamed: 0,subject_id
0,15567127
1,15685616
2,18003764
3,19759233
4,10184274
...,...
238,17336284
239,18700699
240,15544960
241,12905199


In [9]:
##Module hosp, table diagnoses_icd. ICD_code = E14*
query = f"""
SELECT DISTINCT subject_id FROM `physionet-data.mimiciv_hosp.diagnoses_icd`
WHERE LOWER(icd_code) LIKE '%e14%' AND icd_version = 10
"""
e14_df = run_query(query)
e14_df

Unnamed: 0,subject_id


In [10]:
icd_cohort = pd.concat([e10_df, e11_df, e12_df, e13_df, e14_df], join='inner', ignore_index=True).drop_duplicates()
icd_cohort

Unnamed: 0,subject_id
0,10030753
1,10075925
2,10123949
3,10160622
4,10253349
...,...
19411,14403200
19414,19918971
19422,17257271
19423,18321691


In [11]:
icd_cohort.to_csv("patient_cohort (MK).csv",index = False)

## Feature Engineering

Extract all related features of patient cohort.

In [12]:
query = f"""
SELECT admissions.*, patients.gender, patients.anchor_age, patients.anchor_year, patients.anchor_year_group, patients.dod 
FROM `physionet-data.mimiciv_hosp.patients` AS patients
JOIN `physionet-data.mimiciv_hosp.admissions` AS admissions ON patients.subject_id = admissions.subject_id
"""
all_patients_df = run_query(query)
all_patients_df

Unnamed: 0,subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag,gender,anchor_age,anchor_year,anchor_year_group,dod
0,10006053,22942076,2111-11-13 23:39:00,2111-11-15 17:20:00,2111-11-15 17:20:00,URGENT,TRANSFER FROM HOSPITAL,DIED,Medicaid,ENGLISH,,UNKNOWN,NaT,NaT,1,M,52,2111,2014 - 2016,2111-11-15
1,10017531,20668418,2158-01-20 16:52:00,2158-01-30 14:30:00,NaT,URGENT,TRANSFER FROM HOSPITAL,HOME HEALTH CARE,Other,ENGLISH,,WHITE,NaT,NaT,0,M,63,2158,2008 - 2010,NaT
2,10017531,21095812,2159-12-26 20:14:00,2160-02-04 16:00:00,NaT,URGENT,TRANSFER FROM HOSPITAL,REHAB,Other,ENGLISH,,WHITE,NaT,NaT,0,M,63,2158,2008 - 2010,NaT
3,10017531,22580355,2159-09-22 19:30:00,2159-10-24 13:40:00,NaT,URGENT,TRANSFER FROM HOSPITAL,CHRONIC/LONG TERM ACUTE CARE,Other,ENGLISH,,WHITE,NaT,NaT,0,M,63,2158,2008 - 2010,NaT
4,10021312,25020332,2113-08-16 00:32:00,2113-08-18 17:35:00,NaT,URGENT,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,,UNKNOWN,NaT,NaT,0,F,55,2113,2014 - 2016,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
454319,19979081,25032257,2179-02-20 07:15:00,2179-02-27 16:45:00,NaT,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,SKILLED NURSING FACILITY,Medicare,ENGLISH,DIVORCED,ASIAN - CHINESE,NaT,NaT,0,F,80,2179,2011 - 2013,NaT
454320,19991135,28088185,2124-02-17 08:30:00,2124-02-20 08:50:00,NaT,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,DIVORCED,WHITE,NaT,NaT,0,F,57,2124,2008 - 2010,2133-07-19
454321,19995012,29185936,2153-04-11 13:00:00,2153-04-14 13:51:00,NaT,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,SKILLED NURSING FACILITY,Other,ENGLISH,DIVORCED,BLACK/AFRICAN AMERICAN,NaT,NaT,0,F,64,2152,2008 - 2010,NaT
454322,19995790,22970553,2185-02-02 12:00:00,2185-02-06 17:08:00,NaT,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,SKILLED NURSING FACILITY,Medicare,ENGLISH,DIVORCED,WHITE,NaT,NaT,0,M,66,2185,2008 - 2010,NaT


In [13]:
df = all_patients_df[all_patients_df.subject_id.isin(icd_cohort["subject_id"])]
df

Unnamed: 0,subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag,gender,anchor_age,anchor_year,anchor_year_group,dod
12,10063460,26955151,2175-09-03 01:28:00,2175-09-05 15:00:00,NaT,URGENT,PACU,HOME,Other,?,,HISPANIC/LATINO - CUBAN,NaT,NaT,0,M,51,2175,2017 - 2019,NaT
13,10065024,25406025,2160-07-30 17:33:00,2160-08-04 15:39:00,NaT,URGENT,TRANSFER FROM HOSPITAL,SKILLED NURSING FACILITY,Medicare,ENGLISH,,UNKNOWN,NaT,NaT,0,M,77,2160,2017 - 2019,NaT
30,10182648,24152717,2134-04-06 16:59:00,2134-04-13 18:23:00,NaT,URGENT,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,,UNKNOWN,NaT,NaT,0,F,63,2134,2017 - 2019,NaT
39,10212582,22708939,2171-04-07 20:25:00,2171-04-16 14:49:00,NaT,URGENT,TRANSFER FROM HOSPITAL,HOME HEALTH CARE,Other,ENGLISH,,UNKNOWN,NaT,NaT,0,M,55,2171,2017 - 2019,NaT
44,10229302,26194242,2132-09-08 23:35:00,2132-09-20 18:58:00,NaT,URGENT,TRANSFER FROM HOSPITAL,ACUTE HOSPITAL,Medicare,ENGLISH,,NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER,2132-09-08 17:03:00,2132-09-09 00:48:00,0,M,82,2132,2017 - 2019,2132-09-25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
454282,19808487,22178497,2154-10-15 07:15:00,2154-10-19 11:45:00,NaT,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,HOME,Medicare,ENGLISH,DIVORCED,WHITE,NaT,NaT,0,M,62,2154,2008 - 2010,NaT
454288,19836124,25734796,2145-10-14 07:12:00,2145-10-16 18:00:00,NaT,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,DIVORCED,BLACK/AFRICAN AMERICAN,NaT,NaT,0,F,76,2145,2014 - 2016,NaT
454294,19874175,26421016,2122-05-14 07:15:00,2122-05-16 16:15:00,NaT,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,HOME,Other,ENGLISH,DIVORCED,WHITE,NaT,NaT,0,F,61,2122,2014 - 2016,NaT
454320,19991135,28088185,2124-02-17 08:30:00,2124-02-20 08:50:00,NaT,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,DIVORCED,WHITE,NaT,NaT,0,F,57,2124,2008 - 2010,2133-07-19


In [14]:
df["hospital_stay_hour"] = (df["dischtime"] - df["admittime"]).astype('timedelta64[h]')
df["ed_hour"] = (df["edouttime"] - df["edregtime"]).astype('timedelta64[h]')
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,...,edregtime,edouttime,hospital_expire_flag,gender,anchor_age,anchor_year,anchor_year_group,dod,hospital_stay_hour,ed_hour
12,10063460,26955151,2175-09-03 01:28:00,2175-09-05 15:00:00,NaT,URGENT,PACU,HOME,Other,?,...,NaT,NaT,0,M,51,2175,2017 - 2019,NaT,61.0,
13,10065024,25406025,2160-07-30 17:33:00,2160-08-04 15:39:00,NaT,URGENT,TRANSFER FROM HOSPITAL,SKILLED NURSING FACILITY,Medicare,ENGLISH,...,NaT,NaT,0,M,77,2160,2017 - 2019,NaT,118.0,
30,10182648,24152717,2134-04-06 16:59:00,2134-04-13 18:23:00,NaT,URGENT,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,...,NaT,NaT,0,F,63,2134,2017 - 2019,NaT,169.0,
39,10212582,22708939,2171-04-07 20:25:00,2171-04-16 14:49:00,NaT,URGENT,TRANSFER FROM HOSPITAL,HOME HEALTH CARE,Other,ENGLISH,...,NaT,NaT,0,M,55,2171,2017 - 2019,NaT,210.0,
44,10229302,26194242,2132-09-08 23:35:00,2132-09-20 18:58:00,NaT,URGENT,TRANSFER FROM HOSPITAL,ACUTE HOSPITAL,Medicare,ENGLISH,...,2132-09-08 17:03:00,2132-09-09 00:48:00,0,M,82,2132,2017 - 2019,2132-09-25,283.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
454282,19808487,22178497,2154-10-15 07:15:00,2154-10-19 11:45:00,NaT,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,HOME,Medicare,ENGLISH,...,NaT,NaT,0,M,62,2154,2008 - 2010,NaT,100.0,
454288,19836124,25734796,2145-10-14 07:12:00,2145-10-16 18:00:00,NaT,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,...,NaT,NaT,0,F,76,2145,2014 - 2016,NaT,58.0,
454294,19874175,26421016,2122-05-14 07:15:00,2122-05-16 16:15:00,NaT,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,HOME,Other,ENGLISH,...,NaT,NaT,0,F,61,2122,2014 - 2016,NaT,57.0,
454320,19991135,28088185,2124-02-17 08:30:00,2124-02-20 08:50:00,NaT,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,...,NaT,NaT,0,F,57,2124,2008 - 2010,2133-07-19,72.0,


In [30]:
full_df = df.groupby(['subject_id', 'gender']).agg(
    admissions=pd.NamedAgg(column="hadm_id", aggfunc="count"), 
    hospital_stay_mean=pd.NamedAgg(column="hospital_stay_hour", aggfunc="mean"), 
    ed_mean=pd.NamedAgg(column="ed_hour", aggfunc="mean"), 
    admission_type=pd.NamedAgg(column="admission_type", aggfunc="last"),
    admission_location=pd.NamedAgg(column="admission_location", aggfunc="last"),
    insurance=pd.NamedAgg(column="insurance", aggfunc="last"),
    marital_status=pd.NamedAgg(column="marital_status", aggfunc="last"),
    race=pd.NamedAgg(column="race", aggfunc="last"),
    age=pd.NamedAgg(column="anchor_age", aggfunc="max"),
    year_group=pd.NamedAgg(column="anchor_year_group", aggfunc="last"),
    dod=pd.NamedAgg(column="dod", aggfunc="last"),
    ).reset_index()
full_df

Unnamed: 0,subject_id,gender,admissions,hospital_stay_mean,ed_mean,admission_type,admission_location,insurance,marital_status,race,age,year_group,dod
0,10000980,F,7,82.142857,8.000000,OBSERVATION ADMIT,WALK-IN/SELF REFERRAL,Other,MARRIED,BLACK/AFRICAN AMERICAN,73,2008 - 2010,2193-08-26
1,10001180,F,3,88.666667,5.000000,EU OBSERVATION,EMERGENCY ROOM,Other,MARRIED,WHITE,33,2014 - 2016,NaT
2,10001843,M,1,43.000000,,OBSERVATION ADMIT,TRANSFER FROM HOSPITAL,Other,SINGLE,WHITE,73,2017 - 2019,NaT
3,10002013,F,12,78.000000,7.222222,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,Medicare,SINGLE,OTHER,53,2008 - 2010,NaT
4,10002221,F,5,92.000000,11.666667,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,Medicare,SINGLE,WHITE,68,2014 - 2016,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18932,19996654,M,1,113.000000,26.000000,OBSERVATION ADMIT,TRANSFER FROM HOSPITAL,Medicare,SINGLE,WHITE,56,2017 - 2019,NaT
18933,19996783,M,4,133.000000,9.750000,OBSERVATION ADMIT,PHYSICIAN REFERRAL,Other,MARRIED,ASIAN - CHINESE,89,2017 - 2019,2188-05-21
18934,19997473,F,1,518.000000,3.000000,URGENT,TRANSFER FROM HOSPITAL,Medicare,MARRIED,WHITE,82,2014 - 2016,NaT
18935,19997538,M,2,305.500000,7.000000,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,Other,MARRIED,WHITE,53,2017 - 2019,NaT


In [31]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18937 entries, 0 to 18936
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   subject_id          18937 non-null  int64         
 1   gender              18937 non-null  object        
 2   admissions          18937 non-null  int64         
 3   hospital_stay_mean  18937 non-null  float64       
 4   ed_mean             16098 non-null  float64       
 5   admission_type      18937 non-null  object        
 6   admission_location  18937 non-null  object        
 7   insurance           18937 non-null  object        
 8   marital_status      18142 non-null  object        
 9   race                18937 non-null  object        
 10  age                 18937 non-null  int64         
 11  year_group          18937 non-null  object        
 12  dod                 4069 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(3), ob

### Transforming Categorical into Numbers:
* get_dummies: Columns with dtype = (object or category) will be converted.

In [32]:
#Replace Date of Death times with binary (0 or 1)
full_df.loc[full_df['dod'].notna(),'dod'] = int(1)
full_df.loc[full_df['dod'].isnull(),'dod'] = int(0)
full_df['dod'] = full_df['dod'].astype(int)

#Transform Gender column from Categorical Data to Binary:
full_df.loc[full_df['gender'] == 'M', 'gender'] = int(1)
full_df.loc[full_df['gender'] == 'F', 'gender'] = int(0)
full_df['gender'] = full_df['gender'].astype(int)

full_df

Unnamed: 0,subject_id,gender,admissions,hospital_stay_mean,ed_mean,admission_type,admission_location,insurance,marital_status,race,age,year_group,dod
0,10000980,0,7,82.142857,8.000000,OBSERVATION ADMIT,WALK-IN/SELF REFERRAL,Other,MARRIED,BLACK/AFRICAN AMERICAN,73,2008 - 2010,1
1,10001180,0,3,88.666667,5.000000,EU OBSERVATION,EMERGENCY ROOM,Other,MARRIED,WHITE,33,2014 - 2016,0
2,10001843,1,1,43.000000,,OBSERVATION ADMIT,TRANSFER FROM HOSPITAL,Other,SINGLE,WHITE,73,2017 - 2019,0
3,10002013,0,12,78.000000,7.222222,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,Medicare,SINGLE,OTHER,53,2008 - 2010,0
4,10002221,0,5,92.000000,11.666667,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,Medicare,SINGLE,WHITE,68,2014 - 2016,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18932,19996654,1,1,113.000000,26.000000,OBSERVATION ADMIT,TRANSFER FROM HOSPITAL,Medicare,SINGLE,WHITE,56,2017 - 2019,0
18933,19996783,1,4,133.000000,9.750000,OBSERVATION ADMIT,PHYSICIAN REFERRAL,Other,MARRIED,ASIAN - CHINESE,89,2017 - 2019,1
18934,19997473,0,1,518.000000,3.000000,URGENT,TRANSFER FROM HOSPITAL,Medicare,MARRIED,WHITE,82,2014 - 2016,0
18935,19997538,1,2,305.500000,7.000000,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,Other,MARRIED,WHITE,53,2017 - 2019,0


## Classification
Split into train and test sets.

In [28]:
X = full_df.drop(columns=['subject_id', 'dod'])
col_name = X.columns
Y = full_df['dod']
X.head()

Unnamed: 0,gender,admissions,hospital_stay_mean,ed_mean,age
0,0,7,82.142857,8.0,73
1,0,3,88.666667,5.0,33
2,1,1,43.0,,73
3,0,12,78.0,7.222222,53
4,0,5,92.0,11.666667,68


In [33]:
#Check the final dtype of each column. Are they properly defined now? 
print(full_df.info(),"\n\n")

#How is the data distributed? Outliers?
print(full_df.describe(), "\n\n")

## How balanced is the data?
print(full_df["dod"].value_counts())
print(full_df["gender"].value_counts())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18937 entries, 0 to 18936
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   subject_id          18937 non-null  int64  
 1   gender              18937 non-null  int64  
 2   admissions          18937 non-null  int64  
 3   hospital_stay_mean  18937 non-null  float64
 4   ed_mean             16098 non-null  float64
 5   admission_type      18937 non-null  object 
 6   admission_location  18937 non-null  object 
 7   insurance           18937 non-null  object 
 8   marital_status      18142 non-null  object 
 9   race                18937 non-null  object 
 10  age                 18937 non-null  int64  
 11  year_group          18937 non-null  object 
 12  dod                 18937 non-null  int64  
dtypes: float64(2), int64(5), object(6)
memory usage: 1.9+ MB
None 


         subject_id        gender    admissions  hospital_stay_mean  \
count  1.893700

In [36]:
## Random suffle and create the subsets for training and testing
# We can keep 80% of the data to Train the model and the remaining 20% for Testing.

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
print("Check the subsets size: X_train:{}, y_train:{}, X_test:{}, y_test:{}. \n\n".format(X_train.shape,y_train.shape,X_test.shape,y_test.shape))


Check the subsets size: X_train:(15149, 5), y_train:(15149,), X_test:(3788, 5), y_test:(3788,). 




In [40]:
#Training the Naive-Bayes:
classifier_NB = GaussianNB()
model_NB = classifier_NB.fit(X_train,y_train)

#Predict the classifier response for the Test dataset:
predictions_NB = model_NB.predict(X_test)

ValueError: ignored