# Conestoga College

## Course: Data Analysis Mathematics, Algorithms and Modeling


### Problem Analysis Workshop 1


Group members:

- Jiho Jun - 9080800
- Vishnu Sivaraj - 9025320
- Edwin Lopez - 9055061


## Question:

### Predicting hospital readmission for diabetes patients within 30 days

Machine learning can help predict 30-day hospital readmissions in diabetes patients by analyzing medical and demographic data.

Using datasets like the Diabetes 130-US Hospitals records, models can identify high-risk patients, support early interventions, and reduce healthcare costs.

URL: https://www.kaggle.com/datasets/brandao/diabetes


In [None]:
# Install necessary packages

# !pip install -q kaggle pandas

In [1]:
# import necessary libraries

import os, json, pathlib
from getpass import getpass

In [4]:
# Get Kaggle token from user input
print("Paste the contents of kaggle.json. Example: {\"username\":\"your_name\",\"key\":\"abcd...\"}")

# get token string
token_str = getpass("kaggle.json content: ")

Paste the contents of kaggle.json. Example: {"username":"your_name","key":"abcd..."}


In [5]:
# save to ~/.kaggle/kaggle.json
home = pathlib.Path.home()

# make .kaggle directory
kdir = home/".kaggle"

# create directory if it doesn't exist
kdir.mkdir(exist_ok=True)

# write the token string to kaggle.json file
(kdir/"kaggle.json").write_text(token_str)

# set file permissions to read/write for user only
os.chmod(kdir/"kaggle.json", 0o600)

# confirm saved
print("Saved to", kdir/"kaggle.json")

Saved to C:\Users\Ed\.kaggle\kaggle.json


In [6]:
# import KaggleApi and authenticate
from kaggle.api.kaggle_api_extended import KaggleApi
from pathlib import Path

# authenticate
api = KaggleApi()
api.authenticate()

In [7]:
# path to download directory
out_dir = Path("data/diabetes")

# create output directory if it doesn't exist
out_dir.mkdir(parents=True, exist_ok=True)

# dataset to download
dataset = "brandao/diabetes"

print("Downloading", dataset)

# download and unzip
api.dataset_download_files(dataset, path=str(out_dir), unzip=True)

print("Complete.")

print("Files:")

# list downloaded files
for p in sorted(out_dir.rglob("*")):
    if p.is_file():
        print("-", p.relative_to(out_dir))

Downloading brandao/diabetes
Dataset URL: https://www.kaggle.com/datasets/brandao/diabetes
Complete.
Files:
- description.pdf
- diabetic_data.csv


In [None]:
# import pandas
import pandas as pd

# path to the CSV file
file_path = "data/diabetes/diabetic_data.csv"

# Load the CSV into a DataFrame
df = pd.read_csv(file_path)

# Show basic info and first few rows
print(df.shape)

# data types of each column
print(df.dtypes)

# first few rows
df.head()


(101766, 50)
encounter_id                 int64
patient_nbr                  int64
race                        object
gender                      object
age                         object
weight                      object
admission_type_id            int64
discharge_disposition_id     int64
admission_source_id          int64
time_in_hospital             int64
payer_code                  object
medical_specialty           object
num_lab_procedures           int64
num_procedures               int64
num_medications              int64
number_outpatient            int64
number_emergency             int64
number_inpatient             int64
diag_1                      object
diag_2                      object
diag_3                      object
number_diagnoses             int64
max_glu_serum               object
A1Cresult                   object
metformin                   object
repaglinide                 object
nateglinide                 object
chlorpropamide              object
glimepi

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


: 

# Data Cleansing


This data needs cleansing before analysis:

1. Handle missing and unknown values

   - Many categorical fields use "?" or "Unknown/Invalid" instead of nulls.

2. Simplify target variable

   - The field **readmitted** has three categories:  
     "NO"

     ">30" (readmitted after 30 days)

     "<30" (readmitted within 30 days).

     Collapse into binary classification:

     1 = readmitted within 30 days ("<30")

     0 = not readmitted ("NO" or ">30").

3. Encode categorical variables

   - Convert age from intervals (e.g. [70-80)) to numeric midpoints (e.g. 75).

4. Reduce dimensionality of diagnoses

   - There are hundreds of unique codes, which create sparse features ( diag_1, diag_2, diag_3 )
   - There are hundreds of unique codes, which create sparse features.  
     Approach: Group into broader disease categories (circulatory, respiratory, digestive, diabetes-related, etc.).

5. Address class imbalance
   - Address Class Imbalance. In this data, about 11% of cases are readmitted within 30 days.
   - Undersample majority class.
