# Notebook Setup and Data Loading

Here, we'll import the necessary libraries and set up our file paths and plotting styles.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# --- Configuration ---
OUTPUT_DIR = Path("../output")
DATA_FILE = OUTPUT_DIR / "readmissions_dataset.parquet"

# --- Plotting Style ---
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

Load the Parquet file into a pandas DataFrame and display the first few rows to confirm it loaded correctly.

In [2]:
df = pd.read_parquet(DATA_FILE)

print(f"Dataset loaded with {df.shape[0]:,} rows and {df.shape[1]} columns.")
print("\nFirst 5 rows:")
display(df.head())

Dataset loaded with 104,068 rows and 14 columns.

First 5 rows:


Unnamed: 0,encounter_id,patient_id,readmitted_within_30_days,length_of_stay,age_at_admission,gender,race,marital_status,admission_reason,admission_reason_detail,prior_admissions_last_year,num_diagnoses,num_procedures,num_medications
0,138fa8d2-c18d-18f6-2d10-94511e19edbc,79f75722-a60e-c575-a7af-4da0ae591a52,0,11,2,male,,S,Hospital admission (procedure),Patient transfer to skilled nursing facility (...,0,0.0,20.0,0.0
1,edbab4b3-26b2-fcd7-4401-e3622cbfee4d,f65c8b52-a710-5ec4-0c26-44b15d1174a5,0,2,38,male,,W,Drug rehabilitation and detoxification (regime...,Dependent drug abuse (disorder),0,0.0,0.0,0.0
2,46939e78-cb1b-df7f-9202-37bef958c60c,f65c8b52-a710-5ec4-0c26-44b15d1174a5,0,4,40,male,,W,Drug rehabilitation and detoxification (regime...,Dependent drug abuse (disorder),0,0.0,0.0,0.0
3,dc23e269-a1c9-a52a-9558-28cd57391e6f,f65c8b52-a710-5ec4-0c26-44b15d1174a5,0,3,41,male,,W,Drug rehabilitation and detoxification (regime...,Dependent drug abuse (disorder),0,0.0,0.0,0.0
4,1ab103bf-9d29-cb4f-19c6-d7f2c11391bf,f65c8b52-a710-5ec4-0c26-44b15d1174a5,0,4,47,male,,W,Drug rehabilitation and detoxification (regime...,Dependent drug abuse (disorder),0,0.0,0.0,0.0


Use .info() to get a concise summary of the DataFrame, including data types and non-null counts.

In [3]:
print("\nData Types and Null Values:")
df.info()


Data Types and Null Values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104068 entries, 0 to 104067
Data columns (total 14 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   encounter_id                104068 non-null  object 
 1   patient_id                  104068 non-null  object 
 2   readmitted_within_30_days   104068 non-null  int32  
 3   length_of_stay              104068 non-null  int64  
 4   age_at_admission            104068 non-null  int64  
 5   gender                      104068 non-null  object 
 6   race                        0 non-null       object 
 7   marital_status              104068 non-null  object 
 8   admission_reason            104068 non-null  object 
 9   admission_reason_detail     104063 non-null  object 
 10  prior_admissions_last_year  104068 non-null  int64  
 11  num_diagnoses               104068 non-null  float64
 12  num_procedures              104068 non-null