# Q7. Predictive model build
Based on the choice to build a feature for 'Length of Stay', lets now build a model to predict patient length of stay. Details of this model:

### Purpose: 
To predict the length of hospital stay for patients upon admission. This will help Ramsay better allocate resources and manage bed availability, particularly for high-cost DRGs identified earlier. By anticipating patient needs, Ramsay can improve operational efficiency and patient care quality.

### Model Choice
Gradient Boosting Machine (GBM)

### Preprocessing Steps:
- Data Cleaning:
Handle missing values: Impute or remove missing data in relevant columns.
Convert categorical variables to numerical using one-hot encoding (e.g., PrincipalDiagnosis, Sex, UrgencyOfAdmission).
Ensure all date columns are converted to datetime objects.

- Feature Engineering:
Calculate TotalCharges as discussed.
Extract features from date columns (e.g., admission month, day of week).
Aggregate patient history features if available (e.g., past admissions, average stay length).

- Train-Test Split:
Split data into training and testing sets (e.g., 80% train, 20% test).

- Model Training:
Use a Gradient Boosting algorithm (e.g., XGBoost, LightGBM) to train the model.

### Evaluation Metrics:
- Mean Absolute Error (MAE): Measure the average magnitude of errors in predictions. Useful for understanding overall prediction accuracy.
- Root Mean Squared Error (RMSE): Gives higher weight to larger errors and helps identify outliers.
- R² (Coefficient of Determination): Indicates how well the predicted values explain the variance in the actual values.

In [7]:
import pandas as pd

# Ingest raw data
input_filepath = "../data/Data Insights - Synthetic Dataset.csv"
df = pd.read_csv(input_filepath)

Example of how to build this GBM model

In [12]:
# Example of the model build

# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Convert charge columns to numeric types
charge_columns = [
    'AccommodationCharge', 'CCU_Charges', 'ICU_Charge', 
    'TheatreCharge', 'PharmacyCharge', 'ProsthesisCharge', 
    'OtherCharges', 'BundledCharges'
]
for col in charge_columns:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Create TotalCharges column
df['TotalCharges'] = df[charge_columns].sum(axis=1)

# Convert 'AdmissionDate' and 'SeparationDate' to datetime
df['AdmissionDate'] = pd.to_datetime(df['AdmissionDate'], format='%d/%m/%Y')
df['SeparationDate'] = pd.to_datetime(df['SeparationDate'], format='%d/%m/%Y')

# Create LengthOfStay column
df['LengthOfStay'] = (df['SeparationDate'] - df['AdmissionDate']).dt.days

# Extract additional features
df['AdmissionMonth'] = df['AdmissionDate'].dt.month
df['DayOfWeek'] = df['AdmissionDate'].dt.dayofweek

# Split data
X = df[['PrincipalDiagnosis', 'Sex', 'UrgencyOfAdmission', 'TotalCharges', 'AdmissionMonth', 'DayOfWeek']]
y = df['LengthOfStay']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['TotalCharges', 'AdmissionMonth', 'DayOfWeek']),
        ('cat', OneHotEncoder(handle_unknown='ignore'), ['PrincipalDiagnosis', 'Sex', 'UrgencyOfAdmission'])
    ])

# Define and train model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', GradientBoostingRegressor())
])
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
print(f'MAE: {mean_absolute_error(y_test, y_pred)}')
print(f'RMSE: {mean_squared_error(y_test, y_pred, squared=False)}')
print(f'R²: {r2_score(y_test, y_pred)}')


MAE: 7.492799604059682
RMSE: 8.639798561258903
R²: -0.000232219904684694


