Hospital Length of Stay Classification 

We will transform the regression problem of predicting the numerical continuous value length of stay (`LOS`) to a classification problem and fit and evaluate classification models: Linear Regression from sklearn and Random Forest. We will convert the `LOS` numerical continuous variable to a binary label indicating whether the patient was in the hospital for less than a week or not. 

A summary of tasks:
1. Download and read cleaned data file. Do an EDA. 

2. Create a binary label `Outcome` to indicate if the patient stayed a week or more in the hospital  i.e.  `if LOS > 6 days` then y=1 (positive label); `if LOS <=6 days` then y=0 (negative label) . Check the distribution of this binary `Outcome` label. Split the dataset into features matrix and the outcome variable. Remember to remove the `LOS` numerical continuous variable as this is not needed and if included induce data leakage from the features to the binary outcome. 

3. Split the dataset into training and test sets. 

4. Fit a Logistic Regression model on the training set and output evaluation metrics (precision, recall and f1-score) for prediction on the test set. Also print the classification report for predictions on the test set.

5. Fit one other classifier (Random Forest Classifier or Support Vector Machine) and repeat step 4. Compare performance of both models.


## Reading data from CSV

In [2]:
import pandas as pd

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
path = "/content/drive/MyDrive/Georgetown/AI_in_HealthCare/los_dataset_cleaned.csv"
data = pd.read_csv(path)
data.head()

Unnamed: 0,LOS,blood,circulatory,congenital,digestive,endocrine,genitourinary,infectious,injury,mental,...,AGE_newborn,AGE_senior,AGE_young_adult,MAR_DIVORCED,MAR_LIFE PARTNER,MAR_MARRIED,MAR_SEPARATED,MAR_SINGLE,MAR_UNKNOWN (DEFAULT),MAR_WIDOWED
0,1.144444,0.0,1.0,0.0,0.0,0.0,0.0,0.0,4.0,1.0,...,0,1,0,0,0,1,0,0,0,0
1,5.496528,0.0,4.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,...,0,1,0,0,0,1,0,0,0,0
2,6.768056,0.0,2.0,0.0,0.0,2.0,0.0,0.0,3.0,0.0,...,0,1,0,0,0,1,0,0,0,0
3,2.856944,0.0,2.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0
4,3.534028,0.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0,1,0,0,0,1,0,0,0,0


## Creating binary outcome label





In [5]:
# data["Outcome"] = (data["LOS"] > 6).astype(int)
# we can use the command above or create a function and use .apply as stated in the instructions

def classify_los(los):
    return 1 if los > 6 else 0

data["Outcome"] = data["LOS"].apply(classify_los)

In [14]:
data[["LOS", "Outcome"]].head() # Checking if the outcome is labeled correctly

Unnamed: 0,LOS,Outcome
0,1.144444,0
1,5.496528,0
2,6.768056,1
3,2.856944,0
4,3.534028,0


In [9]:
print(data["Outcome"].value_counts(normalize=True))  # Proportion
print(data["Outcome"].value_counts())  # Absolute count

Outcome
1    0.539648
0    0.460352
Name: proportion, dtype: float64
Outcome
1    27542
0    23495
Name: count, dtype: int64


In [10]:
X = data.drop(columns=["LOS", "Outcome"])  # Drop LOS and keep features
y = data["Outcome"]  # Target variable

## Splitting the Dataset into Train and Test


In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


##Training the Logistic Regression Model

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Train Logistic Regression
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)

# Predictions on test set
y_pred = logreg.predict(X_test)

# Print evaluation metrics
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.69      0.65      0.67      4699
           1       0.72      0.75      0.73      5509

    accuracy                           0.71     10208
   macro avg       0.70      0.70      0.70     10208
weighted avg       0.71      0.71      0.71     10208



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


##Training Another Classifier: Random Forest

In [13]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42, n_estimators=100)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest Classifier Results:")
print(classification_report(y_test, y_pred_rf))


Random Forest Classifier Results:
              precision    recall  f1-score   support

           0       0.71      0.62      0.66      4699
           1       0.71      0.78      0.74      5509

    accuracy                           0.71     10208
   macro avg       0.71      0.70      0.70     10208
weighted avg       0.71      0.71      0.70     10208



## Conclusion and Comparison of Results


- Both Logistic Regression (LR) and Random Forest (RF) achieved similar overall accuracy (71%), but there are some differences in their precision, recall, and F1-scores.

- Random Forest had better recall for longer stays (78% vs. 75%), making it slightly better at identifying patients staying more than a week.

- Logistic Regression had more balanced performance across both classes, with a slightly higher recall for shorter stays (65% vs. 62%).

- If predicting longer stays is more important, RF is preferable. If a balanced model is needed, LR performs just as well.