Akshitha - Week 09 Machine Learning with Scikit-learn

1. Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.

Answer.

The best overall performance among different classification models within the Python notebook belongs to the **Random Forest model without cross-validation** (`RandomForest_noCV`) which achieved 99.93% accuracy on the training set. The training accuracy of 99.93% indicates that this model demonstrated strong memorization capabilities for the training data. A training accuracy at 99.93% is typically an indication of overfitting because such models overly match training data thus performing poorly when dealing with new unseen information.

The testing accuracy of the `RandomForest_noCV` model reveals overfitting through its decline to **68.6%**. The difference between training accuracy and testing accuracy indicates that the model has become too complex because it learns random training data patterns instead of recognizing generalizable test data patterns. The testing accuracy of the **logistic regression models** (standard, L1-penalized with varying `C` values, and cross-validated models) remained consistent between **70.6% to 71.8%** across all models. The generalization between different configurations appears better because this performance remains stable while being closer to their training accuracy results.

The **Grid Search CV-tuned Random Forest model** (`RandomForest_CV`) reached a better testing accuracy through employing cross-validation techniques during its hyperparameter optimization process. The training accuracy results remained high which indicates that generalization improvement occurred yet overfitting continued to affect the model performance. A Random Forest model received further tuning through `max_depth` optimization during development as `RandomForest_CV2` to minimize tree overfitting. The testing accuracy levels of this model matched those of the previous tuned model indicating that hyperparameter optimization helps yet does not eliminate the problem of overfitting when insufficient regularization exists.

The Random Forest models delivered the best training accuracy but they performed similarly to logistic regression models while evaluating new data. Testing performances of **Logistic Regression model with L1 penalty (`C=10`) generated accurate results with training set precision at 73.47% and testing set precision at 71.8%.** The model presents an optimal balance between training data fit and unseen data generalization which makes it the most reliable selection according to the evidence presented.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import time

In [2]:
df_patient = pd.read_csv('./PatientAnalyticFile.csv')
df_patient['mortality'] = np.where(df_patient['DateOfDeath'].isnull(), 0, 1)

df_patient['DateOfBirth'] = pd.to_datetime(df_patient['DateOfBirth'])
df_patient['Age_years'] = ((pd.to_datetime('2015-01-01') - df_patient['DateOfBirth']).dt.days / 365.25)

vars_remove = ['PatientID', 'First_Appointment_Date', 'DateOfBirth', 'Last_Appointment_Date', 'DateOfDeath']
df_patient = df_patient.drop(columns=vars_remove)

df_patient = pd.get_dummies(df_patient, drop_first=True)

In [3]:
# Define predictors (X) and target (y)
X = df_patient.drop('mortality', axis=1)
y = df_patient['mortality']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Solvers to test
solvers = ['liblinear', 'lbfgs', 'newton-cg', 'sag', 'saga']

# Store results
results = []

In [5]:
for solver in solvers:
    start_time = time.time()

    # Initialize and fit model
    model = LogisticRegression(solver=solver, max_iter=1000)
    model.fit(X_train, y_train)

    time_taken = time.time() - start_time

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)

    results.append([solver, train_accuracy, test_accuracy, time_taken])

results_df = pd.DataFrame(results, columns=['Solver', 'Training Accuracy', 'Testing Accuracy', 'Time Taken (seconds)'])
print(results_df)

      Solver  Training Accuracy  Testing Accuracy  Time Taken (seconds)
0  liblinear           0.748125           0.73625              0.247304
1      lbfgs           0.747938           0.73600              2.181107
2  newton-cg           0.748062           0.73575              0.588134
3        sag           0.748125           0.73625              6.544846
4       saga           0.748125           0.73600              8.889797
5  liblinear           0.748125           0.73625              0.049444
6      lbfgs           0.747938           0.73600              0.521924
7  newton-cg           0.748062           0.73575              0.370197
8        sag           0.748125           0.73625              6.550987
9       saga           0.748125           0.73600              8.219650


Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?



The solver that delivered optimal performance according to results is **`liblinear`**. The best overall performance comes from `liblinear` based on the evaluation of testing accuracy together with training accuracy and execution time. Testing accuracy stands out as the most vital metric for evaluation because it indicates how well the model generalizes to new data it has not encountered. Training accuracy and execution time become important factors during the evaluation process of solvers alongside testing accuracy.

The `liblinear` solver demonstrated testing accuracy of **0.73625** which matches the best-performing solver accuracy levels of `sag` and `saga`. The model showed strong effectiveness by reaching **0.748125** as its **training accuracy** value to indicate it captured all important training patterns. The `liblinear` solver completed its execution process in the least amount of time at **0.049444 seconds**. The execution time of `liblinear` at **0.049444 seconds** stands out as much faster than the other solvers `sag` and `saga` which needed more than **6.5 seconds** and **8.2 seconds** respectively.

The execution times for `lbfgs` and `newton-cg` exceeded those of `liblinear` even though their accuracy was comparable. The execution process for `lbfgs` lasted **0.521924 seconds** yet `newton-cg` needed **0.370197 seconds** to finish. The extended run time of these solvers becomes a disadvantage since they failed to outperform the accuracy of the `liblinear` solver.

The selection of the best-performing solver depends on a balanced evaluation of three factors: generalization performance through testing accuracy and speed through execution time and training accuracy. The `liblinear` solver proved itself as the top choice because it delivered the best accuracy within the shortest execution time. The logistic regression problem requires the `liblinear` solver as its optimal solution.