Week 09 - Machine Learning with Scikit-learn

Vaishnavi Tapetla

For this week’s assignment, you are required to investigate the accuracy-computation time tradeoffs of the different optimization algorithms (solvers) that are available for fitting linear regression models in Scikit-Learn. Using the code shared via the Python notebook (part of this week’s uploads archive) where the use of logistic regression was demonstrated, complete the following operations:

Question 1



Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.




The get_results(result_scores) function displays a summary of final results which shows the accuracy statistics for training and testing of each model. The model selection should focus on identifying the best-performing algorithm which demonstrates strong performance on training data and test data especially emphasizing test data results to verify generalization capabilities.
According to the evidence the Random Forest model without cross-validation presented the highest training accuracy of 0.9993 but simultaneously displayed severe overfitting by performing poorly on the test data with an accuracy of 0.686. The standard Logistic Regression model and the Logistic Regression with L1 penalty and C=10 parameter achieved 0.718 test set accuracy as their best performance. These models exhibited similar training accuracy levels at 0.7333 as well as 0.7347 while demonstrating no evidence of overfitting. The Random Forest models with cross-validation likely achieved optimal training-test performance yet the visible output did not show their final scores.
The Logistic Regression model with L1 penalty and C=10 demonstrates the best overall performance because it obtained 0.718 test accuracy along with 0.7347 training accuracy. The model strikes an optimal balance between training data fitting and unseen data prediction capability which makes it the most suitable choice for this classification problem.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from patsy import dmatrices
import time
import matplotlib.pyplot as plt

In [2]:
patient_data = pd.read_csv('PatientAnalyticFile.csv')
patient_data['mortality'] = np.where(patient_data['DateOfDeath'].isnull(), 0, 1)

# Convert DateOfBirth to datetime and calculate age
patient_data['DateOfBirth'] = pd.to_datetime(patient_data['DateOfBirth'])
patient_data['Age_years'] = ((pd.to_datetime('2015-01-01') - patient_data['DateOfBirth']).dt.days/365.25)

# Creating formula
vars_remove = ['PatientID', 'First_Appointment_Date', 'DateOfBirth',
               'Last_Appointment_Date', 'DateOfDeath', 'mortality']
vars_left = set(patient_data.columns) - set(vars_remove)
formula = "mortality ~ " + " + ".join(vars_left)

In [3]:
Y, X = dmatrices(formula, patient_data)

# Split data into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, np.ravel(Y), test_size=0.2, random_state=42)

In [4]:
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

# Dictionary to store results
results = {
    'Solver': [],
    'Training Accuracy': [],
    'Holdout Accuracy': [],
    'Time Taken (seconds)': []
}

In [5]:
for solver in solvers:
    # Start timing
    start_time = time.time()

    # Create and fit model
    model = LogisticRegression(solver=solver, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)

    # End timing
    end_time = time.time()
    time_taken = end_time - start_time

    # Calculate accuracies
    train_accuracy = accuracy_score(y_train, model.predict(X_train))
    test_accuracy = accuracy_score(y_test, model.predict(X_test))

    # Store results
    results['Solver'].append(solver)
    results['Training Accuracy'].append(round(train_accuracy, 4))
    results['Holdout Accuracy'].append(round(test_accuracy, 4))
    results['Time Taken (seconds)'].append(round(time_taken, 4))

# Convert results to DataFrame for better display
results_df = pd.DataFrame(results)



In [6]:
print(results_df)

      Solver  Training Accuracy  Holdout Accuracy  Time Taken (seconds)
0  newton-cg             0.7481            0.7362                0.2802
1      lbfgs             0.7483            0.7362                2.2068
2  liblinear             0.7479            0.7362                0.1638
3        sag             0.7481            0.7362               24.2829
4       saga             0.7480            0.7362               10.9742


Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?





The five solvers demonstrated equivalent performance based on accuracy metrics. The training accuracy scores fall between 0.7479 and 0.7483 where lbfgs reaches 0.7483 as its peak value. The 0.0004 point difference between the top and bottom scoring solver is so small that it probably does not create a statistically meaningful distinction. All solvers demonstrated the same holdout accuracy score of 0.7362 which indicates their equal capability to predict new data points.
The computational efficiency becomes the decisive factor since all accuracy metrics are equivalent. The execution times between solvers differ significantly from one another. The liblinear solver demonstrated the fastest execution time at 0.1638 seconds before newton-cg completed at 0.2802 seconds. The time required for sag and saga to execute reached 24.2829 seconds and 10.9742 seconds respectively whereas lbfgs finished at 2.2068 seconds.
Among the three metrics the liblinear solver demonstrates the highest performance level. The liblinear solver reached identical accuracy levels as other solvers yet outperformed them in speed by 1.7 times newton-cg, 13.5 times lbfgs, 67 times saga, and 148 times sag. The computational efficiency of liblinear solver provides substantial speed benefits to model retraining processes without affecting performance results.