# Lab: Trees and Model Stability

Trees are notorious for being **unstable**: Small changes in the data can lead to noticeable or large changes in the tree. We're going to explore this phenomenon, and a common rebuttal.

In the folder for this lab, there are three datasets that we used in class: Divorce, heart failure, and the AirBnB price dataset.

1. Pick one of the datasets and appropriately clean it.
2. Perform a train-test split for a specific seed (save the seed for reproducibility). Fit a classification/regression tree and a linear model on the training data and evaluate their performance on the test data. Set aside the predictions these models make.
3. Repeat step 2 for three to five different seeds (save the seeds for reproducibility). How different are the trees that you get? Your linear model coefficients?. Set aside the predictions these models make.

Typically, you would see the trees changing what appears to be a non-trivial amount, while the linear model coefficients don't vary nearly as much. Often, the changes appear substantial.

But are they?

4. Instead of focusing on the tree or model coefficients, do three things:
    1. Make scatterplots of the predicted values on the test set from question 2 against the predicted values for the alternative models from part 3, separately for your trees and linear models. Do they appear reasonably similar?
    2. Compute the correlation between your model in part 2 and your alternative models in part 3, separately for your trees and linear models. Are they highly correlated or not?
    3. Run a simple linear regression of the predicted values on the test set from the alternative models on the predicted values from question 2, separately for your trees and linear models. Is the intercept close to zero? Is the slope close to 1? Is the $R^2$ close to 1?

5. Do linear models appear to have similar coefficients and predictions across train/test splits? Do trees?
6. True or false, and explain: "Even if the models end up having a substantially different appearance, the predictions they generate are often very similar."

In [10]:
#part 1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

heart = pd.read_csv('/content/heart_failure_clinical_records_dataset.csv')

print("First 5 rows of the DataFrame:")
print(heart.head())

print("\nDataFrame Information:")
heart.info()

print("\nMissing values per column:")
print(heart.isnull().sum())
#No missing values and no objects in the dataset

First 5 rows of the DataFrame:
    age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
0  75.0        0                       582         0                 20   
1  55.0        0                      7861         0                 38   
2  65.0        0                       146         0                 20   
3  50.0        1                       111         0                 20   
4  65.0        1                       160         1                 20   

   high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
0                    1  265000.00               1.9           130    1   
1                    0  263358.03               1.1           136    1   
2                    0  162000.00               1.3           129    1   
3                    0  210000.00               1.9           137    1   
4                    0  327000.00               2.7           116    0   

   smoking  time  DEATH_EVENT  
0        0     4            1  
1        

In [3]:
X = heart.drop('DEATH_EVENT', axis=1)
y =heart['DEATH_EVENT']

print("Features (X) shape:", X.shape)
print("Target (y) shape:", y.shape)
print("\nFirst 5 rows of features (X):")
print(X.head())
print("\nFirst 5 values of target (y):")
print(y.head())

Features (X) shape: (299, 12)
Target (y) shape: (299,)

First 5 rows of features (X):
    age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
0  75.0        0                       582         0                 20   
1  55.0        0                      7861         0                 38   
2  65.0        0                       146         0                 20   
3  50.0        1                       111         0                 20   
4  65.0        1                       160         1                 20   

   high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
0                    1  265000.00               1.9           130    1   
1                    0  263358.03               1.1           136    1   
2                    0  162000.00               1.3           129    1   
3                    0  210000.00               1.9           137    1   
4                    0  327000.00               2.7           116    0   

   smoking  time  

In [4]:
#train-test split

from sklearn.model_selection import train_test_split

random_seed = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_seed)

print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (239, 12)
y_train shape: (239,)
X_test shape: (60, 12)
y_test shape: (60,)


In [5]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# Initialize and train Decision Tree Classifier
dt_model = DecisionTreeClassifier(random_state=random_seed)
dt_model.fit(X_train, y_train)

# Make predictions with Decision Tree
dt_predictions = dt_model.predict(X_test)

# Initialize and train Logistic Regression model
lr_model = LogisticRegression(random_state=random_seed, solver='liblinear') # 'liblinear' solver for small datasets
lr_model.fit(X_train, y_train)

# Make predictions with Logistic Regression
lr_predictions = lr_model.predict(X_test)

print("Decision Tree predictions (first 5):")
print(dt_predictions[:5])
print("\nLogistic Regression predictions (first 5):")
print(lr_predictions[:5])

Decision Tree predictions (first 5):
[1 0 0 1 0]

Logistic Regression predictions (first 5):
[0 0 0 1 0]


In [6]:
from sklearn.metrics import accuracy_score, classification_report

# Evaluate Decision Tree Classifier
dt_accuracy = accuracy_score(y_test, dt_predictions)
print(f"Decision Tree Accuracy: {dt_accuracy:.4f}")
print("\nDecision Tree Classification Report:")
print(classification_report(y_test, dt_predictions))

# Evaluate Logistic Regression Model
lr_accuracy = accuracy_score(y_test, lr_predictions)
print(f"\nLogistic Regression Accuracy: {lr_accuracy:.4f}")
print("\nLogistic Regression Classification Report:")
print(classification_report(y_test, lr_predictions))

Decision Tree Accuracy: 0.6333

Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       0.66      0.77      0.71        35
           1       0.58      0.44      0.50        25

    accuracy                           0.63        60
   macro avg       0.62      0.61      0.61        60
weighted avg       0.63      0.63      0.62        60


Logistic Regression Accuracy: 0.7833

Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       0.75      0.94      0.84        35
           1       0.88      0.56      0.68        25

    accuracy                           0.78        60
   macro avg       0.81      0.75      0.76        60
weighted avg       0.80      0.78      0.77        60



In [7]:
#pt 3

random_seeds = [10, 20, 30, 40, 50]

dt_predictions_per_seed = []
dt_structures_per_seed = []
lr_predictions_per_seed = []
lr_coefficients_per_seed = []

print(f"Random seeds defined: {random_seeds}")
print("Empty lists initialized for storing model results per seed.")

Random seeds defined: [10, 20, 30, 40, 50]
Empty lists initialized for storing model results per seed.


In [14]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

for seed in random_seeds:
    print(f"\nProcessing with random seed: {seed}")

    # 1. Train-test split
    X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X, y, test_size=0.2, random_state=seed)

    # 2. Train Decision Tree Classifier
    dt_model_s = DecisionTreeClassifier(random_state=seed)
    dt_model_s.fit(X_train_s, y_train_s)
    dt_predictions_s = dt_model_s.predict(X_test_s)
    dt_predictions_per_seed.append(dt_predictions_s)

    # Store Decision Tree structure
    dt_structures_per_seed.append({
        'node_count': dt_model_s.tree_.node_count,
        'max_depth': dt_model_s.tree_.max_depth
    })
    print(f"Decision Tree: nodes={dt_model_s.tree_.node_count}, max_depth={dt_model_s.tree_.max_depth}")

    # 3. Train Logistic Regression model
    lr_model_s = LogisticRegression(random_state=seed, solver='liblinear', max_iter=1000)
    lr_model_s.fit(X_train_s, y_train_s)
    lr_predictions_s = lr_model_s.predict(X_test_s)
    lr_predictions_per_seed.append(lr_predictions_s)

    # Store Logistic Regression coefficients
    lr_coefficients_per_seed.append({
        'coefficients': lr_model_s.coef_,
        'intercept': lr_model_s.intercept_
    })

print("\nModel training and prediction storage complete for all seeds.")



Processing with random seed: 10
Decision Tree: nodes=63, max_depth=8

Processing with random seed: 20
Decision Tree: nodes=69, max_depth=9

Processing with random seed: 30
Decision Tree: nodes=69, max_depth=9

Processing with random seed: 40
Decision Tree: nodes=69, max_depth=8

Processing with random seed: 50
Decision Tree: nodes=61, max_depth=9

Model training and prediction storage complete for all seeds.


In [13]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

print("--- Model Accuracies per Seed ---")
for i, seed in enumerate(random_seeds):
    # train-test split
    X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X, y, test_size=0.2, random_state=seed)

    # Get predictions for the current seed
    dt_preds = dt_predictions_per_seed[i]
    lr_preds = lr_predictions_per_seed[i]

    # Calculate accuracies
    dt_acc = accuracy_score(y_test_s, dt_preds)
    lr_acc = accuracy_score(y_test_s, lr_preds)

    print(f"\nSeed {seed}:")
    print(f"  Decision Tree Accuracy: {dt_acc:.4f}")
    print(classification_report(y_test, dt_preds))

    print(f"  Logistic Regression Accuracy: {lr_acc:.4f}")
    print(classification_report(y_test, lr_preds))



--- Model Accuracies per Seed ---

Seed 10:
  Decision Tree Accuracy: 0.7667
              precision    recall  f1-score   support

           0       0.54      0.63      0.58        35
           1       0.32      0.24      0.27        25

    accuracy                           0.47        60
   macro avg       0.43      0.43      0.43        60
weighted avg       0.44      0.47      0.45        60

  Logistic Regression Accuracy: 0.8167
              precision    recall  f1-score   support

           0       0.64      0.80      0.71        35
           1       0.56      0.36      0.44        25

    accuracy                           0.62        60
   macro avg       0.60      0.58      0.57        60
weighted avg       0.61      0.62      0.60        60


Seed 20:
  Decision Tree Accuracy: 0.7167
              precision    recall  f1-score   support

           0       0.63      0.69      0.66        35
           1       0.50      0.44      0.47        25

    accuracy           