<a href="https://colab.research.google.com/github/dh-kt/Statistical_Learning_Resampling-Method/blob/main/notebooks/validation_set_approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# For statistical modeling
import statsmodels.api as sm
import statsmodels.formula.api as smf

# For machine learning (we'll use this for train/test split)
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report


In [2]:
# Set random seed for reproducibility
np.random.seed(42)
print(" All libraries imported successfully!")

 All libraries imported successfully!


In [4]:
#Preparing data to load
from google.colab import files
uploaded = files.upload()

Saving Default.csv to Default (1).csv


In [5]:
#Load data
default = pd.read_csv('Default.csv')

#Inspect data
print('Data Info:')
print(default.head())
print('\n')

Data Info:
  default student      balance        income
0      No      No   729.526495  44361.625074
1      No     Yes   817.180407  12106.134700
2      No      No  1073.549164  31767.138947
3      No      No   529.250605  35704.493935
4      No      No   785.655883  38463.495879




In [6]:
#check for missing values
print('Mising Values:')
print(default.isnull().sum())

Mising Values:
default    0
student    0
balance    0
income     0
dtype: int64


In [7]:
# Convert 'default' from string to numeric (0 and 1)
default['default'] = default['default'].map({'No':0, 'Yes':1})
default['stuent'] = default['student'].map({'No':0, 'Yes':1})

In [9]:
#Check the structure of the data
print("Data Info:")
print(default.info())
print("\n")

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   default  10000 non-null  int64  
 1   student  10000 non-null  object 
 2   balance  10000 non-null  float64
 3   income   10000 non-null  float64
 4   stuent   10000 non-null  int64  
dtypes: float64(2), int64(2), object(1)
memory usage: 390.8+ KB
None




In [10]:
#Statistics Summary of the data
print("\n Summary Statistics:")
print(default.describe())


 Summary Statistics:
            default       balance        income        stuent
count  10000.000000  10000.000000  10000.000000  10000.000000
mean       0.033300    835.374886  33516.981876      0.294400
std        0.179428    483.714985  13336.639563      0.455795
min        0.000000      0.000000    771.967729      0.000000
25%        0.000000    481.731105  21340.462903      0.000000
50%        0.000000    823.636973  34552.644802      0.000000
75%        0.000000   1166.308386  43807.729272      1.000000
max        1.000000   2654.322576  73554.233495      1.000000


In [11]:
# Let's also check if there are any other categorical columns that might need conversion
print('\n Other categorical columns:')
categorical_cols = default.select_dtypes(include=['object']).columns
print(categorical_cols)


 Other categorical columns:
Index(['student'], dtype='object')


In [12]:
df = default.select_dtypes(include='number')

In [13]:
# Fit logistic regression model using income and balance to predict default
logit_model = smf.logit(formula='default ~ income + balance', data=df)
fitted_model = logit_model.fit()

#Display the model summary
print(fitted_model.summary())

Optimization terminated successfully.
         Current function value: 0.078948
         Iterations 10
                           Logit Regression Results                           
Dep. Variable:                default   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9997
Method:                           MLE   Df Model:                            2
Date:                Thu, 21 Aug 2025   Pseudo R-squ.:                  0.4594
Time:                        18:09:16   Log-Likelihood:                -789.48
converged:                       True   LL-Null:                       -1460.3
Covariance Type:            nonrobust   LLR p-value:                4.541e-292
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -11.5405      0.435    -26.544      0.000     -12.393     -10.688
income      2.081e-05   4.99

In [14]:
# Calculate odds ratios for practical interpretation
odds_ratios = np.exp(fitted_model.params)
print("\n📊 Odds Ratios (more interpretable):")
print(f"Income: 1 unit increase → {odds_ratios['income']:.6f}x odds of default")
print(f"Balance: 1 unit increase → {odds_ratios['balance']:.6f}x odds of default")
print(f"For every $1000 increase in balance: {np.exp(0.0056*1000):.2f}x odds of default")


📊 Odds Ratios (more interpretable):
Income: 1 unit increase → 1.000021x odds of default
Balance: 1 unit increase → 1.005663x odds of default
For every $1000 increase in balance: 270.43x odds of default


Model Results Interpretation
1. Model Convergence & Fit
Optimization terminated successfully after 10 iterations

Current function value: 0.0789 (lower is better, indicating good model fit)

Pseudo R-squared: 0.4594 - The model explains approximately 45.94% of the variance in default status, which is considered quite strong for a logistic regression model

2. Statistical Significance
All predictors are highly statistically significant (p-value < 0.001):

Income: p-value = 0.000
  
Balance: p-value = 0.000

Intercept: p-value = 0.000

3. Coefficient Interpretation
Intercept: -11.5405 - The log-odds of default when both income and balance are zero (theoretical baseline)

Income: 2.081e-05 - For each $1 increase in income, the log-odds of default increase by 0.00002081

Balance: 0.0056 - For each $1 increase in credit card balance, the log-odds of default increase by 0.0056
4. mportant Note: Quasi-Separation Warning
The warning about "quasi-separation" indicates that:

14% of observations can be perfectly predicted - this is actually good news!

It means our model is very effective at distinguishing defaulters from non-defaulters

No action needed - this is expected behavior for a well-performing classification model

5. Conclusion
The logistic regression model successfully identifies both income and balance as highly significant predictors of credit card default. Credit card balance appears to be the much stronger predictor in practical terms.

#b.	Using the validation set approach, estimate the test error of this model
he goal is to:

Split our data into training set (to build the model)

And validation set (to test the model's performance)

This gives us an estimate of the test error (how well the model generalizes)

In [15]:
# Split the Data into Training and Validation Sets
X = default[['income', 'balance']] #Features
y = default['default']              # Target variable

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print('Data Split Results:')
print(f'Original data shape: {X.shape}')
print(f'Training set shape: {X_train.shape} (70%)')
print(f'Validation set shape: {X_val.shape} (30%)')
print(f'Training default rate: {y_train.mean():.3%}')
print(f'Vaidation default rate: {y_val.mean():.3%}')

Data Split Results:
Original data shape: (10000, 2)
Training set shape: (7000, 2) (70%)
Validation set shape: (3000, 2) (30%)
Training default rate: 3.329%
Vaidation default rate: 3.333%


In [16]:
# Fit the model on train Data only
X_train_sm = sm.add_constant(X_train)
train_model = sm.Logit(y_train, X_train_sm).fit()

print('Model fitted on training!')
print(train_model.summary())

Optimization terminated successfully.
         Current function value: 0.078461
         Iterations 10
Model fitted on training!
                           Logit Regression Results                           
Dep. Variable:                default   No. Observations:                 7000
Model:                          Logit   Df Residuals:                     6997
Method:                           MLE   Df Model:                            2
Date:                Thu, 21 Aug 2025   Pseudo R-squ.:                  0.4625
Time:                        18:09:28   Log-Likelihood:                -549.22
converged:                       True   LL-Null:                       -1021.9
Covariance Type:            nonrobust   LLR p-value:                5.290e-206
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -11.5977      0.523    -22.157      0.000     -12.624     -10.572
in

In [17]:
#Make Predictions on Validation Set
X_val_sm = sm.add_constant(X_val)

#Convert probabilities to class predictions (Using 0.5 thread)
val_predictions_probs = train_model.predict(X_val_sm)

val_predictions = (val_predictions_probs > 0.5).astype(int)

print('First 10 prediction probabilities: ', val_predictions_probs[:10].round(3))
print('First 10 class predictions; ', val_predictions[:10])
print('First 10 actual values: ', y_val.values[:10])

First 10 prediction probabilities:  7288    0.011
6310    0.002
3912    0.000
6386    0.528
5633    0.004
7212    0.000
8705    0.698
8622    0.002
542     0.011
8083    0.015
dtype: float64
First 10 class predictions;  7288    0
6310    0
3912    0
6386    1
5633    0
7212    0
8705    1
8622    0
542     0
8083    0
dtype: int64
First 10 actual values:  [0 0 0 0 0 0 1 0 0 0]


In [18]:
#Calculate Validation Error Rate
#Misclassification error
error_rate = np.mean(val_predictions != y_val)
accuracy = 1- error_rate

#Confusion Matrix
cm = confusion_matrix(y_val, val_predictions)
tn, fp, fn, tp = cm.ravel()

print(f'Validation Error Rate: {error_rate:.3%}')
print(f'Validation Accuracy: {accuracy:.3%}')
print(f'\nConfusion Matrix:')
print(cm)
print(f'\nTrue Negative:{tn} | False Positive: {fp}')
print(f'False Negative: {fp} | True Positive: {tp}')

Validation Error Rate: 2.867%
Validation Accuracy: 97.133%

Confusion Matrix:
[[2881   19]
 [  67   33]]

True Negative:2881 | False Positive: 19
False Negative: 19 | True Positive: 33


In [19]:
#Additional Metrics
if tp + fp > 0: # Aviod division by zero
  precision = tp / (tp+fp)
  recall = tp / (tp+fn)
  f1_score = 2 * (precision * recall) / (precision + recall)

  print(f'\nAdditional Metrics:')
  print(f'Precision: {precision:.3%}') # of predicted, how many were correct?
  print(f'Recall: {recall:.3%}')  # Of actual defaults, how many did we catch?
  print(f'F1-Score: {f1_score:.3%}')


Additional Metrics:
Precision: 63.462%
Recall: 33.000%
F1-Score: 43.421%


In [20]:
#Training predictions
train_predictions_probs = train_model.predict(X_train_sm)
train_predictions = (train_predictions_probs > 0.5).astype(int)
train_error = np.mean(train_predictions != y_train)

print(f'Training Error Rate: {train_error:.3%}')
print(f'Validation Error Rate: {error_rate:.3%}')
print(f'Difference: {abs(train_error - error_rate):.3%}')

if abs(train_error - error_rate) < 0.01:
  print('Good generalization: Similar training and validation error")')
else:
  print('Potential overfitting: Large gap between training and validation error")')

Training Error Rate: 2.571%
Validation Error Rate: 2.867%
Difference: 0.295%
Good generalization: Similar training and validation error")


## **Validation Set Approach Results**

### **Key Findings:**
- **Validation Error Rate:** 2.867%
- **Validation Accuracy:** 97.133%
- **The model demonstrates excellent generalization** with only a 0.295% difference between training and validation error

### **Model Performance Breakdown:**

**Overall Performance:**
-**97.133% Accuracy:** The model correctly predicts default status for the vast majority of customers
-**2.867% Error Rate:** Very low misclassification rate

**Default Prediction Capability (Class 1):**
- **Precision: 63.462%** - When the model predicts "default", it's correct about 63% of the time
- **Recall: 33.000%** - The model detects only 33% of actual defaults
- **F1-Score: 43.421%** - Balanced measure of precision and recall

**Non-Default Prediction Capability (Class 0):**
- **Excellent performance** with 2881 true negatives and only 19 false positives

### **Business Implications:**
- The model is **highly effective at identifying low-risk customers** (97%+ accuracy)
- However, it **misses about 67% of actual defaults** (low recall) - this could be costly for the business
- When it does flag someone as high-risk, it's correct about 63% of the time

### **Recommendations:**
1. **Address class imbalance** - Only 3.3% of accounts default, making them hard to detect
2. **Adjust classification threshold** - Lowering the threshold from 0.5 to 0.3 might catch more defaults
3. **Collect additional features** - Income and balance alone may not be sufficient to predict default reliably
4. **Use as first screening tool** - The model is excellent for identifying low-risk customers automatically

### **Next Steps:**
- Experiment with different probability thresholds
- Try techniques for imbalanced data (SMOTE, class weights)
- Implement k-Fold Cross-Validation for more robust error estimation

In [21]:
# =============================================================================
# PART a: Repeat validation set approach with different random splits
# =============================================================================

print(" PART a: Multiple Validation Splits")
print("=" * 50)

# We'll test 3 different random seeds
random_seeds = [42, 123, 789]
results = []

for i, seed in enumerate(random_seeds, 1):
    print(f"\n Split {i} (Random State = {seed})")
    print("-" * 30)

    # Split with different random seed
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.3, random_state=seed, stratify=y
    )

    # Fit model
    X_train_sm = sm.add_constant(X_train)
    model = sm.Logit(y_train, X_train_sm).fit(disp=0)  # disp=0 suppresses output

    # Predictions - RESET INDEX TO FIX THE ERROR!
    X_val_sm = sm.add_constant(X_val)
    val_predictions_proba = model.predict(X_val_sm)
    val_predictions = (val_predictions_proba > 0.5).astype(int)

    # FIX: Reset indices to ensure proper alignment
    y_val_reset = y_val.reset_index(drop=True)
    val_predictions_reset = pd.Series(val_predictions).reset_index(drop=True)

    # Calculate metrics with reset indices
    error_rate = np.mean(val_predictions_reset != y_val_reset)
    accuracy = 1 - error_rate
    cm = confusion_matrix(y_val_reset, val_predictions_reset)
    tn, fp, fn, tp = cm.ravel()

    if tp + fp > 0:
        precision = tp / (tp + fp)
        recall = tp / (tp + fn)
        f1 = 2 * (precision * recall) / (precision + recall)
    else:
        precision = recall = f1 = 0

    # Store results
    results.append({
        'split': i,
        'random_state': seed,
        'error_rate': error_rate,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'tp': tp,
        'fp': fp,
        'fn': fn,
        'tn': tn
    })

    # Print results
    print(f"Error Rate: {error_rate:.3%}")
    print(f"Accuracy: {accuracy:.3%}")
    print(f"Precision: {precision:.3%}")
    print(f"Recall: {recall:.3%}")
    print(f"F1-Score: {f1:.3%}")
    print(f"Confusion Matrix:\n{cm}")

# Create summary DataFrame
results_df = pd.DataFrame(results)
print("\n" + "="*50)
print("📊 SUMMARY OF ALL SPLITS")
print("="*50)
print(results_df.round(4))

# Calculate averages
print(f"\n📈 AVERAGE PERFORMANCE ACROSS ALL SPLITS:")
print(f"Average Error Rate: {results_df['error_rate'].mean():.3%}")
print(f"Average Accuracy: {results_df['accuracy'].mean():.3%}")
print(f"Average Precision: {results_df['precision'].mean():.3%}")
print(f"Average Recall: {results_df['recall'].mean():.3%}")
print(f"Average F1-Score: {results_df['f1_score'].mean():.3%}")

 PART a: Multiple Validation Splits

 Split 1 (Random State = 42)
------------------------------
Error Rate: 2.867%
Accuracy: 97.133%
Precision: 63.462%
Recall: 33.000%
F1-Score: 43.421%
Confusion Matrix:
[[2881   19]
 [  67   33]]

 Split 2 (Random State = 123)
------------------------------
Error Rate: 2.467%
Accuracy: 97.533%
Precision: 80.952%
Recall: 34.000%
F1-Score: 47.887%
Confusion Matrix:
[[2892    8]
 [  66   34]]

 Split 3 (Random State = 789)
------------------------------
Error Rate: 2.833%
Accuracy: 97.167%
Precision: 70.270%
Recall: 26.000%
F1-Score: 37.956%
Confusion Matrix:
[[2889   11]
 [  74   26]]

📊 SUMMARY OF ALL SPLITS
   split  random_state  error_rate  accuracy  precision  recall  f1_score  tp  \
0      1            42      0.0287    0.9713     0.6346    0.33    0.4342  33   
1      2           123      0.0247    0.9753     0.8095    0.34    0.4789  34   
2      3           789      0.0283    0.9717     0.7027    0.26    0.3796  26   

   fp  fn    tn  
0  19 