# 3. We now review k-fold cross-validation.

**(a) Explain how k-fold cross-validation is implemented.**

a. This method includes randomly splitting the dataset into k groups, or folds, of roughly the same size. The first fold serves as a validation set, while the model is trained on the remaining k - 1 folds. Subsequently, the mean squared error (MSE1) is calculated based on the data in the validation fold. This entire process is iterated k times, with a different group of data serving as the validation set in each iteration. Ultimately, this repetition yields k test error estimates. The k-fold cross-validation estimate is obtained by averaging these individual estimates.

**(b) What are the advantages and disadvantages of k-fold cross- validation relative to:**

**i. The validation set approach?**

The validation set approach has its advantages and disadvantages:

*Advantages:*

a. Conceptually simple and straightforward to implement.

b. Provides a clear and intuitive way to assess model performance by setting aside a validation set.

*Disadvantages:*

a. The validation mean squared error (MSE) can exhibit significant variability, making it less reliable as a performance metric.

b. Only a portion of the available data is utilized to train the model, potentially leading to suboptimal model performance due to the limited training data.

**ii. LOOCV?**

*Advantages of LOOCV:*

a. LOOCV tends to have lower bias in estimating model performance compared to the validation set approach. This is because it leaves out only one observation at a time, leading to a more robust estimate of model performance.

b. The LOOCV method consistently produces the same results when applied multiple times since it splits the data based on just one observation at a time, eliminating randomness in the splitting process.

*Disadvantages of LOOCV:*

a. LOOCV can be computationally intensive, especially when dealing with large datasets, as it requires fitting the model as many times as there are observations in the dataset. This can make it impractical for very large datasets.

**4. Suppose that we use some statistical learning method to make a prediction for the response Y for a particular value of the predictor X.**

**Carefully describe how we might estimate the standard deviation of our prediction.**

By following the steps outlined below, we can determine the standard deviation of a prediction for the response variable Y at a specific value of the predictor X:

1. Collect a dataset that includes paired values of X and Y.
2. Proceed to fit a statistical learning model to this dataset.
3. Utilize the trained model to make predictions for Y at the desired X value.
4. Calculate the residuals, which represent the differences between the observed Y values and the Y values predicted by the model for all data points.

To estimate the standard deviation, we have a couple of options:

1. One approach is to estimate the in-sample standard deviation by computing the standard deviation of the residuals.
2. Alternatively, we can also compute a prediction interval for Y at a given X, and within this interval, the standard deviation serves as a measure of uncertainty.

In [1]:
!pip install ISLP

Collecting ISLP
  Downloading ISLP-0.3.21-py3-none-any.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
Collecting lifelines (from ISLP)
  Downloading lifelines-0.27.8-py3-none-any.whl (350 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m350.7/350.7 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pygam (from ISLP)
  Downloading pygam-0.9.0-py3-none-any.whl (522 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m522.2/522.2 kB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
Collecting pytorch-lightning (from ISLP)
  Downloading pytorch_lightning-2.1.0-py3-none-any.whl (774 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m774.6/774.6 kB[0m [31m39.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchmetrics (from ISLP)
  Downloading torchmetrics-1.2.0-py3-none-any.whl (805 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

In [3]:
import numpy as np
import pandas as pd
from ISLP import load_data
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

In [4]:
default_df = load_data('Default')
default_df.head()

Unnamed: 0,default,student,balance,income
0,No,No,729.526495,44361.625074
1,No,Yes,817.180407,12106.1347
2,No,No,1073.549164,31767.138947
3,No,No,529.250605,35704.493935
4,No,No,785.655883,38463.495879


In [5]:
# 5. a

# Convert 'default' column into binary labels: 1 for 'Yes' and 0 for 'No'
default_df['default'] = default_df['default'].apply(lambda x: 1 if x == 'Yes' else 0)

# Define feature variables 'X' and target variable 'y'
X = default_df[['balance', 'income']]
y = default_df['default']

# Create and train a Logistic Regression model with a fixed random state of 5
model = LogisticRegression(random_state=5).fit(X, y)

# Make predictions using the trained model on the same dataset
predictions = model.predict(X)

# Print the accuracy score of the model on the training data
accuracy = model.score(X, y)
print("Model Accuracy:", accuracy)

Model Accuracy: 0.9737


In [6]:
#5.b

#Split the data into a training set and a validation set (e.g., 80% training, 20% validation)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

# Fit a logistic regression model on the training set
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the validation set (get probabilities)
y_pred_prob = model.predict_proba(X_val)[:, 1]  # Probability of 'default'

# Classify individuals as 'default' if probability is greater than 0.5
y_pred_class = (y_pred_prob > 0.5).astype(int)

# Compute the validation set error (misclassification rate)
validation_error = 1 - accuracy_score(y_val, y_pred_class)

print("Validation Set Error:", validation_error)

Validation Set Error: 0.023499999999999965


In [7]:
# 5.c
# Iteration with Different Splits
for i in range(3):
    print(f"Iteration {i + 1}:")

    # Randomly split the dataset into a training set and a validation set
    X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X, y, test_size=0.2, random_state=(42 + i))

    # Train a logistic regression model using the training set
    logistic_model = LogisticRegression()
    logistic_model.fit(X_train_split, y_train_split)

    # Generate predictions on the validation set, obtaining probabilities
    y_pred_prob = logistic_model.predict_proba(X_val_split)[:, 1]  # Probability of 'default'

    # Classify individuals as 'default' if their probability is greater than 0.5
    y_pred_class = (y_pred_prob > 0.5).astype(int)

    # Calculate the validation set error, which measures the misclassification rate
    validation_error = 1 - accuracy_score(y_val_split, y_pred_class)

    print("Validation Set Error:", validation_error)

Iteration 1:
Validation Set Error: 0.034499999999999975
Iteration 2:
Validation Set Error: 0.03949999999999998
Iteration 3:
Validation Set Error: 0.039000000000000035


**5.d. Fitting a logistic regression using income, balance and a dummy variable for student**

In [11]:
# Create a dummy variable for 'student' (1 for 'Yes', 0 for 'No')
default_df['student'] = default_df['student'].apply(lambda x: 1 if x == 'Yes' else 0)

# Split the dataset into training and validation sets
X = default_df[['balance', 'income', 'student']]
y = default_df['default']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

# Train a logistic regression model using the training data
model = LogisticRegression()
model.fit(X_train, y_train)

# Generate predictions on the validation set
y_pred = model.predict(X_val)

# Calculate the model's accuracy and validation set error
accuracy = accuracy_score(y_val, y_pred)
validation_error = 1 - accuracy

print("Validation Set Error:", validation_error)

Validation Set Error: 0.026000000000000023


The model's validation set error is lower (0.023) when using only 'income' and 'balance' as predictors, compared to the model that also includes the 'student' dummy variable (0.026). This suggests that, in this particular case and with the provided dataset, the model using only 'income' and 'balance' performs better in terms of validation set error. The inclusion of the 'student' variable in the model appears to lead to an increase in the test error rate.

In [12]:

!pip3 install statsmodels



In [13]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from ISLP import load_data

In [14]:
# 6.a

# Set a random seed to ensure reproducibility
np.random.seed(42)

# Load the data into a DataFrame
data = load_data('Default')

# Convert 'default' to binary labels (0 for 'No', 1 for 'Yes')
data['default'] = data['default'].map({'No': 0, 'Yes': 1})

# Define the predictors, which are 'income' and 'balance'
X = data[['income', 'balance']]

# Add an intercept term to the predictors
X = sm.add_constant(X)

# Define the response variable
y = data['default']

# Fit a logistic regression model using Generalized Linear Models
model = sm.GLM(y, X, family=sm.families.Binomial())
results = model.fit()

# Display the summary of the model
results.summary()

0,1,2,3
Dep. Variable:,default,No. Observations:,10000.0
Model:,GLM,Df Residuals:,9997.0
Model Family:,Binomial,Df Model:,2.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-789.48
Date:,"Thu, 02 Nov 2023",Deviance:,1579.0
Time:,02:39:27,Pearson chi2:,6950.0
No. Iterations:,9,Pseudo R-squ. (CS):,0.1256
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-11.5405,0.435,-26.544,0.000,-12.393,-10.688
income,2.081e-05,4.99e-06,4.174,0.000,1.1e-05,3.06e-05
balance,0.0056,0.000,24.835,0.000,0.005,0.006


In [15]:
# Obtain the computed standard errors for the coefficients
standard_errors = results.bse

# Print the standard errors
print(standard_errors)

const      0.434772
income     0.000005
balance    0.000227
dtype: float64


In [16]:
# 6.b Bootstrapping Logistic Regression

import numpy as np
import pandas as pd
import statsmodels.api as sm

# Define a function for bootstrapping
def bootstrap_logistic(data, indices):
    # Create a subsample based on the specified indices
    subsampled_data = data.iloc[indices]

    # Specify predictor variables (income and balance)
    X = subsampled_data[['income', 'balance']]

    # Add an intercept term
    X = sm.add_constant(X)

    # Define the response variable
    y = subsampled_data['default']

    # Fit a logistic regression model
    model = sm.Logit(y, X)
    results = model.fit()

    # Extract coefficient estimates for income and balance
    coefficients = results.params[['income', 'balance']]

    return coefficients

In [18]:
# 6.c  Bootstrap Logistic Regression Coefficients

# Define the number of bootstrap iterations
n_bootstrap = 1000

# Initialize arrays to store coefficient estimates for 'income' and 'balance'
bootstrap_coefficient_estimates = np.zeros((n_bootstrap, 2))  # 2 for income and balance

# Conduct bootstrap sampling and estimate coefficients
for iteration in range(n_bootstrap):
    # Create random indices with replacement
    sampled_indices = np.random.choice(data.index, len(data), replace=True)

    # Calculate coefficient estimates for 'income' and 'balance' using the bootstrapping function
    coefficients = bootstrap_logistic(data, sampled_indices)

    # Store the coefficients in the array
    bootstrap_coefficient_estimates[iteration, :] = coefficients

Optimization terminated successfully.
         Current function value: 0.078504
         Iterations 10
Optimization terminated successfully.
         Current function value: 0.081959
         Iterations 10
Optimization terminated successfully.
         Current function value: 0.076469
         Iterations 10
Optimization terminated successfully.
         Current function value: 0.083472
         Iterations 10
Optimization terminated successfully.
         Current function value: 0.077343
         Iterations 10
Optimization terminated successfully.
         Current function value: 0.074523
         Iterations 10
Optimization terminated successfully.
         Current function value: 0.078372
         Iterations 10
Optimization terminated successfully.
         Current function value: 0.081343
         Iterations 10
Optimization terminated successfully.
         Current function value: 0.077791
         Iterations 10
Optimization terminated successfully.
         Current function value: 0.

In [20]:
#  6.c: Compute Standard Errors

# Calculate the standard errors from the bootstrap coefficients
standard_errors = np.std(bootstrap_coefficient_estimates, axis=0)

# Display the estimated standard errors for coefficients
print("Estimated Standard Errors of Coefficients:")
print("Income:", standard_errors[0])
print("Balance:", standard_errors[1])

Estimated Standard Errors of Coefficients:
Income: 4.968586838704016e-06
Balance: 0.00023209224235493277


# 6.d
The standard errors obtained from both the sm.GLM() function and the bootstrap method are in close agreement, suggesting that the logistic regression model appears to be a suitable fit for the dataset. This alignment in standard errors implies that the model's parameter estimates are reliable and that the model is likely capturing the underlying patterns in the data effectively.

In [21]:
# Set a random seed to ensure consistent results
np.random.seed(42)

# Import your dataset into a DataFrame
dataset = load_data('Weekly')
dataset.head()

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
0,1990,0.816,1.572,-3.936,-0.229,-3.484,0.154976,-0.27,Down
1,1990,-0.27,0.816,1.572,-3.936,-0.229,0.148574,-2.576,Down
2,1990,-2.576,-0.27,0.816,1.572,-3.936,0.159837,3.514,Up
3,1990,3.514,-2.576,-0.27,0.816,1.572,0.16163,0.712,Up
4,1990,0.712,3.514,-2.576,-0.27,0.816,0.153728,1.178,Up


In [25]:
# Step 7a: Logistic Regression Model

# Import the necessary libraries from statsmodels
import statsmodels.api as sm

# Specify the predictor variables (Lag1 and Lag2) and include an intercept
X = sm.add_constant(dataset[['Lag1', 'Lag2']])

# Define the response variable (Direction) as a binary outcome (0 for 'Down', 1 for 'Up')
y = (dataset['Direction'] == 'Up').astype(int)

# Fit a logistic regression model
model = sm.Logit(y, X)
results = model.fit()

# Display the summary of the logistic regression model
print(results.summary())

Optimization terminated successfully.
         Current function value: 0.683297
         Iterations 4
                           Logit Regression Results                           
Dep. Variable:              Direction   No. Observations:                 1089
Model:                          Logit   Df Residuals:                     1086
Method:                           MLE   Df Model:                            2
Date:                Thu, 02 Nov 2023   Pseudo R-squ.:                0.005335
Time:                        02:58:03   Log-Likelihood:                -744.11
converged:                       True   LL-Null:                       -748.10
Covariance Type:            nonrobust   LLR p-value:                   0.01848
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2212      0.061      3.599      0.000       0.101       0.342
Lag1          -0.0387      0.

In [26]:
# Step 7b: Logistic Regression Model (Excluding First Observation)

# Import the necessary libraries from pandas and statsmodels
import pandas as pd
import statsmodels.api as sm

# Create a modified dataset by excluding the first observation
data_without_first = dataset.iloc[1:]

# Specify the predictor variables (Lag1 and Lag2) and include an intercept
X = sm.add_constant(data_without_first[['Lag1', 'Lag2']])

# Define the response variable (Direction) as a binary outcome (0 for 'Down', 1 for 'Up')
y = (data_without_first['Direction'] == 'Up').astype(int)

# Fit a logistic regression model
model = sm.Logit(y, X)
results = model.fit()

# Display the summary of the logistic regression model
print(results.summary())

Optimization terminated successfully.
         Current function value: 0.683147
         Iterations 4
                           Logit Regression Results                           
Dep. Variable:              Direction   No. Observations:                 1088
Model:                          Logit   Df Residuals:                     1085
Method:                           MLE   Df Model:                            2
Date:                Thu, 02 Nov 2023   Pseudo R-squ.:                0.005387
Time:                        03:00:24   Log-Likelihood:                -743.26
converged:                       True   LL-Null:                       -747.29
Covariance Type:            nonrobust   LLR p-value:                   0.01785
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2232      0.061      3.630      0.000       0.103       0.344
Lag1          -0.0384      0.

In [29]:
# Step 7.c: Predict and Classify the First Observation

# Use the fitted model to predict the probability for the first observation
predicted_probability = results.predict(X.iloc[0])

# Determine the classification of the first observation based on the predicted probability
predicted_direction = "Up" if predicted_probability[0] > 0.5 else "Down"

# Retrieve the actual direction of the first observation
actual_direction = dataset['Direction'].iloc[0]

# Check if the observation was accurately classified
is_correctly_classified = (predicted_direction == actual_direction)

# Display the predicted and actual directions, and whether the observation was correctly classified
print("Predicted Direction:", predicted_direction)
print("Actual Direction:", actual_direction)
print("Correctly Classified:", is_correctly_classified)

Predicted Direction: Up
Actual Direction: Down
Correctly Classified: False


In [30]:
# Step 7.d: Leave-One-Out Cross-Validation for Logistic Regression

# Import the necessary libraries from pandas, statsmodels, and numpy
import pandas as pd
import statsmodels.api as sm
import numpy as np

# Set a fixed random seed for reproducibility
np.random.seed(42)

# Load the weekly dataset
df_weekly = load_data("Weekly")

# Remove rows with missing data in 'Direction', 'Lag1', and 'Lag2'
df_weekly.dropna(subset=['Direction', 'Lag1', 'Lag2'], inplace=True)

# Initialize an error counter
error = 0

# Get the number of observations
n = df_weekly.shape[0]

# Create an empty array for predicted values
y_pred = np.empty(n)

# Perform Leave-One-Out Cross-Validation
for i in range(n):
    # Create a temporary array without the i-th observation
    x_temp = df_weekly.loc[df_weekly.index != i, ["Lag1", "Lag2"]].to_numpy()
    X_new = sm.add_constant(x_temp)

    # Convert "Up" to 1 and "Down" to 0 for the response variable
    y_new = (df_weekly.loc[df_weekly.index != i, "Direction"] == 'Up').astype(int).to_numpy()

    # Fit a logistic regression model on the reduced dataset
    mls = sm.Logit(y_new, X_new).fit()

    # Create a new array for the i-th observation with the constant term
    X_i = np.array([1.0, df_weekly.loc[i, "Lag1"], df_weekly.loc[i, "Lag2"]])

    # Predict the probability for the i-th observation
    predicted_prob = mls.predict(X_i)

    # Classify the i-th observation based on the predicted probability
    y_pred[i] = np.where(predicted_prob > 0.5, 1, 0)

    # Check if an error was made in predicting the direction
    if df_weekly['Direction'].iloc[i] == 'Up' and y_pred[i] == 0:
        error += 1
    elif df_weekly['Direction'].iloc[i] == 'Down' and y_pred[i] == 1:
        error += 1

# Calculate the Leave-One-Out Cross-Validation (LOOCV) error rate
loocv_error_rate = error / n

Optimization terminated successfully.
         Current function value: 0.683147
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.683149
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.683416
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.683253
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.683454
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.683181
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.683432
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.683335
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.683402
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.683203
  

In [31]:
# Step 7d: Display the Leave-One-Out Cross-Validation Error Rate
print("Leave-One-Out Cross-Validation (LOOCV) Error Rate:", loocv_error_rate)

Leave-One-Out Cross-Validation (LOOCV) Error Rate: 0.44995408631772266


# 7. e
The Leave-One-Out Cross-Validation (LOOCV) error rate is roughly 0.4499, which is approximately 44.99%. This means that, on average, the model makes inaccurate predictions for about 44.99% of the observations when each observation is held out as a test point while the model is trained on the remaining data.

A LOOCV error rate of 44.99% indicates that the model may have constraints in its capacity to precisely forecast the market's direction based on the provided features (Lag1 and Lag2).