## DSI-06 Homework 3: ANSWERS
From Chapter 4, found on pages 196-197 of ISLP

*This question should be answered using the `Weekly` data set, which is part of the ISLP package. This data is similar in nature to the Smarket data from this section's in-class exercises, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.*

In [None]:
# Import standard libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import subplots
import statsmodels.api as sm

# Import specific objects
from ISLP.models import (ModelSpec as MS,
                         summarize)
from ISLP import load_data
from ISLP import confusion_table
from ISLP.models import contrast
from sklearn.discriminant_analysis import \
     (LinearDiscriminantAnalysis as LDA,
      QuadraticDiscriminantAnalysis as QDA)
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder

# Load dataset
Weekly = load_data('Weekly')
Weekly

a) Produce some numerical and graphical summaries of the `Weekly` data. Do there appear to be any patterns?

In [None]:
# Use the describe() function to obtain basic summary statistics for each variable 
print(Weekly.describe())

In [None]:
numeric_columns = Weekly.select_dtypes(include=['float64', 'int64'])

# Calculate the correlation matrix
cor_Weekly = numeric_columns.corr()

# Plot the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(cor_Weekly, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Plot')
plt.show()

We can see a pattern! We have a significant linear relationship between Year and Volume. The correlational plot does not seem to illustrate that any other variables are significantly linearly related.

b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

In [None]:
# Create lag variables
for i in range(1, 6):
    Weekly[f'Lag_{i}'] = Weekly['Direction'].shift(i)

# Drop rows with missing values due to lag creation
Weekly = Weekly.dropna()

# Select predictors and response variable
allvars = Weekly.columns.drop(['Today', 'Direction', 'Year'])
design = sm.add_constant(pd.get_dummies(Weekly[allvars], drop_first=True))
X = design
y = (Weekly['Direction'] == 'Up').astype(int)

# Fit logistic regression model
glm = sm.GLM(y, X, family=sm.families.Binomial())
results = glm.fit()

# Print the summary of the logistic regression
print(results.summary())

In [None]:
summarize(results)

The column labelled Pr(>|z|) gives the p-values associated with each variables. Recall that the p-values
indicate whether or not to reject the null hypothesis that there is no association between the response and
predictor variable. Lag 2 appear to be statistically significant!

c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression

In [None]:
# Get predicted probabilities
y_prob = results.predict()

# Convert probabilities to binary predictions (0 or 1)
y_pred = (y_prob > 0.5).astype(int)

# Compute confusion matrix
cm = confusion_matrix(y, y_pred)

# Extract values from the confusion matrix
tn, fp, fn, tp = cm.ravel()

# Compute overall fraction of correct predictions (accuracy)
accuracy = accuracy_score(y, y_pred)

# Print confusion matrix and accuracy
print("Confusion Matrix:")
print(pd.DataFrame(cm, columns=['Predicted 0', 'Predicted 1'], index=['Actual 0', 'Actual 1']))
print("\nOverall Fraction of Correct Predictions (Accuracy):", accuracy)

d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

In [None]:
# Filter data for the training period (1990 to 2008)
train_data = Weekly[(Weekly['Year'] >= 1990) & (Weekly['Year'] <= 2008)]

# Filter data for the test period (2009 and 2010)
test_data = Weekly[(Weekly['Year'] == 2009) | (Weekly['Year'] == 2010)]

# Extract predictors and response variables for training
X_train = train_data[['Lag2']]
y_train = train_data['Direction']

# Extract predictors and response variables for testing
X_test = test_data[['Lag2']]
y_test = test_data['Direction']

# Fit logistic regression model on the training data
logreg_model = LogisticRegression()
logreg_model.fit(X_train, y_train)

# Get predicted probabilities on the test data
y_prob = logreg_model.predict_proba(X_test)[:, 1]

# Convert probabilities to binary predictions (0 or 1)
y_pred = (y_prob > 0.5).astype(int)

# Convert labels in y_true to numeric values (0 and 1)
y_true_numeric = y_test.map({'Down': 0, 'Up': 1})

# Compute confusion matrix
cm = confusion_matrix(y_true_numeric, y_pred)

# Extract values from the confusion matrix
tn, fp, fn, tp = cm.ravel()

# Compute overall fraction of correct predictions (accuracy)
accuracy = accuracy_score(y_true_numeric, y_pred)

# Print confusion matrix and accuracy
print("Confusion Matrix:")
print(pd.DataFrame(cm, columns=['Predicted Down', 'Predicted Up'], index=['Actual Down', 'Actual Up']))
print("\nOverall Fraction of Correct Predictions (Accuracy):", accuracy)

e) Repeat (d) using LDA.

In [None]:
# Filter data for the training period (1990 to 2008)
train_data = Weekly[(Weekly['Year'] >= 1990) & (Weekly['Year'] <= 2008)]

# Filter data for the test period (2009 and 2010)
test_data = Weekly[(Weekly['Year'] == 2009) | (Weekly['Year'] == 2010)]

# Extract predictors and response variables for training
X_train = train_data[['Lag2']]
y_train = train_data['Direction']

# Extract predictors and response variables for testing
X_test = test_data[['Lag2']]
y_test = test_data['Direction']

# Fit LDA model on the training data
lda_model = LDA()
lda_model.fit(X_train, y_train)

# Predictions on the test data
y_pred = lda_model.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Extract values from the confusion matrix
tn, fp, fn, tp = cm.ravel()

# Compute overall fraction of correct predictions (accuracy)
accuracy = accuracy_score(y_test, y_pred)

# Print confusion matrix and accuracy
print("Confusion Matrix:")
print(pd.DataFrame(cm, columns=['Predicted Down', 'Predicted Up'], index=['Actual Down', 'Actual Up']))
print("\nOverall Fraction of Correct Predictions (Accuracy):", accuracy)

f) Repeat (d) using QDA.

In [None]:
# Filter data for the training period (1990 to 2008)
train_data = Weekly[(Weekly['Year'] >= 1990) & (Weekly['Year'] <= 2008)]

# Filter data for the test period (2009 and 2010)
test_data = Weekly[(Weekly['Year'] == 2009) | (Weekly['Year'] == 2010)]

# Extract predictors and response variables for training
X_train = train_data[['Lag2']]
y_train = train_data['Direction']

# Extract predictors and response variables for testing
X_test = test_data[['Lag2']]
y_test = test_data['Direction']

# Fit QDA model on the training data
qda_model = QDA()
qda_model.fit(X_train, y_train)

# Predictions on the test data
y_pred = qda_model.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Extract values from the confusion matrix
tn, fp, fn, tp = cm.ravel()

# Compute overall fraction of correct predictions (accuracy)
accuracy = accuracy_score(y_test, y_pred)

# Print confusion matrix and accuracy
print("Confusion Matrix:")
print(pd.DataFrame(cm, columns=['Predicted Down', 'Predicted Up'], index=['Actual Down', 'Actual Up']))
print("\nOverall Fraction of Correct Predictions (Accuracy):", accuracy)

g) Repeat (d) using KNN with K = 1.

In [None]:
# Filter data for the training period (1990 to 2008)
train_data = Weekly[(Weekly['Year'] >= 1990) & (Weekly['Year'] <= 2008)]

# Filter data for the test period (2009 and 2010)
test_data = Weekly[(Weekly['Year'] == 2009) | (Weekly['Year'] == 2010)]

# Extract predictors and response variables for training
X_train = train_data[['Lag2']]
y_train = train_data['Direction']

# Extract predictors and response variables for testing
X_test = test_data[['Lag2']]
y_test = test_data['Direction']

# Fit KNN model on the training data with K=1
knn_model = KNeighborsClassifier(n_neighbors=1)
knn_model.fit(X_train, y_train)

# Predictions on the test data
y_pred = knn_model.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Extract values from the confusion matrix
tn, fp, fn, tp = cm.ravel()

# Compute overall fraction of correct predictions (accuracy)
accuracy = accuracy_score(y_test, y_pred)

# Print confusion matrix and accuracy
print("Confusion Matrix:")
print(pd.DataFrame(cm, columns=['Predicted Down', 'Predicted Up'], index=['Actual Down', 'Actual Up']))
print("\nOverall Fraction of Correct Predictions (Accuracy):", accuracy)

h) Repeat (d) using naive Bayes.

In [None]:
# Filter data for the training period (1990 to 2008)
train_data = Weekly[(Weekly['Year'] >= 1990) & (Weekly['Year'] <= 2008)]

# Filter data for the test period (2009 and 2010)
test_data = Weekly[(Weekly['Year'] == 2009) | (Weekly['Year'] == 2010)]

# Extract predictors and response variables for training
X_train = train_data[['Lag2']]
y_train = train_data['Direction']

# Extract predictors and response variables for testing
X_test = test_data[['Lag2']]
y_test = test_data['Direction']

# Fit Naive Bayes model on the training data
naive_bayes_model = GaussianNB()
naive_bayes_model.fit(X_train, y_train)

# Predictions on the test data
y_pred = naive_bayes_model.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Extract values from the confusion matrix
tn, fp, fn, tp = cm.ravel()

# Compute overall fraction of correct predictions (accuracy)
accuracy = accuracy_score(y_test, y_pred)

# Print confusion matrix and accuracy
print("Confusion Matrix:")
print(pd.DataFrame(cm, columns=['Predicted Down', 'Predicted Up'], index=['Actual Down', 'Actual Up']))
print("\nOverall Fraction of Correct Predictions (Accuracy):", accuracy)

i) Which of these methods appears to provide the best results on this data?

*Given the accuracy and test error rate, the Linear Discriminant Analysis, Quadratic Discriminant Analysis, and logistic regression model performed the best (accuracy of 62.5%, test error of 37.5%).*

j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

Examples: Logistic regression with interaction

In [None]:
# Convert 'Direction' to numeric format
label_encoder = LabelEncoder()
Weekly['Direction_numeric'] = label_encoder.fit_transform(Weekly['Direction'])

# Split the data into training and testing sets
train_data, test_data = train_test_split(Weekly, test_size=0.2, random_state=16)

# Fit logistic regression model with interaction term on the training data
formula = 'Direction_numeric ~ Lag2 * Lag4'
log_fit_interaction = sm.Logit.from_formula(formula, data=train_data).fit()

# Predict probabilities on the test data
log_probs_interaction = log_fit_interaction.predict(test_data)

# Convert probabilities to binary predictions (0 or 1)
log_pred = (log_probs_interaction > 0.5).astype(int)

# Decode numeric predictions back to original labels
log_pred_labels = label_encoder.inverse_transform(log_pred)

# Create a confusion matrix
cm = confusion_matrix(test_data['Direction'], log_pred_labels)

# Print confusion matrix
print("Confusion Matrix:")
print(pd.DataFrame(cm, columns=['Predicted Down', 'Predicted Up'], index=['Actual Down', 'Actual Up']))

Example: KNN, $K = 10$

In [None]:
# Filter data for the training period (1990 to 2008)
train_data = Weekly[(Weekly['Year'] >= 1990) & (Weekly['Year'] <= 2008)]

# Filter data for the test period (2009 and 2010)
test_data = Weekly[(Weekly['Year'] == 2009) | (Weekly['Year'] == 2010)]

# Extract predictors and response variables for training
X_train = train_data[['Lag2']]
y_train = train_data['Direction']

# Extract predictors and response variables for testing
X_test = test_data[['Lag2']]
y_test = test_data['Direction']

# Fit KNN model on the training data with K=1
knn_model = KNeighborsClassifier(n_neighbors=10)
knn_model.fit(X_train, y_train)

# Predictions on the test data
y_pred = knn_model.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Extract values from the confusion matrix
tn, fp, fn, tp = cm.ravel()

# Compute overall fraction of correct predictions (accuracy)
accuracy = accuracy_score(y_test, y_pred)

# Print confusion matrix and accuracy
print("Confusion Matrix:")
print(pd.DataFrame(cm, columns=['Predicted Down', 'Predicted Up'], index=['Actual Down', 'Actual Up']))
print("\nOverall Fraction of Correct Predictions (Accuracy):", accuracy)