# Exercise 13

This question should be answered using the `Weekly` data set, which
is part of the ISLP package. This data is similar in nature to the
Smarket data from this chapter’s lab, except that it contains 1,089
weekly returns for 21 years, from the beginning of 1990 to the end of
2010.

## Part A

Produce some numerical and graphical summaries of the Weekly
data. Do there appear to be any patterns?

In [1]:
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize)

In [2]:
from ISLP import confusion_table
from ISLP.models import contrast
from sklearn.discriminant_analysis import (LinearDiscriminantAnalysis as LDA, QuadraticDiscriminantAnalysis as QDA)
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [3]:
Weekly = load_data('Weekly')
Weekly

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
0,1990,0.816,1.572,-3.936,-0.229,-3.484,0.154976,-0.270,Down
1,1990,-0.270,0.816,1.572,-3.936,-0.229,0.148574,-2.576,Down
2,1990,-2.576,-0.270,0.816,1.572,-3.936,0.159837,3.514,Up
3,1990,3.514,-2.576,-0.270,0.816,1.572,0.161630,0.712,Up
4,1990,0.712,3.514,-2.576,-0.270,0.816,0.153728,1.178,Up
...,...,...,...,...,...,...,...,...,...
1084,2010,-0.861,0.043,-2.173,3.599,0.015,3.205160,2.969,Up
1085,2010,2.969,-0.861,0.043,-2.173,3.599,4.242568,1.281,Up
1086,2010,1.281,2.969,-0.861,0.043,-2.173,4.835082,0.283,Up
1087,2010,0.283,1.281,2.969,-0.861,0.043,4.454044,1.034,Up


In [4]:
Weekly.columns
print(len(Weekly))

1089


In [5]:
Weekly.corr()

ValueError: could not convert string to float: 'Down'

Volume and Year appear to be highly correlated. 

In [None]:
Weekly.plot(y='Volume')

## Part B

Use the full data set to perform a logistic regression with `Direction` as the response and the five lag variables plus `Volume` as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

In [None]:
allvars = Weekly.columns.drop(['Direction', 'Today', 'Year'])

design = MS(allvars)

X = design.fit_transform(Weekly)
y = Weekly.Direction == 'Up'

glm = sm.GLM(y,
             X,
             family=sm.families.Binomial())

results = glm.fit()
summarize(results)

`Lag2` appears to be statistically significant, with a $\text{p-value}$ of $0.030$. The rest of the predictors do not appear to be statistically significant as they are all $\text{p-value} > 0.05$. 

## Part B

Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

In [None]:
probs = results.predict()

predicted_labels = np.array(['Down']*len(Weekly))
predicted_labels[probs > 0.5] = 'Up'

confusion_table(predicted_labels, Weekly.Direction)

In [None]:
np.mean(predicted_labels == Weekly.Direction)

- Error Rate = $56.11\%$
- Correct Prediction Rate = $\frac{54+557}{1089}=56.1\%$
- Sensitivity = $\frac{557}{557+48}=92.06\%$
- Specificity = $\frac{54}{54+430}=11.15\%$
- False Positive = 430
- False Negative = 48

92% of the 'Up' indicators were predicted successfully, meaning that LDA does a good job of predicting upwards trends in the data when the real-world result was up. However, the specificity is 11.15%, meaning that rarely picked 'Down' when it was supposed to. The model is heavily biased to 'Up'. 

## Part D

Now fit the logistic regression model using a training data period from 1990 to 2008, with `Lag2` as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

In [None]:
train = (Weekly.Year <= 2008) # create mask
Weekly_train = Weekly.loc[train] # apply mask to full data set
Weekly_test = Weekly.loc[~train]

D = Weekly.Direction # work with direction column -- responses
L_train, L_test = D.loc[train], D.loc[~train] # Up/Down
y_train, y_test = [M == 'Up' for M in [L_train, L_test]] # boolean

Weekly_train, Weekly_test = [DF[['Lag2', 'Direction']] for DF in [Weekly_train, Weekly_test]]

print(f"Train data shape: {Weekly_train.shape}")
print(f"Test data shape: {Weekly_test.shape}")


In [None]:
# sole predictor of log2
design = MS(['Lag2'])

X_train = design.fit_transform(Weekly_train)

glm = sm.GLM(y_train,
             X_train,
             family=sm.families.Binomial()) # logistic regression

results = glm.fit()
summarize(results)

In [None]:
def _get_corr_pred(conf_table: pd.DataFrame) -> float:
    return np.diag(conf_table).sum() / conf_table.sum().sum()

In [None]:
# transform test dataframe to modelspec
X_test = design.transform(Weekly_test) # modelspec ensures we only work with Lag2
predicted_values = results.predict(X_test)

# generate prediction labels
predicted_labels = np.array(['Down']*len(X_test))
predicted_labels[predicted_values > 0.5] = 'Up'

log_reg_conf_table = confusion_table(predicted_labels, Weekly_test.Direction)
log_reg_corr_pred = _get_corr_pred(log_reg_conf_table)

display(log_reg_conf_table)
display(log_reg_corr_pred)

- Sensitivity: $\frac{56}{5+56}=91.8\%$
- Specificity: $\frac{9}{9+34}=20.93\%$

high sensitivity (true positive rate) and low specificity (true negative rate) means that the model is biased to predicted 'Up', even in cases where the true value should be 'Down'. 

## Part E

Repeat (d) using LDA.

In [None]:
lda = LDA(store_covariance=True)

In [None]:
lda.fit(X_train, L_train)
lda_pred = lda.predict(X_test)

lda_conf_table = confusion_table(lda_pred, L_test)
lda_corr_pred = _get_corr_pred(lda_conf_table)

display(lda_conf_table)
display(lda_corr_pred)

## Part F

Repeat (d) using QDA.

In [None]:
qda = QDA(store_covariance=True)

qda.fit(X_train, L_train)

display(qda.means_)
display(qda.priors_)

qda_pred = qda.predict(X_test)

qda_conf_table = confusion_table(qda_pred, L_test)
qda_corr_pred = _get_corr_pred(qda_conf_table) 

display(qda_conf_table)
display(qda_corr_pred)

## Part G

Repeat (d) using KNN with $K = 1$.

In [None]:
knn1 = KNeighborsClassifier(n_neighbors=1)
knn1.fit(X_train, L_train)

knn1_pred = knn1.predict(X_test)

knn1_conf_table = confusion_table(knn1_pred, L_test)
knn1_corr_pred = _get_corr_pred(knn1_conf_table)

display(knn1_conf_table)
display(knn1_corr_pred)


## Part H

Repeat (d) using naive Bayes.

In [None]:
NB = GaussianNB()

NB.fit(X_train, L_train)
nb_labels = NB.predict(X_test)

nb_conf_table = confusion_table(nb_labels, L_test)
nb_corr_pred = _get_corr_pred(nb_conf_table)

display(nb_conf_table)
display(nb_corr_pred)

## Part I

Which of these methods appears to provide the best results on the data?

Linear Discriminant Analysis (LDA) and Logistic Regression both have a correct prediction rate of $62.5\%$, so they are both tied as the models that fit this data the best. 

## Part J

Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

In [None]:
def find_best_k(X_train, X_test, L_train, L_test):
    max_corr_pred = 0
    k_val = 0 
    for k in range(1, 100):
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, L_train)

        knn_pred = knn.predict(X_test)

        knn_conf_table = confusion_table(knn_pred, L_test)
        knn_corr_pred = _get_corr_pred(knn_conf_table)

        print(f"k = {k}, correct prediction rate = {knn_corr_pred}")
        if knn_corr_pred > max_corr_pred:
            max_corr_pred, k_val = knn_corr_pred, k

    return k_val, max_corr_pred

find_best_k(X_train, X_test, L_train, L_test)

With training data using the sole predictor of `Lag2`, we generate a correct prediction score of $61.54\%$. We're not able to beat the LDA / log reg score of $62.5\%$.