# 6.3: Classification Exercises

## Getting Started

### Import Libraries 

We import our standard libraries and specific objects/libraries at the top level of our notebook.

In [None]:
# Import libraries and objects
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize)
import warnings 
warnings.filterwarnings('ignore') # mute warning messages
from ISLP import confusion_table
from ISLP.models import contrast
from sklearn.discriminant_analysis import \
     (LinearDiscriminantAnalysis as LDA,
      QuadraticDiscriminantAnalysis as QDA)
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

First, load our `Smarket` data.

In [None]:
Smarket = load_data('Smarket')
Smarket

We can view the variables names.

In [None]:
Smarket.columns

### Logistic Regression

We will fit a logistic regression model in order to predict `Direction` using `Lag1` through `Lag5` and `Volume`. The `sm.GLM()` function fits generalized linear models, a class of models that includes logistic regression. Alternatively, the function `sm.Logit()` fits a logistic regression model directly. The syntax of `sm.GLM()` is similar to that of `sm.OLS()`, except that we use the argument `family=sm.families.Binomial()` in order to tell `statsmodels` to run a logistic regression rather than some other type of generalized linear model.

In [None]:
allvars = Smarket.columns.drop(['Today', 'Direction', 'Year'])
design = MS(allvars)
X = design.fit_transform(Smarket)
y = Smarket.Direction == 'Up'
glm = sm.GLM(y,
             X,
             family=sm.families.Binomial())
results = glm.fit()
summarize(results)

The column labelled Pr(>|z|) gives the $p$-values associated with each variables. Recall that the $p$-values
indicate whether or not to reject the null hypothesis that there is no association between the response and
predictor variable. **Is there evidence of an association between any of the predictor variables and the response?
If so, which ones?**

The smallest $p$-value here is associated with `Lag1`. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. 

We use the `params` attribute of results in order to access just the coefficients for this fitted model.

In [None]:
results.params

Likewise we can use the `pvalues` attribute to access the $p$-values for the coefficients.

In [None]:
results.pvalues

The `predict()` method of results can be used to predict the probability that the market will go up, given values of the predictors. This method returns predictions on the probability scale. If no data set is supplied to the `predict()` function, then the probabilities are computed for the training data that was used to fit the logistic regression model. As with linear regression, one can pass an optional `exog` argument consistent with a design matrix if desired. Here we have printed only the first ten probabilities.

In [None]:
probs = results.predict()
probs[:10]

In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, `Up` or `Down`. The following two commands create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5.

In [None]:
labels = np.array(['Down']*1250)
labels[probs>0.5] = "Up"

The `confusion_table()` function from the `ISLP` package summarizes these predictions, showing how many observations were correctly or incorrectly classified. Our function, which is adapted from a similar function in the module `sklearn.metrics`, transposes the resulting matrix and includes row and column labels. The `confusion_table()` function takes as first argument the predicted labels, and second argument the true labels.

In [None]:
confusion_table(labels, Smarket.Direction)

The diagonal elements of the confusion matrix indicate correct predictions, while the off-diagonals represent incorrect predictions. Hence our model correctly predicted that the market would go up on 507 days and that it would go down on 145 days, for a total of 507 + 145 = 652 correct predictions. The `np.mean()` function can be used to compute the fraction of days for which the prediction was correct. In this case, logistic regression correctly predicted the movement of the market 52.2% of the time and 47.8% is the training error rate.

In [None]:
(507+145)/1250, np.mean(labels == Smarket.Direction)

Now we can try predicting the outcomes of the test data. **Try this out yourselves! Find the confusion matrix and test error rate as well.**


**How does the training error rate compare to the test error rate?**


**Is logistic regression method good at predicting the direction of the market? Why or why not?
Use the training/testing error rate to support your answer.**

### Linear Discriminant Analysis
We begin by performing LDA on the Smarket data, using the function `LinearDiscriminantAnalysis()`, which we have abbreviated `LDA()`. We fit the model using only the observations before 2005.

In [None]:
# Previous code to set up for LDA
model = MS(['Lag1', 'Lag2']).fit(Smarket)

X = model.transform(Smarket)
D = Smarket.Direction
train = (Smarket.Year < 2005)
L_train, L_test = D.loc[train], D.loc[~train]

X_train, X_test = X.loc[train], X.loc[~train]
y_train, y_test = y.loc[train], y.loc[~train]

glm_train = sm.GLM(y_train,
                   X_train,
                   family=sm.families.Binomial())
results = glm_train.fit()
probs = results.predict(exog=X_test)

labels = np.array(['Down']*252)
labels[probs>0.5] = 'Up'
confusion_table(labels, L_test)

newdata = pd.DataFrame({'Lag1':[1.2, 1.5],
                        'Lag2':[1.1, -0.8]});
newX = model.transform(newdata)
results.predict(newX)

In [None]:
lda = LDA(store_covariance=True)

X_train, X_test = [M.drop(columns=['intercept'])
                   for M in [X_train, X_test]]
lda.fit(X_train, L_train)

In [None]:
# Extract the means in the two classes 
lda.means_

In [None]:
# Estimate prior probabilities 
lda.classes_

In [None]:
# Get priors
lda.priors_

The LDA output indicates that $\hat\pi_{Down}=0.492$ and
$\hat\pi_{Up}=0.508$.

The LDA and logistic regression predictions are almost identical.

In [None]:
lda_pred = lda.predict(X_test)

confusion_table(lda_pred, L_test)

**Try fitting a LDA model using predictor variables of the Smarket data of your choice. Discuss
the results.**

### Quadratic Discriminant Analysis

We will now fit a QDA model to the  `Smarket`  data. QDA is
implemented via
`QuadraticDiscriminantAnalysis()`
in the `sklearn` package, which we abbreviate to `QDA()`.
The syntax is very similar to `LDA()`.

In [None]:
qda = QDA(store_covariance=True)
qda.fit(X_train, L_train)

In [None]:
# Compute means and priors
qda.means_, qda.priors_

In [None]:
# Estimate covariance
qda.covariance_[0]

In [None]:
# Predict
qda_pred = qda.predict(X_test)
confusion_table(qda_pred, L_test)

The QDA predictions are accurate almost 60% of the time, even though the 2005 data was not used to fit the model. The test error rate of the QDA model is 40%.

In [None]:
np.mean(qda_pred == L_test)

### Naive Bayes

Next, we fit a naive Bayes model to the `Smarket` data, which is
similar to `LDA()` and `QDA()`. By
default, this implementation `GaussianNB()` of the naive Bayes classifier models each
quantitative feature using a Gaussian distribution. However, a kernel
density method can also be used to estimate the distributions.

In [None]:
NB = GaussianNB()
NB.fit(X_train, L_train)

NB.classes_

In [None]:
# Make predictions
nb_labels = NB.predict(X_test)
confusion_table(nb_labels, L_test)

Naive Bayes performs well on these data, with accurate predictions over 59% of the time. This is slightly worse than QDA, but much better than LDA. The test error rate of the naive Bayes model is 41%.

In [None]:
# Estimate probabilities
NB.predict_proba(X_test)[:5]

**Try fitting a naive Bayes model using predictor variables of the Smarket data of your choice.
Discuss the results.**

### K-Nearest Neighbors

We will now perform KNN using the `KNeighborsClassifier()` function. This function is similar
to the other model-fitting functions we've used throughout these exercises.

In [None]:
knn1 = KNeighborsClassifier(n_neighbors=1)
X_train, X_test = [np.asarray(X) for X in [X_train, X_test]]
knn1.fit(X_train, L_train)
knn1_pred = knn1.predict(X_test)
confusion_table(knn1_pred, L_test)

The results using $K=1$ are not very good, since only $50%$ of the
observations are correctly predicted. Of course, it may be that $K=1$
results in an overly-flexible fit to the data.

In [None]:
(83+43)/252, np.mean(knn1_pred == L_test)

As we can see KNN for $K=1$ only gives 50% accuracy which is no better than random chance. 

**Try running
KNN for several values of K and summarize the results for the best model you find.
Out of all the classification methods we tried, which performs best on the Smarket data? Give
some explanation for why that might be.**

*These exercises were adapted from :* James, Gareth, et al. An Introduction to Statistical Learning: with Applications in Python, Springer, 2023.