# 6.3: Classification Exercises

## Getting Started

### Import Libraries 

We import our standard libraries and specific objects/libraries at the top level of our notebook.

In [None]:
# Import libraries and objects
import numpy as np
from matplotlib.pyplot import subplots
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize)
import warnings 
warnings.filterwarnings('ignore') # mute warning messages
from ISLP import confusion_table
from sklearn.neighbors import KNeighborsClassifier

First, load our `Smarket` data.

In [None]:
Smarket = load_data('Smarket')
Smarket

We can view the variables names.

In [None]:
Smarket.columns

### Logistic Regression

We will fit a logistic regression model in order to predict `Direction` using `Lag1` through `Lag5` and `Volume`. The `sm.GLM()` function fits generalized linear models, a class of models that includes logistic regression. Alternatively, the function `sm.Logit()` fits a logistic regression model directly. The syntax of `sm.GLM()` is similar to that of `sm.OLS()`, except that we use the argument `family=sm.families.Binomial()` in order to tell `statsmodels` to run a logistic regression rather than some other type of generalized linear model.

In [None]:
allvars = Smarket.columns.drop(['Today', 'Direction', 'Year'])
design = MS(allvars)
X = design.fit_transform(Smarket)
y = Smarket.Direction == 'Up'
# fit the model
glm = sm.Logit(y, X)

# # or equally:
# glm = sm.GLM(y,
#              X,
#              family=sm.families.Binomial())
results = glm.fit()
summarize(results)

The column labelled Pr(>|z|) gives the $p$-values associated with each variables. Recall that the $p$-values
indicate whether or not to reject the null hypothesis that there is no association between the response and
predictor variable. **Is there evidence of an association between any of the predictor variables and the response?
If so, which ones?**

The smallest $p$-value here is associated with `Lag1`. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. 

We use the `params` attribute of results in order to access just the coefficients for this fitted model.

In [None]:
results.params

Likewise we can use the `pvalues` attribute to access the $p$-values for the coefficients.

In [None]:
results.pvalues

The `predict()` method of results can be used to predict the probability that the market will go up, given values of the predictors. This method returns predictions on the probability scale. If no data set is supplied to the `predict()` function, then the probabilities are computed for the training data that was used to fit the logistic regression model. As with linear regression, one can pass an optional `exog` argument consistent with a design matrix if desired. Here we have printed only the first ten probabilities.

In [None]:
probs = results.predict()
probs[:10]

In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, `Up` or `Down`. The following two commands create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5.

In [None]:
labels = np.array(['Down']*1250)
labels[probs>0.5] = "Up"

The `confusion_table()` function from the `ISLP` package summarizes these predictions, showing how many observations were correctly or incorrectly classified. Our function, which is adapted from a similar function in the module `sklearn.metrics`, transposes the resulting matrix and includes row and column labels. The `confusion_table()` function takes as first argument the predicted labels, and second argument the true labels.

In [None]:
confusion_table(labels, Smarket.Direction)

The diagonal elements of the confusion matrix indicate correct predictions, while the off-diagonals represent incorrect predictions. Hence our model correctly predicted that the market would go up on 507 days and that it would go down on 145 days, for a total of 507 + 145 = 652 correct predictions. The `np.mean()` function can be used to compute the fraction of days for which the prediction was correct. In this case, logistic regression correctly predicted the movement of the market 52.2% of the time and 47.8% is the training error rate.

In [None]:
print((507+145)/(145+141+457+507))
# or equally:
print(np.mean(labels == Smarket.Direction))

Now we can try predicting the outcomes of the test data. **Try this out yourselves! Find the confusion matrix and test error rate as well.**


**How does the training error rate compare to the test error rate?**


**Is logistic regression method good at predicting the direction of the market? Why or why not?**

*These exercises were adapted from :* James, Gareth, et al. An Introduction to Statistical Learning: with Applications in Python, Springer, 2023.