# Classification Exercises

## Getting Started

### Import Libraries 

We import our standard libraries and specific objects/libraries at the top level of our notebook.

In [1]:
# Import libraries and objects
import numpy as np
from ISLP import load_data
from ISLP.models import ModelSpec as MS
import warnings 
warnings.filterwarnings('ignore') # mute warning messages
from ISLP import confusion_table
from sklearn.neighbors import KNeighborsClassifier

First, load our `Smarket` data.

In [2]:
Smarket = load_data('Smarket')
Smarket

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
0,2001,0.381,-0.192,-2.624,-1.055,5.010,1.19130,0.959,Up
1,2001,0.959,0.381,-0.192,-2.624,-1.055,1.29650,1.032,Up
2,2001,1.032,0.959,0.381,-0.192,-2.624,1.41120,-0.623,Down
3,2001,-0.623,1.032,0.959,0.381,-0.192,1.27600,0.614,Up
4,2001,0.614,-0.623,1.032,0.959,0.381,1.20570,0.213,Up
...,...,...,...,...,...,...,...,...,...
1245,2005,0.422,0.252,-0.024,-0.584,-0.285,1.88850,0.043,Up
1246,2005,0.043,0.422,0.252,-0.024,-0.584,1.28581,-0.955,Down
1247,2005,-0.955,0.043,0.422,0.252,-0.024,1.54047,0.130,Up
1248,2005,0.130,-0.955,0.043,0.422,0.252,1.42236,-0.298,Down


We can view the variables names.

In [3]:
Smarket.columns

Index(['Year', 'Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today',
       'Direction'],
      dtype='object')

### K-Nearest Neighbors

We will now perform KNN using the `KNeighborsClassifier()` function. This function is similar
to the other model-fitting functions we've used throughout these exercises.

In [4]:
#### STEP 1: DATA PREP
    # Remove columns not required for modeling from main data source
allvars = Smarket.columns.drop(['Today', 'Direction', 'Year'])
''' 
MS is a package that helps you prepare the data prior to modeling. 
Without going into too much details, this is what it does (not in any specific order): 

1. Transform data into a structured format (think tables)
2. Add a new column as a constant with value of 1 (remember intercept from linear regression?)
3. Standardize numerical values (we haven't gone over this concept yet)
4. Encodes any categorical variables. This means MS package converts categorical variable values 
into boolean True/False columns. 
   In Smarket, there aren't any categorical values such as high/low. But if you are ever working 
   with a data set that has such categories, High would be a new column with True/False, Low would 
   be a new column with True/False.

Conclusion: Above 4 steps are very typical when preparing data required for modelling. MS just helps
 you speed up the process by not having to write python code to do the data transformations.
 
'''
# Run the MS data prep functions on the columns you chose to keep for modeling
design = MS(allvars)

# Train on datapoints that are prior to 2005
train = (Smarket.Year < 2005)

############## NOT USED IN MODELING BELOW. CAN BE REMOVED
    # Filter original data for rows that are 'True'. This leaves you with data points that are before 2005. 
    # Store the filtered data in a new variable to use for training in a structured tabular format.
Smarket_train = Smarket.loc[train]
    # Filter original data for rows that are 'False'. This leaves you with data points that are not 
    # before 2005 Store the filtered data in a new variable to use for testing in a structured tabular 
    # format.
Smarket_test = Smarket.loc[~train]
 # This is to see the shape of the table you created as testing data.
Smarket_test.shape

(252, 9)

In [38]:
# Identify the list of explanatory variables by using fit_transform package to identify the most 
# important variables.
    ## Not exactly sure on the methodology used to identify the most important variables. 
    # ISLP document isn't clear enough 
    # (https://islp.readthedocs.io/en/latest/api/generated/ISLP.models.generic_selector.html#ISLP.models.generic_selector.FeatureSelector.fit_transform)
X = design.fit_transform(Smarket)

#### STEP 2: SPLIT DATA INTO TRAIN & TEST
# We split the data into testing and training for both X and Y
X_train, X_test = X.loc[train], X.loc[~train]

# Define the variable you are trying to predict.
D = Smarket.Direction

# L is basically Y
L_train, L_test = D.loc[train], D.loc[~train]

#### STEP 3: TRAIN THE MODEL
# Create an instance of the model. 
knn1 = KNeighborsClassifier(n_neighbors=7) # API reference: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

# Convert data into required format for the model to process
X_train, X_test = [np.asarray(X) for X in [X_train, X_test]]

# Fit the model using X & y training data. 
knn1.fit(X_train, L_train)

#### STEP 4: PREDICT USING MODEL
# Predict results using x test data.
knn1_pred = knn1.predict(X_test)

# Compare predicted results against actual results stored found in y test data.
confusion_table(knn1_pred, L_test)

Truth,Down,Up
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
Down,56,59
Up,55,82


The results using $K=1$ are not very good, since only $50%$ of the
observations are correctly predicted. Of course, it may be that $K=1$
results in an overly-flexible fit to the data.

In [39]:
np.mean(knn1_pred == L_test)

0.5476190476190477

As we can see KNN for $K=1$ only gives 50% accuracy which is no better than random chance. 

**Try running
KNN for several values of K and summarize the results for the best model you find.
Out of all the classification methods we tried, which performs best on the Smarket data? Give
some explanation for why that might be.**

*These exercises were adapted from :* James, Gareth, et al. An Introduction to Statistical Learning: with Applications in Python, Springer, 2023.