# Classification Methods

## The Stock Market Data

The stockmarket data shows the daily percentage returns for the S&P 500 stock index between 2001 and 2005. 
It contains 1250 observations on the following 9 variables. Given we have so many variables we are operating in a high dimensional space, so don't be disappointed that there are no nice 2D plots with dicision boundaries. This time we have to trust the metrics which we learned in class to find out what is best.

Features:

- Year: The year that the observation was recorded

- Lag1: Percentage return for previous day

- Lag2: Percentage return for 2 days previous

- Lag3: Percentage return for 3 days previous

- Lag4: Percentage return for 4 days previous

- Lag5: Percentage return for 5 days previous

- Volume: Volume of shares traded (number of daily shares traded in billions)

- Today: Percentage return for today

Response:

- Direction: A factor with levels Down and Up indicating whether the market had a positive or negative return on a given day.

Given we only want to predict the ups and downs of the market value, it is a classification problem.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 
import math
from patsy import dmatrices
import statsmodels.discrete.discrete_model as sm
import statsmodels.formula.api as smf
import statsmodels.api as sma
from statsmodels.graphics.regressionplots import *
from sklearn import datasets, linear_model
from sklearn.metrics import confusion_matrix
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.naive_bayes import GaussianNB as NB
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn import preprocessing

In [None]:
Smarket = pd.read_csv('data/Smarket.csv', header=0)

In [None]:
Smarket.head()

In [None]:
Smarket.columns

In [None]:
Smarket.shape

In [None]:
# for panda data frame, there is a method corr to compute pairwise correlation between numerical variables
Smarket.corr()
# as one would expect, the correlations between the lag variables and today’s returns are close to zero

In [None]:
# take a look at volume column
plt.plot(Smarket.iloc[:, 6])
# or plt.plot(Smarket[['Volume']])
plt.show()

## Logistic Regression

In [None]:
y, X = dmatrices('Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume', Smarket, return_type = 'dataframe')
print(y)

In [None]:
# since we are more interested in stock marketing up, we take the second column of y as our response variable 
# we build a model to predict whether the direction will be up. 
logit = sm.Logit(y.iloc[:,1], X)
logit.fit().summary()

In [None]:
# to extract the parameters directly
logit.fit().params

In [None]:
# to extract the probability of the market going up for the first 10 instances
logit.fit().predict()[0:10] 

In [None]:
# in order to make a prediction as to whether the market will go up or down on a particular day, 
# we must convert these predicted probabilities into class labels, Up (1) or Down (0).
# we will do this by threshold the probability by a predefined threshold 
threshold = 0.5 
predict_label = pd.DataFrame(np.zeros(shape=(1250,1)), columns = ['label'])
predict_label.iloc[logit.fit().predict()>threshold] = 1

In [None]:
# we can evalue the TRAINING result by constructing a confusion matrix 
confusion_matrix(y.iloc[:,1], predict_label.iloc[:,0])

In [None]:
# the diagonal elements of the confusion matrix indicate correct predictions, while the off-diagonals represent incorrect predictions. 
# in this case, logistic regression correctly predicted the movement of the market 52.2% of the time.
print(np.mean(y.iloc[:,1] == predict_label.iloc[:,0]))
# or use the confusion matrix to compute the accuracy 
print(confusion_matrix(y.iloc[:,1], predict_label.iloc[:,0]).diagonal().sum()* 1.0 /confusion_matrix(y.iloc[:,1], predict_label.iloc[:,0]).sum())

### Train-Validation Split

In [None]:
# in order to better assess the accuracy of the logistic regression model in this setting, 
# we can fit the model using part of the data, and then examine how well it predicts the hold out data. 
# this will yield a more realistic error rate, in the sense that in practice we will be interested in our 
# model’s performance not on the data that we used to fit the model, but rather on days in the future for which the market’s movements are unknown.
Smarket_2005 = Smarket.query('Year >= 2005')
Smarket_train = Smarket.query('Year < 2005')

In [None]:
# we will use the training dataset to build the logistic regression model 
y_train, X_train = dmatrices('Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume', Smarket_train, return_type = 'dataframe')
y_test, X_test = dmatrices('Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume', Smarket_2005, return_type = 'dataframe')

In [None]:
logit = sm.Logit(y_train.iloc[:,1], X_train)
print(logit.fit().summary())

In [None]:
preds = logit.fit().predict(X_test)
predict_label = pd.DataFrame(np.zeros(shape=(X_test.shape[0],1)), columns = ['label'])
threshold = 0.5
mark = (preds > threshold).reset_index(drop=True)
predict_label.loc[mark] = 1
confusion_matrix(y_test.iloc[:,1], predict_label.iloc[:,0])

In [None]:
# to get accuracy
np.mean(y_test.iloc[:,1].reset_index(drop=True)==predict_label.iloc[:,0].reset_index(drop=True)) 

# note: we have trained and tested our model on two completely separate data sets: 
# training was performed using only the dates before 2005, and testing was performed 
# using only the dates in 2005. Finally, we compute the predictions for 2005 and compare 
# them to the actual movements of the market over that time period. The results are rather 
# disappointing: the test error rate is 1 - 48% = 52 %, which is worse than random guessing 
# for a balanced data. Of course this result is not all that surprising, given that one 
# would not generally expect to be able to use previous days’ returns to predict future market performance.

In [None]:
# the retrain of the model with Lag1 and Lag2 will be similar to previous steps (I will be brief here). 
y_train, X_train = dmatrices('Direction~Lag1+Lag2', Smarket_train, return_type = 'dataframe')
y_test, X_test = dmatrices('Direction~Lag1+Lag2', Smarket_2005, return_type = 'dataframe')
logit = sm.Logit(y_train.iloc[:,1], X_train)
preds = logit.fit().predict(X_test)
predict_label = pd.DataFrame(np.zeros(shape=(X_test.shape[0],1)), columns = ['label'])
threshold = 0.5
confusion_matrix(y_test.iloc[:,1], predict_label.iloc[:,0])
np.mean(y_test.iloc[:,1].reset_index(drop=True)==predict_label.iloc[:,0].reset_index(drop=True)) # to get accuracy on validation set

In [None]:
# another way to deal with logistics regression is to change the threshold value from 0.5 to others. 
# there is an example below with threshold 0.45. 
preds = logit.fit().predict(X_test)
predict_label = pd.DataFrame(np.zeros(shape=(X_test.shape[0],1)), columns = ['label'])
threshold = 0.45
predict_label.loc[(preds > threshold).reset_index(drop=True)] = 1
confusion_matrix(y_test.iloc[:,1], predict_label.iloc[:,0])

# to get accuracy on validation set, we did see an improvment of the accuracy from 0.48 to 0.56
np.mean(y_test.iloc[:,1].reset_index(drop=True)==predict_label.iloc[:,0].reset_index(drop=True)) 

## Linear Discriminant Analysis

In [None]:
# we will use sklearn's implementation of LDA
# from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

In [None]:
y_train.iloc[:,1].unique()

In [None]:
# the training process 
sklearn_lda = LDA(n_components=1) #creating a LDA object
lda = sklearn_lda.fit(X_train.iloc[:,1:3], y_train.iloc[:,1]) #learning the projection matrix
X_lda = lda.transform(X_train.iloc[:,1:3]) #using the model to project X 
X_labels = lda.predict(X_train.iloc[:,1:3]) #gives you the predicted label for each sample
X_prob = lda.predict_proba(X_train.iloc[:,1:3]) #the probability of each sample to belong to each class

In [None]:
# testing step 
X_test_labels =lda.predict(X_test.iloc[:,1:3])
X_test_prob = lda.predict_proba(X_test.iloc[:,1:3]) 
print(X_test_prob[0:5,:])

In [None]:
# get the accuracy of the test set using default threshold
np.mean(y_test.iloc[:,1]==X_test_labels) 

In [None]:
# let's change the threshod a bit to see whether we can improve the accuracy. 
# the 2nd column of X_test_prob is the probability belongs to UP group. 
# the default value is 0.5, let us first check that. 
threshold = 0.5 
np.mean(y_test.iloc[:,1]==(X_test_prob[:,1]>=threshold))

In [None]:
threshold = 0.48
np.mean(y_test.iloc[:,1]==(X_test_prob[:,1]>=threshold))

## Quadratic Discriminant Analysis

In [None]:
# it is a little bit of annoying that QDA and LDA have minor difference in their parameter 
# set-up and function names. 
# from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA

In [None]:
sklearn_qda = QDA(priors=None,store_covariance=True) #creating a QDA object
qda = sklearn_qda.fit(X_train.iloc[:,1:3], y_train.iloc[:,1]) #learning the projection matrix
X_labels = qda.predict(X_train.iloc[:,1:3]) #gives you the predicted label for each sample
X_prob = qda.predict_proba(X_train.iloc[:,1:3]) #the probability of each sample to belong to each class

X_test_labels=qda.predict(X_test.iloc[:,1:3])
X_test_prob = qda.predict_proba(X_test.iloc[:,1:3]) 

print(np.mean(y_test.iloc[:,1]==X_test_labels) )

In [None]:
# again, use dir() to explore all the information stored in lda and qda.
#dir(qda)

In [None]:
print(qda.means_)
print(qda.covariance_)

## Naive Bayes

In [None]:
# from sklearn.naive_bayes import GaussianNB as NB

In [None]:
NB_class = NB()
NB_class.fit(X_train.iloc[:,1:3], y_train.iloc[:,1])
X_test_labels=NB_class.predict(X_test.iloc[:,1:3])
X_test_prob = NB_class.predict_proba(X_test.iloc[:,1:3]) 
print(np.mean(y_test.iloc[:,1]==X_test_labels))

#dir(NB_class) # use dir command to check what Naive Bayes classifier has

## K-Nearest Neighbors

In [None]:
# from sklearn.neighbors import KNeighborsClassifier as KNN

In [None]:
neigh = KNN(n_neighbors= 4) # use n_neighbors to change the # of tune the performance of KNN
KNN_fit = neigh.fit(X_train.iloc[:,1:3], y_train.iloc[:,1]) #learning the projection matrix
X_test_labels=KNN_fit.predict(X_test.iloc[:,1:3])
X_test_prob = KNN_fit.predict_proba(X_test.iloc[:,1:3]) 
print(np.mean(y_test.iloc[:,1]==X_test_labels))

#dir(neigh) # use dir command to check what KNN offers