# Linear Models For Classification

In this notebook we look at how to implement linear models in python.

The following datasets will be used for analysis -

1. [Bikeshare](https://drive.google.com/file/d/1mzUgrPg3Dndy-DFy8rf6Dqh6-jX1FaSe/view?usp=sharing)
2. [Stock-Market](https://drive.google.com/file/d/1bFNQ0DzvFAbNKa5G8PA-aLRo35xSYSBC/view?usp=sharing)


From before we use the 

1. `numpy` library for dealing with numerical datasets
2. `pandas` is used to manipilate the datasets using the DataFrame object.
3. `matplotlib` is used to plot the figures.
4. Use `sklearn` to implement logistic regression


In [1]:
import os
import tarfile
import urllib.request

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/"
HOUSING_URL = DOWNLOAD_ROOT + "Notebooks/Data/Smarket.csv"

def fetch_data(housing_url=HOUSING_URL):
    urllib.request.urlretrieve(housing_url, "/content/Smarket.csv")

fetch_data()

In [2]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline


First lets look at the stock market data. The aim is to predict the `Direction` variable using `lag_*` and `Volume` variables.

Before any analysis can be done we need to prepare the data by -

1. Encoding the output/response Up/Down as 0/1.
2. We then remove the `Year`, `Today` and `Direction` to get our independent variables.
3. Note that the sign of `Today` essentially dictates the Up/Down direction.

In [3]:
data = pd.read_csv("/content/Smarket.csv", index_col=0)
print(data.head())

   Year   Lag1   Lag2   Lag3   Lag4   Lag5  Volume  Today Direction
1  2001  0.381 -0.192 -2.624 -1.055  5.010  1.1913  0.959        Up
2  2001  0.959  0.381 -0.192 -2.624 -1.055  1.2965  1.032        Up
3  2001  1.032  0.959  0.381 -0.192 -2.624  1.4112 -0.623      Down
4  2001 -0.623  1.032  0.959  0.381 -0.192  1.2760  0.614        Up
5  2001  0.614 -0.623  1.032  0.959  0.381  1.2057  0.213        Up


Question 1 : Write a function which takes a pandas dataframe, and column name and returns the mean, stdev, 25th Quantile, Median, 75th Quantile.

**IMPORTANT NOTE**: The function you have written should pass the test. Only those which have passed will be considered for grades. An example is shown below. 

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def prepare_data_det(data1):
    data = data1.copy()
    data['Direction'] = data['Direction'].replace('Up',1)
    data['Direction'] = data['Direction'].replace('Down',0)
    data['Direction'].astype('int')
    return data

def column_stats(data, col_name):
  ### CODE HERE
  ### Fill this function such that given a pandas datafarme data,
  ### and column name col_name it will pass the test below.

  data = prepare_data_det(data)
  data_array = data[col_name]

  mean = data_array.mean()
  stdev = data_array.std()
  Q1 = data_array.quantile(0.25)
  median = data_array.median()
  Q3 = data_array.quantile(0.75)

  print("mean  : ",mean)
  print("stdev : ",stdev)
  print("Q1    : ",Q1)  
  print("median: ",median)  
  print("Q3    : ",Q3)  
  return mean,stdev,Q1,median,Q3;
  

def test_column_stats():
  """
  """
  data = pd.read_csv("/content/Smarket.csv", index_col=0)
  mean, stdev, Q1, median, Q3 = column_stats(data, 'Lag1')
  assert np.abs(mean - 0.003834) < 1e-4
  assert np.abs(stdev-1.136299) < 1e-4
  assert np.abs(Q1 - -0.639500) < 1e-4
  assert np.abs(median-0.039000) < 1e-4
  assert np.abs(Q3-0.596750) < 1e-4
test_column_stats()


mean  :  0.003834400000000011
stdev :  1.1362988437142851
Q1    :  -0.6395
median:  0.039
Q3    :  0.59675


Question 2: Split the dataset into train and test with sizes 998, 252 respectively.  Then fit a logistic regression model on this dataset.

Report the following:

1. Train Accuracy
2. Test Accuracy
3. `coef_` attribute
4. `intercept_` attribute.

Make sure the output is just the above quantities and nothing else.

**IMPORTANT NOTE 1:** Do not shuffle the dataset while splitting. The first 998 rows should be taken as train and remaining 252 should be taken as test.

**IMPORTANT NOTE 2:** Do not use the `Year` and `Today` for prediction.

**IMPORTANT NOTE 3:** Consider `Up` to be class 1 and `Down` to be class 0.

** Grading will be done on whether the test `test_classifier` function worked or not.


In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

def prepare_data_set_Q2(data):
    cols  = ['Lag1','Lag2','Lag3','Lag4','Lag5','Volume','Direction']
    # select the wanted columns
    data1 = data[cols]
    df_data_set = data1.copy()
    df_data_set['Direction'] = df_data_set['Direction'].replace('Up',1)
    df_data_set['Direction'] = df_data_set['Direction'].replace('Down',0)
    df_data_set['Direction'].astype('int')
    return df_data_set


from sklearn.linear_model import LogisticRegression
def get_LR_classifier():
  data_set = pd.read_csv("/content/Smarket.csv", index_col=0)
  df_data_set = prepare_data_set_Q2(data_set)

  Y = df_data_set.Direction
  X = df_data_set.drop('Direction',axis=1)
  

  # split the data into train and test set with the required columns
  XTrain = df_data_set.iloc[ 0 : 998, [0, 1, 2, 3, 4, 5] ]
  yTrain = df_data_set.iloc[ 0 : 998, [6] ]

  XTest = df_data_set.iloc[998 : , [0, 1, 2, 3, 4, 5] ]
  yTest = df_data_set.iloc[998 : , [6] ]

  clf = LogisticRegression()
  clf.fit(XTrain, yTrain.values.ravel())
  yPred = clf.predict(XTest)

  print(' 1. Train Accuracy      : ',clf.score(XTrain, yTrain))
  print(' 2. Test Accuracy       : ', clf.score(XTest, yTest))
  print(' 3. coef_ attribute     : ', clf.coef_)
  print(' 4. intercept_ attribute: ' , clf.intercept_)
  print(' \n\n ')

  return clf, XTrain, yTrain, XTest, yTest


def test_LR_classifier():
  clf, XTrain, yTrain, XTest, yTest = get_LR_classifier()
  assert np.abs(clf.score(XTrain, yTrain) - 0.5250501002004008) < 1e-4
  assert np.abs(clf.score(XTest, yTest) - 0.48412698412698413) < 1e-4
  arr = np.array([[-0.05410202, -0.04559333,  0.00727805,  0.00653897, -0.00415829, -0.10995391]])
  assert np.sum(np.abs(np.array(clf.coef_) - arr)) <6*1e-4
  assert np.abs(clf.intercept_ - 0.18259423) < 1e-4

test_LR_classifier()

 1. Train Accuracy      :  0.5250501002004008
 2. Test Accuracy       :  0.48412698412698413
 3. coef_ attribute     :  [[-0.05410202 -0.04559333  0.00727805  0.00653897 -0.00415829 -0.10995391]]
 4. intercept_ attribute:  [0.18259423]
 

 


Question 3: Using the sklearn package `from sklearn.naive_bayes.GaussianNB` fit a naive bayes classifier. Use only the features `Lag2` and `Lag3` for this purpose.

eport the following quantities from the properties of the classifier:

1. `class_prior_`
2. `theta_`
3. `var_`
4. Confusion Matrix for the test data.

**IMPORTANT NOTE** The above quantities should be printed with correct labelling. 


In [7]:
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split

def prepare_data_set_Q3(data):
    cols  = ['Lag2','Lag3','Direction']
    # select the wanted columns
    data1 = data[cols]
    df_data_set = data1.copy()
    df_data_set['Direction'] = df_data_set['Direction'].replace('Up',1)
    df_data_set['Direction'] = df_data_set['Direction'].replace('Down',0)
    df_data_set['Direction'].astype('int')

    return df_data_set

def get_NB_classifier():
  data_set = pd.read_csv("/content/Smarket.csv", index_col=0)
  df_data_set = prepare_data_set_Q3(data_set)

  Y = df_data_set.Direction
  X = df_data_set.drop('Direction',axis=1)

  # split the data into train and test set with the required columns
  Xtrain = df_data_set.iloc[ 0 : 998, [0, 1] ]
  yTrain = df_data_set.iloc[ 0 : 998, [2] ]

  XTest = df_data_set.iloc[998 : , [0, 1] ]
  yTest = df_data_set.iloc[998 : , [2] ]

  clf = GaussianNB()
  clf.fit(Xtrain, yTrain.values.ravel())
  yPred = clf.predict(XTest)
  Confusion_matrix = confusion_matrix(yTest, yPred)

  print("1. class_prior_: ",clf.class_prior_)
  print("2. theta_      : ",clf.theta_)
  print("3. var_        : ",clf.var_)
  print("4. Confusion Matrix for the test data: ",Confusion_matrix)
  print(' \n\n')

  return clf, Xtrain, yTrain, XTest, yTest

def test_NB_classifier():
  clf, Xtrain, yTrain, XTest, yTest = get_NB_classifier()
  arr1 = np.array([0.49198397, 0.50801603])
  assert np.sum(np.abs(np.array(clf.class_prior_) - arr1)) < 2*1e-4

  arr2 = np.array([[ 0.03389409, -0.00980652],
                    [-0.03132544,  0.00583432]])
  assert np.sum(np.abs(np.array(clf.theta_) - arr2)) < 4*1e-4


  arr3 = np.array([[1.23792871, 1.23412176],
                   [1.21956089, 1.22963]])
  assert np.sum(np.abs(np.array(np.sqrt(clf.var_)) - arr3)) <6*1e-4

  arr4 = np.array([[  9, 102],
                   [  7, 134]])
  assert np.sum(np.abs(np.array(arr4) - arr4)) < 4*1e-4  
  data.head()
test_NB_classifier()  

1. class_prior_:  [0.49198397 0.50801603]
2. theta_      :  [[ 0.03389409 -0.00980652]
 [-0.03132544  0.00583432]]
3. var_        :  [[1.53246749 1.52305652]
 [1.48732877 1.51198994]]
4. Confusion Matrix for the test data:  [[  9 102]
 [  7 134]]
 


