## About the Dataset

The data file contain gray-scale images of hand-drawn digits, from zero through nine.

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

The data set, (data1.csv) has 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.

You can read more about dataset [Here](https://www.kaggle.com/c/digit-recognizer/data)

## Loading the dataset

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [26]:
df = pd.read_csv('data1.csv')

In [27]:
df.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
df.shape

(42000, 785)

In [31]:
df["label"].value_counts()

1    4684
7    4401
3    4351
9    4188
2    4177
6    4137
0    4132
4    4072
8    4063
5    3795
Name: label, dtype: int64

<img src='images/icon/ppt-icons.png' alt='Mini-Challenge' style="width: 100px;float:left; margin-right:15px"/><br/>

## Mini-Challenge - 1
***
### Instructions

- Store all the features(independent values) in a variable called `X`
- Store the target variable `label`(dependent value) in a variable called `y`
- Split the data X and y into X_train,X_test,y_train and y_test in the ratio 70:30 and `random_state = 42`
- Further split the testing data into X_train1, X_test1, y_train1, y_test1 in 70:30 ratio, `stratify = y_test` and 
  random_state = 101 
- Then apply the base Logistic regression model pass parameter as `random_state=101` and calculate the `score` on new splitted 
  test data.

In [7]:
%%time
from sklearn.linear_model import LogisticRegression

#Dividing the training set in train and test set
y = df.iloc[:,0]
X = df.iloc[:,1:]

#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

X_train1, X_test1, y_train1, y_test1 = train_test_split(X_test, y_test, test_size=0.3,stratify=y_test, random_state = 101)

lr = LogisticRegression(random_state=101)
lr.fit(X_train1,y_train1)
print("Accuracy on test data:", lr.score(X_test1,y_test1))

Accuracy on test data: 0.8436507936507937
Wall time: 4min 18s


### Observation:  Logistic Regression without any Feature Selection gives an accuracy of 0.84

<img src='images/icon/ppt-icons.png' alt='Mini-Challenge' style="width: 100px;float:left; margin-right:15px"/><br/>

## Mini-Challenge - 2
***
**Remove Correlated Features**
As we have learned earlier one of the assumptions of Logistic Regression model is that the independent features should not be correlated to each other(i.e Multicollinearity).
### Instructions

* Find the features that have a correlation higher that 0.8 and remove the same so that the assumption for logistic regression model is satisfied.

In [8]:
corr_matrix = df.drop("label",1).corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.8
to_drop = [column for column in upper.columns if any(upper[column] > 0.8)]
print("Columns to be dropped: ")
print(to_drop)

Columns to be dropped: 
['pixel13', 'pixel15', 'pixel33', 'pixel35', 'pixel86', 'pixel121', 'pixel122', 'pixel123', 'pixel124', 'pixel125', 'pixel126', 'pixel127', 'pixel128', 'pixel148', 'pixel149', 'pixel150', 'pixel151', 'pixel152', 'pixel153', 'pixel154', 'pixel155', 'pixel156', 'pixel157', 'pixel158', 'pixel159', 'pixel176', 'pixel177', 'pixel178', 'pixel179', 'pixel180', 'pixel181', 'pixel182', 'pixel183', 'pixel184', 'pixel244', 'pixel245', 'pixel271', 'pixel272', 'pixel273', 'pixel274', 'pixel280', 'pixel286', 'pixel287', 'pixel288', 'pixel289', 'pixel298', 'pixel299', 'pixel300', 'pixel301', 'pixel302', 'pixel314', 'pixel315', 'pixel316', 'pixel317', 'pixel326', 'pixel327', 'pixel328', 'pixel329', 'pixel330', 'pixel331', 'pixel336', 'pixel341', 'pixel342', 'pixel343', 'pixel344', 'pixel345', 'pixel346', 'pixel351', 'pixel354', 'pixel355', 'pixel356', 'pixel357', 'pixel358', 'pixel359', 'pixel368', 'pixel369', 'pixel370', 'pixel371', 'pixel372', 'pixel373', 'pixel374', 'pixel37

In [9]:
df.drop(to_drop,axis=1,inplace=True)

### Store the updated dataframe df after removing correlated features in variable df1 

In [11]:
df1 = df.copy()
print(df1.shape)
df1.head()

(42000, 581)


Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel773,pixel774,pixel775,pixel776,pixel777,pixel778,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<img src='images/icon/ppt-icons.png' alt='Mini-Challenge' style="width: 100px;float:left; margin-right:15px"/><br/>

## Mini-Challenge - 3
***
### Instructions
- Apply Logistic Regression model on a newly created dataframe df1
- Store all the features(independent values) in a variable called `X`
- Store the target variable `label`(dependent value) in a variable called `y`
- Split the data X and y into X_train,X_test,y_train and y_test in the ratio 70:30, `stratify=y` and `random_state = 42`
- Further split the testing data into X_train1, X_test1, y_train1, y_test1 in 70:30 ratio, `stratify = y_test` and 
  `random_state = 101` 
- Then apply the base Logistic regression model pass parameter as `random_state=101` and calculate the `score` on new splitted 
  test data.

In [20]:
%%time
from sklearn.linear_model import LogisticRegression
y = df1.iloc[:,0]
X = df1.iloc[:,1:]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,stratify=y, random_state = 42)
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_test, y_test, test_size=0.3,stratify=y_test, random_state = 101)
lr = LogisticRegression(random_state=101)
lr.fit(X_train1,y_train1)
print("Accuracy on test data: ", lr.score(X_test1,y_test1))

Accuracy on test data:  0.8481481481481481
Wall time: 4min 36s


### Observation: After keeping highly correlated features, there is not much change in the score. Lets apply another feature selection technique(Chi Squared test)to see whether we can increase our score.

<img src='images/icon/ppt-icons.png' alt='Mini-Challenge' style="width: 100px;float:left; margin-right:15px"/><br/>

## Mini-Challenge - 4
***
Chi-Square test:
In this task we will try to identify the optimum no. of features to use
### Instructions

* Store all the features(independent values) in a variable called `X`
* Store the target variable `label`(dependent value) in a variable called `y`
* Three variables `nof_list`, `high_score` and `nof` are already defined for you(Feel free to change the number of features in `nof_list`)
* Run a `n` loop passing through each element of `nof_list`.
* Inside the loop, initialise a `SelectKBest()` with the parameters `score_func=chi2` & `k= n` and save it to a variable called `test`.
* Split `X` and `y` into `X_train,X_test,y_train,y_test` using train_test_split() function. Use `test_size = 0.3`,`stratify=y`   and `random_state = 42`
* Further split the testing data into X_train1, X_test1, y_train1, y_test1 in 70:30 ratio, `stratify = y_test` and  `random_state = 101`
* Fit `test` on the training data `X_train1` and `y_train1` using the `fit_transform()` method. Store the result back into `X_train1`
* Transform `X_test1` using the `transform()` method of test. Store the result back into `X_test1`
* Initialise a logistic regression model with LogisticRegression(random_state=101) and save it to a variable called `model`.
* Fit the model on the training data `X_train1` and `y_train1` using the `fit()` method.
* Write a condition to store the highest R2 score of all `n`. Store the highest R2 score in `high score` and the 
  `n` assosciated with it in `nof`

In [23]:
%%time
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest

# Code starts here

y = df1.iloc[:,0]
X = df1.iloc[:,1:]

nof_list=[100,300]

high_score=0

nof=0

for n in nof_list:
    test = SelectKBest(score_func=chi2 , k= n )
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,stratify=y, random_state = 42)
    X_train1, X_test1, y_train1, y_test1 = train_test_split(X_test, y_test, test_size=0.3,stratify=y_test, random_state = 101)
    X_train1 = test.fit_transform(X_train1,y_train1)
    X_test1 = test.transform(X_test1)
    
    model = LogisticRegression(random_state=101)
    model.fit(X_train1,y_train1)
    
    if model.score(X_test1,y_test1)>high_score:
        high_score=model.score(X_test1,y_test1)
        nof=n 
print("High Score is:",high_score, "with features=",nof)

High Score is: 0.8597883597883598 with features= 300
Wall time: 4min 57s


### Observation: We observe that using chi squared test there is a 1% change in the score and the optimum features that we got is 300.

<img src='images/icon/ppt-icons.png' alt='Mini-Challenge' style="width: 100px;float:left; margin-right:15px"/><br/>

## Mini-Challenge - 5
***
Analysis of variance (ANOVA) is another method to check for close relationship between two variables.
### Instructions

* Store all the features(independent values) in a variable called `X`
* Store the target variable `label`(dependent value) in a variable called `y`
* Three variables `nof_list`, `high_score` and `nof` are already defined for you(Feel free to change the number of features in `nof_list`)
* Run a `n` loop passing through each element of `nof_list`.
* Inside the loop, initialise a `SelectKBest()` with the parameters `score_func=f_classif` & `k= n` and save it to a variable called `test`.
* Split `X` and `y` into `X_train,X_test,y_train,y_test` using train_test_split() function. Use `test_size = 0.3`,`stratify=y`   and `random_state = 42`
* Further split the testing data into X_train1, X_test1, y_train1, y_test1 in 70:30 ratio, `stratify = y_test` and  `random_state = 101`
* Fit `test` on the training data `X_train1` and `y_train1` using the `fit_transform()` method. Store the result back into `X_train1`
* Transform `X_test1` using the `transform()` method of test. Store the result back into `X_test1`
* Initialise a logistic regression model with LogisticRegression(random_state=101) and save it to a variable called `model`.
* Fit the model on the training data `X_train1` and `y_train1` using the `fit()` method.
* Write a condition to store the highest R2 score of all `n`. Store the highest R2 score in `high score` and the 
  `n` assosciated with it in `nof`

In [24]:
%%time
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import SelectKBest

y = data.iloc[:,0]
X = data.iloc[:,1:]

nof_list=[50,300]

high_score=0

nof=0


for n in nof_list:
    test = SelectKBest(score_func=f_classif , k= n )
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,stratify=y, random_state = 42)
    X_train1, X_test1, y_train1, y_test1 = train_test_split(X_test, y_test, test_size=0.3,stratify=y_test, random_state = 101)
    X_train1 = test.fit_transform(X_train1,y_train1)
    X_test1 = test.transform(X_test1)
    model = LogisticRegression()
    model.fit(X_train1,y_train1)

    if model.score(X_test1,y_test1)>high_score:
        high_score=model.score(X_test1,y_test1)
        nof=n 
print("High Score is:",high_score, "with features=",nof)

High Score is: 0.8632275132275132 with features= 300
Wall time: 4min 33s


### Observation: We observe that using Anova test there is not much change in the score i.e 0.86 and the optimum features that we got is 300.

<img src='images/icon/ppt-icons.png' alt='Mini-Challenge' style="width: 100px;float:left; margin-right:15px"/><br/>

## Mini-Challenge - 6
***
Applying PCA feature reduction technique

### Instructions
* Store all the features(independent values) in a variable called `X` from `df1`
* Store the target variable `label`(dependent value) in a variable called `y`
* Split `X` and `y` into `X_train,X_test,y_train,y_test` using train_test_split() function. Use `test_size = 0.3`,`stratify=y`   and `random_state = 42`
* Further split the testing data into X_train1, X_test1, y_train1, y_test1 in 70:30 ratio, `stratify = y_test` and  `random_state = 101`
* Initialise a PCA model with PCA(.95) and save it to a variable called `pca`.
* Fit the pca on the training data `X_train1` using the `fit()` method.
* Print the no of components of the pca method
* Transform `X_train1`, `X_test1` using the `transform()` method of pca. Store the result back into `X_train1` and `X_test1`
  simultaneously.
* Initialize a logistic regression model  with `Logisticregression(solver='lbfgs')` and save it to a variable called logistic.
* Fit the logistic on the training data `X_train1` and `y_train1` using the `fit()` method.
* Predict on `X_test1` for one Observation and print it
* Calculate the score on the test data
* You can also compare your predicted values and observed values by printing out values of `logistic.predict(X_test1)` and  `y_test1`

In [16]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
# Make an instance of the Model
y = df1.iloc[:,0]
X = df1.iloc[:,1:]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,stratify=y, random_state = 42)
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_test, y_test, test_size=0.3,stratify=y_test, random_state = 101)

pca = PCA(.95)


pca.fit(X_train1)
print("No of Components used: ",pca.n_components_)

X_train1 = pca.transform(X_train1)
X_test1 = pca.transform(X_test1)


logistic = LogisticRegression(solver = 'lbfgs')
logistic.fit(X_train1, y_train1)

# Predict for One Observation 
print("Prediction for one observation: ",logistic.predict(X_test1[0].reshape(1,-1)))

# Predict for ten Observation 
print("Prediction for 10 observation: ",logistic.predict(X_test1[0:10]))

print("Accuracy after applying PCA: ",logistic.score(X_test1, y_test1))



No of Components used:  106
Prediction for one observation:  [2]
Prediction for 10 observation:  [2 0 6 5 6 8 3 5 0 3]
Accuracy after applying PCA:  0.8751322751322751


In [23]:
print("Prediction for 10 observation:    ",logistic.predict(X_test1[0:10]))
print("Actual values for 10 observation: ",y_test1[0:10].values)

Prediction for 10 observation:     [2 0 6 5 6 8 3 5 0 3]
Actual values for 10 observation:  [3 0 6 5 6 8 3 5 0 3]


<img src='images/icon/quiz.png' alt='Mini-Challenge' style="width: 100px;float:left; margin-right:15px"/>
<br/>

### Feature Selection & Logistic Regression
***