In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('Purchased Dataset.csv')
data.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [3]:
data.isnull().sum()#No missing values

User ID            0
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64

In [4]:
#Lets convert categorical variable Gender into numerical variable
sex_map = {'Male':0,'Female':1}

In [5]:
data['Gender'] = data['Gender'].map(sex_map)

In [6]:
data.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,0,19,19000,0
1,15810944,0,35,20000,0
2,15668575,1,26,43000,0
3,15603246,1,27,57000,0
4,15804002,0,19,76000,0


In [7]:
X = data[['Gender','Age','EstimatedSalary']]
y = data['Purchased']

In [8]:
from sklearn.preprocessing import StandardScaler

In [9]:
sc = StandardScaler()

In [10]:
X_scaled = sc.fit_transform(X)

In [11]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y,test_size = 0.30,random_state = 5)

In [12]:
from sklearn.linear_model import LogisticRegression

In [13]:
regression = LogisticRegression()

In [14]:
regression.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [15]:
y_pred = regression.predict(X_test)
print("Training Data Accuracy:{}".format(regression.score(X_train,y_train)))
print("Test Data Accuracy:{}".format(regression.score(X_test,y_test)))

Training Data Accuracy:0.8464285714285714
Test Data Accuracy:0.8666666666666667


In [16]:
from sklearn.metrics import accuracy_score
accuracy_score(y_pred,y_test)

0.8666666666666667

In [17]:
# Lets see if we change the Random state whether the accuracy will change r not
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y,test_size = 0.30,random_state = 9)
regression = LogisticRegression()
regression.fit(X_train,y_train)
y_pred = regression.predict(X_test)
print("Training Data Accuracy:{}".format(regression.score(X_train,y_train)))
print("Test Data Accuracy:{}".format(regression.score(X_test,y_test)))
accuracy_score(y_pred,y_test)

Training Data Accuracy:0.8464285714285714
Test Data Accuracy:0.8


0.8

We see that if we change the random state there is the significant change in the Accuracy

#### Cross-Validation
Suppose you train a model on a given dataset using any specific algorithm. You tried to find the accuracy of the trained model using the same training data and found the accuracy to be 95% or maybe even 100%. What does this mean? Is your model ready for prediction? The answer is no. Why? Because your model has trained itself on the given data, i.e. it knows the data and it has generalized over it very well. But when you try and predict over a new set of data, it’s most likely to give you very bad accuracy, because it has never seen the data before and thus it fails to generalizes well over it. This is the problem of overfitting. To tackle such problem, Cross-validation comes into the picture. Cross-validation is a resampling technique with a basic idea of dividing the training dataset into two parts i.e. train and test. On one part(train) you try to train the model and on the second part(test) i.e. the data which is unseen for the model, you make the prediction and check how well your model works on it. If the model works with good accuracy on your test data, it means that the model has not overfitted the training data and can be trusted with the prediction, whereas if it performs with bad accuracy then our model is not to be trusted and we need to tweak our algorithm.

Let’s see the different approaches of Cross-Validation:

•Hold Out Method:(Usually Train Test split is good example for Hold Out Method) 

It is the most basic of the CV techniques. It simply divides the dataset into two sets of training and test. The training dataset is used to train the model and then test data is fitted in the trained model to make predictions. We check the accuracy and assess our model on that basis. This method is used as it is computationally less costly. But the evaluation based on the Hold-out set can have a high variance because it depends heavily on which data points end up in the training set and which in test data. The evaluation will be different every time this division changes


•k-fold Cross-Validation
<img src="cv1.png" width="">

To tackle the high variance of Hold-out method, the k-fold method is used. The idea is simple, divide the whole dataset into ‘k’ sets preferably of equal sizes. Then the first set is selected as the test set and the rest ‘k-1’ sets are used to train the data. Accuracy is calculated for this particular dataset. Then the steps are repeated, i.e. the second set is selected as the test data, and the remaining ‘k-1’ sets are used as the training data. Again, the Accuracy is calculated. the CV error is given as the mean of the total errors calculated individually.

<img src="cv2.png" width="">
The variance in error decreases with the increase in ‘k’. The disadvantage of k-fold cv is that it is computationally expensive as the algorithm runs from scratch for ‘k’ times.

*  Leave One Out Cross Validation (LOOCV)

<img src="cv3.png" width=""> 
 
LOOCV is a special case of k-fold CV, where k becomes equal to n (number of observations). So instead of creating two subsets, it selects a single observation as a test data and rest of data as the training data. The error is calculated for this test observations. Now, the second observation is selected as test data, and the rest of the data is used as the training set. Again, the error is calculated for this particular test observation. This process continues ‘n’ times.

### Bias Variance tradeoff for k-fold CV, LOOCV and Holdout Set CV

There is a very good explanation given in the ISLR Book as given below:


A k-fold CV with k < n has a computational advantage to LOOCV. But putting computational issues aside,
a less obvious but potentially more important advantage of k-fold CV is that it often gives more accurate estimates of the test error rate than does LOOCV.
The validation set approach can lead to overestimates of the test error rate since in this approach the
the training set used to fit the statistical learning method contains only half the observations of the entire data set. Using this logic, it is not hard to see that LOOCV will give approximately unbiased estimates of the test error since each training set contains n − 1 observations, which is almost as many as the number of observations in the full data set. And performing k-fold CV for, say, k = 5 or k = 10 will lead to an intermediate level of bias since each training set contains (k − 1)n/k observations—fewer than
in the LOOCV approach, but substantially more than in the validation set approach. Therefore, from the perspective of bias reduction, it is clear that LOOCV is to be preferred to k-fold CV. However, we know that bias is not the only source for concern in an estimating procedure; we must also consider the procedure’s variance. It turns out that LOOCV has higher variance than does k-fold CV with k < n. Why
is this the case? When we perform LOOCV, we are in effect averaging the outputs of n fitted models, each of which is trained on an almost identical set of observations; therefore, these outputs are highly (positively) correlated with each other. In contrast, when we perform k-fold CV with k < n, we are averaging the outputs of k fitted models that are somewhat less correlated with each other since the overlap between the training sets in each model is smaller. Since the mean of many highly correlated quantities has higher variance than does the mean of many quantities that are not as highly correlated, the test error estimate resulting from LOOCV tends to have higher variance than does the test error estimate resulting from k-fold CV.

In [33]:
from sklearn.model_selection import cross_val_score
classifier2 = LogisticRegression()
cross_val_score(classifier2,X_scaled,y,cv = 5, scoring='accuracy')

array([0.7   , 0.95  , 0.9375, 0.8125, 0.7   ])

overall Accuracy will be mean of all accuracies.

In [34]:
print(cross_val_score(classifier2,X_scaled,y,cv = 5, scoring='accuracy').mean())

0.82


In [35]:
# Lets see the Accuracy with Decission Tree classfier
from sklearn.tree import DecisionTreeClassifier
classifier3 = DecisionTreeClassifier()

In [36]:
cross_val_score(classifier3,X,y,cv = 10, scoring='accuracy')

array([0.85 , 0.75 , 0.925, 0.825, 0.95 , 0.825, 0.775, 0.825, 0.75 ,
       0.825])

In [37]:
cross_val_score(classifier1,X,y,cv = 10, scoring='accuracy').mean()

0.8324999999999999