https://www.kaggle.com/hypnobear/absenteeism-at-work-dataset

https://www.kaggle.com/chetnasureka/absenteeismatwork/kernels

https://www.kaggle.com/shreytiwari/name-na

https://www.kaggle.com/miner16078/zenith-classification-and-clustering

https://www.kaggle.com/tejprash/theaggregatr-assign6

https://www.kaggle.com/kerneler/starter-absenteeism-at-work-7c360987-f

https://www.kaggle.com/dweepa/outliers-assign6


# Evaluate the Performance of Machine Learning Algorithms with Resampling
You need to know how well your algorithms perform on unseen data. The best way to evaluate
the performance of an algorithm would be to make predictions for new data to which you
already know the answers. The second best way is to use clever techniques from statistics called
resampling methods that allow you to make accurate estimates for how well your algorithm will
perform on new data. In this chapter you will discover how you can estimate the accuracy of
your machine learning algorithms using resampling methods in Python and scikit-learn on the
Pima Indians dataset. Let's get started.

## Evaluate Machine Learning Algorithms
Why can't you prepare your machine learning algorithm on your training dataset and use
predictions from this same dataset to evaluate performance? The simple answer is overfitting.
Imagine an algorithm that remembers every observation it is shown during training. If you
evaluated your machine learning algorithm on the same dataset used to train the algorithm, then
an algorithm like this would have a perfect score on the training dataset. But the predictions it
made on new data would be terrible. We must evaluate our machine learning algorithms on
data that is not used to train the algorithm.
A model evaluation is an estimate that we can use to talk about how well we think the
method may actually do in practice. It is not a guarantee of performance. Once we estimate the
performance of our algorithm, we can then re-train the final algorithm on the entire training
dataset and get it ready for operational use. Next up we are going to look at four diferent
techniques that we can use to split up our training dataset and create useful estimates of
performance for our machine learning algorithms:

- Train and Test Sets.
- k-fold Cross-Validation.
- Leave One Out Cross-Validation.
- Repeated Random Test-Train Splits.


## Split into Train and Test Sets
The simplest method that we can use to evaluate the performance of a machine learning
algorithm is to use diferent training and testing datasets. We can take our original dataset and
split it into two parts. Train the algorithm on the first part, make predictions on the second
part and evaluate the predictions against the expected results. The size of the split can depend
on the size and specifics of your dataset, although it is common to use 67% of the data for
training and the remaining 33% for testing.
This algorithm evaluation technique is very fast. It is ideal for large datasets (millions of
records) where there is strong evidence that both splits of the data are representative of the
underlying problem. Because of the speed, it is useful to use this approach when the algorithm
you are investigating is slow to train. A downside of this technique is that it can have a high
variance. This means that diferences in the training and test dataset can result in meaningful
diferences in the estimate of accuracy. In the example below we split the Pima Indians dataset
into 67%/33% splits for training and test and evaluate the accuracy of a Logistic Regression
model.

In [1]:
# Evaluate using a train and a test set
from pandas import read_csv
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [2]:
data = pd.read_csv('Absenteeism_at_work.csv')

In [3]:
data.head(20)

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,97,0,1,30,4
1,36,0,7,3,1,118,13,18,50,239.554,97,1,1,31,0
2,3,23,7,4,1,179,51,18,38,239.554,97,0,1,31,2
3,7,7,7,5,1,279,5,14,39,239.554,97,0,1,24,4
4,11,23,7,5,1,289,36,13,33,239.554,97,0,1,30,2
5,3,23,7,6,1,179,51,18,38,239.554,97,0,1,31,2
6,10,22,7,6,1,361,52,3,28,239.554,97,0,1,27,8
7,20,23,7,6,1,260,50,11,36,239.554,97,0,1,23,4
8,14,19,7,2,1,155,12,14,34,239.554,97,0,1,25,40
9,1,22,7,2,1,235,11,14,37,239.554,97,0,3,29,8


In [4]:
print(data.shape)

(740, 15)


In [5]:
array = data.values
array

array([[11., 26.,  7., ...,  1., 30.,  4.],
       [36.,  0.,  7., ...,  1., 31.,  0.],
       [ 3., 23.,  7., ...,  1., 31.,  2.],
       ...,
       [ 4.,  0.,  0., ...,  1., 34.,  0.],
       [ 8.,  0.,  0., ...,  1., 35.,  0.],
       [35.,  0.,  0., ...,  1., 25.,  0.]])

In [6]:
X = array[:,0:15]
X

array([[11., 26.,  7., ...,  1., 30.,  4.],
       [36.,  0.,  7., ...,  1., 31.,  0.],
       [ 3., 23.,  7., ...,  1., 31.,  2.],
       ...,
       [ 4.,  0.,  0., ...,  1., 34.,  0.],
       [ 8.,  0.,  0., ...,  1., 35.,  0.],
       [35.,  0.,  0., ...,  1., 25.,  0.]])

In [7]:
X.shape

(740, 15)

In [8]:
Y = array[:,14]
Y

array([  4.,   0.,   2.,   4.,   2.,   2.,   8.,   4.,  40.,   8.,   8.,
         8.,   8.,   1.,   4.,   8.,   2.,   8.,   8.,   2.,   8.,   1.,
        40.,   4.,   8.,   7.,   1.,   4.,   8.,   2.,   8.,   8.,   4.,
         8.,   2.,   1.,   8.,   4.,   8.,   4.,   2.,   4.,   4.,   8.,
         2.,   3.,   3.,   4.,   8.,  32.,   0.,   0.,   2.,   2.,   0.,
         0.,   3.,   3.,   0.,   1.,   3.,   4.,   3.,   3.,   0.,   1.,
         3.,   3.,   3.,   2.,   2.,   5.,   8.,   3.,  16.,   8.,   2.,
         8.,   1.,   3.,   1.,   1.,   8.,   8.,   5.,  32.,   8.,  40.,
         1.,   8.,   3.,   8.,   3.,   4.,   1.,   3.,  24.,   3.,   1.,
        64.,   2.,   8.,   2.,   8.,  56.,   8.,   3.,   3.,   2.,   8.,
         2.,   8.,   2.,   1.,   1.,   1.,   8.,   2.,   2.,   2.,   1.,
         2.,   2.,   2.,   2.,   2.,   2.,   2.,   2.,   8.,   8.,   2.,
         2.,   2.,   0.,   1.,   3.,   1.,   8.,   8.,   2.,   8.,   2.,
         8.,   8.,   8.,   2.,   2.,   1.,   8.,   

In [9]:
Y.shape

(740,)

In [11]:
test_size = 0.33
seed = 7

In [12]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size,
random_state=seed)

In [18]:
X_train

array([[ 5., 26.,  8., ...,  1., 38.,  8.],
       [33., 23., 11., ...,  1., 32.,  2.],
       [23., 19.,  4., ...,  1., 21.,  8.],
       ...,
       [11., 24., 11., ...,  1., 30.,  8.],
       [33., 28.,  4., ...,  1., 32.,  8.],
       [28., 11.,  3., ...,  1., 24.,  8.]])

In [19]:
X_test

array([[22., 13.,  6., ...,  3., 19.,  2.],
       [ 3., 27.,  2., ...,  1., 31.,  3.],
       [22., 27.,  5., ...,  3., 19.,  2.],
       ...,
       [30., 28.,  8., ...,  1., 22.,  4.],
       [34., 27.,  1., ...,  1., 28.,  2.],
       [29.,  0.,  9., ...,  1., 24.,  0.]])

In [15]:
print(X_train.shape)
print(X_test.shape)

(495, 15)
(245, 15)


In [20]:
Y_train

array([  8.,   2.,   8.,   3.,   2.,   3.,   3.,   8.,   8.,   8.,   0.,
        24.,   2.,   2.,   4.,   4.,   8.,   8.,   8.,   3.,   1.,   4.,
         1.,   8.,   3.,   8.,   8.,   2.,   8.,   1.,   4.,   8.,   0.,
         3.,   3.,   3.,   8.,   8.,   8.,   8.,   0.,   4.,   2.,   4.,
         2.,   8.,  24.,   2.,   2.,   8.,   3.,   8.,   1.,   3.,   8.,
       120.,   3.,   2.,   4.,   3.,   3., 120.,   5.,   0.,   8.,   3.,
        16.,   8.,   4.,   3.,   4.,   8.,   8.,  40.,   0.,  24.,   8.,
         2.,   2.,   2.,   4.,   4.,   3.,   8.,   4.,   3.,   3.,   1.,
         8.,   3.,   2.,   8.,   8., 120.,   8.,   1.,   8.,   4.,   4.,
         8.,   8.,   8.,   4.,   8.,   5.,   2.,   1.,   3.,   2.,   2.,
         8.,   8.,  24.,   3.,   8.,   1.,   2.,   1., 104.,   0.,   2.,
         1.,   8.,   3.,   8.,   8.,   2.,   4.,   2.,   8.,   3.,  16.,
         3.,   8.,   1.,   8.,   3.,  40.,   8.,   2.,   8.,   4.,   1.,
         3.,   5.,   3.,   2.,   8.,   2.,   8.,   

In [21]:
Y_test

array([  2.,   3.,   2.,   8.,  24.,   3.,   8.,   3.,   8.,   8.,   1.,
         2.,   8.,   1.,   4.,   8.,   3.,   8.,   2.,   2.,   8.,   4.,
         1.,   3.,  16.,   8.,   8.,   8.,   3.,   1.,   4.,   2.,   8.,
         0.,   1.,   2.,   1.,   1.,   2.,   2.,   3.,   0.,  24.,   2.,
         3.,   2.,   4.,   1.,   8.,   1.,   8.,   3.,   8.,   8.,   2.,
         0.,   8.,   3.,   0.,   2.,   2.,   2.,   1.,   1.,   3.,   2.,
         0.,   8.,   5.,   8.,   3.,   3.,   8.,   8.,   1.,   2.,   1.,
         8.,   2.,   3.,   8.,   5.,   3.,   8.,   3.,   1.,   8.,   3.,
         4.,   8.,   2.,   8.,   3.,   4.,   8.,   0.,   8.,   3.,  16.,
         1.,  64.,   2.,   2.,  16.,   2.,  16.,   3.,   2.,   4.,   2.,
         8.,   2.,   2.,   3.,   8.,  64.,   0.,   0.,   8.,   3.,  32.,
         2.,  24.,   3.,   8.,   1.,   1.,   3.,   8.,   3.,   2.,   2.,
         8.,   8.,   4.,   2.,   0.,   2.,   8.,   0.,   1.,   3.,   0.,
         8.,   3.,   1.,   4.,   8.,  40.,   3.,   

In [16]:
print(Y_train.shape)
print(Y_test.shape)

(495,)
(245,)


In [23]:
model = LogisticRegression()
model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [24]:
model.fit(X_train, Y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [25]:
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (result*100.0))

Accuracy: 62.857%


## K-fold Cross-Validation
Cross-validation is an approach that you can use to estimate the performance of a machine
learning algorithm with less variance than a single train-test set split. It works by splitting
the dataset into k-parts (e.g. k = 5 or k = 10). Each split of the data is called a fold. The
algorithm is trained on k 􀀀 1 folds with one held back and tested on the held back fold. This is
repeated so that each fold of the dataset is given a chance to be the held back test set. After
running cross-validation you end up with k dierent performance scores that you can summarize
using a mean and a standard deviation.
The result is a more reliable estimate of the performance of the algorithm on new data. It is
more accurate because the algorithm is trained and evaluated multiple times on diferent data.
The choice of k must allow the size of each test partition to be large enough to be a reasonable
sample of the problem, whilst allowing enough repetitions of the train-test evaluation of the
algorithm to provide a fair estimate of the algorithms performance on unseen data. For modest
sized datasets in the thousands or tens of thousands of records, k values of 3, 5 and 10 are
common. In the example below we use 10-fold cross-validation.

In [28]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

In [29]:
array_k = data.values
X = array_k[:,0:15]
Y = array_k[:,14]
kfold = KFold(n_splits=10, random_state=7)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))



Accuracy: 61.081% (7.297%)




## Leave One Out Cross-Validation
You can configure cross-validation so that the size of the fold is 1 (k is set to the number of
observations in your dataset). This variation of cross-validation is called leave-one-out cross-
validation. The result is a large number of performance measures that can be summarized
in an effort to give a more reasonable estimate of the accuracy of your model on unseen data. A downside is that it can be a computationally more expensive procedure than k-fold
cross-validation. In the example below we use leave-one-out cross-validation.

In [30]:
# Evaluate using Leave One Out Cross Validation

from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

In [32]:
array_cross_val = data.values
X = array_cross_val[:,0:15]
Y = array_cross_val[:,14]
loocv = LeaveOneOut()
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

















































































Accuracy: 62.027% (48.532%)


## Repeated Random Test-Train Splits
Another variation on k-fold cross-validation is to create a random split of the data like the
train/test split described above, but repeat the process of splitting and evaluation of the
algorithm multiple times, like cross-validation. This has the speed of using a train/test split and
the reduction in variance in the estimated performance of k-fold cross-validation. You can also
repeat the process many more times as needed to improve the accuracy. A down side is that
repetitions may include much of the same data in the train or the test split from run to run,
introducing redundancy into the evaluation. The example below splits the data into a 67%/33%
train/test split and repeats the process 10 times.

In [33]:
# Evaluate using Shuffle Split Cross Validation

from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

In [35]:
array_r_r = data.values
X = array_r_r[:,0:15]
Y = array_r_r[:,14]
n_splits = 10
test_size = 0.33
seed = 7
kfold = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=seed)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))



Accuracy: 63.837% (2.739%)




## What Techniques to Use When

This section lists some tips to consider what resampling technique to use in dierent circum-
stances.

- Generally k-fold cross-validation is the gold standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.

- Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.

- Techniques like leave-one-out cross-validation and repeated random splits can be useful intermediates when trying to balance variance in the estimated performance, model training speed and dataset size.


The best advice is to experiment and find a technique for your problem that is fast and
produces reasonable estimates of performance that you can use to make decisions. If in doubt,
use 10-fold cross-validation.