https://www.kaggle.com/hypnobear/absenteeism-at-work-dataset

https://www.kaggle.com/chetnasureka/absenteeismatwork/kernels

https://www.kaggle.com/shreytiwari/name-na

https://www.kaggle.com/miner16078/zenith-classification-and-clustering

https://www.kaggle.com/tejprash/theaggregatr-assign6

https://www.kaggle.com/kerneler/starter-absenteeism-at-work-7c360987-f

https://www.kaggle.com/dweepa/outliers-assign6


# Spot-Check Classification Algorithms

Spot-checking is a way of discovering which algorithms perform well on your machine learning
problem. You cannot know which algorithms are best suited to your problem beforehand. You
must trial a number of methods and focus attention on those that prove themselves the most
promising. In this chapter you will discover six machine learning algorithms that you can use
when spot-checking your classification problem in Python with scikit-learn. After completing
this lesson you will know:
1. How to spot-check machine learning algorithms on a classification problem.
2. How to spot-check two linear classification algorithms.
3. How to spot-check four nonlinear classification algorithms.


## Algorithm Spot-Checking
You cannot know which algorithm will work best on your dataset beforehand. You must use
trial and error to discover a shortlist of algorithms that do well on your problem that you can
then double down on and tune further. I call this process spot-checking.
The question is not: What algorithm should I use on my dataset? Instead it is: What
algorithms should I spot-check on my dataset? You can guess at what algorithms might do
well on your dataset, and this can be a good starting point. I recommend trying a mixture of
algorithms and see what is good at picking out the structure in your data. Below are some
suggestions when spot-checking algorithms on your dataset:
- Try a mixture of algorithm representations (e.g. instances and trees).
- Try a mixture of learning algorithms (e.g. diferent algorithms for learning the same type
of representation).
- Try a mixture of modeling types (e.g. linear and nonlinear functions or parametric and
nonparametric).
Let's get specific. In the next section, we will look at algorithms that you can use to
spot-check on your next classification machine learning project in Python.


## Algorithms Overview
We are going to take a look at six classification algorithms that you can spot-check on your
dataset. Starting with two linear machine learning algorithms:
- Logistic Regression.
- Linear Discriminant Analysis.
Then looking at four nonlinear machine learning algorithms:
- k-Nearest Neighbors.
- Naive Bayes.
- Classification and Regression Trees.
- Support Vector Machines.
Each recipe is demonstrated on the Pima Indians onset of Diabetes dataset. A test harness
using 10-fold cross-validation is used to demonstrate how to spot-check each machine learning
algorithm and mean accuracy measures are used to indicate algorithm performance. The recipes
assume that you know about each machine learning algorithm and how to use them. We will
not go into the API or parameterization of each algorithm.

## Linear Machine Learning Algorithms
This section demonstrates minimal recipes for how to use two linear machine learning algorithms:
logistic regression and linear discriminant analysis.

## Logistic Regression
Logistic regression assumes a Gaussian distribution for the numeric input variables and can
model binary classification problems. You can construct a logistic regression model using the
LogisticRegression class1.

In [1]:
# Logistic Regression Classification
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression


In [2]:
data = pd.read_csv('Absenteeism_at_work.csv')

In [3]:
data.head(20)

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,97,0,1,30,4
1,36,0,7,3,1,118,13,18,50,239.554,97,1,1,31,0
2,3,23,7,4,1,179,51,18,38,239.554,97,0,1,31,2
3,7,7,7,5,1,279,5,14,39,239.554,97,0,1,24,4
4,11,23,7,5,1,289,36,13,33,239.554,97,0,1,30,2
5,3,23,7,6,1,179,51,18,38,239.554,97,0,1,31,2
6,10,22,7,6,1,361,52,3,28,239.554,97,0,1,27,8
7,20,23,7,6,1,260,50,11,36,239.554,97,0,1,23,4
8,14,19,7,2,1,155,12,14,34,239.554,97,0,1,25,40
9,1,22,7,2,1,235,11,14,37,239.554,97,0,3,29,8


In [4]:
print(data.shape)

(740, 15)


In [5]:
array = data.values
X = array[:,0:15]
Y = array[:,14]
kfold = KFold(n_splits=10, random_state=7)

In [6]:
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())



0.6108108108108109


## Linear Discriminant Analysis
Linear Discriminant Analysis or LDA is a statistical technique for binary and multiclass
classification. It too assumes a Gaussian distribution for the numerical input variables. You can
construct an LDA model using the LinearDiscriminantAnalysis class2.

In [7]:
# LDA Classification

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [8]:
array_lin = data.values
X_lin = array_lin[:,0:15]
Y_lin = array_lin[:,14]
kfold = KFold(n_splits=10, random_state=7)
model = LinearDiscriminantAnalysis()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.427027027027027


