https://www.kaggle.com/hypnobear/absenteeism-at-work-dataset

https://www.kaggle.com/chetnasureka/absenteeismatwork/kernels

https://www.kaggle.com/shreytiwari/name-na

https://www.kaggle.com/miner16078/zenith-classification-and-clustering

https://www.kaggle.com/tejprash/theaggregatr-assign6

https://www.kaggle.com/kerneler/starter-absenteeism-at-work-7c360987-f

https://www.kaggle.com/dweepa/outliers-assign6


# Feature Selection For Machine Learning
The data features that you use to train your machine learning models have a huge influence on
the performance you can achieve. Irrelevant or partially relevant features can negatively impact
model performance. In this chapter you will discover automatic feature selection techniques
that you can use to prepare your machine learning data in Python with scikit-learn. After
completing this lesson you will know how to use:

1. Univariate Selection.
2. Recursive Feature Elimination.
3. Principle Component Analysis.
4. Feature Importance.


## Feature Selection

Feature selection is a process where you automatically select those features in your data that
contribute most to the prediction variable or output in which you are interested. Having
irrelevant features in your data can decrease the accuracy of many models, especially linear
algorithms like linear and logistic regression. Three benets of performing feature selection
before modeling your data are:

- Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
- Improves Accuracy: Less misleading data means modeling accuracy improves.
- Reduces Training Time: Less data means that algorithms train faster.

You can learn more about feature selection with scikit-learn in the article Feature selection1.
Each feature selection recipes will use the Pima Indians onset of diabetes dataset.

# Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with
the output variable. The scikit-learn library provides the SelectKBest class2 that can be used
with a suite of diferent statistical tests to select a specific number of features. The example
below uses the Chi-Squared (f2) statistical test for non-negative features to select 4 of the best
features from the Pima Indians onset of diabetes dataset.

In [1]:
# Rescale data (between 0 and 1)
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

In [2]:
data = pd.read_csv('Absenteeism_at_work.csv')

In [3]:
data.head(20)

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,97,0,1,30,4
1,36,0,7,3,1,118,13,18,50,239.554,97,1,1,31,0
2,3,23,7,4,1,179,51,18,38,239.554,97,0,1,31,2
3,7,7,7,5,1,279,5,14,39,239.554,97,0,1,24,4
4,11,23,7,5,1,289,36,13,33,239.554,97,0,1,30,2
5,3,23,7,6,1,179,51,18,38,239.554,97,0,1,31,2
6,10,22,7,6,1,361,52,3,28,239.554,97,0,1,27,8
7,20,23,7,6,1,260,50,11,36,239.554,97,0,1,23,4
8,14,19,7,2,1,155,12,14,34,239.554,97,0,1,25,40
9,1,22,7,2,1,235,11,14,37,239.554,97,0,3,29,8


In [4]:
print(data.shape)

(740, 15)


In [89]:
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [91]:
# load data
array_univ_selec = data.values
array_univ_selec

array([[11., 26.,  7., ...,  1., 30.,  4.],
       [36.,  0.,  7., ...,  1., 31.,  0.],
       [ 3., 23.,  7., ...,  1., 31.,  2.],
       ...,
       [ 4.,  0.,  0., ...,  1., 34.,  0.],
       [ 8.,  0.,  0., ...,  1., 35.,  0.],
       [35.,  0.,  0., ...,  1., 25.,  0.]])

In [92]:
array_univ_selec.shape

(740, 15)

In [94]:
X_univ = array_univ_selec[:,0:15]
X_univ

array([[11., 26.,  7., ...,  1., 30.,  4.],
       [36.,  0.,  7., ...,  1., 31.,  0.],
       [ 3., 23.,  7., ...,  1., 31.,  2.],
       ...,
       [ 4.,  0.,  0., ...,  1., 34.,  0.],
       [ 8.,  0.,  0., ...,  1., 35.,  0.],
       [35.,  0.,  0., ...,  1., 25.,  0.]])

In [96]:
Y_univ = array_univ_selec[:,14]
Y_univ

array([  4.,   0.,   2.,   4.,   2.,   2.,   8.,   4.,  40.,   8.,   8.,
         8.,   8.,   1.,   4.,   8.,   2.,   8.,   8.,   2.,   8.,   1.,
        40.,   4.,   8.,   7.,   1.,   4.,   8.,   2.,   8.,   8.,   4.,
         8.,   2.,   1.,   8.,   4.,   8.,   4.,   2.,   4.,   4.,   8.,
         2.,   3.,   3.,   4.,   8.,  32.,   0.,   0.,   2.,   2.,   0.,
         0.,   3.,   3.,   0.,   1.,   3.,   4.,   3.,   3.,   0.,   1.,
         3.,   3.,   3.,   2.,   2.,   5.,   8.,   3.,  16.,   8.,   2.,
         8.,   1.,   3.,   1.,   1.,   8.,   8.,   5.,  32.,   8.,  40.,
         1.,   8.,   3.,   8.,   3.,   4.,   1.,   3.,  24.,   3.,   1.,
        64.,   2.,   8.,   2.,   8.,  56.,   8.,   3.,   3.,   2.,   8.,
         2.,   8.,   2.,   1.,   1.,   1.,   8.,   2.,   2.,   2.,   1.,
         2.,   2.,   2.,   2.,   2.,   2.,   2.,   2.,   8.,   8.,   2.,
         2.,   2.,   0.,   1.,   3.,   1.,   8.,   8.,   2.,   8.,   2.,
         8.,   8.,   8.,   2.,   2.,   1.,   8.,   

In [97]:
Y_univ.shape

(740,)

In [104]:
# feature extraction
test = SelectKBest(score_func=chi2, k=15)
test

SelectKBest(k=15, score_func=<function chi2 at 0x7fae9f39c488>)

In [105]:
fit = test.fit(X_univ, Y_univ)
fit

SelectKBest(k=15, score_func=<function chi2 at 0x7fae9f39c488>)

In [106]:
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)

[1.933e+02 1.316e+03 7.598e+01 1.710e+01 2.178e+01 1.684e+03 2.242e+02
 3.408e+01 3.606e+01 2.522e+02 5.757e+00 6.327e+02 6.053e+00 1.711e+01
 1.897e+04]


In [107]:
features = fit.transform(X_univ)
# summarize selected features
print(features[0:5,:])

[[ 11.     26.      7.      3.      1.    289.     36.     13.     33.
  239.554  97.      0.      1.     30.      4.   ]
 [ 36.      0.      7.      3.      1.    118.     13.     18.     50.
  239.554  97.      1.      1.     31.      0.   ]
 [  3.     23.      7.      4.      1.    179.     51.     18.     38.
  239.554  97.      0.      1.     31.      2.   ]
 [  7.      7.      7.      5.      1.    279.      5.     14.     39.
  239.554  97.      0.      1.     24.      4.   ]
 [ 11.     23.      7.      5.      1.    289.     36.     13.     33.
  239.554  97.      0.      1.     30.      2.   ]]


In [108]:
features.shape

(740, 15)

### Recursive Feature Elimination
The Recursive Feature Elimination (or RFE) works by recursively removing attributes and
building a model on those attributes that remain. It uses the model accuracy to identify which

## Principal Component Analysis
Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a
compressed form. Generally this is called a data reduction technique. A property of PCA is that
you can choose the number of dimensions or principal components in the transformed result. In
the example below, we use PCA and select 3 principal components. Learn more about the PCA
class in scikit-learn by reviewing the API4.

In [109]:
from sklearn.decomposition import PCA

In [110]:
array_pca = data.values

In [112]:
X_pca = array_pca[:,0:15]
X_pca

array([[11., 26.,  7., ...,  1., 30.,  4.],
       [36.,  0.,  7., ...,  1., 31.,  0.],
       [ 3., 23.,  7., ...,  1., 31.,  2.],
       ...,
       [ 4.,  0.,  0., ...,  1., 34.,  0.],
       [ 8.,  0.,  0., ...,  1., 35.,  0.],
       [35.,  0.,  0., ...,  1., 25.,  0.]])

In [113]:
X_pca.shape

(740, 15)

In [115]:
Y_pca = array[:,14]
Y_pca

array([  4.,   0.,   2.,   4.,   2.,   2.,   8.,   4.,  40.,   8.,   8.,
         8.,   8.,   1.,   4.,   8.,   2.,   8.,   8.,   2.,   8.,   1.,
        40.,   4.,   8.,   7.,   1.,   4.,   8.,   2.,   8.,   8.,   4.,
         8.,   2.,   1.,   8.,   4.,   8.,   4.,   2.,   4.,   4.,   8.,
         2.,   3.,   3.,   4.,   8.,  32.,   0.,   0.,   2.,   2.,   0.,
         0.,   3.,   3.,   0.,   1.,   3.,   4.,   3.,   3.,   0.,   1.,
         3.,   3.,   3.,   2.,   2.,   5.,   8.,   3.,  16.,   8.,   2.,
         8.,   1.,   3.,   1.,   1.,   8.,   8.,   5.,  32.,   8.,  40.,
         1.,   8.,   3.,   8.,   3.,   4.,   1.,   3.,  24.,   3.,   1.,
        64.,   2.,   8.,   2.,   8.,  56.,   8.,   3.,   3.,   2.,   8.,
         2.,   8.,   2.,   1.,   1.,   1.,   8.,   2.,   2.,   2.,   1.,
         2.,   2.,   2.,   2.,   2.,   2.,   2.,   2.,   8.,   8.,   2.,
         2.,   2.,   0.,   1.,   3.,   1.,   8.,   8.,   2.,   8.,   2.,
         8.,   8.,   8.,   2.,   2.,   1.,   8.,   

In [116]:
Y_pca.shape

(740,)

In [127]:
# feature extraction
pca = PCA(n_components=3)
pca

PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [128]:
fit = pca.fit(X_pca)
fit

PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [129]:
# summarize components
print("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance: [0.673 0.228 0.037]
[[-3.853e-02 -1.486e-02  7.006e-03  7.375e-04  5.966e-04  9.967e-01
   6.109e-02 -2.268e-02 -2.234e-02  3.733e-03 -4.534e-03  3.654e-04
  -5.730e-04 -8.596e-03  5.488e-03]
 [ 3.041e-02 -2.848e-02 -1.494e-02  4.840e-04  4.317e-03 -1.204e-03
  -3.298e-02 -6.058e-04 -6.215e-03  9.983e-01 -8.619e-03  1.861e-04
  -1.220e-03 -1.055e-02  1.024e-02]
 [ 4.399e-01 -1.510e-01  1.647e-02 -9.818e-03  5.779e-03  6.322e-02
  -8.398e-01 -8.429e-02  3.281e-02 -4.817e-02  9.506e-04  1.293e-03
   8.099e-03 -6.649e-02  2.429e-01]]


## Feature Importance
Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance
of features. In the example below we construct a ExtraTreesClassifier classier for the Pima
Indians onset of diabetes dataset. You can learn more about the ExtraTreesClassifier class5
in the scikit-learn API.

In [130]:
# Feature Importance with Extra Trees Classifier

from sklearn.ensemble import ExtraTreesClassifier

In [131]:
array_feture_impor = data.values
array_feture_impor

array([[11., 26.,  7., ...,  1., 30.,  4.],
       [36.,  0.,  7., ...,  1., 31.,  0.],
       [ 3., 23.,  7., ...,  1., 31.,  2.],
       ...,
       [ 4.,  0.,  0., ...,  1., 34.,  0.],
       [ 8.,  0.,  0., ...,  1., 35.,  0.],
       [35.,  0.,  0., ...,  1., 25.,  0.]])

In [132]:
array_feture_impor.shape

(740, 15)

In [135]:
X_feature = array_feture_impor[:,0:15]
X_feature

array([[11., 26.,  7., ...,  1., 30.,  4.],
       [36.,  0.,  7., ...,  1., 31.,  0.],
       [ 3., 23.,  7., ...,  1., 31.,  2.],
       ...,
       [ 4.,  0.,  0., ...,  1., 34.,  0.],
       [ 8.,  0.,  0., ...,  1., 35.,  0.],
       [35.,  0.,  0., ...,  1., 25.,  0.]])

In [136]:
X_feature.shape

(740, 15)

In [138]:
Y_feature = array_feture_impor[:,14]
Y_feature

array([  4.,   0.,   2.,   4.,   2.,   2.,   8.,   4.,  40.,   8.,   8.,
         8.,   8.,   1.,   4.,   8.,   2.,   8.,   8.,   2.,   8.,   1.,
        40.,   4.,   8.,   7.,   1.,   4.,   8.,   2.,   8.,   8.,   4.,
         8.,   2.,   1.,   8.,   4.,   8.,   4.,   2.,   4.,   4.,   8.,
         2.,   3.,   3.,   4.,   8.,  32.,   0.,   0.,   2.,   2.,   0.,
         0.,   3.,   3.,   0.,   1.,   3.,   4.,   3.,   3.,   0.,   1.,
         3.,   3.,   3.,   2.,   2.,   5.,   8.,   3.,  16.,   8.,   2.,
         8.,   1.,   3.,   1.,   1.,   8.,   8.,   5.,  32.,   8.,  40.,
         1.,   8.,   3.,   8.,   3.,   4.,   1.,   3.,  24.,   3.,   1.,
        64.,   2.,   8.,   2.,   8.,  56.,   8.,   3.,   3.,   2.,   8.,
         2.,   8.,   2.,   1.,   1.,   1.,   8.,   2.,   2.,   2.,   1.,
         2.,   2.,   2.,   2.,   2.,   2.,   2.,   2.,   8.,   8.,   2.,
         2.,   2.,   0.,   1.,   3.,   1.,   8.,   8.,   2.,   8.,   2.,
         8.,   8.,   8.,   2.,   2.,   1.,   8.,   

In [139]:
Y_feature.shape

(740,)

In [141]:
# feature extraction
model = ExtraTreesClassifier()
model

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [142]:
model.fit(X_feature, Y_feature)
print(model.feature_importances_)

[0.035 0.105 0.059 0.061 0.037 0.035 0.022 0.021 0.025 0.064 0.052 0.053
 0.01  0.025 0.397]




You can see that we are given an importance score for each attribute where the larger the
score, the more important the attribute. The scores highlight the importance of plas, age and
mass.