# Supervised Learning : Regression
## Peer Review Project 2

This project uses the publically available **IBM HR Analytics Employee Attrition & Performance** data hosted on Kaggle 

[https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset)

---

# Report Summary

This is a report on the application of various linear regression models in an attempt to predict the attrition of employees basing on the available performance data about them. 

## Data set details

- The data set contains *1470 observations* with *35 features*. 
- 9 features contain categorical data (object)

    `Attrition, BusinessTravel, Department, EducationField, Gender, JobRole, MaritalStatus, Over18, OverTime`

- 26 features contain numerical data (int64)

    `Age,  DailyRate, DistanceFromHome, Education,  EmployeeCount, EmployeeNumber, EnvironmentSatisfaction,  HourlyRate, JobInvolvement, JobLevel, JobSatisfaction, MonthlyIncome, MonthlyRate, NumCompaniesWorked, PercentSalaryHike, PerformanceRating, RelationshipSatisfaction, StandardHours, StockOptionLevel, TotalWorkingYears, TrainingTimesLastYear, WorkLifeBalance, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager`

## Objectives

1. To train and test several machine learning models on this data set to predict employee attrition
2. To select the best hyperparameters and regularization methods for each model basing on coefficent scores.
3. To select the best ML model appropriate for this data basing on complexity and mean squared error scores.

# Methodology

## Linear regression models and regularization methods used 

This project tests out linear, LASSO, Ridge and Elastic net regression models with their associated L2 and L1 regularization methods.


## Data preparation and analysis methods 
1. Prediction feature is 'Attrition'. We are interested in predicting this feature.
2. All categorical features are one hot encoded to make them useable in modeling. 
3. Numerical features are tested for normality and appropriately transformed.
4. Analysis pipelines with scaling, grid searching and cross validation with 3 k-folds (25% testing data) on the selected models are run.

## Results 
1. Most informative features of the dataset 
2. General performance of models
3. Best performing model


## Conclusion 
1. Model selection for this dataset, appropriate hyper parameters and regularization method
2. Flaws in the dataset
2. Flaws in the model/objectives
3. Way forward


---

# Code below this point

Beyond this point is the code used in the project detailing the analysis summarised in the section above. 

In [None]:
# Import modules

%pylab inline
%config InlineBackend.figure_formats = ['retina']

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [None]:
# Importing data
data = pd.read_csv('IBM_employee_data.csv')
data.info()

In [None]:
data.dtypes.value_counts()

In [None]:
data.head()

Review the categorical features to determine which have more than one category so that we one hot encode them. 

In [26]:
# Select the object (string) columns
mask = data.dtypes == np.object

In [27]:
mask

Age                         False
Attrition                    True
BusinessTravel               True
DailyRate                   False
Department                   True
DistanceFromHome            False
Education                   False
EducationField               True
EmployeeCount               False
EmployeeNumber              False
EnvironmentSatisfaction     False
Gender                       True
HourlyRate                  False
JobInvolvement              False
JobLevel                    False
JobRole                      True
JobSatisfaction             False
MaritalStatus                True
MonthlyIncome               False
MonthlyRate                 False
NumCompaniesWorked          False
Over18                       True
OverTime                     True
PercentSalaryHike           False
PerformanceRating           False
RelationshipSatisfaction    False
StandardHours               False
StockOptionLevel            False
TotalWorkingYears           False
TrainingTimesL

In [28]:
mask.drop(mask.index[1])

Age                         False
BusinessTravel               True
DailyRate                   False
Department                   True
DistanceFromHome            False
Education                   False
EducationField               True
EmployeeCount               False
EmployeeNumber              False
EnvironmentSatisfaction     False
Gender                       True
HourlyRate                  False
JobInvolvement              False
JobLevel                    False
JobRole                      True
JobSatisfaction             False
MaritalStatus                True
MonthlyIncome               False
MonthlyRate                 False
NumCompaniesWorked          False
Over18                       True
OverTime                     True
PercentSalaryHike           False
PerformanceRating           False
RelationshipSatisfaction    False
StandardHours               False
StockOptionLevel            False
TotalWorkingYears           False
TrainingTimesLastYear       False
WorkLifeBalanc

In [29]:
categorical_cols = data.columns[mask]

# Determine how many extra columns would be created
num_ohc_cols = (data[categorical_cols]
                .apply(lambda x: x.nunique())
                .sort_values(ascending=False))


In [30]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Copy of the data
data_ohc = data.copy()

# The encoders
le = LabelEncoder()
ohc = OneHotEncoder()

for col in num_ohc_cols.index:
    # Integer encode the string categories
    dat = le.fit_transform(data_ohc[col]).astype(np.int)
    
    # Remove the original column from the dataframe
    data_ohc = data_ohc.drop(col, axis=1)

    # One hot encode the data--this returns a sparse array
    new_dat = ohc.fit_transform(dat.reshape(-1,1))

    # Create unique column names
    n_cols = new_dat.shape[1]
    col_names = ['_'.join([col, str(x)]) for x in range(n_cols)]

    # Create the new dataframe
    new_df = pd.DataFrame(new_dat.toarray(), 
                          index=data_ohc.index, 
                          columns=col_names)

    # Append the new data to the dataframe
    data_ohc = pd.concat([data_ohc, new_df], axis=1)

In [31]:
print(data.shape[1])

# Remove the string columns from the dataframe
data = data.drop(num_ohc_cols.index, axis=1)

print(data.shape[1])

35
26


In [32]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 26 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Age                       1470 non-null   int64
 1   DailyRate                 1470 non-null   int64
 2   DistanceFromHome          1470 non-null   int64
 3   Education                 1470 non-null   int64
 4   EmployeeCount             1470 non-null   int64
 5   EmployeeNumber            1470 non-null   int64
 6   EnvironmentSatisfaction   1470 non-null   int64
 7   HourlyRate                1470 non-null   int64
 8   JobInvolvement            1470 non-null   int64
 9   JobLevel                  1470 non-null   int64
 10  JobSatisfaction           1470 non-null   int64
 11  MonthlyIncome             1470 non-null   int64
 12  MonthlyRate               1470 non-null   int64
 13  NumCompaniesWorked        1470 non-null   int64
 14  PercentSalaryHike         1470 non-null 