# Preliminary Data Exploration: Medium.com employee data

In this notebook we explore human resources data provided by Medium on Kaggle (https://www.kaggle.com/ludobenistant/hr-analytics). It is simulated employee data of 15000 employees, and the goal is to understand something about what might cause an employee to prematurely leave the company.

In [1]:
%matplotlib notebook

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
from __future__ import print_function

In [4]:
sns.set_context("talk")

## Read in data

In [5]:
#data is comma-delimited csv, with headers in the first line
df = pd.read_csv('HR_comma_sep.csv')

In [6]:
df.columns

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'sales', 'salary'],
      dtype='object')

We have lots of interesting information that might impact an employee's attitudes. We have direct measures of her general happiness level (satisfaction_level) and her employer's attitude about her (last_evaluation). A measure of workload can be gauged from number of projects and average monthly hours. A measure of rewards from the employer can be found in salary and number of promotions in the last 5 years. Her status in the company could be gleaned from her salary level, department, time spent at the company, and number of projects. 

The data is pretty clean. There is no missing data, and formats are all sensible. Ratings are already normalized to be between 0 and 1. We're ready to start looking for interesting features in the data!

In [7]:
#what are possible values for sales
set(df.sales)

{'IT',
 'RandD',
 'accounting',
 'hr',
 'management',
 'marketing',
 'product_mng',
 'sales',
 'support',
 'technical'}

In [8]:
#what are possible values for salary
set(df.salary)

{'high', 'low', 'medium'}

## Data Exploration

The package seaborn has some great plotting functions that make it easy to see interactions between features, as well as the distribution of values in a feature. As busy as this plot is, it provides almost all pairwise comparisons you might want to do all in one graph.

In [9]:
g = sns.PairGrid(df, hue='left', vars=df.columns[0:5])
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter,alpha=0.3)

<IPython.core.display.Javascript object>

Some thoughts that come up looking at the plots above are: 
* is there a bimodal distribution for leaving - seems there are two peaks in most plots for leavers - those who have high or low satisfaction, high or low evaluations, and those who work part-time or overtime.
* there are clearly clusters of leavers when monthly hours and satisfaction level are compared
* there are no leavers if the employee has stayed over 6 years. Leavers will leave within 6 years.

There seems to be an interesting story behind the satisfaction vs work hours plots. Do they partition by department, or number of projects? as seen below, the answers are no, and yes, respectively.

In [10]:
# satisfaction vs work hours for each department
g = sns.FacetGrid(df, row="sales", col="left", margin_titles=True)
g.map(plt.scatter, "average_montly_hours", "satisfaction_level")
plt.title('satisfaction vs work hours')

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0xe83c3bfd30>

In [11]:
# satisfaction vs work hours per number of projects  
g = sns.FacetGrid(df, row="number_project", col="left", margin_titles=True)
g.map(plt.scatter, "average_montly_hours", "satisfaction_level")
plt.title('satisfaction vs work hours')

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0xe83e0df400>

We can also make more satisfaction vs. work hours plots with color representing promotions, and salary - I've included them below for curiosity, but there are no clear patterns from promotions or salary in those plots.

So lucky us, the data already hints at some useful relationships and real take home messages as a result of the analysis.

The fact that we have some apparently useful demographics about our employees, and that we know if the employee left or not means we can apply machine learning, and we can use supervised learning algorithms. One of the benefits of some algorithms is you don't have to do any preliminary data exploration - the algorithm can decide for itself if a feature is relevant or not. As this is a classification problem (with two outcomes, will leave, or will stay) I will apply initially two classification algorithms: logistic regression, and decision trees.

In [12]:
g = sns.PairGrid(df, hue='promotion_last_5years', vars=df.columns[0:5])
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter,alpha=0.3)

<IPython.core.display.Javascript object>

In [13]:
g = sns.PairGrid(df, hue='salary', vars=df.columns[0:5])
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter,alpha=0.3)

<IPython.core.display.Javascript object>