# Employee Metrics

**Project Goal**

We intend to clean and conduct exploratory analysis with this dataset before we input into Tableau to create an Employee Metrics dashboard.

The dashboard will provide stakeholders with valuable insights and data-driven decision-making capabilities regarding various aspects of the workforce, including demographics, salaries, attrition rates, performance indicators, and employee retention.

**Questions to Ask**
- What is the distribution of employees by gender and ethnicity?
- What is the average annual salary by department or job title?
- How does the bonus percentage vary across different business units?
- How many employees have exited the company, and what is their distribution by department?
- What is the age distribution of employees by gender?
- What is the geographical distribution of employees by country and city?
- Which employees have the highest and lowest salaries?
- What is the employee attrition rate, and how does it vary by department and business unit?


### Create Workspace

In [2]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import style
import datetime as dt

### Load Dataset

In [3]:
df = pd.read_excel('Employee  Data.xlsx', sheet_name='Data')
df.head()

Unnamed: 0,Employee ID,Full Name,Job Title,Department,Business Unit,Gender,Ethnicity,Age,Hire Date,Annual Salary,Bonus %,Country,City,Exit Date
0,E02002,Kai Le,Controls Engineer,Engineering,Manufacturing,Male,Asian,47,2022-02-05,92368,0.0,United States,Columbus,NaT
1,E02003,Robert Patel,Analyst,Sales,Corporate,Male,Asian,58,2013-10-23,45703,0.0,United States,Chicago,NaT
2,E02004,Cameron Lo,Network Administrator,IT,Research & Development,Male,Asian,34,2019-03-24,83576,0.0,China,Shanghai,NaT
3,E02005,Harper Castillo,IT Systems Architect,IT,Corporate,Female,Latino,39,2018-04-07,98062,0.0,United States,Seattle,NaT
4,E02006,Harper Dominguez,Director,Engineering,Corporate,Female,Latino,42,2005-06-18,175391,0.24,United States,Austin,NaT


### Preliminary Exploration of the Data

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Employee ID    1000 non-null   object        
 1   Full Name      1000 non-null   object        
 2   Job Title      1000 non-null   object        
 3   Department     1000 non-null   object        
 4   Business Unit  1000 non-null   object        
 5   Gender         1000 non-null   object        
 6   Ethnicity      1000 non-null   object        
 7   Age            1000 non-null   int64         
 8   Hire Date      1000 non-null   datetime64[ns]
 9   Annual Salary  1000 non-null   int64         
 10  Bonus %        1000 non-null   float64       
 11  Country        1000 non-null   object        
 12  City           1000 non-null   object        
 13  Exit Date      103 non-null    datetime64[ns]
dtypes: datetime64[ns](2), float64(1), int64(2), object(9)
memory usage: 109.5

### Data Cleaning

In [5]:
df.columns = list(map(lambda x: x.replace(' ', '_').lower(), df.columns))

### Exploratory Analysis

In [6]:
total = df.groupby(by=df.hire_date).\
    agg({'employee_id':'count'}).\
    cumsum().\
    reset_index().\
    rename(columns={'employee_id':'total_employees'})

hires = df.groupby(by=df.hire_date).\
    agg({'employee_id':'count'}).\
    reset_index().\
    rename(columns={'employee_id':'total_hires'})

exits = df.groupby(by=df.exit_date).\
    agg({'employee_id':'count'}).\
    reset_index().\
    rename(columns={'employee_id':'total_exits'})

attrition = total.merge(hires, how='left', on='hire_date').\
                merge(exits, how='outer', left_on='hire_date', right_on='exit_date')

attrition.head()

Unnamed: 0,hire_date,total_employees,total_hires,exit_date,total_exits
0,1993-04-30,1.0,1.0,NaT,
1,1993-05-11,2.0,1.0,NaT,
2,1993-05-13,3.0,1.0,NaT,
3,1993-05-29,4.0,1.0,NaT,
4,1993-06-25,5.0,1.0,NaT,


In [7]:
for ind, row in attrition.iterrows():
    if pd.isnull(row['hire_date']):
        attrition.loc[ind, 'hire_date'] = attrition.loc[ind, 'exit_date']

attrition.isnull().sum()

hire_date            0
total_employees     85
total_hires         85
exit_date          918
total_exits        918
dtype: int64

In [8]:
attrition.total_employees.fillna(method="ffill", inplace=True)
attrition.total_hires.fillna(0, inplace=True)
attrition.total_exits.fillna(0, inplace=True)
attrition.drop(columns='exit_date', inplace=True)
attrition.rename(columns={'hire_date':'date'}, inplace=True)

attrition.isnull().sum()

date               0
total_employees    0
total_hires        0
total_exits        0
dtype: int64

In [9]:
attrition

Unnamed: 0,date,total_employees,total_hires,total_exits
0,1993-04-30,1.0,1.0,0.0
1,1993-05-11,2.0,1.0,0.0
2,1993-05-13,3.0,1.0,0.0
3,1993-05-29,4.0,1.0,0.0
4,1993-06-25,5.0,1.0,0.0
...,...,...,...,...
1014,2023-01-10,1000.0,0.0,1.0
1015,2023-01-25,1000.0,0.0,2.0
1016,2023-02-01,1000.0,0.0,1.0
1017,2023-02-02,1000.0,0.0,1.0


### Prepare Dataset for Tableau Dashboarding

In [10]:
# attrition.to_excel('attrition.xlsx', index=False)