# Unsupervised Learning Project

1. Data Prep & EDA
2. K-Means Clustering
3. PCA for Visualization
4. K-Means Clustering (Round 2)
5. PCA for Visualization (Round 2)
6. EDA on Clusters
7. Make Recommendations

## Goal & Scope

**GOAL**: You are trying to better understand the company’s different segments of employees and how to increase employee retention within each segment.

**SCOPE**: Your task is to use a clustering technique to segment the employees, a dimensionality reduction technique to visualize the segments, and finally explore the clusters to make recommendations to increase retention.

# 1. Data Prp & EDA

## a. Create numeric columns and hanlde any missing data or problems with data types

In [1]:
import pandas as pd

employee_data = pd.read_csv('../employee_data.csv')
employee_data

Unnamed: 0,EmployeeID,Age,Gender,DistanceFromHome,JobLevel,Department,MonthlyIncome,PerformanceRating,JobSatisfaction,Attrition
0,1001,41,Female,1,2,Sales,5993,3,4,Yes
1,1002,49,Male,8,2,Research & Development,5130,4,2,No
2,1004,37,Male,2,1,Research & Development,2090,3,3,Yes
3,1005,33,Female,3,1,Research & Development,2909,3,3,No
4,1007,27,Male,2,1,Research & Development,3468,3,2,No
...,...,...,...,...,...,...,...,...,...,...
1465,3061,36,Male,23,2,Research & Development,2571,3,4,No
1466,3062,39,Male,6,3,Research & Development,9991,3,1,No
1467,3064,27,Male,4,2,Research & Development,6142,4,2,No
1468,3065,49,Male,2,2,Sales,5390,3,2,No


In [2]:
employee_data.shape

(1470, 10)

In [5]:
employee_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   EmployeeID         1470 non-null   int64 
 1   Age                1470 non-null   int64 
 2   Gender             1470 non-null   object
 3   DistanceFromHome   1470 non-null   int64 
 4   JobLevel           1470 non-null   int64 
 5   Department         1470 non-null   object
 6   MonthlyIncome      1470 non-null   int64 
 7   PerformanceRating  1470 non-null   int64 
 8   JobSatisfaction    1470 non-null   int64 
 9   Attrition          1470 non-null   object
dtypes: int64(7), object(3)
memory usage: 115.0+ KB


In [7]:
employee_data.dtypes[employee_data.dtypes == 'int64']

EmployeeID           int64
Age                  int64
DistanceFromHome     int64
JobLevel             int64
MonthlyIncome        int64
PerformanceRating    int64
JobSatisfaction      int64
dtype: object

In [8]:
employee_data.dtypes[employee_data.dtypes != 'int64']

Gender        object
Department    object
Attrition     object
dtype: object

In [10]:
Data = employee_data.copy()
Data.head()

Unnamed: 0,EmployeeID,Age,Gender,DistanceFromHome,JobLevel,Department,MonthlyIncome,PerformanceRating,JobSatisfaction,Attrition
0,1001,41,Female,1,2,Sales,5993,3,4,Yes
1,1002,49,Male,8,2,Research & Development,5130,4,2,No
2,1004,37,Male,2,1,Research & Development,2090,3,3,Yes
3,1005,33,Female,3,1,Research & Development,2909,3,3,No
4,1007,27,Male,2,1,Research & Development,3468,3,2,No


In [11]:
Data.Gender.value_counts()

Gender
Male      882
Female    588
Name: count, dtype: int64

In [12]:
import numpy as np

Data['Gender'] = np.where(Data['Gender'] == 'Female', 1, 0)
Data.head()

Unnamed: 0,EmployeeID,Age,Gender,DistanceFromHome,JobLevel,Department,MonthlyIncome,PerformanceRating,JobSatisfaction,Attrition
0,1001,41,1,1,2,Sales,5993,3,4,Yes
1,1002,49,0,8,2,Research & Development,5130,4,2,No
2,1004,37,0,2,1,Research & Development,2090,3,3,Yes
3,1005,33,1,3,1,Research & Development,2909,3,3,No
4,1007,27,0,2,1,Research & Development,3468,3,2,No


In [13]:
Data['Attrition'] = np.where(Data['Attrition'] == 'Yes', 1, 0)
Data.head()

Unnamed: 0,EmployeeID,Age,Gender,DistanceFromHome,JobLevel,Department,MonthlyIncome,PerformanceRating,JobSatisfaction,Attrition
0,1001,41,1,1,2,Sales,5993,3,4,1
1,1002,49,0,8,2,Research & Development,5130,4,2,0
2,1004,37,0,2,1,Research & Development,2090,3,3,1
3,1005,33,1,3,1,Research & Development,2909,3,3,0
4,1007,27,0,2,1,Research & Development,3468,3,2,0


In [15]:
pd.get_dummies(Data['Department']).astype('int64')

Unnamed: 0,Human Resources,Research & Development,Sales
0,0,0,1
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0
...,...,...,...
1465,0,1,0
1466,0,1,0
1467,0,1,0
1468,0,0,1


In [16]:
Data = pd.concat([Data, pd.get_dummies(Data['Department']).astype('int64')], axis=1)
Data.drop(columns=['Department'], axis=1, inplace=True)
Data.head()

Unnamed: 0,EmployeeID,Age,Gender,DistanceFromHome,JobLevel,MonthlyIncome,PerformanceRating,JobSatisfaction,Attrition,Human Resources,Research & Development,Sales
0,1001,41,1,1,2,5993,3,4,1,0,0,1
1,1002,49,0,8,2,5130,4,2,0,0,1,0
2,1004,37,0,2,1,2090,3,3,1,0,1,0
3,1005,33,1,3,1,2909,3,3,0,0,1,0
4,1007,27,0,2,1,3468,3,2,0,0,1,0


In [17]:
Data.shape

(1470, 12)