## Self Practice Case Study: PCA, tSNE for dimensionality reduction

- In the LVC, we worked on using PCA and tSNE to visualise the digits dataset in 2D. We created embeddings using PCA and tSNE and these embeddings were plotted using the scatter plot.


- In this case study we will work with two different datasets and use PCA and tSNE, not for visualisation by reducing dimensions, but for **reducing the dimensions in case where the original data has a lot of dimensions and then use transformed features for classification**. We will try to understand the variance explained through principal components and use tsne.


- Note: We will later (in the next week) use these embeddings to carry out a classification exercise. 

## Data

#### HR Attrition Analysis
In this dataset, we have employees as data points and different features for them. We have 'Attrition' as the target column. The data dictionary is provided separately.

#### Digits dataset
This is the dataset that comes with sklearn. We are using sklearn.datasets.load_digits to load the dataset. 
The data is stoed as a numpy array. Each row represents one image where each image has 64 features(8*8 image where the pixel values have been flattened into a vector). 


Let us start by importing some basic packages.

## Accessing data and preprocessing

In [None]:
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt

Let us load the HR attrition dataset now.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving HR_Employee_Attrition_Dataset.xlsx to HR_Employee_Attrition_Dataset.xlsx


In [None]:
import io
df = pd.read_excel(io.BytesIO(uploaded['HR_Employee_Attrition_Dataset.xlsx']))

In [None]:
# displaying the first 3 rows
df.head(3)

Unnamed: 0,EmployeeNumber,Attrition,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,1,Yes,41,Travel_Rarely,1102,Sales,1,2,Life Sciences,2,Female,94,3,2,Sales Executive,4,Single,5993,19479,8,Y,Yes,11,3,1,80,0,8,0,1,6,4,0,5
1,2,No,49,Travel_Frequently,279,Research & Development,8,1,Life Sciences,3,Male,61,2,2,Research Scientist,2,Married,5130,24907,1,Y,No,23,4,4,80,1,10,3,3,10,7,1,7
2,3,Yes,37,Travel_Rarely,1373,Research & Development,2,2,Other,4,Male,92,2,1,Laboratory Technician,3,Single,2090,2396,6,Y,Yes,15,3,2,80,0,7,3,3,0,0,0,0


In [None]:
# set the EmployeeNumber as the index
df = df.set_index('EmployeeNumber')
df.columns

Index(['Attrition', 'Age', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField',
       'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
       'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18',
       'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

In [None]:
# Having looked at the data, we realise that there are some cateogorical variables that are nominal. We should encode these variables.
# The modelling algorithms require data in numerical form and thus we need to convert categories to numbers by encoding.

to_get_dummies_for = ['BusinessTravel', 'Department','Education', 'EducationField','EnvironmentSatisfaction', 'Gender',  'JobInvolvement','JobLevel', 'JobRole', 'MaritalStatus' ]

df = pd.get_dummies(data = df, columns= to_get_dummies_for, drop_first= True)       

In [None]:
df.shape

(2940, 57)

We have encoded the dimensions and now the no of features are 57.
Let us now treat the other variables.

In [None]:
df.head(3)

Unnamed: 0_level_0,Attrition,Age,DailyRate,DistanceFromHome,HourlyRate,JobSatisfaction,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely,Department_Research & Development,Department_Sales,Education_2,Education_3,Education_4,Education_5,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,EducationField_Other,EducationField_Technical Degree,EnvironmentSatisfaction_2,EnvironmentSatisfaction_3,EnvironmentSatisfaction_4,Gender_Male,JobInvolvement_2,JobInvolvement_3,JobInvolvement_4,JobLevel_2,JobLevel_3,JobLevel_4,JobLevel_5,JobRole_Human Resources,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Married,MaritalStatus_Single
EmployeeNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1
1,Yes,41,1102,1,94,4,5993,19479,8,Y,Yes,11,3,1,80,0,8,0,1,6,4,0,5,0,1,0,1,1,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1
2,No,49,279,8,61,2,5130,24907,1,Y,No,23,4,4,80,1,10,3,3,10,7,1,7,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0
3,Yes,37,1373,2,92,3,2090,2396,6,Y,Yes,15,3,2,80,0,7,3,3,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,1,0,0,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1


In [None]:
dict_over18 = {'Y': 1, 'N':0}
dict_OverTime = {'Yes': 1, 'No':0}
dict_attrition = {'Yes': 1, 'No': 0}

df['OverTime'] = df.OverTime.map(dict_OverTime)
df['Over18'] = df.Over18.map(dict_over18)
df['Attrition'] = df.Attrition.map(dict_attrition)

Y_HR = df.Attrition
X_HR = df.drop(columns = ['Attrition'])

In [None]:
df.head(3)

Unnamed: 0_level_0,Attrition,Age,DailyRate,DistanceFromHome,HourlyRate,JobSatisfaction,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely,Department_Research & Development,Department_Sales,Education_2,Education_3,Education_4,Education_5,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,EducationField_Other,EducationField_Technical Degree,EnvironmentSatisfaction_2,EnvironmentSatisfaction_3,EnvironmentSatisfaction_4,Gender_Male,JobInvolvement_2,JobInvolvement_3,JobInvolvement_4,JobLevel_2,JobLevel_3,JobLevel_4,JobLevel_5,JobRole_Human Resources,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Married,MaritalStatus_Single
EmployeeNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1
1,1,41,1102,1,94,4,5993,19479,8,1,1,11,3,1,80,0,8,0,1,6,4,0,5,0,1,0,1,1,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1
2,0,49,279,8,61,2,5130,24907,1,1,0,23,4,4,80,1,10,3,3,10,7,1,7,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0
3,1,37,1373,2,92,3,2090,2396,6,1,1,15,3,2,80,0,7,3,3,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,1,0,0,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1


In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Attrition,2940.0,0.161224,0.3678,0.0,0.0,0.0,0.0,1.0
Age,2940.0,36.92381,9.133819,18.0,30.0,36.0,43.0,60.0
DailyRate,2940.0,802.485714,403.440447,102.0,465.0,802.0,1157.0,1499.0
DistanceFromHome,2940.0,9.192517,8.105485,1.0,2.0,7.0,14.0,29.0
HourlyRate,2940.0,65.891156,20.325969,30.0,48.0,66.0,84.0,100.0
JobSatisfaction,2940.0,2.728571,1.102658,1.0,2.0,3.0,4.0,4.0
MonthlyIncome,2940.0,6502.931293,4707.15577,1009.0,2911.0,4919.0,8380.0,19999.0
MonthlyRate,2940.0,14313.103401,7116.575021,2094.0,8045.0,14235.5,20462.0,26999.0
NumCompaniesWorked,2940.0,2.693197,2.497584,0.0,1.0,2.0,4.0,9.0
Over18,2940.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0


Now that we have treated the categorical columns, let us now look at the shape of the data.

In [None]:
df.shape

(2940, 57)

So, we have 57 different features. Let us now try to reduce the dimensions.
First lets start with PCA.

## PCA: HR Attrition

In [None]:
X_HR.head(3)

Unnamed: 0_level_0,Age,DailyRate,DistanceFromHome,HourlyRate,JobSatisfaction,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely,Department_Research & Development,Department_Sales,Education_2,Education_3,Education_4,Education_5,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,EducationField_Other,EducationField_Technical Degree,EnvironmentSatisfaction_2,EnvironmentSatisfaction_3,EnvironmentSatisfaction_4,Gender_Male,JobInvolvement_2,JobInvolvement_3,JobInvolvement_4,JobLevel_2,JobLevel_3,JobLevel_4,JobLevel_5,JobRole_Human Resources,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Married,MaritalStatus_Single
EmployeeNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1
1,41,1102,1,94,4,5993,19479,8,1,1,11,3,1,80,0,8,0,1,6,4,0,5,0,1,0,1,1,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1
2,49,279,8,61,2,5130,24907,1,1,0,23,4,4,80,1,10,3,3,10,7,1,7,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0
3,37,1373,2,92,3,2090,2396,6,1,1,15,3,2,80,0,7,3,3,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,1,0,0,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1


In [None]:
# let us get the data to the same scale first
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
X_HR_scaled = mms.fit_transform(X_HR)


In [None]:
# Now that we have scaled the data, let us now go ahead and fit the PCA. Let us try to achieve 95% variance explained with the least number of principal components possible.

from sklearn.decomposition import PCA

So, before jumping in to generate the PCA embeddings, let us understand how to figure out the optimum number of prinipal components.


In [None]:
# To figure out the number of components that explain more than 95% of variance, let us first generate PCs for the original number of dimensions.
n = X_HR.shape[1]
pca = PCA(n_components = n)
pca.fit(X_HR_scaled)

# let us now find variance explained by all these principal components.

exp_var_HR = pca.explained_variance_ratio_

In [None]:
# find the least number of components that can explain more than 95% variance
sum = 0
for ix, i in enumerate(exp_var_HR):
  sum = sum + i
  if(sum>0.95):
    print("Number of PCs that explain at least 95% variance: ", ix+1)
    break

Number of PCs that explain at least 95% variance:  36


### Comments
-  We see that out of the 56 original features, we reduced the number of features through principal components to 36, these components explain the 95% of the original variance.

- So that is about 36% reduction in the dimensionality with a loss of 5% in variance.

- Let us now look at these principal components.

In [None]:
X_HR_transformed_pca = pca.transform(X_HR_scaled)[:, 0:36]
X_HR_transformed_pca.shape

(2940, 36)

In [None]:
X_HR_transformed_pca

array([[ 1.31122257,  0.80949068, -0.48008978, ...,  0.41985676,
         0.25405038, -0.20750442],
       [-0.4787661 , -0.44882672,  0.44642704, ..., -0.12559866,
         0.232826  ,  0.15420296],
       [-0.68466273,  0.74778442,  0.94197334, ...,  0.02329229,
         0.12641541,  0.22286996],
       ...,
       [-0.3890975 , -0.6046511 ,  0.10290065, ..., -0.08852991,
        -0.09914871, -0.33291044],
       [ 1.13259716, -0.24280923,  0.94187371, ...,  0.25862045,
         0.20479414, -0.05048027],
       [-0.50756099, -0.67427016,  0.55213963, ..., -0.35079405,
        -0.19035914, -0.10034733]])

## tSNE: HR Attrition

Let us now generate 3 dimensions tSNE embeddings which we will further use for classification. 

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components = 3)

X_HR_transformed_tsne = tsne.fit_transform(X_HR_scaled)

In [None]:
X_HR_transformed_tsne.shape

(2940, 3)

### Comments
- We have generated the 2D embeddings for the data. We have seen in the LVC that the 2d emebeddings of tSNE captured the data nicely(better than PCA with 2 componenets). If the embeddings(tSNE) here, capture the original data in 2d better then the PCA with over 36 features, we can expect it to work better on a classification task. We will check this later in the classification week.
