                                        ~ BHARATH BOMMEESHWAR K ~

### <U>OBJECTIVE:
* To analyse and identify the important factors which increases the churn rate in a company and make suggestions for optimal attrition.
* Build a machine learning model which predicts the chances of attrition of an employees   

### <U>FRAME WORK
   - Data acquistion
   - Exploratory Data Analysis
   - Data pre-processing 
   - Building machine learning models and Validating
   - Model testing
   - Conclusion  

### <font color= BLUE> <b> <u>1. DATA ACQUISITION

####  IMPORTING LIBRARIES AND MODULES

In [3]:
#datastructures and tools
import pandas as pd

#matrices and arrays
import numpy as np

#vizualizations
import seaborn as sns
import matplotlib.pyplot as plt

#to hide harmless warnings
import warnings
warnings.filterwarnings('ignore')

#to work with time computation
import time

#Resampling 
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE
from collections import Counter

#scientific computation
from scipy import stats
import statsmodels.formula.api as sm

# scale the data
from sklearn.preprocessing import StandardScaler
# split the data
from sklearn.model_selection import train_test_split
# cross validation - grid search
from sklearn.model_selection import GridSearchCV
#linear regression model
from sklearn.linear_model import LinearRegression
#knn neighbour model
from sklearn.neighbors import KNeighborsClassifier
#decision tree model
from sklearn.tree import DecisionTreeClassifier
#random forest
from sklearn.ensemble import RandomForestClassifier
#adaboost classifier
from sklearn.ensemble import AdaBoostClassifier
#gradient boosting
from sklearn.ensemble import GradientBoostingClassifier
#Bagging model
from sklearn.ensemble import BaggingClassifier
#Logistic regression model
from sklearn.linear_model import LogisticRegression


#for cross validation
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score

#measure accuracy
from sklearn import metrics
#to get complete classification report
from sklearn.metrics import classification_report
#to fill missing values in dataset
from sklearn.impute import KNNImputer

#to check accuracy 
from sklearn import metrics as mat

#### READING THE FILE

In [20]:
#Importing the Train data set
raw_data=pd.read_csv("Acoustic Features.csv")
raw_data.head()

Unnamed: 0,Class,_RMSenergy_Mean,_Lowenergy_Mean,_Fluctuation_Mean,_Tempo_Mean,_MFCC_Mean_1,_MFCC_Mean_2,_MFCC_Mean_3,_MFCC_Mean_4,_MFCC_Mean_5,...,_Chromagram_Mean_9,_Chromagram_Mean_10,_Chromagram_Mean_11,_Chromagram_Mean_12,_HarmonicChangeDetectionFunction_Mean,_HarmonicChangeDetectionFunction_Std,_HarmonicChangeDetectionFunction_Slope,_HarmonicChangeDetectionFunction_PeriodFreq,_HarmonicChangeDetectionFunction_PeriodAmp,_HarmonicChangeDetectionFunction_PeriodEntropy
0,relax,0.052,0.591,9.136,130.043,3.997,0.363,0.887,0.078,0.221,...,0.426,1.0,0.008,0.101,0.316,0.261,0.018,1.035,0.593,0.97
1,relax,0.125,0.439,6.68,142.24,4.058,0.516,0.785,0.397,0.556,...,0.002,1.0,0.0,0.984,0.285,0.211,-0.082,3.364,0.702,0.967
2,relax,0.046,0.639,10.578,188.154,2.775,0.903,0.502,0.329,0.287,...,0.184,0.746,0.016,1.0,0.413,0.299,0.134,1.682,0.692,0.963
3,relax,0.135,0.603,10.442,65.991,2.841,1.552,0.612,0.351,0.011,...,0.038,1.0,0.161,0.757,0.422,0.265,0.042,0.354,0.743,0.968
4,relax,0.066,0.591,9.769,88.89,3.217,0.228,0.814,0.096,0.434,...,0.004,0.404,1.0,0.001,0.345,0.261,0.089,0.748,0.674,0.957


In [21]:
#making copy of the data set
att = raw_data.copy()
att.head()

Unnamed: 0,Class,_RMSenergy_Mean,_Lowenergy_Mean,_Fluctuation_Mean,_Tempo_Mean,_MFCC_Mean_1,_MFCC_Mean_2,_MFCC_Mean_3,_MFCC_Mean_4,_MFCC_Mean_5,...,_Chromagram_Mean_9,_Chromagram_Mean_10,_Chromagram_Mean_11,_Chromagram_Mean_12,_HarmonicChangeDetectionFunction_Mean,_HarmonicChangeDetectionFunction_Std,_HarmonicChangeDetectionFunction_Slope,_HarmonicChangeDetectionFunction_PeriodFreq,_HarmonicChangeDetectionFunction_PeriodAmp,_HarmonicChangeDetectionFunction_PeriodEntropy
0,relax,0.052,0.591,9.136,130.043,3.997,0.363,0.887,0.078,0.221,...,0.426,1.0,0.008,0.101,0.316,0.261,0.018,1.035,0.593,0.97
1,relax,0.125,0.439,6.68,142.24,4.058,0.516,0.785,0.397,0.556,...,0.002,1.0,0.0,0.984,0.285,0.211,-0.082,3.364,0.702,0.967
2,relax,0.046,0.639,10.578,188.154,2.775,0.903,0.502,0.329,0.287,...,0.184,0.746,0.016,1.0,0.413,0.299,0.134,1.682,0.692,0.963
3,relax,0.135,0.603,10.442,65.991,2.841,1.552,0.612,0.351,0.011,...,0.038,1.0,0.161,0.757,0.422,0.265,0.042,0.354,0.743,0.968
4,relax,0.066,0.591,9.769,88.89,3.217,0.228,0.814,0.096,0.434,...,0.004,0.404,1.0,0.001,0.345,0.261,0.089,0.748,0.674,0.957


#### INFORMATION ABOUT THE DATASET

In [22]:
att.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 51 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   Class                                           400 non-null    object 
 1   _RMSenergy_Mean                                 400 non-null    float64
 2   _Lowenergy_Mean                                 400 non-null    float64
 3   _Fluctuation_Mean                               400 non-null    float64
 4   _Tempo_Mean                                     400 non-null    float64
 5   _MFCC_Mean_1                                    400 non-null    float64
 6   _MFCC_Mean_2                                    400 non-null    float64
 7   _MFCC_Mean_3                                    400 non-null    float64
 8   _MFCC_Mean_4                                    400 non-null    float64
 9   _MFCC_Mean_5                               

###  Target variable :
    ----------------------------

    * Attrition 

### Predictor variables:
    
    Categorical variables
    -----------------------------
    * Business travel 
    * Department              
    * Education field           
    * Marital status 
    * Gender        
    * Over time
    
    Discrete Numerical
    -----------------------------
    * Environment satisfaction 
    * Job involvement      
    * Job level                
    * Job satisfaction          
    * Work life balance
    


     Numerical variables
     ----------------------------
     * Monthly Income                   
     * Total working years        
     * Years at company          
     * Years in current role       
     * Years since last promotion  
     * Years with Current Manager    
     * Distance from home  


### <font color= BLUE> <b> <u>2.EXPLORATORY DATA ANALYSIS

#### Finding null values and duplicates

In [12]:
att.isnull().sum()

Age                        0
Attrition                  0
BusinessTravel             0
Department                 0
EducationField             0
EnvironmentSatisfaction    0
Gender                     0
JobInvolvement             0
JobLevel                   0
JobSatisfaction            0
MaritalStatus              0
MonthlyIncome              0
OverTime                   0
TotalWorkingYears          0
WorkLifeBalance            0
YearsAtCompany             0
YearsInCurrentRole         0
YearsSinceLastPromotion    0
YearsWithCurrManager       0
DistanceFromHome           0
dtype: int64

In [14]:
# Checking for any duplicate entries
duplicates=att.duplicated()
print("Duplicates count:",duplicates.sum())
#data1[duplicates]

#However if the target is imbalanced, it need to balanced by making random duplicates(using smote)

Duplicates count: 0


#### Descriptive statistics

In [15]:
#numerical
att.describe()

Unnamed: 0,Age,Attrition,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,TotalWorkingYears,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,DistanceFromHome
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,0.161224,2.721769,2.729932,2.063946,2.728571,6502.931293,11.279592,2.761224,7.008163,4.229252,2.187755,4.123129,9.192517
std,9.135373,0.367863,1.093082,0.711561,1.10694,1.102846,4707.956783,7.780782,0.706476,6.126525,3.623137,3.22243,3.568136,8.106864
min,18.0,0.0,1.0,1.0,1.0,1.0,1009.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
25%,30.0,0.0,2.0,2.0,1.0,2.0,2911.0,6.0,2.0,3.0,2.0,0.0,2.0,2.0
50%,36.0,0.0,3.0,3.0,2.0,3.0,4919.0,10.0,3.0,5.0,3.0,1.0,3.0,7.0
75%,43.0,0.0,4.0,3.0,3.0,4.0,8379.0,15.0,3.0,9.0,7.0,3.0,7.0,14.0
max,60.0,1.0,4.0,4.0,5.0,4.0,19999.0,40.0,4.0,40.0,18.0,15.0,17.0,29.0


In [18]:
#categorical
att.describe(include="object")

Unnamed: 0,BusinessTravel,Department,EducationField,Gender,MaritalStatus,OverTime
count,1470,1470,1470,1470,1470,1470
unique,3,3,6,2,3,2
top,Travel_Rarely,Research & Development,Life Sciences,Male,Married,No
freq,1043,961,606,882,673,1054


* From the initial exploration we found no duplicates or null values are found in total 1470 observations
* The target variable has two classes (0-not attrited &1-attrited)
* Further insights can we found from univariate,bivariate and multivariate analysis

### UNIVARIATE ANALYSIS