### Import the necessary libraries

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#to scale the data using z-score 
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

#algorithms to use
from sklearn.linear_model import LogisticRegression

#Metrics to evaluate the model
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve

#to ignore warnings
import warnings
warnings.filterwarnings("ignore")

ModuleNotFoundError: No module named 'sklearn'

### Read the dataset

In [None]:
#reading the dataset
import pandas as pd
import requests
from io import StringIO  

orig_url="https://drive.google.com/file/d/147Z67u4-bp_ZVlbc18dg6J3h9ORRlCcW/view?usp=sharing"

file_id = orig_url.split('/')[-2]
dwn_url='https://drive.google.com/uc?export=download&id=' + file_id
url = requests.get(dwn_url).text

csv_raw = StringIO(url)
employees = pd.read_csv(csv_raw)

: 

In [None]:
employees.head()

: 

### Printing the info

In [None]:
employees.info()

: 

**Observation:**
- There are 2940 observations and 33 columns.
- All the column have 2940 non-null values i.e. there are no missing values in the data.

**Let's check the unique values in each column** 

In [None]:
#checking unique values in each column
employees.nunique()

: 

**Observation:**
- Employee number is an identifier which is unique for each employee and we can drop this column as it would not add any value to our analysis.
- Over18 and StandardHours have only 1 unique value. These column will not add any value to our model hence we can drop them.
- On the basis of number of unique values in each column and the data description, we can identify the continuous and categorical columns in the data.

Let's drop the columns mentioned above and define lists for numerical and categorical columns to apply explore them separately.

In [None]:
#dropping the columns 
employees=employees.drop(['EmployeeNumber','Over18','StandardHours'],axis=1)

: 

In [None]:
#Creating numerical columns
num_cols=['DailyRate','Age','DistanceFromHome','MonthlyIncome','MonthlyRate','PercentSalaryHike','TotalWorkingYears',
          'YearsAtCompany','NumCompaniesWorked','HourlyRate',
          'YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager','TrainingTimesLastYear']

#Creating categorical variables 
cat_cols= ['Attrition','OverTime','BusinessTravel', 'Department','Education', 'EducationField','JobSatisfaction','EnvironmentSatisfaction','WorkLifeBalance',
           'StockOptionLevel','Gender', 'PerformanceRating', 'JobInvolvement','JobLevel', 'JobRole', 'MaritalStatus','RelationshipSatisfaction']

: 

### Let's start with univariate analysis of numerical columns

In [None]:
#Checking summary statistics
employees[num_cols].describe().T

: 

- **Average employee age is around 37 years**. It has a high range, from 18 years to 60, indicating good age diversity in the organization.
- **At least 50% of the employees live within a 7 km radius** from the organization. However there are some extreme values, seeing as the maximum value is 29 km.
- **The average monthly income of an employee is USD 6500.** It has a high range of values from 1K-20K, which is to be expected for any organization's income distribution. There is a big difference between the 3rd quartile value (around USD 8400) and the maximum value (nearly USD 20000), showing that the **company's highest earners have a disproportionately large income** in comparison to the rest of the employees. Again, this is fairly common in most organizations.
- **Average salary hike of an employee is around 15%.** At least 50% of employees got a salary hike 14% or less, with the maximum salary hike being 25%.
- Average number of years an employee is associated with the company is 7. 
- **On average, the number of years since an employee got a promotion is 2.18**. The majority of employees have been promoted since the last year.

Let's explore these variables in some more depth by observing their distributions

In [None]:
#creating histograms
employees[num_cols].hist(figsize=(14,14))
plt.show()

: 

**Observations:**

- **The age distribution is close to a normal distribution** with the majority of employees between the ages of 25 and 50.

- **The percentage salary hike is skewed to the right**, which means employees are mostly getting lower percentage salary increases.

- **MonthlyIncome and TotalWorkingYears are skewed to the right**, indicating that the majority of workers are in entry / mid-level positions in the organization.

- **DistanceFromHome also has a right skewed distribution**, meaning most employees live close to work but there are a few that live further away.

- **On average, an employee has worked at 2.5 companies.** Most employees have worked at only 1 company.

- **The YearsAtCompany variable distribution shows a good proportion of workers with 10+ years**, indicating a significant number of loyal employees at the organization. 

- **The YearsInCurrentRole distribution has three peaks at 0, 2, and 7.** There are a few employees that have even stayed in the same role for 15 years and more.

- **The YearsSinceLastPromotion variable distribution indicates that some employees have not received a promotion in 10-15 years and are still working in the organization.** These employees are assumed to be high work-experience employees in upper-management roles, such as co-founders, C-suite employees and the like.

- The distributions of DailyRate, HourlyRate and MonthlyRate appear to be uniform and do not provide much information. It could be that daily rate refers to the income earned per extra day worked while hourly rate could refer to the same concept applied for extra hours worked per day. Since these rates tend to be broadly similiar for multiple employees in the same department, that explains the uniform distribution they show. 

### Univariate analysis for categorical variables

In [None]:
#Printing the % sub categories of each category
for i in cat_cols:
    print(employees[i].value_counts(normalize=True))
    print('*'*40)

: 

**Observations:**

- **The employee attrition rate is 16%.**
- **Around 28% of the employees are working overtime.** This number appears to be on the higher side, and might indicate a stressed employee work-life.
- 71% of the employees have traveled rarely, while around 19% have to travel frequently.
- Around 73% of the employees come from an educational background in the Life Sciences and Medical fields. 
- Over 65% of employees work in the Research & Development department of the organization.
- **Nearly 40% of the employees have low (1) or medium-low (2) job satisfaction** and environment satisfaction in the organization, indicating that the morale of the company appears to be somewhat low.
- **Over 30% of the employees show low (1) to medium-low (2) job involvement.** 
- Over 80% of the employees either have none or very less stock options. 
- **In terms of performance ratings, none of the employees have rated lower than 3 (excellent).** About 85% of employees have a performance rating equal to 3 (excellent), while the remaining have a rating of 4 (outstanding). This could either mean that the majority of employees are top performers, or  the more likely scenerio is that the organization could be highly lenient with its performance appraisal process.

### Bivariate and Multivariate analysis

**We have analyzed different categorical and numerical variables.** 

**Let's now check how does attrition rate is related with other categorical variables**

In [None]:
for i in cat_cols:
    if i!='Attrition':
        (pd.crosstab(employees[i],employees['Attrition'],normalize='index')*100).plot(kind='bar',figsize=(8,4),stacked=True)
        plt.ylabel('Percentage Attrition %')

: 

**Observations:**
    
- **Employees working overtime have more than a 30% chance of attrition**, 
which is very high compared to the 10% chance of attrition for employees who do not work extra hours.
- As seen earlier, the majority of employees work for the R&D department. The chance of attrition there is ~15%
- **Employees working as sales representatives have an attrition rate of around 40%** while HRs and Technicians have an attrition rate of around 25%. The sales and HR departments have higher attrition rates in comparison to an academic department like Research & Development, an observation that makes intuitive sense keeping in mind the differences in those job profiles. The high-pressure and incentive-based nature of Sales and Marketing roles may be contributing to their higher attrition rates.
- **The lower the employee's job involvement, the higher their attrition chances appear to be, with 1-rated JobInvolvement employees attriting at 35%.** The reason for this could be that employees with lower job involvement might feel left out or less valued and have already started to explore new options, leading to a higher attrition rate.
- **Employees at a lower job level also attrite more,** with 1-rated JobLevel employees showing a nearly 25% chance of attrition. These may be young employees who tend to explore more options in the initial stages of their careers. 
- **A low work-life balance rating clearly leads employees to attrite**; 30% of those in the 1-rated category show attrition.

**Let's check the relationship between attrition and Numerical variables**

In [None]:
#Mean of numerical varibles grouped by attrition
employees.groupby(['Attrition'])[num_cols].mean()

: 

**Observations:**
- **Employees leaving the company have a nearly 30% lower average income and 30% lesser work experience than those who are not.** These could be the employees looking to explore new options and/or increase their salary with a company switch. 
- **Employees showing attrition also tend to live 16% further from the office than those who are not**. The longer commute to and from work could mean they have to spend more time/money every day, amd this could be leading to job dissatisfaction and wanting to leave the organization.

**We have found out what kind of employees are leaving the company more.**

### Let's check the relationship between different numerical variables

In [None]:
#plotting the correation between numerical variables
plt.figure(figsize=(15,8))
sns.heatmap(employees[num_cols].corr(),annot=True, fmt='0.2f', cmap='YlGnBu')

: 

**Observations:**

- **Total work experience, monthly income, years at company and years with current manager are highly correlated with each other and with employee age** which is easy to understand as these variables show an increase with age for most employees. 
- Years at company and years in current role are correlated with years since last promotion which means that the company is not giving promotions at the right time.

**Now that we have explored our data. Let's build the model**

## Model Building - Approach
1. Prepare data for modeling
2. Partition the data into train and test set.
3. Build model on the train data.
4. Tune the model if required.
5. Test the data on test set.

###  Preparing data for modeling

**Creating dummy variables for categorical Variables**

In [None]:
#creating list of dummy columns
to_get_dummies_for = ['BusinessTravel', 'Department','Education', 'EducationField','EnvironmentSatisfaction', 'Gender',  'JobInvolvement','JobLevel', 'JobRole', 'MaritalStatus' ]

#creating dummy variables
employees = pd.get_dummies(data = employees, columns= to_get_dummies_for, drop_first= True)      

#mapping overtime and attrition
dict_OverTime = {'Yes': 1, 'No':0}
dict_attrition = {'Yes': 1, 'No': 0}


employees['OverTime'] = employees.OverTime.map(dict_OverTime)
employees['Attrition'] = employees.Attrition.map(dict_attrition)

: 

**Separating the independent variables (X) and the dependent variable (Y)**

In [None]:
#Separating target variable and other variables
Y= employees.Attrition
X= employees.drop(columns = ['Attrition'])

: 

### Scaling the data

The independent variables in this dataset have different scales. When features have differing scales from each other, there is a chance that a higher weightage will be given to features which have a higher magnitude, and they will dominate over other features whose magnitude changes may be smaller but whose percentage changes may be just as significant or even larger. This will impact the performance of our machine learning algorithm, and we do not want our algorithm to be biased towards one feature. 

The solution to this issue is **Feature Scaling**, i.e. scaling the dataset so as to give every transformed variable a comparable scale.

In this problem, we will use the **Standard Scaler** method, which centers and scales the dataset using the Z-Score.

It standardizes features by subtracting the mean and scaling it to have unit variance.

The standard score of a sample x is calculated as:

**z = (x - u) / s**

where **u** is the mean of the training samples (zero) and **s** is the standard deviation of the training samples.

In [None]:
#Scaling the data
sc=StandardScaler()
X_scaled=sc.fit_transform(X)
X_scaled=pd.DataFrame(X_scaled, columns=X.columns)

: 

### Problem 1. Splitting the data into 70% train and 30% test set

Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples. In such cases it is recommended to use the **stratified sampling** technique to ensure that relative class frequencies are approximately preserved in each train and validation fold.

**1.a.** Complete the code to split the data into 70% training and 30% testing data. Use `stratify=y` to preserve relative class frequencies. 

In [None]:
#splitting the data
x_train,x_test,y_train,y_test=------(X_scaled,Y,test_size=----,random_state=1,stratify=Y)

: 

### Problem 2. Model evaluation criterion

#### The model can make two types of wrong predictions:
1. Predicting an employee will attrite when the employee doesn't attrite
2. Predicting an employee will not attrite and the employee actually attrites

**2.a. What are the consequeces of predicting that the employee will not attrite but the employee attrites?**        
-  **-------** 

**2.b.** 
**Why would the company want to maximize the Recall of the model?**
- **-------** 

Also, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.

In [None]:
#creating metric function 
def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))
    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize=(8,5))
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels=['Not Attrite', 'Attrite'], yticklabels=['Not Attrite', 'Attrite'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()

: 

### Problem 3. Logistic Regression Model 

- Logistic Regression is a supervised learning algorithm which is used for **binary classification problems** i.e. where the dependent variable is categorical and has only two possible values. In logistic regression, we use the sigmoid function to calculate the probability of an event y, given some features x as:

                                          P(y)=1/exp(1 + exp(-x))

**3.a.** Instantiate the logistic regression model as `lg`.      


In [None]:
#fitting logistic regression model
# 3.a. Instantiate the model
lg=-----()

: 

**3.b.** Fit the model to the training data. 

In [None]:
# 3.b. fit the model to the training data
lg.----(----,----)

: 


**3.c. Checking model performance (train)**  
 Check the model performance on the training data. Use the `metrics_score` function built above. 

In [None]:
# 3.c. checking the performance on the training data
y_pred_train = lg.----(----)
metrics_score(----, ----)

: 


**3.d. Checking model performance (test)**     
Check the model performance on the testing data. Use the `metrics_score` function built above. 

In [None]:
# 3.d. checking the performance on the test dataset
y_pred_test = lg.----(----)
metrics_score(----, ----)

: 

**3.e. Share 3 to 4 observations based on the metrics generated above.**

**Observations:**
- **-----**
- **-----**
- **-----**


**Let's check the coefficients and find which variables are leading to attrition and which can help to reduce the attrition**

In [None]:
#printing the coefficients of logistic regression
cols=X.columns

coef_lg=lg.coef_

pd.DataFrame(coef_lg,columns=cols).T.sort_values(by=0,ascending=False)

: 

**Observations:**


**Features which string positive affect on the attrition rate are:**
- OverTime	
- BusinessTravel_Travel_Frequently	
- Department_Research & Development	
- JobRole_Sales Executive	
- MaritalStatus_Single	
- Department_Sales	
- NumCompaniesWorked	
- YearsSinceLastPromotion
- JobLevel_5	
- BusinessTravel_Travel_Rarely
- DistanceFromHome
- YearsAtCompany	
- JobRole_Human Resources	
- JobRole_Sales Representative

**Features which string negative affect on the attrition rate are:**
- MonthlyIncome	
- JobInvolvement_3	
- JobLevel_2	
- EnvironmentSatisfaction_4	
- JobInvolvement_4	
- JobInvolvement_2	
- EnvironmentSatisfaction_3	
- EducationField_Life Sciences	
- EnvironmentSatisfaction_2	
- YearsWithCurrManager	
- JobRole_Research Director	
- TotalWorkingYears	
- JobSatisfaction	

**The features identified as important are similar for both the Tree model and the logistic regression model. Notice that we are able to see a bit more detail in the logistic regression results with the + and - contributions.**

The coefficients of the logistic regression model give us the log of odds, which is hard to interpret in the real world. We can convert the log of odds into real odds by taking its exponential.

In [None]:
odds = np.exp(lg.coef_[0]) #finding the odds

# adding the odds to a dataframe and sorting the values
pd.DataFrame(odds, x_train.columns, columns=['odds']).sort_values(by='odds', ascending=False) 

: 

### Problem 4. Meaning of Coefficients
EXAMPLE: The odds of an employee working overtime to attrite are **2.6 times** the odds of one who is not, probably due to the fact that working overtime is not sustainable for an extended duration for any employee, and may lead to burnout and job dissatisfaction.
     
**4.a.** What is the impact of frquent travel on the odds that an employee will attrite? 
   
- **-----**    

**4.b.** What is the impact of marital status on the odds that an employee will attrite? 
   
- **-----**   

**Precision-Recall Curve for logistic regression**

In [None]:
y_scores_lg=lg.predict_proba(x_train) #predict_proba gives the probability of each observation belonging to each class


precisions_lg, recalls_lg, thresholds_lg = precision_recall_curve(y_train, y_scores_lg[:,1])

#Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_lg, precisions_lg[:-1], 'b--', label='precision')
plt.plot(thresholds_lg, recalls_lg[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()

: 

**Observation:**
- We can see that precision and recall are balanced for a threshold of about ~**0.35**.

**Let's find out the performance of the model at this threshold**

In [None]:
optimal_threshold1=.35
y_pred_train = lg.predict_proba(x_train)
metrics_score(y_train, y_pred_train[:,1]>optimal_threshold1)

: 

- **The model performance has improved. The recall has increased significantly for class 1.**
- Let's check the performance on the test data.

In [None]:
optimal_threshold1=.35
y_pred_test = lg.predict_proba(x_test)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold1)

: 

### Problem 5 Comparison with Tree Induction

#### Take a little time to look back at our earlier exercise analyzing the employee data using tree induction. 
**5.a.** What pre-processing step was necessary when preparing the data for a logistic regression that was not necessary for the tree induction model? 
- **-----**
     
**5.b.** How did the performances of the logistic and tree induction models compare?
- **-----**
   
**5.c.** Did the models identify the same features as important in predicting attrition? What similarities and differences did you observe in the feature importances? 
- **-----**
    
**5.d.** What other classification models would be appropriate for this data set? (feel free to do a little research) 
- **-----**

