## Background / Motivation

__What motivated you to work on this problem?__

__Mention any background about the problem, if it is required to understand your analysis later on.__

As college juniors and sophomores, we are all on the verge of entering the job market. In addition, unemployment was at a high during Covid, so as the job market restabiliszes, we thought it would be helpful for employers and hiring managers to be able to see the factors, pre pandemic, that motivated attrition in order to help organize the post pandemic world as we slowly go back to normal. We found a data set detailing attrition, and immediately felt intrigued. Attrition is the act of leaving a job, whether it be due to termination or resignation. This data set showed many attributes of different employees as well as whether or not they left their job. Knowing much of your life satisfaction in adulthood is tied to your occupation, this data set presented itself as a relevant option for our analysis.

## Problem statement 

__Describe your problem statement. Articulate your objectives using absolutely no jargon. Interpret the problem as inference and/or prediction.__

We are observing which variables are most associated with whether an employee will stay with a company, and what factors a company should focus on to decrease attrition, as well as be prepared for certain personal and professional factors that may increase rates of attition. Our question is a classification problem, because the stakeholders (employers) are concerned with a categorical variable - whether or not an employee will stay with a company. It is also an inference problem because employers want to see an association between factors that influence an employee's experience and attrition so they can improve working conditions. Employers are not predicting whether or not an employee will stay - they want to increase working conditions so more employees will stay.

## Data sources
__What data did you use? Provide details about your data. Include links to data if you are using open-access data.__

https://www.kaggle.com/datasets/thedevastator/employee-attrition-and-factors?page=2

We utilized a Kaggle data source focused on examining the impact of performance, financials, and job roles on employee retention. This dataset includes parameters such as Age, Gender, Business Travel Frequency, Marital Status, and Education Level. It also includes a variant set of parameters related to the job being performed, such as Job Involvement, Job Level, and Total Working Hours. Other aspects such as Percent Salary during tenure, Performance Rating, Relationship Satisfaction, Number Companies Worked before, and Retirement Status were also taken into consideration. This dataset provides insight into various aspects outlining the ethos of the modern workforce.

## Stakeholders
__Who cares? If you are successful, what difference will it make to them?__

Our stakeholders are employers, especially HR departments and hiring managers. We are targeting HR departments and hiring managers because the HR department influences working conditions and hiring managers can make hiring decisions based on personal attributes that may makes someone more likely to stay. The inferences we make from our model will relate to actionable items that employers can take.

## Data quality check / cleaning / preparation 

__In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels.__

__If the tables in this section take too much space, you may put them in the appendix, and just mention any useful insights you obtained from the data quality check that helped you develop the model or helped you realize the necessary data cleaning / preparation.__

__Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them?__ 

__Did you do any data wrangling or data preparation before the data was ready to use for model development? Did you create any new predictors from exisiting predictors? For example, if you have number of transactions and spend in a credit card dataset, you may create spend per transaction for predicting if a customer pays their credit card bill. Mention the steps at a broad level, you may put minor details in the appendix. Only mention the steps that ended up being useful towards developing your final model(s).__

In [27]:
#| echo: false
# import image module
from IPython.display import Image

Image(url="images/Exploratory_data_analysis2.png", width = 1000, height = 1000)

As explained below, this tells us to resample, and we delete the columns that have the same result for observation. We also later use this table to better understand out singular matrix error.

## Exploratory data analysis

Put the relevant EDA here (visualizations, tables, etc.) that helped you figure out useful predictors for developing the model(s). Only put the EDA that ended up being useful towards developing your final model(s). 

List the insights (as bullet points) you got from EDA that ended up being useful towards developing your final model. 

Again, if there are too many plots / tables, you may put them into appendix, and just mention the insights you got from them.

- After creating dummy variables for attrition(yes and no), we visualized the distribution of people who stayed and left, and about 84% of our data was those who stayed. Because of this uneven distribution, we decided to oversample so that positive responses would not be favored

In [19]:
#| echo: false
Image(url="images/Exploratory_data_analysis1.png", width = 400, height = 400)

- Checked value counts for all independent variables, and EmployeeCount, StandardHours, Over18 all had the same value in every row, so we dropped these columns as they provided no insight for our regression model(all values same)


- We created correlation matrices and heatmaps in order to see variables with high correlations, in order get a grasp on possible interaction terms and multicollinearity 


In [20]:
#| echo: false
Image(url="images/Exploratory_data_analysis4.png", width = 500, height = 500)

- There was lots of variation in many of our variables, so in order to visualize transformations, we binned continuous variables with more than 10 unique values to get a better idea of what transformations to apply, then applied those transformations on the original variables prior to forward selection.

In [24]:
#| echo: false
Image(url="images/Exploratory_data_analysis5.png", width = 700)

In [25]:
#| echo: false
Image(url="images/Exploratory_data_analysis6.png", width = 700)

## Approach

What kind of a model (linear / logistic / other) did you use? What performance metric(s) did you optimize and why?

Is there anything unorthodox / new in your approach? 

What problems did you anticipate? What problems did you encounter? Did the very first model you tried work? 

Did your problem already have solution(s) (posted on Kaggle or elsewhere). If yes, then how did you build upon those solutions, what did you do differently? Is your model better as compared to those solutions in terms of prediction / inference?

**Important: Mention any code repositories (with citations) or other sources that you used, and specifically what changes you made to them for your project.**

We used a logistic model, and our primary goal was to optimize recall. As training new workers is slow and expensive, it is costly for the companies when workers leave their jobs. Therefore, we decided it would be most important that our model detects the positive case of attrition. We planned on measuring this by using confusion matrices as well as optimizing the decision threshold. 

In addition, we wanted to minimize the false negative rate. The reasoning behind this is similar to recall - for a company, it is more costly and risky to be prepared for workers staying and they end up leaving, rather than to be prepared for them to leave and they end up staying. There may be a slightly higher 'prevention' cost if they are prepared to hire new people, however that is ultimately less costly than an employee unexpectedly leaving as unexpected attrition may decrease all efficiency and operating power of the company. 

Our approach is fairly conventional, though have required some creativity to solve our most frequent error: singular matrix. 

The first model we used appeared to give us very high recall with low false negative rate, however with further analysis we realized it did not work. Because we were doing interaction terms and transformations after running forward selection, we did not give the algorithm the chance to filter through and choose the new best subset of variables. This also caused us to use much more than the 'optimal' number of variables per our BIC minimization. 

We anticipated some problems with the mix of variables type in our data set. With 35 rows, there is a large mixture of categorical and quantitative variables, with high emphasis on categorical variables, which were all converted into dummy variables. The biggest problem that we have encountered is an error of singular matrix. Because we had to create so many dummy variables, the code will not let us run forward selection without returning a singular matrix error. 

We did not check to see if our problem has solutions on Kaggle, and and we could not find much information elsewhere. Initially, we had tried to avoid this by running stepwise selection and then manually inputting interaction variables. However, we later realized that would not return the best results as forward selection may have filtered out some of the interaction terms, so we had to make sure to input all potential variables into the forward selection model. To combat singular matrix, we tried dropping levels we deemed to be primary causes of singular matrix. These included one dummy variable from each level and variables with high VIFs. Our very first model did not work at all because of the need to create and interact all the dummy variables. 

Our problem had information posted to Kaggle, but not much in terms of a regression that used methods we had already learned about. The models on Kaggle were more binning and visuals unrelated to factors we chose to examine, or used methods we did not learn in this class, so we did not choose to build upon the Kaggle data.

## Developing the model

Explain the steps taken to develop and improve the base model - informative visualizations / addressing modeling assumption violations / variable transformation / interactions / outlier treatment / influential points treatment / addressing over-fitting / addressing multicollinearity / variable selection - stepwise regression, lasso, ridge regression). 

Did you succeed in achieving your goal, or did you fail? Why?

**Put the final model equation**.

**Important: This section should be rigorous and thorough. Present detailed information about decision you made, why you made them, and any evidence/experimentation to back them up.**

As explained above, to create our baseline model we first dropped `EmployeeCount`, `StandardHours`, and `Over18`, then resampled our data frame so we had an even distribution of positive and negative classes, and lastly calculated the VIFs and dropped variables we suspected to have multicollinearity. We decided to do this based on our heatmap-which indicated a high likliehood of multicollinearity. Our baseline model returned a recall of 74.1% on the training data and 51.3% on the test data.

We began our efforts of improvement by visualizing our predictors to determine the necessity of transformations. We binned `TotalWorkingYears`, `YearsAtCompany`, `YearsInCurrentRole`, `YearsWithCurrManager`, `Age`, `YearsSinceLastPromotion`, `MonthlyRate`, `DistanceFromHome`, `DailyRate`, and `HourlyRate` to help visualize possible transformations. From this, we determined `YearsInCurrentRole`, `StockOptionLevel`, `YearsSinceLastPromotion`, `NumCompaniesWorked`, `DailyRate`, `TrainingTimesLastYear`, and `HourlyRate` may require quadratic transformations while `PercentSalaryHike`, `Education`, and `WorkLifeBalance` may require logarithmic transformations.

After this, we ran our first round of forward selection including our predictors and those now requiring transformations. Because our data had such a large number of original predictors, we chose to do forward selection in order to best select our variables since it is particularly useful when dealing with a large number of predictors or when there is minimal prior knowledge about which predictors are likely to be important. By starting with an empty set, forward selection helps to avoid the potential risk of including irrelevant or redundant features in the model, which could lead to overfitting and decreased generalization performance. We chose to use BIC as our metric of how many variables to use as BIC typically handles  large numbers of predicters the best. This results in a model inlcuding 29 variables rather than the original 50. 

Before running the second round of forward selection we dropped some additional predictors: `JobRole_HumanResources`, `JobRole_LaboratoryTechnician`, `JobRole_Manager`, `JobRole_ManufacturingDirector`, `JobRole_ResearchDirector`, `JobRole_SalesExecutive`, `JobRole_SalesRepresentative`, `EducationField_Marketing`, `EducationField_Medical`, `EducationField_Other`. Including these variables as we ran the forward selection with interactions had been returning a singular matrix error. After constantly running into this error throughout the project, we ultimately analyzed our data and determined `JobRole` and `EducationField` were two levels likley causing this error. Because these variables had some many categories, their dummies had many 0s likley resulting in a determinant of 0 and returning a singular matrix error. However, we chose to drop these after the first round of forward selection because through earlier iterations of the model that employed brute force we say that specifically `JobRole_ResearchScientist` and also `EducationField_TechnicalDegree` were important in the recall of our model.

Using the remaining prdeictors, we attempted a second round of variable selection--this time using interaction terms as well. However when we original ran forward selection with all 463 of these predictors, we returned a singular matrix error. Singular matrix is an error resulting from high levels of collinearity among two or many variables. Therefore, we decided to recalculate VIFs and drop variables with problematic VIFs. Due to the large number of predictors, we decided we did not have the time or computational capacity to drop variables due to their VIF one by one. Therefore, we dropped sets of variables at a time. If we could do this project again, we would try to find a way to drop the variables one by one as it is likely we dropped some important predictors through this method. Left with a remaining 156 predcitors and VIFs all at 10 or below, we began the second round of forward selection.

For our final step, we combined the singular and transformed variables selected selected in the first round of forward selection with the interacted variables selected in the second round of forward selection models to create our final model. 

Our final model equation is as follows:
$Attrition_Yes~OverTime_Yes+MaritalStatus_Single+JobLevel+JobInvolvement+JobRole_ResearchScientist+
EducationField_TechnicalDegree+EnvironmentSatisfaction+JobSatisfaction+BusinessTravel_Travel_Frequently+
YearsInCurrentRole+YearsAtCompany+NumCompaniesWorked+TotalWorkingYears+DistanceFromHome+BusinessTravel_Travel_Rarely+
MaritalStatus_Married+JobRole_ResearchDirector+YearsWithCurrManager+Gender_Male+YearsSinceLastPromotion+
JobRole_SalesRepresentative+TrainingTimesLastYear+JobRole_LaboratoryTechnician+JobRole_HumanResources+
JobRole_SalesExecutive+np.log(WorkLifeBalance)+I(YearsSinceLastPromotion**2)+I(YearsInCurrentRole**2)+
np.log(PercentSalaryHike)+MaritalStatus_Single*OverTime_Yes+HourlyRate*JobLevel+DistanceFromHome*OverTime_Yes+
EnvironmentSatisfaction*StockOptionLevel+NumCompaniesWorked*Gender_Male+TotalWorkingYears*Gender_Male+
StockOptionLevel*OverTime_Yes+RelationshipSatisfaction*JobRole_ResearchScientist+DailyRate*EducationField_TechnicalDegree+
DailyRate*JobLevel+YearsSinceLastPromotion*MaritalStatus_Married+DistanceFromHome*YearsInCurrentRole+
DistanceFromHome*BusinessTravel_Travel_Frequently+JobRole_ResearchScientist*MaritalStatus_Single+
JobSatisfaction*TrainingTimesLastYear+TrainingTimesLastYear*Gender_Male+
YearsInCurrentRole*BusinessTravel_Travel_Frequently+MonthlyRate*BusinessTravel_Travel_Frequently+
JobRole_ResearchScientist*OverTime_Yes+EnvironmentSatisfaction*NumCompaniesWorked+DailyRate*NumCompaniesWorked+
YearsSinceLastPromotion*MaritalStatus_Single+DailyRate*YearsSinceLastPromotion+NumCompaniesWorked*OverTime_Yes+
EnvironmentSatisfaction*BusinessTravel_Travel_Frequently+NumCompaniesWorked*YearsSinceLastPromotion+
BusinessTravel_Travel_Frequently*MaritalStatus_Married+StockOptionLevel*BusinessTravel_Travel_Frequently+
StockOptionLevel*YearsSinceLastPromotion+StockOptionLevel*Gender_Male+StockOptionLevel*BusinessTravel_Travel_Rarely+
DailyRate*StockOptionLevel+DailyRate*DistanceFromHome+EducationField_TechnicalDegree*JobRole_ResearchScientist+
TotalWorkingYears*JobRole_ResearchScientist+YearsSinceLastPromotion*JobRole_ResearchScientist+
JobRole_ResearchScientist*MaritalStatus_Married+StockOptionLevel*TotalWorkingYears+
StockOptionLevel*TrainingTimesLastYear+JobSatisfaction*Gender_Male+EnvironmentSatisfaction*Gender_Male+
MonthlyRate*MaritalStatus_Married+TrainingTimesLastYear*MaritalStatus_Married

In this model, our recall improved on the training data from 74.1% to 83.4% and on the test data from 51.3% to 53.8%.  

Though we improved our recall, we would not consider this a successful project overall. As said above, dropping VIF in large chunks likely caused problems in variable selection. Because of the large differences in recall between test and train, our model is likely overfitted. 

## Limitations of the model with regard to inference / prediction

If it is inference, will the inference hold for a certain period of time, for a certain subset of population, and / or for certain conditions.

If it is prediction, then will it be possible / convenient / expensive for the stakeholders to collect the data relating to the predictors in the model. Using your model, how soon will the stakeholder be able to predict the outcome before the outcome occurs. For example, if the model predicts the number of bikes people will rent in Evanston on a certain day, then how many days before that day will your model be able to make the prediction. This will depend on how soon the data that your model uses becomes available. If you are predicting election results, how many days / weeks / months / years before the election can you predict the results. 

When will your model become too obsolete to be useful?

Our model, as an inference model, will hold until the workplace environment has been transformed. In another sense, many of the variables in our model are widely common workplace and personal features- ADD FINAL MODEL SELECTED FEATURES. These apply to most workplaces, and therefor can continue to be used until they are no longer relevant to the workplace (in the sense that the workplace may transition to more remote, in which case a predictor such as distance from home will be less relevant- we are already beginning to see this with more remote jobs). However, we used data from a pre-Covid world, and we did this because we believe that the workplace is slowly going back to pre-pandemic times, so even after a world-changing shift, this model could still be helpful as it outlines all the basic parameters involved in job attrition, ones unlikely to be 'taken out of the equation' with changes in the workplace. I.e. even with a change in job environment, your marital status and overtime are likely to continue to affect your 'risk' of attrition. Either way, this model continues to be useful for HR departments and hiring managers to get a sense of which variables in their employees to consider when hiring, and which to focus on during their worktime, regardless of what the work environment looks like (job satisfaction will continue to be relevant if we were to move into a remote work world). 

## Conclusions and Recommendations to stakeholder(s)

What conclusions do you draw based on your model? If it is inference you may draw conclusions based on the coefficients, statistical significance of predictors / interactions, etc. If it is prediction, you may draw conclusions based on prediction accuracy, or other performance metrics.

How do you use those conclusions to come up with meaningful recommendations for stakeholders? The recommendations must be action-items for stakeholders that they can directly implement without any further analysis. Be as precise as possible. The stakeholder(s) are depending on you to come up with practically implementable recommendations, instead of having to think for themselves.

If your recommendations are not practically implementable by stakeholders, how will they help them? Is there some additional data / analysis / domain expertise you need to do to make the recommendations implementable? 

Do the stakeholder(s) need to be aware about some limitations of your model? Is your model only good for one-time use, or is it possible to update your model at a certain frequency (based on recent data) to keep using it in the future? If it can be used in the future, then for how far into the future?

Recommendations

- To make recommendations, we looked at our model coefficients, focusing on the absolute value greatest coefficients, as these have the biggest influence on the model and we can make the strongest inference on them.
- Research scientists are highly likely to leave their job, but research directors are highly unlikely to leave their job. Therefore, research employers should provide many leadership and promotional opportunities to research scientists. If they feel they are able to move up in their role, they may be more motivated to stay or more motivated to work hard, making employers less likely to fire them.

- Unmarried workers are likely to leave their job. This may because they are at a less stable stage in their life and go through transition periods which involve moving or finding a new job to fit a new lifestyle. However, if the unmarried people work overtime they are less likely to leave. And, if the unmarried people have gone several years without a promotion, they are more likely to leave. Unmarried workers are more likely to be younger workers. If employers are hiring entry-level workers, they should be sure that the employee is dedicated and is willing to make sacrifices for work (ie. Sacrifice time).

- Additionally, married workers have overall much lower attrition rates. Depending on company policy with in office relationships, larger businesses could host social events of sorts for office workers to meet other people in their job field and with similar interests—possibly resulting in some long term relationships. 

- Employees who travel frequently are fairly likely to leave the job. With the technology available, employers should lean into Zoom calls for out-of-town work, rather than forcing employees to travel.

- There is a high negative coefficient for the log transformation of work life balance. Therefore, a medium amount of work life balance is not nearly as impactful as a high amount of work life balance. If the employer is having a problem with a lot of employees leaving, they could provide a high level of work life balance. However, a medium amount of work life balance will not greatly improve attrition rates for employers, and might, in fact cause them to lose productivity. If employers are unable to give a high amount of work life balance, instead of giving a medium amount they could give a lower amount and focus on other factors that would improve attrition instead.

- Attrition is greatly increase for employees who travel frequently and for those who do not. Attrition also increases as stock option level and total working years increase. Overall it seems that greater stock options lead to greater rates of attrition. One of the most major ways to offset attrition of workers receiving stock options is environment satisfaction. Stock options likely lead too the high rates of attrition because they provide a form of income to the employee even after attrition. Therefore, we would recommend requiring a certain amount of working year tied to the level of stock options as well as maintaining an environment that workers want to be at. The workers need to enjoy work enough that they do not want to retire and live off of options alone. 

- To be the most helpful for stakeholder, this model should be updated with post-COVID data to see how employee preferences have changed after COVID.

## GitHub and individual contribution {-}

Put the **Github link** for the project repository.

https://github.com/catherineerickson/Attrition-Project

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Ally Bardas</td>
    <td>Data cleaning, EDA, and Forward Selection </td>
    <td>Cleaned data, addressed model bias, helped fix multicollinearity errors.</td>
    <td>20</td>
  </tr>
  <tr>
    <td>Annabel Skubisz</td>
    <td>EDA, Forward Selection, Model </td>
    <td>Checked and addressed modeling assumptions and identified relevant variable interactions.</td>
    <td>23</td>
  </tr>
    <tr>
    <td>Catherine Erickson</td>
    <td>EDA, VIF, Forward Selection, and Transformations</td>
    <td>Developed visualizations to identify appropriate variable transformations, code for VIF and running through the forward selections.</td>
    <td>13</td>    
  </tr>
    <tr>
    <td>Karrine Denisova</td>
    <td>Interactions and Model</td>
    <td>Created interaction terms and helped develop model equation.</td>
    <td>18</td>    
  </tr>
</table>

List the **challenges** you faced when collaborating with the team on GitHub. Are you comfortable using GitHub? 
Do you feel GitHuB made collaboration easier? If not, then why? *(Individual team members can put their opinion separately, if different from the rest of the team)*

When we accidentally committed at the same time, we had issues issues deciding when to overwrite/stash changes. Otherwise, we were all comfortable using GitHub and think it is better than the alternative of texting. There was a learning curve, but we feel comfortable using it now.

## References {-}

List and number all bibliographical references. When referenced in the text, enclose the citation number in square brackets, for example [1].

[1] Authors. The frobnicatable foo filter, 2014. Face and Gesture submission ID 324. Supplied as additional material
fg324.pdf. 3


## Appendix {-}

You may put additional stuff here as Appendix. You may refer to the Appendix in the main report to support your arguments. However, the appendix section is unlikely to be checked while grading, unless the grader deems it necessary.