# Analyzing Employee Exit Surveys - Part 1

In this project, I am working with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. 

### I am going to analyse if:
- employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
- Are younger employees resigning due to some kind of dissatisfaction? What about older employees?
- They want us to combine the results for both surveys to answer these questions.

### In this notebook I will only clean and aggregate the data, the analyis can be found in exit_surveys_analysis.ipynb

In [27]:
import pandas as pd
import numpy as np

## Importing data and first inspections

I am going to use the two datasets:
- `dete_survey.csv`
- `tafe.survey.csv`

In [28]:
# load rawdata
tafe_survey = pd.read_csv('tafe_survey.csv') # load rawdata
dete_survey = pd.read_csv('dete_survey.csv') # load rawdata
dete_survey.head()

Unnamed: 0,ID,SeparationType,Cease Date,DETE Start Date,Role Start Date,Position,Classification,Region,Business Unit,Employment Status,...,Kept informed,Wellness programs,Health & Safety,Gender,Age,Aboriginal,Torres Strait,South Sea,Disability,NESB
0,1,Ill Health Retirement,08/2012,1984,2004,Public Servant,A01-A04,Central Office,Corporate Strategy and Peformance,Permanent Full-time,...,N,N,N,Male,56-60,,,,,Yes
1,2,Voluntary Early Retirement (VER),08/2012,Not Stated,Not Stated,Public Servant,AO5-AO7,Central Office,Corporate Strategy and Peformance,Permanent Full-time,...,N,N,N,Male,56-60,,,,,
2,3,Voluntary Early Retirement (VER),05/2012,2011,2011,Schools Officer,,Central Office,Education Queensland,Permanent Full-time,...,N,N,N,Male,61 or older,,,,,
3,4,Resignation-Other reasons,05/2012,2005,2006,Teacher,Primary,Central Queensland,,Permanent Full-time,...,A,N,A,Female,36-40,,,,,
4,5,Age Retirement,05/2012,1970,1989,Head of Curriculum/Head of Special Education,,South East,,Permanent Full-time,...,N,A,M,Female,61 or older,,,,,


## Data Cleaning Tasks:
- The `dete_survey` dataframe contains 'Not Stated' values that indicate values are missing, but they aren't represented as NaN.
- Both the `dete_survey` and `tafe_survey` dataframes contain many columns that are not needed to complete the analysis.
- Each dataframe contains many of the same columns, but the column names are different
- There are multiple columns/answers that indicate an employee resigned because they were dissatisfied.

### Solutions:
- `dete_surveys` 'Not Stated':NaN
- drop the `tafe_survey` columns: [17:66] (I will not include all the column names here as the list is way too long)
- drop the `dete_survey` columns [28:49] as they are of no interest to the goal of this analysis
- Make all the capitalization lowercase.
- Remove any trailing whitespace from the end of the strings.
- Replace spaces with underscores ('_')
- undertake some renaiming of the remaining columns

In [3]:
tafe_survey = pd.read_csv('tafe_survey.csv') # load rawdata
dete_survey = pd.read_csv('dete_survey.csv', na_values = 'Not Stated') # load rawdata represent Not Stated as NaN vals

tafe_survey_to_drop = tafe_survey.columns[17:66] # tafe columns 2be dropped
dete_survey_to_drop = dete_survey.columns[28:49] # dete columns 2be dropped

tafe_survey = tafe_survey.drop(tafe_survey_to_drop, axis = 1) # drop columns 17:66
dete_survey = dete_survey.drop(dete_survey_to_drop, axis = 1) # drop columns 28:49

# cleaning column names manually:
rename_names = {'Record ID': 'id', 
                'CESSATION YEAR': 'cease_date',
                'Reason for ceasing employment': 'separationtype',
                'Gender. What is your Gender?': 'gender', 
                'CurrentAge. Current Age': 'age',
                'Employment Type. Employment Type': 'employment_status',
                'Classification. Classification': 'position',
                'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
                'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}

tafe_survey = tafe_survey.rename(columns = rename_names)
dete_survey = dete_survey.rename(columns = rename_names)

# cleaning column names:
# remove trailing white spaces
# replace remaining white spaces with '_'
# make all columns lower case
tafe_survey.columns = tafe_survey.columns.str.lower().str.strip().str.replace(' ','_')
dete_survey.columns = dete_survey.columns.str.lower().str.strip().str.replace(' ','_')

tafe_survey.head()

Unnamed: 0,id,institute,workarea,cease_date,separationtype,contributing_factors._career_move_-_public_sector,contributing_factors._career_move_-_private_sector,contributing_factors._career_move_-_self-employment,contributing_factors._ill_health,contributing_factors._maternity/family,...,contributing_factors._study,contributing_factors._travel,contributing_factors._other,contributing_factors._none,gender,age,employment_status,position,institute_service,role_service
0,6.34133e+17,Southern Queensland Institute of TAFE,Non-Delivery (corporate),2010.0,Contract Expired,,,,,,...,,,,,Female,26 30,Temporary Full-time,Administration (AO),1-2,1-2
1,6.341337e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Retirement,-,-,-,-,-,...,-,Travel,-,-,,,,,,
2,6.341388e+17,Mount Isa Institute of TAFE,Delivery (teaching),2010.0,Retirement,-,-,-,-,-,...,-,-,-,NONE,,,,,,
3,6.341399e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,-,-,-,-,...,-,Travel,-,-,,,,,,
4,6.341466e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,Career Move - Private Sector,-,-,-,...,-,-,-,-,Male,41 45,Permanent Full-time,Teacher (including LVT),3-4,3-4


In [4]:
tafe_survey.columns

Index(['id', 'institute', 'workarea', 'cease_date', 'separationtype',
       'contributing_factors._career_move_-_public_sector',
       'contributing_factors._career_move_-_private_sector',
       'contributing_factors._career_move_-_self-employment',
       'contributing_factors._ill_health',
       'contributing_factors._maternity/family',
       'contributing_factors._dissatisfaction',
       'contributing_factors._job_dissatisfaction',
       'contributing_factors._interpersonal_conflict',
       'contributing_factors._study', 'contributing_factors._travel',
       'contributing_factors._other', 'contributing_factors._none', 'gender',
       'age', 'employment_status', 'position', 'institute_service',
       'role_service'],
      dtype='object')

## Filtering for Survey Respondents who resigned:

`dete_survey['separationtype']` contains multiple separation types that contain 'Resignation'

In [5]:
dete_survey['separationtype'].value_counts()

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separationtype, dtype: int64

`taf_survey['separationtype']` contains only one type of Resignation

In [6]:
tafe_survey['separationtype'].value_counts()

Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
Name: separationtype, dtype: int64

In [7]:
# select resignations
dete_resignations = dete_survey[(dete_survey['separationtype'] == 'Resignation-Other reasons') | \
                               (dete_survey['separationtype'] == 'Resignation-Other employer')| \
                                (dete_survey['separationtype'] == 'Resignation-Move overseas/interstate')].copy()

taf_resignations = tafe_survey[tafe_survey['separationtype'] == 'Resignation'].copy()

taf_resignations.head()

Unnamed: 0,id,institute,workarea,cease_date,separationtype,contributing_factors._career_move_-_public_sector,contributing_factors._career_move_-_private_sector,contributing_factors._career_move_-_self-employment,contributing_factors._ill_health,contributing_factors._maternity/family,...,contributing_factors._study,contributing_factors._travel,contributing_factors._other,contributing_factors._none,gender,age,employment_status,position,institute_service,role_service
3,6.341399e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,-,-,-,-,...,-,Travel,-,-,,,,,,
4,6.341466e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,Career Move - Private Sector,-,-,-,...,-,-,-,-,Male,41 45,Permanent Full-time,Teacher (including LVT),3-4,3-4
5,6.341475e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,-,-,-,-,...,-,-,Other,-,Female,56 or older,Contract/casual,Teacher (including LVT),7-10,7-10
6,6.34152e+17,Barrier Reef Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,Career Move - Private Sector,-,-,Maternity/Family,...,-,-,Other,-,Male,20 or younger,Temporary Full-time,Administration (AO),3-4,3-4
7,6.341537e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,-,-,-,-,...,-,-,Other,-,Male,46 50,Permanent Full-time,Teacher (including LVT),3-4,3-4


# Checking the Data for Faulty Values:

- Since the `cease_date` is the last year of the person's employment and the `dete_start_date` is the person's first year of employment, it wouldn't make sense to have years after the current date.
- Given that most people in this field start working in their 20s, it's also unlikely that the `dete_start_date` was before the year 1940.
- If we have many years higher than the current date or lower than 1940, we wouldn't want to continue with our analysis, because it could mean there's something very wrong with the data. If there are a small amount of values that are unrealistically high or low, we can remove them.

In [8]:
dete_resignations['dete_start_date'].value_counts()

2011.0    24
2008.0    22
2007.0    21
2012.0    21
2010.0    17
2005.0    15
2004.0    14
2009.0    13
2006.0    13
2013.0    10
2000.0     9
1999.0     8
1996.0     6
2002.0     6
1992.0     6
1998.0     6
2003.0     6
1994.0     6
1993.0     5
1990.0     5
1980.0     5
1997.0     5
1991.0     4
1989.0     4
1988.0     4
1995.0     4
2001.0     3
1985.0     3
1986.0     3
1983.0     2
1976.0     2
1974.0     2
1971.0     1
1972.0     1
1984.0     1
1982.0     1
1987.0     1
1975.0     1
1973.0     1
1977.0     1
1963.0     1
Name: dete_start_date, dtype: int64

In [9]:
dete_resignations['cease_date'].value_counts()

2012       126
2013        74
01/2014     22
12/2013     17
06/2013     14
09/2013     11
11/2013      9
07/2013      9
10/2013      6
08/2013      4
05/2012      2
05/2013      2
07/2012      1
07/2006      1
2010         1
09/2010      1
Name: cease_date, dtype: int64


### Conclusion:
- both dataframes check out
- just need to modify the datetime formats a little 

## Next step: estimating length of employment
- new column in `dete_resignations` dataframe called dete resignations
- measures the length of emplyment (`cease_date` - `dete_start_date`)
- taf_resignations dataframe does not contain start date of employment

In [10]:
# select resignations
# use df.copy() method to avoid SettingWithCopyWarning
dete_resignations = dete_survey[(dete_survey['separationtype'] == 'Resignation-Other reasons') | \
                               (dete_survey['separationtype'] == 'Resignation-Other employer')| \
                                (dete_survey['separationtype'] == 'Resignation-Move overseas/interstate')].copy()

taf_resignations = tafe_survey[tafe_survey['separationtype'] == 'Resignation'].copy()

# change to datatime
dete_resignations['cease_date'] = pd.to_datetime(dete_resignations['cease_date']).dt.strftime('%Y').astype(float)

dete_resignations['dete_start_date'] = pd.to_datetime(dete_resignations['dete_start_date'], \
                                                      format = "%Y").dt.strftime('%Y').astype(float)

# subtract dete_start_date from cease_date

dete_resignations['institute_service'] = dete_resignations['cease_date'].astype(float) \
                                                - dete_resignations['dete_start_date'].astype(float)
dete_resignations.head()

Unnamed: 0,id,separationtype,cease_date,dete_start_date,role_start_date,position,classification,region,business_unit,employment_status,...,workload,none_of_the_above,gender,age,aboriginal,torres_strait,south_sea,disability,nesb,institute_service
3,4,Resignation-Other reasons,2012.0,2005.0,2006.0,Teacher,Primary,Central Queensland,,Permanent Full-time,...,False,False,Female,36-40,,,,,,7.0
5,6,Resignation-Other reasons,2012.0,1994.0,1997.0,Guidance Officer,,Central Office,Education Queensland,Permanent Full-time,...,False,False,Female,41-45,,,,,,18.0
8,9,Resignation-Other reasons,2012.0,2009.0,2009.0,Teacher,Secondary,North Queensland,,Permanent Full-time,...,False,False,Female,31-35,,,,,,3.0
9,10,Resignation-Other employer,2012.0,1997.0,2008.0,Teacher Aide,,,,Permanent Part-time,...,False,False,Female,46-50,,,,,,15.0
11,12,Resignation-Move overseas/interstate,2012.0,2009.0,2009.0,Teacher,Secondary,Far North Queensland,,Permanent Full-time,...,False,False,Male,31-35,,,,,,3.0


## Filter for Employees who resigned due to Dissatisfaction

Now I am going to find out wether or not the employees quit due to dissatisfaction. The approaches for the two dataframes are a bit different.

### 1. Taf resignations:

In the Taf Dataframe the following two columns are indicators for dissatisfaction:
- `Contributing Factors. Dissatisfaction`
- `Contributing Factors. Job Dissatisfaction`

First I will update the values in them so that each contains only True, False, or NaN values. Thereafter, I'll apply the df.any() method to create a dissatisfied column that summarizes if one of the two columns above is True.

In [11]:
def update_vals(val):
    '''
    update columns:
    - contributing_factors._job_dissatisfaction
    -  contributing_factors._dissatisfaction
    to either uniformly indicate wether or not employee left because of dissatisfaction
    '''
    if pd.isnull(val):
        return np.nan
    elif val == '-':
        return False
    else:
        return True
    
# apply update_vals function to dataframe 
taf_resignations['contributing_factors._dissatisfaction'] = \
            taf_resignations['contributing_factors._dissatisfaction'].map(update_vals)

taf_resignations['contributing_factors._job_dissatisfaction'] = \
            taf_resignations['contributing_factors._job_dissatisfaction'].map(update_vals)

# apply any function to the two columns, if any of the two is True, the dissatisfaction column also becomes true
# thus providing an indicator for the employee quitting due to dissatisfaction
taf_resignations['dissatisfaction'] = taf_resignations[['contributing_factors._job_dissatisfaction', \
                                             'contributing_factors._dissatisfaction']].any(axis=1, skipna=False)

taf_resignations['contributing_factors._dissatisfaction'].value_counts()

False    277
True      55
Name: contributing_factors._dissatisfaction, dtype: int64

### 2. Dete Resignations

The following columns are indications for dissatisfaction:

- `job_dissatisfaction`
- `dissatisfaction_with_the_department`
- `physical_work_environment`
- `lack_of_recognition`
- `lack_of_job_security`
- `work_location`
- `employment_conditions`
- `work_life_balance`
- `workload`

These are already uniformly represented by true and false values, it suffices to only apply the df.any() function:

In [12]:
dete_resignations['dissatisfaction'] = dete_resignations[['job_dissatisfaction',\
                                                          'dissatisfaction_with_the_department',\
                                                          'physical_work_environment',\
                                                          'lack_of_recognition',\
                                                          'lack_of_job_security',\
                                                          'work_location',\
                                                          'employment_conditions',\
                                                          'work_life_balance',\
                                                          'workload']].any(axis=1, skipna=False)

dete_resignations['dissatisfaction'].value_counts()

False    162
True     149
Name: dissatisfaction, dtype: int64

# Data Aggregation into one dataframe

First, I'll add a column to each dataframe that will allow me to easily distinguish between the two.
- column named institute to `dete_resignations_up`. Each row should contain the value DETE.
- column named institute to `tafe_resignations_up`. Each row should contain the value TAFE.

In [13]:
# copy updated dataframes to work on
dete_resignations_up = dete_resignations.copy()
taf_resignations_up = taf_resignations.copy()

# add extra column for identifiers
dete_resignations_up['institute'] = 'DETE'
taf_resignations_up['institute'] = 'TAFE'

# merge the two dataframes - result called combined
combined = pd.concat([dete_resignations_up, taf_resignations_up])

# get rid of columns with more than 500 nan values
combined_up = combined.copy()
combined_up = combined.dropna(axis = 1, thresh = 500)
combined_up.head()

Unnamed: 0,id,separationtype,cease_date,position,employment_status,gender,age,institute_service,dissatisfaction,institute
3,4.0,Resignation-Other reasons,2012.0,Teacher,Permanent Full-time,Female,36-40,7,False,DETE
5,6.0,Resignation-Other reasons,2012.0,Guidance Officer,Permanent Full-time,Female,41-45,18,True,DETE
8,9.0,Resignation-Other reasons,2012.0,Teacher,Permanent Full-time,Female,31-35,3,False,DETE
9,10.0,Resignation-Other employer,2012.0,Teacher Aide,Permanent Part-time,Female,46-50,15,True,DETE
11,12.0,Resignation-Move overseas/interstate,2012.0,Teacher,Permanent Full-time,Male,31-35,3,False,DETE


## Cleaning institute_service column

To analyze the data, I'll convert these numbers into categories. The categories are based on this [article](https://www.businesswire.com/news/home/20171108006002/en/Age-Number-Engage-Employees-Career-Stage).

I'll use the following, slightly modified, labels to classify the employees:

- New: Less than 3 years at a company
- Experienced: 3-6 years at a company
- Established: 7-10 years at a company
- Veteran: 11 or more years at a company
- Let's categorize the values in the institute_service column using the definitions

In [14]:
combined_up['institute_service'].value_counts()

Less than 1 year      73
1-2                   64
3-4                   63
5-6                   33
11-20                 26
5.0                   23
1.0                   22
7-10                  21
3.0                   20
0.0                   20
6.0                   17
4.0                   16
2.0                   14
9.0                   14
7.0                   13
More than 20 years    10
8.0                    8
13.0                   8
15.0                   7
20.0                   7
12.0                   6
22.0                   6
17.0                   6
14.0                   6
10.0                   6
18.0                   5
16.0                   5
23.0                   4
11.0                   4
24.0                   4
39.0                   3
32.0                   3
21.0                   3
19.0                   3
26.0                   2
28.0                   2
30.0                   2
25.0                   2
36.0                   2
27.0                   1


### Problem:
- Not all values are numerical in nature
- Some have the format "7-10"
- Some entries are of the type: "More than 20 years" or "Less than 1 year"

### Solution:
- Use the Series.astype() method to change the type to 'str'.
- Use vectorized string methods to extract the years of service from each pattern.
- Double check that you didn't miss extracting any digits.
- Use the Series.astype() method to change the type to 'float'.

In [15]:
# Convert years of service to string
combined_up['institute_service'] = combined['institute_service'].astype('str')
# manually replace the two outliers:
combined_up['institute_service'] = combined_up['institute_service'].str.replace("Less than 1 year", "0.0", regex = False)
combined_up['institute_service'] = combined_up['institute_service'].str.replace("More than 20 years", "20.0", regex = False)
# get first entry of values separated by '-'
combined_up['institute_service'] = combined_up['institute_service'].str.split(pat = '-', expand = True)
# convert to float
combined_up['institute_service'] = combined_up['institute_service'].astype('float')
combined_up['institute_service'].value_counts(dropna = False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_up['institute_service'] = combined['institute_service'].astype('str')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_up['institute_service'] = combined_up['institute_service'].str.replace("Less than 1 year", "0.0", regex = False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  comb

0.0     93
NaN     88
1.0     86
3.0     83
5.0     56
7.0     34
11.0    30
20.0    17
6.0     17
4.0     16
9.0     14
2.0     14
13.0     8
8.0      8
15.0     7
22.0     6
10.0     6
17.0     6
14.0     6
12.0     6
16.0     5
18.0     5
24.0     4
23.0     4
21.0     3
39.0     3
32.0     3
19.0     3
36.0     2
30.0     2
25.0     2
26.0     2
28.0     2
42.0     1
29.0     1
35.0     1
27.0     1
41.0     1
49.0     1
38.0     1
34.0     1
33.0     1
31.0     1
Name: institute_service, dtype: int64

### Classification of Employees

In [30]:
def classify_employee(val):
    if pd.isnull(val):
        return np.nan
    elif val < 3:
        return 'new'
    elif val <= 6:
        return 'experienced'
    elif val <= 10:
        return 'established'
    elif val >= 11.0:
        return 'veteran'
    
combined_up['service_cat'] = combined_up['institute_service'].map(classify_employee)
combined_up.head()

Unnamed: 0,id,separationtype,cease_date,position,employment_status,gender,age,institute_service,dissatisfaction,institute,service_cat
3,4.0,Resignation-Other reasons,2012,Teacher,Permanent Full-time,Female,36-40,7,False,DETE,established
5,6.0,Resignation-Other reasons,2012,Guidance Officer,Permanent Full-time,Female,41-45,18,True,DETE,veteran
8,9.0,Resignation-Other reasons,2012,Teacher,Permanent Full-time,Female,31-35,3,False,DETE,experienced
9,10.0,Resignation-Other employer,2012,Teacher Aide,Permanent Part-time,Female,46-50,15,True,DETE,veteran
11,12.0,Resignation-Move overseas/interstate,2012,Teacher,Permanent Full-time,Male,31-35,3,False,DETE,experienced


In [32]:
combined_up['service_cat'].value_counts(dropna = False)

new            281
experienced    172
veteran        136
established     62
Name: service_cat, dtype: int64

## Final Cleaning of dissatisified Column

In [33]:
combined_up['dissatisfaction'].value_counts(dropna = False)

False    411
True     240
Name: dissatisfaction, dtype: int64

- The `dissatisfaction` column contains 8 NaN values
- In order to analyze the data I'll replace them with the value that occurs most frequently in this column (False)

In [34]:
combined_up = combined_up.fillna(False) # fill nan values with False
combined_up['dissatisfaction'].value_counts(dropna = False)

False    411
True     240
Name: dissatisfaction, dtype: int64

# Save the Cleaned Data to CSV

In [35]:
combined_up.to_csv('aggregated_cleaned_data.csv',index=False)

# Conclusion
In this notebook I have successfully:
- cleaned the relevant data
- aggreagated the two different data sources into a comprehensive data set

The final analysis can be found in the next notebook called `exit_surveys_analysis.ipynb`.