# Introduction

This project looks at exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia.

The data source for TAFE is found [here](https://data.gov.au/dataset/ds-qld-89970a3b-182b-41ea-aea2-6f9f17b5907e/details?q=exit%20survey).

The data source for DETE is found [here](https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey).

The main focus is answering the following two questions:

* Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
* Are younger employees resigning due to some kind of dissatisfaction? What about older employees?

## Exploring the Data

In [1]:
import pandas as pd
import numpy as np

dete_survey = pd.read_csv('dete_survey.csv')
tafe_survey = pd.read_csv('tafe_survey.csv')

print(dete_survey.info())
print(dete_survey.head())

FileNotFoundError: [Errno 2] No such file or directory: 'dete_survey.csv'

In [None]:
print(tafe_survey.info())
print(tafe_survey.head())

We can note from the dete_survey dataset that there are 'Not Stated' values that are essentially NaN values.

There are also many columns that we don't need to complete our analysis.

Also columns are similar but different names across the two datasets.

There are also multiple columns that indicate resignation reasons due to dissatisfaction.

## Correcting Missing Values and Dropping Unnecessary Columns

In [90]:
dete_survey = pd.read_csv('dete_survey.csv', na_values='Not Stated')

dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1)

tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis=1)

We corrected NaN to the standard noted above ('Not Stated'). We also dropped multiple columns that would not have contributed to the analysis.

In [91]:
dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(' ','_')

tafe_survey_updated.rename(columns = {'Record ID': 'id',
                                      'CESSATION YEAR': 'cease_date',
                                      'Reason for ceasing employment': 'separationtype',
                                      'Gender. What is your Gender?': 'gender',
                                      'CurrentAge. Current Age': 'age',
                                      'Employment Type. Employment Type': 'employment_status',
                                      'Classification. Classification': 'position',
                                      'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
                                      'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'
                                     }, inplace=True)

print(dete_survey_updated.head())
print(tafe_survey_updated.head())

   id                    separationtype cease_date  dete_start_date  \
0   1             Ill Health Retirement    08/2012           1984.0   
1   2  Voluntary Early Retirement (VER)    08/2012              NaN   
2   3  Voluntary Early Retirement (VER)    05/2012           2011.0   
3   4         Resignation-Other reasons    05/2012           2005.0   
4   5                    Age Retirement    05/2012           1970.0   

   role_start_date                                      position  \
0           2004.0                                Public Servant   
1              NaN                                Public Servant   
2           2011.0                               Schools Officer   
3           2006.0                                       Teacher   
4           1989.0  Head of Curriculum/Head of Special Education   

  classification              region                      business_unit  \
0        A01-A04      Central Office  Corporate Strategy and Peformance   
1        AO5-A

We want to eventually combine the columns, so column standardization is necessary and thus we applied general updates as well as dictionary replacements.

## Filtering the Data

We are looking specifically at 'separationtype' to answer this question:

* Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been at the job longer?

Thus, we should look at filtering out data that does not help us answer that.

In [92]:
print(dete_survey_updated['separationtype'].value_counts())
print(tafe_survey_updated['separationtype'].value_counts())

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separationtype, dtype: int64
Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
Name: separationtype, dtype: int64


In [93]:
dete_resignations = dete_survey_updated[(dete_survey_updated['separationtype'] == 'Resignation-Other reasons') | 
                   (dete_survey_updated['separationtype'] == 'Resignation-Other employer') | 
                  (dete_survey_updated['separationtype'] == 'Resignation-Move overseas/interstate')].copy()

tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'] == 'Resignation'].copy()

## Verifying the Data

We should look at the cease_date column in order to ensure that the data makes sense. For example, dates after today (2021) do not make sense. First, we need to standardize the dates though.

In [94]:
dete_resignations['cease_date'].value_counts()

2012       126
2013        74
01/2014     22
12/2013     17
06/2013     14
09/2013     11
11/2013      9
07/2013      9
10/2013      6
08/2013      4
05/2013      2
05/2012      2
07/2006      1
2010         1
07/2012      1
09/2010      1
Name: cease_date, dtype: int64

In [95]:
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.extract(r'([2][0-9]{3})',expand=False).astype('float')

In [96]:
print(dete_resignations['cease_date'].value_counts().sort_index(ascending=False))
print(dete_resignations['dete_start_date'].value_counts().sort_index(ascending=False))

print(tafe_resignations['cease_date'].value_counts().sort_index(ascending=False))

2014.0     22
2013.0    146
2012.0    129
2010.0      2
2006.0      1
Name: cease_date, dtype: int64
2013.0    10
2012.0    21
2011.0    24
2010.0    17
2009.0    13
2008.0    22
2007.0    21
2006.0    13
2005.0    15
2004.0    14
2003.0     6
2002.0     6
2001.0     3
2000.0     9
1999.0     8
1998.0     6
1997.0     5
1996.0     6
1995.0     4
1994.0     6
1993.0     5
1992.0     6
1991.0     4
1990.0     5
1989.0     4
1988.0     4
1987.0     1
1986.0     3
1985.0     3
1984.0     1
1983.0     2
1982.0     1
1980.0     5
1977.0     1
1976.0     2
1975.0     1
1974.0     2
1973.0     1
1972.0     1
1971.0     1
1963.0     1
Name: dete_start_date, dtype: int64
2013.0     55
2012.0     94
2011.0    116
2010.0     68
2009.0      2
Name: cease_date, dtype: int64


There do not seem to be any inconsistencies with no values before 1940 or after this year. However, among the two surveys, it seems that their range is significantly different.

## Creating a New Column to Analyze

In the DETE survey, it seems there is no length of service column like in the TAFE survey. We will create this to be able to easily compare.

In [97]:
dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']

## Identifying Dissatisfied Employees

We will be looking at the relevant columns to find our dissatisfied employees.

First, we will see what values exist, update if necessary, and use the relevant columns to identify dissatisfaction in any attribute.

In [98]:
print(tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts())
print(tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts())

-                                         277
Contributing Factors. Dissatisfaction      55
Name: Contributing Factors. Dissatisfaction, dtype: int64
-                      270
Job Dissatisfaction     62
Name: Contributing Factors. Job Dissatisfaction, dtype: int64


In [99]:
def update_vals(val):
    if pd.isnull(val):
        return np.nan
    elif val == '-':
        return False
    else:
        return True
    
tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']] = tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].applymap(update_vals)

In [100]:
print(tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts())
print(tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts())

False    277
True      55
Name: Contributing Factors. Dissatisfaction, dtype: int64
False    270
True      62
Name: Contributing Factors. Job Dissatisfaction, dtype: int64


In [101]:
tafe_resignations['dissatisfied'] = tafe_resignations[
                                                        ['Contributing Factors. Dissatisfaction', 
                                                         'Contributing Factors. Job Dissatisfaction']
                                                    ].any(axis=1, skipna=False)

dete_resignations['dissatisfied'] = dete_resignations[
                                                        ['job_dissatisfaction',
                                                        'dissatisfaction_with_the_department',
                                                        'physical_work_environment',
                                                        'lack_of_recognition',
                                                        'lack_of_job_security',
                                                        'work_location',
                                                        'employment_conditions',
                                                        'work_life_balance',
                                                        'workload']
                                                     ].any(axis=1, skipna=False)

In [102]:
dete_resignations_up = dete_resignations.copy()
tafe_resignations_up = tafe_resignations.copy()

## Combining the Data

We will combine the datasets to analyze according to the institute_service column. First, we will concatenate the two datasets on top of each other and drop columns where the number of non null values is above 500.

Second, we will convert these institute_service values into service categories (i.e. how long they have worked at that institution).

In [103]:
dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'

combined = pd.concat([dete_resignations_up,tafe_resignations_up], axis=0, ignore_index=True).copy()

combined_updated = combined.dropna(axis=1, how='any', thresh=500).copy()

In [104]:
combined_updated['institute_service'] = combined_updated['institute_service'].astype('str').str.extract(r"([\d]+)[^\d]",expand=False).astype('float')

In [106]:
def mapping(val):
    if pd.isnull(val):
        return np.nan
    elif val < 3:
        return 'New'
    elif val < 7:
        return 'Experienced'
    elif val < 11:
        return 'Established'
    else:
        return 'Veteran'
    
combined_updated['service_cat'] = combined_updated['institute_service'].apply(mapping)

## Initial Analysis

We will now be looking into the dataset we have created.

In [107]:
combined_updated['dissatisfied'].value_counts(dropna=False)

False    403
True     240
NaN        8
Name: dissatisfied, dtype: int64

In [108]:
combined_updated['dissatisfied'].fillna(value=False,inplace=True)

In [110]:
pivot = pd.pivot_table(combined_updated, values='dissatisfied', index='service_cat')

Unnamed: 0_level_0,dissatisfied
service_cat,Unnamed: 1_level_1
Established,0.516129
Experienced,0.343023
New,0.295337
Veteran,0.485294


In [None]:
%matplotlib inline

pivot.plot(x='service_cat',y=''kind='bar')