## Part 1: Data Retrieval

### 1.1. Retrieve the data

In [131]:
# Import our dependencies
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras import layers

#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


## Part 2: Preprocessing

### 2.1. Perform some Exploratory Data Analysis 

In [132]:
# Determine the number of unique values in each column.
attrition_df.nunique()

Age                         43
Attrition                    2
BusinessTravel               3
Department                   3
DistanceFromHome            29
Education                    5
EducationField               6
EnvironmentSatisfaction      4
HourlyRate                  71
JobInvolvement               4
JobLevel                     5
JobRole                      9
JobSatisfaction              4
MaritalStatus                3
NumCompaniesWorked          10
OverTime                     2
PercentSalaryHike           15
PerformanceRating            2
RelationshipSatisfaction     4
StockOptionLevel             4
TotalWorkingYears           40
TrainingTimesLastYear        7
WorkLifeBalance              4
YearsAtCompany              37
YearsInCurrentRole          19
YearsSinceLastPromotion     16
YearsWithCurrManager        18
dtype: int64

In [133]:
# Determine which features are numerical. Also, determine whether there are any null values 
attrition_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 27 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   Department                1470 non-null   object
 4   DistanceFromHome          1470 non-null   int64 
 5   Education                 1470 non-null   int64 
 6   EducationField            1470 non-null   object
 7   EnvironmentSatisfaction   1470 non-null   int64 
 8   HourlyRate                1470 non-null   int64 
 9   JobInvolvement            1470 non-null   int64 
 10  JobLevel                  1470 non-null   int64 
 11  JobRole                   1470 non-null   object
 12  JobSatisfaction           1470 non-null   int64 
 13  MaritalStatus             1470 non-null   object
 14  NumCompaniesWorked      

No null values

### 2.2. Create X dataset
**Note:** We will use all columns except `Attrition` and `Department` which are the target columns.

In [134]:
# Create X_df
X_df = attrition_df.drop(columns=['Attrition', 'Department'])

# Show the data types for X_df
display(X_df.dtypes)

Age                          int64
BusinessTravel              object
DistanceFromHome             int64
Education                    int64
EducationField              object
EnvironmentSatisfaction      int64
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
NumCompaniesWorked           int64
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StockOptionLevel             int64
TotalWorkingYears            int64
TrainingTimesLastYear        int64
WorkLifeBalance              int64
YearsAtCompany               int64
YearsInCurrentRole           int64
YearsSinceLastPromotion      int64
YearsWithCurrManager         int64
dtype: object

#### Preprocess the non-numeric features in the X dataset

##### Determine value counts for each non-numeric feature

In [135]:
# Collect all non-numric features
non_num_df = X_df.select_dtypes(include=object)

# Determine value counts for each non-numeric feature to determine how to encode them
for column in non_num_df.columns:
    display(non_num_df[column].value_counts())
    print(len(non_num_df[column].value_counts()))

BusinessTravel
Travel_Rarely        1043
Travel_Frequently     277
Non-Travel            150
Name: count, dtype: int64

3


EducationField
Life Sciences       606
Medical             464
Marketing           159
Technical Degree    132
Other                82
Human Resources      27
Name: count, dtype: int64

6


JobRole
Sales Executive              326
Research Scientist           292
Laboratory Technician        259
Manufacturing Director       145
Healthcare Representative    131
Manager                      102
Sales Representative          83
Research Director             80
Human Resources               52
Name: count, dtype: int64

9


MaritalStatus
Married     673
Single      470
Divorced    327
Name: count, dtype: int64

3


OverTime
No     1054
Yes     416
Name: count, dtype: int64

2


None of the non-numeric features have an obvious order to them. Therefore, we can use OneHotEncoder for all of them.

##### Define a function for the encoding
**Note:** We will define the function in such a way that we can also use it to encode the colmns of the y dataset.

In [136]:
def encode_feature(feature):
    '''Takes a feature as input and uses OneHotEncoder to encode that feature.
       Returns a DataFrame with encoded columns. If the value count of the feature is 2 (binary) drops
       the first encoded column as it is redundant
       
       Input: Pandas DataFrame with one feature to be encoded
       
       Output: Pandas DataFrame with encoded columns.
    '''

    # Instantiate an instance of OneHotEncoder depending on whether the
    # value count of the feature is 2 or not 
    if len(feature.value_counts()) == 2:
        feature_encoder = OneHotEncoder(sparse_output=False, drop='first')
    else:
        feature_encoder = OneHotEncoder(sparse_output=False)

    # Encode the feature
    feature_encoded = feature_encoder.fit_transform(feature)

    # Retrieve the column names of the encoded feature
    feature_columns = feature_encoder.get_feature_names_out([feature.columns[0]])

    # Return the encoded DataFrame
    return pd.DataFrame(feature_encoded, columns=feature_columns)

##### Encode all non-numeric features in the X dataset

In [137]:
# Copy the X dataset
X_encoded_df = X_df.copy()

# Loop through all non-numeric features
for column in non_num_df.columns:
    # Retrieve the feature
    feature = non_num_df[[column]]

    # Encode the feature and concatenate the X dataset with the encoded columns
    X_encoded_df = pd.concat([X_encoded_df, encode_feature(feature)], axis=1)

    # Drop the unencoded column
    X_encoded_df.drop(columns=column, inplace=True)

# Show the first few rows of the encoded X dataset 
display(X_encoded_df.head())

# Show the data types for X_encoded_df
display(X_encoded_df.dtypes)

Unnamed: 0,Age,DistanceFromHome,Education,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,NumCompaniesWorked,PercentSalaryHike,...,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Divorced,MaritalStatus_Married,MaritalStatus_Single,OverTime_Yes
0,41,1,2,2,94,3,2,4,8,11,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
1,49,8,1,3,61,2,2,2,1,23,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,37,2,2,4,92,2,1,3,6,15,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
3,33,3,4,4,56,3,1,3,1,11,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
4,27,2,1,1,40,3,1,2,9,12,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Age                                    int64
DistanceFromHome                       int64
Education                              int64
EnvironmentSatisfaction                int64
HourlyRate                             int64
JobInvolvement                         int64
JobLevel                               int64
JobSatisfaction                        int64
NumCompaniesWorked                     int64
PercentSalaryHike                      int64
PerformanceRating                      int64
RelationshipSatisfaction               int64
StockOptionLevel                       int64
TotalWorkingYears                      int64
TrainingTimesLastYear                  int64
WorkLifeBalance                        int64
YearsAtCompany                         int64
YearsInCurrentRole                     int64
YearsSinceLastPromotion                int64
YearsWithCurrManager                   int64
BusinessTravel_Non-Travel            float64
BusinessTravel_Travel_Frequently     float64
BusinessTr

### 2.3. Create y dataset

In [138]:
# Create y_df with the Attrition and Department columns
y_df = attrition_df[['Attrition', 'Department']]

display(y_df.head())

Unnamed: 0,Attrition,Department
0,Yes,Sales
1,No,Research & Development
2,Yes,Research & Development
3,No,Research & Development
4,No,Research & Development


#### Check the elements in each class of the y dataset

In [139]:
display(y_df['Attrition'].value_counts())
display(y_df['Department'].value_counts())

Attrition
No     1233
Yes     237
Name: count, dtype: int64

Department
Research & Development    961
Sales                     446
Human Resources            63
Name: count, dtype: int64

The classes are very unbalanced. This might lead to overfitting.

#### Encode the y dataset

In [140]:
# Copy the X dataset
y_encoded_df = y_df.copy()

# Loop through all non-numeric features
for column in y_df.columns:
    # Retrieve the feature
    feature = y_df[[column]]

    # Encode the feature and concatenate the X dataset with the encoded columns
    y_encoded_df = pd.concat([y_encoded_df, encode_feature(feature)], axis=1)

    # Drop the unencoded column
    y_encoded_df.drop(columns=column, inplace=True)

# Show the first few rows of the encoded X dataset 
display(y_encoded_df.head())

# Show the data types for X_encoded_df
display(y_encoded_df.dtypes)

Unnamed: 0,Attrition_Yes,Department_Human Resources,Department_Research & Development,Department_Sales
0,1.0,0.0,0.0,1.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0


Attrition_Yes                        float64
Department_Human Resources           float64
Department_Research & Development    float64
Department_Sales                     float64
dtype: object

#### Split y dataset into two separate targets

In [141]:
# Create y dataset for 'Department' target 
y_dept = y_encoded_df[['Department_Human Resources',
                       'Department_Research & Development',
                       'Department_Sales']]
display(y_dept.head())

Unnamed: 0,Department_Human Resources,Department_Research & Development,Department_Sales
0,0.0,0.0,1.0
1,0.0,1.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,1.0,0.0


In [142]:
# Create y dataset for 'Attrition' target 
y_att = y_encoded_df[['Attrition_Yes']].rename(columns={'Attrition_Yes': 'Attrition'})
display(y_att.head())

Unnamed: 0,Attrition
0,1.0
1,0.0
2,1.0
3,0.0
4,0.0


### 2.4. Split the X and y dataset into train and test sets

In [143]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_dept_train, y_dept_test, y_att_train, y_att_test = \
                            train_test_split(X_encoded_df, y_dept, y_att, random_state=1)

### 2.5. Scale the X train and test datasets

In [144]:
# Create a StandardScaler
scaler = StandardScaler()

# Fit the StandardScaler to the training data
scaler.fit(X_train)

# Scale the training and testing data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Part 3: Create, Compile, and Train the Model

In [145]:
# Find the number of columns in the X training data
input_nodes = len(X_train.columns)

# Create the input layer
input_layer = layers.Input(shape=(input_nodes,), name='input_layer')

# Create at least two shared layers
shared1 = layers.Dense(64, activation='relu')(input_layer)
shared2 = layers.Dense(128, activation='relu')(shared1)

In [146]:
# Create a branch for Department
# with a hidden layer and an output layer

# Create the hidden layer
dept_hidden = layers.Dense(32, activation='relu')(shared2)

# Create the output layer
dept_output = layers.Dense(3, activation='softmax', name='dept_out')(dept_hidden)

In [147]:
# Create a branch for Attrition
# with a hidden layer and an output layer

# Create the hidden layer
att_hidden = layers.Dense(32, activation='relu')(shared2)

# Create the output layer
att_output = layers.Dense(1, activation='sigmoid', name='att_out')(att_hidden)

In [148]:
# Create the model
model = Model(inputs=input_layer, outputs=[dept_output, att_output])

# Compile the model
model.compile(optimizer='adam',
              loss={'dept_out': 'categorical_crossentropy', 'att_out': 'binary_crossentropy'},
              metrics={'dept_out': 'accuracy', 'att_out': 'accuracy'})

# Summarize the model
model.summary()

In [149]:
# Train the model
model.fit(X_train_scaled,
          {'dept_out': y_dept_train, 'att_out': y_att_train},
          epochs=100,
          batch_size=32,
          validation_split=0.2)


Epoch 1/100
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - att_out_accuracy: 0.8294 - dept_out_accuracy: 0.5719 - loss: 1.4086 - val_att_out_accuracy: 0.8281 - val_dept_out_accuracy: 0.8235 - val_loss: 1.0016
Epoch 2/100
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - att_out_accuracy: 0.8546 - dept_out_accuracy: 0.8854 - loss: 0.7834 - val_att_out_accuracy: 0.8281 - val_dept_out_accuracy: 0.9457 - val_loss: 0.6682
Epoch 3/100
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - att_out_accuracy: 0.8478 - dept_out_accuracy: 0.9573 - loss: 0.5278 - val_att_out_accuracy: 0.8416 - val_dept_out_accuracy: 0.9774 - val_loss: 0.5557
Epoch 4/100
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - att_out_accuracy: 0.8800 - dept_out_accuracy: 0.9808 - loss: 0.3857 - val_att_out_accuracy: 0.8462 - val_dept_out_accuracy: 0.9774 - val_loss: 0.5136
Epoch 5/100
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x15f51e10df0>

In [150]:
# Evaluate the model with the testing data
test_results = model.evaluate(X_test_scaled, {'dept_out': y_dept_test, 'att_out': y_att_test})
display(test_results)

[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - att_out_accuracy: 0.7829 - dept_out_accuracy: 0.9751 - loss: 2.2341 


[1.9615750312805176, 0.8179348111152649, 0.967391312122345]

In [151]:
# Print the accuracy for both department and attrition
print(f"Department predictions accuracy: {test_results[1]:.3f}")
print(f"Attrition predictions accuracy: {test_results[2]:.3f}")

Department predictions accuracy: 0.818
Attrition predictions accuracy: 0.967


# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

YOUR ANSWERS HERE

1. No. The target categories for both 'Department' and 'Attrition' are very unbalanced since the number of elements in each category is very different. A better metric to use would be the balanced accuracy.
2. Since there are three different departments, we have to use an activation function that can handle multi-class classification. Therefore, we chose `softmax` as the activation function to predict the department.  
There are only two categories for attrition. Therefore, the `sigmoid` activation function is appropriate to use.
3. The model is overfitting, especially when predicting the 'Attrition'. This is evident by the large difference between the train accuracy (of 1.0) and the test accuracy (of 0.82). It is also evident in the large loss value.  
To improve the overfitting, we would start performing a p-values and a PCA analysis to see whether we can reduce the number of input features when building the model. We can also try to optimize the parameters of the model itself.