## Part 1: Preprocessing

In [40]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras import layers

#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [41]:
# Determine the number of unique values in each column
attrition_df.nunique()

Unnamed: 0,0
Age,43
Attrition,2
BusinessTravel,3
Department,3
DistanceFromHome,29
Education,5
EducationField,6
EnvironmentSatisfaction,4
HourlyRate,71
JobInvolvement,4


In [42]:
# Create y_df with the Attrition and Department columns
y_df = attrition_df[['Attrition', 'Department']]
y_df.head()

Unnamed: 0,Attrition,Department
0,Yes,Sales
1,No,Research & Development
2,Yes,Research & Development
3,No,Research & Development
4,No,Research & Development


In [43]:
# Create a list of at least 10 column names to use as X data
columns = ['Age',  'BusinessTravel', 'Education', 'EducationField', 'HourlyRate', 'OverTime',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction','PercentSalaryHike',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear']

# Create X_df using your selected columns
X_df = attrition_df[columns]

# Show the data types for X_df
X_df.head()

Unnamed: 0,Age,BusinessTravel,Education,EducationField,HourlyRate,OverTime,JobInvolvement,JobLevel,JobRole,JobSatisfaction,PercentSalaryHike,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear
0,41,Travel_Rarely,2,Life Sciences,94,Yes,3,2,Sales Executive,4,11,0,8,0
1,49,Travel_Frequently,1,Life Sciences,61,No,2,2,Research Scientist,2,23,1,10,3
2,37,Travel_Rarely,2,Other,92,Yes,2,1,Laboratory Technician,3,15,0,7,3
3,33,Travel_Frequently,4,Life Sciences,56,Yes,3,1,Research Scientist,3,11,0,8,3
4,27,Travel_Rarely,1,Medical,40,No,3,1,Laboratory Technician,2,12,1,6,3


In [44]:
# Convert your X data to numeric data types however you see fit
# Add new code cells as necessary
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Use LabelEncoder to change column "OverTime" to numeric data
label_encoder = LabelEncoder()
X_df['OverTime_encoded'] = label_encoder.fit_transform(X_df['OverTime'])
X_df = X_df.drop(columns=['OverTime'], axis = 1)

# Use OneHotEncoder to change columns ('BusinessTravel', 'EducationField', 'JobRole') to numeric data
categorical_columns = X_df.select_dtypes(include=['object']).columns.tolist()
one_hot_encoder = OneHotEncoder(sparse_output=False)
one_hot_encoded = one_hot_encoder.fit_transform(X_df[categorical_columns])
one_hot_df = pd.DataFrame(one_hot_encoded, columns=one_hot_encoder.get_feature_names_out(categorical_columns))

X = pd.concat([X_df, one_hot_df], axis=1).drop(categorical_columns, axis=1)


X.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_df['OverTime_encoded'] = label_encoder.fit_transform(X_df['OverTime'])


Unnamed: 0,Age,Education,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,PercentSalaryHike,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,...,EducationField_Technical Degree,JobRole_Healthcare Representative,JobRole_Human Resources,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative
0,41,2,94,3,2,4,11,0,8,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,49,1,61,2,2,2,23,1,10,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,37,2,92,2,1,3,15,0,7,3,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,33,4,56,3,1,3,11,0,8,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,27,1,40,3,1,2,12,1,6,3,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
# Create a LabelEncoder for the Attrition column
label_encoder = LabelEncoder()

# Fit the encoder to the data
y_df['Attrition_encoded'] = label_encoder.fit_transform(y_df['Attrition'])

# Create two new variables by applying the encoder
y_df = y_df.drop(columns=['Attrition'], axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  y_df['Attrition_encoded'] = label_encoder.fit_transform(y_df['Attrition'])


In [46]:
from sklearn.preprocessing import OneHotEncoder

# Create a OneHotEncoder for the Department column
oneHot_encoder = OneHotEncoder(sparse_output=False)

# Fit the encoder to the data
department_encoded = oneHot_encoder.fit_transform(y_df[['Department']])

department_df = pd.DataFrame(department_encoded, columns=oneHot_encoder.get_feature_names_out(['Department']))

y_df = pd.concat([y_df, department_df], axis=1).drop(['Department'], axis=1)

y_df.head()


Unnamed: 0,Attrition_encoded,Department_Human Resources,Department_Research & Development,Department_Sales
0,1,0.0,0.0,1.0
1,0,0.0,1.0,0.0
2,1,0.0,1.0,0.0
3,0,0.0,1.0,0.0
4,0,0.0,1.0,0.0


In [47]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
y_attrition = y_df['Attrition_encoded']
y_department = y_df[['Department_Human Resources', 'Department_Research & Development', 'Department_Sales']]
X_train, X_test, y_attrition_train, y_attrition_test, y_department_train, y_department_test = train_test_split(X, y_attrition, y_department)


In [48]:
# Create a StandardScaler
scaler = StandardScaler()

# Fit the StandardScaler to the training data
X_scaler = scaler.fit(X_train)

# Scale the training and testing data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

## Part 2: Create, Compile, and Train the Model

In [49]:
# Find the number of columns in the X training data.
num = len(X.columns)
num
# Create the input layer
input_layer = layers.Input(shape=(num,), name='input')

# Create at least two shared layers
shared_layer1 = layers.Dense(64, activation='relu', name='shared1')(input_layer)
shared_layer2 = layers.Dense(128, activation='relu', name='shared2')(shared_layer1)

In [50]:
# Create a branch for Department
# with a hidden layer and an output layer

# Create the hidden layer
department_hidden = layers.Dense(32, activation='relu', name='department_hidden')(shared_layer2)

# Create the output layer
department_output = layers.Dense(3, activation='softmax', name='department_output')(department_hidden)


In [51]:
# Create a branch for Attrition
# with a hidden layer and an output layer

# Create the hidden layer
attrition_hidden = layers.Dense(32, activation='relu', name='attrition_hidden')(shared_layer2)


# Create the output layer
attrition_output = layers.Dense(1, activation='sigmoid', name='attrition_output')(attrition_hidden)


In [52]:
# Create the model
model = Model(inputs=input_layer, outputs={'attrition_output': attrition_output, 'department_output': department_output})


# Compile the model
model.compile(optimizer='adam',
              loss={'department_output': 'categorical_crossentropy', 'attrition_output': 'binary_crossentropy'},
              metrics={'department_output': 'accuracy', 'attrition_output': 'accuracy'})



# Summarize the model
model.summary()

In [53]:
# Train the model
hist = model.fit(
    X_train_scaled,
    {'attrition_output': y_attrition_train, 'department_output': y_department_train},
    epochs=35,
    batch_size=32,
    validation_split=0.2
)

Epoch 1/35
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 28ms/step - attrition_output_accuracy: 0.8170 - attrition_output_loss: 0.5107 - department_output_accuracy: 0.5725 - department_output_loss: 0.8923 - loss: 1.4032 - val_attrition_output_accuracy: 0.8552 - val_attrition_output_loss: 0.4055 - val_department_output_accuracy: 0.8914 - val_department_output_loss: 0.3721 - val_loss: 0.7765
Epoch 2/35
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - attrition_output_accuracy: 0.8306 - attrition_output_loss: 0.4190 - department_output_accuracy: 0.9205 - department_output_loss: 0.2952 - loss: 0.7146 - val_attrition_output_accuracy: 0.8552 - val_attrition_output_loss: 0.3687 - val_department_output_accuracy: 0.9729 - val_department_output_loss: 0.1245 - val_loss: 0.4924
Epoch 3/35
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - attrition_output_accuracy: 0.8545 - attrition_output_loss: 0.3442 - department_output_accura

In [21]:
# Evaluate the model with the testing data
test_results = model.evaluate(X_test_scaled, {'attrition_output': y_attrition_test, 'department_output': y_department_test})
test_results

[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - attrition_output_accuracy: 0.8546 - attrition_output_loss: 0.5934 - department_output_accuracy: 0.9809 - department_output_loss: 0.1138 - loss: 0.7003 


[0.8433697819709778,
 0.08725724369287491,
 0.8012217879295349,
 0.83152174949646,
 0.9836956262588501]

In [54]:
# Print the accuracy for both department and attrition
attrition_accuracy = test_results[0]
department_accuracy = test_results[2]

print(f"Attrition Output Accuracy: {attrition_accuracy:.4f}")
print(f"Department Output Accuracy: {department_accuracy:.4f}")


Attrition Output Accuracy: 0.8434
Department Output Accuracy: 0.8012


# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

YOUR ANSWERS HERE

1. Accuracy may not be the best metric if the dataset is imbalanced, as it can be misleading in cases where one class dominates. For attrition prediction, metrics like F1-score, recall, or AUC-ROC would be more useful to evaluate how well the model detects actual attrition cases. For department classification, categorical cross-entropy and top-K accuracy might provide better insights. If the classes are balanced, accuracy can still be a reasonable metric. However, using a confusion matrix can help assess model performance more thoroughly.

2. The department_output layer likely uses a softmax activation function because it predicts three categories and uses categorical cross-entropy loss, which requires a probability distribution. The attrition_output layer likely uses a sigmoid activation function since it is a binary classification problem and uses binary cross-entropy loss. Sigmoid outputs a probability between 0 and 1, making it suitable for predicting whether an employee will leave or stay. These activation functions align with the chosen loss functions to ensure proper probability interpretation.


3. To improve the model, adding regularization techniques like L1/L2 penalties or dropout can help prevent overfitting. Batch normalization can also be introduced after dense layers to stabilize training and improve convergence. Experimenting with different hidden layer sizes or activation functions like LeakyReLU may enhance performance. Additionally, tuning the learning rate or trying different optimizers, such as RMSprop, could lead to better results.