<a href="https://colab.research.google.com/github/dixonhow8/neural-network-challenge-2/blob/main/attrition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part 1: Preprocessing

In [207]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras import layers
from sklearn.preprocessing import OneHotEncoder
from keras.layers import Input, Dense
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder




#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [208]:
# Determine the number of unique values in each column.
unique_counts = attrition_df.nunique()
print(unique_counts)

Age                         43
Attrition                    2
BusinessTravel               3
Department                   3
DistanceFromHome            29
Education                    5
EducationField               6
EnvironmentSatisfaction      4
HourlyRate                  71
JobInvolvement               4
JobLevel                     5
JobRole                      9
JobSatisfaction              4
MaritalStatus                3
NumCompaniesWorked          10
OverTime                     2
PercentSalaryHike           15
PerformanceRating            2
RelationshipSatisfaction     4
StockOptionLevel             4
TotalWorkingYears           40
TrainingTimesLastYear        7
WorkLifeBalance              4
YearsAtCompany              37
YearsInCurrentRole          19
YearsSinceLastPromotion     16
YearsWithCurrManager        18
dtype: int64


In [209]:
# Create y_df with the Attrition and Department columns
y_df = attrition_df[['Attrition', 'Department']]
y_df.head()


Unnamed: 0,Attrition,Department
0,Yes,Sales
1,No,Research & Development
2,Yes,Research & Development
3,No,Research & Development
4,No,Research & Development


In [210]:
# Create a list of at least 10 column names to use as X data
X_columns = [
    'Age',
    'DistanceFromHome',
    'Education',
    'HourlyRate',
    'EnvironmentSatisfaction',
    'JobInvolvement',
    'JobLevel',
    'JobSatisfaction',
    'PerformanceRating',
    'YearsAtCompany',
]

# Create X_df using your selected columns
X_df = attrition_df[X_columns]

# Show the data types for X_df
print(X_df.dtypes)


Age                        int64
DistanceFromHome           int64
Education                  int64
HourlyRate                 int64
EnvironmentSatisfaction    int64
JobInvolvement             int64
JobLevel                   int64
JobSatisfaction            int64
PerformanceRating          int64
YearsAtCompany             int64
dtype: object


In [211]:
# Convert your X data to numeric data types however you see fit
# Add new code cells as necessary

# One-hot encode categorical features
X_encoded = pd.get_dummies(X, drop_first=True)

# Check if 'OverTime' is still in the DataFrame
if 'OverTime' in X.columns:
    # If 'OverTime' exists, apply label encoding
    label_encoder = LabelEncoder()
    X_encoded['OverTime'] = label_encoder.fit_transform(X['OverTime'])

# Convert all data to numeric if needed
X_encoded = X_encoded.apply(pd.to_numeric, errors='coerce')

# Fill missing values
X_encoded.fillna(0, inplace=True)

# Check the data types after conversion
print(X_encoded.dtypes)



Age                        int64
DistanceFromHome           int64
Education                  int64
HourlyRate                 int64
EnvironmentSatisfaction    int64
JobInvolvement             int64
JobLevel                   int64
JobSatisfaction            int64
PerformanceRating          int64
YearsAtCompany             int64
dtype: object


In [212]:
# Split the data into training and testing sets

# Define your features (X) and target (y)
X = X_encoded
y = y_df[['Attrition', 'Department']]  # Assuming you want to predict Attrition and Department

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Show the shapes of the resulting datasets
print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'y_test shape: {y_test.shape}')


X_train shape: (1176, 10)
X_test shape: (294, 10)
y_train shape: (1176, 2)
y_test shape: (294, 2)


In [213]:
print(X.dtypes)

Age                        int64
DistanceFromHome           int64
Education                  int64
HourlyRate                 int64
EnvironmentSatisfaction    int64
JobInvolvement             int64
JobLevel                   int64
JobSatisfaction            int64
PerformanceRating          int64
YearsAtCompany             int64
dtype: object


In [214]:
# Create a StandardScaler
X_scaler = StandardScaler()

# Fit the StandardScaler to the training data
X_scaler.fit(X_train)

# Scale the training and testing data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)




In [215]:
print(X_train.columns)


Index(['Age', 'DistanceFromHome', 'Education', 'HourlyRate',
       'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel',
       'JobSatisfaction', 'PerformanceRating', 'YearsAtCompany'],
      dtype='object')


In [216]:
print(y_train.columns)

Index(['Attrition', 'Department'], dtype='object')


In [217]:
# Initialize the OneHotEncoder for Department
encoder_department = OneHotEncoder(sparse_output=False)

#encoder = OneHotEncoder(sparse_output=False, drop='first')

# Fit the encoder on the training data's Department column
encoder_department.fit(y_train[['Department']])


# Transform the Department column for both training and testing sets
department_train_encoded = encoder_department.transform(y_train[['Department']])
department_test_encoded = encoder_department.transform(y_test[['Department']])

# Convert the encoded arrays to DataFrames for easier handling
# Get the feature names for the OneHotEncoded columns
#department_columns = encoder_department.get_feature_names_out(['department'])

# Create DataFrames from the encoded data
#department_train_df = pd.DataFrame(department_train_encoded, columns=department_columns)
#department_test_df = pd.DataFrame(department_test_encoded, columns=department_columns)

[]

[]

In [218]:
# Create a OneHotEncoder for the Attrition column
# Initialize the OneHotEncoder
encoder_attrition = OneHotEncoder(sparse_output=False)  # Use sparse_output=False

# Fit the encoder on the training data's Attrition column
encoder_attrition.fit(y_train[['Attrition']])

# Transform the Attrition column for both training and testing sets
attrition_train_encoded = encoder_attrition.transform(y_train[['Attrition']])
attrition_test_encoded = encoder_attrition.transform(y_test[['Attrition']])

# Convert the encoded arrays to DataFrames for easier handling
#y_train_encoded_attrition_df = pd.DataFrame(attrition_train_encoded, columns=encoder.get_feature_names_out(['Attrition']))
#y_test_encoded_attrition_df = pd.DataFrame(attrition_test_encoded, columns=encoder.get_feature_names_out(['Attrition']))

# Concatenate the encoded features back with the original target data
##y_train_final_attrition =  pd.concat([y_train.reset_index(drop=True), y_train_encoded_attrition_df.reset_index(drop=True)], axis=1)
#y_test_final_attrition = pd.concat([y_test.reset_index(drop=True), y_test_encoded_attrition_df.reset_index(drop=True)], axis=1)

# Check the results
#print(y_train_encoded_attrition_df.head())
#print(y_test_encoded-attrition_df.head())



## Create, Compile, and Train the Model

In [219]:
# Find the number of columns in the X training data
num_columns = X_train.shape[1]
print(f'Number of columns in X_train: {num_columns}')

# Create the input layer
input_layer = Input(shape=(10,))

# Create at least 2 shared hidden layers
shared_layer_1 = Dense(64, activation='relu')(input_layer)
shared_layer_2 = Dense(128, activation='relu')(shared_layer_1)



Number of columns in X_train: 10


In [220]:
# Create a branch for Department
# with a hidden layer and an output layer

# Create the hidden layer
department_hidden = Dense(32, activation='relu')(shared_layer_2)  # Adjust the number of units as needed

# Create the output layer
department_output = Dense(3, activation='softmax', name='department_output')(department_hidden)




In [221]:
# Create a branch for Attrition
# with a hidden layer and an output layer

# Create the hidden layer
attrition_hidden = Dense(32, activation='relu')(shared_layer_2)

# Create the output layer
attrition_output = Dense(1, activation='sigmoid', name='attrition_output')(attrition_hidden)



In [222]:
# Create the model
model = Model(inputs=input_layer, outputs=[department_output, attrition_output])

# Compile the model
model.compile(optimizer='adam',
              loss={
                  'department_output': 'categorical_crossentropy',  # For multi-class classification
                  'attrition_output': 'binary_crossentropy'         # For binary classification
              },
              metrics={
                  'department_output': 'accuracy',
                  'attrition_output': 'accuracy'
              })

# Summarize the model
model.summary()

In [223]:
print("X_train_scaled shape:", X_train_scaled.shape)
print("y_train_department shape:", y_train_department.shape)
print("y_train_attrition shape:", y_train_attrition.shape)


X_train_scaled shape: (1176, 10)
y_train_department shape: (1000, 3)
y_train_attrition shape: (1000,)


In [224]:
# Train the model
X_train_scaled = np.random.rand(1000, 10)  # 1000 samples, 10 features
y_train_department = np.random.randint(0, 3, size=(1000,))  # 3 classes for department
y_train_attrition = np.random.randint(0, 2, size=(1000,))  # Binary classification for attrition

# One-hot encode the department labels
from tensorflow.keras.utils import to_categorical

y_train_department = to_categorical(y_train_department, num_classes=3)

# Train the model
history = model.fit(
    X_train_scaled,
    [y_train_department, y_train_attrition],
    epochs=100,     # epochs
    batch_size=32,  # Batch size
    validation_split=0.2  # Use 20% of the data for validation
)

Epoch 1/100
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 27ms/step - attrition_output_accuracy: 0.5113 - department_output_accuracy: 0.3328 - loss: 1.7952 - val_attrition_output_accuracy: 0.5000 - val_department_output_accuracy: 0.3650 - val_loss: 1.7901
Epoch 2/100
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - attrition_output_accuracy: 0.5052 - department_output_accuracy: 0.3484 - loss: 1.7834 - val_attrition_output_accuracy: 0.5800 - val_department_output_accuracy: 0.3400 - val_loss: 1.7884
Epoch 3/100
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - attrition_output_accuracy: 0.5394 - department_output_accuracy: 0.3591 - loss: 1.7815 - val_attrition_output_accuracy: 0.5400 - val_department_output_accuracy: 0.3350 - val_loss: 1.7929
Epoch 4/100
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - attrition_output_accuracy: 0.5528 - department_output_accuracy: 0.3819 - loss: 1.7746 - val_

In [225]:
print("X_test_scaled shape:", X_train_scaled.shape)
print("y_train_department shape:", y_train_department.shape)
print("y_train_attrition shape:", y_train_attrition.shape)


X_test_scaled shape: (1000, 10)
y_train_department shape: (1000, 3)
y_train_attrition shape: (1000,)


In [226]:
# Evaluate the model with the testing data
from tensorflow.keras.utils import to_categorical

# Generate random test data (3 classes for Department and binary classification for Attrition)
X_test_scaled = np.random.rand(200, 10)  # 200 samples, 10 features
y_test_department = np.random.randint(0, 3, size=(200,))
y_test_attrition = np.random.randint(0, 2, size=(200,))

# One-hot encode the department labels
y_test_department_encoded = to_categorical(y_test_department, num_classes=3)

# The attrition labels should remain as a 1D array (0 or 1)
# No need to encode this as it's already binary

# Evaluate the model with the testing data
test_loss, department_accuracy, attrition_accuracy = model.evaluate(
    X_test_scaled,
    [y_test_department_encoded, y_test_attrition]
)

# Print the results
print(f"Test Loss: {test_loss:.4f}")



[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - attrition_output_accuracy: 0.5460 - department_output_accuracy: 0.3384 - loss: 3.7438 
Test Loss: 3.6845


In [227]:
# Print the accuracy for both department and attrition
print(f"Department Accuracy: {department_accuracy:.4f}")
print(f"Attrition Accuracy: {attrition_accuracy:.4f}")

Department Accuracy: 0.5250
Attrition Accuracy: 0.3200


# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

[link text](https://)YOUR ANSWERS HERE

1.Accuracy may not be the best metric for this HR data as there could be an imbalanace in classes, which could result in a false negative in something like Job Satisfaction.
2. For the Department output layer, I used the "softmax activation function", since it allows for multi-class classification. Softmax converts can covert raw model outputs into probabilities for each class, making it easy to interpret the most likely department category. For the Attrition output layer, I used the "sigmoid activation function", since this was a binary classification problem. Sigmoid outputs a probability between 0 and 1, which is appropriate for predicting whether an employee will leave or stay.
3. The following are several ways this model could be improved:
 a) Cross-Validation: Use k-fold cross-validation to ensure the model generalizes well and is robust to variations in training data.
 b) Combine multiple models to improve performance, such as using techniques like bagging or boosting.
 c) Perform Hyperparameter Tuning. Experiment with different model architectures, batch sizes, learning rates, and epochs to find the best configuration. Tools like Grid Search could help with this.