# DAT-430 Module  Project 2

## Instructions

There are two data files for HR attrition data. Data dictionaries for the two files are provided in the instructions. In this project, you will merge the two files into one data set and create two predictive models. You will then analyze model outputs using a Power BI dashboard. 

### Load Python Libraries

Run the following step to load Python libraries

In [11]:
# import python libraries you will use in this project
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

### Step 1: Merge the two data files

Follow instructions provided for this step and write code below.

In [2]:
# 1-1 Upload first data set
hr1_df = pd.read_csv('HRData1.csv')
hr1_shape = hr1_df.shape
hr1_data_rows, hr1_data_cols = hr1_shape[0], hr1_shape[1]

# 1-2 Upload second data set
hr2_df = pd.read_csv('HRData2.csv')
hr2_shape = hr2_df.shape
hr2_data_rows, hr2_data_cols = hr2_shape[0], hr2_shape[1]

# 1-3 Merge the two data sets
hr_merged_df = pd.merge(hr1_df, hr2_df, how='inner', on='EmployeeNumber')
hr_merged_shape = hr_merged_df.shape
hr_merged_data_rows, hr_merged_data_cols = hr_merged_shape[0], hr_merged_shape[1]


print(f'HR 1 data:  rows={hr1_data_rows}, cols={hr1_data_cols}')
print(f'HR 2 data:  rows={hr2_data_rows}, cols={hr2_data_cols}')
print(f'Merged data:  rows={hr_merged_data_rows}, cols={hr_merged_data_cols}')

HR 1 data:  rows=1470, cols=10
HR 2 data:  rows=1470, cols=21
Merged data:  rows=1470, cols=30


### Step 2: Feature Engineering

Follow instructions provided for this step and write code below.

In [4]:
# 2-1 Create High Income feature
hr_merged_df['high_income'] = 0
hr_merged_df.loc[hr_merged_df['MonthlyIncome'] >= 7000, 'high_income'] = 1

# 2-2 Create Tenured feature
hr_merged_df['Tenured'] = 0
hr_merged_df.loc[hr_merged_df['YearsAtCompany'] >= 11, 'Tenured'] = 1

# 2-3 Copy Data Set with a Subset of Features 
data_model_df = hr_merged_df[['Age', 'Attrition', 'Department', 'Education', 'Gender', 
                              'high_income', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 
                              'MonthlyIncome', 'NumCompaniesWorked', 'PerformanceRating', 
                              'StockOptionLevel', 'Tenured', 'TotalWorkingYears', 'TrainingTimesLastYear', 
                              'YearsAtCompany', 'YearsInCurrentRole', 'YearsWithCurrManager']].copy()
data_model_shape = data_model_df.shape
data_rows, data_cols = data_model_shape[0], data_model_shape[1]
print(f'Feature Engineering Data: rows = {data_rows}, cols = {data_cols}')

Feature Engineering Data: rows = 1470, cols = 19


### Step 3: One Hot Encode Categorical Features 

Follow instructions provided for this step and write code below.

In [7]:
# 3-1 Create a function to One Hot Encode Categorical variables
def onehot_encoder(df, column_names):
    """"
    Args:
        df (pd.DataFrame): The input DataFrame.
        columns_names (list): A list of column names to encode.
                    
    Returns:
        pd.DataFrame: The DataFrame with specified columns encoded.
    """
    encoder = OneHotEncoder(sparse=False)
    one_hot_encoded = encoder.fit_transform(df[column_names])
    one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(column_names))
    df_encoded = pd.concat([df, one_hot_df], axis = 1)
    df_encoded = df_encoded.drop(column_names, axis = 1)
    
    return df_encoded
    
# 3-2 Run the function on Categorical variables in the data set
data_df = onehot_encoder(data_model_df, ['Gender', 'JobRole', 'Department', 'MaritalStatus'])
data_shape = data_df.shape
data_rows, data_cols = data_shape[0], data_shape[1]
print(f'Encoded Data: rows = {data_rows}, cols = {data_cols}')


Encoded Data: rows = 1470, cols = 32


### Step 4: Prepare Training and Testing Sets 

Follow instructions provided for this step and write code below.

In [8]:
# 4-1 Split data into Feature and Target vectors
target = 'Attrition'
X = data_df.drop(target, axis = 1)
y = data_df[target]

# 4-2 Create Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 67)

print(f'Training Set:  X = {X_train.shape}, y = {y_train.shape}')
print(f'Testing Set:  X = {X_test.shape}, y = {y_test.shape}')

Training Set:  X = (1029, 31), y = (1029,)
Testing Set:  X = (441, 31), y = (441,)


### Step 5: Random Forest Classification Model 

Follow instructions provided for this step and write code below.

In [12]:
# 5-1 Run Random Forest Classifier to predict Attrition
model = RandomForestClassifier(random_state = 67)

## Hyperparameter Tuning & Fit
param_grid = {  "n_estimators"      : [250, 300],
                "max_depth"         : [5, 10],
                "bootstrap": [True, False],
                "class_weight": ['balanced']}
rf_grid_search = GridSearchCV(model, param_grid, n_jobs = -1, cv = 2)
rf_grid_search.fit(X_train, y_train)


# 5-2 Predict on the Train and Test Sets
y_rf_train_pred = rf_grid_search.predict(X_train)
y_rf_test_pred = rf_grid_search.predict(X_test)


# 5-3 Report Metrics
# accuracy_score, precision_score, recall_score
train_accuracy = accuracy_score(y_train, y_rf_train_pred)
train_precision = precision_score(y_train, y_rf_train_pred)
train_recall = recall_score(y_train, y_rf_train_pred)
print(f"\ntrain set metrics")
print(f"accuracy = {round(train_accuracy, 2)}")
print(f"precision = {round(train_precision, 2)}")
print(f"recall = {round(train_recall, 2)}")

test_accuracy = accuracy_score(y_test, y_rf_test_pred)
test_precision = precision_score(y_test, y_rf_test_pred)
test_recall = recall_score(y_test, y_rf_test_pred)
print(f"\ntest set metrics")
print(f"accuracy = {round(test_accuracy, 2)}")
print(f"precision = {round(test_precision, 2)}")
print(f"recall = {round(test_recall, 2)}")



train set metrics
accuracy = 0.97
precision = 0.99
recall = 0.92

test set metrics
accuracy = 0.86
precision = 0.82
recall = 0.56


### Step 6: Ada Boost Classification Model 

Follow instructions provided for this step and write code below.

In [13]:
# 6-1 Run Ada Boost Classifier to predict Attrition
model = AdaBoostClassifier(random_state = 67)

## Hyperparameter Tuning & Fit
param_grid = {  "n_estimators"   : [250, 300],
                "learning_rate"  : [0.1, 0.01, 0.001]}

ada_grid_search = GridSearchCV(model, param_grid, n_jobs=-1, cv=2)
ada_grid_search.fit(X_train, y_train)

# 6-2 Predict on the Train and Test Sets
y_ada_train_pred = ada_grid_search.predict(X_train)
y_ada_test_pred = ada_grid_search.predict(X_test)


# 6-3 Report Metrics
# accuracy_score, precision_score, recall_score
train_accuracy = accuracy_score(y_train, y_ada_train_pred)
train_precision = precision_score(y_train, y_ada_train_pred)
train_recall = recall_score(y_train, y_ada_train_pred)
print(f"\ntrain set metrics")
print(f"accuracy = {round(train_accuracy, 2)}")
print(f"precision = {round(train_precision, 2)}")
print(f"recall = {round(train_recall, 2)}")

test_accuracy = accuracy_score(y_test, y_ada_test_pred)
test_precision = precision_score(y_test, y_ada_test_pred)
test_recall = recall_score(y_test, y_ada_test_pred)
print(f"\ntest set metrics")
print(f"accuracy = {round(test_accuracy, 2)}")
print(f"precision = {round(test_precision, 2)}")
print(f"recall = {round(test_recall, 2)}")



train set metrics
accuracy = 0.86
precision = 0.92
recall = 0.56

test set metrics
accuracy = 0.85
precision = 0.82
recall = 0.5


### Step 7: Merge Predictions with Data Sets and Save

Follow instructions provided for this step and write code below.

In [15]:
# 7-1 Merge and Save Outputs
## Merge train set predictions with the original train data
rf_train_pred_df = pd.DataFrame({'RF_Attrition_Pred': y_rf_train_pred}, index=X_train.index)
ada_train_pred_df = pd.DataFrame({'ADA_Attrition_Pred': y_ada_train_pred}, index=X_train.index)

train_actual_df = pd.DataFrame({'Attrition': y_train}, index=X_train.index)
train_merged_df = pd.concat([X_train, train_actual_df, rf_train_pred_df, ada_train_pred_df], axis=1)
train_merged_df.to_csv('predictions_train.csv', index=False)


## Merge test set predictions with the original test data
rf_test_pred_df = pd.DataFrame({'Attrition_Pred': y_rf_test_pred}, index=X_test.index)
ada_test_pred_df = pd.DataFrame({'ADA_Attrition_Pred': y_ada_test_pred}, index=X_test.index)

test_actual_df = pd.DataFrame({'Attrition': y_test}, index=X_test.index)
test_merged_df = pd.concat([X_test, test_actual_df, rf_test_pred_df, ada_test_pred_df], axis=1)
test_merged_df.to_csv('predictions_test.csv', index=False)


### Step 8: Power BI Dashboard

Continue analysis in Power BI

## End of Project