Let's import the necessary libraries and files to get started

In [9]:
import numpy as np
import pandas as pd

import boto3
from sagemaker import get_execution_role

role = get_execution_role()
bucket='ceo-turnover-data'
data_key = 'pre_processed_v4_CEO.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

data = pd.read_csv(data_location)

data is a pandas dataframe storing our preprocessed dataset. Let's print out data.head() to check it imported correctly and see how it's structured

In [11]:
print(data.head())

   Age                                       Company Name   Director Name  \
0   73  COSTCO WHOLESALE CORP (Costco Companies Inc pr...     Jim Sinegal   
1   28            Morris & Garritano Insurance Agency Inc  Brendan Morris   
2   29  Madison Industries Inc (Madison Capital Partne...   Larry Gies Jr   
3   33                              Crowley Maritime Corp  Tom Crowley Jr   
4   33                     Enterprise Solutions Group Inc     Savas Karas   

   Number of Records               Role Name           Seniority  \
0                  1           President/CEO  Executive Director   
1                  1                     CEO  Executive Director   
2                  1  Chairman/President/CEO  Executive Director   
3                  1  Chairman/President/CEO  Executive Director   
4                  1           President/CEO  Executive Director   

   Tenure (Years) Turnover (YES/NO)  Year  
0               7                NO  2000  
1               7                NO  200

Now we're cooking with gas! Let's trim our data to only include predictors and labels.

In [17]:
data = data[['Age', 'Tenure (Years)', 'Turnover (YES/NO)']]

print(data.head())
print("\n We have {} rows of {} columns".format(data.shape[0], data.shape[1]))

   Age  Tenure (Years) Turnover (YES/NO)
0   73               7                NO
1   28               7                NO
2   29               7                NO
3   33               7                NO
4   33               7                NO

 We have 291294 rows of 3 columns


Let's one-hot encode our label data and store it in a separate array

In [27]:
y = np.array(data[['Turnover (YES/NO)']])
for i in range(data.shape[0]):
    y[i] = 0 if y[i] == 'NO' else 1
    
print("We have {} turnover events".format(int(sum(y))))

We have 25621 turnover events


Aaand let's also put our predictor data in separate array

In [34]:
X = np.array(data[['Age', 'Tenure (Years)']])

print("The average CEO is {} years old".format(round(int(sum(X[:,0])) / data.shape[0])))

The average CEO is 58 years old
