# Create S3 Bucket and Upload Objects

Here, we're downloading the raw datasets from kaggle and using boto3 to write those local files to our S3 bucket. For those attempting to reproduce the download, you can do so programmatically via cURL request or kagglehub API.

**cURL command**: 
```
!curl -L -o ~/Downloads/employee-attrition-dataset.zip\
  https://www.kaggle.com/api/v1/datasets/download/stealthtechnologies/employee-attrition-dataset
```
**kagglehub snippet**:
```
kagglehub.dataset_download("stealthtechnologies/employee-attrition-dataset")
```

## Dependency Setup

We'll use this section to ensure that all of the necessary dependencies for this project are installed.

In [1]:
!python --version

Python 3.10.16


In [2]:
!pip install --disable-pip-version-check -q awswrangler --quiet
!pip install --disable-pip-version-check -q kagglehub --quiet

In [3]:
!pip list

Package                       Version
----------------------------- -------------------
aiohappyeyeballs              2.4.4
aiohttp                       3.11.11
aiosignal                     1.3.2
alabaster                     1.0.0
annotated-types               0.7.0
antlr4-python3-runtime        4.9.3
anyio                         4.8.0
argon2-cffi                   23.1.0
argon2-cffi-bindings          21.2.0
arrow                         1.3.0
astroid                       3.3.8
astropy                       6.1.7
astropy-iers-data             0.2025.1.13.0.34.51
asttokens                     3.0.0
async-lru                     2.0.4
async-timeout                 5.0.1
asyncssh                      2.19.0
atomicwrites                  1.4.1
attrs                         23.2.0
autopep8                      2.0.4
autovizwidget                 0.22.0
awscli                        1.37.6
awswrangler                   3.11.0
babel                         2.16.0
backports.tarfile       

In [4]:
import boto3
from botocore.client import ClientError
import sagemaker
import pandas as pd

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## Reading in Data from Kaggke

In [5]:
# Display original data 

try:
    train_df = pd.read_csv('train.csv')
    print("Train data:")
    print("Shape:", train_df.shape)
    display(train_df.head())
    
    test_df = pd.read_csv('test.csv')
    print("\nTest data:")
    print("Shape:", test_df.shape)
    display(test_df.head())
    
except FileNotFoundError:
      print("One or both of the CSV files (train.csv, test.csv) were not found.")

Train data:
Shape: (59598, 24)


Unnamed: 0,Employee ID,Age,Gender,Years at Company,Job Role,Monthly Income,Work-Life Balance,Job Satisfaction,Performance Rating,Number of Promotions,...,Number of Dependents,Job Level,Company Size,Company Tenure,Remote Work,Leadership Opportunities,Innovation Opportunities,Company Reputation,Employee Recognition,Attrition
0,8410,31,Male,19,Education,5390,Excellent,Medium,Average,2,...,0,Mid,Medium,89,No,No,No,Excellent,Medium,Stayed
1,64756,59,Female,4,Media,5534,Poor,High,Low,3,...,3,Mid,Medium,21,No,No,No,Fair,Low,Stayed
2,30257,24,Female,10,Healthcare,8159,Good,High,Low,0,...,3,Mid,Medium,74,No,No,No,Poor,Low,Stayed
3,65791,36,Female,7,Education,3989,Good,High,High,1,...,2,Mid,Small,50,Yes,No,No,Good,Medium,Stayed
4,65026,56,Male,41,Education,4821,Fair,Very High,Average,0,...,0,Senior,Medium,68,No,No,No,Fair,Medium,Stayed



Test data:
Shape: (14900, 24)


Unnamed: 0,Employee ID,Age,Gender,Years at Company,Job Role,Monthly Income,Work-Life Balance,Job Satisfaction,Performance Rating,Number of Promotions,...,Number of Dependents,Job Level,Company Size,Company Tenure,Remote Work,Leadership Opportunities,Innovation Opportunities,Company Reputation,Employee Recognition,Attrition
0,52685,36,Male,13,Healthcare,8029,Excellent,High,Average,1,...,1,Mid,Large,22,No,No,No,Poor,Medium,Stayed
1,30585,35,Male,7,Education,4563,Good,High,Average,1,...,4,Entry,Medium,27,No,No,No,Good,High,Left
2,54656,50,Male,7,Education,5583,Fair,High,Average,3,...,2,Senior,Medium,76,No,No,Yes,Good,Low,Stayed
3,33442,58,Male,44,Media,5525,Fair,Very High,High,0,...,4,Entry,Medium,96,No,No,No,Poor,Low,Left
4,15667,39,Male,24,Education,4604,Good,High,Average,0,...,6,Mid,Large,45,Yes,No,No,Good,High,Stayed


## Combining Data 

In [6]:
# Combine Test and Train to "data.csv", since we are going to do split differently

try:
  # Ensure train and test have the same shape (number of columns)
  if list(train_df.columns) != list(test_df.columns):
    print("Warning: Train and test datasets have different columns.")
    #Handle the difference - Example: remove extra columns
    #Find columns that exist in train but not in test
    extra_cols = list(set(train_df.columns) - set(test_df.columns))

    #Remove these columns from train set
    if extra_cols:
        train_df = train_df.drop(columns = extra_cols)

    #Find columns that exist in test but not in train
    extra_cols = list(set(test_df.columns) - set(train_df.columns))
    #Remove these columns from test set
    if extra_cols:
        test_df = test_df.drop(columns=extra_cols)


  # Combine the dataframes
  combined_df = pd.concat([train_df, test_df], ignore_index=True)

  # Save the combined dataframe to data.csv in the /content directory
  combined_df.to_csv('data.csv', index=False)
  print("Combined data saved to data.csv")
  print(f"Combined data has {len(combined_df)} records")

except FileNotFoundError:
  print("One or both of the CSV files (train.csv, test.csv) were not found.")

except Exception as e:
    print(f"An error occurred: {e}")

Combined data saved to data.csv
Combined data has 74498 records


## Feature Engineering

In [7]:
# Reading in Combined CSV
df = pd.read_csv('data.csv')

# Ensuring the shape is correct
df.shape

(74498, 24)

These are the data types for the attrition dataset. Now that we have an understanding of which features are categorical, we can proceed with transforming them into appropriate encoded representations. This step ensures that our machine learning model(s) can effectively interpret categorical data. Common encoding techniques include one-hot encoding, label encoding, and target encoding, depending on the nature of the categorical variables and the model requirements.

In [8]:
df.dtypes

Employee ID                  int64
Age                          int64
Gender                      object
Years at Company             int64
Job Role                    object
Monthly Income               int64
Work-Life Balance           object
Job Satisfaction            object
Performance Rating          object
Number of Promotions         int64
Overtime                    object
Distance from Home           int64
Education Level             object
Marital Status              object
Number of Dependents         int64
Job Level                   object
Company Size                object
Company Tenure               int64
Remote Work                 object
Leadership Opportunities    object
Innovation Opportunities    object
Company Reputation          object
Employee Recognition        object
Attrition                   object
dtype: object

In [9]:
# Handle categorical features
label_encoders = {}
for col in df.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

In [10]:
df.head()

Unnamed: 0,Employee ID,Age,Gender,Years at Company,Job Role,Monthly Income,Work-Life Balance,Job Satisfaction,Performance Rating,Number of Promotions,...,Number of Dependents,Job Level,Company Size,Company Tenure,Remote Work,Leadership Opportunities,Innovation Opportunities,Company Reputation,Employee Recognition,Attrition
0,8410,31,1,19,0,5390,0,2,0,2,...,0,1,1,89,0,0,0,0,2,1
1,64756,59,0,4,3,5534,3,0,3,3,...,3,1,1,21,0,0,0,1,1,1
2,30257,24,0,10,2,8159,2,0,3,0,...,3,1,1,74,0,0,0,3,1,1
3,65791,36,0,7,0,3989,2,0,2,1,...,2,1,2,50,1,0,0,2,2,1
4,65026,56,1,41,0,4821,1,3,0,0,...,0,2,1,68,0,0,0,1,2,1


## Data Splits (with holdout)

Here, we are taking a substantial portion of our entire dataset and creating a holdout set that remains completely unseen during training and validation. This holdout set will serve as a final benchmark to assess the model’s generalization to new data after all feature engineering, hyperparameter tuning, and validation steps are complete.

By keeping this portion isolated, we ensure that our evaluation is unbiased and reflective of real-world performance, preventing any data leakage or overfitting to the training process.

In [11]:
try:
    # Split data into holdout set (40%) and remaining data
    holdout_data, remaining_data = train_test_split(df, test_size=0.6, random_state=42) # 60% for further splitting

    # Save the holdout data to a file
    holdout_data.to_csv('holdout.csv', index=False)
    print("Holdout data saved to holdout.csv")

    # Save the holdout data to a file
    remaining_data.to_csv('remaining_data.csv', index=False)
    print("Remaining data saved to remaining_data.csv")

except FileNotFoundError:
    print("Error: data.csv not found. Please ensure the combined dataset is created first.")
except Exception as e:
    print(f"An error occurred: {e}")

Holdout data saved to holdout.csv
Remaining data saved to remaining_data.csv


## Setting Object Destination and Copying Data to Bucket

In this section, we are configuring the destination for our data within an Amazon S3 bucket. The bucket name is determined dynamically based on the SageMaker session, ensuring that each user interacts with their own unique bucket. By obtaining the default bucket associated with each user’s session, we ensure that data storage remains consistent and personalized.

We separated our data that we will ingest into Athena (remaining_data.csv) and our holdout data that we will use for pressure testing (holdout.csv) into separate directories to prevent overlap.

In [12]:
# Create a SageMaker session object, which is used to manage interactions with SageMaker resources.
sess = sagemaker.Session()

# Retrieve the default Amazon S3 bucket associated with the SageMaker session.
bucket = sess.default_bucket()

# Get the IAM role associated with the current SageMaker notebook or environment.
role = sagemaker.get_execution_role()

# Get the AWS region name for the current session.
region = boto3.Session().region_name

# Retrieve the AWS account ID of the caller using the Security Token Service (STS) client.
account_id = boto3.client("sts").get_caller_identity().get("Account")

# Create a Boto3 client for the SageMaker service, specifying the AWS region.
sm = boto3.Session().client(service_name="sagemaker", region_name=region)

In [13]:
bucket_path = "s3://{}/aai-540-group-3-final-project/data".format(bucket)
bucket_path_data_combined = "s3://{}/aai-540-group-3-final-project/data/db_source".format(bucket)
bucket_path_splits =  "s3://{}/aai-540-group-3-final-project/data/holdout".format(bucket)
bucket_for_eda = "s3://{}/aai-540-group-3-final-project/data/eda".format(bucket)
print(bucket_path)
print(bucket_path_data_combined)
print(bucket_path_splits)
print(bucket_for_eda)

s3://sagemaker-us-east-1-242201273368/aai-540-group-3-final-project/data
s3://sagemaker-us-east-1-242201273368/aai-540-group-3-final-project/data/db_source
s3://sagemaker-us-east-1-242201273368/aai-540-group-3-final-project/data/holdout
s3://sagemaker-us-east-1-242201273368/aai-540-group-3-final-project/data/eda


In [14]:
%store bucket_path

Stored 'bucket_path' (str)


In [15]:
%store bucket_path_data_combined

Stored 'bucket_path_data_combined' (str)


In [16]:
%store bucket_path_splits

Stored 'bucket_path_splits' (str)


In [17]:
!aws s3 cp "remaining_data.csv" $bucket_path_data_combined/

upload: ./remaining_data.csv to s3://sagemaker-us-east-1-242201273368/aai-540-group-3-final-project/data/db_source/remaining_data.csv


In [18]:
# Adding Holdout data to its own directory
!aws s3 cp "holdout.csv" $bucket_path_splits/

upload: ./holdout.csv to s3://sagemaker-us-east-1-242201273368/aai-540-group-3-final-project/data/holdout/holdout.csv


In [19]:
# Adding EDA/Raw data to its own directory
!aws s3 cp "data.csv" $bucket_for_eda/

upload: ./data.csv to s3://sagemaker-us-east-1-242201273368/aai-540-group-3-final-project/data/eda/data.csv


## Listing Files in our Bucket

In this section, we will programmatically list the files stored in the Amazon S3 bucket associated with this notebook. By dynamically identifying the bucket through the SageMaker session, we ensure that the code remains reproducible for anyone using it, regardless of their account or environment. This approach avoids hardcoding bucket names and guarantees compatibility across different users.


In [20]:
!aws s3 ls $bucket_path/ --recursive

2025-02-23 04:05:11          0 aai-540-group-3-final-project/data/db_source//242201273368/sagemaker/us-east-1/offline-store/employee-attrition-feature-store-1740280320/employee-attrition-feature-store2025-02-23T03:12:00.262Z.txt
2025-02-23 03:18:46      23523 aai-540-group-3-final-project/data/db_source/242201273368/sagemaker/us-east-1/offline-store/employee-attrition-feature-store-1740280320/data/year=2025/month=02/day=23/hour=03/20250223T031159Z_111qhMB3kv2B80lz.parquet
2025-02-23 03:18:46      25001 aai-540-group-3-final-project/data/db_source/242201273368/sagemaker/us-east-1/offline-store/employee-attrition-feature-store-1740280320/data/year=2025/month=02/day=23/hour=03/20250223T031159Z_1QzIxyNc7hGCNhd2.parquet
2025-02-23 03:18:46      25305 aai-540-group-3-final-project/data/db_source/242201273368/sagemaker/us-east-1/offline-store/employee-attrition-feature-store-1740280320/data/year=2025/month=02/day=23/hour=03/20250223T031159Z_1RMnxfkvYNo5PBx0.parquet
2025-02-23 03:18:46      25

## Deleting Local Files
Serves as clean up before pushing to our repository

In [21]:
import os

def delete_csv_files(*filenames):
    if len(filenames) < 2:
        print("Please specify at least three CSV files to delete.")
        return
    
    for file in filenames:
        if file.endswith(".csv") and os.path.exists(file):
            try:
                os.remove(file)
                print(f"Deleted: {file}")
            except Exception as e:
                print(f"Error deleting {file}: {e}")
        else:
            print(f"File not found or not a CSV: {file}")

# Deleting CSV Files
delete_csv_files("data.csv", "holdout.csv", "remaining_data.csv")

Deleted: data.csv
Deleted: holdout.csv
Deleted: remaining_data.csv


## Release Resources

In [22]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [23]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}

<IPython.core.display.Javascript object>