## Dependency Setup

We'll use this section to ensure that all of the necessary dependencies for this project are installed.

In [None]:
!python --version

In [2]:
!pip install --disable-pip-version-check -q awswrangler --quiet
!pip install --disable-pip-version-check -q kagglehub --quiet

In [3]:
!pip list

Package                              Version
------------------------------------ --------------
ansicolors                           1.1.8
anyio                                4.4.0
archspec                             0.2.3
argon2-cffi                          23.1.0
argon2-cffi-bindings                 21.2.0
arrow                                1.3.0
asttokens                            2.4.1
async-lru                            2.0.4
attrs                                23.2.0
autovizwidget                        0.20.4
awscli                               1.33.13
awswrangler                          3.11.0
Babel                                2.14.0
beautifulsoup4                       4.12.3
bleach                               6.1.0
boltons                              23.1.1
boto3                                1.34.131
botocore                             1.34.131
Brotli                               1.1.0
cached-property                      1.5.2
certifi                    

In [4]:
import boto3
from botocore.client import ClientError
import sagemaker
import pandas as pd

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


## Create S3 Bucket and Upload Objects

Here, we're downloading the raw datasets from kaggle and using boto3 to write those local files to our S3 bucket. For those attempting to reproduce the download, you can do so programmatically via cURL request or kagglehub API.

**cURL command**: 
```
!curl -L -o ~/Downloads/employee-attrition-dataset.zip\
  https://www.kaggle.com/api/v1/datasets/download/stealthtechnologies/employee-attrition-dataset
```
**kagglehub snippet**:
```
kagglehub.dataset_download("stealthtechnologies/employee-attrition-dataset")
```

In [5]:
# Display original data 

try:
  train_df = pd.read_csv('train.csv')
  print("Train data:")
  display(train_df.head())

  test_df = pd.read_csv('test.csv')
  print("\nTest data:")
  display(test_df.head())

except FileNotFoundError:
  print("One or both of the CSV files (train.csv, test.csv) were not found.")

Train data:


Unnamed: 0,Employee ID,Age,Gender,Years at Company,Job Role,Monthly Income,Work-Life Balance,Job Satisfaction,Performance Rating,Number of Promotions,...,Number of Dependents,Job Level,Company Size,Company Tenure,Remote Work,Leadership Opportunities,Innovation Opportunities,Company Reputation,Employee Recognition,Attrition
0,8410,31,Male,19,Education,5390,Excellent,Medium,Average,2,...,0,Mid,Medium,89,No,No,No,Excellent,Medium,Stayed
1,64756,59,Female,4,Media,5534,Poor,High,Low,3,...,3,Mid,Medium,21,No,No,No,Fair,Low,Stayed
2,30257,24,Female,10,Healthcare,8159,Good,High,Low,0,...,3,Mid,Medium,74,No,No,No,Poor,Low,Stayed
3,65791,36,Female,7,Education,3989,Good,High,High,1,...,2,Mid,Small,50,Yes,No,No,Good,Medium,Stayed
4,65026,56,Male,41,Education,4821,Fair,Very High,Average,0,...,0,Senior,Medium,68,No,No,No,Fair,Medium,Stayed



Test data:


Unnamed: 0,Employee ID,Age,Gender,Years at Company,Job Role,Monthly Income,Work-Life Balance,Job Satisfaction,Performance Rating,Number of Promotions,...,Number of Dependents,Job Level,Company Size,Company Tenure,Remote Work,Leadership Opportunities,Innovation Opportunities,Company Reputation,Employee Recognition,Attrition
0,52685,36,Male,13,Healthcare,8029,Excellent,High,Average,1,...,1,Mid,Large,22,No,No,No,Poor,Medium,Stayed
1,30585,35,Male,7,Education,4563,Good,High,Average,1,...,4,Entry,Medium,27,No,No,No,Good,High,Left
2,54656,50,Male,7,Education,5583,Fair,High,Average,3,...,2,Senior,Medium,76,No,No,Yes,Good,Low,Stayed
3,33442,58,Male,44,Media,5525,Fair,Very High,High,0,...,4,Entry,Medium,96,No,No,No,Poor,Low,Left
4,15667,39,Male,24,Education,4604,Good,High,Average,0,...,6,Mid,Large,45,Yes,No,No,Good,High,Stayed


In [6]:
# Combine Test and Train to "data.csv", since we are going to do split differently

try:
  # Ensure train and test have the same shape (number of columns)
  if list(train_df.columns) != list(test_df.columns):
    print("Warning: Train and test datasets have different columns.")
    #Handle the difference - Example: remove extra columns
    #Find columns that exist in train but not in test
    extra_cols = list(set(train_df.columns) - set(test_df.columns))

    #Remove these columns from train set
    if extra_cols:
        train_df = train_df.drop(columns = extra_cols)

    #Find columns that exist in test but not in train
    extra_cols = list(set(test_df.columns) - set(train_df.columns))
    #Remove these columns from test set
    if extra_cols:
        test_df = test_df.drop(columns=extra_cols)


  # Combine the dataframes
  combined_df = pd.concat([train_df, test_df], ignore_index=True)

  # Save the combined dataframe to data.csv in the /content directory
  combined_df.to_csv('data.csv', index=False)
  print("Combined data saved to data.csv")

except FileNotFoundError:
  print("One or both of the CSV files (train.csv, test.csv) were not found.")

except Exception as e:
    print(f"An error occurred: {e}")

Combined data saved to data.csv


In [7]:
# Create a SageMaker session object, which is used to manage interactions with SageMaker resources.
sess = sagemaker.Session()

# Retrieve the default Amazon S3 bucket associated with the SageMaker session.
bucket = sess.default_bucket()

# Get the IAM role associated with the current SageMaker notebook or environment.
role = sagemaker.get_execution_role()

# Get the AWS region name for the current session.
region = boto3.Session().region_name

# Retrieve the AWS account ID of the caller using the Security Token Service (STS) client.
account_id = boto3.client("sts").get_caller_identity().get("Account")

# Create a Boto3 client for the SageMaker service, specifying the AWS region.
sm = boto3.Session().client(service_name="sagemaker", region_name=region)

## Setting Object Destination and Copying Data to Bucket

In this section, we are configuring the destination for our data within an Amazon S3 bucket. The bucket name is determined dynamically based on the SageMaker session, ensuring that each user interacts with their own unique bucket. By obtaining the default bucket associated with each user’s session, we ensure that data storage remains consistent and personalized.

In [8]:
bucket_path = "s3://{}/aai-540-group-3-final-project/data".format(bucket)
bucket_path_data_combined = "s3://{}/aai-540-group-3-final-project/data/db_source".format(bucket)
print(bucket_path)
print(bucket_path_data_combined )

's3://sagemaker-us-east-1-796598873577/aai-540-group-3-final-project/data'

In [9]:
%store bucket_path

Stored 'bucket_path' (str)


In [10]:
!aws s3 cp "train.csv" $bucket_path/

upload: ./train.csv to s3://sagemaker-us-east-1-796598873577/aai-540-group-3-final-project/data/train.csv


In [11]:
!aws s3 cp "test.csv" $bucket_path/

upload: ./test.csv to s3://sagemaker-us-east-1-796598873577/aai-540-group-3-final-project/data/test.csv


In [12]:
!aws s3 cp "data.csv" $bucket_path_data_combined

upload: ./data.csv to s3://sagemaker-us-east-1-796598873577/aai-540-group-3-final-project/data/db_source


## Listing Files in our Bucket

In this section, we will programmatically list the files stored in the Amazon S3 bucket associated with this notebook. By dynamically identifying the bucket through the SageMaker session, we ensure that the code remains reproducible for anyone using it, regardless of their account or environment. This approach avoids hardcoding bucket names and guarantees compatibility across different users.


In [14]:
!aws s3 ls $bucket_path/ --recursive

2025-01-27 01:08:06    9550295 aai-540-group-3-final-project/data/data.csv
2025-01-27 01:14:19    9550295 aai-540-group-3-final-project/data/db_source
2025-01-27 01:14:16    1910316 aai-540-group-3-final-project/data/test.csv
2025-01-27 01:14:12    7640348 aai-540-group-3-final-project/data/train.csv


In [15]:
!aws s3 ls $bucket_path/db_source --recursive

2025-01-27 01:14:19    9550295 aai-540-group-3-final-project/data/db_source


## Release Resources

In [19]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [None]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}