# 00 - Setup

This lab has been tested in Amazon SageMaker AI Studio using JupyterLab and the SageMaker Distribution 2.4.1 container image. 

## Introduction

Welcome to this hands-on workshop on building a credit risk assessment model using Amazon SageMaker!


In this first notebook, you'll set up your development environment and prepare a dataset that will serve as the foundation for your machine learning use case.


<div class="alert alert-block alert-info">
<b> Use Case: You are working for Octank Bank and you want to develop a classification model that predicts whether a customer presents credit risk. 
</div> 


**Business Context**:
Credit risk assessment is a critical operation in the financial sector that helps determine the likelihood of borrowers repaying their loans. Accurate risk prediction enables financial institutions to:
- Make sound lending decisions
- Reduce default rates and potential losses
- Optimize interest rates based on risk profiles
- Comply with regulatory requirements for risk management


    
**Your Task**:
As a data scientist at Octank Bank, you're tasked with developing a classification model to predict whether a customer presents a credit risk. Throughout this workshop series, you'll learn how to:
- Set up your SageMaker environment
- Prepare and analyze your data
- Train and deploy a machine learning model using Amazon SageMaker Studio
- Evaluate model performance and interpret results

**Prerequisites:**
- Basic understanding of Python
- Familiarity with machine learning concepts

What You'll Learn in This Notebook:
1. Setting up your SageMaker environment
2. Loading and exploring the credit risk dataset
3. Performing initial data preprocessing

Let's get started!


## About the Dataset:  

The [South German Credit](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29#) dataset is an industry standard benchmark dataset in the field of credit scoring. It contains information about 1,000 loan applicants, including:

- Credit history
- Loan purpose
- Loan amount
- Employment duration
- Personal status and gender
- Other debtors/guarantors
- Property ownership
- Other installment plans
- Housing situation
- Job information

The target variable _credit_risk_ indicates whether a customer is considered a good or bad credit risk.

## 1. Setting Up Your ML Environment



First, let's prepare our development environment and download the dataset.

### 1.1 Install required packages

In Amazon SageMaker AI Studio, the JupyterLab application comes with the [SageMaker Distribution](https://github.com/aws/sagemaker-distribution) image. The distribution image has popular packages, such as PyTorch, TensorFlow, Keras, NumPy, Pandas, Scikit-learn pre-installed.

Let's install some additional packages that are required for this lab and set-up the development environment.

In [None]:
!pip install -r requirements.txt

<div class="alert alert-block alert-warning">
<b>Important:</b> Once the previous step is complete, make sure you restart the kernel before proceeding.
</div>

### 1.2 Import packages

In [None]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import sagemaker
from sagemaker.s3 import S3Downloader


## 2. Download and Prepare Data
In the following section, you download the dataset and perform some initial pre-processing.

### 2.1 Download dataset

Download the dataset from the SageMaker sample files repository:

In [None]:

S3Downloader.download(
    "s3://sagemaker-sample-files/datasets/tabular/uci_statlog_german_credit_data/SouthGermanCredit.asc",
    "data",
)

### 2.2 Split the data into training and test data sets
The code below follows standard data preparation steps, setting up the data for model training and evaluation. The original dataset is split into training and test using a 80-20 split. It also remove missing values, sets a random seed for reproduciblility. 

In [None]:

# Column names for the data set
credit_columns = [
    "status",
    "duration",
    "credit_history",
    "purpose",
    "amount",
    "savings",
    "employment_duration",
    "installment_rate",
    "personal_status_sex",
    "other_debtors",
    "present_residence",
    "property",
    "age",
    "other_installment_plans",
    "housing",
    "number_credits",
    "job",
    "people_liable",
    "telephone",
    "foreign_worker",
    "credit_risk",
]

# Read the CSV data into a pandas DataFrame
data = pd.read_csv(
    "data/SouthGermanCredit.asc",
    names=credit_columns,
    header=0,
    sep=r" ",
    engine="python",
    na_values="?",
).dropna()

# Assuming the last column is your label/target
# First, let's separate features (X) from the label (y)
X = data.iloc[:, :-1]  # All columns except the last one
y = data.iloc[:, -1]   # Only the last column

# Now, split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42  # 20% test data, set random_state for reproducibility
)

# To verify the shapes of your resulting datasets
print(f"Original data shape: {data.shape}")
print(f"Training features shape: {X_train.shape}")
print(f"Training labels shape: {y_train.shape}")
print(f"Test features shape: {X_test.shape}")
print(f"Test labels shape: {y_test.shape}")

### 2.3 Save the data in your JupyterLab space

In [None]:
def create_directory(directory_path):
    try:
        if os.path.exists(directory_path):
            print(f"The '{directory_path}' directory already exists.")
        else:
            os.makedirs(directory_path)
            print(f"Created '{directory_path}' directory successfully.")
    except Exception as e:
        print(f"Error creating directory: {e}")

In [None]:
input_dir = "./input"
create_directory(input_dir)

output_dir = "./output"
create_directory(output_dir)


In [None]:
train_data = pd.concat([X_train, y_train], axis=1)
test_data = pd.concat([X_test, y_test], axis=1)

train_data_path=f"{input_dir}/train_data.csv"
test_data_path=f"{input_dir}/test_data.csv"

train_data.to_csv(train_data_path, index=False)
test_data.to_csv(test_data_path, index=False)

You can also store the above variables to make them available across different notebook sessions

In [None]:
%store train_data_path test_data_path

### 2.4 Upload the data to the default SageMaker S3 bucket for future use
Finally, let's also upload the datasets to Amazon S3 so that they are available for subsequent notebooks.

When using Amazon SageMaker AI Studio, you have a default Amazon S3 bucket allocated for your SageMaker AI session. You can choose the default bucket or any bucket of your choice during development. For this lab, you will use the default bucket.

Below you
- create a SageMaker session and obtain the default bucket name.
- define prefix as "sm-ml-models" to organize resources in S3.
- extract the IAM execution role that provides SageMaker with the necessary AWS permissions for reading and writing data from/to S3 and performing SageMaker AI operations (which will be useful later) 


In [None]:
# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
bucket_name= sagemaker_session.default_bucket()  # Default SageMaker bucket
model_prefix = "sm-ml-models"

# Retrieve the IAM role for SageMaker
role = sagemaker.get_execution_role()
print (f"Your Amazon SageMaker Execution role is: {role}")

%store bucket_name model_prefix role

Upload the training and test data sets.

In [None]:
# save the datasets in S3 for future use
train_s3_url = sagemaker.Session().upload_data(
    path=train_data_path,
    bucket=bucket_name,
    key_prefix=f"{model_prefix}/input"
)
print(f"Upload the dataset to {train_s3_url}")

test_s3_url = sagemaker.Session().upload_data(
    path=test_data_path,
    bucket=bucket_name,
    key_prefix=f"{model_prefix}/input"
)
print(f"Upload the dataset to {test_s3_url}")

### Conclusion and Next Steps:
- Your environment is ready
- The data is prepared and stored
- You're ready for Module 1!

In Module 1, you'll use this foundation to build our first credit risk prediction model using XGBoost.