# AWS model
We chose the XGboost-algorithm for our AWS-model because it also scored pretty good according to the pycaret-comparison and AWS allows us to use it.

## Imports

In [1]:
import os, io, boto3, sagemaker
import pandas as pd
from sagemaker.image_uris import retrieve
from sklearn.model_selection import train_test_split



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## Load data
We also load our data in here. This time we also load in the test-data because we will make the predictions in this file so that we can later get them from our bucket when we compare the models. This is the way AWS likes to work and just getting the predictions this way is also quiet easy for us.

In [2]:
x_train = pd.read_csv('x_train.csv')
y_train = pd.read_csv('y_train.csv')
x_test = pd.read_csv('x_test.csv')

train_and_validate = pd.concat([y_train, x_train], axis=1)

We also split our train-data into a train and validation-dataset

In [3]:
train, validate = train_test_split(train_and_validate, test_size=0.2, random_state=42, stratify=train_and_validate['survived'])

To allow aws to train a model we upload our training-, validation- and test-data to our bucket

In [4]:
bucket='titanic-bucket-mj'

prefix='titanic'

train_file='titanic_train.csv'
test_file='titanic_test.csv'
validate_file='titanic_validate.csv'

s3_resource = boto3.Session().resource('s3')

def upload_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    dataframe.to_csv(csv_buffer, header=False, index=False)
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

upload_s3_csv(train_file, 'train', train)
upload_s3_csv(test_file, 'test', x_test)
upload_s3_csv(validate_file, 'validate', validate)

# Training model
We use the XGboost-model because it was rated as a pretty good model, after light-gradient boosting and KNN, according to Pycaret.

In [5]:
container = retrieve('xgboost',boto3.Session().region_name,'1.0-1')

We set the parameters neccessary for the XGboost-model. "num_round" sets how many rounds we boost the model and the "eval_metric" explains how the model will be evaluated (auc = Area Under the Curve). "objective" sets the objective, in this case that means we will get a value between 0 and 1, later we make sure, using a threshold, that this is set to 0 or 1 because these are the values that have to be predicted.

In [6]:
hyperparams={"num_round":"40",
             "eval_metric": "auc",
             "objective": "binary:logistic"}

We set the output-location and make the model.

In [7]:
s3_output_location="s3://{}/{}/output/".format(bucket,prefix)
xgb_model=sagemaker.estimator.Estimator(container,
                                       sagemaker.get_execution_role(),
                                       instance_count=1,
                                       instance_type='ml.m4.xlarge',
                                       output_path=s3_output_location,
                                        hyperparameters=hyperparams,
                                        sagemaker_session=sagemaker.Session())

We also give the data-channels for the training- and validation-datasets.

In [8]:
train_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/".format(bucket,prefix,train_file),
    content_type='text/csv')

validate_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/".format(bucket,prefix,validate_file),
    content_type='text/csv')

data_channels = {'train': train_channel, 'validation': validate_channel}

Now we train the model, the model will automatically be saved to the output-folder we have set earlier.

In [9]:
xgb_model.fit(inputs=data_channels, logs=False)


2024-11-28 11:36:09 Starting - Starting the training job..
2024-11-28 11:36:23 Starting - Preparing the instances for training......
2024-11-28 11:37:02 Downloading - Downloading input data......
2024-11-28 11:37:37 Downloading - Downloading the training image.........
2024-11-28 11:38:28 Training - Training image download completed. Training in progress......
2024-11-28 11:38:54 Uploading - Uploading generated training model.
2024-11-28 11:39:07 Completed - Training job completed


# Prepare for comparison
We will first check if the file we loaded previously does indeed not have the "survived"-column, otherwise the predictions won't work.

In [10]:
x_test.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked
0,3,1,28.0,0,0,11.35005,2
1,3,1,27.0,0,0,8.459572,2
2,2,1,17.0,0,0,16.042818,2
3,1,0,43.0,0,0,31.949337,2
4,2,1,0.0,1,1,28.165107,2


We uploaded the test-data earlier, so now we only have to tell aws where it can find this file and also where it is allowed to give the output. After everything has been predicted we can find the file with the predictions there. We are doing it this way because it is easier than having to open the model again in another file and do the predictions there. It is easier with the other two models, that is why we don't do it for those.

In [12]:
batch_output = "s3://{}/{}/batch-out/".format(bucket,prefix)
test_input = "s3://{}/{}/test/titanic_test.csv".format(bucket,prefix)

xgb_transformer = xgb_model.transformer(instance_count=1,
                                       instance_type='ml.m4.xlarge',
                                       strategy='MultiRecord',
                                       assemble_with='Line',
                                       output_path=batch_output)

xgb_transformer.transform(data=test_input,
                         data_type='S3Prefix',
                         content_type='text/csv',
                         split_type='Line')
xgb_transformer.wait()

...................................[34m[2024-11-28:12:13:31:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-11-28:12:13:31:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-11-28:12:13:31:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[35m[2024-11-28:12:13:31:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2024-11-28:12:13:31:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2024-11-28:12:13:31:INFO] nginx config: [0m
[35mworker_processes auto;[0m
[35mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[35mpid /tmp/nginx.pid;[0m
[35merror_log  /dev/stderr;[0m
[35mworker_rlimit_nofile 4096;[0m
[35mevents {
  worker_connections 2048;[0m
[35m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combin