# Sagemaker Deployment

## Using Local Model

Before using sagemaker to deploy the model let us test that our model saved locally can be used to make inferences. 

To simplify this process only the xgboost model will be deployed to a sagemaker endpoint.

In [2]:
# %pip install -r requirements.txt --quiet

Note: you may need to restart the kernel to use updated packages.


In [None]:
# %pip install -r pandas numpy scikit-learn xgboost joblib --quiet

In [3]:
import joblib
import os
import pandas as pd

In [42]:
data_path = os.path.join(".", "data", "test.csv")
test_df = pd.read_csv(data_path)
test_data = test_df.iloc[:2]
test_data

Unnamed: 0,ID,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags,Price
0,44020629,-,VOLKSWAGEN,Golf,2012,Hatchback,No,Diesel,2.0 Turbo,0 km,4,Manual,Front,02-Mar,Left wheel,Grey,10,
1,45784798,-,HYUNDAI,Sonata,2012,Sedan,Yes,Petrol,2.4,26000 km,4,Tiptronic,Front,04-May,Left wheel,Grey,10,


In [5]:
rf_path = "./models/RandomForestRegressor_v20240114.joblib"
xgb_path = "./models/XGBRegressor_v20240113.joblib"
if os.path.isfile(rf_path):
     rf = joblib.load(rf_path)
    
if os.path.isfile(xgb_path):
    xgb = joblib.load(xgb_path)

Taking a look at the pipeline object saved for ransom forest. 

It has a dictionary strucutre containing a model, date and version. We only want the model to evaluate on.

In [6]:
rf

{'model': Pipeline(steps=[('columntransformer',
                  ColumnTransformer(transformers=[('functiontransformer-1',
                                                   FunctionTransformer(func=<function preprocess_levy_and_fillna at 0x7f284559cd30>),
                                                   ['Prod. year', 'Levy']),
                                                  ('functiontransformer-2',
                                                   FunctionTransformer(func=<function extract_numeric_features at 0x7f284559c790>,
                                                                       kw_args={'columns_to_extract': ['Mileage',
                                                                                                       'Engine '
                                                                                                       'volu...
                                                                 handle_unknown='ignore',
                              

In [7]:
rf_model = rf["model"]
rf_model

Predicting with the test data

In [8]:
rf_model.predict(test_data)

array([26821.51119488, 17481.25051534])

It is clear that the model can be used to make inferences. Next, lets deploy it to an endpoint so users have access to it.

## Using Sagemaker

To simplify the process we will retrain an in-built sagemaker xgboost model and deploy it to an endpoint.

In [9]:
import sagemaker
import boto3

from sagemaker import image_uris
from sagemaker.session import s3_input, Session

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [10]:
bucket_name = 'car-prices-prediction'
my_region = boto3.session.Session().region_name
prefix = "20240223"

In [11]:
train_format = f's3://{bucket_name}/data/cleaned/{prefix}/train_v{prefix}.csv'
test_format = f's3://{bucket_name}/data/cleaned/{prefix}/val_v{prefix}.csv'
output_path = f's3://{bucket_name}/sagemaker_output/{prefix}'
print(f"Output path: {output_path}")

Output path: s3://car-prices-prediction/sagemaker_output/20240223


In [12]:
train_format

's3://car-prices-prediction/data/cleaned/20240223/train_v20240223.csv'

In [13]:
test_format

's3://car-prices-prediction/data/cleaned/20240223/val_v20240223.csv'

Creating the train and validation data

In [14]:
s3_train_data = sagemaker.inputs.TrainingInput(s3_data=train_format, content_type='csv')
s3_val_data = sagemaker.inputs.TrainingInput(s3_data=test_format, content_type='csv')

## Training Model

In [15]:
container = image_uris.retrieve('xgboost', boto3.Session().region_name, version='latest')
container

'811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest'

In [16]:
xgb_best_params =  {'colsample_bytree': 0.9976711378207281,
                      'min_child_weight': 3.157431071502319,
                      'learning_rate': 0.01094008867924328,
                      'gamma': 0.7032856276348437,
                      'max_depth': 10,
                      'n_estimators': 351,
                      'subsample': 0.9118162108383395,
                      'reg_alpha': 0.43473034704658026,
                      'reg_lambda': 0.2305809419908718,
                       'num_round':50}

In [17]:
# Create a SageMaker Model
estimator = sagemaker.estimator.Estimator(image_uri=container,
                                         hyperparameters=xgb_best_params,
                                         role=sagemaker.get_execution_role(),
                                         instance_count=1,
                                         instance_type='ml.m5.2xlarge',
                                         train_volumne_size=5,
                                         output_path=output_path,
                                         use_spot_instances=True,
                                         max_run=300,
                                         max_wait=600)

In [18]:
estimator.fit({'train':s3_train_data, 'validation':s3_val_data})

INFO:sagemaker:Creating training-job with name: xgboost-2024-03-04-04-05-26-047


2024-03-04 04:05:26 Starting - Starting the training job...
2024-03-04 04:05:42 Starting - Preparing the instances for training...
2024-03-04 04:06:15 Downloading - Downloading input data...
2024-03-04 04:06:30 Downloading - Downloading the training image...
2024-03-04 04:07:06 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[2024-03-04:04:07:24:INFO] Running standalone xgboost training.[0m
[34m[2024-03-04:04:07:24:INFO] File size need to be processed in the node: 2.18mb. Available memory size in the node: 23955.0mb[0m
[34m[2024-03-04:04:07:24:INFO] Determined delimiter of CSV input is ','[0m
[34m[04:07:24] S3DistributionType set as FullyReplicated[0m
[34m[04:07:24] 15389x23 matrix with 353947 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2024-03-04:04:07:24:INFO] Determined delimiter of CSV input is ','[0m
[34m[04:07:24] S3DistributionType set as FullyReplicated[0m
[34m[04:07:

## Deploying The Model To An Endpoint

Next we need to deploy the model to an endpoint. This can be done programmatically.

In [19]:
xgb_predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: xgboost-2024-03-04-04-13-05-983
INFO:sagemaker:Creating endpoint-config with name xgboost-2024-03-04-04-13-05-983
INFO:sagemaker:Creating endpoint with name xgboost-2024-03-04-04-13-05-983


-----!

**Before we make predictions we need to preprocess the test data**

In [23]:
from utils.clean_and_return_dataframe import clean_and_return_dataframe

In [61]:
cleaned_test_df = clean_and_return_dataframe(df=test_df, has_header=True)
cleaned_test_df.head()

Unnamed: 0,Prod. year,Levy,Mileage,Engine volume,Manufacturer,Model,Fuel type,Leather interior_Yes,Gear box type_Automatic,Gear box type_Manual,...,Category_Goods wagon,Category_Hatchback,Category_Jeep,Category_Limousine,Category_Microbus,Category_Minivan,Category_Pickup,Category_Sedan,Category_Universal,Price
0,2012.0,789.0,0.0,2.0,58.0,521.0,1.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,2012.0,789.0,26000.0,2.4,23.0,837.0,5.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,
2,2005.0,1925.0,168000.0,1.5,40.0,885.0,5.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,
3,2012.0,975.0,143000.0,3.2,59.0,970.0,5.0,1.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,1993.0,5220.0,200000.0,1.6,41.0,165.0,5.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,


Next we need to drop the Price column and remove the headers 

In [62]:
cleaned_test_df = cleaned_test_df.drop("Price", axis=1)

In [63]:
cleaned_test_data = cleaned_test_df.iloc[:2]
# cleaned_test_data.columns = cleaned_test_data.iloc[0]  # Use the first row as headers
# cleaned_test_data = cleaned_test_data[1:]  # Drop the first row since it's now the header
cleaned_test_data

Unnamed: 0,Prod. year,Levy,Mileage,Engine volume,Manufacturer,Model,Fuel type,Leather interior_Yes,Gear box type_Automatic,Gear box type_Manual,...,Category_Coupe,Category_Goods wagon,Category_Hatchback,Category_Jeep,Category_Limousine,Category_Microbus,Category_Minivan,Category_Pickup,Category_Sedan,Category_Universal
0,2012.0,789.0,0.0,2.0,58.0,521.0,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2012.0,789.0,26000.0,2.4,23.0,837.0,5.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [71]:
test_data

Unnamed: 0,ID,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags,Price
0,44020629,-,VOLKSWAGEN,Golf,2012,Hatchback,No,Diesel,2.0 Turbo,0 km,4,Manual,Front,02-Mar,Left wheel,Grey,10,
1,45784798,-,HYUNDAI,Sonata,2012,Sedan,Yes,Petrol,2.4,26000 km,4,Tiptronic,Front,04-May,Left wheel,Grey,10,


In [67]:
from sagemaker.serializers import CSVSerializer

**Making predictions**

In [21]:
rf_model.predict(test_data)

array([26821.51119488, 17481.25051534])

In [64]:
xgb_predictor.serializer = CSVSerializer()
predictions = xgb_predictor.predict(cleaned_test_data).decode('utf-8')

In [70]:
predictions

'2008.5054931640625,2011.808349609375'

## Conclusion

There are varying Predictions when using the different models
* The xgboost model predicted `$2008.51` and `$2011.80`. 
* The random forest model predicted `$ 26821.51` and `$ 17481.25`


## Spin Down Resources

In [72]:
xgb_predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: xgboost-2024-03-04-04-13-05-983
INFO:sagemaker:Deleting endpoint with name: xgboost-2024-03-04-04-13-05-983
