## Batch Transform

Now we are going to use "today's" features to create predictions, that the business unit is going to use as an input for promotions. 

For this, we are going to deploy the model created on the best training job from the hyperparameter tunning job and use the resulting endpoint for inference. 

In [1]:
import sagemaker
import boto3
from sagemaker.estimator import Estimator
from sagemaker.tuner import HyperparameterTuner
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import os 
import time
from sagemaker.predictor import csv_serializer,RealTimePredictor

# take the best training job from notebook #3
best_training_job = 'forecast-vtas-190702-1408-004-46a4c8e2'

In [13]:
df = pd.read_csv('to_predict.csv',header=None)

In [19]:
df.shape

(12347, 69)

In [15]:
id_reseller = pd.read_csv('id_reseller_to_predict.csv',header=None)[0]

In [17]:
id_reseller.shape

(12347,)

In [3]:
model = Estimator.attach(best_training_job)

2019-07-02 14:16:52 Starting - Preparing the instances for training
2019-07-02 14:16:52 Downloading - Downloading input data
2019-07-02 14:16:52 Training - Training image download completed. Training in progress.
2019-07-02 14:16:52 Uploading - Uploading generated training model
2019-07-02 14:16:52 Completed - Training job completed[31mArguments: train[0m
[31m[2019-07-02:14:11:42:INFO] Running standalone xgboost training.[0m
[31m[2019-07-02:14:11:42:INFO] Setting up HPO optimized metric to be : mae[0m
[31m[2019-07-02:14:11:42:INFO] File size need to be processed in the node: 283.91mb. Available memory size in the node: 8448.75mb[0m
[31m[2019-07-02:14:11:42:INFO] Determined delimiter of CSV input is ','[0m
[31m[14:11:42] S3DistributionType set as FullyReplicated[0m
[31m[14:11:44] 825465x69 matrix with 56922242 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[31m[2019-07-02:14:11:44:INFO] Determined delimiter of CSV input is ','[0m
[

Billable seconds: 364


In [4]:
model_predictor = model.deploy(initial_instance_count=1,
                            instance_type='ml.m4.xlarge')

--------------------------------------------------------------------------------------------------!

In [None]:
#model_predictor = RealTimePredictor('forecast-vtas-190528-2006-003-486b05ad')

In [5]:
model_predictor.content_type = 'text/csv'
model_predictor.serializer = csv_serializer
model_predictor.deserializer = None

In [27]:
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, model_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(df.as_matrix())

In [28]:
predictions.shape

(12347,)

In [29]:
df_predictions  = pd.DataFrame({'id_reseller':id_reseller,'prediction':predictions})

In [30]:
df_predictions.head()

Unnamed: 0,id_reseller,prediction
0,499921233,202316.296875
1,499921235,59766.675781
2,499921241,39696.0625
3,499921250,14430.231445
4,499921253,10196.595703


Finally we upload predictions to S3

In [None]:
role = sagemaker.get_execution_role()
bucket = sagemaker.Session().default_bucket()                     
prefix = 'predictions'

In [None]:
df_predictions.to_csv('predictions.csv',index=False)

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'predictions.csv')).upload_file('predictions.csv')