# Using SageMaker  'script mode' 

In this notebook we first start with loading scikit-learn inside this nodebook and I show you different parts of that. Then we use that knowldge to test the code make sure the local training jobs works. That means, in this jupyter instance, we load the sklearn and put some codes inside this notebook and we make the training job and model. 

If everything works fine, we use the code of this notebook as a refrence to create a script file (script.py that you also need to upload it to Jupyter file system) and then we are going to assume the SageMaker SKlearn container hosts our code. So we load that SageMaker SKLearn container and we pass our script file to it later.

The script.py is essentially a converted version of the cells of this notebook, before we get into SageMaker part. We make a function for each cell in that script and we pass the parameters from this notebook to make sure that file can hook itself to SageMaker and runs inside SageMaker built-in containers (in this case sklearn container).


### Create  dummy dataset and save them into disk

Although we could get a data file from different data sources, my focus here is to show you how the SageMaker script mode works. So I use the "datasets" function inside sklearn to generate some dummy data for me.

We also use pickle library to serialize and deserialize the data when we need it.

In [1]:
from sklearn import datasets
import pickle

We use the "make_regression" function to create 100 data points with some noise and bias to provide some randomness into the generated data.

The following line generates a synthetic dataset using the make_regression function from the datasets module in the scikit-learn library.

The make_regression function generates a random regression problem, where the input features X are sampled from a **multivariate normal distribution** and the target variable y is a linear combination of the input features with some added noise.

The arguments to the make_regression function are as follows:

**100**: This specifies the number of samples to generate. In this case, we are generating a dataset with 100 samples.

**1**: This specifies the number of input features to generate for each sample. In this case, we are generating a dataset with one input feature.

**noise=5**: This specifies the standard deviation of the Gaussian noise added to the target variable y. In this case, we are adding a noise with a standard deviation of 5 to the target variable.

**bias=0**: This specifies the intercept term in the linear model used to generate the target variable y. In this case, we are setting the intercept term to 0.

In [2]:
X, y = datasets.make_regression(100, 1, noise=5, bias=0)

In [11]:
X, y

(array([[ 0.1154211 ],
        [-0.53797108],
        [ 1.28411277],
        [ 0.41022989],
        [-0.71581683],
        [-0.11103686],
        [ 0.21688263],
        [-0.96370828],
        [ 1.25859056],
        [-0.92132718],
        [-0.00442608],
        [-0.45492271],
        [ 0.35162384],
        [ 0.71364796],
        [ 0.34503081],
        [ 1.59984851],
        [-0.0839448 ],
        [ 0.31472825],
        [ 0.95771149],
        [ 0.3927157 ],
        [ 0.09097232],
        [ 0.9225186 ],
        [-0.40496473],
        [-1.29520659],
        [-0.34307335],
        [ 0.61456245],
        [ 0.43930244],
        [ 0.55882438],
        [ 0.82420809],
        [ 0.87844747],
        [ 0.20925097],
        [ 0.93236776],
        [ 0.87548411],
        [-0.42784255],
        [-1.34047366],
        [ 1.17498392],
        [-1.42331904],
        [-0.02370424],
        [-0.36027973],
        [-0.23379689],
        [-0.56051278],
        [-0.15642898],
        [-0.00813131],
        [ 0

Then we save those data points into a file called "traing.pickle":

In [12]:
pickle.dump([X,y], open('./train.pickle', 'wb'))

## Create a model from the dataset

In the next step, we load the LinearRegression algorithm from the sklearn and we pass the generated data in the previous section to this algorithm to create a model of our generated data:

In [13]:
from sklearn.linear_model import LinearRegression

We load the data that we serialize into disk in the previous section:

In [14]:
[loaded_X, loaded_y] = pickle.load(open('./train.pickle', 'rb'))

We create an instance of algorithm and call it "model" object:

In [15]:
model = LinearRegression()

In [16]:
model.fit(loaded_X,loaded_y)

## Test the model 

We have a model now. We want to test if it works or not. For that we pass an array of input and see if it predicts some outputs:

In [17]:
model.predict([[0],[1],[2],[3]])

array([-0.89607839, 26.02072813, 52.93753466, 79.85434118])

So that works. Now the next step is we save this model file into disk and load it again to make sure it works even after loading.

## Save the model to a file

We use the same serialization library we used to save and load the data but this time we use it for model file:

In [18]:
p = pickle.dumps(model)

In [19]:
pickle.dump(model, open('./model.pickle', 'wb'))

## Loading the model from a file and test it again

In [20]:
loaded_model = pickle.load(open('./model.pickle', 'rb'))

In [21]:
loaded_model.predict([[0],[1],[2],[3]])

array([-0.89607839, 26.02072813, 52.93753466, 79.85434118])

# SageMaker Training in script mode

From this cell and on, the SageMaker part starts. The Script.py is going to have functions for the above mentioned cells. 

Now that we have a code that upload data and model and creates a model from the data, we want to pass that to SageMaker and create that model by SageMaker. Amazon SageMaker Python SDK consider all those well-know ML libraries like sklearn, pytorch, Tensorflow and MXnet as first level citizens and allows you to use them inside the SagaMaker. You just need to point SageMaker Python SDK that you want to use them and the rest is passing the arguments and parameters to that container that will run your code.

In the following code, we point SageMaker library to sklearn library and import other required libraries:

In [22]:
import sagemaker
from sagemaker.sklearn.estimator import SKLearn
import boto3
import os

Since we are going to use SageMaker, We need to set the role that the training instance will use, the location of the data files and the model artifact in S3. In the following, we create  some variables that will be used through this process:

In [23]:
role = sagemaker.get_execution_role()
sess = sagemaker.Session()
bucket = sess.default_bucket()

s3_prefix = "script-mode"
data_s3_prefix = f"{s3_prefix}/training_data"
data_train_s3_uri = f"s3://{bucket}/{s3_prefix}/training_data"

train_dir = os.path.join(os.getcwd(), "")


Let's see the values of the above variables before we use them:

In [24]:
role

'arn:aws:iam::566462208046:role/LabRole'

In [25]:
s3_prefix

'script-mode'

In [26]:
data_s3_prefix

'script-mode/training_data'

In [27]:
data_train_s3_uri

's3://sagemaker-us-east-1-566462208046/script-mode/training_data'

The following access is needed to run the script sucessfully:

In [28]:
!sudo chmod 777 lost+found

Upload the training data to S3, so it's available for SageMaker training:

In [29]:
s3_resource_bucket = boto3.Session().resource("s3").Bucket(bucket)
s3_resource_bucket.Object(os.path.join(data_s3_prefix, "train.pickle")).upload_file(
    train_dir + "/train.pickle"
)

After running the above line the data file is loaded into script-mode/training_data/ folder in the SageMaker folder. So the data now is on S3. 

In the code I showed you earlier, before getting into SageMaker part, I did not set any hyperparameter and just used the default values. This time, I am going to show you how to set those up, if you want to, and how these hyperparameters are passed to SageMaker container later. To undrestand how these hyperparameters are related to the script we create later, see the follwowing three lines and compare them with the next cell:

    parser.add_argument("--copy_X",        type=bool, default=True)
    parser.add_argument("--fit_intercept", type=bool, default=True)
    parser.add_argument("--normalize",     type=bool, default=False)

In [30]:
hyperparameters = {
    "copy_X": True,
    "fit_intercept": True,
    "normalize": False,
}

Then we set a few more variables that we need to pass to the SageMaker Estimator later:

In [31]:
train_instance_type = "ml.m5.large"

inputs = {
    "train": data_train_s3_uri
}

Now we create the The SageMaker Estimator object. That is a high level interface for SageMaker training.  This object represents the algorithm, the data, and other configuration. 

https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html

The most important parts of the following code are:

1) I have an entry_point that is set to script.py. Undrestanding that code is crucial. You need to open that file and relate that to whatever has happened in this notebook. 
2) We set where that script is. Here we set the current directory (.)
3) We also have to set the sklearn and python versions

The rest are like other SageMaker Estimator settings.

In [32]:
estimator_parameters = {
    "entry_point": "script.py",
    "source_dir": ".",
    "framework_version": "0.23-1",
    "py_version": "py3",
    "instance_type": train_instance_type,
    "instance_count": 1,
    "hyperparameters": hyperparameters,
    "role": role,
    "base_job_name": "linearregression-model",
}



Now we create an instance of the SageMaker Sklearn object and we pass the above mentioned parameters to it:

In [33]:
estimator = SKLearn(**estimator_parameters)


Then we use .fit() function to let SageMaker to spin up a managed instance, transfer the code (script.py) and data and put that into a sklearn container to start the training.  All this happens off of the notebook server.  We can watch the training through the console, and watch the logs in CloudWatch Logs.

In [34]:
estimator.fit(inputs)

INFO:sagemaker:Creating training-job with name: linearregression-model-2023-02-26-03-10-21-266


2023-02-26 03:10:22 Starting - Starting the training job...
2023-02-26 03:10:38 Starting - Preparing the instances for training......
2023-02-26 03:11:24 Downloading - Downloading input data...
2023-02-26 03:11:54 Training - Downloading the training image..[34m2023-02-26 03:12:34,719 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2023-02-26 03:12:34,722 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-02-26 03:12:34,761 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2023-02-26 03:12:34,948 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-02-26 03:12:34,960 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-02-26 03:12:34,971 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-02-26 03:12:34,981 sagemaker-training-to

In the above output pay attention to where the model file is stored in the S3 bucket. It would be something like thuis:

SM_MODULE_DIR=s3://sagemaker-us-east-1-722464288670/linearregression-model-2022-11-18-00-07-00-294/source/sourcedir.tar.gz

This is where the model.tar.gz is located on.

# (Optional) SageMaker Endpoint

Since we have not talked about the inferencing by the SageMaker this part is optional but in the case you want to see what you can do after, please continue reading this notebook. 

After running the .fit(), a new model artifact is created in the current folder .

We can now take create a 'predictor' by deploying the estimator.  Then we can use it to make new predictions.

(Make sure that the 'endpoint_name' used is not currently running.)

In [None]:
sklearn_predictor = estimator.deploy(initial_instance_count=1,
                                     instance_type='ml.m5.large',
                                     endpoint_name='linearregression-endpoint1')

INFO:sagemaker:Creating model with name: linearregression-model-2023-02-26-03-13-06-081
INFO:sagemaker:Creating endpoint-config with name linearregression-endpoint1
INFO:sagemaker:Creating endpoint with name linearregression-endpoint1


------------------------------------

In [None]:
sklearn_predictor.predict([[0],[1],[2],[3]])

## Clean up

Running this cell will remove the endpoint and configuration:

In [None]:
sklearn_predictor.delete_endpoint(True)