# Train a Boston house price prediction model with data fetched from Delta Lake and SageMaker Training

<b>This notebook was tested on SageMaker Studio with `Python 3 (Data Science)` Kernel.</b>

In this notebook, we'll show how to run a SageMaker training job that fetch the Boston Housing dataset from the example Delta Sharing Server that Databricks are hosting, and then deploy an endpoint and do inference.

Few important thing to note:
- As best practise, feature engineerings should be done in an ETL or SageMaker Processing job, and not inside a training job.
- This example is intended to run on a local PC with SageMaker Local, for easy debug. It is zero effort to move it to a notebook, if needed.
- You should consider the best practise to where to store the profile file to access the Delta Sharing Server, from security/versioning point of view.

## General settings

In [None]:
from sagemaker import get_execution_role, Session, image_uris
import sagemaker
import boto3

region = boto3.Session().region_name
role = get_execution_role()
sagemaker_session = Session()

print(region)

In [None]:
bucket = sagemaker.Session().default_bucket()
prefix = "sagemaker/delta-lake-scikit-learn-train-demo"

print(bucket)

## Download profile file

We will download a profile file for the Delta Sharing Server that Databricks are hosting.

In [None]:
profile_file = "https://raw.githubusercontent.com/delta-io/delta-sharing/main/examples/open-datasets.share"

In [None]:
!wget {profile_file}

Typically this file is managed and secured on the client-side. Because our first experiment with Delta Sharing is about reading data from the Databricks server, we can stick with the provided example profile_file on GitHub and retrieve it via HTTP.

To get a better idea of the content and syntax of that file, Let's display it.

In [None]:
!cat open-datasets.share

## Upload profile file to S3

In [None]:
sample_profile_file_url = sagemaker.Session().upload_data(
    "open-datasets.share", bucket=bucket, key_prefix=prefix + "/profile"
)

print(sample_profile_file_url)

## Writing a Script Mode script

The below script contains both training and inference functionality and can run both in SageMaker Training hardware or locally (desktop, SageMaker notebook, on prem, etc). Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script

In [None]:
!pygmentize ./code/scikit_boston_housing.py

Note the relevant lines in the training script that create a `SharingClient` and load the table as a Pandas DataFrame:

```
    profile_file = profile_files[0]
    print(f'Found profile file: {profile_file}')

    # Create a SharingClient
    client = delta_sharing.SharingClient(profile_file)
    table_url = profile_file + "#delta_sharing.default.boston-housing"

    # Load the table as a Pandas DataFrame
    print('Loading boston-housing table from Delta Lake')
    train_data = delta_sharing.load_as_pandas(table_url)
    print(f'Train data shape: {train_data.shape}')
```

The next lines show a drop null functionality. This is for demo purposes. As best practise, feature engineerings should be done in an ETL or SageMaker Processing job, and not inside a training job.

## Writing a `requirements.txt` file

We will need to install `delta-sharing` package in order to use it in the training script.

In [None]:
!pygmentize ./code/requirements.txt

## SageMaker Training

We will now launch a training job with the Python SDK.

In [None]:
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"

sklearn_estimator = SKLearn(
    entry_point="scikit_boston_housing.py",
    source_dir='code',
    role=get_execution_role(),
    instance_count=1,
    instance_type="ml.m5.large",
    framework_version=FRAMEWORK_VERSION
)

In [None]:
sklearn_estimator.fit({"train": sample_profile_file_url})

## Deploy to a real-time endpoint

An Estimator could be deployed directly after training, with an Estimator.deploy() but here we showcase the more extensive process of creating a model from s3 artifacts, that could be used to deploy a model that was trained in a different session or even out of SageMaker.

In [None]:
predictor = sklearn_estimator.deploy(instance_type="ml.m5.large", initial_instance_count=1)

## Invoke with the Python SDK

In [None]:
test_sample = [[0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98]]

In [None]:
prediction = predictor.predict(test_sample)

In [None]:
prediction

## Don't forget to delete the endpoint !

In [None]:
predictor.delete_endpoint()