## Prepare Athena table

At this point, it is assumed that S3 bucket sagemaker-restate-`<AWS ACCOUNT ID>` is created and raw data has been uploaded to s3://sagemaker-restate-`<AWS ACCOUNT ID>`/raw/russia/.

The step below creates a Glue database and table containing the raw data by running a Glue crawler. 

In [None]:
import boto3

AWS_ACCOUNT = boto3.client("sts").get_caller_identity()["Account"]

BUCKET_NAME = "sagemaker-restate-{AWS_ACCOUNT}".format(AWS_ACCOUNT=AWS_ACCOUNT)

glue_client = boto3.client('glue')

In [None]:
try:
    response = glue_client.create_database(
        DatabaseInput={
            'Name': 'restate'
        }
    )
    print("Successfully created database")
except Exception as e:
    print('Error in creating database: {ERROR}'.format(ERROR=e))

In [None]:
# This assumes the Glue service role name is AWSGlueServiceRole-restate
try:
    response = glue_client.create_crawler(
        Name='restate-russia',
        Role='AWSGlueServiceRole-restate',
        DatabaseName='restate',
        Targets={
            'S3Targets': [
                {
                    'Path': 's3://{BUCKET_NAME}/raw/russia/'.format(BUCKET_NAME=BUCKET_NAME),
                }
            ]
        }
    )
    print("Successfully created crawler")
except Exception as e:
    print('Error in creating crawler: {ERROR}'.format(ERROR=e))

In [None]:
try:
    response = glue_client.start_crawler(
        Name='restate-russia'
    )
    print("Successfully started crawler")
except Exception as e:
    print('Error in starting crawler: {ERROR}'.format(ERROR=e))

Once crawler is done crawling, table `russia` in database `restate` should be visible in Athena. Now we filter the data for `region = 3870` and we use this in our pipeline. 

Make sure Athena query result location setting is updated accordingly before proceeding to the next step. 

In [None]:
query = 'CREATE TABLE restate.russia_3870 AS SELECT * FROM restate.russia where "region" = 3870;'
DATABASE = 'restate'
output='s3://{BUCKET_NAME}/athena'.format(BUCKET_NAME=BUCKET_NAME)

athena_client = boto3.client('athena')

try:
    response = athena_client.start_query_execution(
        QueryString=query,
        QueryExecutionContext={
            'Database': DATABASE
        },
        ResultConfiguration={
            'OutputLocation': output,
        }
    )
except Exception as e:
    print('Error running the query: {ERROR}'.format(ERROR=e))


## Prepare Decision Tree custom Docker image

We make a  Docker image containing a custom algorithm using [Scikit-learn Decision Tree Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor). Note that the Docker image has been modified to support hyperparameter tuning and validation data. 



In [None]:
! sudo yum install docker -y

In [None]:
%%sh

# The name of our algorithm
ALGORITHM_NAME=restate-decision-trees

cd container

chmod +x decision_trees/train
chmod +x decision_trees/serve

AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)

AWS_REGION=$(aws configure get region)

IMAGE_FULLNAME="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ALGORITHM_NAME}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${ALGORITHM_NAME}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${ALGORITHM_NAME}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${AWS_REGION}|docker login --username AWS --password-stdin ${IMAGE_FULLNAME}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${ALGORITHM_NAME} .
docker tag ${ALGORITHM_NAME} ${IMAGE_FULLNAME}
docker push ${IMAGE_FULLNAME}


Once Docker image is pushed to ECR repository, we make the image accessible from SageMaker. 

In [None]:
%%sh

# The name of our algorithm
SM_IMAGE_NAME=restate-dtree
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)

# This assumes the role name is AmazonSageMakerServiceCatalogProductsUseRole-restate
ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT}:role/AmazonSageMakerServiceCatalogProductsUseRole-restate"

aws sagemaker create-image \
    --image-name ${SM_IMAGE_NAME} \
    --role-arn ${ROLE_ARN}


In [None]:
%%sh
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
ALGORITHM_NAME=restate-decision-trees
AWS_REGION=$(aws configure get region)
SM_IMAGE_NAME=restate-dtree
SM_BASE_IMAGE="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ALGORITHM_NAME}:latest"

aws sagemaker create-image-version \
    --image-name ${SM_IMAGE_NAME} \
    --base-image ${SM_BASE_IMAGE}

Make sure to update the version below with the correct one based on the output of the previous step, i.e. `...image-version/restate-dtree/<SM_BASE_IMAGE_VERSION>`.

In [None]:
%%sh

SM_BASE_IMAGE_VERSION=1
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
ALGORITHM_NAME=restate-decision-trees
AWS_REGION=$(aws configure get region)
SM_IMAGE_NAME=restate-dtree
SM_BASE_IMAGE="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ALGORITHM_NAME}:latest"

aws sagemaker describe-image-version \
    --image-name ${SM_IMAGE_NAME} \
    --version ${SM_BASE_IMAGE_VERSION}

## Start the SageMaker pipeline

In [None]:
! pip install sagemaker-pipeline/

In [None]:
! get-pipeline-definition --help

At this point, it is assumed that a SageMaker project with a name `restate` and a pipeline with a name `sagemaker-restate` are already created. 

In [None]:
%%sh

# This assumes the SageMaker pipeline role name is AmazonSageMakerServiceCatalogProductsUseRole-restate

AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION=$(aws configure get region)
SAGEMAKER_PROJECT_ARN="arn:aws:sagemaker:${AWS_REGION}:${AWS_ACCOUNT}:project/restate"
SAGEMAKER_PROJECT_NAME=restate
SAGEMAKER_PIPELINE_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT}:role/AmazonSageMakerServiceCatalogProductsUseRole-restate"
SAGEMAKER_PIPELINE_NAME=sagemaker-restate
SAGEMAKER_PROJECT_ID=p-jittopdrswh5
ARTIFACT_BUCKET="sagemaker-project-${SAGEMAKER_PROJECT_ID}"
SAGEMAKER_PROJECT_NAME_ID="${SAGEMAKER_PROJECT_NAME}-${SAGEMAKER_PROJECT_ID}"

run-pipeline --module-name pipelines.restate.pipeline \
  --role-arn $SAGEMAKER_PIPELINE_ROLE_ARN \
  --tags "[{\"Key\":\"sagemaker:project-name\", \"Value\":\"${SAGEMAKER_PROJECT_NAME}\"}, {\"Key\":\"sagemaker:project-id\", \"Value\":\"${SAGEMAKER_PROJECT_ID}\"}]" \
  --kwargs "{\"region\":\"${AWS_REGION}\",\"sagemaker_project_arn\":\"${SAGEMAKER_PROJECT_ARN}\",\"role\":\"${SAGEMAKER_PIPELINE_ROLE_ARN}\",\"default_bucket\":\"${ARTIFACT_BUCKET}\",\"pipeline_name\":\"${SAGEMAKER_PROJECT_NAME_ID}\",\"model_package_group_name\":\"${SAGEMAKER_PROJECT_NAME_ID}\",\"base_job_prefix\":\"${SAGEMAKER_PROJECT_NAME_ID}\"}"


If you inspect the pipeline, you will see that the XGBoost model performs better than Decision Tree. Therefore, the XGBoost model is registered in the registry.

You can experiment on the data, e.g. use data for `region=russia_2922`, by changing the Athena query in `pipeline.py`. See if using this data, XGBoost would still be the winning model.

## Deploy the winning model

In [None]:
from sagemaker import get_execution_role, session
import boto3

role = get_execution_role()
sm_client = boto3.client('sagemaker')

MODEL_VERSION="1"
SAGEMAKER_PROJECT_NAME="restate"
SAGEMAKER_PROJECT_ID="p-jittopdrswh5"
AWS_REGION = boto3.Session().region_name
MODEL_PACKAGE_ARN="arn:aws:sagemaker:{AWS_REGION}:{AWS_ACCOUNT}:model-package/{SAGEMAKER_PROJECT_NAME}-{SAGEMAKER_PROJECT_ID}/{MODEL_VERSION}".format(AWS_REGION=AWS_REGION, AWS_ACCOUNT=AWS_ACCOUNT, SAGEMAKER_PROJECT_NAME=SAGEMAKER_PROJECT_NAME,SAGEMAKER_PROJECT_ID=SAGEMAKER_PROJECT_ID,MODEL_VERSION=MODEL_VERSION)
                    

model_package_update_response = sm_client.update_model_package(
    ModelPackageArn=MODEL_PACKAGE_ARN,
    ModelApprovalStatus="Approved"
)


In [None]:
from time import gmtime, strftime

model_name = 'restate-modelregistry-model-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name : {}".format(model_name))
container_list = [{'ModelPackageName': MODEL_PACKAGE_ARN}]

create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    Containers = container_list
)
print("Model arn : {}".format(create_model_response["ModelArn"]))

In [None]:
endpoint_config_name = 'restate-modelregistry-EndpointConfig-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_config_name)
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m5.large',
        'InitialVariantWeight':1,
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

In [None]:
endpoint_name = 'restate-modelregistry-endpoint-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("EndpointName={}".format(endpoint_name))

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)

print(create_endpoint_response['EndpointArn'])

Wait for the endpoint to be created. 

## Inference

Use the following data for inference:

`5,2,57.0 ,8.0,2021.0,1.0,24.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0`

This is a brick building type that is a secondary real estate with 5 levels, 2 rooms, area and kitchen area of 57 sqm and 8 sqm, respectively, in region 3870. This has been published on Jan 24, 2021 and the market value is 3300000 Russian Rubles. 

Let's see it's predicted value using our generated model. 

In [None]:
import json 

sm_runtime= boto3.client('runtime.sagemaker')
line = '5,2,57.0 ,8.0,2021.0,1.0,24.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0'
response = sm_runtime.invoke_endpoint(EndpointName=endpoint_name,
    ContentType='text/csv',
    Body=line
)
result = json.loads(response['Body'].read().decode())
print(result)


Now you try:

`5,2,43.0,6.0,2021.0,1.0,24.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0`

This is a brick building type that is a secondary real estate with 5 levels, 2 rooms, area and kitchen area of 43 sqm and 6 sqm, respectively, in region 3870. This has been published on Jan 24, 2021 and the market value is 2450000 Russian Rubles.


This means that based on our ML model, the above shows the predicated real estate value given the features provided. If the predicated value is less than actual free market value, then this means the real estate may be overvalued. Else, the real estate may be undervalued.  

## Cleanup

Cleanup the Glue database, table, crawler, and S3 buckets used. 

Cleanup the ECR and SageMaker images created.

Cleanup the SageMaker model and endpoint resources. 

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)