# Set up SageMaker images for training and processing

In this notebook you will build and push the Docker images that are needed to run graph processing and training tasks with GraphStorm on SageMaker.

## Environment Setup

First, let's set up environment variables that will be used across all notebooks in this demo.

In [None]:
import json
import os

# Read information about the graph from the JSON file you created in notebook 0
with open("task-info.json", "r") as f:
    task_info = json.load(f)


GRAPH_NAME=task_info["GRAPH_NAME"]
BUCKET=task_info["BUCKET"]
GS_HOME=task_info["GS_HOME"]
AWS_REGION=task_info["AWS_REGION"]
GRAPH_ID=task_info["GRAPH_ID"]
AWS_REGION=task_info["AWS_REGION"]

## Example GraphStorm-SageMaker architecture

A common model development process is to perform model exploration locally on a subset of your full data, and when you’re satisfied with the results, train the full-scale model. This setup allows for cheaper exploration before training on the full dataset. 

We demonstrate such a setup in the following diagram, where a user can perform model development and initial training on a single EC2 instance, and when they’re ready to train on their full data, hand off the heavy lifting to SageMaker for distributed training. Using SageMaker Pipelines to train models provides several benefits, like reduced costs, auditability, and lineage tracking.

<img src="images/sm-graphstorm-arch.jpg" width="50%">

## Build and Push GraphStorm Docker Images

GraphStorm uses BYOC (Bring Your Own Container) to run SageMaker jobs. First you will build the image that you will use to partition the graph and run training and inference.

### Required IAM Permissions
To build and push the GraphStorm images, your IAM role needs the following permissions:
- Pull images from the SageMaker public ECR registry
- Create a repository and push images to your account's private ECR registry
- For detailed permissions, refer to the [ECR IAM id-based policy examples](https://docs.aws.amazon.com/AmazonECR/latest/userguide/security_iam_id-based-policy-examples.html)

> NOTE: GraphStorm image builder does not support cross-platform builds (e.g. building linux image on Mac silicon/aarch64), ensure you are building the image on `linux/x86_64` host.

In [None]:
# This will create an ECR repository and push an image to
# ${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/graphstorm:sagemaker-cpu
!bash $GS_HOME/docker/build_graphstorm_image.sh --environment sagemaker --device cpu
!bash $GS_HOME/docker/push_graphstorm_image.sh -e sagemaker -d cpu -r $AWS_REGION

## Neptune Analytics Graph Notebook Setup (Optional)

This section is optional depending on your setup:

1. If you're already using a [Neptune Analytics notebook](https://docs.aws.amazon.com/neptune-analytics/latest/userguide/notebooks.html), that has access to the graph you created, you can skip this section
2. If you're using a regular SageMaker notebook instance, you can install the graph-notebook extension:
   ```bash
   pip install graph-notebook
   jupyter nbextension enable --py --sys-prefix graph_notebook_widgets
   ```
3. If you're self-hosting Jupyter, you can follow the instructions in 
   [Hosting a Neptune Analytics graph-notebook on your local machine](https://docs.aws.amazon.com/neptune-analytics/latest/userguide/create-notebook-local.html)

If you choose to create a new Neptune Analytics notebook, follow these steps.

### Required IAM Role
For a demo role with wider permissions you can attach the following policies to the notebook instance role:
- AWSNeptuneAnalyticsFullAccess
- AmazonSageMakerFullAccess

For detailed permissions and trust policy requirements, see the [Neptune Analytics documentation](https://docs.aws.amazon.com/neptune-analytics/latest/userguide/create-notebook-console.html#create-notebook-iam-role).

In [None]:
import boto3
import base64

sm_client = boto3.client("sagemaker", region_name=AWS_REGION)

# Create a lifecycle config for a Neptune Analytics graph notebook

# TODO: Add a download of the 5th notebook to this LCC script, once the code is public
start_script = r"""#!/bin/bash

sudo -u ec2-user -i <<'EOF'

echo "export GRAPH_NOTEBOOK_AUTH_MODE=IAM" >> ~/.bashrc
echo "export GRAPH_NOTEBOOK_SSL=True" >> ~/.bashrc
echo "export GRAPH_NOTEBOOK_SERVICE=neptune-graph" >> ~/.bashrc
echo "export GRAPH_NOTEBOOK_HOST=GRAPH_ID_PLACEHOLDER.REGION_PLACEHOLDER.neptune-graph.amazonaws.com" >> ~/.bashrc
echo "export GRAPH_NOTEBOOK_PORT=8182" >> ~/.bashrc
echo "export NEPTUNE_LOAD_FROM_S3_ROLE_ARN=" >> ~/.bashrc
echo "export AWS_REGION=REGION_PLACEHOLDER" >> ~/.bashrc

aws s3 cp s3://aws-neptune-notebook-REGION_PLACEHOLDER/graph_notebook.tar.gz /tmp/graph_notebook.tar.gz
rm -rf /tmp/graph_notebook
tar -zxvf /tmp/graph_notebook.tar.gz -C /tmp
chmod +x /tmp/graph_notebook/install_jl4x.sh
/tmp/graph_notebook/install_jl4x.sh

EOF

"""

start_script = (start_script
    .replace("GRAPH_ID_PLACEHOLDER", GRAPH_ID)
    .replace("REGION_PLACEHOLDER", AWS_REGION)
    )

# Encode string to bytes using utf-8, encode bytes to base64, then decode base64 to string
encoded_script = base64.b64encode(start_script.encode()).decode()
lc_config_name = f"{GRAPH_NAME}-{GRAPH_ID}-LC"

response = sm_client.create_notebook_instance_lifecycle_config(
    NotebookInstanceLifecycleConfigName=lc_config_name,
    OnCreate=[
        {"Content": encoded_script},
    ],
)

With the lifecycle config available next you will launch the actual notebook instance

In [None]:
# Enter your Neptune analytics notebook role here
NEPTUNE_NOTEBOOK_ROLE = "arn:aws:iam::123456789012:role/<Your-NeptuneAnalytics-Notebook-Role>"

response = sm_client.create_notebook_instance(
    NotebookInstanceName=f"{GRAPH_NAME}-{GRAPH_ID}-notebook",
    InstanceType="ml.t3.medium",
    RoleArn=NEPTUNE_NOTEBOOK_ROLE,
    LifecycleConfigName=lc_config_name,
    DirectInternetAccess="Enabled",
    VolumeSizeInGB=50,
    PlatformIdentifier="notebook-al2-v3", # Ensure we create Jupyterlab v4 notebook
    InstanceMetadataServiceConfiguration={"MinimumInstanceMetadataServiceVersion": "2"},
)

While the notebook instance is being set up, move on to the next notebook, `2-Deploy-Execute-Pipeline.ipynb` to create a SageMaker training pipeline 