# Step 3: Add a model building CI/CD pipeline

In this step you create an automated CI/CD pipeline for model building using [Amazon SageMaker Projects](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects.html). 

![](img/sagemaker-mlops-project-build.jpg)

You are going to use a [SageMaker-provided MLOps project template for model building and training](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-templates-sm.html#sagemaker-projects-templates-code-commit) to provision a CI/CD workflow automation with [AWS CodePipeline](https://aws.amazon.com/codepipeline/) and an [AWS CodeCommit](https://aws.amazon.com/codecommit/) code repository.

SageMaker project templates offer you the following choice of code repositories, workflow automation tools, and pipeline stages:
- **Code repository**: AWS CodeCommit or third-party Git repositories such as GitHub and Bitbucket
- **CI/CD workflow automation**: AWS CodePipeline or Jenkins
- **Pipeline stages**: Model building and training, model deployment, or both

<div class="alert alert-info"> Make sure you using <code>Python 3</code> kernel in JupyterLab for this notebook.</div>

In [None]:
import boto3
import sagemaker 
from time import gmtime, strftime, sleep

## Create an MLOps project
⭐ You can create a project programmatically in this notebook - **Option 1** or in Studio UI - **Option 2**.

Option 1 is recommended as it requires no manual input and has no dependency on the UX.</br>
Option 2 is given to demonstrate [**Create Project** UI flow](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-create.html).

### Option 1: Create project programmatically
In this section you use `boto3` to create an MLOps project via a SageMaker API.

In [None]:
sm = boto3.client("sagemaker")
sc = boto3.client("servicecatalog")

sc_provider_name = "Amazon SageMaker"
sc_product_name = "MLOps template for model building and training"

In [None]:
p_ids = [p['ProductId'] for p in sc.search_products(
    Filters={
        'FullTextSearch': [sc_product_name]
    },
)['ProductViewSummaries'] if p["Name"]==sc_product_name]

In [None]:
p_ids

In [None]:
# If you get any exception from this code, go to the Option 2 and create a project in Studio UI
if not len(p_ids):
    raise Exception("No Amazon SageMaker ML Ops products found!")
elif len(p_ids) > 1:
    raise Exception("Too many matching Amazon SageMaker ML Ops products found!")
else:
    product_id = p_ids[0]
    print(f"ML Ops product id: {product_id}")

In [None]:
provisioning_artifact_id = sorted(
    [i for i in sc.list_provisioning_artifacts(
        ProductId=product_id
    )['ProvisioningArtifactDetails'] if i['Guidance']=='DEFAULT'],
    key=lambda d: d['Name'], reverse=True)[0]['Id']

In [None]:
provisioning_artifact_id

In [None]:
project_name = f"model-build-{strftime('%m-%d-%H-%M-%S', gmtime())}"
project_parameters = [] # This SageMaker built-in project template doesn't have any parameters

Finally, create a SageMaker project from the service catalog product template:

In [None]:
# create SageMaker project
r = sm.create_project(
    ProjectName=project_name,
    ProjectDescription="Model build project",
    ServiceCatalogProvisioningDetails={
        'ProductId': product_id,
        'ProvisioningArtifactId': provisioning_artifact_id,
    },
)

print(r)
project_id = r["ProjectId"]

<div class="alert alert-info"> 💡 <strong> Wait until project creation is completed by running the next cell</strong>
</div>




In [None]:
# Project creation takes about 3-5 min
while sm.describe_project(ProjectName=project_name)['ProjectStatus'] != 'CreateCompleted':
    print("Waiting for project creation completion")
    sleep(10)
    
print(f"MLOps project {project_name} creation completed")



### End of Option 1: Create project programmatically
Now you have provisioned a project template in your SageMaker environment. Navigate to the section **Configure the MLOps project**.

---

### Option 2: Create a project in Studio UI
<div class="alert alert-info"> 💡 <strong> Skip this section if you created a project programmatically </strong>

Follow the instructions in the Developer Guide – [Create a MLOps Project using Amazon SageMaker Studio or Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-create.html). Choose the **Studio** option.

For the template choose the **Model building and training**.
In the **Project details** you need to provide a name and an optional project description. This template doesn't have any parameters.

Choose **Create** and wait for the project to appear in the Projects list.

### Resolve issues with project creation

#### Project creation process stuck in pending
If after 5 minutes the project creation banner is still on, close the Studio browser window and sign in Studio again.

![](img/project-creation-pending.png)

#### Error messages
❗ If you see an error message similar to:
```
Your project couldn't be created
Studio encountered an error when creating your project. Try recreating the project again.

CodeBuild is not authorized to perform: sts:AssumeRole on arn:aws:iam::XXXX:role/service-role/AmazonSageMakerServiceCatalogProductsCodeBuildRole (Service: AWSCodeBuild; Status Code: 400; Error Code: InvalidInputException; Request ID: 4cf59a54-0c59-476a-a970-0ac656db4402; Proxy: null)
```

see steps 5-6 of [SageMaker Studio Permissions Required to Use Projects](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-studio-updates.html). Make sure you have all required project roles listed in the **Apps** card under **Projects**. 

Alternatively, you can create the required roles by using the provided CloudFormation template [`cfn-templates/sagemaker-project-templates-roles.yaml`](cfn-templates/sagemaker-project-templates-roles.yaml). 
Run in the repository clone directory from the command line terminal where you have the corresponding permissions:

```sh
aws cloudformation deploy \
    --template-file cfn-templates/sagemaker-project-templates-roles.yaml \
    --stack-name sagemaker-project-template-roles \
    --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM \
    --parameter-overrides \
    CreateCloudFormationRole=YES \
    CreateCodeBuildRole=YES \
    CreateCodePipelineRole=YES \
    CreateEventsRole=YES \
    CreateProductsExecutionRole=YES 
```

### End of Option 2: Create a project in Studio UI
Now when you have the project created, move to the section **Configure the MLOps project**.

---

## Configure the MLOps project
The project runs a provided default model building pipeline automatically as soon as it has been created. This pipeline is a sample placeholder in the project for your own custom pipeline. Ignore the default pipeline for the moment.
The project templates deploys the following architecture in your AWS account:

![](img/mlops-model-build-train.png)

The main components are:
1. The project template is made available through SageMaker Projects and AWS Service Catalog portfolio
2. A CodePipeline pipeline with two stages - `Source` to download the source code from a CodeCommit repository and `Build` to create and execute a SageMaker pipeline
3. A default SageMaker pipeline with model build, train, and register workflow
4. A seed code repository in CodeCommit with a provided default version of a placeholder code

This project contains all the required code and the insfrastructure to implement an automated CI/CD pipeline from a pre-defined template. 
To start using the project with your pipeline, you need to complete the following steps:
1. Clone the project CodeCommit repository to your notebook EBS volume
2. Replace the ML pipeline template sample code with your actual pipeline construction code, as implemented in the step 3 notebook
3. Modify the `codebuild-buildspec.yml` file to reference the correct Python module name and to set project parameters

Next sections guide you through these steps. For detailed instructions and a hands-on example, refer to the development guide [SageMaker MLOps Project Walkthrough](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-walkthrough.html).

If you used the option 1 `boto3` to create an MLOps project, the `project_name` and `project_id` are set automatically. You can run the following code cell to print the values. If you followed the UI instructions to create a project, you must set the `project_name` manually.

In [None]:
try:
    print(project_name)
    print(project_id)
except NameError:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("You must set the project_name manually in the following code cell")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [None]:
# project_name = "<ENTER THE NAME OF THE CREATED PROJECT>" # Keep commented out if you used option 1 to create a project
region = sm.meta.region_name
r = sm.describe_project(ProjectName=project_name)
project_id = r['ProjectId']
project_arn = r['ProjectArn']
project_folder = f"sagemaker-{project_name}-{project_id}-modelbuild"
project_repo_url = f"codecommit::{region}://sagemaker-{project_name}-{project_id}-modelbuild"

print(f"Project folder: {project_folder}")
print(f"Project repo URL: {project_repo_url}")

### 1. Clone the project seed code to the JupyterLab file system
You need to clone the project code from the CodeCommit repository by using terminal CLI.

1. Open a new terminal window via **File** > **New** > **Terminal**
2. Install `git-remote-codecommit` helper: ```pip install git-remote-codecommit```
3. Clone the project repository: ```git clone <PROJECT REPO URL>```. Replace the `<PROJECT REPO URL>` with the actual project repo URL from the code cell above.

### 2. Replace pipeline construction code

The following steps are required to customize the project which contains the template code. The next code cell executes all the required steps, you don't need to do anything manually. The following text is for your information only.

- The source code is in the folder `sagemaker-<project-name>-<project-id>-modelbuild`.
- The original file `codebuild-buildspec.yml` is renamed to `codebuild-buildspec-original.yml`.
- Project's code repository folder containing the pipeline code is renamed from `abalone` folder to `playerchurn`.
- The original file with the template pipeline `pipeline.py` is renamed to `pipeline-original.py`.
- Copy the `pipeline_steps` Python modules to the `pipelines` folder in the project's code repository folder.
- Copy the `requirements.txt` created in the notebook 3 to the `pipelines` folder in the project's code repository folder.
- Copy SageMaker Python SDK default configuration file `config.yaml` from the notebook 3 to the `pipelines` folder in the project's code repository folder.

In [None]:
# see the workshop folder name
!pwd

In [None]:
# if you local path for the workshop folder is different, set the correct absolute path to the variable workshop_folder
workshop_folder = "mlops-sagemaker-mlflow"

In [None]:
!mkdir -p ~/{workshop_folder}/pipelines
!mv ~/{project_folder}/codebuild-buildspec.yml ~/{project_folder}/codebuild-buildspec-original.yml
!mv ~/{project_folder}/pipelines/abalone ~/{project_folder}/pipelines/playerchurn
!cp ~/{workshop_folder}/requirements.txt ~/{project_folder}
!cp ~/{workshop_folder}/config.yaml ~/{project_folder}

Test the pipeline locally before running remotely to see if everything works.

In [None]:
# Variables to be used in the CICD build pipeline.
region="us-east-1"
feature_group_name="" # replace feature group with the one created in the previous lab.
bucket_name="" # replace the S3 bucket name with the one created for this workshop. i.e sagemaker-[region]-[aws account id]
bucket_prefix="player-churn/xgboost"
experiment_name="player-churn-model-build-pipeline"
train_instance_type="ml.m5.xlarge"
test_score_threshold=0.75
model_package_group_name="player-churn-model-group"
model_approval_status="PendingManualApproval"
mlflow_tracking_server_arn="" # Provide a valid mlflow tracking server ARN. You can find the value in the output from 00-start-here.ipynb
pipeline_name = "Player-Churn-Model-Training-Pipeline" # replace the value with the name of the training pipeline if it is different from the value given in previous lab.

In [None]:
assert len(feature_group_name) > 0
assert len(bucket_name) > 0
assert len(mlflow_tracking_server_arn) > 0
assert len(pipeline_name) > 0

In [None]:
%store model_package_group_name
%store region
%store bucket_name
%store bucket_prefix

In [None]:
response = sm.start_pipeline_execution(
    PipelineName=pipeline_name,
    PipelineParameters=[
        {
            'Name': 'region',
            'Value': region
        },
        {
            'Name': 'feature_group_name',
            'Value': feature_group_name
        },
        {
            'Name': 'bucket_name',
            'Value': bucket_name
        },
        {
            'Name': 'bucket_prefix',
            'Value': bucket_prefix
        },
        {
            'Name': 'experiment_name',
            'Value': experiment_name
        },
        {
            'Name': 'train_instance_type',
            'Value': train_instance_type
        },
        {
            'Name': 'test_score_threshold',
            'Value': str(test_score_threshold)
        },
        {
            'Name': 'model_package_group_name',
            'Value': model_package_group_name
        },
        {
            'Name': 'model_approval_status',
            'Value': model_approval_status
        },
        {
            'Name': 'mlflow_tracking_server_arn',
            'Value': mlflow_tracking_server_arn
        }
    ],
    
)

Wait for the pipeline to complete.

In [None]:
pipeline_exec_id = response["PipelineExecutionArn"]
describe_pipeline_response = sm.describe_pipeline_execution(
    PipelineExecutionArn=pipeline_exec_id
)

pipeline_execution_status = describe_pipeline_response["PipelineExecutionStatus"]
while True:
   if pipeline_execution_status in ['Stopped', 'Failed', 'Succeeded']:
       print(f"Pipeline execution completed with status: {pipeline_execution_status}")
       break
   print(f"Pipeline execution status: {pipeline_execution_status}")
   sleep(10)
   describe_pipeline_response = sm.describe_pipeline_execution(PipelineExecutionArn=pipeline_exec_id)
   pipeline_execution_status = describe_pipeline_response["PipelineExecutionStatus"]

At this point you have tested locally that the pipeline construction code works and it creates a pipeline. You can see this pipeline in Studio **Pipelines** widget. Now you ready to create a CI/CD pipeline.

#### Attach the model package group to the project
Project-owned resources are automatically tagged with `sagemaker:project-name` and `sagemaker:project-id` tags for cost control, attribute-based security control, and governance. 
Since the model package group already exists in the model registry, you need to tag it to attach to this project. The following code cell calls `AddTags` API to set project tags to the model package group.

In [None]:
model_package_group_arn = sm.describe_model_package_group(ModelPackageGroupName=model_package_group_name).get("ModelPackageGroupArn")

if model_package_group_arn:
    print(f"Adding tags {project_arn.split('/')[-1]} and {project_id} for model package group {model_package_group_arn}")
    r = sm.add_tags(
        ResourceArn=model_package_group_arn,
        Tags=[
            {
                'Key': 'sagemaker:project-name',
                'Value': project_arn.split("/")[-1]
            },
            {
                'Key': 'sagemaker:project-id',
                'Value': project_id
            },
        ]
    )
    print(r)
else:
    print(f"The model package group {model_package_group_name} doesn't exist")
    
sm.list_tags(ResourceArn=model_package_group_arn)["Tags"]

### 3. Modify the build specification file
In the following cell, we'll modify the `codebuild-buildspec.yml` file in the project folder to reflect the new name of Python module with your pipeline and set other project-specific parameters.

You need to pass the following parameters to a pipeline creation script:

- region - the region where the pipeline is run
- feature_group_name - name of the feature group name
- bucket_name - name of the S3 bucket to use for storing the artifacts created in the pipeline
- bucket_prefix - S3 bucket prefix for storing the artifacts.
- experiment_name - The name of the experiment for organizing pipeline runs in the MLFlow tracking server
- train_instance_type - instance to use for model training
- test_score_threshold - minimum test score threshold to evaluate whether to register the model in the registry.
- model_package_group_name - name of the model package group for the model version.
- model_approval_status - default model approval status when model is registered in Model Registry.
- mlflow_tracking_server_arn - a valid MLFlow tracking server ARN


In [None]:
code_build_buildspec_template = r"""

version: 0.2

phases:
  install:
    runtime-versions:
      python: 3.10
    commands:
      - pip install --upgrade --force-reinstall . "awscli>1.20.30"
      - pip install mlflow==2.13.2 sagemaker-mlflow s3fs xgboost
    
  build:
    commands:
      - export SAGEMAKER_USER_CONFIG_OVERRIDE="./config.yaml"
      - export PYTHONUNBUFFERED=TRUE
      - export SAGEMAKER_PROJECT_NAME_ID="${SAGEMAKER_PROJECT_NAME}-${SAGEMAKER_PROJECT_ID}"
      - |
        run-pipeline \
          --role-arn $SAGEMAKER_PIPELINE_ROLE_ARN \
          --tags "[{\"Key\":\"sagemaker:project-name\",\"Value\":\"${SAGEMAKER_PROJECT_NAME}\"}, {\"Key\":\"sagemaker:project-id\", \"Value\":\"${SAGEMAKER_PROJECT_ID}\"}]" \
          --pipeline-name "{{PIPELINE_NAME}}" \
          --kwargs "{ \
                \"region\":\"{{REGION}}\", \
                \"feature_group_name\":\"{{FEATURE_GROUP_NAME}}\",\
                \"bucket_name\":\"{{BUCKET_NAME}}\",\
                \"bucket_prefix\":\"{{BUCKET_PREFIX}}\",\
                \"experiment_name\":\"{{EXPERIMENT_NAME}}\", \
                \"train_instance_type\":\"{{TRAIN_INSTANCE_TYPE}}\", \
                \"test_score_threshold\":\"{{TEST_SCORE_THRESHOLD}}\",\
                \"model_package_group_name\":\"{{MODEL_PACKAGE_GROUP_NAME}}\",\
                \"model_approval_status\":\"{{MODEL_APPROVAL_STATUS}}\",\
                \"mlflow_tracking_server_arn\":\"{{MLFLOW_TRACKING_SERVER_ARN}}\"\
                    }"
      - echo "Create/update of the SageMaker Pipeline and a pipeline execution completed."
"""

In [None]:
code_build_buildspec = code_build_buildspec_template.replace("{{REGION}}", region)
code_build_buildspec = code_build_buildspec.replace("{{PIPELINE_NAME}}", pipeline_name)
code_build_buildspec = code_build_buildspec.replace("{{FEATURE_GROUP_NAME}}", feature_group_name)
code_build_buildspec = code_build_buildspec.replace("{{BUCKET_NAME}}", bucket_name)
code_build_buildspec = code_build_buildspec.replace("{{BUCKET_PREFIX}}", bucket_prefix)
code_build_buildspec = code_build_buildspec.replace("{{EXPERIMENT_NAME}}", experiment_name)
code_build_buildspec = code_build_buildspec.replace("{{TRAIN_INSTANCE_TYPE}}", train_instance_type)
code_build_buildspec = code_build_buildspec.replace("{{TEST_SCORE_THRESHOLD}}", str(test_score_threshold))
code_build_buildspec = code_build_buildspec.replace("{{MODEL_PACKAGE_GROUP_NAME}}", model_package_group_name)
code_build_buildspec = code_build_buildspec.replace("{{MODEL_APPROVAL_STATUS}}", model_approval_status)
code_build_buildspec = code_build_buildspec.replace("{{MLFLOW_TRACKING_SERVER_ARN}}", mlflow_tracking_server_arn)

In [None]:
with open("codebuild-buildspec.yml", "w") as f:
    f.write(code_build_buildspec)

Copy the `codebuild-buildspec.yml` file from the workshop folder to the project's code repository folder:

In [None]:
!cp ~/{workshop_folder}/codebuild-buildspec.yml ~/{project_folder}/codebuild-buildspec.yml
!cp ~/{workshop_folder}/run_pipeline.py ~/{project_folder}/pipelines/run_pipeline.py

### 4. Fix the `setup.py` file
Finally, open the `setup.py` file in the project's code repository folder and replace the line `required_packages = ["sagemaker==2.XX.0"]` with `required_packages = ["sagemaker"]`. Save your changes.

Why did you do this change? The pinned sagemaker library version is a bug and is going to be fixed in future releases of the built-in SageMaker project templates. For now you fix this template file manually. Keep in mind, that the built-in project templates are for your convenience only and to demostrate how to use SageMaker project mechanism to package and provision your own custom MLOps projects.

Now you are ready to launch the CI/CD model building pipeline.

Everything is ready to run a CI/CD pipeline.

---

## Run the CI/CD for the model building pipeline
To launch the CI/CD for the model building pipeline you need to push the changed code into the project CodeCommit repository.

<div class="alert alert-info">Make sure you are in the folder that contains the repository code in JupyterLab terminal when running git commands. The folder name looks like <code>sagemaker-[project-name]-[project-id]-modelbuild</code>.</div>

The cell below prints the required `cd` command with the correct folder name:

In [None]:
print(f"cd ~/{project_folder}")

Open a system terminal window via the JupyterLab menu **File** > **New** > **Terminal** and enter the following commands. Keep `user.email` and `user.name` or replace with your data.
```sh
cd ~/<PROJECT-FOLDER>/<PROJECT-CODE-REPOSITORY-FOLDER>

git config --global user.email "you@example.com"
git config --global user.name "Your Name"
  
git add -A
git commit -am "customize project"
git push
```

After pushing your code changes, the project initiates a run of the CodePipeline pipeline that constructs, upcerts, and executes the SageMaker model building pipeline. This new pipeline execution creates a new model version in the model package group in the SageMaker model registry.

You can follow up the execution of the pipeline in the Studio **Pipelines** widget.

Wait until the pipeline execution finishes. The execution takes about 15 minutes to complete.

## View the details of a new model version
After the pipeline execution finished, a new model version must be registered in the model registry. To see the model version details:

1. In the Studio sidebar, choose the **Models** widget
2. Click on the name of the model package group you created in the previous lab to open the model group
3. In the list of model versions, select the latest version of the model

On the model version tab that opens, you can browse activity, [model version details](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-details.html), and [data lineage](https://docs.aws.amazon.com/sagemaker/latest/dg/lineage-tracking.html). 

![](img/model-version-details.png)

In a real-world project you add various model attributes and additional model version metadata such as [model quality metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-model-quality-metrics.html), [explainability](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-model-explainability.html) and [bias](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-post-training-bias.html) reports, load test data, and [inference recommender](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender.html).

To see the model package version in the Studio UI click on the link constructed by the code cell below. Note that you need to wait until the pipeline execution finishes to see the latest registered version of the model package.

## Summary
In this notebook you implement a CI/CD pipeline with the following features:
- Model building ML pipeline is under the source control in a CodeCommit repository
- Every push into the CodeCommit repository launches a new CodeBuild build which constructs, upserts, and executes the ML pipeline
- The whole e2e model development process is automated now, including the model building pipeline
- SageMaker project is a logical construct in Studio which has the metadata about related ML pipelines, repositories, models, experiments, and inference endpoints

---

## Continue with the step 7
open the step 7 [notebook](07-sagemaker-deploy.ipynb).

## Further development ideas for other projects
- You can use a SageMaker-provided [MLOps template for model building, training, and deployment with third-party Git repositories using Jenkins](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-templates-sm.html#sagemaker-projects-templates-git-jenkins)
- Create a [custom SageMaker project template](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-templates-custom.html) to cover your specific project requirements

## Additional resources
- [Amazon SageMaker Pipelines lab in SageMaker Immersion Day](https://catalog.us-east-1.prod.workshops.aws/workshops/63069e26-921c-4ce1-9cc7-dd882ff62575/en-US/lab6)
- [Enhance your machine learning development by using a modular architecture with Amazon SageMaker projects](https://aws.amazon.com/blogs/machine-learning/enhance-your-machine-learning-development-by-using-a-modular-architecture-with-amazon-sagemaker-projects/)
- [Dive deep into automating MLOps](https://www.youtube.com/watch?v=3_cHnk9VSfQ)
- [SageMaker MLOps Project Walkthrough](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-walkthrough.html)
- [`aws-samples` GitHub repository with custom project templates examples](https://github.com/aws-samples/sagemaker-custom-project-templates)

# Shutdown kernel

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>