## A guide for updating item schemas in Amazon Personalize

The purpose of this notebook is to walk through the process for *updating* your *item schema* when using Amazon Personalize in the context of *User Segmentation* use cases.

This notebook is inspired by the official AWS [Retail Demo Store Workshop](https://github.com/aws-samples/retail-demo-store). The main difference between this notebook and the Retail Demo Store workshop, is that the latter mainly focuses on real-time recommendations, whereas this notebook demonstrates how to implement item schema updates for batch segmentation jobs, specifially for solutions that use the item-attribute-affinity recipe. Though the teachings in this notebook apply to the other recipe types as well.


## Notebook overview

### Core content:

Before we can demonstrate schema changes, we will need to do some set-up. This set up portion is documented in the first 5 chapters of this notebook.

After we perform the set up, we will then walk through the process required to change an item schema. 
This is what the remaining chapters go over.

Table of Contents:

------------Set up------------
- **Chapter 1**: Set Up Amazon S3 Bucket, IAM Policies, and IAM Roles (_5 minutes_)
- **Chapter 2**: Fetch and Inspect the Datasets (_5 minutes_)
- **Chapter 3**: Create Schemas and Import Datasets in Amazon Personalize (_15 minutes_)
- **Chapter 4**: Create an e-commerce custom solution in Amazon Personalize (_60 minutes_)
- **Chapter 5**: Run a Batch Segmentation Job on your Solution Version (_20 minutes_)

------------Update Schema------------
- **Chapter 6.0**: Overview of the Steps involved for Changing Item Schemas (_60 minutes_)
- **Chapter 6 Step 1**: Update the schema (via the CreateSchema API)
- **Chapter 6 Step 2**: Update the dataset that corresponds to your updated schema (via the UpdateDataset API)
- **Chapter 6 Step 3**: Import your new data/columns (via the CreateDatasetImportJob API)
- **Chapter 6 Step 4**: Create new a Solution (via the CreateSolution API) using the updated schema
- **Chapter 6 Step 5**: Train new a solution version (via the CreateSolutionVersion API)

------------Analysis------------
- **Chapter 7.0**: Submit the Same Batch Segmentation Job to our New Solution Version (_20 minutes_)
- **Chapter 7 Step 1**: Analysis: Inspect the outputs of our job & compare it to the output from the first solution version from chapter 5
- **Chapter 7 Step 2**: Analysis: Compare the metrics of the two solution versions

------------Clean up------------
- **Chapter 8**: Clean up (_15 minutes_)


#### Relevant Information:

- This notebook was developed and tested in the us-east-1 Region.

- To ensure a reliable run, please don't *concurrently* run multiple copies of this notebook on the same Sagemaker Notebook Instance. If you want to concurrently run multiple copies of this notebook, run each notebook in its own environment/instance. 

- After you finish running this notebook, please run the code cells in the `Clean up` chapter of this notebook (final chapter). This will prevent incurring additional costs.

- The purpose of this notebook is to demonstrate a high-level implementation of an end-to-end Personalize Workflow for schema updates. As such, the code within this notebook has not been tested for a production environment and for the sake of brevity, not all security best practices may have been implemented. For additional information to secure your Personalize-dependent workloads, refer to the [Security in Amazon Personalize](https://docs.aws.amazon.com/personalize/latest/dg/security.html) section of the Amazon Personalize documentation.

- The notebook will be using the python programming language and the AWS SDK for python (referred to as boto3). Even if you are not fluent in python, the code cells should be reasonably intuitive. In practice, you can use any programming language supported by the AWS SDK to complete the same steps from this notebook in your application environment. Visit the [official boto3 documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) for more information about the AWS SDK for Python.


## Introduction to Amazon Personalize

[Amazon Personalize](https://aws.amazon.com/personalize/) makes it easy for customers to develop applications with a wide array of personalization use cases, including real time product recommendations and customized direct marketing. Amazon Personalize brings the same machine learning technology used by Amazon.com to everyone for use in their applications – with no machine learning experience required. Amazon Personalize customers pay for what they use, with no minimum fees or upfront commitment. You can start using Amazon Personalize with a simple three step process, which only takes a few clicks in the AWS console, or a set of simple API calls. First, point Amazon Personalize to user data, catalog data, and activity stream of views, clicks, purchases, etc. in Amazon S3 or upload using a simple API call. Second, with a single click in the console or an API call, train a custom private recommendation model for your data. Third, retrieve personalized recommendations for any user by creating a recommender, campaign, or batch job.


## Chapter 1: Set Up Amazon S3 Bucket, IAM Policies, and IAM Roles

In this Chapter, we are going to focus setting up our Amazon S3 bucket, and initializing the proper IAM Policies & Roles required to run this workflow.

This chapter will take about 5 minutes.

### Update dependencies

To get started, we need to perform a bit of setup. First, we need to ensure that a current version of botocore is locally installed. The botocore library is used by boto3, the AWS SDK library for python. We need a current version to be able to access some of the newer Amazon Personalize features.

The following cell will update pip and install the latest botocore library.

In [1]:
import sys
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install --upgrade --no-deps --force-reinstall botocore


Collecting botocore
  Using cached botocore-1.31.72-py3-none-any.whl.metadata (6.1 kB)
Using cached botocore-1.31.72-py3-none-any.whl (11.3 MB)
Installing collected packages: botocore
  Attempting uninstall: botocore
    Found existing installation: botocore 1.31.72
    Uninstalling botocore-1.31.72:
      Successfully uninstalled botocore-1.31.72
Successfully installed botocore-1.31.72


### Import dependencies

Next we need to import some dependencies/libraries needed to complete this part of the notebook.
These are not all of the dependencies we'll be using throughout this notebook. But we will import the rest of them as needed.


In [2]:
import boto3
import json
import pandas as pd

import time
import csv  
import os

from io import StringIO
import uuid
from botocore.exceptions import ClientError

import numpy
import botocore
from datetime import datetime

### Create clients

Next we need to create the AWS service clients needed for this demonstration.

- **personalize**: this client is used to create resources in Amazon Personalize
- **s3**: this client is used to access S3 commands and resources


In [3]:
# Setup clients
personalize = boto3.client('personalize')
s3 = boto3.Session().client('s3')


### Set up our Amazon S3 bucket

For simplicity, we will use this bucket to store our input data, output data, helper scripts, and other files. 
Though, in a production environment, you may want to store these assets seperately/in seperate buckets.

To ensure a consistent naming convention throughout this notebook, we generate a random number for the 'token' variable below.

In [4]:
# Use an epoch timestamp w/ precision to the nearest millisecond to present a pseduo-randomly generated value for token. 
# Alternatively, enter your own *lowercase alphanumeric* string of 5 characters here. The 'token` is used for naming aws resources. 
token = str(round(time.time()*1000))[-5:]
print(f'The value of your token is:"{token}".')

# Bucket name *must* contain the substring 'Personalize' or 'personalize'. 
#  This is to ensure compliance with the execution role of this Sagemaker Notebook instance.
bucket_name = 'personalize-dataset-schema-update-example-' + token

# If creating bucket outside of us-east-1, specify the region code within CreateBucketConfiguration.LocationConstraint attribute.
# Reference: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/create_bucket.html
bucket = s3.create_bucket(
    Bucket=bucket_name
)

# The Bucket Policy we need to attach to the bucket in order to allow Amazon Personalize to access it.
bucket_policy = {
    'Version': '2012-10-17',
    'Id': 'PersonalizeS3BucketAccessPolicy',
    'Statement': [
        {
            'Sid': 'PersonalizeS3BucketAccessPolicy',
            'Effect': 'Allow',
            'Principal': {
                'Service': 'personalize.amazonaws.com'
            },
            'Action': [
                's3:GetObject',
                's3:ListBucket',
                's3:PutObject'
            ],
            'Resource': [
                f'arn:aws:s3:::{bucket_name}',
                f'arn:aws:s3:::{bucket_name}/*'
            ]
        }
    ]
}

# Convert the policy to a JSON string and attach it to the bucket
bucket_policy = json.dumps(bucket_policy)
s3.put_bucket_policy(Bucket=bucket_name, Policy=bucket_policy)


# prints out the bucket
print('Bucket: {}'.format(bucket['Location']))

The value of your token is:"85652".
Bucket: /personalize-dataset-schema-update-example-85652


### Set up Amazon IAM Permissions for the Personalize Service


In addition to a bucket policy that allows Amazon Personalize access, we also need to explicitly grant the Amazon Personalize service those permissions within an IAM Role. This will enable the Personalize service to fetch and write data to Amazon S3. We use a custom-made customer-managed IAM policy to ensure we are abiding by the [Principle of Least Privilege best security practice](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html).

In [5]:
# Set up IAM for Personalize

iam = boto3.client('iam')

# role_name must begin with the substring 'PersonalizeRole' to ensure compliance with the Execution Role of this Sagemaker Notebook instance.
role_name = 'PersonalizeRole-'+token

print("Creating IAM Role...")
role_arn = iam.create_role(
    RoleName=role_name,
    # Allow Amazon Personalize to assume this role
    AssumeRolePolicyDocument=json.dumps({
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}))
role_arn = role_arn['Role']['Arn']

print("Created IAM Role. IAM Role ARN: " + role_arn)


# Create the IAM policy for Personalize
personalize_policy_doc = {
    'Version': '2012-10-17',
    'Id': 'PersonalizeS3BucketAccessPolicy-'+token,
    'Statement': [
        {
            'Sid': 'PersonalizeS3BucketAccessPolicy',
            'Action': [
                's3:GetObject',
                's3:ListBucket',
                's3:PutObject'
            ],
            'Resource': [
                f'arn:aws:s3:::{bucket_name}',
                f'arn:aws:s3:::{bucket_name}/*'
            ],
            'Effect': 'Allow'
        }
    ]
}
personalize_policy_doc = json.dumps(personalize_policy_doc)

# role_name must begin with the substring 'PersonalizePolicy' to ensure compliance with the Execution Role of this Sagemaker Notebook instance.
iam_policy_name = 'PersonalizePolicy-'+token

print("Creating IAM Policy...")
policy_response = iam.create_policy(
    PolicyName=iam_policy_name,
    PolicyDocument=personalize_policy_doc,
    Description='Policy to allow Personalize access to our S3 bucket'
)

# get arn of the policy
policy_arn = policy_response['Policy']['Arn']
policy_version = policy_response['Policy']['DefaultVersionId']
print("Created IAM Policy. Policy ARN: " + policy_arn)

# Attach the policy to the role
iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn=policy_arn
)
print("Attached policy to Role")

Creating IAM Role...
Created IAM Role. IAM Role ARN: arn:aws:iam::402114309305:role/PersonalizeRole-85652
Creating IAM Policy...
Created IAM Policy. Policy ARN: arn:aws:iam::402114309305:policy/PersonalizePolicy-85652
Attached policy to Role


## Chapter 2: Fetch and Inspect the Datasets

Amazon Personalize provides predefined recipes, based on common use cases, for training models. A recipe is a machine learning algorithm that you use with settings, or hyperparameters, and the data you provide to train an Amazon Personalize model. The data you provide to train a model are organized into separate datasets by the type of data being provided. A collection of datasets are organized into a dataset group. The three dataset types supported by Personalize are items, users, and interactions. Depending on the recipe type you choose, a different combination of dataset types are required. For all recipe types, an interactions dataset is required. Interactions represent how users interact with items. For example, viewing a product, watching a video, listening to a recording, or reading an article. In this notebook, we will be using the item-attribute-affinity recipe, a recipe that supports all three dataset types.

In this chapter, you will:

    - copy public datasets to your private S3 bucket,
    - Load the datasets into this notebook environment,
    - Inspect the datasets so you have an understanding of the data
    
This chapter will take about 5 minutes.

Let's get started!

#### Some context on 'Items' datasets

When training models in Amazon Personalize, we can provide structured and unstructured metadata about our items. This data helps improve the relevance of recommendations and is particularly useful when recommending new/cold items added to your catalog. 

Optional reading: For this notebook we will be creating 'custom solutions' for our use cases. Additionally, Personalize also has retail domain recommenders. This construct, which was released at re:Invent 2021 is used for real-time inferences. You can read more about them in the [Personalize blog](https://aws.amazon.com/blogs/machine-learning/amazon-personalize-announces-recommenders-optimized-for-retail-and-media-entertainment/).

The retail domain recommenders stipulate some [reserved fields/columns](https://docs.aws.amazon.com/personalize/latest/dg/ECOMMERCE-datasets-and-schemas.html) that we must conform to. For example, some columns that Personalize supports for an `Items` dataset include `ITEM_ID`, `PRICE`, `CATEGORY_L1`, `CATEGORY_L2`, `PRODUCT_DESCRIPTION`, and `GENDER`. Personalize will automatically apply a natural language processing (NLP) machine learning model to the product description column to extract features from the text. The product's unique identifier is required. For items, at least one metadata column (such as price or level-1 category) is also required. In the first part of this notebook, we will create a model that only uses the `ITEM_ID`, `PRICE`, `CATEGORY_L1` columns. In the second part of the notebook, we will update the schema to include the `CATEGORY_L2`, `PRODUCT_DESCRIPTION`, and `GENDER` columns as well.

### Save to CSV and upload to S3 bucket

For this notebook, we will be using publicly available datasets. These datasets are part of the [Retail Demo Store](https://github.com/aws-samples/retail-demo-store) project and are provided as a public download. 

The following cell will copy the csv datasets from the download URL to the local volume and then upload to our private s3 bucket.

In [6]:
users_filename, items_filename, interactions_filename = "users.csv", "items.csv", "interactions.csv"

# copy the datasets from the public s3 bucket to our private s3 bucket
for file in [users_filename, items_filename, interactions_filename]:
    !wget https://code.retaildemostore.retail.aws.dev/csvs/{file}
    boto3.Session().resource('s3').Bucket(bucket_name).Object(file).upload_file(file)
    print(f'Finishing copying {file} to {bucket_name}')

Finishing copying users.csv to personalize-dataset-schema-update-example-85652
Finishing copying items.csv to personalize-dataset-schema-update-example-85652
Finishing copying interactions.csv to personalize-dataset-schema-update-example-85652


Now, we will download our datasets from our private s3 bucket into this notebook environment, and load them into a [Pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

Finally, we will display the first few rows of each dataset just so we have a sense of its dimensions.

In [7]:
# Users Dataset
get_users_csv_response = s3.get_object(Bucket=bucket_name, Key=users_filename)
users_csv_content = get_users_csv_response['Body'].read().decode('utf-8')

# Create a pandas DataFrame from the CSV content
users_df = pd.read_csv(StringIO(users_csv_content))
print('\nUsers:-----------')
print(users_df.head())  # Inspect the first few rows of the DataFrame


# Items Dataset
get_items_csv_response = s3.get_object(Bucket=bucket_name, Key=items_filename)
items_csv_content = get_items_csv_response['Body'].read().decode('utf-8')

# Create a pandas DataFrame from the CSV content
items_basic_df = pd.read_csv(StringIO(items_csv_content))
print('\nOriginal Items:-----------')
print(items_basic_df.head())  # Inspect the first few rows of the DataFrame


# Interactions Dataset
get_interactions_csv_response = s3.get_object(Bucket=bucket_name, Key=interactions_filename)
interactions_csv_content = get_interactions_csv_response['Body'].read().decode('utf-8')

# Create a pandas DataFrame from the CSV content
interactions_df = pd.read_csv(StringIO(interactions_csv_content))

interactions_df['USER_ID'] = interactions_df.USER_ID.astype(str)
interactions_df['TIMESTAMP'] = interactions_df.TIMESTAMP.astype(int)
print('\nInteractions:-----------')
print(interactions_df.head())  # Inspect the first few rows of the DataFrame


Users:-----------
   USER_ID  AGE GENDER
0        1   31      M
1        2   58      F
2        3   43      M
3        4   38      M
4        5   24      M

Original Items:-----------
                                ITEM_ID   PRICE  CATEGORY_L1 CATEGORY_L2  \
0  6579c22f-be2b-444c-a52b-0116dd82df6c   90.99  accessories    backpack   
1  2e852905-c6f4-47db-802c-654013571922  123.99  accessories    backpack   
2  4ec7ff5c-f70f-4984-b6c4-c7ef37cc0c09   87.99  accessories    backpack   
3  7977f680-2cf7-457d-8f4d-afa0aa168cb9  125.99  accessories    backpack   
4  b5649d7c-4651-458d-a07f-912f253784ce  141.99  accessories    backpack   

                                 PRODUCT_DESCRIPTION GENDER PROMOTED  
0           This tan backpack is nifty for traveling      F        N  
1                       Pale pink backpack for women      F        N  
2  This gainsboro backpack for women is first-rat...      F        N  
3  This gray backpack for women is first-rate for...      F        N  
4  

#### Optional reading: Inspection of our user & interactions input data

Similar to the items dataset, we have provided metadata on our users when training models in Personalize. For this demonstration, we have included each user's age and gender. For more information about requirements for the users dataset, refer to the [aws documentation](https://docs.aws.amazon.com/personalize/latest/dg/ECOMMERCE-users-dataset.html).


Additionally, take a look at the first few lines of the interactions file. Note: 

- An EVENT_TYPE column which can be used to train different Personalize campaigns & custom solutions, and can also be used to filter on recommendations. To simulate a real-world site, most of the EVENT_TYPE events are views, whereas a much smaller proportion is add to cart, checkout, and purchase events.
- The custom DISCOUNT column which is a contextual metadata field, that a Personalize user personalization solution can take into account to predict on the best next product based the user's propensity to interact with discount products.

### Trim down the Items dataset
The publicly-available `items dataset` is the dataset that we will be using *after* we update the schema. For the *orginial* schema, we want to have fewer columns (This means that when we use the out-of-the-box dataset, we'll essentially be simulating adding additional columns).

Thus, we will trim down some non-required columns from the items dataset and create our first version on this modifed dataset.

Specifically, we will:
- remove the columns for product description, gender, category_l2, and promoted
- upload this `basic` items dataset to s3. We will use this csv in the `first` solution we create.


In [8]:

items_basic_filename = 'items_basic.csv'

# Remove the specified columns
columns_to_remove = ["PRODUCT_DESCRIPTION", "GENDER", "CATEGORY_L2", "PROMOTED"]
items_basic_df = items_basic_df.drop(columns=columns_to_remove)

# Save the modified DataFrame to a new CSV file
items_basic_df.to_csv(items_basic_filename, index=False)

# Upload the modified CSV file to an S3 bucket
s3.upload_file(items_basic_filename, bucket_name, items_basic_filename)

print(f"File '{items_basic_filename}' has been uploaded to S3 bucket '{bucket_name}' with key '{items_basic_filename}'.")

print('\n Basic Items dataset preview:-----------')
print(items_basic_df.head())  # Inspect the first few rows of the DataFrame

File 'items_basic.csv' has been uploaded to S3 bucket 'personalize-dataset-schema-update-example-85652' with key 'items_basic.csv'.

 Basic Items dataset preview:-----------
                                ITEM_ID   PRICE  CATEGORY_L1
0  6579c22f-be2b-444c-a52b-0116dd82df6c   90.99  accessories
1  2e852905-c6f4-47db-802c-654013571922  123.99  accessories
2  4ec7ff5c-f70f-4984-b6c4-c7ef37cc0c09   87.99  accessories
3  7977f680-2cf7-457d-8f4d-afa0aa168cb9  125.99  accessories
4  b5649d7c-4651-458d-a07f-912f253784ce  141.99  accessories


Confirm that the trimmed down items dataset only has the following columns: ITEM_ID, PRICE, CATEGORY_L1.

## Chapter 2 Summary - What have we accomplished?

In this chapter, we fetched pre-prepared sample datasets for each dataset type (items, users, and interactions) and uploaded them to the Amazon S3 bucket for later use.

We also inspected the three dataset types that will be used to train models and create custom solutions in Amazon Personalize.

In the next chapter we will start creating resources in Amazon Personalize to receive our dataset files.

# Chapter 3: Create Schemas and Import Datasets into Amazon Personalize


In this Chapter we are going to create an Amazon Personalize dataset group and import our three datasets into Amazon Personalize.

## Chapter 3 Objectives

In this chapter we will accomplish the following steps. This chapter should take about 15 minutes to complete.

- Create schema resources in Amazon Personalize that define the layout of our three dataset files (CSVs) created in the prior chapter
- Create a dataset group in Amazon Personalize that will be used to receive our datasets
- Create a dataset in the Personalize dataset group for the three dataset types and schemas
    - Items: information about the products in the Retail Demo Store
    - Users: information about the users in the Retail Deme Store
    - Interactions: user-item interactions representing typical storefront behavior such as viewing products, adding products to a shopping cart, purchasing products, and so on
- Create dataset import jobs to import each of the three datasets into Personalize

Note: We will be using the trimmed version of the items dataset here.

## Configure Amazon Personalize

Now that we've prepared our three datasets and uploaded them to S3 we'll need to configure the Amazon Personalize service to understand our data so that it can be used to train models for generating recommendations.

### Create Schemas for Datasets

Amazon Personalize requires a schema for each dataset so it can map the columns in our CSVs to fields for model training. Each schema is declared in JSON using the [Apache Avro](https://avro.apache.org/) format.

Let's define and create schemas in Personalize for our datasets.

Note that categorical fields include an additional attribute of `"categorical": true` and the textual field has an additional attribute of `"textual": true`. Categorical fields are those where one or more values can be specified for the field value (i.e. enumerated values). For example, one or more category names/codes for the `CATEGORY_L1` field. A textual field indicates that Personalize should apply a natural language processing (NLP) model to the field's value to extract model features from unstructured text. In this case, we're using the product description as the textual field. You can only have one textual field in the items dataset. Finally, you will notice that the `PROMOTED` field does _not_ have `categorical` or `textual` specified. In this case, the `PROMOTED` column will not be included as a feature in the model but can be used for filtering (out of scope of this notebook).

Another detail to note is that when we call the [CreateSchema](https://docs.aws.amazon.com/personalize/latest/dg/API_CreateSchema.html) API, we pass an optional `domain` parameter with a value of `ECOMMERCE`. This tells Personalize that we are creating a schema for Retail/E-commerce domain. We will do this for all three schemas.

#### Users Dataset Schema

In [9]:
users_schema = {
    "type": "record",
    "name": "Users",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "AGE",
            "type": "int"
        },
        {
            "name": "GENDER",
            "type": "string",
            "categorical": True,
        }
    ],
    "version": "1.0"
}

try:
    users_schema_name = 'users-schema-'+token
    create_schema_response = personalize.create_schema(
        name = users_schema_name,
        domain = "ECOMMERCE",
        schema = json.dumps(users_schema)
    )
    print(json.dumps(create_schema_response, indent=2))
    users_schema_arn = create_schema_response['schemaArn']
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this schema, seemingly')
    paginator = personalize.get_paginator('list_schemas')
    for paginate_result in paginator.paginate():
        for schema in paginate_result['schemas']:
            if schema['name'] == users_schema_name:
                users_schema_arn = schema['schemaArn']
                print(f"Using existing schema: {users_schema_arn}")
                break

{
  "schemaArn": "arn:aws:personalize:us-east-1:402114309305:schema/users-schema-85652",
  "ResponseMetadata": {
    "RequestId": "e0de7810-0017-4604-9c89-fd9844b63eea",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 18:36:32 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "84",
      "connection": "keep-alive",
      "x-amzn-requestid": "e0de7810-0017-4604-9c89-fd9844b63eea"
    },
    "RetryAttempts": 0
  }
}


#### Items Datsaset Schema

In [10]:
# In this chapter, we are going to train our solution on only the three folloing columns: ITEM_ID, PRICE, CATEGORY_L1.
# A later portion of this notebook will go over how to update the schema to include additional columns 
# such as `CATEGORY_L2`, `PRODUCT_DESCRIPTION`, and `GENDER`. 

items_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "PRICE",
            "type": "float"
        },
        {
            "name": "CATEGORY_L1",
            "type": "string",
            "categorical": True,
        },
    ],
    "version": "1.0"
}

try:
    items_schema_name = 'items-schema-'+token
    create_schema_response = personalize.create_schema(
        name = items_schema_name,
        domain = 'ECOMMERCE',
        schema = json.dumps(items_schema)
    )
    items_schema_arn = create_schema_response['schemaArn']
    print(json.dumps(create_schema_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this schema, seemingly')
    paginator = personalize.get_paginator('list_schemas')
    for paginate_result in paginator.paginate():
        for schema in paginate_result['schemas']:
            if schema['name'] == items_schema_name:
                items_schema_arn = schema['schemaArn']
                print(f"Using existing schema: {items_schema_arn}")
                break

{
  "schemaArn": "arn:aws:personalize:us-east-1:402114309305:schema/items-schema-85652",
  "ResponseMetadata": {
    "RequestId": "999ab420-c646-4ad8-a6d0-f030c65e09be",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 18:36:32 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "84",
      "connection": "keep-alive",
      "x-amzn-requestid": "999ab420-c646-4ad8-a6d0-f030c65e09be"
    },
    "RetryAttempts": 0
  }
}


#### Interactions Dataset Schema

In [11]:
interactions_schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "EVENT_TYPE",  # "View", "Purchase", etc.
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        },
        {
            "name": "DISCOUNT",  # This is the contextual metadata - "Yes" or "No".
            "type": "string"
        },
    ],
    "version": "1.0"
}

try:
    interactions_schema_name = 'interactions-schema-'+token
    create_schema_response = personalize.create_schema(
        name = interactions_schema_name,
        domain = "ECOMMERCE",
        schema = json.dumps(interactions_schema)
    )
    print(json.dumps(create_schema_response, indent=2))
    interactions_schema_arn = create_schema_response['schemaArn']
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this schema, seemingly')
    paginator = personalize.get_paginator('list_schemas')
    for paginate_result in paginator.paginate():
        for schema in paginate_result['schemas']:
            if schema['name'] == interactions_schema_name:
                interactions_schema_arn = schema['schemaArn']
                print(f"Using existing schema: {interactions_schema_arn}")
                break

{
  "schemaArn": "arn:aws:personalize:us-east-1:402114309305:schema/interactions-schema-85652",
  "ResponseMetadata": {
    "RequestId": "8882cc35-1e1f-4874-971d-f7bacbfdf1f7",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 18:36:32 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "91",
      "connection": "keep-alive",
      "x-amzn-requestid": "8882cc35-1e1f-4874-971d-f7bacbfdf1f7"
    },
    "RetryAttempts": 0
  }
}


### Create and Wait for Dataset Group

Next we need to create the dataset group that will contain our three datasets. This is one of many Personalize operations that are asynchronous. That is, we call an API to create a resource and have to wait for it to become active.

#### Create Dataset Group

Note that we are also passing `ECOMMERCE` for the `domain` parameter here too.

In [12]:
try:
    dataset_group_name = 'retaildemostore-products-DSG-'+token
    create_dataset_group_response = personalize.create_dataset_group(
        name = dataset_group_name,
        domain = 'ECOMMERCE'
    )
    dataset_group_arn = create_dataset_group_response['datasetGroupArn']
    print(json.dumps(create_dataset_group_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this dataset group, seemingly')
    paginator = personalize.get_paginator('list_dataset_groups')
    for paginate_result in paginator.paginate():
        for dataset_group in paginate_result['datasetGroups']:
            if dataset_group['name'] == dataset_group_name:
                dataset_group_arn = dataset_group['datasetGroupArn']
                break
                
print(f'DatasetGroupArn = {dataset_group_arn}')

{
  "datasetGroupArn": "arn:aws:personalize:us-east-1:402114309305:dataset-group/retaildemostore-products-DSG-85652",
  "domain": "ECOMMERCE",
  "ResponseMetadata": {
    "RequestId": "6fc951ef-453d-4195-b2e0-515de29956f6",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 18:36:32 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "134",
      "connection": "keep-alive",
      "x-amzn-requestid": "6fc951ef-453d-4195-b2e0-515de29956f6"
    },
    "RetryAttempts": 0
  }
}
DatasetGroupArn = arn:aws:personalize:us-east-1:402114309305:dataset-group/retaildemostore-products-DSG-85652


#### Wait for Dataset Group to Have ACTIVE Status
This should take about a minute.

In [13]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(15)

DatasetGroup: CREATE PENDING
DatasetGroup: CREATE PENDING
DatasetGroup: ACTIVE


### Define the three Datasets in Personalize
Next we will create the datasets in Personalize for our three dataset types.

#### Create Users Dataset

In [14]:
try:
    dataset_type = "USERS"
    users_dataset_name = "retaildemostore-products-users-ds-"+token
    create_dataset_response = personalize.create_dataset(
        name = users_dataset_name,
        datasetType = dataset_type,
        datasetGroupArn = dataset_group_arn,
        schemaArn = users_schema_arn
    )

    users_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this dataset, seemingly')
    paginator = personalize.get_paginator('list_datasets')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for dataset in paginate_result['datasets']:
            if dataset['name'] == users_dataset_name:
                users_dataset_arn = dataset['datasetArn']
                break
                
print(f'Users dataset ARN = {users_dataset_arn}')

{
  "datasetArn": "arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85652/USERS",
  "ResponseMetadata": {
    "RequestId": "fddd0bc1-265a-41a7-ad52-0bb1e4e3b402",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 18:37:02 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "108",
      "connection": "keep-alive",
      "x-amzn-requestid": "fddd0bc1-265a-41a7-ad52-0bb1e4e3b402"
    },
    "RetryAttempts": 0
  }
}
Users dataset ARN = arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85652/USERS


#### Create Items Dataset

In [15]:
try:
    dataset_type = "ITEMS"
    items_dataset_name = "retaildemostore-products-items-ds-"+token
    create_dataset_response = personalize.create_dataset(
        name = items_dataset_name,
        datasetType = dataset_type,
        datasetGroupArn = dataset_group_arn,
        schemaArn = items_schema_arn
    )

    items_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this dataset, seemingly')
    paginator = personalize.get_paginator('list_datasets')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for dataset in paginate_result['datasets']:
            if dataset['name'] == items_dataset_name:
                items_dataset_arn = dataset['datasetArn']
                break
                
print(f'Items dataset ARN = {items_dataset_arn}')

{
  "datasetArn": "arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85652/ITEMS",
  "ResponseMetadata": {
    "RequestId": "730a43fb-0ed2-4b6b-9de0-8dda72637055",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 18:37:02 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "108",
      "connection": "keep-alive",
      "x-amzn-requestid": "730a43fb-0ed2-4b6b-9de0-8dda72637055"
    },
    "RetryAttempts": 0
  }
}
Items dataset ARN = arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85652/ITEMS


#### Create Interactions Dataset

In [16]:
try:
    dataset_type = "INTERACTIONS"
    interactions_dataset_name = "retaildemostore-products-interactions-ds-"+token
    create_dataset_response = personalize.create_dataset(
        name = interactions_dataset_name,
        datasetType = dataset_type,
        datasetGroupArn = dataset_group_arn,
        schemaArn = interactions_schema_arn
    )

    interactions_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this dataset, seemingly')
    paginator = personalize.get_paginator('list_datasets')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for dataset in paginate_result['datasets']:
            if dataset['name'] == interactions_dataset_name:
                interactions_dataset_arn = dataset['datasetArn']
                break
                
print(f'Interactions dataset ARN = {interactions_dataset_arn}')

{
  "datasetArn": "arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85652/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "88365159-5804-4b11-90bb-d60b6b93e308",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 18:37:02 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "115",
      "connection": "keep-alive",
      "x-amzn-requestid": "88365159-5804-4b11-90bb-d60b6b93e308"
    },
    "RetryAttempts": 0
  }
}
Interactions dataset ARN = arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85652/INTERACTIONS


### Wait for datasets to become active

It can take a minute for the datasets to be created. Let's wait for all three to become active.

In [17]:
%%time

dataset_arns = [ items_dataset_arn, users_dataset_arn, interactions_dataset_arn ]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    for dataset_arn in reversed(dataset_arns):
        response = personalize.describe_dataset(
            datasetArn = dataset_arn
        )
        status = response["dataset"]["status"]

        if status == "ACTIVE":
            print(f'Dataset {dataset_arn} successfully completed')
            dataset_arns.remove(dataset_arn)
        elif status == "CREATE FAILED":
            print(f'Dataset {dataset_arn} failed')
            if response['dataset'].get('failureReason'):
                print('   Reason: ' + response['dataset']['failureReason'])
            dataset_arns.remove(dataset_arn)

    if len(dataset_arns) > 0:
        print('At least one dataset is still in progress')
        time.sleep(15)
    else:
        print("All datasets have completed")
        break

At least one dataset is still in progress
Dataset arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85652/INTERACTIONS successfully completed
Dataset arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85652/USERS successfully completed
Dataset arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85652/ITEMS successfully completed
All datasets have completed
CPU times: user 20.8 ms, sys: 122 µs, total: 20.9 ms
Wall time: 15.2 s


## Import Datasets to Personalize

So far in this chapter we have created schemas in Personalize that define the columns in our CSVs. Then we created a datset group and defined three datasets in Personalize that will receive our data. In the following steps we will create import jobs with Personalize that will import the datasets from our S3 bucket into our dataset group. 


### Create Import Jobs

With the permissions in place to allow Personalize to access our CSV files, let's create three import jobs to import each file into its respective dataset. Each import job can take roughly 10 minutes to complete so we'll create all three import jobs and then wait for them all to complete. This allows them to import in parallel.

#### Create Users Dataset Import Job

In [18]:
import_job_suffix = str(uuid.uuid4())[:8]

users_create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "import-job-users-" + import_job_suffix,
    datasetArn = users_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, users_filename)
    },
    roleArn = role_arn
)

users_dataset_import_job_arn = users_create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(users_create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:402114309305:dataset-import-job/import-job-users-95f739a1",
  "ResponseMetadata": {
    "RequestId": "9d35b449-b3e7-47b9-83a8-c5ee1524f91c",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 18:37:18 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "113",
      "connection": "keep-alive",
      "x-amzn-requestid": "9d35b449-b3e7-47b9-83a8-c5ee1524f91c"
    },
    "RetryAttempts": 0
  }
}


#### Create Items Dataset Import Job

In [19]:

items_create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "import-job-items-" + import_job_suffix,
    datasetArn = items_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, items_basic_filename)
    },
    roleArn = role_arn
)

items_dataset_import_job_arn = items_create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(items_create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:402114309305:dataset-import-job/import-job-items-95f739a1",
  "ResponseMetadata": {
    "RequestId": "25411ad7-ebb1-43e6-9be5-bc7c65d8b712",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 18:37:18 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "113",
      "connection": "keep-alive",
      "x-amzn-requestid": "25411ad7-ebb1-43e6-9be5-bc7c65d8b712"
    },
    "RetryAttempts": 0
  }
}


#### Create Interactions Dataset Import Job

In [20]:
interactions_create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "import-job-interactions-" + import_job_suffix,
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename)
    },
    roleArn = role_arn
)

interactions_dataset_import_job_arn = interactions_create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(interactions_create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:402114309305:dataset-import-job/import-job-interactions-95f739a1",
  "ResponseMetadata": {
    "RequestId": "c273d7d8-3bd7-4e1f-ad21-b4e5c87a04c1",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 18:37:18 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "120",
      "connection": "keep-alive",
      "x-amzn-requestid": "c273d7d8-3bd7-4e1f-ad21-b4e5c87a04c1"
    },
    "RetryAttempts": 0
  }
}


### Wait for Import Jobs to Complete

It can take up to 10 minutes for the import jobs to complete, while you're waiting you can learn more about Datasets and Schemas here: https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html

We will wait for all three import jobs to finish.

#### Wait for Items Import Job to Complete

In [21]:
%%time

import_job_arns = [ users_dataset_import_job_arn, items_dataset_import_job_arn, interactions_dataset_import_job_arn ]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    for job_arn in reversed(import_job_arns):
        import_job_response = personalize.describe_dataset_import_job(
            datasetImportJobArn = job_arn
        )
        status = import_job_response["datasetImportJob"]['status']

        if status == "ACTIVE":
            print(f'Import job {job_arn} successfully completed')
            import_job_arns.remove(job_arn)
        elif status == "CREATE FAILED":
            print(f'Import job {job_arn} failed')
            if import_job_response["datasetImportJob"].get('failureReason'):
                print('   Reason: ' + import_job_response["datasetImportJob"]['failureReason'])
            import_job_arns.remove(job_arn)

    if len(import_job_arns) > 0:
        print('At least one dataset import job still in progress')
        time.sleep(60)
    else:
        print("All import jobs have ended")
        break

At least one dataset import job still in progress
At least one dataset import job still in progress
At least one dataset import job still in progress
At least one dataset import job still in progress
Import job arn:aws:personalize:us-east-1:402114309305:dataset-import-job/import-job-interactions-95f739a1 successfully completed
At least one dataset import job still in progress
At least one dataset import job still in progress
Import job arn:aws:personalize:us-east-1:402114309305:dataset-import-job/import-job-items-95f739a1 successfully completed
Import job arn:aws:personalize:us-east-1:402114309305:dataset-import-job/import-job-users-95f739a1 successfully completed
All import jobs have ended
CPU times: user 190 ms, sys: 8.1 ms, total: 198 ms
Wall time: 6min


## Chapter 3 Summary - What have we accomplished?

In this chapter we created schemas in Amazon Personalize that mapped to the dataset CSVs we introduced in chapter 2. We also created a dataset group in Personalize as well as Datasets to represent our CSVs. Finally, we created dataset import jobs in Personalize to load the three datasets into Personalize.

In the next chapter we will create the a custom solution and train a solution version. This is where the machine learning models are trained and deployed.

# Chapter 4: Create a custom solution in Amazon Personalize

In this chapter we are going to create a Solution in Amazon Personalize. A Solution consists of a Personalize Recipe (an algorithm), parameters, and all of its Solution Versions (ie: trained models). 

## Chapter 4 Objectives

In this chapter we will accomplish the following steps.

- Create custom solution and solution version for the following use case:
    - **Item Attribute Affinity**: user segmentation model that recommends users for item categories/attributes.

This portion should take about 60 minutes to complete. However, most of the time will be waiting for model training job to complete. While we are waiting, this notebook will review a way to confirm the columns that are being used for training.

## Create a Custom Solution

With our three datasets imported into our dataset group, we can now turn to creating solutions. 

We simply need to create a solution and solution version using the item-attribute-affinity recipe.

#### Overview of the recipe

[Item-Attribute-Affinity:](https://docs.aws.amazon.com/personalize/latest/dg/item-attribute-affinity-recipe.html)
> The Item-Attribute-Affinity (aws-item-attribute-affinity) recipe is a USER_SEGMENTATION recipe that creates a user segment (group of users) for each item attribute that you specify. Use Item-Attribute-Affinity to learn more about your users and take actions based on their respective user segments.

> For example, you might want to create a marketing campaign for your retail application based on user preferences for shoe types in your catalog. Item-Attribute-Affinity would create a user segment for each shoe type based data in your Interactions and Items datasets. You could use this to promote different shoes to different user segments based on the likelihood that they will take an action (for example, click a shoe or purchase a shoe). Other uses might include promoting different movie genres to different users or identifying prospective job applicant based on job type.


Note: This demonstration only uses one recipe, however there many more than that available. If you are interested, you can visit the official documentation to read more about all the [predefined recipes](https://docs.aws.amazon.com/personalize/latest/dg/working-with-predefined-recipes.html) Amazon Personalize has to offer.


In [22]:
# This is a User Segmentation Recipe
item_attribute_affinity_recipe_arn = 'arn:aws:personalize:::recipe/aws-item-attribute-affinity'


### Create Custom Solution and Solution Version

With our recipe defined, we can now create our solution and solution version.

The code cell below creates a solution using the item-attribute-affinity recipe and our dataset group that we created in the previous chapter.

In [23]:
solution_version_arn = None
solution_arn = None
solution_name = "my-original-solution-"+token


try:
    create_solution_response = personalize.create_solution(
        name = solution_name,
        datasetGroupArn = dataset_group_arn,
        recipeArn = item_attribute_affinity_recipe_arn
    )

    solution_arn = create_solution_response['solutionArn']
    print(json.dumps(create_solution_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this solution, seemingly')
    paginator = personalize.get_paginator('list_solutions')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for solution in paginate_result['solutions']:
            if solution['name'] == solution_name:
                solution_arn = solution['solutionArn']
                print(f'Item Attribute Affinity solution ARN = {solution_arn}')
                
                response = personalize.list_solution_versions(
                    solutionArn = solution_arn,
                    maxResults = 100
                )
                if len(response['solutionVersions']) > 0:
                    solution_version_arn = response['solutionVersions'][-1]['solutionVersionArn']
                    print(f'Will use most recent solution version for this solution: {solution_version_arn}')
                    
                break

{
  "solutionArn": "arn:aws:personalize:us-east-1:402114309305:solution/my-original-solution-85652",
  "ResponseMetadata": {
    "RequestId": "dbf600ff-feb2-4cef-97ce-a0633bee738e",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 18:51:05 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "96",
      "connection": "keep-alive",
      "x-amzn-requestid": "dbf600ff-feb2-4cef-97ce-a0633bee738e"
    },
    "RetryAttempts": 0
  }
}


#### Create Item Attribute Affinity Solution Version
Next we can create a solution version for the solution. This is where the model is trained for this custom solution.

In [24]:
if not solution_version_arn:
    create_solution_version_response = personalize.create_solution_version(
        solutionArn = solution_arn
    )

    solution_version_arn = create_solution_version_response['solutionVersionArn']
    print(json.dumps(create_solution_version_response, indent=2))
else:
    print(f'Solution version {solution_version_arn} already exists; not creating')

{
  "solutionVersionArn": "arn:aws:personalize:us-east-1:402114309305:solution/my-original-solution-85652/ad6aea89",
  "ResponseMetadata": {
    "RequestId": "099c3ef4-a9c0-4025-84ef-60fb333deae9",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 18:51:05 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "112",
      "connection": "keep-alive",
      "x-amzn-requestid": "099c3ef4-a9c0-4025-84ef-60fb333deae9"
    },
    "RetryAttempts": 0
  }
}


### Confirm the training columns

While the solution version is being trained in the background, let's double check that this solution will be trained on the columns of the schema we created earlier. 

The most reliable method of making sure that the intended columns are being used by the solution requires checking via the Amazon Personalize service page within the AWS Management Console. Visit the `Solutions and recipes` menu option under the `Custom Resources` tab. Select the solution that we just created. Click the `Solution Configuration` dropdown option, and you will be able to see the `Columns for training` that are used by your solution. You will notice that the Item dataset columns used for training were: `ITEM_ID`, `PRICE`, and `CATEGORY_L1`.

You may also want to confirm the schema of the item dataset. The code cell below does that programmatically. Do note however, the columns displayed in a schema do not neccessarily have to be the columns that a solution uses for training (For example, when creating a solution, you can optionally choose to *exclude* any columns that are contained within a schema, from training. Though since we are not explicitly exlcuding any of the columns in the dataset, the columns showed in the schema are the ones that our Personalize Solution will be trained on).


In [25]:
confirm_schema = personalize.describe_schema(schemaArn=items_schema_arn)

print(json.dumps(confirm_schema['schema'], indent=2, default=str))

{
  "name": "items-schema-85652",
  "schemaArn": "arn:aws:personalize:us-east-1:402114309305:schema/items-schema-85652",
  "schema": "{\"type\": \"record\", \"name\": \"Items\", \"namespace\": \"com.amazonaws.personalize.schema\", \"fields\": [{\"name\": \"ITEM_ID\", \"type\": \"string\"}, {\"name\": \"PRICE\", \"type\": \"float\"}, {\"name\": \"CATEGORY_L1\", \"type\": \"string\", \"categorical\": true}], \"version\": \"1.0\"}",
  "creationDateTime": "2023-10-27 18:36:32.456000+00:00",
  "lastUpdatedDateTime": "2023-10-27 18:36:32.456000+00:00",
  "domain": "ECOMMERCE"
}


## Wait for Solution Versions to Complete

It can take roughly 40 minutes for the solution version to be created. During this process a model is being trained and tested with the data contained within your datasets. The duration of training jobs can increase based on the size of the dataset, training parameters and a selected recipe. In the cell below we will wait for the solution versions to finish.

While you are waiting for this process to complete you can learn more about [custom solutions](https://docs.aws.amazon.com/personalize/latest/dg/training-deploying-solutions.html).

Additionally, though they are out of scope of this notebook, here you can read about [Personalize Recommenders](https://docs.aws.amazon.com/personalize/latest/dg/creating-recommenders.html). Recommenders are required for real-time jobs.


#### Wait for the custom solution version to become active

The following cell waits for the solution version for the item attribute affinity use case to become active. We *need* to make sure it is active before proceeding to the next Chapter.

In [26]:
%%time

solution_version_arn

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    soln_ver_response = personalize.describe_solution_version(
        solutionVersionArn = solution_version_arn
    )
    status = soln_ver_response["solutionVersion"]["status"]

    if status == "ACTIVE":
        print(f'Solution version {solution_version_arn} successfully completed')
        print(json.dumps(soln_ver_response, indent=2, default=str))
        print("Solution version has completed")
        break
    elif status == "CREATE FAILED":
        print(f'Solution version {solution_version_arn} failed')
        if soln_ver_response["solutionVersion"].get('failureReason'):
            print('   Reason: ' + soln_ver_response["solutionVersion"]['failureReason'])
        break
    else:
        print('Solution version is still in progress')
        time.sleep(60)    
     

Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution ver

## Chapter 4 Summary - What have we accomplished?

In this chapter we created a solution version for item attribute affinity use case. We also confirmed, via the console, that the solution was only using the Items columns: `ITEM_ID`, `PRICE`, and `CATEGORY_L1`.

In the next chapter we will perform run a batch job on this solution version.

# Chapter 5: Run A Batch Segmentation Job on your Solution Version

In this chapter, we will prepare and execute a batch job for the solution version that we created previously. The purpose of doing this is to obtain a baseline that we can use to compare the to the output of the solution version for the updated schema solution in the latter portion of this notebook.

The batch job for the item attribute affinity solution will return an audience of users who have an affinity for different product categories.

We will wait for the job to finish executing. Afterwards, we'll inspect the outputs.

This chapter will take about 20 minutes.

### Set the size input and outputs for our job

If you want, you can set the size of the input job by changing the value of the variable 'x' in the following code cell.

A default value of 5 has been pre-populated for you. If the value of x equal 5, this means we will randomly select 5 USER_ID values as input for the user personalization job.

Similarly, you can set the size of the output by changing the value of the variable 'y'. A default value of 10 has been pre-populated for you. This means our model will return 10 ITEM_ID recommendations for each USER_ID input.

You can decrease or increase the values of 'x' and 'y' if you want. Just be aware that larger values for x and y means the inference job will take slightly longer to complete.


In [27]:
# TODO: Fill in the size of the input and output variables
x = 5
y = 10

## Prepare input file for batch segment job

Next we will prepare a user segmentation input file by randomly selecting a product from the catalog.

First, let's consider the format of the job input file. Below is a sample of the input file for an item attribute affinity job that builds 3 user segments looking for users for a Video On Demand application that are interested in both comedies and action movies, users just interested in comedies, and users interested in both horror and action movies:

```javascript
{"itemAttributes": "ITEMS.genres = \"Comedy\" AND ITEMS.genres = \"Action\""}
{"itemAttributes": "ITEMS.genres = \"Comedy\""}
{"itemAttributes": "ITEMS.genres = \"Horror\" AND ITEMS.genres = \"Action\""}
```

For our job, we will select a few categories and use each category name as the item attribute we want to use to group users.

#### Here are the steps required to run this job:
    - Randomly select x attributes (used as inputs for our batch job)
    - Generate the input file
    - Upload it to S3
    - Create and start a Batch Segment Job using our Solution Version
    - Wait for the Segment Job to complete
    - Download the Segment job output from S3
    - Inspect the Segment Job

The code cell below selects the attributes we want to generate customer segments for, creates the input file for the batch job, and uploads it to S3

In [28]:
categories = items_basic_df['CATEGORY_L1'].unique()
print("Available categories:")
print(categories)

# Randomly select x categories
sample_categories = numpy.random.choice(categories, x, False)
print("Randomly selected categories:")
print(sample_categories)

# Generate the input file
job_input_filename = 'job_input.json'
with open(job_input_filename, 'w') as json_input:
    for category in sample_categories:
        # Write line that specifies the query for users with an affinity for the CATEGORY_L1 field
        json_input.write(f'{{"itemAttributes": "ITEMS.CATEGORY_L1 = \\"{category}\\""}}\n')

# Confirm the file matches the required format:
# One very important characteristic of the job input file is that the itemAttributes query expression 
# for each segment must be fully defined in a single line.
print("\nPreviewing input file... ")
!head -n 5 $job_input_filename
print('\n')


# Upload job input file to S3
s3_input_key = "batch-job-inputs/item-attribute-affinity-job/" + job_input_filename
s3.upload_file(job_input_filename, bucket_name, s3_input_key)

if s3_input_key in [object['Key'] for object in s3.list_objects(Bucket=bucket_name)['Contents']]:
    print('File was uploaded successfully!')
else:
    print('File was not uploaded!')


# We need to define an input location in our s3 bucket where the segmention job gets its input, 
# and an output location where the segmentation job writes its output.
s3_input_path = "s3://" + bucket_name + "/" + s3_input_key
s3_output_path = "s3://" + bucket_name + "/batch-job-outputs/item-attribute-affinity-job/"



Available categories:
['accessories' 'apparel' 'beauty' 'books' 'electronics' 'floral'
 'footwear' 'furniture' 'groceries' 'homedecor' 'housewares' 'instruments'
 'jewelry' 'outdoors' 'seasonal' 'tools' 'food service' 'cold dispensed'
 'salty snacks' 'hot dispensed']
Randomly selected categories:
['tools' 'housewares' 'floral' 'food service' 'furniture']

Previewing input file... 
{"itemAttributes": "ITEMS.CATEGORY_L1 = \"tools\""}
{"itemAttributes": "ITEMS.CATEGORY_L1 = \"housewares\""}
{"itemAttributes": "ITEMS.CATEGORY_L1 = \"floral\""}
{"itemAttributes": "ITEMS.CATEGORY_L1 = \"food service\""}
{"itemAttributes": "ITEMS.CATEGORY_L1 = \"furniture\""}


File was uploaded successfully!


## Create batch segment job

Finally, we're ready to create a batch segment job. There are several required parameters including a name for the job, the solution version ARN for the item attribute affinity model, the IAM role that Personalize needs to be able to access the job input file and write the output file, and the job input and output locations. These parameters are required inputs for all batch jobs in Amazon Personalize.

We're also optionally specifying that we only want the top y users in our user segment.

The user segmentation job can take several minutes to complete. Even though our input file only specifies a few input lines, there is a certain amount of fixed overhead required for Personalize to spin up the compute resources needed to execute the job. This overhead is amortized for larger input files that generate many user segments.


In [29]:
# Create and start a Batch Segment Job using our latest Solution Version
response = personalize.create_batch_segment_job (
    solutionVersionArn = solution_version_arn,
    jobName = "retaildemostore-item-attribute-affinity-job-" + token,
    roleArn = role_arn,
    jobInput = {"s3DataSource": {"path": s3_input_path}},
    jobOutput = {"s3DataDestination":{"path": s3_output_path}},
    numResults = y
)
item_attribute_affinity_job_arn = response['batchSegmentJobArn']
print(json.dumps(response, indent=2, default=str))

{
  "batchSegmentJobArn": "arn:aws:personalize:us-east-1:402114309305:batch-segment-job/retaildemostore-item-attribute-affinity-job-85652",
  "ResponseMetadata": {
    "RequestId": "80b67830-c788-400c-b4ff-2666b8d8a5ea",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 19:32:50 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "135",
      "connection": "keep-alive",
      "x-amzn-requestid": "80b67830-c788-400c-b4ff-2666b8d8a5ea"
    },
    "RetryAttempts": 0
  }
}


### Wait for the Item Attribute Affinity Job to complete and inspect its output

After you finish the batch jobs, run the cell below. The cell below will wait for the Item Attribute Affinity job to finish, download the output, and display its first few lines.

In [30]:
%%time
current_time = datetime.now()
print("\nImport Started on: ", current_time.strftime("%I:%M:%S %p"))

max_time = time.time() + 3*60*60 # 3 hours

while time.time() < max_time:
    response = personalize.describe_batch_segment_job(
        batchSegmentJobArn = item_attribute_affinity_job_arn
    )
    status = response["batchSegmentJob"]['status']
    print("DatasetSegmentJob: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)
    
current_time = datetime.now()
print("Import Completed on: ", current_time.strftime("%I:%M:%S %p"))

job_output_file = job_input_filename + ".out"
export_name = 'batch-job-outputs/item-attribute-affinity-job/' + job_output_file
s3.download_file(bucket_name, export_name, job_output_file)

!head -n 5 $job_output_file


Import Started on:  07:32:50 PM
DatasetSegmentJob: CREATE PENDING
DatasetSegmentJob: CREATE IN_PROGRESS
DatasetSegmentJob: CREATE IN_PROGRESS
DatasetSegmentJob: CREATE IN_PROGRESS
DatasetSegmentJob: CREATE IN_PROGRESS
DatasetSegmentJob: CREATE IN_PROGRESS
DatasetSegmentJob: CREATE IN_PROGRESS
DatasetSegmentJob: CREATE IN_PROGRESS
DatasetSegmentJob: ACTIVE
Import Completed on:  07:40:51 PM
{"input": {"itemAttributes": "ITEMS.CATEGORY_L1 = \"tools\""}, "output": {"usersList": ["2131","2504","4680","4712","3965","3867","517","321","1177","2056"]}, "error": null}
{"input": {"itemAttributes": "ITEMS.CATEGORY_L1 = \"housewares\""}, "output": {"usersList": ["3976","1018","4503","673","1656","456","3609","4288","3948","1135"]}, "error": null}
{"input": {"itemAttributes": "ITEMS.CATEGORY_L1 = \"floral\""}, "output": {"usersList": ["1928","68","161","2123","1691","3167","5035","142","4509","3254"]}, "error": null}
{"input": {"itemAttributes": "ITEMS.CATEGORY_L1 = \"food service\""}, "output": {

Notice that the input item attribute query expressions are echoed in the output file but we also have `output` and `error` elements for each user segment. The `output` element has a `usersList` array that contains the user IDs for the segment. If there were any errors enountered while generating a segment, details will be included in the `error` element for the segment.

### Chapter 5 Summary:

In this chapter, we ran a batch segmentation job for the item attribute affinity solution.

We waited for the inference job to finish and then inspected its outputs.

#### Next Steps (out of scope of this notebook)
Now that we have user segments created, what can we do with them? The most obvious choice is to use these outputs in outbound marketing tools. 

For example, you can create a promotion around a particular product category where you're looking to target users who would have the highest probability of being interested in the promotion. For instance, you may want to send "football" related marketing communications during Superbowl week. You would use the item attribute affinity model for this. 


## Chapter 5 complete

Congratulations! You have completed the batch segmentation portion of the demonstration!

## Chapter 6: Overview of Steps Involved for Changing Item Schemas

Suppose we want to start collecting addtional data for our items dataset. Fortunately, Amazon Personalize supports updating Item Schemas to add new columns. 

In fact, recently, [Amazon Personalize made it easier to add columns to your existing datasets](https://aws.amazon.com/about-aws/whats-new/2023/07/amazon-personalize-add-columns-existing-datasets/). 

> To add a column to an existing dataset, simply select your dataset in the Personalize console and click “Replace Schema.” Add the new columns and import new data. The new columns will then be available for filtering. To use the new columns for training, simply clone your existing Solution, select the new columns for training, and train a new Solution version.

This chapter will guide you through the steps involved in programmatically updating the schema of your items dataset.
This chapter will take about 60 minutes, but most of that time will be spent waiting for data import jobs and solution version training.

### Here is a list of steps required to change the schema of your items:
- Step 1: Create a new schema (include the new columns. All new fields must support 'null' data)
- Step 2: Update the Items Dataset in Amazon Personalize
- Step 3: Import the New Data (10 minutes)
- Step 4: Create a new Solution for the Updated Data
- Step 5: Create a new Solution Version (40 minutes)

Once we perform these steps, we will then run the same batch job on the new solution version and see the new output (Chapter 7).

Step 1: Create a new schema that reflects schema of the updated dataset


In [31]:
# Note: All new fields must support 'null' data. See more here: https://docs.aws.amazon.com/personalize/latest/dg/updating-dataset-schema.html.

updated_items_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "PRICE",
            "type": "float"
        },
        {
            "name": "CATEGORY_L1",
            "type": "string",
            "categorical": True,
        },
        {
            "name": "CATEGORY_L2",
            "type": ["string", "null"],
            "categorical": True,
        },
        {
            "name": "PRODUCT_DESCRIPTION",
            "type": ["string", "null"],
            "textual": True
        },
        {
            "name": "GENDER",
            "type": ["string", "null"],
            "categorical": True,
        },
        {
            "name": "PROMOTED",
            "type": ["string", "null"],
        },
    ],
    "version": "1.1"
}

try:
    updated_items_schema_name = 'import-job-updated-items-schema-'+token
    create_schema_response = personalize.create_schema(
        name = updated_items_schema_name,
        domain = 'ECOMMERCE',
        schema = json.dumps(updated_items_schema)
    )
    updated_items_schema_arn = create_schema_response['schemaArn']
    print(json.dumps(create_schema_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this schema, seemingly')
    paginator = personalize.get_paginator('list_schemas')
    for paginate_result in paginator.paginate():
        for schema in paginate_result['schemas']:
            if schema['name'] == updated_items_schema_name:
                updated_items_schema_arn = schema['schemaArn']
                print(f"Using existing schema: {updated_items_schema_arn}")
                break

{
  "schemaArn": "arn:aws:personalize:us-east-1:402114309305:schema/import-job-updated-items-schema-85652",
  "ResponseMetadata": {
    "RequestId": "b77f4cd4-30aa-4a6e-a589-772691f68614",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 19:40:51 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "103",
      "connection": "keep-alive",
      "x-amzn-requestid": "b77f4cd4-30aa-4a6e-a589-772691f68614"
    },
    "RetryAttempts": 0
  }
}


Step 2: Update the Items Dataset in Amazon Personalize


In [32]:
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/personalize/client/update_dataset.html
updated_items_dataset_arn = personalize.update_dataset(
    datasetArn = items_dataset_arn, # Pass the arn of the dataset you want to update
    schemaArn = updated_items_schema_arn # Pass the new schema arn
)

updated_items_dataset_arn = updated_items_dataset_arn['datasetArn']
print(f'ARN of the updated dataset: {updated_items_dataset_arn}')


# Wait for the dataset to be updated. It can take a minute.
current_time = datetime.now()
print("\nImport Started on: ", current_time.strftime("%I:%M:%S %p"))

max_time = time.time() + 1*60*60 # 1 hour

while time.time() < max_time:
    dataset_response = personalize.describe_dataset(datasetArn = updated_items_dataset_arn)
    status = dataset_response['dataset']['status']
    print("DatasetSegmentJob: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        print("Updated dataset:\n")
        # TODO: View the describe_dataset response. It should have information about the update.
        print(json.dumps(dataset_response, indent=2, default=str))
        break
        
    time.sleep(15)
    
current_time = datetime.now()
print("\nImport Completed on: ", current_time.strftime("%I:%M:%S %p"))


ARN of the updated dataset: arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85652/ITEMS

Import Started on:  07:40:51 PM
DatasetSegmentJob: UPDATE PENDING
DatasetSegmentJob: UPDATE PENDING
DatasetSegmentJob: UPDATE IN_PROGRESS
DatasetSegmentJob: UPDATE IN_PROGRESS
DatasetSegmentJob: UPDATE IN_PROGRESS
DatasetSegmentJob: UPDATE IN_PROGRESS
DatasetSegmentJob: UPDATE IN_PROGRESS
DatasetSegmentJob: UPDATE IN_PROGRESS
DatasetSegmentJob: ACTIVE
Updated dataset:

{
  "dataset": {
    "name": "retaildemostore-products-items-ds-85652",
    "datasetArn": "arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85652/ITEMS",
    "datasetGroupArn": "arn:aws:personalize:us-east-1:402114309305:dataset-group/retaildemostore-products-DSG-85652",
    "datasetType": "ITEMS",
    "schemaArn": "arn:aws:personalize:us-east-1:402114309305:schema/import-job-updated-items-schema-85652",
    "status": "ACTIVE",
    "creationDateTime": "2023-10-27 18:37:02

Step 3: Import the new data

After you associate the new schema with your dataset, you must import the new data.
You can use the CreateDatasetImportJob API call to import the data.

In [33]:
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/personalize/client/create_dataset_import_job.html
updated_items_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "updated-items-dataset-import-job-"+token,
    datasetArn = updated_items_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, items_filename) # Pass in the new dataset. See note below
    },
    roleArn = role_arn,
    importMode = "FULL"
)
# Note: notice how we are passing the updated items dataset (the dataset that contains the additional columns)
# Thus, Personalize will now use all of the columns in the dataset for training. 

updated_items_dataset_import_job_arn = updated_items_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(updated_items_dataset_import_job_response, indent=2))


# Wait for the data to finish importing. It can take up to 10 minutes.
max_time = time.time() + 1*60*60 # 1 hours

while time.time() < max_time:
    import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = updated_items_dataset_import_job_arn
    )
    status = import_job_response["datasetImportJob"]['status']

    if status == "ACTIVE":
        print(f'Import job {updated_items_dataset_import_job_arn} successfully completed')
        break
    elif status == "CREATE FAILED":
        print(f'Import job {updated_items_dataset_import_job_arn} failed')
        if import_job_response["datasetImportJob"].get('failureReason'):
            print('   Reason: ' + import_job_response["datasetImportJob"]['failureReason'])
    else: # status == CREATE PENDING or CREATE IN_PROGRESS
        print('Dataset import job:' + status)
        time.sleep(60)

# After importing the data, the new columns are available for filtering.

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:402114309305:dataset-import-job/updated_items_dataset_import_job85652",
  "ResponseMetadata": {
    "RequestId": "bd04e60b-a312-471a-9319-44637973cc91",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 19:42:52 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "125",
      "connection": "keep-alive",
      "x-amzn-requestid": "bd04e60b-a312-471a-9319-44637973cc91"
    },
    "RetryAttempts": 0
  }
}
Dataset import job:CREATE PENDING
Dataset import job:CREATE IN_PROGRESS
Dataset import job:CREATE IN_PROGRESS
Dataset import job:CREATE IN_PROGRESS
Dataset import job:CREATE IN_PROGRESS
Dataset import job:CREATE IN_PROGRESS
Import job arn:aws:personalize:us-east-1:402114309305:dataset-import-job/updated_items_dataset_import_job85652 successfully completed


Step 4: Create a new Solution for the updated data.

Once the new dataset is finished importing, you will need to create a new solution.
In the request to create the new solution, you can pass in the same DatasetGroup Arn as before.



In [34]:

## Create solution

updated_solution_version_arn = None
updated_solution_arn = None
updated_solution_name = "my-updated-solution-"+token


try:
    # https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/personalize/client/create_solution.html
    create_solution_response = personalize.create_solution(
        name = updated_solution_name,
        datasetGroupArn = dataset_group_arn,
        recipeArn = item_attribute_affinity_recipe_arn
    )

    updated_solution_arn = create_solution_response['solutionArn']
    print(json.dumps(create_solution_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this solution, seemingly')
    paginator = personalize.get_paginator('list_solutions')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for solution in paginate_result['solutions']:
            if solution['name'] == updated_solution_name:
                updated_solution_arn = solution['solutionArn']
                print(f'Item Attribute Affinity solution ARN = {updated_solution_arn}')
                
                response = personalize.list_solution_versions(
                    solutionArn = updated_solution_arn,
                    maxResults = 100
                )
                if len(response['solutionVersions']) > 0:
                    updated_solution_version_arn = response['solutionVersions'][-1]['solutionVersionArn']
                    print(f'Will use most recent solution version for this solution: {updated_solution_version_arn}')
                break

{
  "solutionArn": "arn:aws:personalize:us-east-1:402114309305:solution/my-updated-solution-85652",
  "ResponseMetadata": {
    "RequestId": "93c56fea-25a5-489f-b56e-3f8055a13ecc",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 19:48:53 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "95",
      "connection": "keep-alive",
      "x-amzn-requestid": "93c56fea-25a5-489f-b56e-3f8055a13ecc"
    },
    "RetryAttempts": 0
  }
}


Step 5: Create a new Solution Version from the new Solution


In [35]:
if not updated_solution_version_arn:
    create_solution_version_response = personalize.create_solution_version(
        solutionArn = updated_solution_arn
    )

    updated_solution_version_arn = create_solution_version_response['solutionVersionArn']
    print(json.dumps(create_solution_version_response, indent=2))
else:
    print(f'Solution version {updated_solution_version_arn} already exists; not creating')

{
  "solutionVersionArn": "arn:aws:personalize:us-east-1:402114309305:solution/my-updated-solution-85652/0a2037c2",
  "ResponseMetadata": {
    "RequestId": "e35aa636-d62e-4b26-98e1-60f4281b73d6",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 19:48:53 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "111",
      "connection": "keep-alive",
      "x-amzn-requestid": "e35aa636-d62e-4b26-98e1-60f4281b73d6"
    },
    "RetryAttempts": 0
  }
}


### Wait for the new solution version to finish training
This can take around 40 minutes.

In [36]:
%%time

while time.time() < max_time:
    updated_soln_ver_response = personalize.describe_solution_version(
        solutionVersionArn = updated_solution_version_arn
    )
    status = updated_soln_ver_response["solutionVersion"]["status"]

    if status == "ACTIVE":
        print(f'Solution version {updated_solution_version_arn} successfully completed')
        break
    elif status == "CREATE FAILED":
        print(f'Solution version {updated_solution_version_arn} failed')
        if updated_soln_ver_response["solutionVersion"].get('failureReason'):
            print('   Reason: ' + updated_soln_ver_response["solutionVersion"]['failureReason'])
        break
    else:
        print('At least one solution version is still in progress')
        time.sleep(60)


At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version i

### Chapter 7.0: Submit the batch segmentation job to our new Solution Version

In this chapter, we will:
- Step 1: Submit a batch segmentation job to our new solution version. For consistency, we will use the same sample input that we generated previously.
- Wait for the batch job to complete.
- Step 2: Compare the output of the new solution version to the previous solution versions' output. We should expect to see slight differences in outputs between the before-and-after solution versions.

In [37]:
### Step 1) Submit batch segmentation job. 

# Review the sample input file from earlier. We will use this same file to perform a batch inference job using our new solution version.

# Create a copy of the sample input file for the new solution's batch inference job & rename it
job_input_filename_updated = 'job_input_updated.json'
!cp $job_input_filename $job_input_filename_updated

# Preview input file
print("\nPreviewing input file item-attribute-affinity solution... ")
!head -n 5 $job_input_filename_updated
print('\n')


# Create and start a Batch Segmentation Job using our latest Solution Version

s3_input_key_updated = "batch-job-inputs/item-attribute-affinity-job/" + job_input_filename_updated

s3.upload_file(job_input_filename_updated, bucket_name, s3_input_key_updated)

s3_input_path_updated = "s3://" + bucket_name + "/" + s3_input_key_updated

response = personalize.create_batch_segment_job (
    solutionVersionArn = updated_solution_version_arn,
    jobName = "retaildemostore-updated-item-attribute-affinity-job-" + token,
    roleArn = role_arn,
    jobInput = {"s3DataSource": {"path": s3_input_path_updated}},
    jobOutput = {"s3DataDestination":{"path": s3_output_path}},
    numResults = y
)
job_updated_arn = response['batchSegmentJobArn']
print(json.dumps(response, indent=2, default=str))



Previewing input file item-attribute-affinity solution... 
{"itemAttributes": "ITEMS.CATEGORY_L1 = \"tools\""}
{"itemAttributes": "ITEMS.CATEGORY_L1 = \"housewares\""}
{"itemAttributes": "ITEMS.CATEGORY_L1 = \"floral\""}
{"itemAttributes": "ITEMS.CATEGORY_L1 = \"food service\""}
{"itemAttributes": "ITEMS.CATEGORY_L1 = \"furniture\""}


{
  "batchSegmentJobArn": "arn:aws:personalize:us-east-1:402114309305:batch-segment-job/retaildemostore-updated-item-attribute-affinity-job-85652",
  "ResponseMetadata": {
    "RequestId": "a140d94e-096f-4251-9062-bee7961070de",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 20:30:00 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "143",
      "connection": "keep-alive",
      "x-amzn-requestid": "a140d94e-096f-4251-9062-bee7961070de"
    },
    "RetryAttempts": 0
  }
}


In [38]:
%%time

# Wait for the batch jobs to complete. This can take around 10 minutes.

current_time = datetime.now()
print("Import Started on: ", current_time.strftime("%I:%M:%S %p"))

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    resp = personalize.describe_batch_segment_job(batchSegmentJobArn = job_updated_arn)
    status = resp["batchSegmentJob"]['status']
    print("DatasetSegmentJob {arn}: {status}".format(arn=job_updated_arn, status=status))

    if status == "ACTIVE" or status == "CREATE FAILED":
        current_time = datetime.now()
        print("Job has ended on: ", current_time.strftime("%I:%M:%S %p"))
        break
    else:
        print('Job still in progress')
        time.sleep(60)



Import Started on:  08:30:00 PM
DatasetSegmentJob arn:aws:personalize:us-east-1:402114309305:batch-segment-job/retaildemostore-updated-item-attribute-affinity-job-85652: CREATE PENDING
Job still in progress
DatasetSegmentJob arn:aws:personalize:us-east-1:402114309305:batch-segment-job/retaildemostore-updated-item-attribute-affinity-job-85652: CREATE IN_PROGRESS
Job still in progress
DatasetSegmentJob arn:aws:personalize:us-east-1:402114309305:batch-segment-job/retaildemostore-updated-item-attribute-affinity-job-85652: CREATE IN_PROGRESS
Job still in progress
DatasetSegmentJob arn:aws:personalize:us-east-1:402114309305:batch-segment-job/retaildemostore-updated-item-attribute-affinity-job-85652: CREATE IN_PROGRESS
Job still in progress
DatasetSegmentJob arn:aws:personalize:us-east-1:402114309305:batch-segment-job/retaildemostore-updated-item-attribute-affinity-job-85652: CREATE IN_PROGRESS
Job still in progress
DatasetSegmentJob arn:aws:personalize:us-east-1:402114309305:batch-segment-jo

In [39]:
# Download the output files of the segmenation jobs from S3:

job_output_file_updated = job_input_filename_updated + ".out"
export_name_updated = 'batch-job-outputs/item-attribute-affinity-job/' + job_output_file_updated
s3.download_file(bucket_name, export_name_updated, job_output_file_updated)


#### Step 2) Compare the outputs of the segmentation jobs.

Now lets inspect the output of this job and compare it to the output of the job that we ran on the solution version which used the original items schema.
The code block below provides a side-by-side comparison of the two outputs.


In [40]:
print('Inspecting outputs of original-schema & new-schema solution versions:')
print('Original schema output:')
!head -n 5 $job_output_file

print('\nNew schema output:')
!head -n 5 $job_output_file_updated

Inspecting outputs of original-schema & new-schema solution versions:
Original schema output:
{"input": {"itemAttributes": "ITEMS.CATEGORY_L1 = \"tools\""}, "output": {"usersList": ["2131","2504","4680","4712","3965","3867","517","321","1177","2056"]}, "error": null}
{"input": {"itemAttributes": "ITEMS.CATEGORY_L1 = \"housewares\""}, "output": {"usersList": ["3976","1018","4503","673","1656","456","3609","4288","3948","1135"]}, "error": null}
{"input": {"itemAttributes": "ITEMS.CATEGORY_L1 = \"floral\""}, "output": {"usersList": ["1928","68","161","2123","1691","3167","5035","142","4509","3254"]}, "error": null}
{"input": {"itemAttributes": "ITEMS.CATEGORY_L1 = \"food service\""}, "output": {"usersList": ["5679","5338","5559","5304","5696","5503","5689","5318","5366","5809"]}, "error": null}
{"input": {"itemAttributes": "ITEMS.CATEGORY_L1 = \"furniture\""}, "output": {"usersList": ["3297","1773","97","4110","3660","2930","3557","2217","438","1742"]}, "error": null}

New schema output:


### Compare the outputs:

Look at the outputs in the previous code block.
Do you notice any changes in the outputs?

We should expect to see slight differences in outputs between the before-and-after solution versions.
Any changes you see is a representation of the fact that we are able to create a new solution version after updating the item schema! 



#### Step 2: Evaluate the metrics of the two solution versions

Amazon Personalize provides [offline metrics](https://docs.aws.amazon.com/personalize/latest/dg/working-with-training-metrics.html#working-with-training-metrics-metrics) for solution versions that allow you to evaluate the accuracy of the model before you deploy the model in your application. Metrics can also be used to view the effects of modifying a custom solution's hyperparameters or to compare the metrics between two solutions versions.

Let's retrieve and compare the metrics for the solution versions we just created in this notebook.


In [41]:
# Get metrics for the solution version that used the original schema
get_orig_solution_metrics_response = personalize.get_solution_metrics(
    solutionVersionArn = solution_version_arn
)
print("Metrics of the solution version with the original schema & dataset:")
print(json.dumps(get_orig_solution_metrics_response['metrics'], indent=2))

# Get metrics for the solution version that used the updated schema & dataset
get_updated_solution_metrics_response = personalize.get_solution_metrics(
    solutionVersionArn = updated_solution_version_arn
)
print("\n\nMetrics of the solution version with the updated schema & dataset:")
print(json.dumps(get_updated_solution_metrics_response['metrics'], indent=2))


Metrics of the solution version with the original schema & dataset:
{
  "coverage": 0.5869,
  "hits_at_1_percent": 11.459,
  "recall_at_1_percent": 0.6866
}


Metrics of the solution version with the updated schema & dataset:
{
  "coverage": 0.5826,
  "hits_at_1_percent": 11.67,
  "recall_at_1_percent": 0.6959
}


As you can see, the metrics of the two solution versions are slightly different. This can serve as additional confirmation that the new solution version used a different dataset for training.


However, the most reliable method of making sure that the new columns are being used by this new solution version requires checking via the Amazon Personalize service page within the AWS Management Console. Visit the `Solutions and recipes` menu option under the `Custom Resources` tab. Select the new solution we just created. Click the `Solution Configuration` dropdown option, and you will be able to see the `Columns for training` that are used by your solution. Recall, for the original solution, we previously observed that the Item dataset's columns for training were: `ITEM_ID`, `PRICE`, and `CATEGORY_L1`.

Looking at the updated solution. Notice how the Item dataset columns used for training are: `ITEM_ID`, `PRICE`, `CATEGORY_L1`, `CATEGORY_L2`, `PRODUCT_DESCRIPTION`, and `GENDER`.

This is confirmation that the new solution used these additional new columns for training.


# Summary: 

In this notebook, we walked through the process for updating your item schemas when using Amazon Personalize. Although we used the item-attribute-affinity in our example, this workflow also applies to the other recipe types in Amazon Personalize.

This demonstration required some set up. To recap those steps, we:
1. Set Up Amazon S3 Bucket, IAM Policies, and IAM Roles
2. Fetched and Inspect the Datasets
3. Created Schemas and Imported Datasets in Amazon Personalize
4. Created an e-commerce custom solution in Amazon Personalize
5. Ran a Batch Segmentation Job our Solution Version to obtain a baseline for comparison.

To Recap, the steps we performed for updating a schema, we:
1. Updated the schema (via the CreateSchema API)
2. Updated the dataset that corresponded to the updated schema (via the UpdateDataset API)
3. Imported our new data/columns (via the CreateDatasetImportJob API)
4. Created new a Solution (via the CreateSolution API).
5. Trained a new solution version (via the CreateSolutionVersion API) that uses updated schema and new dataset.
6. Submited the batch job to our new solution version.
7. Inspected the output of our job & compared it to the output from the original solution version.
8. Compared the before-and-after metrics of the two solution versions
9. Confirmed the columns for training for each Solution via the AWS Console.


# Chapter 8: Cleanup

This chapter will walk through deleting all of the resources created throughout this notebook.

First, we will delete the S3 and IAM resources. 

Then, we will delete the Amazon Personalize resources.

Amazon Personalize resources have to deleted in a specific sequence* to avoid dependency errors.
The order in which you should delete resources in Amazon Personalize are: recommenders and campaigns, then solutions, then event trackers, then filters, then datasets and dataset schemas, and finally, the dataset group. 

To declutter this notebook, we will be leveraging a utility module written in python that provides an orderly delete process for deleting all resources in each dataset group.

This section should take about 15 minutes, though most of this time will be spent waiting for the Personalize resources to finish deleting.

*: Note, we didn't use some of these resource types (such as recommenders and campaigns) in this notebook. The list is just for your knowledge.

#### Emptying and Deleting the S3 bucket

NOTE: THE FOLLOWING CODE WILL DELETE ALL OF THE OBJECTS, INCLUDING THE CSVs & BATCH JOB FILES. 
If you dont want to delete the S3 bucket, DONT run the code block below.

Alternatively, if you want to delete the bucket, consider directly downloading the files (eg: via the Console or CLI) to 
persist your data.


In [42]:
# List objects in the bucket and delete them
objects = s3.list_objects_v2(Bucket=bucket_name)

if 'Contents' in objects:
    for obj in objects['Contents']:
        print(f'Deleting {obj["Key"]}...')
        s3.delete_object(Bucket=bucket_name, Key=obj['Key'])

# Delete the bucket
s3.delete_bucket(Bucket=bucket_name)


Deleting batch-job-inputs/item-attribute-affinity-job/job_input.json...
Deleting batch-job-inputs/item-attribute-affinity-job/job_input_updated.json...
Deleting batch-job-outputs/item-attribute-affinity-job/_CHECK...
Deleting batch-job-outputs/item-attribute-affinity-job/job_input.json.out...
Deleting batch-job-outputs/item-attribute-affinity-job/job_input_updated.json.out...
Deleting interactions.csv...
Deleting items.csv...
Deleting items_basic.csv...
Deleting users.csv...


{'ResponseMetadata': {'RequestId': 'VEYHS36PKSZWN6DR',
  'HostId': 'uMk1PA58khnv9mONIXEpgSXKyOgXWOrEdsJPZtMp+Ea3N9eWuWNJ8kXBEYHrXyNMgBGnnoSf8nOeiLnApDnFyneYkn3woTCD',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': 'uMk1PA58khnv9mONIXEpgSXKyOgXWOrEdsJPZtMp+Ea3N9eWuWNJ8kXBEYHrXyNMgBGnnoSf8nOeiLnApDnFyneYkn3woTCD',
   'x-amz-request-id': 'VEYHS36PKSZWN6DR',
   'date': 'Fri, 27 Oct 2023 20:38:03 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

#### Delete the IAM Execution Role and Policy

Now, lets delete the IAM Policy and IAM Role that we created for the Personalize Service.

In [43]:
iam.detach_role_policy(RoleName=role_name, PolicyArn=policy_arn)

print("Deleting: " + role_name)
iam.delete_role(RoleName=role_name)

print("Deleting: " + policy_arn)
iam.delete_policy(PolicyArn=policy_arn)

# Check if the IAM role and policy were deleted
if role_name in [role['RoleName'] for role in iam.list_roles()['Roles']]:
    print('Role was not deleted!')
    
if policy_arn in [policy['Arn'] for policy in iam.list_policies()['Policies']]:
    print('Policy was not deleted!')
    
print('Bucket and IAM role and policy deleted successfully!')

Deleting: PersonalizeRole-85652
Deleting: arn:aws:iam::402114309305:policy/PersonalizePolicy-85652
Bucket and IAM role and policy deleted successfully!


#### Set up deletion script

Next we will download a helper script that will simplify the cleanup process. If you want to look at the underlying code, the helper script can be found in the [retail-demo-store github repo](https://github.com/aws-samples/retail-demo-store/blob/b80137c6edb2c975c50221fcaba46b6abadd7b99/src/aws-lambda/personalize-pre-create-resources/delete_dataset_groups.py).

Note: Under the hood, this script uses the 'delete_*' Personalize API boto3 commands. 

In [44]:
# Download the helper script from the github repo.
!curl -O https://raw.githubusercontent.com/aws-samples/retail-demo-store/b80137c6edb2c975c50221fcaba46b6abadd7b99/src/aws-lambda/personalize-pre-create-resources/delete_dataset_groups.py

# Import the module
import delete_dataset_groups

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18958  100 18958    0     0   119k      0 --:--:-- --:--:-- --:--:--  119k


#### Set up logging for deletion of Personalize resources
The following code cell ensures we import and set up the native python logging module. Our resource deletion script requires this so that we can provide information about the deletion status of Personalize resources.

In [45]:
import logging

handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO)

delete_dataset_groups.logger.setLevel(logging.INFO)
delete_dataset_groups.logger.addHandler(handler)

#### Delete Amazon Personalize Resources

Now we can delete the active dataset groups. This can take up to 10 minutes depending on the resources within your dataset group. The function below will log its progress until finished.

In [46]:
%%time

print(f'Active dataset groups that need to be deleted: {dataset_group_name}\n')

delete_dataset_groups.delete_dataset_groups(
    dataset_group_names = [ dataset_group_name ], 
    wait_for_resources = True
)

Active dataset groups that need to be deleted: retaildemostore-products-DSG-85652

Dataset Group ARN: arn:aws:personalize:us-east-1:402114309305:dataset-group/retaildemostore-products-DSG-85652
All recommenders have been deleted or none exist for dataset group
All campaigns have been deleted or none exist for dataset group
Deleting solution: arn:aws:personalize:us-east-1:402114309305:solution/my-updated-solution-85652
Deleting solution: arn:aws:personalize:us-east-1:402114309305:solution/my-original-solution-85652
Waiting for 2 solution(s) to be deleted
Waiting for 2 solution(s) to be deleted
Waiting for 2 solution(s) to be deleted
Waiting for 1 solution(s) to be deleted
All solutions have been deleted or none exist for dataset group
All event trackers have been deleted or none exist for dataset group
All filters have been deleted or none exist for dataset group
Deleting dataset arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85652/ITEMS
Deleting dataset

In [47]:
# Delete the original item schema
# The external helper script doesnt delete obsolete schemas, so we must do this ourselves
response = personalize.delete_schema(schemaArn = items_schema_arn)


#### Delete local files
If you want to retain these files, dont run the following cell block.

In [48]:
!rm delete_dataset_groups.py

# Delete input & output files from original solution version
!rm job_input.json
!rm job_input.json.out

# Delete input & output files from retrained solution version with new schema & dataset
!rm job_input_updated.json
!rm job_input_updated.json.out

# Delete the basic items dataset csv
!rm {items_basic_filename}


## Cleanup Complete

All resources created by this Notebook have been deleted.

### Final note:
If you are running this notebook on Amazon Sagemaker, don't forget to `stop` or `terminate` your sagemaker instance so that you don't incur additional costs.
Afterwards, feel free to delete the execution role of this Sagemaker notebook instance.

#### Congrats on completing this Personalize Demonstration! 
If you are further interested in learning how you can leverage Amazon Personalize to power your business's ML-powered recommendation services, refer to the [AWS Documentation on Amazon Personlize](https://docs.aws.amazon.com/personalize/latest/dg/what-is-personalize.html).
