## Guided Example for Adding New Items and Interactions for User-Personalization Solutions in Amazon Personalize


This notebook explains the process for *updating* your *datasets* when using Amazon Personalize--specifically adding new *items* and *interactions for those new items*. Commentary in this notebook is centered around the *User Personalization* recipe. 

This notebook is inspired by the official AWS [Retail Demo Store Workshop](https://github.com/aws-samples/retail-demo-store). The main difference between this notebook and the Retail Demo Store workshop, is that the latter mainly focuses on real-time recommendations, whereas this notebook demonstrates how updating your datasets affects user-personalization solutions & item recommendations.


## Notebook overview

### Core content:

Before we can demonstrate schema changes, we will need to do some set-up. This set up portion is documented in the first 5 chapters of this notebook.

After we perform the set up, we will then walk through the process of changing your dataset. 
This is what the remaining chapters go over.

Table of Contents:

------------Set up------------
- **Chapter 1**: Set Up Amazon S3 Bucket, IAM Policies, and IAM Roles (_5 minutes_)
- **Chapter 2**: Fetch, Inspect, and Trim the Datasets (_5 minutes_)
- **Chapter 3**: Create Schemas and Import Datasets in Amazon Personalize (_15 minutes_)
- **Chapter 4**: Create an e-commerce custom solution in Amazon Personalize (_60 minutes_)
- **Chapter 5**: Gather Metrics and Run a Batch Inference Job on your Solution Version (_20 minutes_)

------------Updating Datasets------------
- **Chapter 6**: Overview of the Steps involved for your Updating Datasets for User-Personalization Solutions (_100 minutes_)
- **Chapter 6 Step 1**: Fetch, modify, and inspect the updated the Items & Interactions Datasets 
- **Chapter 6 Step 2**: Update the Items & Interactions Datasets (via the the CreateDatasetImportJob API)
- **Chapter 6 Step 3**: Re-Run the Batch Inference Job on your Solution Version to trigger an auto-update
- **Chapter 6 Step 4**: Gather metrics of the auto-updated solution version
- **Chapter 6 Step 5**: Perform A Full Retraining (Create a Solution Version)
- **Chapter 6 Step 6**: Run the Batch Inference Job on the new Solution Version
- **Chapter 6 Step 7**: Gather metrics of the fully-retrained solution version


------------Analysis------------
- **Chapter 7**: Analysis (_5 minutes_)
- **Chapter 7 Step 1**: Compare the outputs of the original, auto-updated, and fully retrained solution versions. 
    - All three should be different.
    
- **Chapter 7 Step 2**: Compare the metrics across the two solution versions (original, auto-updated, fully-trained)
    - Since the solution version was fully retrained with additional data, we should expect the metrics to be slightly different.
    
------------Clean up------------
- **Chapter 8**: Clean up (_15 minutes_)


#### Relevant Information:

- This notebook was developed and tested in the us-east-1 Region.

- To ensure a reliable run, please don't *concurrently* run multiple copies of this notebook on the same Sagemaker Notebook Instance. If you want to concurrently run multiple copies of this notebook, run each notebook in its own environment/instance. 

- Updating datasets for user-segmentation-based solutions is similar. The difference is, for user-segmentation, a new solution version *must* be trained before changes in the datasets have an impact in your recommendations. This is in contrast to user-personalization-based solutions, where recent interactions and new items are factored in your inference jobs. This is useful for "cold"-starting new items. However, to use those new items & interactions beyond cold starts, occasionally training a new solution version is recommended. This will make that new data more impactful in your recommendations for user-personalization solutions.

- After you finish running this notebook, please run the code cells in the `Clean up` chapter of this notebook (final chapter). This will prevent incurring additional costs.

- The purpose of this notebook is to demonstrate a *high-level* implementation of an end-to-end Personalize Workflow for updating datasets. As such, the code within this notebook has not been tested for a production environment and for the sake of brevity, not all security best practices may have been implemented. For additional information to secure your Personalize-dependent workloads, refer to the [Security in Amazon Personalize](https://docs.aws.amazon.com/personalize/latest/dg/security.html) section of the Amazon Personalize documentation.

- This notebook will be using the python programming language and the AWS SDK for python (referred to as boto3). Even if you are not fluent in python, the code cells should be reasonably intuitive. In practice, you can use any programming language supported by the AWS SDK to complete the same steps from this notebook in your application environment. Visit the [official boto3 documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) for more information about the AWS SDK for Python.


## Introduction to Amazon Personalize

[Amazon Personalize](https://aws.amazon.com/personalize/) makes it easy for customers to develop applications with a wide array of personalization use cases, including real time product recommendations and customized direct marketing. Amazon Personalize brings the same machine learning technology used by Amazon.com to everyone for use in their applications – with no machine learning experience required. Amazon Personalize customers pay for what they use, with no minimum fees or upfront commitment. You can start using Amazon Personalize with a simple three step process, which only takes a few clicks in the AWS console, or a set of simple API calls. First, point Amazon Personalize to user data, catalog data, and activity stream of views, clicks, purchases, etc. in Amazon S3 or upload using a simple API call. Second, with a single click in the console or an API call, train a custom private recommendation model for your data. Third, retrieve personalized recommendations for any user by creating a recommender, campaign, or batch job.


## Chapter 1: Set Up Amazon S3 Bucket, IAM Policies, and IAM Roles

In this Chapter, we are going to focus setting up our Amazon S3 bucket, and initializing the proper IAM Policies & Roles required to run this workflow.

This chapter will take about 5 minutes.

### Update dependencies

To get started, we need to perform a bit of setup. First, we need to ensure that a current version of botocore is locally installed. The botocore library is used by boto3, the AWS SDK library for python. We need a current version to be able to access some of the newer Amazon Personalize features.

The following cell will update pip and install the latest botocore library.

In [1]:
import sys
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install --upgrade --no-deps --force-reinstall botocore


Collecting botocore
  Using cached botocore-1.34.60-py3-none-any.whl.metadata (5.7 kB)
Using cached botocore-1.34.60-py3-none-any.whl (12.0 MB)
Installing collected packages: botocore
  Attempting uninstall: botocore
    Found existing installation: botocore 1.34.60
    Uninstalling botocore-1.34.60:
      Successfully uninstalled botocore-1.34.60
Successfully installed botocore-1.34.60


### Import dependencies

Next we need to import some dependencies/libraries needed to complete this part of the notebook.


In [2]:
import boto3
import json
import pandas as pd
import time
import csv
import os

import uuid
from botocore.exceptions import ClientError
import numpy
from io import StringIO
import botocore
from datetime import datetime


  from pandas.core.computation.check import NUMEXPR_INSTALLED


### Create clients

Next we need to create the AWS service clients needed for this demonstration.

- **personalize**: this client is used to create resources in Amazon Personalize
- **s3**: this client is used to access S3 commands and resources


In [3]:
# Setup clients
personalize = boto3.client('personalize')
s3 = boto3.Session().client('s3')


### Set up our Amazon S3 bucket

For simplicity, we will use this bucket to store our input data, output data, helper scripts, and other files. 
Though, in a production environment, you may want to store these assets seperately/in seperate buckets.

To ensure a consistent naming convention throughout this notebook, we generate a random number for the 'token' variable below.

In [None]:
# Use an epoch timestamp w/ precision to the nearest millisecond to present a pseduo-randomly generated value for token. 
# Alternatively, enter your own *lowercase alphanumeric* string of 5 characters here. The 'token` is used for naming aws resources. 
token = str(round(time.time()*1000))[-5:]
print(f'The value of your token is:"{token}".')

# Bucket name *must* contain the substring 'Personalize' or 'personalize'. 
#  This is to ensure compliance with the execution role of this Sagemaker Notebook instance.
bucket_name = 'personalize-update-items-dataset-example-' + token

# Creates a bucket in us-east-1
# Reference: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/create_bucket.html
bucket = s3.create_bucket(
    Bucket=bucket_name
)

# The Bucket Policy we need to attach to the bucket in order to allow Amazon Personalize to access it.
bucket_policy = {
    'Version': '2012-10-17',
    'Id': 'PersonalizeS3BucketAccessPolicy',
    'Statement': [
        {
            'Sid': 'PersonalizeS3BucketAccessPolicy',
            'Effect': 'Allow',
            'Principal': {
                'Service': 'personalize.amazonaws.com'
            },
            'Action': [
                's3:GetObject',
                's3:ListBucket',
                's3:PutObject'
            ],
            'Resource': [
                f'arn:aws:s3:::{bucket_name}',
                f'arn:aws:s3:::{bucket_name}/*'
            ]
        }
    ]
}

# Convert the policy to a JSON string and attach it to the bucket
bucket_policy = json.dumps(bucket_policy)
s3.put_bucket_policy(Bucket=bucket_name, Policy=bucket_policy)


# prints out the bucket
print('Bucket: {}'.format(bucket['Location']))

### Set up Amazon IAM Permissions for the Personalize Service


In addition to a bucket policy that allows Amazon Personalize access, we also need to explicitly grant the Amazon Personalize service those permissions within an IAM Role. This will enable the Personalize service to fetch and write data to Amazon S3. We use a custom-made customer-managed IAM policy to ensure we are abiding by the [Principle of Least Privilege best security practice](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html).

In [5]:
# Set up IAM for Personalize

iam = boto3.client('iam')

# role_name must begin with the substring 'PersonalizeRole' to ensure compliance with the Execution Role of this Sagemaker Notebook instance.
role_name = 'PersonalizeRole-'+token

print("Creating IAM Role...")
role_arn = iam.create_role(
    RoleName=role_name,
    # Allow Amazon Personalize to assume this role
    AssumeRolePolicyDocument=json.dumps({
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}))
role_arn = role_arn['Role']['Arn']

print("Created IAM Role. IAM Role ARN: " + role_arn)


# Create the IAM policy for Personalize
personalize_policy_doc = {
    'Version': '2012-10-17',
    'Id': 'PersonalizeS3BucketAccessPolicy-'+token,
    'Statement': [
        {
            'Sid': 'PersonalizeS3BucketAccessPolicy',
            'Action': [
                's3:GetObject',
                's3:ListBucket',
                's3:PutObject'
            ],
            'Resource': [
                f'arn:aws:s3:::{bucket_name}',
                f'arn:aws:s3:::{bucket_name}/*'
            ],
            'Effect': 'Allow'
        }
    ]
}
personalize_policy_doc = json.dumps(personalize_policy_doc)

# role_name must begin with the substring 'PersonalizePolicy' to ensure compliance with the Execution Role of this Sagemaker Notebook instance.
iam_policy_name = 'PersonalizePolicy-'+token

print("Creating IAM Policy...")
policy_response = iam.create_policy(
    PolicyName=iam_policy_name,
    PolicyDocument=personalize_policy_doc,
    Description='Policy to allow Personalize access to our S3 bucket'
)

# get arn of the policy
policy_arn = policy_response['Policy']['Arn']
policy_version = policy_response['Policy']['DefaultVersionId']
print("Created IAM Policy. Policy ARN: " + policy_arn)

# Attach the policy to the role
iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn=policy_arn
)
print("Attached policy to Role")

Creating IAM Role...
Created IAM Role. IAM Role ARN: arn:aws:iam::402114309305:role/PersonalizeRole-85968
Creating IAM Policy...
Created IAM Policy. Policy ARN: arn:aws:iam::402114309305:policy/PersonalizePolicy-85968
Attached policy to Role


## Chapter 2: Fetch and Inspect the Datasets

Amazon Personalize provides predefined recipes, based on common use cases, for training models. A recipe is a machine learning algorithm that you use with settings, or hyperparameters, and the data you provide to train an Amazon Personalize model. The data you provide to train a model are organized into separate datasets by the type of data being provided. A collection of datasets are organized into a dataset group. The three dataset types supported by Personalize are items, users, and interactions. Depending on the recipe type you choose, a different combination of dataset types are required. For all recipe types, an interactions dataset is required. Interactions represent how users interact with items. For example, viewing a product, watching a video, listening to a recording, or reading an article. In this notebook, we will be using the user personalization recipe, a recipe that can use all three dataset types.

In this chapter, you will:
    
    - copy public datasets to your private S3 bucket,
    - Load the datasets into this notebook environment,
    - Inspect the datasets so you have an understanding of the data
    
This chapter will take about 5 minutes.

Let's get started!

#### Some context on 'Items' datasets

When training models in Amazon Personalize, we can provide structured and unstructured metadata about our items. This data helps improve the relevance of recommendations and is particularly useful when recommending new/cold items added to your catalog. 

Optional reading: For this notebook we will be creating 'custom solutions' for our use cases. Additionally, Personalize also has retail domain recommenders. This construct, which was released at re:Invent 2021 is used for real-time inferences. You can read more about them in the [Personalize blog](https://aws.amazon.com/blogs/machine-learning/amazon-personalize-announces-recommenders-optimized-for-retail-and-media-entertainment/).

The retail domain recommenders stipulate some [reserved fields/columns](https://docs.aws.amazon.com/personalize/latest/dg/ECOMMERCE-datasets-and-schemas.html) that we must conform to. For example, some columns that Personalize supports for an `Items` dataset include `ITEM_ID`, `PRICE`, `CATEGORY_L1`, `CATEGORY_L2`, `PRODUCT_DESCRIPTION`, and `GENDER`. Personalize will automatically apply a natural language processing (NLP) machine learning model to the product description column to extract features from the text. The product's unique identifier is required. For items, at least one metadata column (such as price or level-1 category) is also required. 


### Save to CSV and upload to S3 bucket

For this notebook, we will be using publicly available datasets. These datasets are part of the [Retail Demo Store](https://github.com/aws-samples/retail-demo-store) project and are provided as a public download. 

The following cell will copy the csv datasets from the download URL to the local volume and then upload to our private s3 bucket. 

In [7]:
users_filename, items_filename, interactions_filename = "users.csv", "items.csv", "interactions.csv"

# copy the datasets from the public s3 bucket to our private s3 bucket
for file in [users_filename, items_filename, interactions_filename]:
    !wget https://code.retaildemostore.retail.aws.dev/csvs/{file}
    boto3.Session().resource('s3').Bucket(bucket_name).Object(file).upload_file(file)
    print(f'Finishing copying {file} to {bucket_name}')

--2024-03-12 01:36:06--  https://code.retaildemostore.retail.aws.dev/csvs/users.csv
Resolving code.retaildemostore.retail.aws.dev (code.retaildemostore.retail.aws.dev)... 18.165.83.69, 18.165.83.109, 18.165.83.129, ...
Connecting to code.retaildemostore.retail.aws.dev (code.retaildemostore.retail.aws.dev)|18.165.83.69|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58912 (58K) [text/csv]
Saving to: ‘users.csv’


2024-03-12 01:36:06 (43.4 MB/s) - ‘users.csv’ saved [58912/58912]

Finishing copying users.csv to personalize-update-items-dataset-example-06196
--2024-03-12 01:36:06--  https://code.retaildemostore.retail.aws.dev/csvs/items.csv
Resolving code.retaildemostore.retail.aws.dev (code.retaildemostore.retail.aws.dev)... 18.165.83.120, 18.165.83.129, 18.165.83.109, ...
Connecting to code.retaildemostore.retail.aws.dev (code.retaildemostore.retail.aws.dev)|18.165.83.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 789018 (771K) [te

Now, we will download our datasets from our private s3 bucket into this notebook environment, and load them into a [Pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

Finally, we will display the first few rows of each dataset just so we have a sense of its dimensions.

In [7]:
# Users Dataset
get_users_csv_response = s3.get_object(Bucket=bucket_name, Key=users_filename)
users_csv_content = get_users_csv_response['Body'].read().decode('utf-8')

# Create a pandas DataFrame from the CSV content
users_df = pd.read_csv(StringIO(users_csv_content))
print('\nUsers:-----------')
print(users_df.head())  # Inspect the first few rows of the DataFrame


# Items Dataset
get_items_csv_response = s3.get_object(Bucket=bucket_name, Key=items_filename)
items_csv_content = get_items_csv_response['Body'].read().decode('utf-8')

# Create a pandas DataFrame from the CSV content
items_df = pd.read_csv(StringIO(items_csv_content))
print('\nItems:-----------')
print(items_df.head())  # Inspect the first few rows of the DataFrame


# Interactions Dataset
get_interactions_csv_response = s3.get_object(Bucket=bucket_name, Key=interactions_filename)
interactions_csv_content = get_interactions_csv_response['Body'].read().decode('utf-8')

# Create a pandas DataFrame from the CSV content
interactions_df = pd.read_csv(StringIO(interactions_csv_content))

interactions_df['USER_ID'] = interactions_df.USER_ID.astype(str)
interactions_df['TIMESTAMP'] = interactions_df.TIMESTAMP.astype(int)
print('\nInteractions:-----------')
print(interactions_df.head())  # Inspect the first few rows of the DataFrame


Users:-----------
   USER_ID  AGE GENDER
0        1   31      M
1        2   58      F
2        3   43      M
3        4   38      M
4        5   24      M

Items:-----------
                                ITEM_ID   PRICE  CATEGORY_L1 CATEGORY_L2  \
0  6579c22f-be2b-444c-a52b-0116dd82df6c   90.99  accessories    backpack   
1  2e852905-c6f4-47db-802c-654013571922  123.99  accessories    backpack   
2  4ec7ff5c-f70f-4984-b6c4-c7ef37cc0c09   87.99  accessories    backpack   
3  7977f680-2cf7-457d-8f4d-afa0aa168cb9  125.99  accessories    backpack   
4  b5649d7c-4651-458d-a07f-912f253784ce  141.99  accessories    backpack   

                                 PRODUCT_DESCRIPTION GENDER PROMOTED  
0           This tan backpack is nifty for traveling      F        N  
1                       Pale pink backpack for women      F        N  
2  This gainsboro backpack for women is first-rat...      F        N  
3  This gray backpack for women is first-rate for...      F        N  
4           

#### Optional reading: Inspection of our user & interactions input data

Similar to the items dataset, we have provided metadata on our users when training models in Personalize. For this notebook we have included each user's age and gender. For more information about requirements for the users dataset, refer to the [aws documentation](https://docs.aws.amazon.com/personalize/latest/dg/ECOMMERCE-users-dataset.html).


Additionally, take a look at the first few lines of the interactions file. Note: 

- An EVENT_TYPE column which can be used to train different Personalize campaigns & custom solutions, and can also be used to filter on recommendations. To simulate a real-world site, most of the EVENT_TYPE events are views, whereas a much smaller proportion is add to cart, checkout, and purchase events.
- The custom DISCOUNT column which is a contextual metadata field, that a Personalize user personalization solution can take into account to predict on the best next product based the user's propensity to interact with discount products.

### Trim down the Items and Interactions dataset

Now, we will modify the `items dataset` and `interactions dataset` to remove (at random) half of the items, and all interactions associated with those items. Then, we will create a solution version using these trimmed down datasets. In the second part of the notebook, we will simulate updating the datasets by using the full-version of the datasets with all of the items and interactions.

Specifically, we will:
- randomly select half the ItemIDs from the original dataset,
- remove rows associated with ItemIDs from the Items dataset,
- From the interactions dataset, remove all interactions associated with our randomly chosen ItemIDs.
- These writes will be performed on a *copy* of the orginal datasets. These `trimmed` CSVs will be uploaded to s3. We will then use these trimmed CSVs when we create our first solution version.


#### Trim the items dataset, upload it to S3, and preview it

In [8]:
# Trim Items Dataset:

items_trimmed_filename = 'items_trimmed.csv'

# Obtain the unique ITEM_ID values
unique_item_ids = items_df['ITEM_ID'].unique()
print(f"Number of unique ITEM_ID values: {len(unique_item_ids)}.\n")

num_of_ItemIDs_to_remove = int(0.5 * len(unique_item_ids))
print(f"Number of ITEM_ID values to remove: {num_of_ItemIDs_to_remove}.\n")

# Randomly select 50% of the unique "ITEM_ID" values (these represent new ITEM_IDs during the dataset update portion of the notebook)
selected_item_ids = numpy.random.choice(unique_item_ids, num_of_ItemIDs_to_remove, replace=False)
print(f"Removing the following ITEM_ID values... : {unique_item_ids}.\n")


# Create a new DataFrame with rows where "ITEM_ID" is not in the selected_item_ids
items_trimmed_df = items_df[~items_df["ITEM_ID"].isin(selected_item_ids)]

# Write the updated data to items_trimmed_filename
print(f"Copying trimmed Items records to {items_trimmed_filename}...\n")
items_trimmed_df.to_csv(items_trimmed_filename, index=False)

# Upload the modified CSV file to an S3 bucket
s3.upload_file(items_trimmed_filename, bucket_name, items_trimmed_filename)

print(f"File '{items_trimmed_filename}' has been uploaded to S3 bucket '{bucket_name}' with key '{items_trimmed_filename}'.")

print('\n Trimmed Items dataset preview:-----------')
print(items_trimmed_df.head())  # Inspect the first few rows of the DataFrame

# Validate: Number of unique ITEM_ID values in the trimmed csv should be half of what it was before
print(f"\nNumber of unique ITEM_ID values in '{items_trimmed_filename}': {len(items_trimmed_df['ITEM_ID'].unique())}")


Number of unique ITEM_ID values: 2465.

Number of ITEM_ID values to remove: 1232.

Removing the following ITEM_ID values... : ['6579c22f-be2b-444c-a52b-0116dd82df6c'
 '2e852905-c6f4-47db-802c-654013571922'
 '4ec7ff5c-f70f-4984-b6c4-c7ef37cc0c09' ...
 '575c0ac0-5494-4c64-a886-a9c0cf8b779a'
 '7000f6e7-41f7-4957-878a-ccc42a39ca59'
 '9c1a2048-7aac-4565-b836-d8d4f726322c'].

Copying trimmed Items records to items_trimmed.csv...

File 'items_trimmed.csv' has been uploaded to S3 bucket 'personalize-update-items-dataset-example-85968' with key 'items_trimmed.csv'.

 Trimmed Items dataset preview:-----------
                                 ITEM_ID   PRICE  CATEGORY_L1 CATEGORY_L2  \
0   6579c22f-be2b-444c-a52b-0116dd82df6c   90.99  accessories    backpack   
4   b5649d7c-4651-458d-a07f-912f253784ce  141.99  accessories    backpack   
5   296d144e-7f86-464b-9c5a-f545257f1700  144.99  accessories    backpack   
12  0c47dade-1ec0-483a-9ab4-1b87604bdaf8  106.99  accessories    backpack   
13  f995

#### Trim the Interactions dataset, upload it to S3, and preview it

In [9]:
# Trim Interactions dataset:

interactions_trimmed_filename = 'interactions_trimmed.csv'

# Create a new DataFrame without rows where "ITEM_ID" is in selected_item_ids
interactions_trimmed_df = interactions_df[~interactions_df['ITEM_ID'].isin(selected_item_ids)]

# Write the trimmed interactions data to a new CSV file
interactions_trimmed_df.to_csv(interactions_trimmed_filename, index=False)

print(f"Removed rows with 'ITEM_ID' matching selected_item_ids and wrote the filtered data to {interactions_trimmed_filename}.\n")

# Upload the modified CSV file to an S3 bucket
s3.upload_file(interactions_trimmed_filename, bucket_name, interactions_trimmed_filename)

print(f"File '{interactions_trimmed_filename}' has been uploaded to S3 bucket '{bucket_name}' with key '{interactions_trimmed_filename}'.")

# Validate: Compare the number of rows in interactions.csv & interactions_trimmed.csv. 
# The latter should have fewer rows (approximately half) than the original interactions csv.
print(f"\nNumber of rows in {interactions_filename}: {len(interactions_df)} ")
print(f"Number of rows in {interactions_trimmed_filename}: {len(interactions_trimmed_df)} ")

print('\n Trimmed Interactions dataset preview:-----------')
print(interactions_trimmed_df.head())  # Inspect the first few rows of the DataFrame



Removed rows with 'ITEM_ID' matching selected_item_ids and wrote the filtered data to interactions_trimmed.csv.

File 'interactions_trimmed.csv' has been uploaded to S3 bucket 'personalize-update-items-dataset-example-85968' with key 'interactions_trimmed.csv'.

Number of rows in interactions.csv: 675004 
Number of rows in interactions_trimmed.csv: 328442 

 Trimmed Interactions dataset preview:-----------
                                 ITEM_ID USER_ID EVENT_TYPE   TIMESTAMP  \
2   3946f4c8-1b5b-4161-b794-70b33affb671    2122       View  1690552959   
3   3946f4c8-1b5b-4161-b794-70b33affb671    2122       View  1690552969   
4   e9daa7cd-8230-4544-9f07-86fa84d7c3c1    2485       View  1690552979   
5   e9daa7cd-8230-4544-9f07-86fa84d7c3c1    2485       View  1690552994   
12  e7af1dbd-4ab2-4201-b70f-1a52e4ea9250     810       View  1690553072   

   DISCOUNT  
2        No  
3        No  
4        No  
5        No  
12       No  


## Chapter 2 Summary - What have we accomplished?

In this chapter, we fetched pre-prepared sample datasets for each dataset type (items, users, and interactions) and uploaded them to the Amazon S3 bucket for later use.

We then inspected the three dataset types that will be used to train models and create custom solutions in Amazon Personalize.

Finally, we trimmed the items and interactions dataset so simulate a baseline dataset (In the latter half of this notebook, we will use the trimmed out items and interactions as our *new* data).

In the next chapter we will start creating resources in Amazon Personalize to receive our dataset files.

# Chapter 3: Create Schemas and Import Datasets into Amazon Personalize


In this chapter we are going to create an Amazon Personalize dataset group and import our three datasets into Amazon Personalize.

## Chapter 3 Objectives

In this chapter we will accomplish the following steps. This chapter should take about 15 minutes to complete.

- Create schema resources in Amazon Personalize that define the layout of our three dataset files (CSVs) created in the prior chapter
- Create a dataset group in Amazon Personalize that will be used to receive our datasets
- Create a dataset in the Personalize dataset group for the three dataset types and schemas
    - Items: information about the products in the Retail Demo Store
    - Users: information about the users in the Retail Deme Store
    - Interactions: user-item interactions representing typical storefront behavior such as viewing products, adding products to a shopping cart, purchasing products, and so on
- Create dataset import jobs to import each of the three datasets into Personalize

Note: We will be using the trimmed versions of the items and interactions datasets here.

## Configure Amazon Personalize

Now that we've prepared our three datasets and uploaded them to S3 we'll need to configure the Amazon Personalize service to understand our data so that it can be used to train models for generating recommendations.

### Create Schemas for Datasets

Amazon Personalize requires a schema for each dataset so it can map the columns in our CSVs to fields for model training. Each schema is declared in JSON using the [Apache Avro](https://avro.apache.org/) format.

Let's define and create schemas in Personalize for our datasets.

Note that categorical fields include an additional attribute of `"categorical": true` and the textual field has an additional attribute of `"textual": true`. Categorical fields are those where one or more values can be specified for the field value (i.e. enumerated values). For example, one or more category names/codes for the `CATEGORY_L1` field. A textual field indicates that Personalize should apply a natural language processing (NLP) model to the field's value to extract model features from unstructured text. In this case, we're using the product description as the textual field. You can only have one textual field in the items dataset. Finally, you will notice that the `PROMOTED` field does _not_ have `categorical` or `textual` specified. In this case, the `PROMOTED` column will not be included as a feature in the model but can be used for filtering (out of scope of this notebook).

Another detail to note is that when we call the [CreateSchema](https://docs.aws.amazon.com/personalize/latest/dg/API_CreateSchema.html) API, we pass an optional `domain` parameter with a value of `ECOMMERCE`. This tells Personalize that we are creating a schema for Retail/E-commerce domain. We will do this for all three schemas.

#### Users Dataset Schema

In [10]:
users_schema = {
    "type": "record",
    "name": "Users",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "AGE",
            "type": "int"
        },
        {
            "name": "GENDER",
            "type": "string",
            "categorical": True,
        }
    ],
    "version": "1.0"
}

try:
    users_schema_name = 'retaildemostore-products-users-schema-'+token
    create_schema_response = personalize.create_schema(
        name = users_schema_name,
        domain = "ECOMMERCE",
        schema = json.dumps(users_schema)
    )
    print(json.dumps(create_schema_response, indent=2))
    users_schema_arn = create_schema_response['schemaArn']
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this schema, seemingly')
    paginator = personalize.get_paginator('list_schemas')
    for paginate_result in paginator.paginate():
        for schema in paginate_result['schemas']:
            if schema['name'] == users_schema_name:
                users_schema_arn = schema['schemaArn']
                print(f"Using existing schema: {users_schema_arn}")
                break

{
  "schemaArn": "arn:aws:personalize:us-east-1:402114309305:schema/retaildemostore-products-users-schema-85968",
  "ResponseMetadata": {
    "RequestId": "816766fd-6e8b-456c-88c8-1659794d109c",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 15:34:53 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "109",
      "connection": "keep-alive",
      "x-amzn-requestid": "816766fd-6e8b-456c-88c8-1659794d109c"
    },
    "RetryAttempts": 0
  }
}


#### Items Datsaset Schema

In [11]:
items_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "PRICE",
            "type": "float"
        },
        {
            "name": "CATEGORY_L1",
            "type": "string",
            "categorical": True,
        },
    ],
    "version": "1.0"
}

try:
    items_schema_name = 'retaildemostore-products-items-schema-'+token
    create_schema_response = personalize.create_schema(
        name = items_schema_name,
        domain = 'ECOMMERCE',
        schema = json.dumps(items_schema)
    )
    items_schema_arn = create_schema_response['schemaArn']
    print(json.dumps(create_schema_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this schema, seemingly')
    paginator = personalize.get_paginator('list_schemas')
    for paginate_result in paginator.paginate():
        for schema in paginate_result['schemas']:
            if schema['name'] == items_schema_name:
                items_schema_arn = schema['schemaArn']
                print(f"Using existing schema: {items_schema_arn}")
                break

{
  "schemaArn": "arn:aws:personalize:us-east-1:402114309305:schema/retaildemostore-products-items-schema-85968",
  "ResponseMetadata": {
    "RequestId": "92f02ed4-41d3-4d6c-9970-d402353f0feb",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 15:34:53 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "109",
      "connection": "keep-alive",
      "x-amzn-requestid": "92f02ed4-41d3-4d6c-9970-d402353f0feb"
    },
    "RetryAttempts": 0
  }
}


#### Interactions Dataset Schema

In [12]:
interactions_schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "EVENT_TYPE",  # "View", "Purchase", etc.
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        },
        {
            "name": "DISCOUNT",  # This is the contextual metadata - "Yes" or "No".
            "type": "string"
        },
    ],
    "version": "1.0"
}

try:
    interactions_schema_name = 'retaildemostore-products-interactions-schema-'+token
    create_schema_response = personalize.create_schema(
        name = interactions_schema_name,
        domain = "ECOMMERCE",
        schema = json.dumps(interactions_schema)
    )
    print(json.dumps(create_schema_response, indent=2))
    interactions_schema_arn = create_schema_response['schemaArn']
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this schema, seemingly')
    paginator = personalize.get_paginator('list_schemas')
    for paginate_result in paginator.paginate():
        for schema in paginate_result['schemas']:
            if schema['name'] == interactions_schema_name:
                interactions_schema_arn = schema['schemaArn']
                print(f"Using existing schema: {interactions_schema_arn}")
                break

{
  "schemaArn": "arn:aws:personalize:us-east-1:402114309305:schema/retaildemostore-products-interactions-schema-85968",
  "ResponseMetadata": {
    "RequestId": "3f22e3fd-704e-4be9-95a7-c4c7e4049a1e",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 15:34:53 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "116",
      "connection": "keep-alive",
      "x-amzn-requestid": "3f22e3fd-704e-4be9-95a7-c4c7e4049a1e"
    },
    "RetryAttempts": 0
  }
}


### Create and Wait for Dataset Group

Next we need to create the dataset group that will contain our three datasets. This is one of many Personalize operations that are asynchronous. That is, we call an API to create a resource and have to wait for it to become active.

#### Create Dataset Group

Note that we are also passing `ECOMMERCE` for the `domain` parameter here too.

In [13]:
try:
    dataset_group_name = 'retaildemostore-products-DSG-'+token
    create_dataset_group_response = personalize.create_dataset_group(
        name = dataset_group_name,
        domain = 'ECOMMERCE'
    )
    dataset_group_arn = create_dataset_group_response['datasetGroupArn']
    print(json.dumps(create_dataset_group_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this dataset group, seemingly')
    paginator = personalize.get_paginator('list_dataset_groups')
    for paginate_result in paginator.paginate():
        for dataset_group in paginate_result['datasetGroups']:
            if dataset_group['name'] == dataset_group_name:
                dataset_group_arn = dataset_group['datasetGroupArn']
                break
                
print(f'DatasetGroupArn = {dataset_group_arn}')

{
  "datasetGroupArn": "arn:aws:personalize:us-east-1:402114309305:dataset-group/retaildemostore-products-DSG-85968",
  "domain": "ECOMMERCE",
  "ResponseMetadata": {
    "RequestId": "6dd03be3-95fb-45a4-995e-84472dd344e8",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 15:34:53 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "134",
      "connection": "keep-alive",
      "x-amzn-requestid": "6dd03be3-95fb-45a4-995e-84472dd344e8"
    },
    "RetryAttempts": 0
  }
}
DatasetGroupArn = arn:aws:personalize:us-east-1:402114309305:dataset-group/retaildemostore-products-DSG-85968


#### Wait for Dataset Group to Have ACTIVE Status
This should take about a minute.

In [14]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(15)

DatasetGroup: CREATE PENDING
DatasetGroup: CREATE PENDING
DatasetGroup: ACTIVE


### Define the three Datasets in Personalize
Next we will create the datasets in Personalize for our three dataset types.

#### Create Users Dataset

In [15]:
try:
    dataset_type = "USERS"
    users_dataset_name = "retaildemostore-products-users-ds-"+token
    create_dataset_response = personalize.create_dataset(
        name = users_dataset_name,
        datasetType = dataset_type,
        datasetGroupArn = dataset_group_arn,
        schemaArn = users_schema_arn
    )

    users_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this dataset, seemingly')
    paginator = personalize.get_paginator('list_datasets')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for dataset in paginate_result['datasets']:
            if dataset['name'] == users_dataset_name:
                users_dataset_arn = dataset['datasetArn']
                break
                
print(f'Users dataset ARN = {users_dataset_arn}')

{
  "datasetArn": "arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85968/USERS",
  "ResponseMetadata": {
    "RequestId": "c43ee3ba-cd86-4120-9001-b7e66d4e69da",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 15:35:23 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "108",
      "connection": "keep-alive",
      "x-amzn-requestid": "c43ee3ba-cd86-4120-9001-b7e66d4e69da"
    },
    "RetryAttempts": 0
  }
}
Users dataset ARN = arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85968/USERS


#### Create Items Dataset

In [16]:
try:
    dataset_type = "ITEMS"
    items_dataset_name = "retaildemostore-products-items-ds-"+token
    create_dataset_response = personalize.create_dataset(
        name = items_dataset_name,
        datasetType = dataset_type,
        datasetGroupArn = dataset_group_arn,
        schemaArn = items_schema_arn
    )

    items_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this dataset, seemingly')
    paginator = personalize.get_paginator('list_datasets')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for dataset in paginate_result['datasets']:
            if dataset['name'] == items_dataset_name:
                items_dataset_arn = dataset['datasetArn']
                break
                
print(f'Items dataset ARN = {items_dataset_arn}')

{
  "datasetArn": "arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85968/ITEMS",
  "ResponseMetadata": {
    "RequestId": "bb3cead9-afd2-447a-a4b5-3424fbf221f8",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 15:35:23 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "108",
      "connection": "keep-alive",
      "x-amzn-requestid": "bb3cead9-afd2-447a-a4b5-3424fbf221f8"
    },
    "RetryAttempts": 0
  }
}
Items dataset ARN = arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85968/ITEMS


#### Create Interactions Dataset

In [17]:
try:
    dataset_type = "INTERACTIONS"
    interactions_dataset_name = "retaildemostore-products-interactions-ds-"+token
    create_dataset_response = personalize.create_dataset(
        name = interactions_dataset_name,
        datasetType = dataset_type,
        datasetGroupArn = dataset_group_arn,
        schemaArn = interactions_schema_arn
    )

    interactions_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this dataset, seemingly')
    paginator = personalize.get_paginator('list_datasets')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for dataset in paginate_result['datasets']:
            if dataset['name'] == interactions_dataset_name:
                interactions_dataset_arn = dataset['datasetArn']
                break
                
print(f'Interactions dataset ARN = {interactions_dataset_arn}')

{
  "datasetArn": "arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85968/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "1b23cf9b-508b-406b-8151-547a89892fff",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 15:35:23 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "115",
      "connection": "keep-alive",
      "x-amzn-requestid": "1b23cf9b-508b-406b-8151-547a89892fff"
    },
    "RetryAttempts": 0
  }
}
Interactions dataset ARN = arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85968/INTERACTIONS


### Wait for datasets to become active

It can take a minute for the datasets to be created. Let's wait for all three to become active.

In [18]:
%%time

dataset_arns = [ items_dataset_arn, users_dataset_arn, interactions_dataset_arn ]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    for dataset_arn in reversed(dataset_arns):
        response = personalize.describe_dataset(
            datasetArn = dataset_arn
        )
        status = response["dataset"]["status"]

        if status == "ACTIVE":
            print(f'Dataset {dataset_arn} successfully completed')
            dataset_arns.remove(dataset_arn)
        elif status == "CREATE FAILED":
            print(f'Dataset {dataset_arn} failed')
            if response['dataset'].get('failureReason'):
                print('   Reason: ' + response['dataset']['failureReason'])
            dataset_arns.remove(dataset_arn)

    if len(dataset_arns) > 0:
        print('At least one dataset is still in progress')
        time.sleep(15)
    else:
        print("All datasets have completed")
        break

At least one dataset is still in progress
At least one dataset is still in progress
Dataset arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85968/INTERACTIONS successfully completed
Dataset arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85968/USERS successfully completed
Dataset arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85968/ITEMS successfully completed
All datasets have completed
CPU times: user 25.3 ms, sys: 1.32 ms, total: 26.6 ms
Wall time: 30.3 s


## Import Datasets to Personalize

So far in this chapter we have created schemas in Personalize that define the columns in our CSVs. Then we created a dataset group and three datasets in Personalize that will receive our data. In the following steps we will create import jobs with Personalize that will import the datasets from our S3 bucket into our dataset group. 


### Create Import Jobs

With the permissions in place to allow Personalize to access our CSV files, let's create three import jobs to import each file into its respective dataset. Each import job can take roughly 10 minutes to complete so we'll start the import jobs and then wait for them all to complete. This allows these import jobs to run in parallel.

#### Create Users Dataset Import Job

In [19]:
import_job_suffix = str(uuid.uuid4())[:8]

users_create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "retaildemostore-products-users-" + import_job_suffix,
    datasetArn = users_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, users_filename)
    },
    roleArn = role_arn
)

users_dataset_import_job_arn = users_create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(users_create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:402114309305:dataset-import-job/retaildemostore-products-users-899a5333",
  "ResponseMetadata": {
    "RequestId": "a2a18d02-7368-491d-ae52-55d4e62c763f",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 15:35:54 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "127",
      "connection": "keep-alive",
      "x-amzn-requestid": "a2a18d02-7368-491d-ae52-55d4e62c763f"
    },
    "RetryAttempts": 0
  }
}


#### Create Items Dataset Import Job

In [20]:

items_create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "retaildemostore-products-items-" + import_job_suffix,
    datasetArn = items_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, items_trimmed_filename)
    },
    roleArn = role_arn
)

items_dataset_import_job_arn = items_create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(items_create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:402114309305:dataset-import-job/retaildemostore-products-items-899a5333",
  "ResponseMetadata": {
    "RequestId": "b04addd2-87ca-449d-adb5-510676e24038",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 15:35:54 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "127",
      "connection": "keep-alive",
      "x-amzn-requestid": "b04addd2-87ca-449d-adb5-510676e24038"
    },
    "RetryAttempts": 0
  }
}


#### Create Interactions Dataset Import Job

In [21]:
interactions_create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "retaildemostore-products-interactions-" + import_job_suffix,
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, interactions_trimmed_filename)
    },
    roleArn = role_arn
)

interactions_dataset_import_job_arn = interactions_create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(interactions_create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:402114309305:dataset-import-job/retaildemostore-products-interactions-899a5333",
  "ResponseMetadata": {
    "RequestId": "ab7a2040-9748-4624-a814-7c56657b6ccf",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 15:35:54 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "134",
      "connection": "keep-alive",
      "x-amzn-requestid": "ab7a2040-9748-4624-a814-7c56657b6ccf"
    },
    "RetryAttempts": 0
  }
}


### Wait for Import Jobs to Complete

It can take up to 10 minutes for the import jobs to complete, while you're waiting you can learn more about Datasets and Schemas here: https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html

We will wait for all three import jobs to finish.

#### Wait for Items Import Job to Complete

In [22]:
%%time

import_job_arns = [ users_dataset_import_job_arn, items_dataset_import_job_arn, interactions_dataset_import_job_arn ]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    for job_arn in reversed(import_job_arns):
        import_job_response = personalize.describe_dataset_import_job(
            datasetImportJobArn = job_arn
        )
        status = import_job_response["datasetImportJob"]['status']

        if status == "ACTIVE":
            print(f'Import job {job_arn} successfully completed')
            import_job_arns.remove(job_arn)
        elif status == "CREATE FAILED":
            print(f'Import job {job_arn} failed')
            if import_job_response["datasetImportJob"].get('failureReason'):
                print('   Reason: ' + import_job_response["datasetImportJob"]['failureReason'])
            import_job_arns.remove(job_arn)

    if len(import_job_arns) > 0:
        print('At least one dataset import job still in progress')
        time.sleep(60)
    else:
        print("All import jobs have ended")
        break

At least one dataset import job still in progress
At least one dataset import job still in progress
At least one dataset import job still in progress
At least one dataset import job still in progress
Import job arn:aws:personalize:us-east-1:402114309305:dataset-import-job/retaildemostore-products-interactions-899a5333 successfully completed
At least one dataset import job still in progress
At least one dataset import job still in progress
Import job arn:aws:personalize:us-east-1:402114309305:dataset-import-job/retaildemostore-products-items-899a5333 successfully completed
Import job arn:aws:personalize:us-east-1:402114309305:dataset-import-job/retaildemostore-products-users-899a5333 successfully completed
All import jobs have ended
CPU times: user 183 ms, sys: 4.76 ms, total: 188 ms
Wall time: 6min


## Chapter 3 Summary - What have we accomplished?

In this chapter we created schemas in Amazon Personalize that mapped to the dataset CSVs we introduced in chapter 2. We also created a dataset group in Personalize as well as Datasets to represent our CSVs. Finally, we created dataset import jobs in Personalize to load the three datasets into Personalize.

In the next chapter we will create the a custom solution and train a solution version. This is where the machine learning models are trained and deployed.

# Chapter 4: Create a custom solution in Amazon Personalize

In this chapter we are going to create a Solution in Amazon Personalize. A Solution consists of a Personalize Recipe (an algorithm), parameters, and all of its Solution Versions (ie: trained models). 

## Chapter 4 Objectives

In this chapter we will accomplish the following steps.

- Create custom solution and solution version for the following use case:
    - **User Personalization**: An item recommendation model that recommends specific items to your users.

This portion should take about 60 minutes to complete. However, most of the time will be waiting for model training job to complete.

## Create a Custom Solution

With our three datasets imported into our dataset group, we can now turn to creating solutions. 

We simply need to create a solution and solution version using the user-personalization recipe.

#### Overview of the recipe

[User-Personalization:](https://docs.aws.amazon.com/personalize/latest/dg/native-recipe-new-item-USER_PERSONALIZATION.html)

> The User-Personalization (aws-user-personalization) recipe is optimized for all personalized recommendation scenarios. It predicts the items that a user will interact with based on Interactions, Items, and Users datasets. When recommending items, it uses automatic item exploration.

> With exploration, recommendations include some items that would be typically less likely to be recommended for the user, such as new items, items with few interactions, or items less relevant for the user based on their previous behavior. This improves item discovery and engagement when you have a fast-changing catalog, or when new items, such as news articles or promotions, are more relevant to users when fresh.



Note: This notebook only uses one recipe, however there many more than that available. If you are interested, you can visit the official documentation to read more about all the [predefined recipes](https://docs.aws.amazon.com/personalize/latest/dg/working-with-predefined-recipes.html) Amazon Personalize has to offer.


In [23]:
# Item Recommendation Recipe:
user_personalization_recipe_arn = 'arn:aws:personalize:::recipe/aws-user-personalization'

### Create Custom Solution and Solution Version

With our recipe defined, we can now create our solution and solution version.

The code cell below creates a solution using the user-personalization recipe and our dataset group that we created in the previous chapter.

In [24]:
user_personalization_solution_name = "retaildemostore-user-personalization-solution-"+token
user_personalization_solution_arn = None
user_personalization_solution_version_arn = None

try:
    create_solution_response = personalize.create_solution(
        name = user_personalization_solution_name,
        datasetGroupArn = dataset_group_arn,
        recipeArn = user_personalization_recipe_arn
    )

    user_personalization_solution_arn = create_solution_response['solutionArn']
    print(json.dumps(create_solution_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this solution, seemingly')
    paginator = personalize.get_paginator('list_solutions')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for solution in paginate_result['solutions']:
            if solution['name'] == user_personalization_solution_name:
                user_personalization_solution_arn = solution['solutionArn']
                print(f'Ranking solution ARN = {user_personalization_solution_arn}')
                
                response = personalize.list_solution_versions(
                    solutionArn = user_personalization_solution_arn,
                    maxResults = 100
                )
                if len(response['solutionVersions']) > 0:
                    user_personalization_solution_version_arn = response['solutionVersions'][-1]['solutionVersionArn']
                    print(f'Will use most recent solution version for this solution: {user_personalization_solution_version_arn}')
                    
                break

{
  "solutionArn": "arn:aws:personalize:us-east-1:402114309305:solution/retaildemostore-user-personalization-solution-85968",
  "ResponseMetadata": {
    "RequestId": "dcfe7878-eb97-42f7-a350-b1328d15ba56",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 15:41:56 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "121",
      "connection": "keep-alive",
      "x-amzn-requestid": "dcfe7878-eb97-42f7-a350-b1328d15ba56"
    },
    "RetryAttempts": 0
  }
}


#### Create User Personalization Solution Version

Next we can create a solution version for the solution. This is where the model is trained for this custom solution.

In [25]:
if not user_personalization_solution_version_arn:
    create_solution_version_response = personalize.create_solution_version(
        solutionArn = user_personalization_solution_arn
    )

    user_personalization_solution_version_arn = create_solution_version_response['solutionVersionArn']
    print(json.dumps(create_solution_version_response, indent=2))
else:
    print(f'Solution version {user_personalization_solution_version_arn} already exists; not creating')

{
  "solutionVersionArn": "arn:aws:personalize:us-east-1:402114309305:solution/retaildemostore-user-personalization-solution-85968/83cc95fc",
  "ResponseMetadata": {
    "RequestId": "2a3a328d-5b88-49e7-a7dc-27628f323951",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 15:41:56 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "137",
      "connection": "keep-alive",
      "x-amzn-requestid": "2a3a328d-5b88-49e7-a7dc-27628f323951"
    },
    "RetryAttempts": 0
  }
}


## Wait for Solution Versions to Complete

It can take roughly 40 minutes for the solution version to be created. During this process a model is being trained and tested with the data contained within your datasets. The duration of training jobs can increase based on the size of the dataset, training parameters and a selected recipe. In the cell below we will wait for the solution version to finish.

While you are waiting for this process to complete you can learn more about [custom solutions](https://docs.aws.amazon.com/personalize/latest/dg/training-deploying-solutions.html).


#### Wait for the custom solution version to become active

The following cell waits for the solution version for the user personalization use case to become active. We *need* to make sure it is active before proceeding to the next Chapter.

In [26]:
%%time

user_personalization_solution_version_arn

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    soln_ver_response = personalize.describe_solution_version(
        solutionVersionArn = user_personalization_solution_version_arn
    )
    status = soln_ver_response["solutionVersion"]["status"]

    if status == "ACTIVE":
        print(f'Solution version {user_personalization_solution_version_arn} successfully completed')
        print(json.dumps(soln_ver_response, indent=2, default=str))
        print("Solution version has completed")
        break
    elif status == "CREATE FAILED":
        print(f'Solution version {user_personalization_solution_version_arn} failed')
        if soln_ver_response["solutionVersion"].get('failureReason'):
            print('   Reason: ' + soln_ver_response["solutionVersion"]['failureReason'])
        break
    else:
        print('Solution version is still in progress')
        time.sleep(60)    

Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version is still in progress
Solution version arn:aws:personalize:us-east-1:402114309305:solution/retaildemostore-user-personalization-solution-85968/83cc95fc successfully completed
{
  "solutionVersion": {
    "name": "retaildemos

## Chapter 4 Summary - What have we accomplished?

In this chapter we created a solution using the user-personalization recipe. We also created (or trained) a solution version.

In the next chapter we will perform a batch job on the newly-trained solution version.

# Chapter 5: Run A Batch Segmentation Job on your Solution Version

In this chapter, we will prepare and execute a batch job for the solution version that we created previously. The purpose of doing this is to obtain a baseline that we can use to compare the to the output of the solution version that uses the updated datasets later on in this notebook.

The batch job for the user personalization solution will return a group of items for each of the inputted users. These items will represent the items that the user is most likely to purchase.

We will wait for the job to finish executing. Afterwards, we'll inspect the outputs.

This chapter will take about 20 minutes.

Declare some file names that we will use throughout the rest of this notebook.
note: 'orig' is shorthard for 'original', 'au' is short for 'auto-updated', and 'fr' is short for 'fully retrained'.
Since are using the same input for all jobs, the contents of these files will be identical, 
but we need to create seperate copies of the input files due to the API behavior.

In [27]:
orig_job_input_filename = "orig_job_input.json"
au_job_input_filename = "au_job_input.json"
fr_job_input_filename = "fr_job_input.json"

### Set the size input and outputs for our job

If you want, you can set the size of the input job by changing the value of the variable 'x' in the following code cell.

A default value of 5 has been pre-populated for you. If the value of x equal 5, this means we will randomly select 5 `USER_ID` values as input for the user personalization job.

Similarly, you can set the size of the output by changing the value of the variable 'y'. A default value of 10 has been pre-populated for you. This means our model will return 10 `ITEM_ID` recommendations for each `USER_ID` input.

You can decrease or increase the values of 'x' and 'y' if you want. Just be aware that larger values for x and y means the inference job will take slightly longer to complete.


In [28]:
# TODO: Fill in the size of the input and output variables
x = 5
y = 10

## Prepare input file for batch inference job

Next we will prepare an input file for our batch inference job.

The input file for batch jobs against User Personalization Solutions requires a list of userIds. For each user, our solution version will return a list of personalized item recommendations for that user.
Along with each item recommendation, our model will also return a score that represents their liklihood of interacting with the item.

Below is a sample of the input file for a user personalization job that builds item recommendations for 3 users.

```javascript
{"userId": "4"}
{"userId": "5"}
{"userId": "6"}
```

For our job, we will *randomly* select *'x'* UserIDs that we want item recommendations for.

#### Here are the steps required to run this job:
    - Randomly select x user id's (used as inputs for our batch job)
    - Generate the input file
    - Upload it to S3
    - Create and start a Batch Inference Job using our Solution Version
    - Wait for the inference Job to complete
    - Download the inference job output from S3
    - Inspect the job output

The code cell below selects the users, generate the input file for the batch job, and uploads it to S3

In [29]:
# - Randomly select x users
users = users_df['USER_ID'].unique()
sample_users = numpy.random.choice(users, x, False)

print("Randomly selected users :")
print(sample_users)

# - Generate the input file
with open(orig_job_input_filename, 'w') as json_input:
    for user_id in sample_users:
        # Write line that specifies the specific user
        json_input.write(f'{{"userId": "{user_id}"}}\n')

# Confirm the file matches the required format:
print("\nPreviewing input file... ")
!head -n 5 $orig_job_input_filename
print('\n')

# Upload job input file to S3
s3_input_key_orig = "batch-job-input/" + orig_job_input_filename

s3.upload_file(orig_job_input_filename, bucket_name, s3_input_key_orig)
if s3_input_key_orig in [object['Key'] for object in s3.list_objects(Bucket=bucket_name)['Contents']]:
    print('File was uploaded successfully!')
else:
    print('File was not uploaded!')


Randomly selected users :
[5023 5866 4016 2755 2896]

Previewing input file... 
{"userId": "5023"}
{"userId": "5866"}
{"userId": "4016"}
{"userId": "2755"}
{"userId": "2896"}


File was uploaded successfully!


## Submit the batch inference job

Finally, we're ready to submit a batch inference job. There are several required parameters including a name for the job, the solution version ARN for the user-personalization model, the IAM role that Personalize needs to be able to access the job input file and write the output file, and the job input and output locations. These parameters are required inputs for all batch jobs in Amazon Personalize.

We're also optionally specifying that we only want *'y'* item recommendations per user.

The inference job can take several minutes to complete. Even though our input file only specifies a few input lines, there is a certain amount of fixed overhead required for Personalize to spin up the compute resources needed to execute the job. This overhead is amortized for larger input files that generate many item recommendations.


In [30]:
# Create and submit a Batch Segment Job using our latest Solution Version
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/personalize/client/create_batch_inference_job.html

# We need to define an input location in our s3 bucket where the segmention job gets its input, 
# and an output location where the segmentation job writes its output.
s3_input_path_orig = "s3://" + bucket_name + "/" + s3_input_key_orig
s3_output_path_orig = "s3://" + bucket_name + "/batch-job-outputs/original/"

response = personalize.create_batch_inference_job (
    solutionVersionArn = user_personalization_solution_version_arn,
    jobName = "retaildemostore-user-personalization-job-orig-" + token,
    roleArn = role_arn,
    jobInput = {"s3DataSource": {"path": s3_input_path_orig }},
    jobOutput = {"s3DataDestination":{"path": s3_output_path_orig }},
    numResults = y
)
user_personalization_job_arn = response['batchInferenceJobArn']
print(json.dumps(response, indent=2, default=str))

{
  "batchInferenceJobArn": "arn:aws:personalize:us-east-1:402114309305:batch-inference-job/retaildemostore-user-personalization-job-orig-85968",
  "ResponseMetadata": {
    "RequestId": "14a32661-ef8f-4fa9-9030-7f0cffae3982",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 16:02:59 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "141",
      "connection": "keep-alive",
      "x-amzn-requestid": "14a32661-ef8f-4fa9-9030-7f0cffae3982"
    },
    "RetryAttempts": 0
  }
}


#### Gather metrics of the solution version

Amazon Personalize provides [offline metrics](https://docs.aws.amazon.com/personalize/latest/dg/working-with-training-metrics.html#working-with-training-metrics-metrics) that allow you to evaluate the accuracy of the model before you deploy the it in your application. Metrics can also be used to view the effects of modifying a custom solution's hyperparameters or to compare the metrics between solution versions.

While we are waiting for the job to complete, let's quickly grab the metrics of our solution version.

We will use these metrics as a baseline to compare the metrics of the future solution versions to.


In [31]:
get_original_solution_metrics_response = personalize.get_solution_metrics(
    solutionVersionArn = user_personalization_solution_version_arn
)

print(json.dumps(get_original_solution_metrics_response['metrics'], indent=2))

{
  "coverage": 0.9968,
  "mean_reciprocal_rank_at_25": 0.8459,
  "normalized_discounted_cumulative_gain_at_10": 0.7873,
  "normalized_discounted_cumulative_gain_at_25": 0.8089,
  "normalized_discounted_cumulative_gain_at_5": 0.7661,
  "precision_at_10": 0.1194,
  "precision_at_25": 0.0541,
  "precision_at_5": 0.2189
}


### Wait for the Job to complete and inspect its output

Run the cell below. The cell below will wait for the user personalization batch inference job to finish, download the output, and display its first few lines.

In [32]:
%%time
current_time = datetime.now()
print("\nInference Job Started on: ", current_time.strftime("%I:%M:%S %p"))

max_time = time.time() + 3*60*60 # 3 hours

while time.time() < max_time:
    response = personalize.describe_batch_inference_job(
        batchInferenceJobArn = user_personalization_job_arn
    )
    status = response["batchInferenceJob"]['status']
    print("DatasetInferenceJob: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)
    
current_time = datetime.now()
print("Inference Job Completed on: ", current_time.strftime("%I:%M:%S %p"))

# - Download the Inference job output from S3
job_output_file_orig = orig_job_input_filename + ".out"
export_name_orig = 'batch-job-outputs/original/' + job_output_file_orig
s3.download_file(bucket_name, export_name_orig, job_output_file_orig)

# - Inspect the Inference Job
!head -n 5 $job_output_file_orig


Inference Job Started on:  04:02:59 PM
DatasetInferenceJob: CREATE PENDING
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: ACTIVE
Inference Job Completed on:  04:16:01 PM
{"input":{"userId":"4016"},"output":{"recommendedItems":["ccdf737c-c4fd-4c78-abd2-d5ef0428ef20","425cc876-3935-4e87-ad8d-77f42b0b6a75","6be08307-1ec0-44dc-b436-5d489a8010e8","2c1b34d6-0f3d-463d-be76-226cb87bdc6d","89c4eeb4-c146-4434-a9f1-6943b4b552dc","5a94b7d5-b210-44b3-9287-c8b0b5488a15","61b1ad14-4e70-4029-ba55-d17bbf4ab62b","8f8f015a-4166-4e9e-ac0b-6d980614ca5d","eecbe

Notice that the input userId is echoed in the output file but we also have `output` and `error` elements for each inference. The `output` element has a `recommendedItems` array that contains the Item IDs for the inference. If there were any errors enountered while generating an inference, details will be included in the `error` element.

### Chapter 5 Summary:

In this chapter, we ran a batch inference job for our original solution version.

We waited for the inference job to finish and then inspected its outputs. We also gathered the metrics of the solution version for future comparison.

#### Potential Next Steps (out of scope of this notebook)
Now that we have these item recommendations created, what can we do with them? The most obvious choice is to use these outputs in outbound marketing tools! 

For example, you can send marketing emails to users containing information about the items they are most likely to purchase (You would use the user personalization model for this). This might be useful if you are seeking to maximize sales or number of purchases. 


## Chapter 5 complete

Congratulations! You have completed the batch segmentation portion of the notebook!

## Chapter 6: Overview of Steps Involved for Updating Datasets for User-Personalization Solutions

Suppose we want to update the data in our datasets. This could mean adding new items or capturing new interactions. Fortunately, Amazon Personalize supports adding new rows to your datasets, whether its new Items, new interactions, or new users!

In fact, for user-personalization-backed solutions, [Amazon Personalize manages some aspects of dataset updates for you](https://docs.aws.amazon.com/personalize/latest/dg/native-recipe-new-item-USER_PERSONALIZATION.html#automatic-updates)!

>  For batch item recommendations, Amazon Personalize *updates* the solution version you specify in the batch inference job when the solution version is the latest for your solution.

> With each update, Amazon Personalize *updates* the solution version to consider any *new items* through *exploration*. And it uses any new *interactions* data, including impressions data, to determine *what items to include or not include in exploration*. This is *not* a full retraining; you should still train a new solution version weekly with trainingMode set to FULL so the model can learn from your users' behavior and any item metadata.

> There is no cost for automatic updates.

(italics added above for emphasis)

This chapter will guide you through the process of updating your datasets. 

This entire chapter will take about 100 minutes, but most of that time will be spent waiting for data import jobs and solution version training.


Note: For demonstration purposes, this notebook only reviews updates to items dataset and interactions dataset. Adding new users is a supported feature in Amazon Personalize, but isn't shown in this notebook. For more information how adding new users, refer to [this aws documentation](https://docs.aws.amazon.com/personalize/latest/dg/getting-batch-recommendations.html).

### Here is a high-level overview of this chapter.
Due to its length, this chapter is sub-divided into steps. 

- **Chapter 6 Step 1**: First, we will fetch, modify, and inspect the updated the Items & Interactions Datasets 

- **Chapter 6 Step 2**: Next, we will update the Items & Interactions Datasets (via the the CreateDatasetImportJob API)

- **Chapter 6 Step 3**: Then, we will re-run the Batch Inference Job on our Solution Version & inpect the updated output. 
    - When we submit the batch job, our original solution version will automatically update. It will now be able to recommend those new items as 'cold' items. We will inspect the output of the original solution version (Output A) and the output of the auto-updated solution version (Output B). We should expect some slight differences between the outputs.

- **Chapter 6 Step 4**: Gather metrics of the auto-updated solution version. 
    - We will fetch the metrics of the solution version to see if they have changed. Since the underlying solution version was only updated (but not retrained), we should expect the metrics to remain the same.

- **Chapter 6 Step 5**: Fully retrain the solution versoin
    - If we want our new Items and Interactions data to have more influence in our solution version, then we will need to *fully retrain* the solution version. The full retraining will allow our solution version to use the new items and new interactions at part of its *meaningful* recommendations for inference jobs.

- **Chapter 6 Step 6**: Run the Batch Inference Job on the fully re-trained Solution Version & Inspect its output (Output C).

- **Chapter 6 Step 7**: Gather metrics of the solution version. We will use this in the analysis portion of the notebook.

Afterwards, in Chapter 7, we will compare the outputs and metrics of the original, auto-updated, and fully-retrained solution versions.

#### Step 1: First, we will fetch, modify, and inspect the updated the Items & Interactions Datasets

To make the before-and-after comparisons easier for us, we will slightly modify the original items and interactions datasets. All of the *new* ITEM_ID values that we will import into Amazon Personalize will contain the string "new_" as a prefix. This way, you will be able to see whether a specific item recommendation is a *new* item or an *original* item. 

Run the code cell below. Due to the amount of data processing, it should take about a minute to run.

In [33]:
# Helper function: Adds the prefix "new_" to new ITEM_ID values.
# A new ITEM_ID is an ITEM_ID value that was removed from the original dataset.
# Recall how we kept these in the selected_item_ids list.
def add_prefix(item_id):
    if item_id in selected_item_ids:
        return 'new_' + item_id
    else:
        return item_id

    
# Modify & Inspect Items Datasets
items_df['ITEM_ID'] = items_df['ITEM_ID'].apply(add_prefix) # Apply the function to the 'ITEM_ID' column
items_df.to_csv('items.csv', index=False) # Save dataframe as csv
print("Previewing the updated Items dataset:")
print(items_df.head(10)) # Inspect the updated dataset. Notice how some of the ITEM_ID values begin with "new_". 
# These are the new ITEM_ID values that we will import into Amazon Personalize when we update the datasets in the next step.


# Modify & Inspect Interactions Datasets
interactions_df['ITEM_ID'] = interactions_df['ITEM_ID'].apply(add_prefix)
interactions_df.to_csv('interactions.csv', index=False)
print("\n\nPreviewing the updated Interactions dataset:")
print(interactions_df[['ITEM_ID', 'USER_ID', 'EVENT_TYPE','TIMESTAMP']].head(10)) # Display only the most important rows for visual purposes
# !head -n 10 $interactions_filename # If you want to view the file itself

# Upload items.csv & interactions.csv to S3
s3.upload_file(items_filename, bucket_name, items_filename)
s3.upload_file(interactions_filename, bucket_name, interactions_filename)


Previewing the updated Items dataset:
                                    ITEM_ID   PRICE  CATEGORY_L1 CATEGORY_L2  \
0      6579c22f-be2b-444c-a52b-0116dd82df6c   90.99  accessories    backpack   
1  new_2e852905-c6f4-47db-802c-654013571922  123.99  accessories    backpack   
2  new_4ec7ff5c-f70f-4984-b6c4-c7ef37cc0c09   87.99  accessories    backpack   
3  new_7977f680-2cf7-457d-8f4d-afa0aa168cb9  125.99  accessories    backpack   
4      b5649d7c-4651-458d-a07f-912f253784ce  141.99  accessories    backpack   
5      296d144e-7f86-464b-9c5a-f545257f1700  144.99  accessories    backpack   
6  new_7d3e7f5b-8ac8-49a9-a960-8a24773a8280  133.99  accessories    backpack   
7  new_1d3ae532-f790-44ca-a8e8-f55aa9b66526   75.99  accessories    backpack   
8  new_f6cd5dd2-d3ea-4858-844a-04879153e459   95.99  accessories    backpack   
9  new_3491deff-c0fe-4065-abbc-72b507da84b2   80.99  accessories    backpack   

                                 PRODUCT_DESCRIPTION GENDER PROMOTED  
0         

Notice how some of the ITEM_ID values begin with the string "new_".
These represent the new items that we will be adding to our dataset.
The ITEM_ID values that do *not* begin with "new_" are old items from the original datasets.

Likewise, we updated the interactions dataset such that the ITEM_IDs of all interactions associated with new ITEM_ID values also begin with "new_".

#### Step 2: Update the new Items and interactions data

We will simulate an update of the items dataset and interactions dataset up importing the complete version of the datasets. Recall that we stored a copy of these datasets in S3, and that they contain *all* of the items and interactions. 


In [34]:
# Step 2: Update the new Items and Interactions Satasets

# You can use the CreateDatasetImportJob API call to import the data.
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/personalize/client/create_dataset_import_job.html

# Import the complete items dataset
updated_items_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "updated_items_dataset_import_job"+token,
    datasetArn = items_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, items_filename) # Pass in the new dataset. See note below
    },
    roleArn = role_arn,
    importMode = "FULL" # importMode='FULL' is used here because we are re-importing the entire dataset. Though if we had imported *just* the new data, we could have set ImportMode to 'INCREMENTAL'.
)
# Note: notice how we are passing the orginal items dataset (the dataset that contains all the items)

updated_items_dataset_import_job_arn = updated_items_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(updated_items_dataset_import_job_response, indent=2))


# Import the complete interactions dataset
updated_interactions_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "updated_interactions_dataset_import_job"+token,
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename) # Pass in the csv that contains all of the interactions.
    },
    roleArn = role_arn,
    importMode = "FULL"
)

updated_interactions_dataset_import_job_arn = updated_items_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(updated_interactions_dataset_import_job_response, indent=2))


import_job_arns = [updated_items_dataset_import_job_arn, updated_interactions_dataset_import_job_arn]

# Wait for the data to finish importing. It can take up to 10 minutes.
max_time = time.time() + 1*60*60 # 1 hours

while time.time() < max_time:
    for job_arn in reversed(import_job_arns):
        import_job_response = personalize.describe_dataset_import_job(
            datasetImportJobArn = job_arn
        )
        status = import_job_response["datasetImportJob"]['status']

        if status == "ACTIVE":
            print(f'Import job {job_arn} successfully completed')
            import_job_arns.remove(job_arn)
        elif status == "CREATE FAILED":
            print(f'Import job {job_arn} failed')
            if import_job_response["datasetImportJob"].get('failureReason'):
                print('   Reason: ' + import_job_response["datasetImportJob"]['failureReason'])
            import_job_arns.remove(job_arn)
    if len(import_job_arns) > 0: # CREATE PENDING or CREATE IN_PROGRESS
        print('At least one dataset import job still in progress')
        time.sleep(60)
    else:
        print("All import jobs have ended")
        break


{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:402114309305:dataset-import-job/updated_items_dataset_import_job",
  "ResponseMetadata": {
    "RequestId": "54efcc34-ec16-4e93-acef-d0a63165c831",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 16:16:27 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "120",
      "connection": "keep-alive",
      "x-amzn-requestid": "54efcc34-ec16-4e93-acef-d0a63165c831"
    },
    "RetryAttempts": 0
  }
}
{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:402114309305:dataset-import-job/updated_interactions_dataset_import_job",
  "ResponseMetadata": {
    "RequestId": "9e2b983d-5f86-49c8-b4b8-5c007b48807e",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 16:16:27 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "127",
      "connection": "keep-alive",
      "x-amzn-requestid": "9e2b983d-5f86-49c8-b4b8-5c007

#### Chapter 6 Step 3: Re-submit the Batch Inference Job on the solution version

Now, we will resubmit the same batch job that we submitted in Chapter 5.
Because we are using the User-personalization recipe, when we submit a batch inference job, our solution version will automatically update to consider the items and interactions data.
This new items and interactions data will be used for cold-starting purposes. In chapter 7, when you compare the output of this job to the output of the previous job, you will notice a difference.

According to [the AWS Documentation for the user personalization recipe](https://docs.aws.amazon.com/personalize/latest/dg/native-recipe-new-item-USER_PERSONALIZATION.html),
> Just remember that Amazon Personalize automatically updates only your latest fully trained solution version, so the manually updated solution version won't be automatically updated in the future.


Additionally, automatic update requirements for batch item recommendations include the following:

> * The solution version you specify in the batch inference job must be the latest solution version for your solution.
> * The solution version must be trained with trainingMode set to FULL (this is the default when creating a solution version).
> * You must provide new item or interactions data since the last automatic update.


In [35]:
s3_input_key_au = "batch-job-input/" + au_job_input_filename
s3_input_path_au = "s3://" + bucket_name + "/" + s3_input_key_au
s3_output_path_au = "s3://" + bucket_name + "/batch-job-outputs/auto-update/"

# Upload a copy of the input file to s3.
!cp {orig_job_input_filename} {au_job_input_filename}

s3.upload_file(au_job_input_filename, bucket_name, s3_input_key_au)
if s3_input_key_au in [object['Key'] for object in s3.list_objects(Bucket=bucket_name)['Contents']]:
    print('File was uploaded successfully!')
else:
    print('File was not uploaded!')

response = personalize.create_batch_inference_job(
    solutionVersionArn = user_personalization_solution_version_arn,
    jobName = "retaildemostore-user-personalization-job-au-" + token,
    roleArn = role_arn,
    jobInput = {"s3DataSource": {"path": s3_input_path_au}},
    jobOutput = {"s3DataDestination":{"path": s3_output_path_au}},
    numResults = y
)
user_personalization_autoupdate_job_arn = response['batchInferenceJobArn']
print(json.dumps(response, indent=2, default=str))

File was uploaded successfully!
{
  "batchInferenceJobArn": "arn:aws:personalize:us-east-1:402114309305:batch-inference-job/retaildemostore-user-personalization-job-au-85968",
  "ResponseMetadata": {
    "RequestId": "fe20dec8-8967-46be-a380-cc3fcf9743af",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 16:22:29 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "139",
      "connection": "keep-alive",
      "x-amzn-requestid": "fe20dec8-8967-46be-a380-cc3fcf9743af"
    },
    "RetryAttempts": 0
  }
}


In [None]:
%%time
#### Wait for the batch inference job to complete This can take around 15 minutes.
current_time = datetime.now()
print("Inference Job Started on: ", current_time.strftime("%I:%M:%S %p"))

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    resp = personalize.describe_batch_inference_job(batchInferenceJobArn = user_personalization_autoupdate_job_arn)
    status = resp["batchInferenceJob"]['status']
    print("DatasetInferenceJob {arn}: {status}".format(arn=user_personalization_autoupdate_job_arn, status=status))

    if status == "ACTIVE" or status == "CREATE FAILED":
        print("Inference job has ended")
        break
    else:
        print('Job still in progress')
        time.sleep(60)
    
current_time = datetime.now()
print("Inference Job Completed on: ", current_time.strftime("%I:%M:%S %p"))


Inference Job Started on:  04:22:29 PM
DatasetInferenceJob arn:aws:personalize:us-east-1:402114309305:batch-inference-job/retaildemostore-user-personalization-job-au-85968: CREATE PENDING
Job still in progress
DatasetInferenceJob arn:aws:personalize:us-east-1:402114309305:batch-inference-job/retaildemostore-user-personalization-job-au-85968: CREATE IN_PROGRESS
Job still in progress
DatasetInferenceJob arn:aws:personalize:us-east-1:402114309305:batch-inference-job/retaildemostore-user-personalization-job-au-85968: CREATE IN_PROGRESS
Job still in progress
DatasetInferenceJob arn:aws:personalize:us-east-1:402114309305:batch-inference-job/retaildemostore-user-personalization-job-au-85968: CREATE IN_PROGRESS
Job still in progress
DatasetInferenceJob arn:aws:personalize:us-east-1:402114309305:batch-inference-job/retaildemostore-user-personalization-job-au-85968: CREATE IN_PROGRESS
Job still in progress
DatasetInferenceJob arn:aws:personalize:us-east-1:402114309305:batch-inference-job/retaild

#### Get the output and inspect of the inference job

Download the output of the inference job from our private S3 bucket.
The following code cell does a side-by-side comparison of the outputs from the original solution version and the auto-updated solution version.
You may notice some differences.

In [None]:
# Get the output and inspect of the inference job

# - Download the Inference job output from S3
job_output_file_au = au_job_input_filename + ".out"
export_name_au = 'batch-job-outputs/auto-update/' + job_output_file_au
# print(batch_job_output_file_au)
# print(export_name_au)

s3.download_file(bucket_name, export_name_au, job_output_file_au)

# - Inspect the Inference Job
print("Previewing output of the inference job that was run on the original solution version")
!head -n 5 $job_output_file_orig

# - Inspect the Inference Job
print("\n\nPreviewing output of the inference job that was run on the auto-updated solution version")
!head -n 5 $job_output_file_au


Previewing output of the inference job that was run on the original solution version
{"input":{"userId":"4016"},"output":{"recommendedItems":["ccdf737c-c4fd-4c78-abd2-d5ef0428ef20","425cc876-3935-4e87-ad8d-77f42b0b6a75","6be08307-1ec0-44dc-b436-5d489a8010e8","2c1b34d6-0f3d-463d-be76-226cb87bdc6d","89c4eeb4-c146-4434-a9f1-6943b4b552dc","5a94b7d5-b210-44b3-9287-c8b0b5488a15","61b1ad14-4e70-4029-ba55-d17bbf4ab62b","8f8f015a-4166-4e9e-ac0b-6d980614ca5d","eecbee28-73a3-425d-84e8-516c326e399c","1daacea7-7d46-464a-8326-ed81951fecab"],"scores":[0.8449224,0.0186943,0.0148416,0.0100115,0.008885,0.0055925,0.0055068,0.0053057,0.0040184,0.0038556]},"error":null}
{"input":{"userId":"2755"},"output":{"recommendedItems":["3630053e-3962-4549-bcce-402c3a980557","6d488475-1d67-4076-96b1-8e706709a847","b947ee58-a7e7-40bf-9926-42a445f3480f","90ccfbb9-4538-4951-af8d-4f728578b237","d537d92a-23fe-4673-a697-795652ff10c8","78080d05-b078-441f-b245-54b2a2dec872","61840d6a-6ba2-4ece-a644-6db6a3377b1c","2e95f6fc-6b

Notice how the output from the auto-updated solution version contains new items.
This demonstrates that the auto-update feature for user-personalization solution version does indeed recommend those new items (for cold starts) without requiring a retraining of the underlying model. 



#### Chapter 6 Step 4: View the metrics of the auto-updated solution version

Lets quickly fetch the metrics of post auto-updated solution version. Since the underlying solution version was only *updated* (but not *retrained*), we should expect the metric of the original solution version & the auto-trained solution version to remain the same.

In [None]:
# Display the metrics of the original/pre-updated solution version:
print("Metrics of the original solution version:")
print(json.dumps(get_original_solution_metrics_response['metrics'], indent=2))

# Get metrics for auto-updated solution version
get_autoupdated_solution_metrics_response = personalize.get_solution_metrics(
    solutionVersionArn = user_personalization_solution_version_arn
)
print("\n\nMetrics of the auto-updated solution version:")
print(json.dumps(get_autoupdated_solution_metrics_response['metrics'], indent=2))

Metrics of the original solution version:
{
  "coverage": 0.9968,
  "mean_reciprocal_rank_at_25": 0.8459,
  "normalized_discounted_cumulative_gain_at_10": 0.7873,
  "normalized_discounted_cumulative_gain_at_25": 0.8089,
  "normalized_discounted_cumulative_gain_at_5": 0.7661,
  "precision_at_10": 0.1194,
  "precision_at_25": 0.0541,
  "precision_at_5": 0.2189
}


Metrics of the auto-updated solution version:
{
  "coverage": 0.9968,
  "mean_reciprocal_rank_at_25": 0.8459,
  "normalized_discounted_cumulative_gain_at_10": 0.7873,
  "normalized_discounted_cumulative_gain_at_25": 0.8089,
  "normalized_discounted_cumulative_gain_at_5": 0.7661,
  "precision_at_10": 0.1194,
  "precision_at_25": 0.0541,
  "precision_at_5": 0.2189
}


#### Chapter 6 Step 5: Fully re-train our solution version

If we want our new Items and Interactions data to have more influence in our solution version, then we will need to *fully retrain* the solution version. The full retraining will allow our solution version to use the new items and new interactions at part of its *meaningful*, or *exploitative* (as opposed to just *exploratory*) recommendations for inference jobs. To fully retrain a solution version, use the CreateSolutionVersion command with the ```trainingMode``` flag set to ```FULL```.

Note: Setting ```trainingMode=FULL``` will enable auto-updates to the new solution version.

Some additional information on the trainingMode parameter (from [documentaion on Solution Versions](https://docs.aws.amazon.com/personalize/latest/dg/API_SolutionVersion.html)):

The trainingMode parameter defines:
> The scope of training to be performed when creating the solution version. The FULL option trains the solution version based on the entirety of the input solution's training data, while the UPDATE option processes only the data that has changed in comparison to the input solution. Choose UPDATE when you want to incrementally update your solution version instead of creating an entirely new one.

As a side note:
> The UPDATE option can only be used when you already have an active solution version created from the input solution using the FULL option and the input solution was trained with the User-Personalization recipe or the HRNN-Coldstart recipe.




In [None]:
# Step 5: Fully train a new solution version
retrained_user_personalization_solution_version_arn = None

if not retrained_user_personalization_solution_version_arn:
    create_solution_version_response = personalize.create_solution_version(
        solutionArn = user_personalization_solution_arn,
        trainingMode='FULL'
    )

    retrained_user_personalization_solution_version_arn = create_solution_version_response['solutionVersionArn']
    print(json.dumps(create_solution_version_response, indent=2))
else:
    print(f'Solution version {retrained_user_personalization_solution_version_arn} already exists; not creating')


{
  "solutionVersionArn": "arn:aws:personalize:us-east-1:402114309305:solution/retaildemostore-user-personalization-solution-85968/859a76e2",
  "ResponseMetadata": {
    "RequestId": "35a7a2ad-025f-43a4-b1bf-de8856237ace",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 16:41:31 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "137",
      "connection": "keep-alive",
      "x-amzn-requestid": "35a7a2ad-025f-43a4-b1bf-de8856237ace"
    },
    "RetryAttempts": 0
  }
}


##### Wait for the new solution version to finish training
This can take around 40 minutes.

In [None]:
%%time

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    updated_soln_ver_response = personalize.describe_solution_version(
        solutionVersionArn = retrained_user_personalization_solution_version_arn
    )
    status = updated_soln_ver_response["solutionVersion"]["status"]

    if status == "ACTIVE":
        print(f'Solution version {retrained_user_personalization_solution_version_arn} successfully completed')
        break
    elif status == "CREATE FAILED":
        print(f'Solution version {retrained_user_personalization_solution_version_arn} failed')
        if updated_soln_ver_response["solutionVersion"].get('failureReason'):
            print('   Reason: ' + updated_soln_ver_response["solutionVersion"]['failureReason'])
        break
    else:
        print('At least one solution version is still in progress')
        time.sleep(60)


At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version is still in progress
At least one solution version i

### Chapter 6 Step 6: Submit the batch segmentation job to our new Solution Version

Now that we have fully retrained a new solution version, we will:
- Submit a batch segmentation job to our new *fully retrained* solution version. For consistency, we will use the same sample input that we used previously.
- Wait for the batch job to complete.
- Compare the output (Output C) with Output B.

In chapter 7, we will perform a side-by-side comparison of outputs A, B, and C.

Recall:
- Output A = the output from the original solution version
- Output B = the output from the solution version after it was auto-updated
- Output C = the output from the fully retrained solution version


In [None]:
### Step 6) Submit batch segmentation job. 

s3_input_key_fr = "batch-job-input/" + fr_job_input_filename
s3_input_path_fr = "s3://" + bucket_name + "/" + s3_input_key_fr
s3_output_path_fr = "s3://" + bucket_name + "/batch-job-outputs/fully-retrained/"

# Upload a copy of the input file to s3.
!cp {orig_job_input_filename} {fr_job_input_filename}
s3.upload_file(fr_job_input_filename, bucket_name, s3_input_key_fr)
if s3_input_key_fr in [object['Key'] for object in s3.list_objects(Bucket=bucket_name)['Contents']]:
    print('File was uploaded successfully!')
else:
    print('File was not uploaded!')
    
# Preview input file We will use the same input for batch inference job using our new solution version.
print("\nPreviewing input file... ")
!head -n 5 $fr_job_input_filename
print('\n')

# Create and start a Batch Segmentation Job using our latest Solution Version
s3_input_key_fr = "batch-job-input/" + fr_job_input_filename
s3_output_path_fr = "s3://" + bucket_name + "/batch-job-outputs/fully-retrained/" # Define output location

# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/personalize/client/create_batch_inference_job.html
response = personalize.create_batch_inference_job (
    solutionVersionArn = retrained_user_personalization_solution_version_arn,
    jobName = "retaildemostore-user-personalization-job-fr-" + token,
    roleArn = role_arn,
    jobInput = {"s3DataSource": {"path": s3_input_path_fr }},
    jobOutput = {"s3DataDestination": {"path": s3_output_path_fr }},
    numResults = y
)

user_personalization_retrained_job_arn = response['batchInferenceJobArn']
print(json.dumps(response, indent=2, default=str))

File was uploaded successfully!

Previewing input file... 
{"userId": "5023"}
{"userId": "5866"}
{"userId": "4016"}
{"userId": "2755"}
{"userId": "2896"}


{
  "batchInferenceJobArn": "arn:aws:personalize:us-east-1:402114309305:batch-inference-job/retaildemostore-user-personalization-job-fr-85968",
  "ResponseMetadata": {
    "RequestId": "bb8bbd39-10b6-4883-a55e-aaaf6f249504",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 27 Oct 2023 17:01:35 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "139",
      "connection": "keep-alive",
      "x-amzn-requestid": "bb8bbd39-10b6-4883-a55e-aaaf6f249504"
    },
    "RetryAttempts": 0
  }
}


Wait for the batch job to complete. This can take around 10 minutes.


In [None]:
%%time
current_time = datetime.now()
print("Inference Job Started on: ", current_time.strftime("%I:%M:%S %p"))

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    
    resp = personalize.describe_batch_inference_job(batchInferenceJobArn = user_personalization_retrained_job_arn)
    status = resp["batchInferenceJob"]['status']
    print("DatasetInferenceJob {arn}: {status}".format(arn=user_personalization_retrained_job_arn, status=status))

    if status == "ACTIVE" or status == "CREATE FAILED":
        current_time = datetime.now()
        print("Inference Job Completed on: ", current_time.strftime("%I:%M:%S %p"))
        break
    else:
        print('At least one batch inference job still in progress')
        time.sleep(60)    


Inference Job Started on:  05:01:35 PM
DatasetInferenceJob arn:aws:personalize:us-east-1:402114309305:batch-inference-job/retaildemostore-user-personalization-job-fr-85968: CREATE PENDING
At least one batch inference job still in progress
DatasetInferenceJob arn:aws:personalize:us-east-1:402114309305:batch-inference-job/retaildemostore-user-personalization-job-fr-85968: CREATE IN_PROGRESS
At least one batch inference job still in progress
DatasetInferenceJob arn:aws:personalize:us-east-1:402114309305:batch-inference-job/retaildemostore-user-personalization-job-fr-85968: CREATE IN_PROGRESS
At least one batch inference job still in progress
DatasetInferenceJob arn:aws:personalize:us-east-1:402114309305:batch-inference-job/retaildemostore-user-personalization-job-fr-85968: CREATE IN_PROGRESS
At least one batch inference job still in progress
DatasetInferenceJob arn:aws:personalize:us-east-1:402114309305:batch-inference-job/retaildemostore-user-personalization-job-fr-85968: CREATE IN_PROGR

The job has completed.
Now, let's download & inspect its output file.


In [None]:
# - Download the Inference job output from S3
job_output_file_fr = fr_job_input_filename + ".out"
export_name_fr = 'batch-job-outputs/fully-retrained/' + job_output_file_fr
s3.download_file(bucket_name, export_name_fr, job_output_file_fr)

# - Inspect the Inference Job
print("Previewing output of the inference job that was run on the fully-trained solution version:")
!head -n 5 $job_output_file_fr

# - Inspect the Inference Job
print("\n\nPreviewing output of the inference job that was run on the auto-updated solution version:")
!head -n 5 $job_output_file_au


Previewing output of the inference job that was run on the fully-trained solution version:
{"input":{"userId":"4016"},"output":{"recommendedItems":["new_72994d99-e815-486e-9b8e-bfccbc230e4b","new_1b2dda7c-7fd7-476a-bdea-87bcd101a022","new_acfba3f9-f7d6-4fff-9cef-35db086d2869","new_bbcda337-3411-47e4-aeec-079663f729df","new_ac46fdbc-2369-4908-b0b1-d405572e4a5c","61b1ad14-4e70-4029-ba55-d17bbf4ab62b","aa4fca9d-d1ef-4529-9169-7bc075733bd5","99c88141-7b7c-403b-b8eb-5dd5e8efd4a4","new_91cfb05c-44cc-4eca-b622-1fa875e0256c","cd78672a-0a5e-4931-ace1-2f2abb90b720"],"scores":[0.1686713,0.0466978,0.0331571,0.0227852,0.0181023,0.0180926,0.017133,0.0126616,0.0100974,0.0100844]},"error":null}
{"input":{"userId":"2755"},"output":{"recommendedItems":["new_af2aba3d-c9f6-46e1-95db-100fc1a73726","3630053e-3962-4549-bcce-402c3a980557","b947ee58-a7e7-40bf-9926-42a445f3480f","a58cc0a2-d54f-4e5a-85e7-e8c530592b76","new_53ec1efb-0deb-48cf-96a3-7a7342b78608","99c88141-7b7c-403b-b8eb-5dd5e8efd4a4","new_1d21572e

Notice how the two outputs are slightly different. Specifically, new items are much more likely to appear in the output from the fully retrained solution version, compared to the output from the auto-updated solution version.

This is beacuse the fully-retrained solution version's output uses the new items and new interactions data to a more significant degree compared to the auto-updated solution version.

Those new items won't *just* be recommended as part cold starts anymore. Now, they'll also be used in more relevant/meaningful recommendations.

#### Step 7: Evaluate the metrics of the two solution versions

Let's compare the metrics of the auto-updated solution version and fully retrained solution version.


In [None]:
print("Metrics of the auto-updated solution version:")
print(json.dumps(get_autoupdated_solution_metrics_response['metrics'], indent=2))

# Get metrics for fully retrained solution version that used the updated datasets
get_retrained_solution_metrics_response = personalize.get_solution_metrics(
    solutionVersionArn = retrained_user_personalization_solution_version_arn
)
print("\n\nMetrics of the solution version with the updated datasets:")
print(json.dumps(get_retrained_solution_metrics_response['metrics'], indent=2))


Metrics of the auto-updated solution version:
{
  "coverage": 0.9968,
  "mean_reciprocal_rank_at_25": 0.8459,
  "normalized_discounted_cumulative_gain_at_10": 0.7873,
  "normalized_discounted_cumulative_gain_at_25": 0.8089,
  "normalized_discounted_cumulative_gain_at_5": 0.7661,
  "precision_at_10": 0.1194,
  "precision_at_25": 0.0541,
  "precision_at_5": 0.2189
}


Metrics of the solution version with the updated datasets:
{
  "coverage": 0.985,
  "mean_reciprocal_rank_at_25": 0.8425,
  "normalized_discounted_cumulative_gain_at_10": 0.7377,
  "normalized_discounted_cumulative_gain_at_25": 0.7709,
  "normalized_discounted_cumulative_gain_at_5": 0.7077,
  "precision_at_10": 0.1454,
  "precision_at_25": 0.0692,
  "precision_at_5": 0.2561
}


As you can see, the metrics of the two solution versions are slightly different. This can serve as additional confirmation that the fully-retrained solution version used different datasets for training.



### Chapter 7: Analysis

##### **Chapter 7 Step 1**: Compare the outputs across the three batch inference jobs

Compare the outputs of the original, auto-updated, and fully retrained solution versions. 
All three should be different.

The original output should will only contain ITEM_ID values that were found in the trimmed-down version of the items dataset.

The auto-updated output will contain ITEM_ID values from the complete version of the items dataset. This means that those new items may show up in the item recommendations for your users. Though keep in mind, that if a new item shows up, it'll be for cold-starting purposes so that you can start collecting interactions data for those new items.

The fully-retrained output will contain ITEM_ID values from the complete version of the items dataset. Recall that the major difference between this output and the auto-updated output is that both are yielded by different solution versions. When we created a new solution version with ```trainingMode='FULL'```, Personalize trained a solution version using the new data, as supposed to just updating its pool of cold-start items (as was the case with the auto-update solution version).   


In [None]:
# - Inspect the outputs of the Inference Jobs

print("Previewing output of the inference job that was run on the original solution version:")
!head -n 10 $job_output_file_orig

print("\n\nPreviewing output of the inference job that was run on the auto-updated solution version:")
!head -n 10 $job_output_file_au

print("\n\nPreviewing output of the inference job that was run on the fully-trained solution version:")
!head -n 10 $job_output_file_fr



Previewing output of the inference job that was run on the original solution version:
{"input":{"userId":"4016"},"output":{"recommendedItems":["ccdf737c-c4fd-4c78-abd2-d5ef0428ef20","425cc876-3935-4e87-ad8d-77f42b0b6a75","6be08307-1ec0-44dc-b436-5d489a8010e8","2c1b34d6-0f3d-463d-be76-226cb87bdc6d","89c4eeb4-c146-4434-a9f1-6943b4b552dc","5a94b7d5-b210-44b3-9287-c8b0b5488a15","61b1ad14-4e70-4029-ba55-d17bbf4ab62b","8f8f015a-4166-4e9e-ac0b-6d980614ca5d","eecbee28-73a3-425d-84e8-516c326e399c","1daacea7-7d46-464a-8326-ed81951fecab"],"scores":[0.8449224,0.0186943,0.0148416,0.0100115,0.008885,0.0055925,0.0055068,0.0053057,0.0040184,0.0038556]},"error":null}
{"input":{"userId":"2755"},"output":{"recommendedItems":["3630053e-3962-4549-bcce-402c3a980557","6d488475-1d67-4076-96b1-8e706709a847","b947ee58-a7e7-40bf-9926-42a445f3480f","90ccfbb9-4538-4951-af8d-4f728578b237","d537d92a-23fe-4673-a697-795652ff10c8","78080d05-b078-441f-b245-54b2a2dec872","61840d6a-6ba2-4ece-a644-6db6a3377b1c","2e95f6fc-6

#### **Chapter 7 Step 2**: Compare the metrics across the two solution versions (original, auto-updated, fully-trained)
Since the solution version was fully retrained with additional data, we should expect the metrics to be slightly different.

We have looked at the metrics for the original solution version, the original solution version after it was auto-updated, and the (new) fully-retrained solution version in previous parts of this notebook. Now, let's display these three sets of metrics side-by-side, so you can easily compare them.


In [None]:
print("Metrics of the original solution version:")
print(json.dumps(get_original_solution_metrics_response['metrics'], indent=2))

print("\n\nMetrics of the auto-updated solution version:")
print(json.dumps(get_autoupdated_solution_metrics_response['metrics'], indent=2))

print("\n\nMetrics of the fully-retrained solution version:")
print(json.dumps(get_retrained_solution_metrics_response['metrics'], indent=2))


Metrics of the original solution version:
{
  "coverage": 0.9968,
  "mean_reciprocal_rank_at_25": 0.8459,
  "normalized_discounted_cumulative_gain_at_10": 0.7873,
  "normalized_discounted_cumulative_gain_at_25": 0.8089,
  "normalized_discounted_cumulative_gain_at_5": 0.7661,
  "precision_at_10": 0.1194,
  "precision_at_25": 0.0541,
  "precision_at_5": 0.2189
}


Metrics of the auto-updated solution version:
{
  "coverage": 0.9968,
  "mean_reciprocal_rank_at_25": 0.8459,
  "normalized_discounted_cumulative_gain_at_10": 0.7873,
  "normalized_discounted_cumulative_gain_at_25": 0.8089,
  "normalized_discounted_cumulative_gain_at_5": 0.7661,
  "precision_at_10": 0.1194,
  "precision_at_25": 0.0541,
  "precision_at_5": 0.2189
}


Metrics of the fully-retrained solution version:
{
  "coverage": 0.985,
  "mean_reciprocal_rank_at_25": 0.8425,
  "normalized_discounted_cumulative_gain_at_10": 0.7377,
  "normalized_discounted_cumulative_gain_at_25": 0.7709,
  "normalized_discounted_cumulative_gain

Notice how the metrics stay the same from the original solution version to the auto-updated solution version. This is because the underlying model isn't retrained with the new data. What does change is that the new data we imported is used for cold-starts.

However, after fully-retraining our solution version (creating a new solution version w/ ```trainingMode='FULL'```), the underlying model *is* retrained. In other words, those new items and new interactions were used for training this solution version, which in turn changes its metrics. 

# Summary: 

In this notebook, we walked through the process for updating your items and interactions datasets in Amazon Personalize in the context of Item Recommendation (specifically user-personalization-recipe-backed) use cases.

The notebook required some set up. To recap those steps, we:
- Set Up Amazon S3 Bucket, IAM Policies, and IAM Roles
- Fetched, Inspected and trimmed Datasets 
- Created Schemas and Imported Datasets in Amazon Personalize
- Created an e-commerce custom solution in Amazon Personalize
- Ran a Batch Inference Job on your Solution Version (Yielded Output A)

After performing some set up and obtaining a baseline for future comparison, we then analyzed how updating items and interactions datasets affects your existing solution versions. We: 
- Imported new items and new interactions into our Amazon Personalize Dataset Group.
- Re-Ran the Inference Job on the Solution Version. Upon submission of the job request, the solution version auto-updated to consider new items and new interactions data. Upon completion of the job, we received Output B and compared it to Output A.
- Performed A Full Retraining (by creating a new Solution Version w/ TrainingMode=FULL). This created a new model which allowed Personalize to give greater weight to the new data. 
- Ran the Batch Inference Job on the new Solution Version (Yielded Output C). We inspected Output C and found that the new items are now more likely to show up in item recommendations.

In the analysis portion of this notebook, we:
- Did a side-by-side comparison of Outputs A, B, and C. All three were slightly different.
- Tracked the changes of the metrics across the solution versions (original solution version, the original solution version after it was auto-updated, and fully-retrained solution version).


# Chapter 8: Cleanup

This chapter will walk through deleting all of the resources created throughout this notebook.

First, we will delete the S3 and IAM resources. 

Then, we will delete the Amazon Personalize resources.

Amazon Personalize resources have to deleted in a specific sequence* to avoid dependency errors.
The order in which you should delete resources in Amazon Personalize are: recommenders and campaigns, then solutions, then event trackers, then filters, then datasets and dataset schemas, and finally, the dataset group. 

To declutter this notebook, we will be leveraging a utility module written in python that provides an orderly delete process for deleting all resources in each dataset group.

This section should take about 15 minutes, though most of this time will be spent waiting for the Personalize resources to finish deleting.

*: Note, we didn't use some of these resource types (such as recommenders and campaigns) in this notebook. The list is just for your knowledge.

#### Emptying and Deleting the S3 bucket

NOTE: THE FOLLOWING CODE WILL DELETE ALL OF THE OBJECTS, INCLUDING THE CSVs & BATCH JOB FILES. 
If you dont want to delete the S3 bucket, DONT run the code block below.

Alternatively, if you want to delete the bucket, consider directly downloading the files (eg: via the Console or CLI) to 
persist your data.


In [None]:
# List objects in the bucket and delete them
objects = s3.list_objects_v2(Bucket=bucket_name)

if 'Contents' in objects:
    for obj in objects['Contents']:
        print(f'Deleting {obj["Key"]}...')
        s3.delete_object(Bucket=bucket_name, Key=obj['Key'])

# Delete the bucket
s3.delete_bucket(Bucket=bucket_name)


Deleting batch-job-input/au_job_input.json...
Deleting batch-job-input/fr_job_input.json...
Deleting batch-job-input/orig_job_input.json...
Deleting batch-job-outputs/auto-update/_CHECK...
Deleting batch-job-outputs/auto-update/au_job_input.json.out...
Deleting batch-job-outputs/fully-retrained/_CHECK...
Deleting batch-job-outputs/fully-retrained/fr_job_input.json.out...
Deleting batch-job-outputs/original/_CHECK...
Deleting batch-job-outputs/original/orig_job_input.json.out...
Deleting interactions.csv...
Deleting interactions_trimmed.csv...
Deleting items.csv...
Deleting items_trimmed.csv...
Deleting users.csv...


{'ResponseMetadata': {'RequestId': 'A6JT9K4G6T33DBVA',
  'HostId': 'qyefAeFHe/HX95f5a9kY/oFHj11Q26uC+hz7szE3byH3rM9XUPvatFtTWgSvEVQHFna6FTBhJDIZJOCUx8bS1Q==',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': 'qyefAeFHe/HX95f5a9kY/oFHj11Q26uC+hz7szE3byH3rM9XUPvatFtTWgSvEVQHFna6FTBhJDIZJOCUx8bS1Q==',
   'x-amz-request-id': 'A6JT9K4G6T33DBVA',
   'date': 'Fri, 27 Oct 2023 17:14:39 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

#### Delete the IAM Execution Role and Policy

Now, lets delete the IAM Policy and IAM Role that we created for the Personalize Service.

In [None]:
iam.detach_role_policy(RoleName=role_name, PolicyArn=policy_arn)

print("Deleting: " + role_name)
iam.delete_role(RoleName=role_name)

print("Deleting: " + policy_arn)
iam.delete_policy(PolicyArn=policy_arn)

# Check if the IAM role and policy were deleted
if role_name in [role['RoleName'] for role in iam.list_roles()['Roles']]:
    print('Role was not deleted!')
    
if policy_arn in [policy['Arn'] for policy in iam.list_policies()['Policies']]:
    print('Policy was not deleted!')
    
print('Bucket and IAM role and policy deleted successfully!')

Deleting: PersonalizeRole-85968
Deleting: arn:aws:iam::402114309305:policy/PersonalizePolicy-85968
Bucket and IAM role and policy deleted successfully!


#### Set up deletion script

Next we will download a helper script that will simplify the cleanup process. If you want to look at the underlying code, the helper script can be found in the [retail-demo-store github repo](https://github.com/aws-samples/retail-demo-store/blob/b80137c6edb2c975c50221fcaba46b6abadd7b99/src/aws-lambda/personalize-pre-create-resources/delete_dataset_groups.py).

Note: Under the hood, this script uses the 'delete_*' Personalize API boto3 commands. 

In [None]:
# Download the helper script from the github repo.
!curl -O https://raw.githubusercontent.com/aws-samples/retail-demo-store/b80137c6edb2c975c50221fcaba46b6abadd7b99/src/aws-lambda/personalize-pre-create-resources/delete_dataset_groups.py

# Import the module
import delete_dataset_groups

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18958  100 18958    0     0   147k      0 --:--:-- --:--:-- --:--:--  146k


#### Set up logging for deletion of Personalize resources
The following code cell ensures we import and set up the native python logging module. Our resource deletion script requires this so that we can provide information about the deletion status of Personalize resources.

In [None]:
import logging

handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO)

delete_dataset_groups.logger.setLevel(logging.INFO)
delete_dataset_groups.logger.addHandler(handler)

#### Delete Amazon Personalize Resources

Now we can delete the active dataset groups. This can take up to 10 minutes depending on the resources within your dataset group. The function below will log its progress until finished.

In [None]:
%%time

print(f'Active dataset groups that need to be deleted: {dataset_group_name}\n')

delete_dataset_groups.delete_dataset_groups(
    dataset_group_names = [ dataset_group_name ], 
    wait_for_resources = True
)

Active dataset groups that need to be deleted: retaildemostore-products-DSG-85968

Dataset Group ARN: arn:aws:personalize:us-east-1:402114309305:dataset-group/retaildemostore-products-DSG-85968
All recommenders have been deleted or none exist for dataset group
All campaigns have been deleted or none exist for dataset group
Deleting solution: arn:aws:personalize:us-east-1:402114309305:solution/retaildemostore-user-personalization-solution-85968
Waiting for 1 solution(s) to be deleted
Waiting for 1 solution(s) to be deleted
Waiting for 1 solution(s) to be deleted
All solutions have been deleted or none exist for dataset group
All event trackers have been deleted or none exist for dataset group
All filters have been deleted or none exist for dataset group
Deleting dataset arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85968/ITEMS
Deleting dataset arn:aws:personalize:us-east-1:402114309305:dataset/retaildemostore-products-DSG-85968/USERS
Deleting dataset ar

#### Delete local files
If you want to retain these files, dont run the following cell block.

In [None]:
!rm delete_dataset_groups.py

# Delete input & output files from original solution version
!rm {orig_job_input_filename} # "orig_job_input.json"
!rm {job_output_file_orig}    # "orig_job_input.json.out"

# Delete input & output files from the auto-updated original solution version
!rm {au_job_input_filename} # "au_job_input.json"
!rm {job_output_file_au}    # "au_job_input.json.out"

# Delete input & output files from the auto-updated original solution version
!rm {fr_job_input_filename} # "fr_job_input.json"
!rm {job_output_file_fr}    # "fr_job_input.json.out"

# Delete the locally-saved csv files
!rm {items_filename}
!rm {interactions_filename}

!rm {items_trimmed_filename}
!rm {interactions_trimmed_filename}


## Cleanup Complete

All resources created by this Personalize Notebook have been deleted.

### Final note:
If you are running this notebook on Amazon Sagemaker, don't forget to `stop` or `terminate` your sagemaker instance so that you don't incur additional costs.
Afterwards, feel free to delete the execution role of this Sagemaker notebook instance.

#### Congrats on completing this Personalize Demonstration! 
If you are further interested in learning how you can leverage Amazon Personalize to power your business's ML-powered recommendation services, refer to the [AWS Documentation on Amazon Personlize](https://docs.aws.amazon.com/personalize/latest/dg/what-is-personalize.html).
