
## Introduction <a class="anchor" id="intro"></a>
[Back to top](#top)

In Amazon Personalize, you start by creating a dataset group, which is a container for Amazon Personalize components. Your dataset group can be one of the following:

A Domain dataset group, where you create preconfigured resources for different business domains and use cases, such as getting recommendations for similar videos (VIDEO_ON_DEMAND domain) or best-selling items (ECOMMERCE domain). You choose your business domain, import your data, and create recommenders. You use recommenders in your application to get recommendations.

Use a [Domain dataset group](https://docs.aws.amazon.com/personalize/latest/dg/domain-dataset-groups.html) if you have a video on demand or e-commerce application and want Amazon Personalize to find the best configurations for your use cases. If you start with a Domain dataset group, you can also add custom resources such as solutions with solution versions trained with recipes for custom use cases.

A [Custom dataset group](https://docs.aws.amazon.com/personalize/latest/dg/custom-dataset-groups.html), where you create configurable resources for custom use cases and batch recommendation workflows. You choose a recipe, train a solution version (model), and deploy the solution version with a campaign. You use a campaign in your application to get recommendations.

Use a Custom dataset group if you don't have a video on demand or e-commerce application or want to configure and manage only custom resources, or want to get recommendations in a batch workflow. If you start with a Custom dataset group, you can't associate it with a domain later. Instead, create a new Domain dataset group.

You can create and manage Domain dataset groups and Custom dataset groups with the AWS console, the AWS Command Line Interface (AWS CLI), or programmatically with the AWS SDKs.

## Define your Use Case <a class="anchor" id="usecase"></a>
[Back to top](#top)

There are a few guidelines for scoping a problem suitable for Personalize. We recommend the values below as a starting point, although the [official limits](https://docs.aws.amazon.com/personalize/latest/dg/limits.html) are a little lower.

* Authenticated users
* At least 50 unique users
* At least 100 unique items
* At least 2 dozen interactions for each user 

Most of the time this is easily attainable, and if you are low in one category, you can often make up for it by having a larger number in another category.

The user-item-iteraction data is key for getting started with the service. This means we need to look for use cases that generate that kind of data, a few common examples are:

- Video-on-demand applications
- E-commerce platforms

Defining your use-case will inform what data and what type of data you need.

In this example we are going to be creating:

- Amazon Personalize Custom Campaign for a personalized ranked list of movies, for instance shelf/rail/carousel based on some information (director, location, superhero franchise, etc...) 

All of these will be created within the same dataset group and with the same input data.




## Choose a Dataset or Data Source <a class="anchor" id="source"></a>
[Back to top](#top)

Regardless of the use case, the algorithms all share a base of learning on user-item-interaction data which is defined by 3 core attributes:

1. **UserID** - The user who interacted
2. **ItemID** - The item the user interacted with
3. **Timestamp** - The time at which the interaction occurred

To begin, we are going to use the latest MovieLens dataset, this dataset has over 25 million interactions and a rich collection of metadata for items. There is also a smaller version of this dataset, which can be used to shorten training times, while still incorporating the same capabilities as the full dataset. In this example we are going to use the smaller version of the dataset

Generally speaking your data will not arrive in a perfect form for Personalize, and will take some modification to be structured correctly. This notebook guides you through all of that. 



### Install required packages

In [None]:
!pip install requests_auth_aws_sigv4
!pip install requests
!pip install opensearch-py

## Imports

In [None]:
import boto3
import json
import time
import re
import pandas as pd
import numpy as np
from datetime import datetime
import os
from utils import create_s3_bucket, create_iam_role, upload_to_s3

IPython extension to reload modules before executing user code, `autoreload` reloads modules automatically before entering the execution of code typed at the IPython prompt.

In [None]:
%load_ext autoreload
%autoreload 2

## Setup Region


In [None]:
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]
print('region:', region)

## Create personalize client

In [None]:
# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')
print("We can communicate with Personalize!")

## Download and Preprocess dataset

### Download data
We will download the dataset from the movielens website and unzip it in a new folder using the code below. If you want to use the full version of the dataset you can download it from here: http://files.grouplens.org/datasets/movielens/ml-25m.zip

In [None]:
data_dir = "poc_data"
root_dir = data_dir + "/ml-latest-small/"

if not os.path.exists(data_dir):
    !mkdir $data_dir

dataset_file = f"{data_dir}/ml-latest-small.zip"


if not os.path.exists(dataset_file):
    !cd $data_dir && wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
    !cd $data_dir && unzip ml-latest-small.zip
    !ls $dataset_dir
else:
    print(dataset_file + " already exists")

### Create an Amazon S3 bucket
Amazon S3 bucket names are globally unique. To create a unique bucket name, the code below will append the string `personalize-os-ranking` to your AWS account number. Then it creates a bucket with this name in the same region. Amazon Personalize needs to be able to read the contents of your S3 bucket. So we also add a bucket policy which allows that.

In [None]:
bucket_name = create_s3_bucket("personalize-os-ranking", region)
bucket_name

### Create an IAM role
By default, the Personalize service does not have permission to access the data we uploaded into the S3 bucket in our account. In order to grant access to the Personalize service to read our CSVs, we need to set a Bucket Policy and create an IAM role that the Amazon Personalize service will assume. Let's set all of that up.

Amazon Personalize needs the ability to assume roles in AWS in order to have the permissions to execute certain tasks. Let's create an IAM role and attach the required policies to it. The code below attaches very permissive policies; please use more restrictive policies for any production application.

The create_iam_role method is defined in the `utils.py` file which has other reusable functions defined in it.

In [None]:
role_arn = create_iam_role("personalize-os-ranking-role", bucket_name)
role_arn

## Create Datasets

In [None]:
resource_name = 'personalize-os-ranking'

### Create DSG

The highest level of isolation and abstraction with Amazon Personalize is a dataset group. Information stored within one of these dataset groups has no impact on any other dataset group or models created from one - they are completely isolated. This allows you to run many experiments and is part of how we keep your models private and fully trained only on your data.

Before importing the data prepared earlier, there needs to be a dataset group and a dataset added to it that handles the interactions.

Dataset groups can house the following types of information:

- User-item-interactions
- Event streams (real-time interactions)
- User metadata
- Item metadata

We need to create the dataset group that will contain our three datasets.


The following cell will create a new dataset group with the name `personalize-os-ranking`.

In [None]:
try:
    create_dataset_group_response = personalize.create_dataset_group(
        name = resource_name
    )
    dataset_group_arn = create_dataset_group_response['datasetGroupArn']
    print('dataset_group_arn: {}'.format(dataset_group_arn))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this datasetgroup.')
    dsgs = personalize.list_dataset_groups(
        maxResults=100
    )
    
    for dsg in dsgs['datasetGroups']:
        #print(dsg)
        if dsg['name'] == resource_name:
            dataset_group_arn = dsg['datasetGroupArn']
            print(f"Using existing dsg: {dataset_group_arn}")

#### Wait for Dataset Group to Have ACTIVE Status
Before we can use the Dataset Group in any items below it must be active. This can take a minute or two. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the dataset group every 60 seconds, up to a maximum of 3 hours.

In [None]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(20)

## Create Schemas

### Create Interaction Schema
Amazon Personalize requires a schema for each dataset, so it can map the columns in our CSVs to fields for model training. Each schema is declared in JSON using the [Apache Avro](https://avro.apache.org/) format.

First, define a schema to tell Amazon Personalize what type of dataset you are uploading. There are several reserved and mandatory keywords required in the schema, based on the type of dataset. More detailed information can be found in the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html).

Here, you will create a schema for interactions data, which requires the `USER_ID`, `ITEM_ID`, and `TIMESTAMP` fields. These must be defined in the same order in the schema as they appear in the dataset.

The interactions dataset has three required columns: `ITEM_ID`, `USER_ID`, and `TIMESTAMP`. The `TIMESTAMP` represents when the user interacted with an item and must be expressed in Unix timestamp format (seconds). For this dataset we also have an `EVENT_TYPE` column.

In [None]:
interaction_schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        },
        { 
            "name": "EVENT_TYPE",
            "type": "string"
        }
    ],
    "version": "1.0"
}

In [None]:
try:
    interaction_schema_response = personalize.create_schema(
        name = f"{resource_name}-interaction-schema",
        schema = json.dumps(interaction_schema)
        )
    interaction_schema_arn = interaction_schema_response['schemaArn']
    print('interaction_schema_arn:\n', interaction_schema_arn)
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this schema.')
    schemas = personalize.list_schemas(maxResults=100)['schemas']
    for schema_response in schemas:
        if schema_response['name'] == f"{resource_name}-interaction-schema":
            interaction_schema_arn = schema_response['schemaArn']
            print(f"Using existing schema: {interaction_schema_arn}")


### Create User Schema

Here, you will create a schema for user data, which requires the `USER_ID`, and an additional metadata field, in this case `GENDER`. These must be defined in the same order in the schema as they appear in the dataset.

In [None]:
users_schema = {
    "type": "record",
    "name": "Users",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "GENDER",
            "type": "string",
            "categorical": True
        }
    ],
    "version": "1.0"
}
    
try:
    create_schema_response = personalize.create_schema(
        name = f"{resource_name}-user-schema",
        schema = json.dumps(users_schema)
    )
    print(json.dumps(create_schema_response, indent=2))
    users_schema_arn = create_schema_response['schemaArn']
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this schema.')
    schemas = personalize.list_schemas(maxResults=100)['schemas']
    for schema_response in schemas:
        if schema_response['name'] == f"{resource_name}-user-schema":
            users_schema_arn = schema_response['schemaArn']
            print(f"Using existing schema: {users_schema_arn}")

### Create Items Schema

Here, you will create a schema for item metadata data, and we define the `ITEM_ID`, `GENRES`, `YEAR`, and `CREATION_TIMESTAMP` fields. These must be defined in the same order in the schema as they appear in the dataset.

In [None]:
items_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "GENRES",
            "type": "string",
            "categorical": True
        },{
            "name": "YEAR",
            "type": "int",
        },
        {
            "name": "CREATION_TIMESTAMP",
            "type": "long",
        }
    ],
    "version": "1.0"
}
    
try:
    create_schema_response = personalize.create_schema(
        name = f"{resource_name}-item-schema",
        schema = json.dumps(items_schema)
    )
    items_schema_arn = create_schema_response['schemaArn']
    print(json.dumps(create_schema_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this schema.')
    schemas = personalize.list_schemas(maxResults=100)['schemas']
    for schema_response in schemas:
        if schema_response['name'] == f"{resource_name}-item-schema":
            items_schema_arn = schema_response['schemaArn']
            print(f"Using existing schema: {items_schema_arn}")

### Prepare the Interactions data and Create Interaction dataset file

Since this is a dataset of an explicit feedback movie ratings, it includes movies rated from 1 to 5. We want to include only moves that were "liked" by the users, and simulate a dataset of data that would be gathered by a VOD platform. In order to do that, we will filter out all interactions under 2 out of 5, and create two event types: "Click" and "Watch". We will then assign all movies rated 2 and above as "Click" and movies rated 4 and above as both "Click" and "Watch".

Note that for a real data set you would actually model based on implicit feedback such as clicks, watches and/or explicit feedback such as ratings, likes etc.


Amazon Personalize has default column names for users, items, and timestamp. These default column names are `USER_ID`, `ITEM_ID`, `TIMESTAMP`. The final modification to the dataset is to replace the existing column headers with the default headers.

Finally, we upload the file to Amazon S3 bucket

In [None]:
original_data = pd.read_csv(root_dir +'/ratings.csv')
original_data.head(5)

arb_time_stamp = original_data.iloc[50]['timestamp']

watched_df = original_data.copy()
watched_df = watched_df[watched_df['rating'] > 3]
watched_df = watched_df[['userId', 'movieId', 'timestamp']]
watched_df['EVENT_TYPE']='Watch'
print(watched_df.head())
print(watched_df.shape)
clicked_df = original_data.copy()
clicked_df = clicked_df[clicked_df['rating'] > 1]
clicked_df = clicked_df[['userId', 'movieId', 'timestamp']]
clicked_df['EVENT_TYPE']='Click'
clicked_df.head()
print(clicked_df.head())
print(clicked_df.shape)
interactions_df = clicked_df.copy()
interactions_df = pd.concat([interactions_df,watched_df], axis = 0)
print(interactions_df.shape)
interactions_df.sort_values("timestamp", axis = 0, ascending = True, 
                 inplace = True, na_position ='last') 
interactions_df.rename(columns = {'userId':'USER_ID', 'movieId':'ITEM_ID', 
                              'timestamp':'TIMESTAMP'}, inplace = True) 
interactions_filename = "interactions.csv"
interactions_df.to_csv((root_dir +"/"+interactions_filename), index=False, float_format='%.0f')


upload_to_s3(root_dir +"/"+interactions_filename, bucket_name, interactions_filename)

### Prepare the Item Metadata and Create Items dataset file

Next we load the data and confirm the data is in a good state.

Next, open the `movies.csv` file and process this dataset. This is a pretty small dataset of just the movieId, title and the list of genres that are applicable to each entry. However, there is additional data available in the movielens dataset. For instance the title includes the year of the movies release. Let's make that another column of metadata. We then remove null values.

From an item metadata perspective, we only want to include information that is relevant to training a model and/or filtering results, so we will drop the title column, and keep the genre information.

Finally, we will add a new dataframe to help us generate a creation timestamp. If you don’t provide the CREATION_TIMESTAMP for an item, the model infers this information from the interaction dataset and uses the timestamp of the item’s earliest interaction as its corresponding release date. If an item doesn’t have an interaction, its release date is set as the timestamp of the latest interaction in the training set, and it is considered a new item.

For the current example we are selecting a today's date as the creation timestamp because the actual creation timestamp is unknown. In your use-case, please provide the appropriate creation timestamp for the item. This can be when the item was added to your platform.

Amazon Personalize has a default column for `ITEM_ID` that will map to our `movieId`. We will flesh out more information by specifying `GENRE` as well. Finally, we save the file and upload it to Amazon S3

In [None]:
original_data = pd.read_csv(root_dir +'/movies.csv')
original_data
original_data['year'] = original_data['title'].str.extract('.*\((.*)\).*',expand = False)
original_data.head(5)
original_data = original_data.dropna(axis=0)
original_data.isnull().sum()
itemmetadata_df = original_data.copy()
itemmetadata_df = itemmetadata_df[['movieId', 'genres', 'year']]
itemmetadata_df.head()
ts = datetime(2022, 1, 1, 0, 0).strftime('%s')


itemmetadata_df['CREATION_TIMESTAMP'] = ts
itemmetadata_df.rename(columns = {'genres':'GENRES', 'movieId':'ITEM_ID', 'year':'YEAR'}, inplace = True) 
items_filename = "item-meta.csv"
itemmetadata_df.to_csv((root_dir+"/"+items_filename), index=False, float_format='%.0f')

upload_to_s3(root_dir +"/"+items_filename, bucket_name, items_filename)

### Create Users dataset file

This dataset does not have any user metadata, so we will create a fake metadata field. The current dataset does not contain additional user information. For this example, we'll randomly assign a gender to the users with equal probability of male and female. We finally save the file and upload it to Amazon S3

In [None]:
# get all unique user ids from the interaction dataset
user_ids = interactions_df['USER_ID'].unique()
user_data = pd.DataFrame()
user_data["USER_ID"]= user_ids
user_data
possible_genders = ['female', 'male']
random = np.random.choice(possible_genders, len(user_data.index), p=[0.5, 0.5])

user_data["GENDER"] = random
user_data
# Saving the data as a CSV file
users_filename = "users.csv"
user_data.to_csv((root_dir+"/"+users_filename), index=False, float_format='%.0f')

upload_to_s3(root_dir +"/"+users_filename, bucket_name, users_filename)

### Create Interaction dataset
With a schema created, you can create an interactions dataset within the dataset group. Note that this does not load the data yet, but creates a schema of what the data looks like.

In [None]:
try:
    interactions_dataset_response = personalize.create_dataset(
        datasetType = 'INTERACTIONS',
        datasetGroupArn = dataset_group_arn,
        schemaArn = interaction_schema_arn,
        name = resource_name
    )
    interaction_dataset_arn = interactions_dataset_response['datasetArn']
    print('interaction_dataset_arn:\n', interaction_dataset_arn)

except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this dataset.',dataset_group_arn)
    datasets = personalize.list_datasets(
            datasetGroupArn=dataset_group_arn,
            maxResults=100
        )
    
    for dataset in datasets['datasets']:
        
        if (dataset['name'] == resource_name) & (dataset['datasetType'] == "INTERACTIONS") :
            interaction_dataset_arn = dataset['datasetArn']
            print(f"Using Interaction dataset: {interaction_dataset_arn}")

### Create Users dataset

With a schema created, you can create an users dataset within the dataset group. Note that this does not load the data yet, but creates a schema of what the data looks like.

In [None]:
try:
    dataset_type = "USERS"
    create_dataset_response = personalize.create_dataset(
        name = resource_name,
        datasetType = dataset_type,
        datasetGroupArn = dataset_group_arn,
        schemaArn = users_schema_arn
    )

    users_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))

except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this dataset.')
    datasets = personalize.list_datasets(
            datasetGroupArn=dataset_group_arn,
            maxResults=100
        )
    for dataset in datasets['datasets']:
        
        if (dataset['name'] == resource_name) & (dataset['datasetType'] == "USERS") :
            users_dataset_arn = dataset['datasetArn']
            print(f"Using Users dataset: {users_dataset_arn}")

### Create Items dataset

With a schema created, you can create an items dataset within the dataset group. Note that this does not load the data yet, but creates a schema of what the data looks like.

In [None]:
try:
    dataset_type = "ITEMS"
    create_dataset_response = personalize.create_dataset(
        name = resource_name,
        datasetType = dataset_type,
        datasetGroupArn = dataset_group_arn,
        schemaArn = items_schema_arn
    )

    items_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this dataset.')
    datasets = personalize.list_datasets(
            datasetGroupArn=dataset_group_arn,
            maxResults=100
        )
    for dataset in datasets['datasets']:
        if (dataset['name'] == resource_name) & (dataset['datasetType'] == "ITEMS") :
            items_dataset_arn = dataset['datasetArn']
            print(f"Using Items dataset: {items_dataset_arn}")

#### Wait for creation

Let's wait until all the datasets have been created.

In [None]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_response = personalize.describe_dataset(
        datasetArn = interaction_dataset_arn
    )
    status = describe_dataset_response["dataset"]["status"] 
    print("{} : {}".format(interaction_dataset_arn, status))
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
    time.sleep(10)
    
while time.time() < max_time:
    describe_dataset_response = personalize.describe_dataset(
        datasetArn = items_dataset_arn
    )
    status =  describe_dataset_response["dataset"]['status']
    print("{} : {}".format(items_dataset_arn, status))
    
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)
    
while time.time() < max_time:
    describe_dataset_response = personalize.describe_dataset(
        datasetArn = users_dataset_arn
    )
    status =  describe_dataset_response["dataset"]['status']
    print("{} : {}".format(users_dataset_arn, status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

### Import Interactions data

Earlier you created the dataset group and the interactions dataset to house your information, so now you will execute an import job that will load the interactions data from the S3 bucket into the Amazon Personalize dataset.

In [None]:
try:
    
    dsImportJobName = resource_name + 'Interactions'

    interactions_dij_response = personalize.create_dataset_import_job(
        jobName =  dsImportJobName,
        datasetArn = interaction_dataset_arn,
        dataSource = {
            "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename)
        },
        roleArn = role_arn
    )

    interactions_dij_arn = interactions_dij_response['datasetImportJobArn']
    print('interactions_dij_arn: ', interactions_dij_arn)
    
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this dataset import job.')
    dijs = personalize.list_dataset_import_jobs(
                datasetArn=interaction_dataset_arn,
                maxResults=100
            )
    for dij in dijs['datasetImportJobs']:
        if dij['jobName'] == dsImportJobName:
            interactions_dij_arn = dij['datasetImportJobArn']
            print(f"Using Interactions dataset: {interactions_dij_arn}")

### Import Items data

Earlier you created the dataset group and the items dataset to house your information, now you will execute an import job that will load the item data from the S3 bucket into the Amazon Personalize dataset.

In [None]:
try:
    create_dataset_import_job_response = personalize.create_dataset_import_job(
        jobName = f"{resource_name}Items",
        datasetArn = items_dataset_arn,
        dataSource = {
            "dataLocation": "s3://{}/{}".format(bucket_name, items_filename)
        },
        roleArn = role_arn
    )

    items_dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
    print(json.dumps(create_dataset_import_job_response, indent=2))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this dataset import job.')
    dijs = personalize.list_dataset_import_jobs(
                datasetArn=items_dataset_arn,
                maxResults=100
            )
    for dij in dijs['datasetImportJobs']:
        if dij['jobName'] == f"{resource_name}Items":
            items_dataset_import_job_arn = dij['datasetImportJobArn']
            print(f"Using Items dataset: {items_dataset_import_job_arn}")

### Import Users dataset

Earlier you created the dataset group and the users dataset to house your information, now you will execute an import job that will load the user data from the S3 bucket into the Amazon Personalize dataset.

In [None]:
try:
    create_dataset_import_job_response = personalize.create_dataset_import_job(
        jobName = f"{resource_name}Users",
        datasetArn = users_dataset_arn,
        dataSource = {
            "dataLocation": "s3://{}/{}".format(bucket_name, users_filename)
        },
        roleArn = role_arn
    )

    users_dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
    print(json.dumps(create_dataset_import_job_response, indent=2))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this dataset import job.')
    dijs = personalize.list_dataset_import_jobs(
                datasetArn=users_dataset_arn,
                maxResults=100
            )
    for dij in dijs['datasetImportJobs']:
        if dij['jobName'] == f"{resource_name}Users":
            users_dataset_import_job_arn = dij['datasetImportJobArn']
            print(f"Using Users dataset: {users_dataset_import_job_arn}")

#### Wait for creation

Before we can use the dataset, the import job must be active. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the import job every minute, up to a maximum of 6 hours.

Importing the data can take some time, depending on the size of the dataset. In this workshop, the data import job should take around 15 minutes. While you're waiting you can learn more about Datasets and Schemas in [the documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html). We need to wait for the data imports to complete.

In [None]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dij_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = interactions_dij_arn
    )
    dataset_import_job = describe_dij_response["datasetImportJob"]
    if "latestDatasetImportJobRun" not in dataset_import_job:
        status = dataset_import_job["status"]
    else:
        status = describe_dij_response["latestDatasetImportJobRun"]["status"] 
    print("{} : {}".format(interactions_dij_arn, status))
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
    time.sleep(10)
    
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = items_dataset_import_job_arn
    )
    status = describe_dataset_import_job_response["datasetImportJob"]['status']
    print("{} : {}".format(items_dataset_import_job_arn, status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)
    
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = users_dataset_import_job_arn
    )
    status = describe_dataset_import_job_response["datasetImportJob"]['status']
    print("{} : {}".format(users_dataset_import_job_arn, status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

### Create Solutions

Some use cases require a custom implementation, in this example we will use the `aws-personalized-ranking recipe`.

In Amazon Personalize, a specific variation of an algorithm is called a recipe. Different recipes are suitable for different situations. A trained model is called a solution, and each solution can have many versions that relate to a given volume of data when the model was trained.

Personalized Ranking is an interesting application of HRNN. Instead of just recommending what is most probable for the user in question, this algorithm takes in a list of items as well as a user. The items are then returned back in the order of most probable relevance for the user. The use case here is for filtering on unique categories that you do not have item metadata to create a filter, or when you have a broad collection that you would like better ordered for a particular user.

For our use case, using the MovieLens data, we could imagine that a Video on Demand application may want to create a shelf of comic book movies, or movies by a specific director. We can generate these lists based on metadata we have. We would use personalized ranking to re-order the list of movies for each user. 

First you create a solution using the recipe. Although you provide the dataset ARN in this step, the model is not yet trained. See this as an identifier instead of a trained model.


In [None]:
try:
    recipe_arn = "arn:aws:personalize:::recipe/aws-personalized-ranking"

    create_solution_response = personalize.create_solution(
        name = resource_name,
        recipeArn = recipe_arn,
        datasetGroupArn = dataset_group_arn)
    solution_arn = create_solution_response["solutionArn"]
    print('solution arn:', solution_arn)

except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this Solution.')
    solutions = personalize.list_solutions(
                datasetGroupArn=dataset_group_arn,
                maxResults=100
            )
    for solution in solutions['solutions']:
        if solution['name'] == resource_name:
            solution_arn = solution['solutionArn']
            print(f"Solution Arn: {solution_arn}")

### Wait for creation

In [None]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_solution_response = personalize.describe_solution(
        solutionArn = solution_arn
    )
    status = describe_solution_response["solution"]["status"] 
    print("{} : {}".format(solution_arn, status))
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
    time.sleep(10)

### Create Solution Version

Once you have a solution, you need to create a version in order to complete the model training. The training can take a while to complete, upwards of 25 minutes, and an average of 35 minutes for this recipe with our dataset. Normally, we would use a while loop to poll until the task is completed. 

In [None]:
try:
    create_sv_response = personalize.create_solution_version(
            name=f"{resource_name}-solution-version",
            solutionArn = solution_arn,
            trainingMode = 'FULL'
        )
    sv_arn = create_sv_response["solutionVersionArn"]
    print('solutionVersionArn:', sv_arn)

except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this Solution version.')
    solution_versions = personalize.list_solution_versions(maxResults=100)['solutionVersions']
    for solution_version in solution_versions:
        print(solution_version['solutionVersionArn'])
        arn = solution_version['solutionVersionArn']
        match = re.search(r'solution/[^/]+/(\w+)', arn)
        name = match.group(1)
        if name == f"{resource_name}-solution-version":
            sv_arn = arn
            print(f"Solution Arn: {sv_arn}")


#### Wait for creation

In [None]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_sv_response = personalize.describe_solution_version(
        solutionVersionArn = sv_arn
    )
    status = describe_sv_response["solutionVersion"]["status"] 
    print("{} : {}".format(sv_arn, status))
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
    time.sleep(10)

### Create Campaign

Once a solution version is created, it is possible to get recommendations from them, and to get a feel for their overall behavior.

For real-time recommendations, after you prepare and import data and creating a solution, you are ready to deploy your solution version to generate recommendations. You deploy a solution version by creating an Amazon Personalize campaign. If you are getting batch recommendations, you don't need to create a campaign. For more information see [Getting batch recommendations and user segments](https://docs.aws.amazon.com/personalize/latest/dg/recommendations-batch.html).

A campaign is a hosted solution version; an endpoint which you can query for recommendations. Pricing is set by estimating throughput capacity (requests from users for personalization per second). When deploying a campaign, you set a minimum throughput per second (TPS) value. This service, like many within AWS, will automatically scale based on demand, but if latency is critical, you may want to provision ahead for larger demand. For this POC and demo, all minimum throughput thresholds are set to 1. For more information, see the [pricing page](https://aws.amazon.com/personalize/pricing/).

Once we're satisfied with our solution version, we need to create Campaigns for each solution version. When creating a campaign you specify the minimum transactions per second (`minProvisionedTPS`) that you expect to make against the service for this campaign. Personalize will automatically scale the inference endpoint up and down for the campaign to match demand but will never scale below `minProvisionedTPS`.

Let's create a campaigns for our solution versions set at `minProvisionedTPS` of 1.

In [None]:
try:
    create_campaign_response = personalize.create_campaign(
        name = resource_name,
        solutionVersionArn = sv_arn)
    campaign_arn = create_campaign_response["campaignArn"]
    print('campaign arn:', campaign_arn)

except personalize.exceptions.ResourceAlreadyExistsException:
    print('You already created this Campaign.')
    campaigns = personalize.list_campaigns(
            solutionArn=solution_arn,
            maxResults=100
        )
    for campaign in campaigns['campaigns']:
        if campaign['name'] == resource_name:
            campaign_arn = campaign['campaignArn']
            print(f"Campaign Arn: {campaign_arn}")

#### Wait for creation

In [None]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_campaign_response = personalize.describe_campaign(
        campaignArn = campaign_arn
    )
    status = describe_campaign_response["campaign"]["status"] 
    print("{} : {}".format(campaign_arn, status))
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
    time.sleep(30)
    

## Storing useful variables 
Before exiting this notebook, run the following cells to save the version ARNs for use in the next notebook.

In [None]:
%store campaign_arn
%store role_arn
%store interactions_filename
%store items_filename
%store users_filename
%store root_dir
%store dataset_group_arn
%store bucket_name
%store region