# Prepare Personalize and import data

We are going to prepare an Amazon Personalize dataset group and importing our three datasets.

## Objectives

We will accomplish the following steps.

- Create schema resources in Amazon Personalize that define the layout of our three dataset files (CSVs)
- Create a dataset group in Amazon Personalize that will be used to receive our datasets
- Create a dataset in the Personalize dataset group for the three dataset types and schemas
    - Items: information about the products in the Retail Demo Store
    - Users: information about the users in the Retail Deme Store
    - Interactions: user-item interactions representing typical storefront behavior such as viewing products, adding products to a shopping cart, purchasing products, and so on
- Create dataset import jobs to import each of the three datasets into Personalize

## Setup

Just as in the first lab, we have to prepare our environment by importing dependencies and creating clients.

### Import dependencies

The following libraries are needed for this lab. Install python dependencies.

In [None]:
import sys
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install --upgrade --no-deps --force-reinstall botocore pandas

Modify the REGION_NAME based on your environment. Setup AWS credential before execute following code.(https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-authentication.html) 

In [None]:
import boto3
import json
import uuid
import time
from botocore.exceptions import ClientError

REGION_NAME = 'ap-northeast-1'

### Create clients

We will need the following AWS service clients in this lab.

In [2]:
personalize = boto3.client('personalize', region_name=REGION_NAME)

## Configure Amazon Personalize

Now that we've prepared our three datasets and uploaded them to S3 we'll need to configure the Amazon Personalize service to understand our data so that it can be used to train models for generating recommendations.

### Create Schemas for Datasets

Amazon Personalize requires a schema for each dataset so it can map the columns in our CSVs to fields for model training. Each schema is declared in JSON using the [Apache Avro](https://avro.apache.org/) format.

Let's define and create schemas in Personalize for our datasets.

#### Items Datsaset Schema

In [None]:
items_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "PRICE",
            "type": "float"
        },
        {
            "name": "CATEGORY_L1",
            "type": "string",
            "categorical": True,
        },
        {
            "name": "CATEGORY_L2",
            "type": "string",
            "categorical": True,
        },
        {
            "name": "PRODUCT_DESCRIPTION",
            "type": "string",
            "textual": True
        },
        {
            "name": "GENDER",
            "type": "string",
            "categorical": True,
        },
        {
            "name": "PROMOTED",
            "type": "string"
        },
    ],
    "version": "1.0"
}

try:
    create_schema_response = personalize.create_schema(
        name = "retaildemostore-products-items",
        domain = 'ECOMMERCE',
        schema = json.dumps(items_schema)
    )
    items_schema_arn = create_schema_response['schemaArn']
    print(json.dumps(create_schema_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this schema, seemingly')
    paginator = personalize.get_paginator('list_schemas')
    for paginate_result in paginator.paginate():
        for schema in paginate_result['schemas']:
            if schema['name'] == 'retaildemostore-products-items':
                items_schema_arn = schema['schemaArn']
                print(f"Using existing schema: {items_schema_arn}")
                break

#### Users Dataset Schema

In [None]:
users_schema = {
    "type": "record",
    "name": "Users",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "AGE",
            "type": "int"
        },
        {
            "name": "GENDER",
            "type": "string",
            "categorical": True,
        }
    ],
    "version": "1.0"
}

try:
    create_schema_response = personalize.create_schema(
        name = "retaildemostore-products-users",
        domain = "ECOMMERCE",
        schema = json.dumps(users_schema)
    )
    print(json.dumps(create_schema_response, indent=2))
    users_schema_arn = create_schema_response['schemaArn']
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this schema, seemingly')
    paginator = personalize.get_paginator('list_schemas')
    for paginate_result in paginator.paginate():
        for schema in paginate_result['schemas']:
            if schema['name'] == 'retaildemostore-products-users':
                users_schema_arn = schema['schemaArn']
                print(f"Using existing schema: {users_schema_arn}")
                break

#### Interactions Dataset Schema

In [None]:
interactions_schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "EVENT_TYPE",  # "View", "Purchase", etc.
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        },
        {
            "name": "DISCOUNT",  # This is the contextual metadata - "Yes" or "No".
            "type": "string"
        },
    ],
    "version": "1.0"
}

try:
    create_schema_response = personalize.create_schema(
        name = "retaildemostore-products-interactions",
        domain = "ECOMMERCE",
        schema = json.dumps(interactions_schema)
    )
    print(json.dumps(create_schema_response, indent=2))
    interactions_schema_arn = create_schema_response['schemaArn']
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this schema, seemingly')
    paginator = personalize.get_paginator('list_schemas')
    for paginate_result in paginator.paginate():
        for schema in paginate_result['schemas']:
            if schema['name'] == 'retaildemostore-products-interactions':
                interactions_schema_arn = schema['schemaArn']
                print(f"Using existing schema: {interactions_schema_arn}")
                break

### Create and Wait for Dataset Group

Next we need to create the dataset group that will contain our three datasets. This is one of many Personalize operations that are asynchronous. That is, we call an API to create a resource and have to wait for it to become active.

#### Create Dataset Group

Note that we are also passing `ECOMMERCE` for the `domain` parameter here too.

In [None]:
try:
    create_dataset_group_response = personalize.create_dataset_group(
        name = 'retaildemostore-products',
        domain = 'ECOMMERCE'
    )
    dataset_group_arn = create_dataset_group_response['datasetGroupArn']
    print(json.dumps(create_dataset_group_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this dataset group, seemingly')
    paginator = personalize.get_paginator('list_dataset_groups')
    for paginate_result in paginator.paginate():
        for dataset_group in paginate_result['datasetGroups']:
            if dataset_group['name'] == 'retaildemostore-products':
                dataset_group_arn = dataset_group['datasetGroupArn']
                break
                
print(f'DatasetGroupArn = {dataset_group_arn}')

#### Wait for Dataset Group to Have ACTIVE Status

In [None]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(15)

### Create Items Dataset

Next we will create the datasets in Personalize for our three dataset types. Let's start with the items dataset.

In [None]:
try:
    dataset_type = "ITEMS"
    create_dataset_response = personalize.create_dataset(
        name = "retaildemostore-products-items",
        datasetType = dataset_type,
        datasetGroupArn = dataset_group_arn,
        schemaArn = items_schema_arn
    )

    items_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this dataset, seemingly')
    paginator = personalize.get_paginator('list_datasets')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for dataset in paginate_result['datasets']:
            if dataset['name'] == 'retaildemostore-products-items':
                items_dataset_arn = dataset['datasetArn']
                break
                
print(f'Items dataset ARN = {items_dataset_arn}')

### Create Users Dataset

In [None]:
try:
    dataset_type = "USERS"
    create_dataset_response = personalize.create_dataset(
        name = "retaildemostore-products-users",
        datasetType = dataset_type,
        datasetGroupArn = dataset_group_arn,
        schemaArn = users_schema_arn
    )

    users_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this dataset, seemingly')
    paginator = personalize.get_paginator('list_datasets')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for dataset in paginate_result['datasets']:
            if dataset['name'] == 'retaildemostore-products-users':
                users_dataset_arn = dataset['datasetArn']
                break
                
print(f'Users dataset ARN = {users_dataset_arn}')

### Create Interactions Dataset

In [None]:
try:
    dataset_type = "INTERACTIONS"
    create_dataset_response = personalize.create_dataset(
        name = "retaildemostore-products-interactions",
        datasetType = dataset_type,
        datasetGroupArn = dataset_group_arn,
        schemaArn = interactions_schema_arn
    )

    interactions_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this dataset, seemingly')
    paginator = personalize.get_paginator('list_datasets')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for dataset in paginate_result['datasets']:
            if dataset['name'] == 'retaildemostore-products-interactions':
                interactions_dataset_arn = dataset['datasetArn']
                break
                
print(f'Interactions dataset ARN = {interactions_dataset_arn}')

### Wait for datasets to become active

It can take a minute or two for the datasets to be created. Let's wait for all three to become active.

In [None]:
%%time

dataset_arns = [ items_dataset_arn, users_dataset_arn, interactions_dataset_arn ]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    for dataset_arn in reversed(dataset_arns):
        response = personalize.describe_dataset(
            datasetArn = dataset_arn
        )
        status = response["dataset"]["status"]

        if status == "ACTIVE":
            print(f'Dataset {dataset_arn} successfully completed')
            dataset_arns.remove(dataset_arn)
        elif status == "CREATE FAILED":
            print(f'Dataset {dataset_arn} failed')
            if response['dataset'].get('failureReason'):
                print('   Reason: ' + response['dataset']['failureReason'])
            dataset_arns.remove(dataset_arn)

    if len(dataset_arns) > 0:
        print('At least one dataset is still in progress')
        time.sleep(15)
    else:
        print("All datasets have completed")
        break

## Import Datasets to Personalize

In the following steps we will create import jobs with Personalize that will import the datasets from our S3 bucket into the service.

### Inspect permissions

By default, the Personalize service does not have permission to acccess the data we uploaded into the S3 bucket in our account. In order to grant access to the Personalize service to read our CSVs, we need to set a Bucket Policy and create an IAM role that the Amazon Personalize service will assume.

Create a bucket in S3 with following bucket policy, replace <bucket_name> with real value:

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::<bucket_name>",
                "arn:aws:s3:::<bucket_name>/*"
            ],
            "Condition": {
                "Bool": {
                    "aws:SecureTransport": "false"
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket_name>",
                "arn:aws:s3:::<bucket_name>/*"
            ]
        }
    ]
}
```

And then upload users.csv, items.csv and interactions.csv to the bucket.

We'll start by displaying the bucket policy in the S3 staging bucket where we uploaded the CSVs.

In [None]:
BUCKET = '<bucket_name>'

items_filename = "items.csv"
users_filename = "users.csv"
interactions_filename = "interactions.csv"

s3_res = boto3.Session().resource('s3')
s3_res.Bucket(BUCKET).Object(items_filename).upload_file(items_filename)
s3_res.Bucket(BUCKET).Object(users_filename).upload_file(users_filename)
s3_res.Bucket(BUCKET).Object(interactions_filename).upload_file(interactions_filename)

In [None]:
s3 = boto3.client("s3", region_name=REGION_NAME)


response = s3.get_bucket_policy(Bucket = BUCKET)
print(json.dumps(json.loads(response['Policy']), indent=2))

Next, create an IAM role DCEGuidancePersonalizeS3 that Personalize will need to assume to access the S3 bucket.
Create an inline policy for this role:
```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket_name>",
                "arn:aws:s3:::<bucket_name>/*"
            ],
            "Effect": "Allow"
        }
    ]
}
```

We'll start by inspecting the role itself.

In [None]:
iam = boto3.client("iam")

role_name = "DCEGuidancePersonalizeS3"

response = iam.get_role(RoleName = role_name)
role_arn = response['Role']['Arn']
print(json.dumps(response['Role'], indent=2, default = str))

{
  "Path": "/",
  "RoleName": "retaildemostore-ap-northeast-1-PersonalizeS3",
  "RoleId": "AROATQW6HWVGVZF33GXKS",
  "Arn": "arn:aws:iam::242057983309:role/retaildemostore-ap-northeast-1-PersonalizeS3",
  "CreateDate": "2023-02-07 06:51:38+00:00",
  "AssumeRolePolicyDocument": {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "personalize.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  },
  "Description": "",
  "MaxSessionDuration": 3600,
  "RoleLastUsed": {}
}


Notice that the role has the same service principal as the bucket policy but this time with the `sts:AssumeRole` action. This is required so that Personalize can assume this role.

Finally, we'll get the inline policy named `BucketAccess` that has the same S3 permissions as the bucket policy.

In [11]:
response = iam.get_role_policy(RoleName = role_name, PolicyName = 'BucketAccess')
print(json.dumps(response, indent=2))

{
  "RoleName": "retaildemostore-ap-northeast-1-PersonalizeS3",
  "PolicyName": "BucketAccess",
  "PolicyDocument": {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Action": [
          "s3:GetObject",
          "s3:ListBucket"
        ],
        "Resource": [
          "arn:aws:s3:::retail-demo-store-ap-northeast-1",
          "arn:aws:s3:::retail-demo-store-ap-northeast-1/*",
          "arn:aws:s3:::retaildemostore-base-w4qexcd5hhdg-buc-stackbucket-dsli02y8esva",
          "arn:aws:s3:::retaildemostore-base-w4qexcd5hhdg-buc-stackbucket-dsli02y8esva/*"
        ],
        "Effect": "Allow"
      },
      {
        "Action": [
          "s3:PutObject"
        ],
        "Resource": [
          "arn:aws:s3:::retaildemostore-base-w4qexcd5hhdg-buc-stackbucket-dsli02y8esva",
          "arn:aws:s3:::retaildemostore-base-w4qexcd5hhdg-buc-stackbucket-dsli02y8esva/*"
        ],
        "Effect": "Allow"
      }
    ]
  },
  "ResponseMetadata": {
    "RequestId": "c5e08d25-090c

### Create Import Jobs

With the permissions in place to allow Personalize to access our CSV files, let's create three import jobs to import each file into its respective dataset. Each import job can take several minutes to complete so we'll create all three import jobs and then wait for them all to complete. This allows them to import in parallel.

#### Create Items Dataset Import Job

In [23]:
import_job_suffix = str(uuid.uuid4())[:8]

items_create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "retaildemostore-products-items-" + import_job_suffix,
    datasetArn = items_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, items_filename)
    },
    roleArn = role_arn
)

items_dataset_import_job_arn = items_create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(items_create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:ap-northeast-1:242057983309:dataset-import-job/retaildemostore-products-items-f31e19b2",
  "ResponseMetadata": {
    "RequestId": "fb9d20c9-af39-4944-988e-246800739425",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 07 Jul 2023 07:10:16 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "132",
      "connection": "keep-alive",
      "x-amzn-requestid": "fb9d20c9-af39-4944-988e-246800739425"
    },
    "RetryAttempts": 0
  }
}


#### Create Users Dataset Import Job

In [24]:
users_create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "retaildemostore-products-users-" + import_job_suffix,
    datasetArn = users_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, users_filename)
    },
    roleArn = role_arn
)

users_dataset_import_job_arn = users_create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(users_create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:ap-northeast-1:242057983309:dataset-import-job/retaildemostore-products-users-f31e19b2",
  "ResponseMetadata": {
    "RequestId": "04c8ca35-aed8-4226-af49-30ba3f0d1158",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 07 Jul 2023 07:10:19 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "132",
      "connection": "keep-alive",
      "x-amzn-requestid": "04c8ca35-aed8-4226-af49-30ba3f0d1158"
    },
    "RetryAttempts": 0
  }
}


#### Create Interactions Dataset Import Job

In [25]:
interactions_create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "retaildemostore-products-interactions-" + import_job_suffix,
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, interactions_filename)
    },
    roleArn = role_arn
)

interactions_dataset_import_job_arn = interactions_create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(interactions_create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:ap-northeast-1:242057983309:dataset-import-job/retaildemostore-products-interactions-f31e19b2",
  "ResponseMetadata": {
    "RequestId": "f7d6ec00-5d2a-425c-9a7c-db54253cce74",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 07 Jul 2023 07:10:22 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "139",
      "connection": "keep-alive",
      "x-amzn-requestid": "f7d6ec00-5d2a-425c-9a7c-db54253cce74"
    },
    "RetryAttempts": 0
  }
}


### Wait for Import Jobs to Complete

It will take 10-15 minutes for the import jobs to complete, while you're waiting you can learn more about Datasets and Schemas here: https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html

We will wait for all three jobs to finish.

#### Wait for Items Import Job to Complete

In [26]:
%%time

import_job_arns = [ items_dataset_import_job_arn, users_dataset_import_job_arn, interactions_dataset_import_job_arn ]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    for job_arn in reversed(import_job_arns):
        import_job_response = personalize.describe_dataset_import_job(
            datasetImportJobArn = job_arn
        )
        status = import_job_response["datasetImportJob"]['status']

        if status == "ACTIVE":
            print(f'Import job {job_arn} successfully completed')
            import_job_arns.remove(job_arn)
        elif status == "CREATE FAILED":
            print(f'Import job {job_arn} failed')
            if import_job_response["datasetImportJob"].get('failureReason'):
                print('   Reason: ' + import_job_response["datasetImportJob"]['failureReason'])
            import_job_arns.remove(job_arn)

    if len(import_job_arns) > 0:
        print('At least one dataset import job still in progress')
        time.sleep(60)
    else:
        print("All import jobs have ended")
        break

At least one dataset import job still in progress
At least one dataset import job still in progress
At least one dataset import job still in progress
At least one dataset import job still in progress
Import job arn:aws:personalize:ap-northeast-1:242057983309:dataset-import-job/retaildemostore-products-interactions-f31e19b2 successfully completed
At least one dataset import job still in progress
Import job arn:aws:personalize:ap-northeast-1:242057983309:dataset-import-job/retaildemostore-products-users-f31e19b2 successfully completed
Import job arn:aws:personalize:ap-northeast-1:242057983309:dataset-import-job/retaildemostore-products-items-f31e19b2 successfully completed
All import jobs have ended
CPU times: user 206 ms, sys: 67.2 ms, total: 274 ms
Wall time: 5min 3s


## Lab 2 Summary - What have we accomplished?

In this lab we created schemas in Amazon Personalize that mapped to the dataset CSVs we created in Lab 1. We also created a dataset group in Personalize as well as datasets to receive our CSVs. Since Personalize needs access to the staging S3 bucket where the CSVs were uploaded, we inspected the S3 bucket policy and IAM role that needs to be passed to Personalize. Finally, we create dataset import jobs in Personalize to upload the three datasets into Personalize.

In the next lab we will create the recommenders and custom solutions and solution versions for our personalization use cases. This is where the machine learning models are trained and deployed.

### Store variables needed in the next lab

We will pass some variables initialized in this lab by storing them in the notebook environment.

In [27]:
%store dataset_group_arn
%store role_arn

Stored 'dataset_group_arn' (str)
Stored 'role_arn' (str)


Continue to [Lab 3](./Lab-3-Create-recommenders-and-custom-solutions.ipynb).