## In this Notebook<a class="anchor" id="top"></a>


## Outline

1. [Introduction](#intro)
1. [Understanding the code](#thecode)
1. [Introduction to Amazon Personalize](#datasets)
1. [Creating Amazon Personalize Resources and Importing data](#import)
1. [Create the Dataset Group](#group_dataset)
1. [Create the Interactions Schema](#interact_schema)
1. [Create the Interactions Dataset](#interact_data)
1. [Create the Items Schema](#items_schema)
1. [Create the Items Dataset](#items_data)
1. [Import the interactions data](#import_interactions)
1. [Import the Item Metadata](#import_items)

## Introduction<a class="anchor" id="intro"></a>
[Back to top](#top)

In the previous notebook: [`01_Introduction_and_Data_Preparation.ipynb`](01_Introduction_and_Data_Preparation.ipynb) you prepared the data we will be using to train a Domain Optimized Recommender and generate personalized emails.

This notebook will walk you through the steps to build a Domain dataset group and upload the data to Amazon Personalize.


## Understanding the code<a class="anchor" id="thecode"></a>
[Back to top](#top)

This notebook can be used in two modalities:

1. Train as you go by executing each cell. Some cells may take a long time to finish executing as they wait for resources to be created.
2. Use this notebook with previously created resources. All or the majority of the resources will already be created and cells will just retrieve the information of these existing resources to use them in following steps.

Because of this, you will find that some cells have `try` and `except` blocks. In particular most of them are handling a `ResourceAlreadyExistsException` exception. 

You can look at the code in the `try` block to get a good idea of how you can create a resource and understand how to use the Amazon Personalize SDK. The `except` block will let you know that the resource has been created and record the corresponding ARN, which is the Amazon unique identifier.

This is an example of the `try` block for creating a dataset group, this code will execute without exceptions if the dataset group does not exist and raise an exception if the dataset group does already exist:

```python
try: 
    create_dataset_group_response = personalize.create_dataset_group(
        name = workshop_dataset_group_name,
        domain='VIDEO_ON_DEMAND'
    )

    workshop_dataset_group_arn = create_dataset_group_response['datasetGroupArn']
    print(json.dumps(create_dataset_group_response, indent=2))
    print ('\nCreating the Dataset Group with dataset_group_arn = {}'.format(workshop_dataset_group_arn))
    
```
and this is the corresponding `except` block that will be exectuted if an exeption is raised becuse the dataset group already exists. This block saves the ARN for the existig dataset group to use later and lets you know the resource already exists.

```python
except personalize.exceptions.ResourceAlreadyExistsException as e:
    workshop_dataset_group_arn = 'arn:aws:personalize:'+region+':'+account_id+':dataset group/' + 
        workshop_dataset_group_name 
    print ('\nThe the Dataset Group with dataset_group_arn = {} already exists'.format(
        workshop_dataset_group_arn))
    print ('\nWe will be using the existing Dataset Group dataset_group_arn = {}'.format(
        workshop_dataset_group_arn))
```

Depending on the resource, you may also find that sometimes the code will check from a list of resourses to find if a resource exists and then use `if` and `else` blocks to either use the existing resource or create it.

In [None]:
# Retrieving stored variables from the previous notebook
%store -r

### Let's build!

Python ships with a broad collection of libraries and we need to import those as well as the ones installed to help us like [boto3](https://aws.amazon.com/sdk-for-python/) (AWS SDK for python).

In [None]:
import time
from time import sleep
import json
from datetime import datetime
import boto3
import botocore
from botocore.exceptions import ClientError


In [None]:
# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

## Introduction to Amazon Personalize <a class="anchor" id="datasets"></a>
[Back to top](#top)

[Amazon Personalize](https://aws.amazon.com/pm/personalize/) is a fully managed Machine Learning (ML) service that uses your data to generate item recommendations for your users. Amazon Personalize makes it easy for developers to build applications with a wide array of personalization use cases and automates many of the complicated steps to build, train, and deploy a ML model.  

Regardless of the use case, the algorithms all share a base of learning on user-item-interaction data which is defined by 3 core attributes:

1. **UserID** - The user who interacted
1. **ItemID** - The item the user interacted with
1. **Timestamp** - The time at which the interaction occurred

Generally speaking your data will not arrive in a perfect form for Personalize, and will take some modification to be structured correctly. This notebook guides you through that process.

### Items data

The item data consists of information about the content that is being interacted with, this generally comes from Content Management Systems (CMS). For the purpose of this workshop we will use the IMDb TT ID to provide a common identifier between the interactions data and the content metadata. Movielens provides its own identifier as well as a the IMDb TT ID (without the leading 'tt') in the 'links.csv' file. This dataset is not manatory, but provided good item metadata will ensure the best results in your trained models.

### Interactions data

The interaction data concists of information about the interactions the users of the fictional app will have with the content. This usually comes from analytics tools or Customer Data Platform's (CDP). The best interaction data for use for Amazon Personalize would include the sequential order of user beavior, what content was watched/clicked on and the order it was interacted with. To simulate our interaction data, we will be using data from the [MovieLens project](https://grouplens.org/datasets/movielens/). Movielens offers multiple versions of their dataset, for the purposes of this workshop we will be using the reduced version of this dataset (approx 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users). 

### User data

The user data is what information you have about you users, it usually comes from Customer relationship management (CRM) or Subscriber management systems. Since there is no user data included in the MovieLens data, we will be generating a small synthetic dataset to simulate this component of the workshop. This dataset is not manatory, but provided good user metadata will ensure the best results in your trained models. In this workshop we will not be using user data to train the Recommender.

In this notebook we will be importing interactions and item data into your environment, inspecting it and converting it to a format that will allow you use it in Amazon Personalize to train models to get personalized recommendations.


## Creating Amazon Personalize Resources and Importing data <a class="anchor" id="import"></a>

### Get the account id and region

In [None]:
account_id = boto3.client('sts').get_caller_identity().get('Account')
print("account id:", account_id)

with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]
print("region:", region)

### IAM role

Amazon Personalize needs the ability to assume roles in AWS in order to have the permissions to execute certain tasks. 

We will be using the S3 bucket that you created when you deployed the Cloud Formation using [personalizeSimpleCFMarketingContentGen.yml](personalizeSimpleCFMarketingContentGen.yml).

The Assume Role Policy document needs to have the following format:

```python
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "personalize.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}
```

The S3 Access Policy document needs to have the following format:

```python
s3_access_policy_document = {
    "Version": "2012-10-17",
    "Statement": {
            "Sid" : "myStatement" ,
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::{}".format(bucket_name),
                "arn:aws:s3:::{}/*".format(bucket_name)
            ],
            "Action": "s3:*"
        }
}

```

Let's get the ARN of the role we created via the Cloud Formation.

In [None]:
# Configure the SDK to SSM:
ssm = boto3.client('ssm')

In [None]:
role_arn_info = ssm.get_parameter(Name='/cloudformation/personalize-iam-role-arn', WithDecryption=False)
role_arn = role_arn_info['Parameter']['Value']

In [None]:
# get the role name
role_name = role_arn.split('/')[1]
role_name

### S3 bucket <a class="anchor" id="bucket_role"></a>
[Back to top](#top)

So far, we have downloaded, manipulated, and saved the data onto the Amazon EBS instance attached to instance running this Jupyter notebook. 

By default, the Amazon Personalize service does not have permission to access the data we uploaded into the S3 bucket in our account. In order to grant access to the Amazon Personalize service to read our CSVs, you need to set a Bucket Policy and create an IAM role that the Amazon Personalize service will assume. 

Use the metadata stored on the instance underlying this Amazon SageMaker notebook, to determine the region it is operating in. If you are using a Jupyter notebook outside of Amazon SageMaker, simply define the region as a string below. The Amazon S3 bucket needs to be in the same region as the Amazon Personalize resources we have been creating so far.

We will be using the S3 bucket that you created when you deployed the Cloud Formation using [personalizeSimpleCFMarketingContentGen.yml](personalizeSimpleCFMarketingContentGen.yml).

This bucket is created with the policy:

```python
policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:*Object",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket_name),
                "arn:aws:s3:::{}/*".format(bucket_name)
            ]
        }
    ]
}
```

This S3 bucket policy allows Amazon Personalize to be able to read the contents of your S3 bucket. 

In [None]:
# get the name of the bucket created by the Cloud Formation
personalizes3bucket = ssm.get_parameter(Name='/cloudformation/personalize-s3-bucket', WithDecryption=False)
bucket_name = personalizes3bucket['Parameter']['Value']

print('bucket_name:', bucket_name)

Let's have a look at the S3 bucket policy. 

In [None]:
s3 = boto3.client('s3')

try:
    bucket_current_policy = s3.get_bucket_policy(Bucket=bucket_name)['Policy']
    print ("Bucket current policy")
    print (json.loads(bucket_current_policy))
    
except Exception as e: 
    raise(e)


### Upload data to S3

Now let us pload the CSV files of our 3 datasets (Item and Interaction).

In [None]:
interactions_file_path = data_dir + "/" + interactions_filename

try:
    s3.get_object(
        Bucket=bucket_name,
        Key=interactions_filename,
    )
    print("{} already exists in the bucket {}".format(interactions_file_path, bucket_name))
except s3.exceptions.NoSuchKey:
    # Uploading the file if it does not already exist
    boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_file_path)
    print("File {} uploaded to bucket {}".format(interactions_filename, bucket_name))

items_file_path = data_dir + "/" + items_filename

try:
    s3.get_object(
        Bucket=bucket_name,
        Key=items_filename,
    )
    print("{} already exists in the bucket {}".format(items_file_path, bucket_name))
except s3.exceptions.NoSuchKey:
    # Uploading the file if it does not already exist     
    boto3.Session().resource('s3').Bucket(bucket_name).Object(items_filename).upload_file(items_file_path)
    print("File {} uploaded to bucket {}".format(items_filename, bucket_name))
    

## Create the Dataset Group <a class="anchor" id="group_dataset"></a>
[Back to top](#top)

The highest level of isolation and abstraction with Amazon Personalize is a *dataset group*. Information stored within one of these dataset groups has no impact on any other dataset group or models created from one - they are completely isolated. This allows you to run many experiments and is part of how we keep your models private and fully trained only on your data. 

Before importing the data prepared earlier, there needs to be a dataset group and a dataset added to it that handles the interactions.

Dataset groups can house the following types of information:

* User-item-interactions
* Event streams (real-time interactions)
* User metadata
* Item metadata

We need to create the dataset group that will contain our three datasets.

Your dataset group can be one of the following types:

* A Domain dataset group, where you create preconfigured resources for different business domains and use cases, such as getting recommendations for similar videos (VIDEO_ON_DEMAND domain) or best selling items (ECOMMERCE domain). You choose your business domain, import your data, and create recommenders. You use recommenders in your application to get recommendations. Use a [Domain dataset group](https://docs.aws.amazon.com/personalize/latest/dg/domain-dataset-groups.html) if you have a video on demand or e-commerce application and want Amazon Personalize to find the best configurations for your use cases. If you start with a Domain dataset group, you can also add custom resources such as solutions with solution versions trained with recipes for custom use cases.

* A [Custom dataset group](https://docs.aws.amazon.com/personalize/latest/dg/custom-dataset-groups.html), where you create configurable resources for custom use cases and batch recommendation workflows. You choose a recipe, train a solution version (model), and deploy the solution version with a campaign. You use a campaign in your application to get recommendations. Use a Custom dataset group if you don't have a video on demand or e-commerce application or want to configure and manage only custom resources, or want to get recommendations in a batch workflow. If you start with a Custom dataset group, you can't associate it with a domain later. Instead, create a new Domain dataset group.

You can create and manage Domain dataset groups and Custom dataset groups with the AWS console, the AWS Command Line Interface (AWS CLI), or programmatically with the AWS SDKs.

In this workshop we will be creating a domain dataset group.

<div class="alert alert-block alert-warning">
<b>Note:</b> If you run these notebooks in your own account, Amazon SageMaker will create these resources, deploying these resources will take aprox. 90 minutes.
</div>

The following cell will create a new dataset group with the name `personalize-demo`.

In [None]:
try:     
    # Try to create the dataset group, this block with exectute fully if the dataset group does not exist yet
    create_dataset_group_response = personalize.create_dataset_group(
        name = workshop_dataset_group_name,
        domain='VIDEO_ON_DEMAND'
    )

    workshop_dataset_group_arn = create_dataset_group_response['datasetGroupArn']
    print(json.dumps(create_dataset_group_response, indent=2))
    print ('\nCreating the Dataset Group with dataset_group_arn = {}'.format(workshop_dataset_group_arn))

except personalize.exceptions.ResourceAlreadyExistsException as e:
    # if the dataset group already exists, get the unique identifier workshop_dataset_group_arn 
    # from the existing resource
    
    workshop_dataset_group_arn = 'arn:aws:personalize:'+region+':'+account_id+':dataset-group/'+workshop_dataset_group_name 
    print ('\nThe the Dataset Group with dataset_group_arn = {} already exists'.format(workshop_dataset_group_arn))
    print ('\nWe will be using the existing Dataset Group dataset_group_arn = {}'.format(workshop_dataset_group_arn))


#### Wait for Dataset Group to have ACTIVE Status 

Before we can use the Dataset Group to create more resources below, it must be active. This can take a minute or two. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the dataset group every 30 seconds, up to a maximum of 3 hours.

In [None]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = workshop_dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(30)

Now that you have a dataset group, you can create a dataset for the interaction data.

## Create the Interactions Schema <a class="anchor" id="interact_schema"></a>
[Back to top](#top)

Now that we've loaded and prepared our three datasets we'll configure the Amazon Personalize service to understand our data so that it can be used to train models for generating recommendations. Amazon Personalize requires a schema for each dataset, so it can map the columns in our CSVs to fields for model training. Each schema is declared in JSON using the [Apache Avro](https://avro.apache.org/) format. 

First, define a schema to tell Amazon Personalize what type of dataset you are uploading. There are several mandatory fields that are required in the schema, depending on the type of dataset. More detailed information can be found in the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html).

The interactions dataset has three required columns: `ITEM_ID`, `USER_ID`, and `TIMESTAMP`. The `TIMESTAMP` represents when the user interated with an item and must be expressed in Unix timestamp format (seconds). For this dataset we also have an `EVENT_TYPE` column. These must be defined in the same order in the schema as they appear in the dataset.

In [None]:
interactions_schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "EVENT_TYPE", # "Watch", "Click", etc.
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}

try:
    # Try to create the interactions dataset schema, this block with exectute fully 
    # if the interactions dataset schema does not exist yet
    create_schema_response = personalize.create_schema(
        name = interactions_schema_name,
        schema = json.dumps(interactions_schema),
        domain='VIDEO_ON_DEMAND'
    )
    print(json.dumps(create_schema_response, indent=2))
    workshop_interactions_schema_arn = create_schema_response['schemaArn']
    print ('\nCreating the Interactions Schema with workshop_interactions_schema_arn = {}'.format(workshop_interactions_schema_arn))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    # if the interactions dataset schema already exists, get the unique identifier workshop_interactions_schema_arn
    # from the existing resource 
    
    workshop_interactions_schema_arn = 'arn:aws:personalize:'+region+':'+account_id+':schema/'+interactions_schema_name 
    print('The schema {} already exists.'.format(workshop_interactions_schema_arn))
    print ('\nWe will be using the existing Interactions Schema with workshop_interactions_schema_arn = {}'.format(workshop_interactions_schema_arn))
 

## Create the Interactions Dataset <a class="anchor" id="interact_data"></a>
[Back to top](#top)

With a schema created, you can create a dataset within the dataset group. Note that this does not load the data yet, but creates a schema of what the data looks like. 

In [None]:
try:
    # Try to create the interactions dataset, this block with exectute fully 
    # if the interactions dataset does not exist yet
    
    dataset_type = 'INTERACTIONS'
    create_dataset_response = personalize.create_dataset(
        name = interactions_dataset_name,
        datasetType = dataset_type,
        datasetGroupArn = workshop_dataset_group_arn,
        schemaArn = workshop_interactions_schema_arn
    )

    workshop_interactions_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
    print ('\nCreating the Interactions Dataset with workshop_interactions_dataset_arn = {}'.format(workshop_interactions_dataset_arn))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    # if the interactions dataset already exists, get the unique identifier workshop_interactions_dataset_arn 
    # from the existing resource 
    workshop_interactions_dataset_arn =  'arn:aws:personalize:'+region+':'+account_id+':dataset/'+workshop_dataset_group_name+'/INTERACTIONS'
    print('The Interactions Dataset {} already exists.'.format(workshop_interactions_dataset_arn))
    print ('\nWe will be using the existing Interactions Dataset with workshop_interactions_dataset_arn = {}'.format(workshop_interactions_dataset_arn))
        

## Create the Items (Movies) Schema<a class="anchor" id="items_schema"></a>
[Back to top](#top)

First, we define a schema to tell Amazon Personalize what type of dataset we are uploading. There are several reserved and mandatory keywords required in the schema, based on the type of dataset. More detailed information can be found in the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html).

Our item metadata data has the following columns: `ITEM_ID`, `TITLE`, `YEAR`, `IMDB_RATING`,`IMDB_NUMBEROFVOTES`,  `PLOT`, `US_MATURITY_RATING_STRING`, `US_MATURITY_RATING`,`GENRES`, `CREATION_TIMESTAMP`, and `PROMOTION` fields. These must be defined in the same order in the schema as they appear in the dataset.

In [None]:
items_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "TITLE",
            "type": "string"
        },
        {
            "name": "YEAR",
            "type": "int"
        },
        {
            "name": "IMDB_RATING",
            "type": "int"
        },
        {
            "name": "IMDB_NUMBEROFVOTES",
            "type": "int"
        },
        {
            "name": "PLOT",
            "type": "string",
            "textual": True
        },
        {
            "name": "US_MATURITY_RATING_STRING",
            "type": "string"
        },
        {
            "name": "US_MATURITY_RATING",
            "type": "int"
        },
        {
            "name": "GENRES",
            "type": "string",
            "categorical": True
        },
        {
            "name": "CREATION_TIMESTAMP",
            "type": "long"
        },
        {
            "name": "PROMOTION",
            "type": "string"
        }
    ],
    "version": "1.0"
}

try:
    # Try to create the items dataset schema, this block with exectute fully 
    # if the items dataset schema does not exist yet
    
    create_schema_response = personalize.create_schema(
        name = items_schema_name,
        schema = json.dumps(items_schema),
        domain='VIDEO_ON_DEMAND'
    )
    workshop_items_schema_arn = create_schema_response['schemaArn']
    print(json.dumps(create_schema_response, indent=2))

    print ('\nCreating the Items Schema with workshop_items_schema_arn = {}'.format(workshop_items_schema_arn))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    # if the items dataset schema already exists, get the unique identifier workshop_items_schema_arn 
    # from the existing resource 
    
    workshop_items_schema_arn = 'arn:aws:personalize:'+region+':'+account_id+':schema/'+items_schema_name 
    print('The schema {} already exists.'.format(workshop_items_schema_arn))
    print ('\nWe will be using the existing Items Schema with workshop_items_schema_arn = {}'.format(workshop_items_schema_arn))
 

## Create the Items Dataset<a class="anchor" id="items_data"></a>
[Back to top](#top)


With a schema created, you can create a dataset within the dataset group. Note that this does not load the data yet, but creates a schema of what the data looks like. 

In [None]:
try:
    # Try to create the items dataset, this block with exectute fully if the items dataset does not exist yet
    
    dataset_type = "ITEMS"
    create_dataset_response = personalize.create_dataset(
        name = items_dataset_name,
        datasetType = dataset_type,
        datasetGroupArn = workshop_dataset_group_arn,
        schemaArn = workshop_items_schema_arn
    )

    workshop_items_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))

    print ('\nCreating the Items Dataset with workshop_items_dataset_arn = {}'.format(workshop_items_dataset_arn))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    # if the items dataset already exists, get the unique identifier workshop_items_dataset_arn 
    # from the existing resource 
    
    workshop_items_dataset_arn =  'arn:aws:personalize:'+region+':'+account_id+':dataset/'+workshop_dataset_group_name+'/ITEMS'
    print('The Items Dataset {} already exists.'.format(workshop_items_dataset_arn))
    print ('\nWe will be using the existing Items Dataset with workshop_items_dataset_arn = {}'.format(workshop_items_dataset_arn))   

Let's wait until all the datasets have been created.

In [None]:
%%time

max_time = time.time() + 6*60*60 # 6 hours
while time.time() < max_time:
    describe_dataset_response = personalize.describe_dataset(
        datasetArn = workshop_interactions_dataset_arn
    )
    status_interaction_dataset =  describe_dataset_response["dataset"]['status']
    print("Interactions Dataset: {}".format(status_interaction_dataset))
    
    if status_interaction_dataset == "ACTIVE":
        print("Build succeeded for {}".format(workshop_interactions_dataset_arn))
        
    elif status_interaction_dataset == "CREATE FAILED":
        print("Build failed for {}".format(workshop_interactions_dataset_arn))
        break
        
    if not status_interaction_dataset == "ACTIVE":
        print("The interaction dataset creation is still in progress")
    else:
        print("The interaction dataset  is ACTIVE")
        

    describe_dataset_response = personalize.describe_dataset(
        datasetArn = workshop_items_dataset_arn
    )
    status_item_dataset =  describe_dataset_response["dataset"]['status']
    print("Items Dataset: {}".format(status_item_dataset))
    
    if status_item_dataset == "ACTIVE":
        print("Build succeeded for {}".format(workshop_items_dataset_arn))
        
    elif status_item_dataset == "CREATE FAILED":
        print("Build failed for {}".format(workshop_items_dataset_arn))
        break
        
    if not status_item_dataset == "ACTIVE":
        print("The item dataset creation is still in progress")
    else:
        print("The item dataset  is ACTIVE")
        
    if status_interaction_dataset == "ACTIVE" and status_item_dataset == "ACTIVE":
        break
    time.sleep(30)

## Import the interactions data <a class="anchor" id="import_interactions"></a>
[Back to top](#top)

Earlier you created the dataset group and dataset to house your information, so now you will execute an import job that will load the interactions data from the S3 bucket into the Amazon Personalize dataset. 

In [None]:
# Check if the import job already exists

# List the import jobs
interactions_dataset_import_jobs = personalize.list_dataset_import_jobs(
    datasetArn=workshop_interactions_dataset_arn,
    maxResults=100
)['datasetImportJobs']

#check if there is an existing job with the prefix
job_exists = False  
job_arn = None

for job in interactions_dataset_import_jobs:
    if (interactions_import_job_name in job['jobName']):
        job_exists = True
        job_arn = job['datasetImportJobArn']
    
if (job_exists):
    workshop_interactions_dataset_import_job_arn = job_arn
    print('The Interactions Import Job {} already exists.'.format(workshop_interactions_dataset_import_job_arn))
    print ('\nWe will be using the existing Interactions Import Job with workshop_interactions_dataset_import_job_arn = {}'.format(workshop_interactions_dataset_import_job_arn))
        
else:
    # If there is no import job with the prefix, create it:   
    create_dataset_import_job_response = personalize.create_dataset_import_job(
        jobName = interactions_import_job_name,
        datasetArn = workshop_interactions_dataset_arn,
        dataSource = {
            "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename)
        },
        roleArn = role_arn
    )
    workshop_interactions_dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
    print(json.dumps(create_dataset_import_job_response, indent=2))
    
    print ('\nImporting the Interactions Data with workshop_interactions_dataset_import_job_arn = {}'.format(workshop_interactions_dataset_import_job_arn))


## Import the Item Metadata <a class="anchor" id="import_items"></a>
[Back to top](#top)

Earlier you created the dataset group and dataset to house your information, now you will execute an import job that will load the item data from the S3 bucket into the Amazon Personalize dataset. 

In [None]:
# Checking if the import job already exists

# List the import jobs
items_dataset_import_jobs = personalize.list_dataset_import_jobs(
    datasetArn=workshop_items_dataset_arn,
    maxResults=100
)['datasetImportJobs']

job_exists = False
job_arn = None

print (items_dataset_import_jobs)

#check if there is an existing job with the prefix
for job in items_dataset_import_jobs:
    if (items_import_job_name in job['jobName']):
        job_exists = True
        job_arn = job['datasetImportJobArn']
    
if (job_exists):
    workshop_items_dataset_import_job_arn =  job_arn
    print('The Items Import Job {} already exists.'.format(workshop_items_dataset_import_job_arn))
    print ('\nWe will be using the existing Items Import Job with workshop_items_dataset_import_job_arn = {}'.format(workshop_items_dataset_import_job_arn))
        
else:
    # If there is no import job with the prefix, create it:    
    create_dataset_import_job_response = personalize.create_dataset_import_job(
        jobName = items_import_job_name,
        datasetArn = workshop_items_dataset_arn,
        dataSource = {
            "dataLocation": "s3://{}/{}".format(bucket_name, items_filename)
        },
        roleArn = role_arn
    )

    workshop_items_dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
    print(json.dumps(create_dataset_import_job_response, indent=2))
    print ('\nImporting the Items Data with workshop_items_dataset_import_job_arn = {}'.format(workshop_items_dataset_import_job_arn))
    
    

Before we can use the dataset, the import job must be active. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the import job every minute, up to a maximum of 6 hours.

Importing the data can take some time, depending on the size of the dataset. In this workshop, the data import job has already been done for you. If you are not using the workshop environment, this should take around 15 minutes. While you're waiting you can learn more about Datasets and Schemas in [the documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html). 

We need to wait for the data imports to complete.

In [None]:
max_time = time.time() + 6*60*60 # 10 hours
while time.time() < max_time:

    # Interactions dataset import
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = workshop_interactions_dataset_import_job_arn
    )
    status_interactions_import = describe_dataset_import_job_response["datasetImportJob"]['status']
    
    if status_interactions_import == "ACTIVE":
        print("Build succeeded for {}".format(workshop_interactions_dataset_import_job_arn))
        
    elif status_interactions_import == "CREATE FAILED":
        print("Build failed for {}".format(workshop_interactions_dataset_import_job_arn))
        break
        
    if not status_interactions_import == "ACTIVE":
        print("The interactions dataset import is still in progress")
    else:
        print("The interactions dataset import is ACTIVE")

    # Items dataset import
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = workshop_items_dataset_import_job_arn
    )
    status_items_import = describe_dataset_import_job_response["datasetImportJob"]['status']
    
    if status_items_import == "ACTIVE":
        print("Build succeeded for {}".format(workshop_items_dataset_import_job_arn))
        
    elif status_items_import == "CREATE FAILED":
        print("Build failed for {}".format(workshop_items_dataset_import_job_arn))
        break
        
    if not status_items_import == "ACTIVE":
        print("The items dataset import is still in progress")
    else:
        print("The items dataset import is ACTIVE")

    if status_interactions_import == "ACTIVE" and status_items_import == 'ACTIVE':
        break

    print()
    time.sleep(30)

Congratulations! You now have imported data from your 3 fictional environments (Content Management System, Analytics/Customer Data Platform and Customer Resource Management/Subscriber Management System)!

We will use this data in the next notebook. In order to use this data we will store these variables so subsequent notebooks can use this data. 

In [None]:
# %store dataset_dir
%store data_dir
%store interactions_filename
%store items_filename
%store workshop_dataset_group_arn
%store workshop_interactions_dataset_arn
%store workshop_items_dataset_arn
%store workshop_interactions_schema_arn
%store workshop_items_schema_arn
%store recommender_top_picks_for_you_name
%store region
%store account_id
%store role_name
%store role_arn

[Go to the next notebook `03_Train_Personalize_Model_02_Training.ipynb`](03_Train_Personalize_Model_02_Training.ipynb) to continue.