# Welcome to the UnicornPost <a class="anchor" id="top"></a>

Congratulations! You have just been hired by UnicornPost, a publisher of news stories that captivate people the world over! Currently, most of your site does not provide a personalized experience, the stories are presented in a static order for all users in a manner controlled by the editorial team. In order to increase page views you are looking to incorporate a recommendation system which assists users in finding the stories they feel are the most intresting and informative.

You’ve been asked by the founders to:

- Provide personalized story recommendation to vistors of your site
- Filter news articles by genre for the relevant parts of the site
- Emphasize newly written articles for certain sections of the site

Throughout the course of this workshop you will be exploring your datasets, building/training several recommendation models and implementing recommendations with API's.

<div class="alert alert-block alert-warning">
<b>Note:</b> importing and training the datasets will take longer than we have in this workshop. In order to complete this workshop within the time set, we have already created several resources on your behalf.  However, the notebooks are designed in such a way that all the steps are included. If the resources have already been created, the cell will return information about the resources, if the resources have not been created, it will create them for you. 
</div>


## In this notebook
In this notebook, you will choose a dataset and prepare it for use with Amazon Personalize.

1. [How to Use the Notebook](#usenotebook)
1. [Introduction to Amazon Personalize Datasets](#datasets)
1. [Choose a Dataset or Data Source](#source)
1. [Configure an S3 bucket and an IAM role](#bucket_role)
1. [Create dataset group](#group_dataset)
1. [Create the Item Interactions Schema](#interact_schema)
1. [Create the Items Schema](#items_schema)
1. [Create the Users Schema](#users_schema)
1. [Import the Item Interactions Data](#import_interactions)
1. [Import the Items Metadata](#import_items)
1. [Import the User Metadata](#import_users)
1. [Storing Useful Variables](#vars)

## How to Use the Notebook <a class="anchor" id="usenotebook"></a>

### Executing cells

The code is broken up into cells like the one below. There's a triangular **Run** button at the top of this page that you can click to execute each cell and move onto the next, or you can press `Shift` + `Enter` while in the cell to execute it and move onto the next one.

As a cell is executing, you'll notice an `*` in the checkbox beside the cell. When the cell has finished running, the checkbox will contain a number to indicate the order the cell was executed in with respect to all the other cells in the notebook.

Simply follow the instructions below and execute the cells to get started with Amazon Personalize.

### Understanding the code

This notebook can be used in two modalities:

1. Train as you go by executing each cell. Some cells may take a long time to finish executing as they wait for resources to be created.
2. Use this notebook with previously created resources. All or the majority of the resources will already be created, and cells will just retrieve the information of these existing resources to use them in following steps.

Because of this, you will find that some cells have `try` and `except` blocks. In particular, most of them are handling a `ResourceAlreadyExistsException` exception. 

You can look at the code in the `try` block to get a good idea of how you can create a resource and understand how to use the Amazon Personalize SDK. The `except` block will let you know that the resource has been created and record the corresponding ARN, which is the Amazon unique identifier.

This is an example of the `try` block for creating a dataset group, this code will execute without exceptions if the dataset group does not exist and raise an exception if the dataset group does already exist:

```python
try:     
    # Try to create the dataset group, this block with exectute fully if the dataset group does not exist yet
    
    create_dataset_group_response = personalize.create_dataset_group(
        name = workshop_dataset_group_name,
    )
    workshop_dataset_group_arn = create_dataset_group_response['datasetGroupArn']
    print(json.dumps(create_dataset_group_response, indent=2))
    print ('\nCreating the Dataset Group with dataset_group_arn = {}'.format(workshop_dataset_group_arn))
```
and this is the corresponding `except` block that will be executed if an exception is raised because the dataset group already exists. This block saves the ARN for the existing dataset group to use later and lets you know the resource already exists.

```python
except personalize.exceptions.ResourceAlreadyExistsException as e:
    workshop_dataset_group_arn = 'arn:aws:personalize:'+region+':'+account_id+':dataset group/' + 
        workshop_dataset_group_name 
    print ('\nThe the Dataset Group with dataset_group_arn = {} already exists'.format(
        workshop_dataset_group_arn))
    print ('\nWe will be using the existing Dataset Group dataset_group_arn = {}'.format(
        workshop_dataset_group_arn))
```

Depending on the resource, you may also find that sometimes the code will check from a list of resources to find if a resource exists and then use `if` and `else` blocks to either use the existing resource or create it.

### Let's build!

Python ships with a broad collection of libraries and we need to import those as well as the ones installed to help us like [boto3](https://aws.amazon.com/sdk-for-python/) (AWS SDK for python) and [Pandas](https://pandas.pydata.org/)/[Numpy](https://numpy.org/)  which are core data science tools.

In [1]:
!pip install --upgrade pip
!pip install --upgrade --no-deps --force-reinstall botocore
!pip install lxml


import pandas as pd
import json
from datetime import datetime
from lxml import html
from bs4 import BeautifulSoup, MarkupResemblesLocatorWarning
import re
import warnings
import csv
import sys
import os.path
import boto3
import json
import seaborn as sns
import matplotlib.pyplot as plt
import time
from time import sleep
import yaml
import numpy as np

data_dir = "poc_data"
!mkdir $data_dir

[0mCollecting botocore
  Downloading botocore-1.34.83-py3-none-any.whl.metadata (5.7 kB)
Downloading botocore-1.34.83-py3-none-any.whl (12.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m90.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[?25hInstalling collected packages: botocore
  Attempting uninstall: botocore
    Found existing installation: botocore 1.34.73
    Uninstalling botocore-1.34.73:
      Successfully uninstalled botocore-1.34.73
Successfully installed botocore-1.34.83
[0mmkdir: cannot create directory ‘poc_data’: File exists


In [2]:
# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

In [3]:
# Get the account id and region to use later
account_id = boto3.client('sts').get_caller_identity().get('Account')
print("account id:", account_id)

with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]
print("region:", region)

account id: 381491864570
region: us-east-1


If this is a workshop and the resources were created for you, we will retrieve the variables of the resources created.

In [4]:
# Opening JSON files
f = open("params.json")
parameters = json.load(f)

In [5]:
workshop_dataset_group_name = parameters['datasetGroup']['serviceConfig']['name']

interactions_schema_name = parameters['datasets']['interactions']['schema']['serviceConfig']['name']
interactions_dataset_name = parameters['datasets']['interactions']['dataset']['serviceConfig']['name']

items_schema_name = parameters['datasets']['items']['schema']['serviceConfig']['name']
items_dataset_name = parameters['datasets']['items']['dataset']['serviceConfig']['name']

#The following job names are the starting Strings of the job names that can be created
interactions_import_job_name = 'dataset_import_interaction'
items_import_job_name = 'dataset_import_item'
        
for solution in parameters['solutions']:
    # This is currently configured assuming only one solution of this type, if there are multiple 
    # solutions of the same type further configuration is needed.
    if (solution['serviceConfig']['recipeArn'] == 'arn:aws:personalize:::recipe/aws-personalized-ranking'):
        workshop_rerank_solution_name = solution['serviceConfig']['name'] 
        # This is currently configured assuming only one campaign, if there are multiple campaigns 
        # further configuration is needed.
        workshop_rerank_campaign_name = solution['campaigns'][0]['serviceConfig']['name']
        
        
    if (solution['serviceConfig']['recipeArn'] == 'arn:aws:personalize:::recipe/aws-user-personalization'):
        workshop_userpersonalization_solution_name = solution['serviceConfig']['name'] 
        # This is currently configured assuming only one campaign, if there are multiple campaigns 
        # further configuration is needed.
        workshop_userpersonalization_campaign_name = solution['campaigns'][0]['serviceConfig']['name'] 

we will make sure we can use the SDK to interact with Amazon Personalize by describing some of the pre-created resources used in the workshop. 

<div class="alert alert-block alert-warning">
<b>Note:</b> If you have not pre-deployed resources and are building them as you go with this notebook, the below cell will raise an exception. You can continue with the notebook and create resources and train models as you go.
</div>

If you have not pre-deployed resources and are building them as you go with this notebook, the below cell will raise an exception. You can continue with the notebook and create resources and train models as you go.

In [6]:
try:
    # Describe a few resources using the SDK
    workshop_rerank_solution_arn = 'arn:aws:personalize:'+region+':'+account_id+':solution/'+workshop_rerank_solution_name
    describe_response = personalize.describe_solution(solutionArn = workshop_rerank_solution_arn)
    print("SDK and resource check SUCCEEDED!")
except:
    print("SDK check FAILED. Proceed to the next cell if you will be uploading data and training models as you go.")
    raise

SDK check FAILED. Proceed to the next cell if you will be uploading data and training models as you go.


ResourceNotFoundException: An error occurred (ResourceNotFoundException) when calling the DescribeSolution operation: The given solution does not exist: arn:aws:personalize:us-east-1:381491864570:solution/immersion_day_personalized_ranking_news

## Introduction to Amazon Personalize Datasets <a class="anchor" id="datasets"></a>
[Back to top](#top)

[Amazon Personalize](https://aws.amazon.com/personalize/) is a fully managed machine learning service that uses your data to generate item recommendations for your users. It can also generate user segments based on users’ affinity for certain items or item metadata.

Regardless of the use case, the algorithms all learn user-item-interaction data, which is defined by 3 core attributes:

1. **UserID** - The user who interacted
1. **ItemID** - The item the user interacted with
1. **Timestamp** - The time at which the interaction occurred

Very often, your data will not arrive in a perfect form for Amazon Personalize from other systems (such as a product catalog, Customer Relationship Management (CRM) System, ...) and you will have to modify it to be structured correctly. This notebook guides you through that process.

### Items data

The Item data consists of information about the articles that users interact with, this data typically comes from some sort of content management platform personalize can process one `textual` field which is recommended to be the article content or a summary of the article content if possible. This dataset can also contain information about the genre or section an article belongs in so that it can be filtered appropriately if needed

### Item-Interactions data

The item-interaction data consists of information about the interactions the readeers of the fictional news site will have with the articles published there. This usually comes from analytics tools or Customer Data Platform's (CDP). The best interaction data for use for Amazon Personalize would include the sequential order of user behavior, what content was clicked on/redand the order it was interacted with. We will be using the CI&T Deskdrop Dataset for both our items (in this case news articles) and our item interactions. Note for a typical news company you may want to filter out some of the interactions if they do not include a sufficient time on page or scroll depth in the article as personalize item-interactions should only respresent a positive interaction between the item and the user

### User data

The user data is what information you have about your users, it typically comes from Customer Relationship Management (CRM) system it is not required for amazon personalize recipies. In this case we will not be using user data.

![Workflow](Images/01_Data_Layer_Resources.jpg)

### Open and Explore the News Interactions Dataset

For this example, we are using public dataset which contains interactions between users and news articles in an internal chat platform. This dataset was initially obtained from https://www.kaggle.com/datasets/gspmoreira/articles-sharing-reading-from-cit-deskdrop

In [7]:
interaction_data = pd.read_csv("https://d2peeor3oplhc6.cloudfront.net/personalize-news-immersion-day/users_interactions.csv")
interaction_data.head(5)

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US
2,1465416190,VIEW,310515487419366995,-1130272294246983140,2631864456530402479,,,
3,1465413895,FOLLOW,310515487419366995,344280948527967603,-3167637573980064150,,,
4,1465412290,VIEW,-7820640624231356730,-445337111692715325,5611481178424124714,,,


Amazon Personalize requires only three datafields in the interactions dataset.
    
1. `User_Id`: A unique identifier for an individual user
1. `Item_Id`: A unique itendifier for the item the user in question chose to interact with
1. `Timestamp`: The time of the interaction

lets take a look at what we have

In [8]:
interaction_data.columns

Index(['timestamp', 'eventType', 'contentId', 'personId', 'sessionId',
       'userAgent', 'userRegion', 'userCountry'],
      dtype='object')

In this case it appears we have all the data we need - we will just need to remap some of the column names such as `contentId` to `Item_Id` and `personId` to `User_Id`, personalize can process event types and even filter on them for training purposes though we will need to make a slight alteration to the column name there as well. Amazon Personalize can also use `sessionId` as well.

In [9]:
interaction_data.rename(columns={'contentId': 'item_id',
                            'eventType': 'event_type',
                            'personId': 'user_id',
                            'sessionId': 'session_id'}, inplace=True)

First lets check what unique field names we have in the eventType field:

In [10]:
interaction_data.event_type.value_counts()

VIEW               61086
LIKE                5745
BOOKMARK            2463
COMMENT CREATED     1611
FOLLOW              1407
Name: event_type, dtype: int64

Amazon Personalize thinks all of the interactions fed to it are representations of postive inclination between the user and the item - so we should discard any event types that do not represent positive inclination, such as thumbs down or a poor review, the above seems to all be positive interactions - so we wont remove any.

Next lets get a bit more of an idea of what is in the other columns

In [11]:
interaction_data.userAgent.value_counts()

Android - Native Mobile App                                                                                                                  6761
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36                               1823
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36                               1146
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36                                    1076
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36                                1059
                                                                                                                                             ... 
Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0                                               

In [12]:
interaction_data.userCountry.value_counts()

BR    51396
US     4657
KR      239
CA      226
JP      144
AU      138
GB       22
DE       19
IE       14
IS       13
SG       11
ZZ       11
AR        7
PT        6
IN        3
ES        3
IT        2
MY        2
CN        1
CL        1
NL        1
CO        1
CH        1
Name: userCountry, dtype: int64

With Amazon Personalize it is smart to group our categories where possible - the exact version number of a users browser likely doesnt matter - however whether they are logging on from a mobile device or a desktop probably does. We will ignore tha users region and country for this demo as we are not certain of how they are generated or whether or not they will be availible at inference.

In [13]:
np.nansum((interaction_data.userAgent.str.contains('Mobile')))

7433

how many users do we not know the agent for

In [14]:
sum(interaction_data.userAgent.isna())

15394

Lets replace these values with a string

In [15]:
interaction_data.userAgent.fillna("UnknownAgent", inplace = True)

The function below will group our user device categories into three families

In [16]:
def device_type(user_device_type_raw: str) -> str:
    if 'Mobile' in str(user_device_type_raw):
        return 'Mobile'
    elif user_device_type_raw == "UnknownAgent":
        return "UnknownAgent"
    else:
        return 'NonMobile'

In [17]:
interaction_data['user_device_type'] = interaction_data.userAgent.apply(device_type)

In [18]:
interaction_data['user_device_type'].value_counts()

NonMobile       49485
UnknownAgent    15394
Mobile           7433
Name: user_device_type, dtype: int64

Lets keep this column - as we likely know it with a degree of certainty when the user logs in and it is also likely to affect the type of content they will wish to consume. We will also keep the `timestamp`, `item_id` and `user_id` columns which are needed for personalize - as well as the `event_type` column and `session id` as well

In [19]:
interaction_data = interaction_data[["timestamp","event_type","item_id","user_id","session_id","user_device_type"]]

What does our distribution of interactions accross our various items look like

In [20]:
interaction_data.item_id.value_counts()

-4029704725707465084    433
-133139342397538859     315
 8657408509986329668    294
-6783772548752091658    294
-6843047699859121724    281
                       ... 
 2824996248683640175      1
 7029834616968294970      1
 7697593937932606048      1
-7108012586837980940      1
 7526977287801930517      1
Name: item_id, Length: 2987, dtype: int64

We have 2987 items represented in our interactions dataset - how many of these items have more than 5 interactions

In [21]:
np.mean(interaction_data.item_id.value_counts() > 5)

0.7643120187479076

How many unique users do we have

In [22]:
interaction_data.user_id.value_counts()

-1032019229384696495    1885
-1443636648652872475    1616
 3609194402293569455    1435
-2626634673110551643    1084
-3596626804281480007     903
                        ... 
-5749772120640063461       1
-8050113247491924905       1
-3410401356987328689       1
 4674800004298965524       1
 3357462296682851629       1
Name: user_id, Length: 1895, dtype: int64

What percentage of users have at least 5 interactions

In [23]:
np.mean(interaction_data.user_id.value_counts() > 5)

0.6437994722955145

64% of users have a history of at least 5 interactions these are very healthy numbers for training a personalize model, personalize can work for cold start users but it is nice that we have a healthy population of users with a decent number of interactions in our dataset.

In [24]:
interaction_data.head(10)

Unnamed: 0,timestamp,event_type,item_id,user_id,session_id,user_device_type
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,UnknownAgent
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,NonMobile
2,1465416190,VIEW,310515487419366995,-1130272294246983140,2631864456530402479,UnknownAgent
3,1465413895,FOLLOW,310515487419366995,344280948527967603,-3167637573980064150,UnknownAgent
4,1465412290,VIEW,-7820640624231356730,-445337111692715325,5611481178424124714,UnknownAgent
5,1465413742,VIEW,310515487419366995,-8763398617720485024,1395789369402380392,NonMobile
6,1465415950,VIEW,-8864073373672512525,3609194402293569455,1143207167886864524,UnknownAgent
7,1465415066,VIEW,-1492913151930215984,4254153380739593270,8743229464706506141,NonMobile
8,1465413762,VIEW,310515487419366995,344280948527967603,-3167637573980064150,UnknownAgent
9,1465413771,VIEW,3064370296170038610,3609194402293569455,1143207167886864524,UnknownAgent


Lets check for the percentage of nulls in each column

In [25]:
interaction_data.isnull().mean()

timestamp           0.0
event_type          0.0
item_id             0.0
user_id             0.0
session_id          0.0
user_device_type    0.0
dtype: float64

Lets check the timestamp to make sure they are formatted properly for Personalize, the following code below will change a Unix Epoch time (in seconds) timestamp (which is what personalize requires) to a human readable timestamp.

In [26]:
def custom_func(numerictime: int) -> str:
    """
    This function takes a numeric field representing epoch time as its argument and returns a human readable timestamp 
    
    :type numerictime: int
    :param numerictime: timestamp in unix epoch time measured in seconds from January 1, 1970
    
    :rtype: string
    :returns: a string representing a human readable timestamp in the format '%Y-%m-%d %H:%M:%S'
    """
    return datetime.utcfromtimestamp(numerictime).strftime('%Y-%m-%d %H:%M:%S')

In [27]:
interaction_data.timestamp.apply(custom_func)

0        2016-06-08 19:10:32
1        2016-06-08 19:02:40
2        2016-06-08 20:03:10
3        2016-06-08 19:24:55
4        2016-06-08 18:58:10
                ...         
72307    2017-01-23 16:53:45
72308    2017-01-23 16:53:45
72309    2017-01-23 16:47:52
72310    2017-01-23 16:53:54
72311    2017-01-23 16:13:08
Name: timestamp, Length: 72312, dtype: object

Looks good - lets save our data the code below is formatting code for pandas that helps personalize process the data correctly.

In [28]:
interactions_file_name = "deskdrop_interactions_automated.csv"

interaction_data.to_csv("poc_data/" + interactions_file_name,
                            index=False, 
                            float_format='%.0f',
                            quoting=csv.QUOTE_NONNUMERIC,
                            doublequote=False,
                            escapechar='\\')

Now lets create a schema for our dataset:

## Create the Interactions Schema <a class="anchor" id="interact_schema"></a>
[Back to top](#top)

Now that we've loaded and prepared our interactions dataset we'll need to configure the Amazon Personalize service to understand our data so that it can be used to train models for generating recommendations. Amazon Personalize requires a schema for each dataset so it can map the columns in our CSVs to fields for model training. Each schema is declared in JSON using the [Apache Avro](https://avro.apache.org/) format. 

First, define a schema to tell Amazon Personalize what type of dataset you are uploading. There are several reserved and mandatory keywords required in the schema, based on the type of dataset. More detailed information can be found in the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html).

#### Interactions Dataset Schema

The interactions dataset has three required columns: `ITEM_ID`, `USER_ID`, and `TIMESTAMP`. The `TIMESTAMP` represents when the user interated with an item and must be expressed in Unix timestamp format (seconds). For this dataset we also have an `EVENT_TYPE` column that includes the type of interaction with the article (`VIEW`, `FOLLOW`, `LIKE` ect).

The interactions dataset supports metadata columns. Interaction metadata columns are a way to provide contextual details that are specific to an interaction such as the user's current device type (phone, tablet, desktop, set-top box, etc), the user's current location (city, region, metro code, etc), current weather conditions, and so on. For this dataset, we have a `user_device_type` column that indicates what sort of device the user is logging into the site from, mobile or non-mobile - as this will likely influnce the sort of articles they want to engage with. Note that the `user_device_type` column is annotated as being categorical (`"categorical": true`) this is required if we want personalize to use it as part of its modeling.

If we didn't want personalize to model off of it - we could insert it as merely a string field in which case we could still use it for filtering. Personalize will process the most frequent 1000 categories individually and bundle the remaining categories into a 1001 category. If this isn't ideal for your use case you may want to manually bundle categories yourself.

In [29]:
interactions_schema = {
	"type": "record",  
	"name": "Interactions",  
	"namespace": "com.amazonaws.personalize.schema",  
	"fields": [  
		{  
			"name": "USER_ID",  
			"type": "string"  
		},  
		{  
			"name": "ITEM_ID",  
			"type": "string"  
		},  
		{  
			"name": "TIMESTAMP",  
			"type": "long"  
		}, 
		{  
			"name": "SESSION_ID",  
			"type": "string"  
		},  
		{  
			"name": "EVENT_TYPE",  
			"type": "string"  
		},  
		{  
			"name": "user_device_type",  
			"type": [  
				"null",  
				"string"  
			],  
			"categorical": True  
		}  
	],  
	"version": "1.0"  
}  

In [30]:
try:
    # Try to create the interactions dataset schema, this block with exectute fully 
    # if the interactions dataset schema does not exist yet
    create_schema_response = personalize.create_schema(
        name = interactions_schema_name,
        schema = json.dumps(interactions_schema),
    )
    print(json.dumps(create_schema_response, indent=2))
    workshop_interactions_schema_arn = create_schema_response['schemaArn']
    print ('\nCreating the Interactions Schema with workshop_interactions_schema_arn = {}'.format(workshop_interactions_schema_arn))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    # if the interactions dataset schema already exists, get the unique identifier workshop_interactions_schema_arn
    # from the existing resource 
    
    workshop_interactions_schema_arn = 'arn:aws:personalize:'+region+':'+account_id+':schema/'+interactions_schema_name 
    print('The schema {} already exists.'.format(workshop_interactions_schema_arn))
    print ('\nWe will be using the existing Interactions Schema with workshop_interactions_schema_arn = {}'.format(workshop_interactions_schema_arn))

{
  "schemaArn": "arn:aws:personalize:us-east-1:381491864570:schema/immersion_day_news_interactions_schema",
  "ResponseMetadata": {
    "RequestId": "e1fda654-5f53-4d6f-a6ae-b3854c29b961",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Thu, 11 Apr 2024 22:00:43 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "104",
      "connection": "keep-alive",
      "x-amzn-requestid": "e1fda654-5f53-4d6f-a6ae-b3854c29b961",
      "strict-transport-security": "max-age=47304000; includeSubDomains",
      "x-frame-options": "DENY",
      "cache-control": "no-cache",
      "x-content-type-options": "nosniff"
    },
    "RetryAttempts": 0
  }
}

Creating the Interactions Schema with workshop_interactions_schema_arn = arn:aws:personalize:us-east-1:381491864570:schema/immersion_day_news_interactions_schema


## Preparing our Items (Articles) Metadata

Now that we have created the schema lets take a look at our items data (in this case data on our articles)

In [31]:
articles = pd.read_csv("https://d2peeor3oplhc6.cloudfront.net/personalize-news-immersion-day/shared_articles.csv")

What type of fields do we have:

In [32]:
articles.columns

Index(['timestamp', 'eventType', 'contentId', 'authorPersonId',
       'authorSessionId', 'authorUserAgent', 'authorRegion', 'authorCountry',
       'contentType', 'url', 'title', 'text', 'lang', 'article_summary',
       'article_hook', 'article_trigger', 'article_genre'],
      dtype='object')

### Sample Text

In [33]:
print(articles.text[0])

All of this work is still very early. The first full public version of the Ethereum software was recently released, and the system could face some of the same technical and legal problems that have tarnished Bitcoin. Many Bitcoin advocates say Ethereum will face more security problems than Bitcoin because of the greater complexity of the software. Thus far, Ethereum has faced much less testing, and many fewer attacks, than Bitcoin. The novel design of Ethereum may also invite intense scrutiny by authorities given that potentially fraudulent contracts, like the Ponzi schemes, can be written directly into the Ethereum system. But the sophisticated capabilities of the system have made it fascinating to some executives in corporate America. IBM said last year that it was experimenting with Ethereum as a way to control real world objects in the so-called Internet of things. Microsoft has been working on several projects that make it easier to use Ethereum on its computing cloud, Azure. "Eth

### Check for article duplicates

In [34]:
articles.contentId.value_counts().value_counts()

1    2993
2      63
3       1
Name: contentId, dtype: int64

In [35]:
articles.contentId.value_counts()

-2990485643677949494    3
 9033884391004475493    2
 2293315701090958124    2
 8289800212949675494    2
 5817939718364925129    2
                       ..
 1981046186743381313    1
-8742648016180281673    1
 4634963407423735625    1
-1081723567492738167    1
 7062745737910050042    1
Name: contentId, Length: 3057, dtype: int64

Which article has three interactions

In [36]:
articles[articles.contentId == -2990485643677949494]

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang,article_summary,article_hook,article_trigger,article_genre
652,1461864561,CONTENT SHARED,-2990485643677949494,-3390049372067052505,-7303988376323787105,,,,HTML,https://hbr.org/2016/04/the-secret-history-of-...,The Secret History of Agile Innovation,Create a FREE account to: Get eight free artic...,en,\nThe article discusses the benefits of creati...,Harvard Business Review offers free registrati...,Sign up for a free HBR account to access busin...,non tech
653,1461864719,CONTENT REMOVED,-2990485643677949494,-3390049372067052505,-7303988376323787105,,,,HTML,https://hbr.org/2016/04/the-secret-history-of-...,The Secret History of Agile Innovation,Create a FREE account to: Get eight free artic...,en,\nThe article discusses the benefits of creati...,Harvard Business Review offers free registrati...,Sign up for a free HBR account to access busin...,non tech
654,1461864769,CONTENT REMOVED,-2990485643677949494,-3390049372067052505,-7303988376323787105,,,,HTML,https://hbr.org/2016/04/the-secret-history-of-...,The Secret History of Agile Innovation,Create a FREE account to: Get eight free artic...,en,\nThe article discusses the benefits of creati...,Harvard Business Review offers free registrati...,Sign up for a free HBR account to access busin...,non tech


How many event types do we have:

In [37]:
articles.eventType.value_counts()

CONTENT SHARED     3047
CONTENT REMOVED      75
Name: eventType, dtype: int64

How many articles do we have in each langauge

In [38]:
articles.lang.value_counts()

en    2264
pt     850
la       4
es       2
ja       2
Name: lang, dtype: int64

These are only supposed to be in english and portugese lets take a look at some of the other langauge examples

In [39]:
articles[articles.lang.isin(['la','es','ja'])][['title','text', 'lang', 'contentId']]

Unnamed: 0,title,text,lang,contentId
593,Costa Rica presenta su primer edificio constru...,"San José, 22 abr (EFE).- Autoridades de Costa ...",es,4654349179360175629
1049,La RAE lucha contra los anglicismos con una ca...,La Real Academia Española alerta sobre su abus...,es,4012770927504386801
1645,Request lesson : How and when to use はず(=hazu)...,= Kotoshi no aki made niwa kare ga dekiru hazu...,la,-9216926795620865886
1700,"Within a Decade, Retail Banks will be Dead","A соuрlе of weeks ago, I found myself sitting ...",la,-8854146354650101086
1825,40 Basic Japanese conversations,Japanese conversation using Ninja LINE sticker...,ja,-5527182266336855540
1851,"The Algorithm March, Japan's Strangely Enterta...",Arugorizumu Koushin! アルゴリズムこうしん (Algorithm Mar...,ja,-4434534460030275781
2272,git flow with support,"Lorem ipsum dolor sit amet, consectetur adipis...",la,-8083832514395551465
2273,git flow with support,"Lorem ipsum dolor sit amet, consectetur adipis...",la,-8083832514395551465


In [40]:
articles[articles.lang.isin(['la','es','ja'])].text[2273]

'Lorem ipsum dolor sit amet, consectetur adipisicing elit. Sint, ducimus, qui fuga corporis veritatis doloribus iure nulla optio dolores maiores dolorum ullam alias cum libero obcaecati cupiditate sit illo aperiam possimus voluptatum similique neque explicabo quibusdam aspernatur dolorem. Quod, corrupti magni explicabo nam sequi nesciunt accusamus aliquam dolore!'

This is placeholder used in UI development lets drop all these articles for simplicity sake

In [41]:
contentIds_to_drop = articles[articles.lang.isin(['la','es','ja'])].contentId

In [42]:
articles.drop(articles.loc[articles.contentId.isin(contentIds_to_drop)].index, inplace=True)

Lets also drop the removal tags

In [43]:
articles.drop(articles.loc[articles.eventType == 'CONTENT REMOVED'].index, inplace=True)

How many duplicates do we have left?

In [44]:
articles.contentId.value_counts().value_counts()

1    3040
Name: contentId, dtype: int64

Personalize can only process one textual field and we have several - first which fields are potentially strings?

In [45]:
articles.dtypes

timestamp           int64
eventType          object
contentId           int64
authorPersonId      int64
authorSessionId     int64
authorUserAgent    object
authorRegion       object
authorCountry      object
contentType        object
url                object
title              object
text               object
lang               object
article_summary    object
article_hook       object
article_trigger    object
article_genre      object
dtype: object

All fields labeld as object could be strings - how many distinct values do we have in each field?

In [46]:
articles.nunique()

timestamp          3039
eventType             1
contentId          3040
authorPersonId      251
authorSessionId    2004
authorUserAgent     114
authorRegion         19
authorCountry         5
contentType           3
url                3016
title              2996
text               3006
lang                  2
article_summary    3033
article_hook       3026
article_trigger    2998
article_genre         5
dtype: int64

We will drop all of the information on the author as it shouldn't be relevant. `eventType` is singular after we dropped other articles so we will drop it.  `contentType`, `lang`, and `article_genre` have low cardinality so they are likely categorical - that leaves potential text fields as

1. Title
1. Text
1. article_summary
1. article_hook
1. article_trigger
1. article_genre

Lets examine each one.

In [47]:
articles.title

1       Ethereum, a Virtual Currency, Enables Transact...
2       Bitcoin Future: When GBPcoin of Branson Wins O...
3                            Google Data Center 360° Tour
4       IBM Wants to "Evolve the Internet" With Blockc...
5       IEEE to Talk Blockchain at Cloud Computing Oxf...
                              ...                        
3117    Conheça a Liga IoT, plataforma de inovação abe...
3118    Amazon takes on Skype and GoToMeeting with its...
3119                          Code.org 2016 Annual Report
3120    JPMorgan Software Does in Seconds What Took La...
3121                 The 2017 Acquia Partners of the Year
Name: title, Length: 3040, dtype: object

Some of these are in english and some in portugese as expected

In [48]:
articles.text[1]

'All of this work is still very early. The first full public version of the Ethereum software was recently released, and the system could face some of the same technical and legal problems that have tarnished Bitcoin. Many Bitcoin advocates say Ethereum will face more security problems than Bitcoin because of the greater complexity of the software. Thus far, Ethereum has faced much less testing, and many fewer attacks, than Bitcoin. The novel design of Ethereum may also invite intense scrutiny by authorities given that potentially fraudulent contracts, like the Ponzi schemes, can be written directly into the Ethereum system. But the sophisticated capabilities of the system have made it fascinating to some executives in corporate America. IBM said last year that it was experimenting with Ethereum as a way to control real world objects in the so-called Internet of things. Microsoft has been working on several projects that make it easier to use Ethereum on its computing cloud, Azure. "Et

This is the raw text of the article it looks reasonably long - personalize has a limit of 20000 characters in one individual field - what is the max number of characters in this field

In [49]:
articles.text.apply(len).max()

122568

Thats too many what about the summary is that any better?

In [50]:
articles.article_summary[1]

"\nEthereum is a new blockchain-based platform that allows developers to build decentralized applications on top of it. It was created in 2014 by Vitalik Buterin and has gained a lot of interest from major banks and companies. \n\nUnlike Bitcoin, which was created anonymously, Ethereum was created in a transparent way by Buterin when he was 21 years old. The platform allows developers to program smart contracts that execute automatically when conditions are met. This enables new kinds of blockchain-based applications.\n\nEthereum raised $18 million in a presale of its native cryptocurrency ether in 2014. This helped fund the non-profit Ethereum Foundation which supports the platform's development. There is now a dedicated network of thousands of computers worldwide that support the platform. \n\nThe platform is still new and faces technical and legal challenges. But major banks like JPMorgan have experimented with private Ethereum blockchains. Microsoft and IBM are also working on Ethe

In [51]:
articles.article_summary.apply(len).max()

2474

Much better - its certainly useful to feed the text of the article to personalize and in this case the summary looks like a much better option now lets take a look at the `article_trigger` and the `article_hook`

In [52]:
articles.article_trigger[1]

'Ethereum platform enables decentralized blockchain applications'

In [53]:
articles.article_hook[1]

'The Ethereum platform allows developers to build decentralized blockchain applications through smart contracts. Major banks and tech companies are experimenting with it.'

Typically on a website you would see the hook and the trigger displayed along with an image in someway like this:
    
[Ethereum platform enables decentralized blockchain applications.](https://somewebsite.com) The Ethereum platform allows developers to build decentralized blockchain applications through smart contracts. Major banks and tech companies are experimenting with it.

For the sake of training a personalize model we will need to create one text field. To do these we can either drop the unwanted fields or combine the text fields together. We know that users will first see the `article_hook` and the `article_trigger` and then read the text - in this case the `text` field is too long so we will use `article_summary` in place. The best thing to do is probably to concatenate the text fields as follows:

`article_trigger` + `article_hook` + `article_summary`

In [54]:
articles['training_text'] = articles['article_trigger'] +  '\n\n' + articles['article_hook'] + '\n\n' +  articles['article_summary']

In [55]:
print(articles['training_text'][1])

Ethereum platform enables decentralized blockchain applications

The Ethereum platform allows developers to build decentralized blockchain applications through smart contracts. Major banks and tech companies are experimenting with it.


Ethereum is a new blockchain-based platform that allows developers to build decentralized applications on top of it. It was created in 2014 by Vitalik Buterin and has gained a lot of interest from major banks and companies. 

Unlike Bitcoin, which was created anonymously, Ethereum was created in a transparent way by Buterin when he was 21 years old. The platform allows developers to program smart contracts that execute automatically when conditions are met. This enables new kinds of blockchain-based applications.

Ethereum raised $18 million in a presale of its native cryptocurrency ether in 2014. This helped fund the non-profit Ethereum Foundation which supports the platform's development. There is now a dedicated network of thousands of computers worl

looks good - now lets drop all the fields we do not need we will keep only the following fields:

1. timestamp
1. contentId
1. ARTICLE_GENRE
1. training_text
1. lang

In [56]:
articles_mlfeatures = articles.drop(columns=articles.columns[~articles.columns.isin(['timestamp', 'contentId', 'article_genre', 'training_text', 'lang'])])

Once again Amazon Personalize requires the identifier of an item to be called `item_id` as we discussed before - we will also need to rename the `timestamp` column so it can be used appropriately 

In [57]:
articles_mlfeatures.rename(columns={'timestamp': 'creation_timestamp', 
                                   'contentId': 'item_id'}, inplace=True)

Do we have any missing features for any of our articles?

In [58]:
articles_mlfeatures.isna().mean()

creation_timestamp    0.0
item_id               0.0
lang                  0.0
article_genre         0.0
training_text         0.0
dtype: float64

Lets check the timestamp to make sure they are formatted properly for Personalize

In [59]:
articles_mlfeatures.creation_timestamp.apply(custom_func)

1       2016-03-28 19:39:48
2       2016-03-28 19:42:26
3       2016-03-28 19:47:54
4       2016-03-28 19:48:17
5       2016-03-28 19:48:42
               ...         
3117    2017-02-24 14:30:04
3118    2017-02-24 14:37:47
3119    2017-02-27 19:20:24
3120    2017-02-28 16:51:59
3121    2017-02-28 18:51:11
Name: creation_timestamp, Length: 3040, dtype: object

What is the last date that we see

In [60]:
custom_func(articles_mlfeatures.creation_timestamp.max())

'2017-02-28 18:51:11'

What is the first date that we see

In [61]:
custom_func(articles_mlfeatures.creation_timestamp.min())

'2016-03-28 19:39:48'

Lets clean the article text so it can more easily processed by personalize the helper function below processes and cleans the text so that personalize can process it

In [62]:
def process_text(text: str) -> str:
    """
    This function exists to process text prior to using it to create a personalize model it takes as an input a text string and returns a text string
    """

    if type(text) == str:
        html_text = text.replace("<p>&nbsp;</p>", "")  # remove &nbsp from the text
        html_text = html_text.replace("<p></p>", "")  # remove <p></p> from the text
        html_text = html_text.replace("<p> </p>", "")  # remove <p> </p> from the text

        # remove hyper links form the text

        a_pattern = re.compile("<a.*?>")
        html_text = re.sub(a_pattern, "", html_text)
        html_text = html_text.replace("</a>", "")

        # remove spans from the text
        span_pattern = re.compile("<span.*?>")
        html_text = re.sub(span_pattern, "", html_text)
        html_text = html_text.replace("</span>", "")

        class_pattern = re.compile("<class.*?>")
        html_text = re.sub(class_pattern, "", html_text)
        html_text = html_text.replace("</class>", "")

        # remove <b> and </b> from the text
        html_text = html_text.replace("<b>", "")
        html_text = html_text.replace("</b>", "")

        # remove .  . and \n from the text
        html_text = html_text.replace("\n", " ")

        # remove .  . and @ from the text
        html_text = html_text.replace("@", "")

        # remove .  . and - from the text
        html_text = html_text.replace("-", "")
        
        # add escapes to single quotes
        html_text = html_text.replace("\"", "\\\"")

        # remove HTML
        cleanr = re.compile("^.*?>|<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
        html_text = re.sub(cleanr, "", html_text)

        # remove extra space etc.
        html_text = " ".join(html_text.split())

        # specifics for Amazon dataset
        html_text = html_text.replace(".  .", "")
        warnings.simplefilter("ignore", MarkupResemblesLocatorWarning)
        soup = BeautifulSoup(html_text, "html.parser")
        text = soup.find_all(string=True)

        for index, t in enumerate(text):
            if t[-1] != ".":
                text[index] += "."

        return " ".join(text).strip()

    else:
        return " "
    

In [63]:
articles_mlfeatures.training_text = articles_mlfeatures.training_text.apply(process_text)

One Final Check

In [64]:
articles_mlfeatures.head(5)

Unnamed: 0,creation_timestamp,item_id,lang,article_genre,training_text
1,1459193988,-4110354420726924665,en,crypto currency,Ethereum platform enables decentralized blockc...
2,1459194146,-7292285110016212249,en,crypto currency,AI assistant describes daily life enhanced by ...
3,1459194474,-6151852268067518688,en,cloud provider news,Google releases immersive 360 video tour insid...
4,1459194497,2448026894306402386,en,crypto currency,Linux Foundation and IBM aim to advance blockc...
5,1459194522,-2826566343807132236,en,tech,IEEE conference highlights blockchain technolo...


Looks good lets save and create a schema for this dataset

In [65]:
items_file_name = "deskdrop_articles_automated.csv"

articles_mlfeatures.to_csv("poc_data/" + items_file_name,
                            index=False, 
                            float_format='%.0f',
                            quoting=csv.QUOTE_NONNUMERIC,
                            doublequote=False,
                            escapechar='\\')

## Create the Items (Articles) schema<a class="anchor" id="items_schema"></a>
[Back to top](#top)

The items dataset schema requires an `ITEM_ID` column and at least one metadata column. Up to 100 metadata columns can be added to the items dataset. adding an article `CREATION_TIMESTAMP` is recommended as it will assist personalize with exploration and recommending new articles

For this dataset we have three metadata columns: `LANG`, `ARTICLE_GENRE`, and `TRAINING_TEXT`

Note that `LANG`, `ARTICLE_GENRE` are annotated as being categorical (`"categorical": True`). In this case though we also have a field annotated as textual (`"textual": True`) - personalize can process one text field in its item dataset.

In [66]:
items_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "CREATION_TIMESTAMP",
            "type": "long"
        },
        {
            "name": "TRAINING_TEXT",
            "type": ["null", "string"],
            "textual": True,
        },
        {
            "name": "LANG",
            "type": "string",
            "categorical": True
        },
        {
            "name": "ARTICLE_GENRE",
            "type": "string",
            "categorical": True
        }
    ],
    "version": "1.0"
}

In [67]:
try:
    # Try to create the items dataset schema, this block with exectute fully 
    # if the items dataset schema does not exist yet
    
    create_schema_response = personalize.create_schema(
        name = items_schema_name,
        schema = json.dumps(items_schema),
    )
    workshop_items_schema_arn = create_schema_response['schemaArn']
    print(json.dumps(create_schema_response, indent=2))

    print ('\nCreating the Items Schema with workshop_items_schema_arn = {}'.format(workshop_items_schema_arn))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    # if the items dataset schema already exists, get the unique identifier workshop_items_schema_arn 
    # from the existing resource 
    
    workshop_items_schema_arn = 'arn:aws:personalize:'+region+':'+account_id+':schema/'+items_schema_name 
    print('The schema {} already exists.'.format(workshop_items_schema_arn))
    print ('\nWe will be using the existing Items Schema with workshop_items_schema_arn = {}'.format(workshop_items_schema_arn))

{
  "schemaArn": "arn:aws:personalize:us-east-1:381491864570:schema/immersion_day_news_items_schema",
  "ResponseMetadata": {
    "RequestId": "e6f563a0-0325-4393-9cc4-9ffc660897ac",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Thu, 11 Apr 2024 22:01:21 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "97",
      "connection": "keep-alive",
      "x-amzn-requestid": "e6f563a0-0325-4393-9cc4-9ffc660897ac",
      "strict-transport-security": "max-age=47304000; includeSubDomains",
      "x-frame-options": "DENY",
      "cache-control": "no-cache",
      "x-content-type-options": "nosniff"
    },
    "RetryAttempts": 0
  }
}

Creating the Items Schema with workshop_items_schema_arn = arn:aws:personalize:us-east-1:381491864570:schema/immersion_day_news_items_schema


# Creating Dataset Group and Importing data <a class="anchor" id="import"></a>

## Configure an S3 bucket and an IAM  role <a class="anchor" id="bucket_role"></a>

So far, we have downloaded, manipulated, and saved the data onto the Amazon EBS instance attached to the instance running this Jupyter notebook.  

By default, the Personalize service does not have permission to access the data we upload into  S3 buckets in our account. In order to grant access to the Amazon Personalize service to interact with our S3 Buckets, we need to set a Bucket Policy and create an IAM role that the Amazon Personalize service will assume. If you are running this notebook without also running the Pretrained cloud formation template in the root folder then you will need to grant this Notebook substantial permissions in order to be able to run the code correctly.

If you are running this with the pretrained cloud formation template `PersonalizeIDPretrained.yaml` then the notebook will not need that level of permissioning and will obtain the permissions from the bucket created as part of the automation process for demonstration.

Use the metadata stored on the instance underlying this Amazon SageMaker notebook, to determine the region it is operating in. If you are using a Jupyter notebook outside of Amazon SageMaker, simply define the region as a string below. The Amazon S3 bucket needs to be in the same region as the Amazon Personalize resources we have been creating so far.

First, let us get the current notebook region.

In [68]:
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]

# To use a different region use:
# region = <your_region>

print('region:', region)

region: us-east-1


Amazon S3 bucket names are globally unique. To create a unique bucket name, the code below will append the string `personalize-poc-publishing` to your AWS account number. Then it creates a bucket with this name in the region discovered in the previous cell. Note if you have already created a bucket as part of the cloud formation automation then the cell below will return information on that previously created bucket.

In [69]:
# Configure the SDK to SSM:
ssm = boto3.client('ssm')
s3 = boto3.client('s3')
bucket_found = False
try:
    personalizes3bucket = ssm.get_parameter(Name='/cloudformation/personalize-s3-bucket', WithDecryption=False)
    bucket_name = personalizes3bucket['Parameter']['Value']
    print('Bucket created as part of cloud formation template found')
    print('bucket_name:', bucket_name)
    bucket_found=True
except:
    
    account_id = boto3.client('sts').get_caller_identity().get('Account')
    bucket_name = account_id + "-" + region + "-" + "personalize-poc-publishing"

    #getting existing buckets in the account
    response = s3.list_buckets()

    if bucket_name in [x['Name'] for x in response['Buckets']]:
        print("The bucket already exists.")
    else:
        if region == "us-east-1":
            bucket_responese = s3.create_bucket(Bucket=bucket_name)
        else:
            bucket_responese = s3.create_bucket(
                Bucket=bucket_name,
                CreateBucketConfiguration={'LocationConstraint': region}
                )
    print('bucket_name:', bucket_name)

Bucket created as part of cloud formation template found
bucket_name: test-stack-3-mlopsstack-1ksuxvta-personalizebucket-evtxjufeydvb


Amazon Personalize needs to be able to read the contents of your S3 bucket. The policy which enables personalize to access the contents of the S3 bucket is below.

```python
policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:*Object",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket_name),
                "arn:aws:s3:::{}/*".format(bucket_name)
            ]
        }
    ]
}
```

This S3 bucket policy allows Amazon Personalize to be able to read the contents of your S3 bucket. The code below creates and adds the policy to the bucket it created in the last step. If the bucket was created as part of the automation script then it merely shows the bucket that it created.

In [70]:
if bucket_found:
    bucket_current_policy = s3.get_bucket_policy(Bucket=bucket_name)['Policy']
    print("Policy for bucket created as part of cloud formation template:")
    print(json.loads(bucket_current_policy))
else:
    policy = {
        "Version": "2012-10-17",
        "Id": "PersonalizeS3BucketAccessPolicy",
        "Statement": [
            {
                "Sid": "PersonalizeS3BucketAccessPolicy",
                "Effect": "Allow",
                "Principal": {
                    "Service": "personalize.amazonaws.com"
                },
                "Action": [
                    "s3:GetObject",
                    "s3:ListBucket",
                    "s3:PutObject"
                ],
                "Resource": [
                    "arn:aws:s3:::{}".format(bucket_name),
                    "arn:aws:s3:::{}/*".format(bucket_name)
                ]
            }
        ]
    }

    bucket_current_policy = None

    try:
        bucket_current_policy = s3.get_bucket_policy(Bucket=bucket_name)['Policy']

    except s3.exceptions.from_code('NoSuchBucketPolicy') as e:    
        print("There is no current Bucket Policy for bucket " + bucket_name)

    except Exception as e: 
        raise(e)

    if (bucket_current_policy and policy == json.loads(bucket_current_policy)):
        print ("The policy is already associated with the S3 Bucket.")
    else:
        print ("Adding the policy to the bucket.")
        print(s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy)))

Policy for bucket created as part of cloud formation template:
{'Version': '2012-10-17', 'Id': 'PersonalizeS3BucketAccessPolicy', 'Statement': [{'Effect': 'Allow', 'Principal': {'Service': 'personalize.amazonaws.com'}, 'Action': ['s3:GetObject', 's3:PutObject', 's3:ListBucket'], 'Resource': ['arn:aws:s3:::test-stack-3-mlopsstack-1ksuxvta-personalizebucket-evtxjufeydvb', 'arn:aws:s3:::test-stack-3-mlopsstack-1ksuxvta-personalizebucket-evtxjufeydvb/*']}]}


### Create an IAM role

Amazon Personalize also needs the ability to assume roles in AWS in order to have the permissions to execute certain tasks. Let's create an IAM role and attach the required policies to it. The code below attaches broad policies. You should use [more restrictive, least-privilege policies for production applications](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html#grant-least-privilege). If you have run the cloud formation template the role will have been automatically created and we will simply obtain that one.

In [71]:
if bucket_found:
    print("Cloud formation template used - skipping creation of IAM role and needed policies for Personalize as they were already created as part of the automation script")
    role_arn_info = ssm.get_parameter(Name='/cloudformation/personalize-iam-role-arn', WithDecryption=False)
    role_arn = role_arn_info['Parameter']['Value']
    role_name = role_arn.split('/')[1]
else:
    iam = boto3.client("iam")

    role_name = account_id+"-PersonalizeS3-Immersion-Day"
    assume_role_policy_document = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                "Service": "personalize.amazonaws.com"
              },
                "Action": "sts:AssumeRole"
            }
        ]
    }

    # Create policy

    s3_access_policy_document = {
        "Version": "2012-10-17",
        "Statement": {
                "Sid" : "myStatement" ,
                "Effect": "Allow",
                "Resource": [
                    "arn:aws:s3:::{}".format(bucket_name),
                    "arn:aws:s3:::{}/*".format(bucket_name)
                ],
                "Action": "s3:*"
            }
    }

    try: 

        policy_response = iam.create_policy(
            PolicyName='restrictedS3Access',
            PolicyDocument=json.dumps(s3_access_policy_document),
            Description='Restricts access to only workshop S3 bucket'
        )

        s3_access_policy_arn = policy_response['Policy']['Arn']

        print ("s3_access_policy_arn:{}".format(s3_access_policy_arn))
    except:
        s3_access_policy_arn = 'arn:aws:iam::{}:policy/restrictedS3Access'.format(account_id)
        print ('The policy {} already exists.'.format(s3_access_policy_arn))
        print ('Using the existing policy')


    try:
        create_role_response = iam.create_role(
            RoleName = role_name,
            AssumeRolePolicyDocument = json.dumps(assume_role_policy_document),
        );
        role_arn = create_role_response["Role"]["Arn"]

        print ("10s pause to allow role to be fully consistent.")
        time.sleep(10)

    except iam.exceptions.EntityAlreadyExistsException as e:
        print('Warning: role already exists:', e)
        role_arn = iam.get_role(
            RoleName = role_name
        )["Role"]["Arn"];

    print('IAM Role: {}\n'.format(role_arn))

    # Attach the policy if it is not previously attached:
    if (s3_access_policy_arn in [ x['PolicyArn'] for x in iam.list_attached_role_policies( RoleName = role_name)['AttachedPolicies']]):
        print ('The policy {} is already attached to this role.'.format(s3_access_policy_arn))
    else:
        print ("Attaching the role_policy: {}".format(s3_access_policy_arn))
        attach_response = iam.attach_role_policy(
            RoleName = role_name,
            PolicyArn = s3_access_policy_arn
        );
        print ("30s pause to allow role to be fully consistent.")
        time.sleep(30)
        print('Done.')

Cloud formation template used - skipping creation of IAM role and needed policies for Personalize as they were already created as part of the automation script


### Upload data to S3

Now that your Amazon S3 bucket has been created, upload the CSV files of our 3 datasets (Item, Interaction and User).

<div class="alert alert-block alert-warning">
<b>Note:</b> NOTE: We will cover how to import real-time data in a future notebook..
</div>

In [72]:
interactions_file_path = data_dir + "/" + interactions_file_name

try:
    s3.get_object(
        Bucket=bucket_name,
        Key=interactions_file_path,
    )
    print("{} already exists in the bucket {}".format(interactions_file_path, bucket_name))
except s3.exceptions.NoSuchKey:
    # Uploading the file if it does not already exist
    boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_file_name).upload_file(interactions_file_path)
    print("File {} uploaded to bucket {}".format(interactions_file_name, bucket_name))

items_file_path = data_dir + "/" + items_file_name
try:
    s3.get_object(
        Bucket=bucket_name,
        Key=items_file_name,
    )
    print("{} already exists in the bucket {}".format(items_file_path, bucket_name))
except s3.exceptions.NoSuchKey:
    # Uploading the file if it does not already exist     
    boto3.Session().resource('s3').Bucket(bucket_name).Object(items_file_name).upload_file(items_file_path)
    print("File {} uploaded to bucket {}".format(items_file_name, bucket_name))

except s3.exceptions.NoSuchKey:
    # Uploading the file if it does not already exist
    boto3.Session().resource('s3').Bucket(bucket_name).Object(users_file_name).upload_file(users_file_path)
    print("File {} uploaded to bucket {}".format(users_file_name, bucket_name))

File deskdrop_interactions_automated.csv uploaded to bucket test-stack-3-mlopsstack-1ksuxvta-personalizebucket-evtxjufeydvb
poc_data/deskdrop_articles_automated.csv already exists in the bucket test-stack-3-mlopsstack-1ksuxvta-personalizebucket-evtxjufeydvb


## Create dataset group <a class="anchor" id="group_dataset"></a>
[Back to top](#top)

The highest level of isolation and abstraction with Amazon Personalize is a *dataset group*. Information stored within one of these dataset groups has no impact on any other dataset group or models created from one – they are completely isolated. This allows you to run many experiments and is part of how we keep your models private and fully trained only on your data. 

Before importing the data prepared earlier, there needs to be a dataset group and a dataset added to it that handles the interactions.

Dataset groups can house the following types of information:

* User-item-interactions
* Event streams (real-time interactions)
* User metadata
* Item metadata

We need to create the dataset group that will contain our three datasets.

Your dataset group can be one of the following types:

* A Domain dataset group, where you create preconfigured resources for different business domains and use cases, such as getting recommendations for similar videos (VIDEO_ON_DEMAND domain) or best selling items (ECOMMERCE domain). You choose your business domain, import your data, and create recommenders. You use recommenders in your application to get recommendations. Use a [Domain dataset group](https://docs.aws.amazon.com/personalize/latest/dg/domain-dataset-groups.html) if you have a video on demand or e-commerce application and want Amazon Personalize to find the best configurations for your use cases. If you start with a Domain dataset group, you can also add custom resources such as solutions with solution versions trained with recipes for custom use cases.


* A [Custom dataset group](https://docs.aws.amazon.com/personalize/latest/dg/custom-dataset-groups.html), where you create configurable resources for custom use cases and batch recommendation workflows. You select a recipe, train a solution version (model), and deploy the solution version with a campaign. You use a campaign in your application to get recommendations. Use a Custom dataset group if you don't have a video on demand or e-commerce application or want to configure and manage only custom resources, or want to get recommendations in a batch workflow. If you start with a Custom dataset group, you can't associate it with a domain later. Instead, create a new Domain dataset group.

You can create and manage Domain dataset groups and Custom dataset groups with the AWS console, the AWS Command Line Interface (AWS CLI), or programmatically with the AWS SDKs.

<div class="alert alert-block alert-warning">
<b>Note:</b> If you are running this as part of an AWS workshop, the resources have been created ahead of time, this is to eliminate the time spent waiting for the data to import, models to train and recommenders to deploy. In these notebooks will check to see if the resources exist and use them. You may see “Resource X Already exists” messages, if you run these notebooks in your own account, it will create these resources, which will add approximately 90 minutes to this workshop.
</div>


## Create dataset group for personalized news model <a class="anchor" id="cluster_group_dataset"></a>
[Back to top](#top)

The highest level of isolation and abstraction with Amazon Personalize is a *dataset group*. Information stored within one of these dataset groups has no impact on any other dataset group or models created from one - they are completely isolated. This allows you to run many experiments and is part of how we keep your models private and fully trained only on your data. 

Before importing the data prepared earlier, there needs to be a dataset group and a dataset added to it that handles the interactions.

Dataset groups can house the following types of information:

* User-item-interactions
* Event streams (real-time interactions)
* User metadata (out of scope for this workshop)
* Item metadata
* Action metadata (out of scope for this workshop)
* Action interaction data (out of scope for this workshop)

We need to create the dataset group that will contain our two datasets.

#### Create Dataset Group
The following cell will create a new dataset group with the name `personalize-immersion-day-news`.

In [73]:
try:     
    # Try to create the dataset group, this block with exectute fully if the dataset group does not exist yet
    
    create_dataset_group_response = personalize.create_dataset_group(
        name = 'personalize-immersion-day-news',
    )
    workshop_dataset_group_arn = create_dataset_group_response['datasetGroupArn']
    print(json.dumps(create_dataset_group_response, indent=2))

    print(f'DatasetGroupArn = {workshop_dataset_group_arn}')
    
except personalize.exceptions.ResourceAlreadyExistsException as e:
    # if the dataset group already exists, get the unique identifier workshop_dataset_group_arn 
    # from the existing resource
    
    workshop_dataset_group_arn = 'arn:aws:personalize:'+region+':'+account_id+':dataset-group/'+workshop_dataset_group_name 
    print ('\nThe the Dataset Group with dataset_group_arn = {} already exists'.format(workshop_dataset_group_arn))
    print ('\nWe will be using the existing Dataset Group dataset_group_arn = {}'.format(workshop_dataset_group_arn))


{
  "datasetGroupArn": "arn:aws:personalize:us-east-1:381491864570:dataset-group/personalize-immersion-day-news",
  "ResponseMetadata": {
    "RequestId": "a3df103c-8b7e-4989-9964-7d5e1ad0ddb4",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Thu, 11 Apr 2024 22:01:26 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "109",
      "connection": "keep-alive",
      "x-amzn-requestid": "a3df103c-8b7e-4989-9964-7d5e1ad0ddb4",
      "strict-transport-security": "max-age=47304000; includeSubDomains",
      "x-frame-options": "DENY",
      "cache-control": "no-cache",
      "x-content-type-options": "nosniff"
    },
    "RetryAttempts": 0
  }
}
DatasetGroupArn = arn:aws:personalize:us-east-1:381491864570:dataset-group/personalize-immersion-day-news


#### Wait for Dataset Group to Have ACTIVE Status

Before we can use the dataset group, it must be active. This can take a minute or two. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the dataset group every 60 seconds, up to a maximum of 3 hours. Note if you are running this notebook as part of an immerion day workshop it is very likely the resources have already been created for you

In [74]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = workshop_dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(15)

DatasetGroup: CREATE PENDING
DatasetGroup: CREATE PENDING
DatasetGroup: ACTIVE


### Create the interactions dataset

With a schema created, you can create a dataset within the dataset group. Note that this does not load the data yet, but creates a resources needed to hold the data, to do this we reference the schema we created earlier. 

In [75]:
try:
    # Try to create the interactions dataset, this block with exectute fully 
    # if the interactions dataset does not exist yet
    
    dataset_type = 'INTERACTIONS'
    create_dataset_response = personalize.create_dataset(
        name = interactions_dataset_name,
        datasetType = dataset_type,
        datasetGroupArn = workshop_dataset_group_arn,
        schemaArn = workshop_interactions_schema_arn
    )

    workshop_interactions_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
    print ('\nCreating the Interactions Dataset with workshop_interactions_dataset_arn = {}'.format(workshop_interactions_dataset_arn))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    # if the interactions dataset already exists, get the unique identifier workshop_interactions_dataset_arn 
    # from the existing resource 
    workshop_interactions_dataset_arn =  'arn:aws:personalize:'+region+':'+account_id+':dataset/'+workshop_dataset_group_name+'/INTERACTIONS'
    print('The Interactions Dataset {} already exists.'.format(workshop_interactions_dataset_arn))
    print ('\nWe will be using the existing Interactions Dataset with workshop_interactions_dataset_arn = {}'.format(workshop_interactions_dataset_arn))
        

{
  "datasetArn": "arn:aws:personalize:us-east-1:381491864570:dataset/personalize-immersion-day-news/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "dcec8b68-5d90-4abf-a17c-817303750d34",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Thu, 11 Apr 2024 22:01:58 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "111",
      "connection": "keep-alive",
      "x-amzn-requestid": "dcec8b68-5d90-4abf-a17c-817303750d34",
      "strict-transport-security": "max-age=47304000; includeSubDomains",
      "x-frame-options": "DENY",
      "cache-control": "no-cache",
      "x-content-type-options": "nosniff"
    },
    "RetryAttempts": 0
  }
}

Creating the Interactions Dataset with workshop_interactions_dataset_arn = arn:aws:personalize:us-east-1:381491864570:dataset/personalize-immersion-day-news/INTERACTIONS


### Create Items Dataset
With a schema created, you can create a dataset within the dataset group. Note that this does not load the data yet, but creates a schema of what the data looks like. 

In [76]:
try:
    # Try to create the items dataset, this block with execute fully if the items dataset does not exist yet
    
    dataset_type = "ITEMS"
    create_dataset_response = personalize.create_dataset(
        name = items_dataset_name,
        datasetType = dataset_type,
        datasetGroupArn = workshop_dataset_group_arn,
        schemaArn = workshop_items_schema_arn
    )

    workshop_items_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))

    print ('\nCreating the Items Dataset with workshop_items_dataset_arn = {}'.format(workshop_items_dataset_arn))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    # if the items dataset already exists, get the unique identifier workshop_items_dataset_arn 
    # from the existing resource 
    
    workshop_items_dataset_arn =  'arn:aws:personalize:'+region+':'+account_id+':dataset/'+workshop_dataset_group_name+'/ITEMS'
    print('The Items Dataset {} already exists.'.format(workshop_items_dataset_arn))
    print ('\nWe will be using the existing Items Dataset with workshop_items_dataset_arn = {}'.format(workshop_items_dataset_arn))   

{
  "datasetArn": "arn:aws:personalize:us-east-1:381491864570:dataset/personalize-immersion-day-news/ITEMS",
  "ResponseMetadata": {
    "RequestId": "0dccfa03-a8f6-4d19-952b-d383f3837173",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Thu, 11 Apr 2024 22:01:58 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "104",
      "connection": "keep-alive",
      "x-amzn-requestid": "0dccfa03-a8f6-4d19-952b-d383f3837173",
      "strict-transport-security": "max-age=47304000; includeSubDomains",
      "x-frame-options": "DENY",
      "cache-control": "no-cache",
      "x-content-type-options": "nosniff"
    },
    "RetryAttempts": 0
  }
}

Creating the Items Dataset with workshop_items_dataset_arn = arn:aws:personalize:us-east-1:381491864570:dataset/personalize-immersion-day-news/ITEMS


Let's wait until all the datasets have been created.

In [77]:
%%time

max_time = time.time() + 6*60*60 # 6 hours
while time.time() < max_time:
    describe_dataset_response = personalize.describe_dataset(
        datasetArn = workshop_interactions_dataset_arn
    )
    status_interaction_dataset =  describe_dataset_response["dataset"]['status']
    print("Interactions Dataset: {}".format(status_interaction_dataset))
    
    if status_interaction_dataset == "ACTIVE":
        print("Build succeeded for {}".format(workshop_interactions_dataset_arn))
        
    elif status_interaction_dataset == "CREATE FAILED":
        print("Build failed for {}".format(workshop_interactions_dataset_arn))
        break
        
    if not status_interaction_dataset == "ACTIVE":
        print("The interaction dataset creation is still in progress")
    else:
        print("The interaction dataset  is ACTIVE")
        

    describe_dataset_response = personalize.describe_dataset(
        datasetArn = workshop_items_dataset_arn
    )
    status_item_dataset =  describe_dataset_response["dataset"]['status']
    print("Items Dataset: {}".format(status_item_dataset))
    
    if status_item_dataset == "ACTIVE":
        print("Build succeeded for {}".format(workshop_items_dataset_arn))
        
    elif status_item_dataset == "CREATE FAILED":
        print("Build failed for {}".format(workshop_items_dataset_arn))
        break
        
    if not status_item_dataset == "ACTIVE":
        print("The item dataset creation is still in progress")
    else:
        print("The item dataset  is ACTIVE")
    
    if status_interaction_dataset == "ACTIVE" and status_item_dataset == "ACTIVE":
        break
        
    time.sleep(30)

Interactions Dataset: CREATE PENDING
The interaction dataset creation is still in progress
Items Dataset: CREATE PENDING
The item dataset creation is still in progress
Interactions Dataset: ACTIVE
Build succeeded for arn:aws:personalize:us-east-1:381491864570:dataset/personalize-immersion-day-news/INTERACTIONS
The interaction dataset  is ACTIVE
Items Dataset: ACTIVE
Build succeeded for arn:aws:personalize:us-east-1:381491864570:dataset/personalize-immersion-day-news/ITEMS
The item dataset  is ACTIVE
CPU times: user 18 ms, sys: 1.04 ms, total: 19 ms
Wall time: 30.1 s


## Import the Interactions <a class="anchor" id="import_interactions"></a>
[Back to top](#top)

Earlier you created the dataset group and dataset to house your information, now you will execute an import job that will load the item data from the S3 bucket into the Amazon Personalize dataset. 

In [78]:
# Check if the import job already exists

# List the import jobs
interactions_dataset_import_jobs = personalize.list_dataset_import_jobs(
    datasetArn=workshop_interactions_dataset_arn,
    maxResults=100
)['datasetImportJobs']

#check if there is an existing job with the prefix
job_exists = False  
job_arn = None

for job in interactions_dataset_import_jobs:
    if (interactions_import_job_name in job['jobName']):
        job_exists = True
        job_arn = job['datasetImportJobArn']
    
if (job_exists):
    workshop_interactions_dataset_import_job_arn = job_arn
    print('The Interactions Import Job {} already exists.'.format(workshop_interactions_dataset_import_job_arn))
    print ('\nWe will be using the existing Interactions Import Job with workshop_interactions_dataset_import_job_arn = {}'.format(workshop_interactions_dataset_import_job_arn))
        
else:
    # If there is no import job with the prefix, create it:   
    create_dataset_import_job_response = personalize.create_dataset_import_job(
        jobName = interactions_import_job_name,
        datasetArn = workshop_interactions_dataset_arn,
        dataSource = {
            "dataLocation": "s3://{}/{}".format(bucket_name, interactions_file_name)
        },
        roleArn = role_arn
    )
    workshop_interactions_dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
    print(json.dumps(create_dataset_import_job_response, indent=2))
    
    print ('\nImporting the Interactions Data with workshop_interactions_dataset_import_job_arn = {}'.format(workshop_interactions_dataset_import_job_arn))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:381491864570:dataset-import-job/dataset_import_interaction",
  "ResponseMetadata": {
    "RequestId": "c2f75300-9201-48fc-af5d-f13b8b2fdda3",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Thu, 11 Apr 2024 22:02:28 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "114",
      "connection": "keep-alive",
      "x-amzn-requestid": "c2f75300-9201-48fc-af5d-f13b8b2fdda3",
      "strict-transport-security": "max-age=47304000; includeSubDomains",
      "x-frame-options": "DENY",
      "cache-control": "no-cache",
      "x-content-type-options": "nosniff"
    },
    "RetryAttempts": 0
  }
}

Importing the Interactions Data with workshop_interactions_dataset_import_job_arn = arn:aws:personalize:us-east-1:381491864570:dataset-import-job/dataset_import_interaction


## Import the Item Metadata <a class="anchor" id="import_items"></a>
[Back to top](#top)

Earlier you created the dataset group and dataset to house your information, now you will execute an import job that will load the item data from the S3 bucket into the Amazon Personalize dataset. 

In [79]:
# Checking if the import job already exists

# List the import jobs
items_dataset_import_jobs = personalize.list_dataset_import_jobs(
    datasetArn=workshop_items_dataset_arn,
    maxResults=100
)['datasetImportJobs']

job_exists = False
job_arn = None

#check if there is an existing job with the prefix
for job in items_dataset_import_jobs:
    if (items_import_job_name in job['jobName']):
        job_exists = True
        job_arn = job['datasetImportJobArn']
    
if (job_exists):
    workshop_items_dataset_import_job_arn =  job_arn
    print('The Items Import Job {} already exists.'.format(workshop_items_dataset_import_job_arn))
    print ('\nWe will be using the existing Items Import Job with workshop_items_dataset_import_job_arn = {}'.format(workshop_items_dataset_import_job_arn))
        
else:
    # If there is no import job with the prefix, create it:    
    create_dataset_import_job_response = personalize.create_dataset_import_job(
        jobName = items_import_job_name,
        datasetArn = workshop_items_dataset_arn,
        dataSource = {
            "dataLocation": "s3://{}/{}".format(bucket_name, items_file_name)
        },
        roleArn = role_arn
    )

    workshop_items_dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
    print(json.dumps(create_dataset_import_job_response, indent=2))
    print ('\nImporting the Items Data with workshop_items_dataset_import_job_arn = {}'.format(workshop_items_dataset_import_job_arn))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:381491864570:dataset-import-job/dataset_import_item",
  "ResponseMetadata": {
    "RequestId": "af4a0b41-530c-4c13-aae5-b0099d62b3a1",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Thu, 11 Apr 2024 22:02:28 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "107",
      "connection": "keep-alive",
      "x-amzn-requestid": "af4a0b41-530c-4c13-aae5-b0099d62b3a1",
      "strict-transport-security": "max-age=47304000; includeSubDomains",
      "x-frame-options": "DENY",
      "cache-control": "no-cache",
      "x-content-type-options": "nosniff"
    },
    "RetryAttempts": 0
  }
}

Importing the Items Data with workshop_items_dataset_import_job_arn = arn:aws:personalize:us-east-1:381491864570:dataset-import-job/dataset_import_item


### Wait for Import Jobs to Complete

Before we can use the dataset, the import job must be active. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the import job every minute, up to a maximum of 6 hours.

It will take 10-15 minutes for the import jobs to complete. While you're waiting you can learn more about Datasets and Schemas in [the documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html).

We will wait for all three jobs to finish.

In [80]:
max_time = time.time() + 6*60*60 # 10 hours
while time.time() < max_time:

    # Interactions dataset import
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = workshop_interactions_dataset_import_job_arn
    )
    status_interactions_import = describe_dataset_import_job_response["datasetImportJob"]['status']
    
    if status_interactions_import == "ACTIVE":
        print("Build succeeded for {}".format(workshop_interactions_dataset_import_job_arn))
        
    elif status_interactions_import == "CREATE FAILED":
        print("Build failed for {}".format(workshop_interactions_dataset_import_job_arn))
        break
        
    if not status_interactions_import == "ACTIVE":
        print("The interactions dataset import is still in progress")
    else:
        print("The interactions dataset import is ACTIVE")

    # Items dataset import
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = workshop_items_dataset_import_job_arn
    )
    status_items_import = describe_dataset_import_job_response["datasetImportJob"]['status']
    
    if status_items_import == "ACTIVE":
        print("Build succeeded for {}".format(workshop_items_dataset_import_job_arn))
        
    elif status_items_import == "CREATE FAILED":
        print("Build failed for {}".format(workshop_items_dataset_import_job_arn))
        break
        
    if not status_items_import == "ACTIVE":
        print("The items dataset import is still in progress")
    else:
        print("The items dataset import is ACTIVE")

    if status_interactions_import == "ACTIVE" and status_items_import == 'ACTIVE':
        break

    print()
    time.sleep(30)

The interactions dataset import is still in progress
The items dataset import is still in progress

The interactions dataset import is still in progress
The items dataset import is still in progress

The interactions dataset import is still in progress
The items dataset import is still in progress

The interactions dataset import is still in progress
The items dataset import is still in progress

The interactions dataset import is still in progress
The items dataset import is still in progress

The interactions dataset import is still in progress
The items dataset import is still in progress

The interactions dataset import is still in progress
The items dataset import is still in progress

Build succeeded for arn:aws:personalize:us-east-1:381491864570:dataset-import-job/dataset_import_interaction
The interactions dataset import is ACTIVE
The items dataset import is still in progress

Build succeeded for arn:aws:personalize:us-east-1:381491864570:dataset-import-job/dataset_import_inter

# Filters <a class="anchor" id="filters"></a>

## Create Filters <a class="anchor" id="create-filters"></a>
[Back to top](#top)

Personalize can utilize either [static or dynamic filters](https://docs.aws.amazon.com/personalize/latest/dg/filter.html). Static filters are where the filter properties are built into the filter itself, which makes invocation simpler, but gives less flexibility. An example of this would be an accessories category filter, which invokes the get_recommendations_response api with the specific filter of CATEGORY_L1 = accessories. In order to create a recommendation for each filter, that would require 10+ filters. Personalize also supports dynamic filters, where the values can be passed at runtime, allowing for a single filter of CATEGORY_L1, where the actual category is passed at runtime. 

Filters can be created for fields of both Items and Events. 

A few common use cases for dynamic filters in news are:

* Categorical filters based on Item Metadata - Often your item metadata will have information about the item. Filtering on these can provide recommendations within that data. In particular this is quite useful for recommending specifc news articles for specific sections of a website such as politics, sports, tech and lifestyle

* Range based filters based on Item Metadata - Personalize supports range operations in both static and dynamic filters. Filtering based on a range can be used to create recommendations specific to a specific time range. This can be useful for highlighting articles for specific breakingnews sections.

* Events - you may want to filter out certain events and provide results based on those events, such as removing articles a user has already read from their recommendations

#### Create Genre Filter

Below we are going to create a dynamic fitler for specific genres of articles - our dataset is limited to tech articles so our genres are all specific iterations of that in this case. We will also exclude previously read articles.

In [81]:
try:
    create_genre_filter_response = personalize.create_filter(
        name = 'breakingnews-genre-filter-2',
        datasetGroupArn = workshop_dataset_group_arn,
        filterExpression = "INCLUDE ItemID WHERE Items.ARTICLE_GENRE IN ($GENRELIST) | EXCLUDE ItemID WHERE Interactions.EVENT_TYPE IN (\"VIEW\", \"LIKE\",\"BOOKMARK\",\"COMMENT CREATED\", \"FOLLOW\")"
    )
    
    genre_filter_arn = create_genre_filter_response['filterArn']
    print('Creating the Personalize filter with category_filter_arn {}.'.format(genre_filter_arn))
    

except personalize.exceptions.ResourceAlreadyExistsException as e:
    print('The Personalize filter {} already exists.'.format(genre_filter_arn))

Creating the Personalize filter with category_filter_arn arn:aws:personalize:us-east-1:381491864570:filter/breakingnews-genre-filter-2.


#### Note the code below will not work unless you are running this as part of an AWS workshop where all resources have been precreated

In [82]:
sample_user = str(-8845298781299428018)

In [83]:
workshop_userpersonalization_campaign_arn =  'arn:aws:personalize:'+region+':'+account_id+':campaign/'+workshop_userpersonalization_campaign_name

In [84]:
get_recommendations_response = personalize_runtime.get_recommendations(
    campaignArn = workshop_userpersonalization_campaign_arn,
    userId = sample_user,
    numResults = 5,
    filterArn = genre_filter_arn,
    filterValues = {"GENRELIST": "\"tech\""}
)

InvalidInputException: An error occurred (InvalidInputException) when calling the GetRecommendations operation: arn:aws:personalize:us-east-1:381491864570:campaign/immersion_day_user_personalization_news_campaign does not exist or is not active yet.

In [None]:
print(get_recommendations_response)

We will look at some of these recommendations in more detail in `03_Training_Layer`

With all imports now complete you can start training your personalization models. Run the cell below before moving on to store a few values for usage in the next notebooks. After completing that cell open notebook `02_Training_Layer.ipynb` to continue.

## Storing Useful Variables <a class="anchor" id="vars"></a>
[Back to top](#top)

Before exiting this notebook, run the following cells to save the version ARNs for use in the next notebook.

In [85]:
%store data_dir
%store interactions_file_name
%store items_file_name
%store workshop_dataset_group_arn
%store workshop_interactions_dataset_arn
%store workshop_items_dataset_arn
%store workshop_interactions_schema_arn
%store workshop_items_schema_arn
%store genre_filter_arn

%store workshop_rerank_solution_name
%store workshop_rerank_campaign_name

%store workshop_userpersonalization_solution_name
%store workshop_userpersonalization_campaign_name

%store region
%store account_id
%store role_name
%store role_arn

%store bucket_name

%store articles_mlfeatures
%store interaction_data

%store bucket_current_policy

Stored 'data_dir' (str)
Stored 'interactions_file_name' (str)
Stored 'items_file_name' (str)
Stored 'workshop_dataset_group_arn' (str)
Stored 'workshop_interactions_dataset_arn' (str)
Stored 'workshop_items_dataset_arn' (str)
Stored 'workshop_interactions_schema_arn' (str)
Stored 'workshop_items_schema_arn' (str)
Stored 'genre_filter_arn' (str)
Stored 'workshop_rerank_solution_name' (str)
Stored 'workshop_rerank_campaign_name' (str)
Stored 'workshop_userpersonalization_solution_name' (str)
Stored 'workshop_userpersonalization_campaign_name' (str)
Stored 'region' (str)
Stored 'account_id' (str)
Stored 'role_name' (str)
Stored 'role_arn' (str)
Stored 'bucket_name' (str)
Stored 'articles_mlfeatures' (DataFrame)
Stored 'interaction_data' (DataFrame)
Stored 'bucket_current_policy' (str)
