# Amazon Personalize: using item metadata to identify stronger similarities

The relevance of the recommendations you deliver with Amazon Personalize depends on the data available when the recommendations are generated. Amazon Personalize uses your users’ historical interactions, the attributes of your items, and your users’ metadata to learn what items are most relevant for each user. The primary data required by Amazon Personalize is user-item interactions. The interactions users have with items in your catalog, such as clicking on a product, reading an article, watching a video, or purchasing a product, are an important signal of what they have found relevant in the past. Including item and user attributes, also known as metadata, can enhance the relevance of recommendations; especially for new items that are similar to what your users have found relevant. However, structured metadata such as an item’s category, style, or genre may not always be readily available or doesn’t provide all the information that you have in your narrative descriptions. 

Now Amazon Personalize allows you to add item metadata such as product descriptions, genre of a movie, etc.. Amazon Personalize hosts, manages, and automatically processes your item metadata attributes and use it to improve the performance of your Amazon Personalize related items solutions.

This notebook will demonstrate how the new recipe `aws-similar-items` plus item metadata improves the relevance of recommendations compared to the `SIMS` recipe.

Amazon Reviews data from the Amazon Prime Pantry category are used for the interactions and items datasets.

When considering including text in your items dataset, keep the following best practices in mind.

One dataset group will be created that will include the interactions data plus item metadata this way we can train two separate models and compare their offline and online results.

In [1]:
import pandas as pd
import json
import numpy as np
from datetime import datetime
import boto3
import time
from time import sleep
from lxml import html

## Load and inspect datasets

We'll start by loading the Prime Pantry reviews dataset. You will need to fill out the form for access to the data files:

http://deepyeti.ucsd.edu/jianmo/amazon/index.html

Citation:
> Justifying recommendations using distantly-labeled reviews and fined-grained aspects  
> Jianmo Ni, Jiacheng Li, Julian McAuley  
> Empirical Methods in Natural Language Processing (EMNLP), 2019 [pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/emnlp19a.pdf)

In [2]:
data_dir = 'raw_data'
!mkdir $data_dir

!cd $data_dir && \
    wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Prime_Pantry.json.gz && \
    wget http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_Prime_Pantry.json.gz

--2021-10-07 15:14:39--  http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Prime_Pantry.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 45435146 (43M) [application/octet-stream]
Saving to: ‘Prime_Pantry.json.gz’


2021-10-07 15:14:42 (17.6 MB/s) - ‘Prime_Pantry.json.gz’ saved [45435146/45435146]

--2021-10-07 15:14:42--  http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_Prime_Pantry.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5281662 (5.0M) [application/octet-stream]
Saving to: ‘meta_Prime_Pantry.json.gz’


2021-10-07 15:14:43 (7.24 MB/s) - ‘meta_Prime_Pantry.json.gz’ saved [5281662/5281662]



### Load and inspect reviews data

We'll start by loading the reviews dataset for the Prime Pantry products and running some commands to see what we have to work with.

In [3]:
pantry_df = pd.read_json(data_dir + '/Prime_Pantry.json.gz', lines=True, compression='infer')
pantry_df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,image,style
0,5,True,"12 14, 2014",A1NKJW0TNRVS7O,B0000DIWNZ,Tamara M.,Good clinging,Clings well,1418515200,,,
1,4,True,"11 20, 2014",A2L6X37E8TFTCC,B0000DIWNZ,Amazon Customer,Fantastic buy and a good plastic wrap. Even t...,Saran could use more Plus to Cling better.,1416441600,,,
2,4,True,"10 11, 2014",A2WPR4W6V48121,B0000DIWNZ,noname,ok,Four Stars,1412985600,,,
3,3,False,"09 1, 2014",A27EE7X7L29UMU,B0000DIWNZ,ZapNZs,Saran Cling Plus is kind of like most of the C...,"The wrap is fantastic, but the dispensing, cut...",1409529600,4.0,,
4,4,True,"08 10, 2014",A1OWT4YZGB5GV9,B0000DIWNZ,Amy Rogers,This is my go to plastic wrap so there isn't m...,has been doing it's job for years,1407628800,,,


In [4]:
pantry_df.shape

(471614, 12)

What can we learn from this output? There are over 471K reviews and 12 columns of data. The `asin` column is our unique item identifier, `reviewerID` is our unique user identifier, `unixReviewTime` is our timestamp for the review, and `overall` indicates the positivity of the review on a scale of 1-5. We will use this file as the basis for our interactions dataset for Personalize. 

### Build and save interactions dataset

Let's start building our interactions dataset by narrowing down the rows we want to include. The first step is to isolate only the positive reviews. For this we will assume any reviews with an overall rating of 4 or higher is a positive review. Anything rating of 3 or below are either mediocre or negative reviews.

In [5]:
positive_reviews_df = pantry_df[pantry_df['overall'] > 3]
positive_reviews_df.shape

(387692, 12)

We're down to 387K positive reviews. Still plenty for training a model in Personalize.

Next let's narrow down the dataset to just the columns we need and add an `EVENT_TYPE` column to indicate the type of events we're capturing. Adding an `EVENT_TYPE` column now will make it easier to explore testing real-time events later if you choose to do so (since `eventType` is a required field for the [PutEvents](https://docs.aws.amazon.com/personalize/latest/dg/API_UBS_PutEvents.html) API).

In [6]:
positive_reviews_df = positive_reviews_df[['reviewerID', 'asin', 'unixReviewTime', 'overall']]
positive_reviews_df['EVENT_TYPE']='reviewed'

positive_reviews_df.head()

Unnamed: 0,reviewerID,asin,unixReviewTime,overall,EVENT_TYPE
0,A1NKJW0TNRVS7O,B0000DIWNZ,1418515200,5,reviewed
1,A2L6X37E8TFTCC,B0000DIWNZ,1416441600,4,reviewed
2,A2WPR4W6V48121,B0000DIWNZ,1412985600,4,reviewed
4,A1OWT4YZGB5GV9,B0000DIWNZ,1407628800,4,reviewed
5,A1GN2ADKF1IE7K,B0000DIWNZ,1405296000,5,reviewed


One last check we should do is sanity check a `unixReviewTime` column value. Since Personalize builds sequence models based on the date and time of each interaction, it's important that the timestamp of each interaction is represented in the expected format so that it is interpreted correctly.

Let's pick a value for the `unixReviewTime` column and parse it into a human-readable date so we can verify that it's reasonable.

In [7]:
time_stamp = positive_reviews_df.iloc[50]['unixReviewTime']
print(time_stamp)
print(datetime.utcfromtimestamp(time_stamp).strftime('%Y-%m-%d %H:%M:%S'))

1321488000
2011-11-17 00:00:00


The timestamp value looks good. Let's get some final summary information for our dataset.

In [8]:
positive_reviews_df.describe(include='all')

Unnamed: 0,reviewerID,asin,unixReviewTime,overall,EVENT_TYPE
count,387692,387692,387692.0,387692.0,387692
unique,202254,10584,,,1
top,A35Q0RBM3YNQNF,B00XA9DADC,,,reviewed
freq,176,5288,,,387692
mean,,,1468847000.0,4.847227,
std,,,43149750.0,0.359769,
min,,,1073693000.0,4.0,
25%,,,1447200000.0,5.0,
50%,,,1474718000.0,5.0,
75%,,,1498435000.0,5.0,


We have 387K reviews for 202K distinct reviewers/users across 10K unique products. This is basis of our interactions dataset.

Before we can use this as our interactions dataset, though, we need to rename the columns to match those expected by Personalize.

In [9]:
positive_reviews_df.rename(columns = {'reviewerID':'USER_ID', 'asin':'ITEM_ID', 
                              'unixReviewTime':'TIMESTAMP', 'overall': 'EVENT_VALUE'}, inplace = True)
positive_reviews_df

Unnamed: 0,USER_ID,ITEM_ID,TIMESTAMP,EVENT_VALUE,EVENT_TYPE
0,A1NKJW0TNRVS7O,B0000DIWNZ,1418515200,5,reviewed
1,A2L6X37E8TFTCC,B0000DIWNZ,1416441600,4,reviewed
2,A2WPR4W6V48121,B0000DIWNZ,1412985600,4,reviewed
4,A1OWT4YZGB5GV9,B0000DIWNZ,1407628800,4,reviewed
5,A1GN2ADKF1IE7K,B0000DIWNZ,1405296000,5,reviewed
...,...,...,...,...,...
471609,A19GSVHXVT5NNF,B01HI8JVI8,1494892800,5,reviewed
471610,ABSCTKLX9F9IU,B01HI8JVI8,1493769600,5,reviewed
471611,A2R33RCWKDHZ3L,B01HI8JVI8,1492646400,5,reviewed
471612,A2INGHYEXZDHMC,B01HI8JVI8,1492560000,5,reviewed


Finally, let's save our positive reviews dataframe as a CSV. We'll upload this CSV to Personalize later in this notebook.

In [10]:
interactions_filename = "interactions.csv"
positive_reviews_df.to_csv(interactions_filename, index=False, float_format='%.0f')

### Load and inspect item metadata

Now that we have the interactions dataset established, let's turn to the items dataset. This is where we will find the unstructured text value that we will include in the model.

Like the reviews dataset, the Prime Pantry item metadata file is also represented in JSON. Due to the nested nature of this file, this will present some challenges in getting our data formatted the way we need it.

Let's start by loading the metadata file into a dataframe and taking a look the data.

In [11]:
pantry_meta_df = pd.read_json('raw_data/meta_Prime_Pantry.json.gz', lines=True, compression='infer')
pantry_meta_df

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,details,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
0,[],,[Sink your sweet tooth into MILK DUDS Candya d...,,"HERSHEY'S Milk Duds Candy, 5 Ounce(Halloween C...","[B019KE37WO, B007NQSWEU]",,Milk Duds,[],[],[],"{'ASIN: ': 'B00005BPJO', 'Item model number:':...","<img src=""https://m.media-amazon.com/images/G/...",,NaT,$5.00,B00005BPJO,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
1,[],,[Sink your sweet tooth into MILK DUDS Candya d...,,"HERSHEY'S Milk Duds Candy, 5 Ounce(Halloween C...","[B019KE37WO, B007NQSWEU]",,Milk Duds,[],[],[],"{'ASIN: ': 'B00005BPJO', 'Item model number:':...","<img src=""https://m.media-amazon.com/images/G/...",,NaT,$5.00,B00005BPJO,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
2,[],,[A perfect Lentil soup starts with Goya Lentil...,,"Goya Dry Lentils, 16 oz","[B003SI144W, B000VDRKEK]",,Goya,[],[],"[B074MFVZG7, B079PTH69L, B000VDRKEK, B074M9T81...",{'ASIN: ': 'B0000DIF38'},"<img src=""https://images-na.ssl-images-amazon....",,NaT,,B0000DIF38,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
3,[],,[Saran Premium Wrap is an extra tough yet easy...,,"Saran Premium Plastic Wrap, 100 Sq Ft","[B01MY5FHT6, B000PYF8VM, B000SRMDFA, B07CX6LN8...",,Saran,[],[],"[B077QLSLRQ, B00JPKW1RQ, B000FE2IK6, B00XUJHJ9...",{'Domestic Shipping: ': 'This item can only be...,"<img src=""https://images-na.ssl-images-amazon....",,NaT,,B0000DIWNI,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
4,[],,[200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Eas...,,"Saran Cling Plus Plastic Wrap, 200 Sq Ft",[],,Saran,[],[],[B0014CZ0TE],{'Domestic Shipping: ': 'This item can only be...,"<img src=""https://images-na.ssl-images-amazon....",,NaT,,B0000DIWNZ,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10808,[],,[These bars are where our journey started and ...,,"KIND Bars, Caramel Almond &amp; Sea Salt, Glut...",[],,KIND,[],"26,259 in Grocery & Gourmet Food (","[B00JQQAN60, B00JQQAWSY, B0111K7V54, B0111K8L9...","{'ASIN: ': 'B01HI76312', 'Item model number:':...","<img src=""https://images-na.ssl-images-amazon....",,NaT,$3.98,B01HI76312,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
10809,[],,[These bars are where our journey started and ...,,"KIND Bars, Maple Glazed Pecan &amp; Sea Salt, ...",[],,KIND,[],"16,822 in Grocery & Gourmet Food (","[B0111K97JC, B00JQQAN60, B0111K8L9Y, B01HI7631...",{'ASIN: ': 'B01HI76790'},"<img src=""https://images-na.ssl-images-amazon....",,NaT,$5.81,B01HI76790,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
10810,[],,[These bars are where our journey started and ...,,"KIND Bars, Dark Chocolate Almond &amp; Coconut...",[],,KIND,[],"107,057 in Grocery & Gourmet Food (","[B0111K7V54, B01HI76312, B00JQQAL0S, B0111K97J...",{'ASIN: ': 'B01HI76SA8'},"<img src=""https://images-na.ssl-images-amazon....",,NaT,$4.98,B01HI76SA8,[],[]
10811,[],,[These bars are where our journey started and ...,,"KIND Bars, Honey Roasted Nuts &amp; Sea Salt, ...",[],,KIND,[],"24,648 in Grocery & Gourmet Food (","[B00JQQAN60, B0111K7V54, B01HI76312, B0111K97J...",{'ASIN: ': 'B01HI76XS0'},"<img src=""https://images-na.ssl-images-amazon....",,NaT,$5.81,B01HI76XS0,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...


In [12]:
pantry_meta_df.describe()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,details,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
count,10813,10813.0,10813,10813.0,10813,10813,10813.0,10813,10813,10813,10813,10813,10813,10813.0,0.0,10813.0,10813,10813,10813
unique,1,1.0,9409,1.0,10782,3957,1.0,1960,763,4828,5940,10786,4,1.0,0.0,1482.0,10812,8940,8940
top,[],,[],,"Centrum Adult Flavor Burst (120 Count, Mixed F...",[],,L'Oreal Paris,[],[],[],{},"<img src=""https://images-na.ssl-images-amazon....",,,,B00005BPJO,[],[]
freq,10813,10813.0,98,10813.0,2,6754,10813.0,171,9777,5937,4835,24,10621,10813.0,,4063.0,2,1781,1781


So what can we learn from this information? First, there are over 10K products represented in the metadata file. Most of the columns will be of little value to us for Personalize since they aren't relevant as features (image URLs, `details`, `also_viewed`, `also_buy`, etc) or are mostly blank/sparse (`category`, `fit`, `tech1`, etc). The `asin` column is our unique identifier for each item (although there looks to be one duplicate) and `brand` and `price` look like they may be useful. The `description` column is what we will use for unstructured text.

However, we have to do some cleanup and reformatting of the fields we want to use in our items dataset. For example, the `price` field is a formatted currency value (string) and not numeric and the `description` field was loaded as an array of strings due to how they values were represented and parsed from the original JSON file. Lastly, the `description` values also contain HTML markup that needs to stripped.

Let's start by creating a dataframe with just the columns we need for the items dataset.

In [13]:
items_df = pantry_meta_df.copy()
items_df = items_df[['asin', 'brand', 'price', 'description']]
items_df.head(10)

Unnamed: 0,asin,brand,price,description
0,B00005BPJO,Milk Duds,$5.00,[Sink your sweet tooth into MILK DUDS Candya d...
1,B00005BPJO,Milk Duds,$5.00,[Sink your sweet tooth into MILK DUDS Candya d...
2,B0000DIF38,Goya,,[A perfect Lentil soup starts with Goya Lentil...
3,B0000DIWNI,Saran,,[Saran Premium Wrap is an extra tough yet easy...
4,B0000DIWNZ,Saran,,[200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Eas...
5,B0000GH6UG,Ibarra,,"[Ibarra Chocolate, 19 Oz, , ]"
6,B0000KC2BK,Knorr,$3.09,[Knorr Granulated Chicken Flavor Bouillon is a...
7,B0001E1IN8,Castillo,,[Red chili habanero sauces. They are present t...
8,B00032E8XK,Chicken of the Sea,$1.48,[Chicken of the Sea Solid White Albacore Tuna ...
9,B0005XMTHE,Smucker's,$2.29,"[Helps build muscles with bcaa's amino acids, ..."


Next let's drop duplicate rows based on the `asin` column value. There should only be one duplicate based on the `describe()` output above.

In [14]:
items_df = items_df.drop_duplicates(subset=['asin'], keep='last')
items_df.shape

(10812, 4)

Next let's focus on reformatting and cleaning up the `description` column values. As you can see above, the `description` is currently represented as an array of strings (because that's how it is represented in the JSON file). We need to flatten this array into a single string and strip all HTML markup from each fragment.

We'll start by creating two utility functions that will be used to clean the `description` (and later the `title` column in the original dataset when we want to display titles for recommended products).

In [15]:
# Strips and cleans a value of HTML markup and whitespace.
def clean_markup(value):
    s = str(value).strip()
    if s != '':
        s = str(html.fromstring(s).text_content())
        s = ' '.join(s.split())
                
    return s.strip()

# Cleans and reformats the description column value for a dataframe row.
def clean_and_reformat_description(row):
    s = ''
    for el in row['description']:
        el = clean_markup(el)
        if el != '':
            s += ' ' + el
                
    return s.strip()

In [16]:
items_df['description'] = items_df.apply(clean_and_reformat_description, axis=1)
items_df

Unnamed: 0,asin,brand,price,description
1,B00005BPJO,Milk Duds,$5.00,Sink your sweet tooth into MILK DUDS Candya de...
2,B0000DIF38,Goya,,A perfect Lentil soup starts with Goya Lentils...
3,B0000DIWNI,Saran,,Saran Premium Wrap is an extra tough yet easy ...
4,B0000DIWNZ,Saran,,200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Easy...
5,B0000GH6UG,Ibarra,,"Ibarra Chocolate, 19 Oz"
...,...,...,...,...
10808,B01HI76312,KIND,$3.98,These bars are where our journey started and i...
10809,B01HI76790,KIND,$5.81,These bars are where our journey started and i...
10810,B01HI76SA8,KIND,$4.98,These bars are where our journey started and i...
10811,B01HI76XS0,KIND,$5.81,These bars are where our journey started and i...


Next let's take a look at the `price` column and change its type from a string to a float.

In [17]:
items_df['price'].value_counts()

          4063
$2.99      114
$3.99      113
$4.99      103
$5.99       87
          ... 
$10.40       1
$43.53       1
$20.42       1
$17.74       1
$62.99       1
Name: price, Length: 1482, dtype: int64

The following cell with convert empty/non-numeric prices to `np.nan` and all others will have the `$` currency symbol removed. This will allow us to coerce the type to a float.

In [18]:
def convert_price(row):
    v = str(row['price']).strip().replace('$', '')
    if v == '' or not v.lstrip('-').replace('.', '').isdigit():
        return np.nan
    return v

items_df['price'] = items_df.apply(convert_price, axis=1)
items_df

Unnamed: 0,asin,brand,price,description
1,B00005BPJO,Milk Duds,5.00,Sink your sweet tooth into MILK DUDS Candya de...
2,B0000DIF38,Goya,,A perfect Lentil soup starts with Goya Lentils...
3,B0000DIWNI,Saran,,Saran Premium Wrap is an extra tough yet easy ...
4,B0000DIWNZ,Saran,,200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Easy...
5,B0000GH6UG,Ibarra,,"Ibarra Chocolate, 19 Oz"
...,...,...,...,...
10808,B01HI76312,KIND,3.98,These bars are where our journey started and i...
10809,B01HI76790,KIND,5.81,These bars are where our journey started and i...
10810,B01HI76SA8,KIND,4.98,These bars are where our journey started and i...
10811,B01HI76XS0,KIND,5.81,These bars are where our journey started and i...


In [19]:
items_df['price'].value_counts()

2.99     114
3.99     113
4.99     103
5.99      87
2.98      76
        ... 
12.66      1
24.82      1
19.92      1
13.97      1
13.33      1
Name: price, Length: 1480, dtype: int64

In [20]:
items_df['price'] = items_df['price'].astype(float)

Next we'll rename the columns to match the names and uppercase name format expected by Personalize.

In [21]:
items_df.rename(columns = {'asin':'ITEM_ID', 'brand':'BRAND', 
                              'price':'PRICE', 'description': 'DESCRIPTION'}, inplace = True)
items_df.head(10)

Unnamed: 0,ITEM_ID,BRAND,PRICE,DESCRIPTION
1,B00005BPJO,Milk Duds,5.0,Sink your sweet tooth into MILK DUDS Candya de...
2,B0000DIF38,Goya,,A perfect Lentil soup starts with Goya Lentils...
3,B0000DIWNI,Saran,,Saran Premium Wrap is an extra tough yet easy ...
4,B0000DIWNZ,Saran,,200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Easy...
5,B0000GH6UG,Ibarra,,"Ibarra Chocolate, 19 Oz"
6,B0000KC2BK,Knorr,3.09,Knorr Granulated Chicken Flavor Bouillon is a ...
7,B0001E1IN8,Castillo,,Red chili habanero sauces. They are present to...
8,B00032E8XK,Chicken of the Sea,1.48,Chicken of the Sea Solid White Albacore Tuna i...
9,B0005XMTHE,Smucker's,2.29,"Helps build muscles with bcaa's amino acids, i..."
10,B0005XNE6E,Snapple,1.99,"At Snapple, we believe lifes a peach. Weve bee..."


We'll be creating one items CSV. We'll use this file to train our personalize models so we can compare the offline metrics and do some online inspection of recommendations.

In [22]:
items_filename = "items-metadata.csv"
items_df.to_csv(items_filename, index=False, float_format='%.2f')

## Create dataset groups and upload datasets

With the datasets that we need built, now it's time to upload them to Personalize using dataset import jobs. Before we can upload the CSVs, we need to create dataset groups, create schemas for our datasets, and create datasets.

We'll start by creating SDK client that we'll need to interact with Personalize.

### Create dataset groups

Let's create our dataset group.

In [24]:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

In [25]:
create_dataset_group_response = personalize.create_dataset_group(
    name = "amazon-pantry-aws-similar-items"
)

dataset_group = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))

{
  "datasetGroupArn": "arn:aws:personalize:us-east-1:144386903708:dataset-group/amazon-pantry-aws-similar-items",
  "ResponseMetadata": {
    "RequestId": "4b22f3c9-d00d-49fa-b56f-3049e60b14e2",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 07 Oct 2021 15:17:26 GMT",
      "x-amzn-requestid": "4b22f3c9-d00d-49fa-b56f-3049e60b14e2",
      "content-length": "110",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


Since dataset groups can take a few seconds to be fully created, let's wait until they both have a status of ACTIVE.

In [26]:
in_progress_dataset_group_arns = [ dataset_group ]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    for dataset_group_arn in in_progress_dataset_group_arns:
        describe_dataset_group_response = personalize.describe_dataset_group(
            datasetGroupArn = dataset_group_arn
        )
        status = describe_dataset_group_response["datasetGroup"]["status"]
        if status == "ACTIVE":
            print("Dataset group create succeeded for {}".format(dataset_group_arn))
            in_progress_dataset_group_arns.remove(dataset_group_arn)
        elif status == "CREATE FAILED":
            print("Create failed for {}".format(dataset_group_arn))
            in_progress_dataset_group_arns.remove(dataset_group_arn)

    if len(in_progress_dataset_group_arns) <= 0:
        break
    else:
        print("At least one dataset group create is still in progress")
                
    time.sleep(10)

At least one dataset group create is still in progress
At least one dataset group create is still in progress
Dataset group create succeeded for arn:aws:personalize:us-east-1:144386903708:dataset-group/amazon-pantry-aws-similar-items


### Create Interactions dataset schema and datasets

We will be creating a single schema for the interactions dataset type and sharing it across both solution versions. This is possible since schemas are global to your AWS account and not specific to a dataset group.

In [27]:
interactions_schema = schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        },
        {
            "name": "EVENT_VALUE",
            "type": "float"
        },
        {
            "name": "EVENT_TYPE",
            "type": "string"
        }
    ],
    "version": "1.0"
}
            
create_schema_response = personalize.create_schema(
    name = "amazon-pantry-interactions-1",
    schema = json.dumps(interactions_schema)
)

interaction_schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:144386903708:schema/amazon-pantry-interactions-1",
  "ResponseMetadata": {
    "RequestId": "96501057-36b8-4ca7-ad76-911498441b14",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 07 Oct 2021 15:18:31 GMT",
      "x-amzn-requestid": "96501057-36b8-4ca7-ad76-911498441b14",
      "content-length": "94",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


Next we'll create an Interactions dataset in our dataset group specifying the schema we just created.

In [28]:
dataset_type = "INTERACTIONS"
create_dataset_response = personalize.create_dataset(
    name = "amazon-pantry-ints",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group,
    schemaArn = interaction_schema_arn
)

interactions_dataset = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:144386903708:dataset/amazon-pantry-aws-similar-items/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "f3aa8827-d8a3-4afa-b731-70ccd4595675",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 07 Oct 2021 15:18:36 GMT",
      "x-amzn-requestid": "f3aa8827-d8a3-4afa-b731-70ccd4595675",
      "content-length": "112",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


### Stage Interactions CSV in S3

Before we can upload the interactions CSV we created earlier into the Personalize datasets that we just created, we need to stage the CSV in an S3 bucket.

Let's create an S3 bucket and copy the interactions CSV file to the bucket.

In [29]:
# Determine the current S3 region where this notebook is being hosted in SageMaker.
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]
print(region)

us-east-1


In [30]:

s3 = boto3.client('s3')
account_id = boto3.client('sts').get_caller_identity().get('Account')
bucket_name = "amazon-pantry-personalize-example"+account_id
print(bucket_name)
if region == "us-east-1":
    s3.create_bucket(Bucket=bucket_name)
else:
    region='us-west-2'
    s3.create_bucket(
        Bucket=bucket_name,
        CreateBucketConfiguration={'LocationConstraint': region}
    )

amazon-pantry-personalize-example144386903708


#### Upload Interactions CSV to S3

In [31]:
boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_filename)

### Create S3 bucket policy and IAM role

Before we can submit a dataset import job to Personalize, we have to create a bucket policy and IAM role that will give Personalize access to our bucket.

In [32]:
policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:*Object",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket_name),
                "arn:aws:s3:::{}/*".format(bucket_name)
            ]
        }
    ]
}

s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))

{'ResponseMetadata': {'RequestId': 'H5K1X3F9BCJMVW04',
  'HostId': 'E+jhicBArPArPsseiSMoJyxiiUrhmypt9D5GVSCMY6EgYx1wFDYq4eYClVYInY1E6T5eiZO49EI=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': 'E+jhicBArPArPsseiSMoJyxiiUrhmypt9D5GVSCMY6EgYx1wFDYq4eYClVYInY1E6T5eiZO49EI=',
   'x-amz-request-id': 'H5K1X3F9BCJMVW04',
   'date': 'Thu, 07 Oct 2021 15:20:29 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

In [33]:
iam = boto3.client("iam")

role_name = "PersonalizeRoleAmazonPantryAwsSimilarItems"
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "personalize.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}

create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
# if you would like to use a bucket with a different name, please consider creating and attaching a new policy
# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"
iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = policy_arn
)

# Now add S3 support
iam.attach_role_policy(
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
    RoleName=role_name
)
time.sleep(20) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

arn:aws:iam::144386903708:role/PersonalizeRoleAmazonPantryAwsSimilarItems


### Import Interactions datasets

Now we're ready to import the staged Interactions CSV in our S3 bucket to the Personalize dataset we created.

In [35]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "amazon-pantry-interactions-import",
    datasetArn = interactions_dataset,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename)
    },
    roleArn = role_arn
)

dataset_import_job = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:144386903708:dataset-import-job/amazon-pantry-interactions-import",
  "ResponseMetadata": {
    "RequestId": "b8894f08-58a4-428e-b1f6-5df4e59849b9",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 07 Oct 2021 15:22:04 GMT",
      "x-amzn-requestid": "b8894f08-58a4-428e-b1f6-5df4e59849b9",
      "content-length": "121",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


### Create Items dataset schema and datasets

Next we will repeat the process for the items datasets.

Create a schema that includes the description. Be sure to take note of the `"textual": True` attribute on the `DESCRIPTION` field. This is how you differentiate unstructured text fields from categorical and string fields. Without this attribute, Personalize will not apply natural language processing techniques to extract features from this text.

In [37]:
item_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "BRAND",
            "type": [ "null", "string" ],
            "categorical": True
        },{
            "name": "PRICE",
            "type": [ "null", "float" ],
        },{
            "name": "DESCRIPTION",
            "type": [ "null", "string" ],
            "textual": True
        }
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = "amazon-pantry-items-schema",
    schema = json.dumps(item_schema)
)

item_schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:144386903708:schema/amazon-pantry-items-schema",
  "ResponseMetadata": {
    "RequestId": "2bf13abf-6115-415d-8f7d-bf1c4482bfb4",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 07 Oct 2021 15:22:21 GMT",
      "x-amzn-requestid": "2bf13abf-6115-415d-8f7d-bf1c4482bfb4",
      "content-length": "92",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


Next we will create Personalize datasets in our dataset group, taking special care to specify the approprate schema ARN for each dataset.

In [38]:
dataset_type = "ITEMS"
create_dataset_response = personalize.create_dataset(
    name = "amazon-pantry-items",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = item_schema_arn
)

items_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:144386903708:dataset/amazon-pantry-aws-similar-items/ITEMS",
  "ResponseMetadata": {
    "RequestId": "f304bdcb-d941-444b-a83e-857c35eb0f2e",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 07 Oct 2021 15:22:29 GMT",
      "x-amzn-requestid": "f304bdcb-d941-444b-a83e-857c35eb0f2e",
      "content-length": "105",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


#### Stage Items CSV in S3

Next we'll copy our two items CSV files to the same S3 bucket create above.

In [39]:
boto3.Session().resource('s3').Bucket(bucket_name).Object(items_filename).upload_file(items_filename)

### Import Items datasets for each dataset group

Since the S3 bucket policy and IAM role are already setup, we can just submit two dataset import jobs to import the Items CSVs.

In [40]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "amazon-pantry-items-import-job",
    datasetArn = items_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, items_filename)
    },
    roleArn = role_arn
)

dataset_import_job_items_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:144386903708:dataset-import-job/amazon-pantry-items-import-job",
  "ResponseMetadata": {
    "RequestId": "27cf7ae3-1949-48fe-8884-bf282729cfc4",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 07 Oct 2021 15:22:45 GMT",
      "x-amzn-requestid": "27cf7ae3-1949-48fe-8884-bf282729cfc4",
      "content-length": "118",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


### Wait for Interactions dataset import jobs to complete

The following cell will wait for both import jobs to complete.

In [None]:
%%time

in_progress_import_arns = [ dataset_import_job ]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    for import_arn in in_progress_import_arns:
        describe_dataset_import_job_response = personalize.describe_dataset_import_job(
            datasetImportJobArn = import_arn
        )
        status = describe_dataset_import_job_response["datasetImportJob"]['status']
        if status == "ACTIVE":
            print("Dataset import succeeded for {}".format(import_arn))
            in_progress_import_arns.remove(import_arn)
        elif status == "CREATE FAILED":
            print("Create failed for {}".format(import_arn))
            in_progress_import_arns.remove(import_arn)

    if len(in_progress_import_arns) <= 0:
        break
    else:
        print("At least one dataset import job is still in progress")
                
    time.sleep(60)

### Wait for Items import job to complete

The following logic will wait until both items datasets are fully imported into each dataset group.

In [None]:
%%time

in_progress_import_arns = [ dataset_import_job_items_arn ]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    for import_arn in in_progress_import_arns:
        describe_dataset_import_job_response = personalize.describe_dataset_import_job(
            datasetImportJobArn = import_arn
        )
        status = describe_dataset_import_job_response["datasetImportJob"]['status']
        if status == "ACTIVE":
            print("Dataset import succeeded for {}".format(import_arn))
            in_progress_import_arns.remove(import_arn)
        elif status == "CREATE FAILED":
            print("Create failed for {}".format(import_arn))
            in_progress_import_arns.remove(import_arn)

    if len(in_progress_import_arns) <= 0:
        break
    else:
        print("At least one dataset import job is still in progress")
                
    time.sleep(60)

## Create solutions and solution versions

With the interactions and items datasets imported into each dataset group, we will next create solutions and solution versions using the user-personalization recipe for the data in each dataset group.

First, let's list the Personalize recipes available.

In [41]:
personalize.list_recipes()

{'recipes': [{'name': 'aws-hrnn',
   'recipeArn': 'arn:aws:personalize:::recipe/aws-hrnn',
   'status': 'ACTIVE',
   'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),
   'lastUpdatedDateTime': datetime.datetime(2021, 10, 2, 13, 24, 21, 632000, tzinfo=tzlocal())},
  {'name': 'aws-hrnn-coldstart',
   'recipeArn': 'arn:aws:personalize:::recipe/aws-hrnn-coldstart',
   'status': 'ACTIVE',
   'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),
   'lastUpdatedDateTime': datetime.datetime(2021, 10, 2, 13, 24, 21, 632000, tzinfo=tzlocal())},
  {'name': 'aws-hrnn-metadata',
   'recipeArn': 'arn:aws:personalize:::recipe/aws-hrnn-metadata',
   'status': 'ACTIVE',
   'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),
   'lastUpdatedDateTime': datetime.datetime(2021, 10, 2, 13, 24, 21, 632000, tzinfo=tzlocal())},
  {'name': 'aws-personalized-ranking',
   'recipeArn': 'arn:aws:personalize:::recipe/aws-personalized-ranking',
  

We will use the `aws-sims` and `aws-similar-items` recipes to train two solutions in this notebook.

In [42]:
sims_recipe_arn = "arn:aws:personalize:::recipe/aws-sims"
similar_items_recipe_arn = "arn:aws:personalize:::recipe/aws-similar-items"

First, we will create a solution and solution version for each of the recipes.

In [43]:
sims_create_solution_response = personalize.create_solution(
    name = "amazon-pantry-sims-solution-example",
    datasetGroupArn = dataset_group_arn,
    recipeArn = sims_recipe_arn
)

sims_solution_arn = sims_create_solution_response['solutionArn']

In [44]:
print(sims_solution_arn)

arn:aws:personalize:us-east-1:144386903708:solution/amazon-pantry-sims-solution-example


In [45]:
sims_solution_version_response = personalize.create_solution_version(
    solutionArn = sims_solution_arn
)

In [46]:
sims_solution_version_arn = sims_solution_version_response['solutionVersionArn']
print(json.dumps(sims_solution_version_response, indent=2))

{
  "solutionVersionArn": "arn:aws:personalize:us-east-1:144386903708:solution/amazon-pantry-sims-solution-example/0fed38cd",
  "ResponseMetadata": {
    "RequestId": "f3d70620-3170-4700-acaa-a84de14de287",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 07 Oct 2021 15:34:41 GMT",
      "x-amzn-requestid": "f3d70620-3170-4700-acaa-a84de14de287",
      "content-length": "121",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


Next we will create a solution and solution version in the dataset group that includes the item descriptions.

In [47]:
similar_items_create_solution_response = personalize.create_solution(
    name = "amazon-pantry-aws-similar-items-solution-example",
    datasetGroupArn = dataset_group_arn,
    recipeArn = similar_items_recipe_arn
)

similar_items_solution_arn = similar_items_create_solution_response['solutionArn']

In [48]:
similar_items_solution_version_response = personalize.create_solution_version(
    solutionArn = similar_items_solution_arn
)

In [49]:
similar_items_solution_version_arn = similar_items_solution_version_response['solutionVersionArn']
print(json.dumps(similar_items_solution_version_response, indent=2))

{
  "solutionVersionArn": "arn:aws:personalize:us-east-1:144386903708:solution/amazon-pantry-aws-similar-items-solution-example/23419275",
  "ResponseMetadata": {
    "RequestId": "7b802cd0-0ee2-4166-a1e6-f78dfac0600d",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 07 Oct 2021 15:34:49 GMT",
      "x-amzn-requestid": "7b802cd0-0ee2-4166-a1e6-f78dfac0600d",
      "content-length": "134",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


### Wait for solution versions to become active

Finally, we'll wait for the solution versions to finish being created. This step is where Personalize trains machine learning models based on the datasets and selected recipe. Personalize will also split the interactions datasets into training and evaluation portions so it can evaluate the quality of recommendations against the trained model using held out data.

You will notice that the solution version in the dataset group that includes the description data will take longer to train than the one without the description.

In [None]:
%%time

in_progress_solution_versions = [
    sims_solution_version_arn,
    similar_items_solution_version_arn
]

max_time = time.time() + 10*60*60 # 10 hours
while time.time() < max_time:
    for solution_version_arn in in_progress_solution_versions:
        version_response = personalize.describe_solution_version(
            solutionVersionArn = solution_version_arn
        )
        status = version_response["solutionVersion"]["status"]
        
        if status == "ACTIVE":
            print("Build succeeded for {}".format(solution_version_arn))
            in_progress_solution_versions.remove(solution_version_arn)
        elif status == "CREATE FAILED":
            print("Build failed for {}".format(solution_version_arn))
            in_progress_solution_versions.remove(solution_version_arn)
    
    if len(in_progress_solution_versions) <= 0:
        break
    else:
        print("At least one solution build is still in progress")
        
    time.sleep(60)

At least one solution build is still in progress


Generally speaking, the addition of text-based unstructured meta will increase training time. In our case, you can see above that the solution version that trained on the dataset with product descriptions took about 15 minutes longer than the solution version trained on the dataset without production descriptions. This difference will vary based on the composition and text values for your datasets.

Let's inspect the training hours for each solution version and compare them as well.

In [None]:
response = personalize.describe_solution_version(solutionVersionArn = sims_solution_version_arn)
training_hours_sims = response['solutionVersion']['trainingHours']

response = personalize.describe_solution_version(solutionVersionArn = similar_items_solution_version_arn)
training_hours_similar_items = response['solutionVersion']['trainingHours']
training_diff = (training_hours_sims - training_hours_similar_items) / training_hours_similar_items

print(f"Training hours sims: {training_hours_sims}")
print(f"Training hours similar items: {training_hours_similar_items}")

print("Difference of {:.2%}".format(training_diff))

The training hours used for cost calculations was about 50% higher for training with the description column. 

The wall/clock time and training hours will vary depending on the size of your datasets but this information can help you assess the trade off when considering adding unstructured text to your datasets.

In [None]:
sims_solution = {
        "solution_arn": sims_solution_arn,
        "solution_version_arn": sims_solution_version_arn
}
sims_v2_solution = {
        "solution_arn": similar_items_solution_arn,
        "solution_version_arn": similar_items_solution_version_arn
}

In [None]:
def create_campaign(solution,name):
    create_campaign_response = personalize.create_campaign(
        name = "personalize-demo-" + name + "example",
        solutionVersionArn = solution['solution_version_arn'],
        minProvisionedTPS = 1
    )

    campaign_arn = create_campaign_response['campaignArn']
    print('campaign_arn:' + campaign_arn)
    return campaign_arn

def waitForCampaign(solution):
    max_time = time.time() + 3*60*60 # 3 hours
    while time.time() < max_time:
        describe_campaign_response = personalize.describe_campaign(
            campaignArn = solution['campaign_arn']
        )
        status = describe_campaign_response["campaign"]["status"]
        print("Campaign: {} {}".format(solution['campaign_arn'], status))

        if status == "ACTIVE" or status == "CREATE FAILED":
            break

        time.sleep(60)

#### Create and wait for the 2 Campaigns
Create a campaign for each of the items similarities recipes, but keep all the other settings the same to demonstrate the impact of the addition of metadata.

In [None]:
sims_solution['campaign_arn'] = create_campaign(sims_solution, 'sims')
sims_v2_solution['campaign_arn'] = create_campaign(sims_v2_solution, 'aws-similar-items')

In [None]:
waitForCampaign(sims_solution)
waitForCampaign(sims_v2_solution)

# Getting Recommendations

First we are going to select three types of items to infer with
1. Item with a high number of interactions
1. Item with a low number of interactions
1. Random itemId

We will then look at how each of the models behave with each of the provided items


Lets take a look at our interactions dataset and plot the distribution of items that have been interacted with

In [None]:
items_interacted_df = positive_reviews_df.copy()
# Getting unique aisns counts
asin_interaction_count = items_interacted_df['asin'].value_counts()
# Transforming panda series to df
asin_interaction_count_df = pd.DataFrame({'asin':asin_interaction_count.index, 'count':asin_interaction_count.values})
asin_interaction_count_df

In [None]:
asin_interaction_count_df.describe(include='int')

As we can see above, the highest interacted item has ~5k interactions, lets plot the distribution of all items. Here we can see that there are items with high number of interactions and some with very low to non interactions.

In [None]:
asin_interaction_count_df.plot()

Below we can see a closer look at items with 100 to 300 interactions. These are going to give us the most variable results when testing.

In [None]:
zoom_interactions = asin_interaction_count_df.copy()
zoom_interactions = zoom_interactions.loc[(zoom_interactions["count"] > 100) & (zoom_interactions["count"] < 300)]
zoom_interactions.plot()

Now lets define some functions to explore the items metadata

In [None]:
# https://www.geeksforgeeks.org/how-to-select-rows-from-a-dataframe-based-on-column-values/
def get_item_brand(item_id):
    """
    Takes in an ID, returns its brand
    """

    return items_df.query('ITEM_ID=="{}"'.format(item_id))['BRAND'].item()

def get_item_price(item_id):
    """
    Takes in an ID, returns its brand
    """

    return items_df.query('ITEM_ID=="{}"'.format(item_id))['PRICE'].item()

def get_item_description(item_id):
    """
    Takes in an ID, returns its brand
    """

    return items_df.query('ITEM_ID=="{}"'.format(item_id))['DESCRIPTION'].item()
def get_item_df(item_id):
    """
    Takes in an ID, returns a title
    """
    temp_df = items_df.query('ITEM_ID=="{}"'.format(item_id))
  
    temp_df['INTERACTIONS_COUNT'] = get_item_count(item_id)
    return temp_df
def get_item_count(item_id):
    return asin_interaction_count_df.query('asin=="{}"'.format(item_id))['count'].item()
    
def get_recs_df(item_id, campaign):
    response = personalize_runtime.get_recommendations(
        campaignArn=campaign,
        itemId=item_id,
#         numResults=5
    )
    return clean_recs_list(response['itemList'])

def clean_recs_list(rec_list):
    items = []
    for each in rec_list:
        items.append([each['itemId'], get_item_brand(each['itemId']), get_item_price(each['itemId']), get_item_description(each['itemId']), get_item_count(each['itemId'])])
    return pd.DataFrame (items, columns = ['ITEM_ID', 'BRAND', 'PRICE','DESCRIPTION', 'INTERACTIONS_COUNT'])

Lets get a random item id and explore each model recommendations

In [None]:
asin_interaction_count_df.sample()['asin'].item()

In [None]:
asin = asin_interaction_count_df.sample(1)['asin'].item()
recommended_item_df = get_item_df(asin)
recommendations_sims_df = get_recs_df(asin,sims_solution['campaign_arn'])
recommendations_sims_v2_df = get_recs_df(asin,sims_v2_solution['campaign_arn'])

This is the item we are going to run inference against, as we can see here we have the brand, price, and number of interactions. The theory is that for items with low interactions SIMS will return mostly popular items, and the new aws-item-similarity recipe will return items more related to the current item metadata

In [None]:
recommended_item_df

In [None]:
recommended_item_df['DESCRIPTION'].item()

In [None]:
recommendations_sims_df

Cost category and product groupings

In [None]:
recommendations_sims_v2_df

Let's print the descriptions too see if they match our item theme

In [None]:
for index, row in recommendations_sims_v2_df.iterrows():
    print('----ITEM----')
    print('Recommendation number {index}, BRAND: {brand}, PRICE: {price}'.format(index=index+1, brand=row['BRAND'], price=row['PRICE']))
    print('----DESCRIPTION----')
    print(row['DESCRIPTION'])


# Targeted examples

Now let's take a look at some more targeted items to see how these two models behave

## Popular item

In [None]:
asin = 'B00HZ6X8QU'
recommended_item_df = get_item_df(asin)
recommendations_sims_df = get_recs_df(asin,sims_solution['campaign_arn'])
recommendations_sims_v2_df = get_recs_df(asin,sims_v2_solution['campaign_arn'])
recommended_item_df

The theory is that for items with low interactions SIMS will return mostly popular items, and the new aws-item-similarity recipe will return items more related to the current item metadata (price and description)

**Full description:**


In [None]:
recommended_item_df['DESCRIPTION'].item()

#### SIMS

In [None]:
recommendations_sims_df

We can see here, the SIMS model returns popular items. Which is not optimal

#### SIMS V2

In [None]:
recommendations_sims_v2_df

We see very comparable recommendations across both models! Which is an expected behavior considering the item is one of the most popular across the interactions dataset. Now less take a look at recommendations of an item not as popular

## Unpopular item

In [None]:
# B017BGLXYC - SODA - low price - low interactions
asin = 'B01GCT22E4'
recommended_item_df = get_item_df(asin)
recommendations_sims_df = get_recs_df(asin,sims_solution['campaign_arn'])
recommendations_sims_v2_df = get_recs_df(asin,sims_v2_solution['campaign_arn'])
recommended_item_df

The theory is that for items with low interactions SIMS will return mostly popular items, and the new aws-item-similarity recipe will return items more related to the current item metadata (price and description)

**Full description:**


In [None]:
recommended_item_df['DESCRIPTION'].item()

#### SIMS

In [None]:
recommendations_sims_df

We can see here, the SIMS model returns popular items. Which is not optimal

#### SIMS V2

In [None]:
recommendations_sims_v2_df

Let's print the descriptions too see if they match our item theme

In [None]:
for index, row in recommendations_sims_v2_df.iterrows():
    print('----ITEM----')
    print('Recommendation number {index}, BRAND: {brand}, PRICE: {price}'.format(index=index+1, brand=row['BRAND'], price=row['PRICE']))
    print('----DESCRIPTION----')
    print(row['DESCRIPTION'])
