# Handling cold or new items with Personalize

In this notebook, we show how we can use Personalize to identify cold or new items and recommend new items to our customers.

There are a few goals when we recommend from new items. We typically want users to see more of our new catalog but at the same time we want to personalize thier experience. That is, we want to show only the subset of cold/new items that a user may be interested in.

This is a hard problem, since we don't have much information about new items.

In [1]:
import tempfile, subprocess, urllib.request, zipfile
import pandas as pd, numpy as np

In [2]:
import io
import scipy.sparse as ss
import json
import time
import os

In [3]:
import boto3

# download dataset

In [4]:
with tempfile.TemporaryDirectory() as tmpdir:
    urllib.request.urlretrieve(
        'http://files.grouplens.org/datasets/movielens/ml-1m.zip',
        tmpdir + '/ml-1m.zip')
    zipfile.ZipFile(tmpdir + '/ml-1m.zip').extractall(tmpdir)
    print(subprocess.check_output(['ls', tmpdir+'/ml-1m']).decode('utf-8'))
    
    df_all = pd.read_csv(
        tmpdir + '/ml-1m/ratings.dat',
        sep='::',
        names=['USER_ID','ITEM_ID','EVENT_VALUE', 'TIMESTAMP'])
    df_all['EVENT_TYPE']='RATING'

    items_all = pd.read_csv(
        tmpdir + '/ml-1m/movies.dat',
        sep='::', encoding='latin1',
        names=['ITEM_ID', '_TITLE', 'GENRE'],
    )
    del items_all['_TITLE']

movies.dat
ratings.dat
README
users.dat



  # This is added back by InteractiveShellApp.init_path()


In [5]:
pd.set_option('display.max_rows', 5)

In [6]:
items = items_all.copy()
items

Unnamed: 0,ITEM_ID,GENRE
0,1,Animation|Children's|Comedy
1,2,Adventure|Children's|Fantasy
...,...,...
3881,3951,Drama
3882,3952,Drama|Thriller


In [7]:
df = df_all.copy()
df

Unnamed: 0,USER_ID,ITEM_ID,EVENT_VALUE,TIMESTAMP,EVENT_TYPE
0,1,1193,5,978300760,RATING
1,1,661,3,978302109,RATING
...,...,...,...,...,...
1000207,6040,1096,4,956715648,RATING
1000208,6040,1097,4,956715569,RATING


## convert into Personalize format

This dataset doesn't really have cold/new items. So what we do is we pick about 50\% items at random and delete them from the dataset. Thse will be 'held out' as 'cold/new' items, and then used for evalution at the end.

In [8]:
unique_items = df['ITEM_ID'].unique()
unique_items = np.random.permutation(unique_items)
len(unique_items)

3706

In [9]:
warm_items = unique_items[len(unique_items)//2:]
cold_items = unique_items[:len(unique_items)//2]

In [10]:
df['to_keep'] = df['ITEM_ID'].apply(lambda x:x in warm_items)
df=df[df['to_keep']]
del df['to_keep']

In [11]:
df

Unnamed: 0,USER_ID,ITEM_ID,EVENT_VALUE,TIMESTAMP,EVENT_TYPE
2,1,914,3,978301968,RATING
6,1,1287,5,978302039,RATING
...,...,...,...,...,...
1000207,6040,1096,4,956715648,RATING
1000208,6040,1097,4,956715569,RATING


In [12]:
df.to_csv('interactions.csv',index=False)

## item metadata

It is important not to overwhelm the system with items that we aren't actually interested in recommending from. So we remove these items from the item meta-data table. This is important.

In [13]:
items['to_keep'] = items['ITEM_ID'].apply(lambda x:x in unique_items)
items=items[items['to_keep']]
del items['to_keep']

In [14]:
items

Unnamed: 0,ITEM_ID,GENRE
0,1,Animation|Children's|Comedy
1,2,Adventure|Children's|Fantasy
...,...,...
3881,3951,Drama
3882,3952,Drama|Thriller


In [15]:
items.to_csv('item_metadata.csv',index=False)

# upload data

In [17]:
os.environ['AWS_DEFAULT_REGION']="us-east-1"
suffix = str(np.random.uniform())[4:9]
bucket = "demo-temporal-holdout-metadata-"+suffix     # replace with the name of your S3 bucket
!aws s3 mb s3://{bucket}

make_bucket: demo-temporal-holdout-metadata-89220


In [18]:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

In [19]:
interactions_filename = 'interactions.csv'
boto3.Session().resource('s3').Bucket(bucket).Object(interactions_filename).upload_file(interactions_filename)

In [20]:
item_metadata_file = 'item_metadata.csv'
boto3.Session().resource('s3').Bucket(bucket).Object(item_metadata_file).upload_file(item_metadata_file)

## create schemas

We create schemas for our data, exactly like the metadata model/notebook/example.

In [21]:
schema_name="DEMO-temporal-metadata-schema-"+suffix

In [22]:
schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "EVENT_VALUE",
            "type": "float"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        },
        { 
            "name": "EVENT_TYPE",
            "type": "string"
        },
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = schema_name,
    schema = json.dumps(schema)
)

schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:261294318658:schema/DEMO-temporal-metadata-schema-89220",
  "ResponseMetadata": {
    "RequestId": "3ffc6c6e-5ef6-40a7-9ece-bb6e345c2bc9",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 10 Oct 2019 05:51:58 GMT",
      "x-amzn-requestid": "3ffc6c6e-5ef6-40a7-9ece-bb6e345c2bc9",
      "content-length": "101",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [23]:
metadata_schema_name="DEMO-temporal-metadata-metadataschema-"+suffix

In [24]:
metadata_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
    {
        "name": "ITEM_ID",
        "type": "string"
    },
    {
        "name": "GENRE",
        "type": "string",
        "categorical": True
    }
    ],
    "version": "1.0"
}

create_metadata_schema_response = personalize.create_schema(
    name = metadata_schema_name,
    schema = json.dumps(metadata_schema)
)

metadata_schema_arn = create_metadata_schema_response['schemaArn']
print(json.dumps(create_metadata_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:261294318658:schema/DEMO-temporal-metadata-metadataschema-89220",
  "ResponseMetadata": {
    "RequestId": "a25bb24c-5418-46b0-afc5-89b6d5370cd2",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 10 Oct 2019 05:52:00 GMT",
      "x-amzn-requestid": "a25bb24c-5418-46b0-afc5-89b6d5370cd2",
      "content-length": "109",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


## create a dataset group and datasets we need

In [25]:
dataset_group_name = "DEMO-temporal-metadata-dataset-group-" + suffix

create_dataset_group_response = personalize.create_dataset_group(
    name = dataset_group_name
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))

{
  "datasetGroupArn": "arn:aws:personalize:us-east-1:261294318658:dataset-group/DEMO-temporal-metadata-dataset-group-89220",
  "ResponseMetadata": {
    "RequestId": "b45d3c02-ef93-4ec6-ae42-8fe612c70689",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 10 Oct 2019 05:52:02 GMT",
      "x-amzn-requestid": "b45d3c02-ef93-4ec6-ae42-8fe612c70689",
      "content-length": "121",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [26]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(20)

DatasetGroup: CREATE PENDING
DatasetGroup: ACTIVE


In [27]:
dataset_type = "INTERACTIONS"
create_dataset_response = personalize.create_dataset(
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = schema_arn,
    name = "DEMO-temporal-metadata-dataset-interactions-" + suffix
)

interactions_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:261294318658:dataset/DEMO-temporal-metadata-dataset-group-89220/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "05ae1d23-1791-42da-bcf3-228cd413001d",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 10 Oct 2019 05:52:23 GMT",
      "x-amzn-requestid": "05ae1d23-1791-42da-bcf3-228cd413001d",
      "content-length": "123",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [28]:
dataset_type = "ITEMS"
create_metadata_dataset_response = personalize.create_dataset(
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = metadata_schema_arn,
    name = "DEMO-temporal-metadata-dataset-items-" + suffix
)

metadata_dataset_arn = create_metadata_dataset_response['datasetArn']
print(json.dumps(create_metadata_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:261294318658:dataset/DEMO-temporal-metadata-dataset-group-89220/ITEMS",
  "ResponseMetadata": {
    "RequestId": "2921ca0a-429c-4798-a214-78b95c2548ef",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 10 Oct 2019 05:52:23 GMT",
      "x-amzn-requestid": "2921ca0a-429c-4798-a214-78b95c2548ef",
      "content-length": "116",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


## set appropriate s3 bucket permissions

In [29]:
s3 = boto3.client("s3")

policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket),
                "arn:aws:s3:::{}/*".format(bucket)
            ]
        }
    ]
}

s3.put_bucket_policy(Bucket=bucket, Policy=json.dumps(policy));

## create a role that has the right permissions

In [30]:
from botocore.exceptions import ClientError

In [31]:
iam = boto3.client("iam")

role_name = "PersonalizeS3Role-"+suffix
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "personalize.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}
try:
    create_role_response = iam.create_role(
        RoleName = role_name,
        AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
    );

    iam.attach_role_policy(
        RoleName = role_name,
        PolicyArn = "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
    );

    role_arn = create_role_response["Role"]["Arn"]
except ClientError as e:
    if e.response['Error']['Code'] == 'EntityAlreadyExists':
        role_arn = iam.get_role(RoleName=role_name)['Role']['Arn']
    else:
        raise

In [32]:
print(role_arn)

arn:aws:iam::261294318658:role/PersonalizeS3Role-89220


# upload the data

In [33]:
# sometimes need to wait a bit for the role to be created
time.sleep(20)

In [34]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "DEMO-temporal-dataset-import-job-"+suffix,
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, 'interactions.csv')
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:261294318658:dataset-import-job/DEMO-temporal-dataset-import-job-89220",
  "ResponseMetadata": {
    "RequestId": "d3e3db3d-31fa-4bc1-9f70-2bb4fef8cb06",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 10 Oct 2019 05:52:45 GMT",
      "x-amzn-requestid": "d3e3db3d-31fa-4bc1-9f70-2bb4fef8cb06",
      "content-length": "126",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [35]:
create_metadata_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "DEMO-temporal-metadata-dataset-import-job-"+suffix,
    datasetArn = metadata_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, 'item_metadata.csv')
    },
    roleArn = role_arn
)

metadata_dataset_import_job_arn = create_metadata_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_metadata_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:261294318658:dataset-import-job/DEMO-temporal-metadata-dataset-import-job-89220",
  "ResponseMetadata": {
    "RequestId": "d3a6c2c6-b0e8-449f-b28a-20895cadd12f",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 10 Oct 2019 05:52:45 GMT",
      "x-amzn-requestid": "d3a6c2c6-b0e8-449f-b28a-20895cadd12f",
      "content-length": "135",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


## import the data into personalize

In [36]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = dataset_import_job_arn
    )
    
    dataset_import_job = describe_dataset_import_job_response["datasetImportJob"]
    if "latestDatasetImportJobRun" not in dataset_import_job:
        status = dataset_import_job["status"]
        print("DatasetImportJob: {}".format(status))
    else:
        status = dataset_import_job["latestDatasetImportJobRun"]["status"]
        print("LatestDatasetImportJobRun: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetImportJob: CREATE PENDING
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: ACTIVE


In [37]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = metadata_dataset_import_job_arn
    )
    
    dataset_import_job = describe_dataset_import_job_response["datasetImportJob"]
    if "latestDatasetImportJobRun" not in dataset_import_job:
        status = dataset_import_job["status"]
        print("DatasetImportJob: {}".format(status))
    else:
        status = dataset_import_job["latestDatasetImportJobRun"]["status"]
        print("LatestDatasetImportJobRun: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetImportJob: ACTIVE


## now we create a cold start solution using the cold start recipe

In [38]:
recipe_list = personalize.list_recipes()
for recipe in recipe_list['recipes']:
    print(recipe['recipeArn'])

arn:aws:personalize:::recipe/aws-deepfm
arn:aws:personalize:::recipe/aws-ffnn
arn:aws:personalize:::recipe/aws-hrnn
arn:aws:personalize:::recipe/aws-hrnn-coldstart
arn:aws:personalize:::recipe/aws-hrnn-metadata
arn:aws:personalize:::recipe/aws-personalized-ranking
arn:aws:personalize:::recipe/aws-popularity-count
arn:aws:personalize:::recipe/aws-sims


In [39]:
recipe_arn = "arn:aws:personalize:::recipe/aws-hrnn-coldstart"

In [40]:
create_solution_response = personalize.create_solution(
    name = "DEMO-temporal-metadata-solution-"+suffix,
    datasetGroupArn = dataset_group_arn,
    recipeArn = recipe_arn,
    solutionConfig = {
        "featureTransformationParameters" : {
            'cold_start_max_duration' : '5',
            'cold_start_relative_from' : 'latestItem',
            'cold_start_max_interactions':'15'
        }
    }
    
)

solution_arn = create_solution_response['solutionArn']
print(json.dumps(create_solution_response, indent=2))

{
  "solutionArn": "arn:aws:personalize:us-east-1:261294318658:solution/DEMO-temporal-metadata-solution-89220",
  "ResponseMetadata": {
    "RequestId": "e20097de-a096-4408-bc92-ab61b24c768f",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 10 Oct 2019 06:07:48 GMT",
      "x-amzn-requestid": "e20097de-a096-4408-bc92-ab61b24c768f",
      "content-length": "107",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [41]:
create_solution_version_response = personalize.create_solution_version(
    solutionArn = solution_arn
)

solution_version_arn = create_solution_version_response['solutionVersionArn']
print(json.dumps(create_solution_version_response, indent=2))

{
  "solutionVersionArn": "arn:aws:personalize:us-east-1:261294318658:solution/DEMO-temporal-metadata-solution-89220/e3ecd8cc",
  "ResponseMetadata": {
    "RequestId": "3f6480a0-d44e-448c-947a-be4c70cd4013",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 10 Oct 2019 06:07:48 GMT",
      "x-amzn-requestid": "3f6480a0-d44e-448c-947a-be4c70cd4013",
      "content-length": "123",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [42]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_solution_version_response = personalize.describe_solution_version(
        solutionVersionArn = solution_version_arn
    )
    status = describe_solution_version_response["solutionVersion"]["status"]
    print("SolutionVersion: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

SolutionVersion: CREATE PENDING
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGR

## Metrics are not very meaningful for cold start solutions

In [43]:
get_solution_metrics_response = personalize.get_solution_metrics(
    solutionVersionArn = solution_version_arn
)

print(json.dumps(get_solution_metrics_response, indent=2))


{
  "solutionVersionArn": "arn:aws:personalize:us-east-1:261294318658:solution/DEMO-temporal-metadata-solution-89220/e3ecd8cc",
  "metrics": {
    "coverage": 0.2085,
    "mean_reciprocal_rank_at_25": 0.0001,
    "normalized_discounted_cumulative_gain_at_10": 0.0,
    "normalized_discounted_cumulative_gain_at_25": 0.0004,
    "normalized_discounted_cumulative_gain_at_5": 0.0,
    "precision_at_10": 0.0,
    "precision_at_25": 0.0001,
    "precision_at_5": 0.0
  },
  "ResponseMetadata": {
    "RequestId": "4b7f67f4-2b02-47f6-b99f-8d51ceed73c7",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 10 Oct 2019 07:06:04 GMT",
      "x-amzn-requestid": "4b7f67f4-2b02-47f6-b99f-8d51ceed73c7",
      "content-length": "409",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


What happened here? Since we deleted all cold start items from the training set, the metrics are close to zero. 

An important lesson here is that metrics for cold start are hard to evaluate offline.

Lets look at what happens when we generate predictions.

## Create a campaign 

In [44]:
create_campaign_response = personalize.create_campaign(
    name = "DEMO-coldstart-campaign-"+suffix,
    solutionVersionArn = solution_version_arn,
    minProvisionedTPS = 2,    
)

campaign_arn = create_campaign_response['campaignArn']
print(json.dumps(create_campaign_response, indent=2))

{
  "campaignArn": "arn:aws:personalize:us-east-1:261294318658:campaign/DEMO-coldstart-campaign-89220",
  "ResponseMetadata": {
    "RequestId": "a17d9622-afe4-4a3e-b764-ebb861a9b192",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 10 Oct 2019 07:06:04 GMT",
      "x-amzn-requestid": "a17d9622-afe4-4a3e-b764-ebb861a9b192",
      "content-length": "99",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [45]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_campaign_response = personalize.describe_campaign(
        campaignArn = campaign_arn
    )
    status = describe_campaign_response["campaign"]["status"]
    print("Campaign: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

Campaign: CREATE PENDING
Campaign: CREATE IN_PROGRESS
Campaign: CREATE IN_PROGRESS
Campaign: CREATE IN_PROGRESS
Campaign: CREATE IN_PROGRESS
Campaign: CREATE IN_PROGRESS
Campaign: CREATE IN_PROGRESS
Campaign: CREATE IN_PROGRESS
Campaign: CREATE IN_PROGRESS
Campaign: ACTIVE


In [46]:
# we had saved all the data before deleting the cold items
df = df_all.copy()
df['to_keep'] = df['ITEM_ID'].apply(lambda x:x in cold_items)
df=df[df['to_keep']]
del df['to_keep']
df

Unnamed: 0,USER_ID,ITEM_ID,EVENT_VALUE,TIMESTAMP,EVENT_TYPE
0,1,1193,5,978300760,RATING
1,1,661,3,978302109,RATING
...,...,...,...,...,...
1000201,6040,1080,4,957717322,RATING
1000203,6040,1090,3,956715518,RATING


In [47]:
from tqdm import tqdm_notebook
import numpy as np
from metrics import mean_reciprocal_rank, ndcg_at_k, precision_at_k

In [48]:
users = df['USER_ID'].unique()

## how often is the deleted/cold/new item in the actual items the user interacted with?

In [50]:
relevance = []
for user_id in  tqdm_notebook(users[:1000]):

    true_items = set(df[df['USER_ID']==user_id]['ITEM_ID'].values)

    rec_response = personalize_runtime.get_recommendations(
            campaignArn = campaign_arn,
            userId = str(user_id)
        )
    rec_items = [int(x['itemId']) for x in rec_response['itemList']]
    relevance.append([int(x in true_items) for x in rec_items])

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

In [51]:
print('mean_reciprocal_rank', np.mean([mean_reciprocal_rank(r) for r in relevance]))
print('precision_at_5', np.mean([precision_at_k(r, 5) for r in relevance]))
print('precision_at_10', np.mean([precision_at_k(r, 10) for r in relevance]))
print('precision_at_25', np.mean([precision_at_k(r, 25) for r in relevance]))
print('normalized_discounted_cumulative_gain_at_5', np.mean([ndcg_at_k(r, 5) for r in relevance]))
print('normalized_discounted_cumulative_gain_at_10', np.mean([ndcg_at_k(r, 10) for r in relevance]))
print('normalized_discounted_cumulative_gain_at_25', np.mean([ndcg_at_k(r, 25) for r in relevance]))

mean_reciprocal_rank 0.32269186889418144
precision_at_5 0.1618
precision_at_10 0.15159999999999998
precision_at_25 0.13252
normalized_discounted_cumulative_gain_at_5 0.208342930765236
normalized_discounted_cumulative_gain_at_10 0.260983140687113
normalized_discounted_cumulative_gain_at_25 0.3875125021383548


### A baseline

As a baseline, consider the case where we picked out of cold start items uniformly at random. Note that since cold start items in our dataset have no interaction history, this is a reasonable baseline - we don't have much information to go on here.

In [52]:
len(rec_items)

25

In [53]:
relevance = []
for user_id in  tqdm_notebook(users[:1000]):

    true_items = set(df[df['USER_ID']==user_id]['ITEM_ID'].values)
    rec_items = np.random.permutation(cold_items)[:25]
    relevance.append([int(x in true_items) for x in rec_items])

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

In [54]:
print('mean_reciprocal_rank', np.mean([mean_reciprocal_rank(r) for r in relevance]))
print('precision_at_5', np.mean([precision_at_k(r, 5) for r in relevance]))
print('precision_at_10', np.mean([precision_at_k(r, 10) for r in relevance]))
print('precision_at_25', np.mean([precision_at_k(r, 25) for r in relevance]))
print('normalized_discounted_cumulative_gain_at_5', np.mean([ndcg_at_k(r, 5) for r in relevance]))
print('normalized_discounted_cumulative_gain_at_10', np.mean([ndcg_at_k(r, 10) for r in relevance]))
print('normalized_discounted_cumulative_gain_at_25', np.mean([ndcg_at_k(r, 25) for r in relevance]))

mean_reciprocal_rank 0.109026552752672
precision_at_5 0.039200000000000006
precision_at_10 0.040100000000000004
precision_at_25 0.04084
normalized_discounted_cumulative_gain_at_5 0.07499378892350762
normalized_discounted_cumulative_gain_at_10 0.11011923221840729
normalized_discounted_cumulative_gain_at_25 0.18745855682250365


We see that the cold start model is able to levarage some information from the meta-data to gain lift over uninformed recommendation of cold/new items.

The lift is only 3-4x because we have many cold item and not very informative meta-data - just the genre of the movie.

Reducing cold items, or improving the quality of meta-data can both be useful.

## a quick smell test

In [60]:
# we had saved all the data before deleting the cold items
df = df_all.copy()
df['to_keep'] = df['ITEM_ID'].apply(lambda x:x in warm_items)
df=df[df['to_keep']]
del df['to_keep']
df = df.sort_values('TIMESTAMP', kind='mergesort').copy()
df

Unnamed: 0,USER_ID,ITEM_ID,EVENT_VALUE,TIMESTAMP,EVENT_TYPE
999873,6040,593,5,956703954,RATING
1000192,6040,2019,5,956703977,RATING
...,...,...,...,...,...
825526,4958,3489,4,1046454320,RATING
825724,4958,3264,4,1046454548,RATING


In [61]:
user_id = users[1]
hist_items = df[df['USER_ID']==user_id]['ITEM_ID'].tail(5).values
items_all.set_index('ITEM_ID').loc[hist_items]

Unnamed: 0_level_0,GENRE
ITEM_ID,Unnamed: 1_level_1
1690,Action|Horror|Sci-Fi
3257,Action|Drama|Romance|Thriller
2002,Action|Comedy|Crime|Drama
292,Action|Drama|Thriller
95,Action|Thriller


In [62]:
rec_response = personalize_runtime.get_recommendations(
            campaignArn = campaign_arn,
            userId = str(user_id)
        )
rec_items = [int(x['itemId']) for x in rec_response['itemList']]
items_all.set_index('ITEM_ID').loc[rec_items[:5]]

Unnamed: 0_level_0,GENRE
ITEM_ID,Unnamed: 1_level_1
3705,Action|Adventure|Romance|Thriller
736,Action|Adventure|Romance|Thriller
10,Action|Adventure|Thriller
990,Action|Adventure|Thriller
733,Action|Adventure|Thriller


We see that this user watched a lot of Action|Adventure|Thriller items and the model is able to pick this up, and recommend Action|Adventure|Thriller items from the cold items.

## another quick smell test

In [63]:
user_id = users[2]
hist_items = df[df['USER_ID']==user_id]['ITEM_ID'].tail(5).values
items_all.set_index('ITEM_ID').loc[hist_items]

Unnamed: 0_level_0,GENRE
ITEM_ID,Unnamed: 1_level_1
1304,Action|Comedy|Western
3619,Comedy
1270,Comedy|Sci-Fi
3552,Comedy
104,Comedy


In [64]:
rec_response = personalize_runtime.get_recommendations(
            campaignArn = campaign_arn,
            userId = str(user_id)
        )
rec_items = [int(x['itemId']) for x in rec_response['itemList']]
items_all.set_index('ITEM_ID').loc[rec_items[:5]]

Unnamed: 0_level_0,GENRE
ITEM_ID,Unnamed: 1_level_1
153,Action|Adventure|Comedy|Crime
1910,Action|Comedy|Crime
3266,Action|Comedy|Crime|Drama
2001,Action|Comedy|Crime|Drama
3184,Action|Comedy|Crime|Drama


We see that this user watched a lot of Comedy|Action items and the model is able to pick this up, and recommend Comedy|Action items from the cold items.

# clean up

In [74]:
personalize.delete_campaign(campaignArn=campaign_arn)
while len(personalize.list_campaigns(solutionArn=solution_arn)['campaigns']):
    time.sleep(5)

personalize.delete_solution(solutionArn=solution_arn)
while len(personalize.list_solutions(datasetGroupArn=dataset_group_arn)['solutions']):
    time.sleep(5)

for dataset in personalize.list_datasets(datasetGroupArn=dataset_group_arn)['datasets']:
    personalize.delete_dataset(datasetArn=dataset['datasetArn'])
while len(personalize.list_datasets(datasetGroupArn=dataset_group_arn)['datasets']):
    time.sleep(5)

personalize.delete_dataset_group(datasetGroupArn=dataset_group_arn)

{'ResponseMetadata': {'RequestId': 'de33b4b7-611b-4f4b-8b54-0396f3afdcd4',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Thu, 10 Oct 2019 07:29:49 GMT',
   'x-amzn-requestid': 'de33b4b7-611b-4f4b-8b54-0396f3afdcd4',
   'content-length': '0',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

# execute with caution!

In [75]:
!aws s3 rm s3://{bucket} --recursive

delete: s3://demo-temporal-holdout-metadata-89220/item_metadata.csv
delete: s3://demo-temporal-holdout-metadata-89220/interactions.csv
