# Handling cold or new items with Personalize

In this notebook, we show how we can use Personalize to identify cold or new items and recommend new items to our customers.

There are a few goals when we recommend from new items. We typically want users to see more of our new catalog but at the same time we want to personalize thier experience. That is, we want to show only the subset of cold/new items that a user may be interested in.

This is a hard problem, since we don't have much information about new items.

In [1]:
import tempfile, subprocess, urllib.request, zipfile
import pandas as pd, numpy as np

In [2]:
import io
import scipy.sparse as ss
import json
import time
import os

In [3]:
import sagemaker.amazon.common as smac

In [4]:
import boto3

# download dataset

In [100]:
with tempfile.TemporaryDirectory() as tmpdir:
    urllib.request.urlretrieve(
        'http://files.grouplens.org/datasets/movielens/ml-20m.zip',
        tmpdir + '/ml-20m.zip')
    zipfile.ZipFile(tmpdir + '/ml-20m.zip').extractall(tmpdir)
    df = pd.read_csv(tmpdir + '/ml-20m/ratings.csv')
    movies = pd.read_csv(tmpdir + '/ml-20m/movies.csv', index_col='movieId')
    vocab_size = df.movieId.max() + 1

In [101]:
movies2=movies.copy()

In [6]:
vocab_size

131263

In [7]:
test_time_ratio = 0.01
test_user_ratio = 0.2

In [8]:
dfo = df.copy()
df = df[df.timestamp < df.timestamp.max() * (1-test_time_ratio) + df.timestamp.min() * test_time_ratio]

## convert into Personalize format

In [113]:
df.columns = ['USER_ID','ITEM_ID','EVENT_VALUE','TIMESTAMP']
df['EVENT_TYPE']='RATING'

This dataset doesn't really have cold/new items. So what we do is we pick about 1/4th (6000) items at random and delete them from the dataset. Thse will be 'held out' as 'cold/new' items, and then used for evalution at the end.

In [10]:
unique_items = df['ITEM_ID'].unique()

In [13]:
unique_items = np.random.permutation(unique_items)

In [14]:
len(unique_items)

25199

In [15]:
warm_items = set(unique_items[6000:])

In [16]:
cold_items = unique_items[:6000]

In [17]:
df['to_keep'] = df['ITEM_ID'].apply(lambda x:x in warm_items)

In [18]:
df=df[df['to_keep']]

In [19]:
del df['to_keep']

In [None]:
#for demo we may want to upload a small dataset
#df=df.loc[:10000]

Even though we deleted all occurances of these items, we want the system to know they exist. So we create a fake user, who iteracts with each of these items once. 

This doesn't impact the model much, but just lets the system know that these exist.

In [20]:
max_ind = max(df.index)+1

In [21]:

ndf = pd.DataFrame(columns=df.columns,index=range(max_ind,max_ind+len(cold_items)))

In [27]:
ndf['USER_ID']='fake_user'
ndf['ITEM_ID']=cold_items
ndf['EVENT_VALUE']=4.5
ndf['TIMESTAMP']=int(time.time())
ndf['EVENT_TYPE']='RATING'

In [31]:
df=pd.concat((df,ndf))

In [33]:
df.to_csv('interactions.csv',index=False)

## item metadata

In [34]:
movies = movies.reset_index()

del movies['title']

movies.columns=['ITEM_ID','GENRE']

In [35]:
movies.head()

Unnamed: 0,ITEM_ID,GENRE
0,1,Adventure|Animation|Children|Comedy|Fantasy
1,2,Adventure|Children|Fantasy
2,3,Comedy|Romance
3,4,Comedy|Drama|Romance
4,5,Comedy


It is important not to overwhelm the system with items that we aren't actually interested in recommending from. So we remove these items from the item meta-data table. This is important.

In [36]:
unique_items=set(unique_items)
movies['to_keep'] = movies['ITEM_ID'].apply(lambda x:x in unique_items)

In [37]:
movies=movies[movies['to_keep']]

In [38]:
del movies['to_keep']

In [39]:
movies.to_csv('item_metadata.csv',index=False)

# upload data

In [40]:
os.environ['AWS_DEFAULT_REGION']="us-east-1"
suffix = str(np.random.uniform())[4:9]
bucket = "demo-temporal-holdout-metadata-"+suffix     # replace with the name of your S3 bucket
!aws s3 mb s3://{bucket}

make_bucket: demo-temporal-holdout-metadata-62826


In [41]:
personalize = boto3.client(service_name='personalize', endpoint_url='https://personalize.us-east-1.amazonaws.com')
personalize_runtime = boto3.client(service_name='personalize-runtime', endpoint_url='https://personalize-runtime.us-east-1.amazonaws.com')

In [42]:
interactions_filename = 'interactions.csv'
boto3.Session().resource('s3').Bucket(bucket).Object(interactions_filename).upload_file(interactions_filename)

In [43]:
item_metadata_file = 'item_metadata.csv'
boto3.Session().resource('s3').Bucket(bucket).Object(item_metadata_file).upload_file(item_metadata_file)

## create schemas

We create schemas for our data, exactly like the metadata model/notebook/example.

In [44]:
schema_name="DEMO-temporal-metadata-schema-"+suffix

In [45]:
schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "EVENT_VALUE",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        },
        { 
            "name": "EVENT_TYPE",
            "type": "string"
        },
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = schema_name,
    schema = json.dumps(schema)
)

schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:261294318658:schema/DEMO-temporal-metadata-schema-62826",
  "ResponseMetadata": {
    "RequestId": "33817256-bc4a-49a4-9d4b-500b20c75038",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Sat, 01 Jun 2019 02:32:32 GMT",
      "x-amzn-requestid": "33817256-bc4a-49a4-9d4b-500b20c75038",
      "content-length": "101",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [46]:
metadata_schema_name="DEMO-temporal-metadata-metadataschema-"+suffix

In [47]:
metadata_schema = {
 "type": "record",
 "name": "Items",
 "namespace": "com.amazonaws.personalize.schema",
 "fields": [
 {
 "name": "ITEM_ID",
 "type": "string"
 },
 {
 "name": "GENRE",
 "type": "string",
 "categorical": True
 }
 ],
 "version": "1.0"
}

create_metadata_schema_response = personalize.create_schema(
    name = metadata_schema_name,
    schema = json.dumps(metadata_schema)
)

metadata_schema_arn = create_metadata_schema_response['schemaArn']
print(json.dumps(create_metadata_schema_response, indent=2))


{
  "schemaArn": "arn:aws:personalize:us-east-1:261294318658:schema/DEMO-temporal-metadata-metadataschema-62826",
  "ResponseMetadata": {
    "RequestId": "08610ffd-6ac9-4377-9749-3ed1625e05ac",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Sat, 01 Jun 2019 02:32:33 GMT",
      "x-amzn-requestid": "08610ffd-6ac9-4377-9749-3ed1625e05ac",
      "content-length": "109",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


## create a dataset group and datasets we need

In [49]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(20)

DatasetGroup: CREATE PENDING
DatasetGroup: ACTIVE


In [48]:
dataset_group_name = "DEMO-temporal-metadata-dataset-group-" + suffix

create_dataset_group_response = personalize.create_dataset_group(
    name = dataset_group_name
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))

{
  "datasetGroupArn": "arn:aws:personalize:us-east-1:261294318658:dataset-group/DEMO-temporal-metadata-dataset-group-62826",
  "ResponseMetadata": {
    "RequestId": "80694f9a-ee2b-4e52-b184-0252bf45f3f5",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Sat, 01 Jun 2019 02:32:33 GMT",
      "x-amzn-requestid": "80694f9a-ee2b-4e52-b184-0252bf45f3f5",
      "content-length": "121",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [50]:
dataset_type = "INTERACTIONS"
create_dataset_response = personalize.create_dataset(
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = schema_arn,
    name = "DEMO-temporal-metadata-dataset-interactions-" + suffix
)

interactions_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:261294318658:dataset/DEMO-temporal-metadata-dataset-group-62826/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "899cc918-6a68-4d88-98d6-9c2bf9fdf541",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Sat, 01 Jun 2019 02:32:52 GMT",
      "x-amzn-requestid": "899cc918-6a68-4d88-98d6-9c2bf9fdf541",
      "content-length": "123",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [51]:
dataset_type = "ITEMS"
create_metadata_dataset_response = personalize.create_dataset(
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = metadata_schema_arn,
    name = "DEMO-temporal-metadata-dataset-items-" + suffix
)

metadata_dataset_arn = create_metadata_dataset_response['datasetArn']
print(json.dumps(create_metadata_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:261294318658:dataset/DEMO-temporal-metadata-dataset-group-62826/ITEMS",
  "ResponseMetadata": {
    "RequestId": "0ca1e1f0-6045-4e6f-b5c1-4f6186b8c793",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Sat, 01 Jun 2019 02:32:53 GMT",
      "x-amzn-requestid": "0ca1e1f0-6045-4e6f-b5c1-4f6186b8c793",
      "content-length": "116",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


## set appropriate s3 bucket permissions

In [52]:

s3 = boto3.client("s3")

policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket),
                "arn:aws:s3:::{}/*".format(bucket)
            ]
        }
    ]
}

s3.put_bucket_policy(Bucket=bucket, Policy=json.dumps(policy));


## create a role that has the right permissions

In [53]:
from botocore.exceptions import ClientError

In [54]:
iam = boto3.client("iam")

role_name = "PersonalizeS3Role-"+suffix
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "personalize.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}
try:
    create_role_response = iam.create_role(
        RoleName = role_name,
        AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
    );

    iam.attach_role_policy(
        RoleName = role_name,
        PolicyArn = "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
    );

    role_arn = create_role_response["Role"]["Arn"]
except ClientError as e:
    if e.response['Error']['Code'] == 'EntityAlreadyExists':
        role_arn = iam.get_role(RoleName=role_name)['Role']['Arn']
    else:
        raise

In [55]:
print(role_arn)

arn:aws:iam::261294318658:role/PersonalizeS3Role-62826


# upload the data

In [56]:
# sometimes need to wait a bit for the role to be created
time.sleep(60)

In [57]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "DEMO-temporal-dataset-import-job-"+suffix,
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, 'interactions.csv')
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:261294318658:dataset-import-job/DEMO-temporal-dataset-import-job-62826",
  "ResponseMetadata": {
    "RequestId": "59bf65e3-3e81-4db3-bb0f-eb8c0c0a03e9",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Sat, 01 Jun 2019 02:33:55 GMT",
      "x-amzn-requestid": "59bf65e3-3e81-4db3-bb0f-eb8c0c0a03e9",
      "content-length": "126",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [58]:
create_metadata_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "DEMO-temporal-metadata-dataset-import-job-"+suffix,
    datasetArn = metadata_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, 'item_metadata.csv')
    },
    roleArn = role_arn
)

metadata_dataset_import_job_arn = create_metadata_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_metadata_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:261294318658:dataset-import-job/DEMO-temporal-metadata-dataset-import-job-62826",
  "ResponseMetadata": {
    "RequestId": "800c4420-6c6e-45ac-ae44-c668c62b34bf",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Sat, 01 Jun 2019 02:33:55 GMT",
      "x-amzn-requestid": "800c4420-6c6e-45ac-ae44-c668c62b34bf",
      "content-length": "135",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


## import the data into personalize

In [59]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = dataset_import_job_arn
    )
    
    dataset_import_job = describe_dataset_import_job_response["datasetImportJob"]
    if "latestDatasetImportJobRun" not in dataset_import_job:
        status = dataset_import_job["status"]
        print("DatasetImportJob: {}".format(status))
    else:
        status = dataset_import_job["latestDatasetImportJobRun"]["status"]
        print("LatestDatasetImportJobRun: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetImportJob: CREATE PENDING
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: ACTIVE


In [60]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = metadata_dataset_import_job_arn
    )
    
    dataset_import_job = describe_dataset_import_job_response["datasetImportJob"]
    if "latestDatasetImportJobRun" not in dataset_import_job:
        status = dataset_import_job["status"]
        print("DatasetImportJob: {}".format(status))
    else:
        status = dataset_import_job["latestDatasetImportJobRun"]["status"]
        print("LatestDatasetImportJobRun: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetImportJob: ACTIVE


## now we create a cold start solution using the cold start recipe

In [62]:
recipe_list = personalize.list_recipes()
for recipe in recipe_list['recipes']:
    print(recipe['recipeArn'])

arn:aws:personalize:::recipe/aws-hrnn
arn:aws:personalize:::recipe/aws-hrnn-coldstart
arn:aws:personalize:::recipe/aws-hrnn-metadata
arn:aws:personalize:::recipe/aws-personalized-ranking
arn:aws:personalize:::recipe/aws-popularity-count
arn:aws:personalize:::recipe/aws-sims


In [63]:
recipe_arn = "arn:aws:personalize:::recipe/aws-hrnn-coldstart"

In [69]:
create_solution_response = personalize.create_solution(
    name = "DEMO-temporal-metadata-solution-"+suffix,
    datasetGroupArn = dataset_group_arn,
    recipeArn = recipe_arn,
    solutionConfig = {
        "featureTransformationParameters" : {
            'cold_start_max_duration' : '5',
            'cold_start_relative_from' : 'latestItem',
            'cold_start_max_interactions':'15'
        }
    }
    
)

solution_arn = create_solution_response['solutionArn']
print(json.dumps(create_solution_response, indent=2))

{
  "solutionArn": "arn:aws:personalize:us-east-1:261294318658:solution/DEMO-temporal-metadata-solution-62826",
  "ResponseMetadata": {
    "RequestId": "386f95b5-3ac8-45c4-b26d-cdf63a46815b",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Sat, 01 Jun 2019 03:19:50 GMT",
      "x-amzn-requestid": "386f95b5-3ac8-45c4-b26d-cdf63a46815b",
      "content-length": "107",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [73]:
create_solution_version_response = personalize.create_solution_version(
    solutionArn = solution_arn
)

solution_version_arn = create_solution_version_response['solutionVersionArn']
print(json.dumps(create_solution_version_response, indent=2))

{
  "solutionVersionArn": "arn:aws:personalize:us-east-1:261294318658:solution/DEMO-temporal-metadata-solution-62826/a965c800",
  "ResponseMetadata": {
    "RequestId": "b4c9f1ff-d6c4-4f93-ad7b-e3a9047c787f",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Sat, 01 Jun 2019 04:01:24 GMT",
      "x-amzn-requestid": "b4c9f1ff-d6c4-4f93-ad7b-e3a9047c787f",
      "content-length": "123",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [74]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_solution_version_response = personalize.describe_solution_version(
        solutionVersionArn = solution_version_arn
    )
    status = describe_solution_version_response["solutionVersion"]["status"]
    print("SolutionVersion: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

SolutionVersion: CREATE PENDING
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGR

## Metrics are not very meaningful for cold start solutions

In [75]:
get_solution_metrics_response = personalize.get_solution_metrics(
    solutionVersionArn = solution_version_arn
)

print(json.dumps(get_solution_metrics_response, indent=2))


{
  "solutionVersionArn": "arn:aws:personalize:us-east-1:261294318658:solution/DEMO-temporal-metadata-solution-62826/a965c800",
  "metrics": {
    "coverage": 0.0925,
    "mean_reciprocal_rank_at_25": 0.0,
    "normalized_discounted_cumulative_gain_at_10": 0.0,
    "normalized_discounted_cumulative_gain_at_25": 0.0,
    "normalized_discounted_cumulative_gain_at_5": 0.0,
    "precision_at_10": 0.0,
    "precision_at_25": 0.0,
    "precision_at_5": 0.0
  },
  "ResponseMetadata": {
    "RequestId": "ae8e287e-5320-40a4-a375-97f5c83a9186",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Sat, 01 Jun 2019 05:28:04 GMT",
      "x-amzn-requestid": "ae8e287e-5320-40a4-a375-97f5c83a9186",
      "content-length": "400",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


What happened here? Since we deleted all cold start items from the training set, the metrics are zero. 

An important lesson here is that metrics for cold start are hard to evaluate offline.

Lets look at what happens when we generate predictions.

## Create a campaign 

In [76]:
create_campaign_response = personalize.create_campaign(
    name = "DEMO-coldstart-campaign-"+suffix,
    solutionVersionArn = solution_version_arn,
    minProvisionedTPS = 2,    
)

campaign_arn = create_campaign_response['campaignArn']
print(json.dumps(create_campaign_response, indent=2))

{
  "campaignArn": "arn:aws:personalize:us-east-1:261294318658:campaign/DEMO-coldstart-campaign-62826",
  "ResponseMetadata": {
    "RequestId": "38a2dcb1-9667-4e3e-a211-4eee4e1c75ff",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Sat, 01 Jun 2019 05:28:57 GMT",
      "x-amzn-requestid": "38a2dcb1-9667-4e3e-a211-4eee4e1c75ff",
      "content-length": "99",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [77]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_campaign_response = personalize.describe_campaign(
        campaignArn = campaign_arn
    )
    status = describe_campaign_response["campaign"]["status"]
    print("Campaign: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

Campaign: CREATE IN_PROGRESS
Campaign: CREATE IN_PROGRESS
Campaign: CREATE IN_PROGRESS
Campaign: CREATE IN_PROGRESS
Campaign: CREATE IN_PROGRESS
Campaign: CREATE IN_PROGRESS
Campaign: CREATE IN_PROGRESS
Campaign: CREATE IN_PROGRESS
Campaign: CREATE IN_PROGRESS
Campaign: ACTIVE


In [78]:
# we had saved all the data before deleting the cold items
df = dfo.copy()

In [79]:
df.columns = ['USER_ID','ITEM_ID','EVENT_VALUE','TIMESTAMP']
df['EVENT_TYPE']='RATING'

In [80]:
from tqdm import tqdm_notebook
import numpy as np
from metrics import mean_reciprocal_rank, ndcg_at_k, precision_at_k

In [81]:
users = df['USER_ID'].unique()

## how often is the deleted/cold/new item in the actual items the user interacted with?

In [87]:
relevance = []
for user_id in  tqdm_notebook(users[:1000]):

    true_items = set(df[df['USER_ID']==user_id]['ITEM_ID'].values)

    rec_response = personalize_runtime.get_recommendations(
            campaignArn = campaign_arn,
            userId = str(user_id)
        )
    rec_items = [int(x['itemId']) for x in rec_response['itemList']]
    relevance.append([int(x in true_items) for x in rec_items])

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

In [88]:
print('mean_reciprocal_rank', np.mean([mean_reciprocal_rank(r) for r in relevance]))
print('precision_at_5', np.mean([precision_at_k(r, 5) for r in relevance]))
print('precision_at_10', np.mean([precision_at_k(r, 10) for r in relevance]))
print('precision_at_25', np.mean([precision_at_k(r, 25) for r in relevance]))
print('normalized_discounted_cumulative_gain_at_5', np.mean([ndcg_at_k(r, 5) for r in relevance]))
print('normalized_discounted_cumulative_gain_at_10', np.mean([ndcg_at_k(r, 10) for r in relevance]))
print('normalized_discounted_cumulative_gain_at_25', np.mean([ndcg_at_k(r, 25) for r in relevance]))

mean_reciprocal_rank 0.040141421094823014
precision_at_5 0.0218
precision_at_10 0.0221
precision_at_25 0.02068
normalized_discounted_cumulative_gain_at_5 0.0395727651381263
normalized_discounted_cumulative_gain_at_10 0.06171099822996375
normalized_discounted_cumulative_gain_at_25 0.10481562385346689


### A baseline

As a baseline, consider the case where we picked out of cold start items uniformly at random. Note that since cold start items in our dataset have no interaction history, this is a reasonable baseline - we don't have much information to go on here.

In [89]:
len(rec_items)

25

In [91]:
relevance = []
for user_id in  tqdm_notebook(users[:1000]):

    true_items = set(df[df['USER_ID']==user_id]['ITEM_ID'].values)
    rec_items = np.random.permutation(cold_items)[:25]
    relevance.append([int(x in true_items) for x in rec_items])

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

In [92]:
print('mean_reciprocal_rank', np.mean([mean_reciprocal_rank(r) for r in relevance]))
print('precision_at_5', np.mean([precision_at_k(r, 5) for r in relevance]))
print('precision_at_10', np.mean([precision_at_k(r, 10) for r in relevance]))
print('precision_at_25', np.mean([precision_at_k(r, 25) for r in relevance]))
print('normalized_discounted_cumulative_gain_at_5', np.mean([ndcg_at_k(r, 5) for r in relevance]))
print('normalized_discounted_cumulative_gain_at_10', np.mean([ndcg_at_k(r, 10) for r in relevance]))
print('normalized_discounted_cumulative_gain_at_25', np.mean([ndcg_at_k(r, 25) for r in relevance]))

mean_reciprocal_rank 0.014632145855508018
precision_at_5 0.0044
precision_at_10 0.0052
precision_at_25 0.00524
normalized_discounted_cumulative_gain_at_5 0.012052301347277762
normalized_discounted_cumulative_gain_at_10 0.02093126177923137
normalized_discounted_cumulative_gain_at_25 0.03604591854461026


We see that the cold start model is able to levarage some information from the meta-data to gain lift over uninformed recommendation of cold/new items.

The lift is only 3-4x because we have many cold item (6000) and not very informative meta-data - just the genre of the movie.

Reducing cold items, or improving the quality of meta-data can both be useful.

## a quick smell test

In [114]:
user_id = users[1]

true_items = set(df[df['USER_ID']==user_id]['ITEM_ID'].values)

In [119]:
movies2.loc[true_items]

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
3,Grumpier Old Men (1995),Comedy|Romance
260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
2948,From Russia with Love (1963),Action|Adventure|Thriller
2951,"Fistful of Dollars, A (Per un pugno di dollari...",Action|Western
1544,"Lost World: Jurassic Park, The (1997)",Action|Adventure|Sci-Fi|Thriller
1673,Boogie Nights (1997),Drama
266,Legends of the Fall (1994),Drama|Romance|War|Western
908,North by Northwest (1959),Action|Adventure|Mystery|Romance|Thriller
2454,"Fly, The (1958)",Horror|Mystery|Sci-Fi
2455,"Fly, The (1986)",Drama|Horror|Sci-Fi|Thriller


In [116]:
rec_response = personalize_runtime.get_recommendations(
            campaignArn = campaign_arn,
            userId = str(user_id)
        )
rec_items = [int(x['itemId']) for x in rec_response['itemList']]

In [118]:
movies2.loc[rec_items]

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
5895,Making Contact (a.k.a. Joey) (1985),Fantasy|Horror|Sci-Fi
55751,Decoys (2004),Comedy|Horror|Sci-Fi
63458,Critters 4 (1991),Comedy|Horror|Sci-Fi
66152,TerrorVision (1986),Comedy|Horror|Sci-Fi
7845,Tremors II: Aftershocks (1996),Comedy|Horror|Sci-Fi
44777,Evil Aliens (2005),Comedy|Horror|Sci-Fi
4412,"Thing with Two Heads, The (1972)",Comedy|Horror|Sci-Fi
196,Species (1995),Horror|Sci-Fi
71851,"I, Monster (1971)",Horror|Sci-Fi
4997,"Convent, The (2000)",Horror|Sci-Fi


We see that this user watched a lot of Horror/Sci-Fi movies and the model is able to pick this up, and recommend Horror Sci-Fi movies from the cold items.