Building A Campaign(Recommender system) with AWS Personalize on Sagemaker studio

This notebook takes us through the steps in building a recommendation model for movies based on data collected from the movielens data set. The goal is to recommend movies that are relevant based on a particular user. The data is coming from the MovieLens project.

In [1]:
#importing the necessary library to build our campaign
import boto3
import json
import pandas as pd
import numpy as np
import time

In [2]:
#validating that this environment can communicate with amazon personalize
#configuring sdk to personalize
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

Configure the data

Data is imported into Amazon Personalize through Amazon S3, below we will specify a bucket that we have created within AWS for the purposes of this exercise.

In [4]:
s3 = boto3.resource('s3')
try:
    s3.Object('eugene-personalize-demo-s3', 'ratings.csv').download_file('ratings.csv')
    print('Successful: Data retrieved')
except Exception as e:
    print('error: ', e)
    
try:
    data = pd.read_csv('ratings.csv')
    print('successful')
except Exception as e:
    print('error: ', e)

Successful: Data retrieved
successful


In [5]:
data.head()

Unnamed: 0,USER_ID,ITEM_ID,RATING,TIMESTAMP
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   USER_ID    100836 non-null  int64  
 1   ITEM_ID    100836 non-null  int64  
 2   RATING     100836 non-null  float64
 3   TIMESTAMP  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


Adding a schema. The schema allows Amazon Personalize to parse the training dataset.
A core component of how Personalize understands your data comes from the Schema that is defined below. 
This configuration tells the service how to digest the data provided via your CSV file

In [None]:
response = personalize.delete_schema(
    schemaArn="arn:aws:personalize:us-east-2:767498210768:schema/movie_ratings_schema"
)

In [7]:
schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "RATING",
            "type": "float"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}

Create_Schema = personalize.create_schema(
                name = "movie_ratings_schema", schema = json.dumps(schema)) 

Schema_arn = Create_Schema['schemaArn']
print(json.dumps(Create_Schema))

#remember ,you need to grant privilege access to sagemaker from IAM so that sagemaker can create a schema on aws personzlie.
#However, I already did

{"schemaArn": "arn:aws:personalize:us-east-2:767498210768:schema/movie_ratings_schema", "ResponseMetadata": {"RequestId": "88652006-1905-4136-97dd-2cda578f0606", "HTTPStatusCode": 200, "HTTPHeaders": {"content-type": "application/x-amz-json-1.1", "date": "Sun, 27 Dec 2020 23:46:14 GMT", "x-amzn-requestid": "88652006-1905-4136-97dd-2cda578f0606", "content-length": "86", "connection": "keep-alive"}, "RetryAttempts": 0}}


Creating our dataset group with exisiting schemas.
The largest grouping in Personalize is a Dataset Group, this will isolate our data, event trackers, solutions, and campaigns. 
Grouping things together that share a common collection of data.

In [8]:
Create_dataset_group = personalize.create_dataset_group(name = 'ratings_dataset_group')
dataset_group_arn = Create_dataset_group['datasetGroupArn']
print(json.dumps(Create_dataset_group))

{"datasetGroupArn": "arn:aws:personalize:us-east-2:767498210768:dataset-group/ratings_dataset_group", "ResponseMetadata": {"RequestId": "c190aab7-338c-4e96-bba8-72b64456aea8", "HTTPStatusCode": 200, "HTTPHeaders": {"content-type": "application/x-amz-json-1.1", "date": "Sun, 27 Dec 2020 23:46:18 GMT", "x-amzn-requestid": "c190aab7-338c-4e96-bba8-72b64456aea8", "content-length": "100", "connection": "keep-alive"}, "RetryAttempts": 0}}


Here, we create our actual dataset

In [10]:
dataset_type = "INTERACTIONS"
create_dataset = personalize.create_dataset(
    name = "movie_ratings_dataset",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = Schema_arn
)

dataset_arn = create_dataset['datasetArn']
print(json.dumps(create_dataset))

{"datasetArn": "arn:aws:personalize:us-east-2:767498210768:dataset/ratings_dataset_group/INTERACTIONS", "ResponseMetadata": {"RequestId": "0e28848a-9cc7-480a-8739-f89bb4bcd3b6", "HTTPStatusCode": 200, "HTTPHeaders": {"content-type": "application/x-amz-json-1.1", "date": "Sun, 27 Dec 2020 23:48:42 GMT", "x-amzn-requestid": "0e28848a-9cc7-480a-8739-f89bb4bcd3b6", "content-length": "102", "connection": "keep-alive"}, "RetryAttempts": 0}}


Import our data from s3 into our newly created movie ratings dataset on AWS personalize

In [11]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "ratings-data",
    datasetArn = dataset_arn,
    dataSource = {
        "dataLocation": "s3://eugene-personalize-demo-s3/ratings.csv".format('eugene-personalize-demo-s3', 'ratings.csv')
    },
    roleArn = 'arn:aws:iam::767498210768:role/Eugene-personalize'
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response))

{"datasetImportJobArn": "arn:aws:personalize:us-east-2:767498210768:dataset-import-job/ratings-data", "ResponseMetadata": {"RequestId": "ce2d916b-ddaf-44f1-9784-6829977cf2fe", "HTTPStatusCode": 200, "HTTPHeaders": {"content-type": "application/x-amz-json-1.1", "date": "Sun, 27 Dec 2020 23:49:29 GMT", "x-amzn-requestid": "ce2d916b-ddaf-44f1-9784-6829977cf2fe", "content-length": "100", "connection": "keep-alive"}, "RetryAttempts": 0}}



CREATING OUR SOLUTION AND VERSION
In Amazon Personalize a trained model is called a Solution, each Solution can have many specific versions that relate to a given volume of data when the model was trained.

To begin we will list all the recipies that are supported, a recipie is an algorithm that has not been trained on our data yet. After listing we'll select one and use that to build your model.

In [12]:
#listing out the available recipes
list_recipes_response = personalize.list_recipes()
list_recipes_response

{'recipes': [{'name': 'aws-hrnn',
   'recipeArn': 'arn:aws:personalize:::recipe/aws-hrnn',
   'status': 'ACTIVE',
   'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),
   'lastUpdatedDateTime': datetime.datetime(2020, 12, 18, 0, 38, 30, 252000, tzinfo=tzlocal())},
  {'name': 'aws-hrnn-coldstart',
   'recipeArn': 'arn:aws:personalize:::recipe/aws-hrnn-coldstart',
   'status': 'ACTIVE',
   'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),
   'lastUpdatedDateTime': datetime.datetime(2020, 12, 18, 0, 38, 30, 252000, tzinfo=tzlocal())},
  {'name': 'aws-hrnn-metadata',
   'recipeArn': 'arn:aws:personalize:::recipe/aws-hrnn-metadata',
   'status': 'ACTIVE',
   'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),
   'lastUpdatedDateTime': datetime.datetime(2020, 12, 18, 0, 38, 30, 252000, tzinfo=tzlocal())},
  {'name': 'aws-personalized-ranking',
   'recipeArn': 'arn:aws:personalize:::recipe/aws-personalized-ranking',
  

In [13]:
#selecting the recipe to use for our solution
recipe_arn = 'arn:aws:personalize:::recipe/aws-user-personalization'
#creating a solution
create_solution_response = personalize.create_solution(
    name = "movie-user-personalization",
    datasetGroupArn = dataset_group_arn,
    recipeArn = recipe_arn
)
solution_arn = create_solution_response['solutionArn']
print(json.dumps(create_solution_response, indent=2))

#creating a version of solution
create_solution_version_response = personalize.create_solution_version(
    solutionArn = solution_arn
)
solution_version_arn = create_solution_version_response['solutionVersionArn']
print(json.dumps(create_solution_version_response, indent=2))

InvalidInputException: An error occurred (InvalidInputException) when calling the CreateSolution operation: Dataset group must contain an INTERACTIONS dataset with successfully completed dataset import job or EVENT_INTERACTIONS

In [None]:
#waiting for our solution to have active status on the console
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_solution_version_response = personalize.describe_solution_version(
        solutionVersionArn = solution_version_arn
    )
    status = describe_solution_version_response["solutionVersion"]["status"]
    print("SolutionVersion: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS


In [1]:
#getting the metrics for our solution version
get_solution_metrics_response = personalize.get_solution_metrics(
    solutionVersionArn = solution_version_arn
)

print(json.dumps(get_solution_metrics_response))

NameError: name 'personalize' is not defined

In [None]:
#creating a campaign to deploy our solution 
create_campaign_response = personalize.create_campaign(
    name = "movie-campaign",
    solutionVersionArn = solution_version_arn,
    minProvisionedTPS = 1,
    campaignConfig = {
        "itemExplorationConfig": {
            "explorationWeight": "0.5"
        }
    }
)

campaign_arn = create_campaign_response['campaignArn']
print(json.dumps(create_campaign_response, indent=2))

In [None]:
#the block of code below enables us to monitor when our campaign has an active status(i.e., ready to be deployed)
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_campaign_response = personalize.describe_campaign(
        campaignArn = campaign_arn
    )
    status = describe_campaign_response["campaign"]["status"]
    print("Campaign: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

In [None]:
#Getting sample recommendations
#After the campaign is active you are ready to get recommendations. First we need to select a random user from the collection. 
#Then we will create a few helper functions for getting movie information to show for recommendations instead of just IDs.

# Getting a random user:
user_id, item_id, _ = data.sample().values[0]
print("USER: {}".format(user_id))

#importing data from s3 bucket
s3.Object('eugene-personalize-demo-s3', 'movies.csv').download_file('movies.csv')
# First load items into memory
items = pd.read_csv('movies.csv', index_col='ITEM_ID')

#defining a function that obtains the movie title name
def get_movie_title(movie_id):
    """
    Takes in an ID, returns a title
    """
    movie_id = int(movie_id) - 1
    return items.iloc[movie_id]['TITLE']

In [None]:
#using the user id gotten from the previous code we then call the get recommendations method
get_recommendations_response = personalize_runtime.get_recommendations(
    campaignArn = campaign_arn,
    userId = str(user_id),
)
# Update DF rendering
pd.set_option('display.max_rows', 30)

print("Recommendations for user: ", user_id)

item_list = get_recommendations_response['itemList']

recommendation_list = []

for item in item_list:
    title = get_movie_title(item['itemId'])
    recommendation_list.append(title)
    
recommendations_df = pd.DataFrame(recommendation_list, columns = ['OriginalRecs'])
recommendations_df