# Creating and Evaluating Solutions (Python SDK)<a class="anchor" id="top"></a>

In this notebook, we'll train some models in Amazon Personalize and review their metrics - using [Boto3, the AWS SDK for Python](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html).

> For an **alternative** approach to the same steps through the [Amazon Personalize console UI](https://console.aws.amazon.com/personalize/home) - see Notebook [03a_Creating_and_Evaluating_Solutions_(Console).ipynb](03a_Creating_and_Evaluating_Solutions_(Console).ipynb) instead.

⚠️ You'll need to already have run the previous notebooks in this series to set up your environment; and prepare and import your data.

Before we start, we'll here:

- Import the libraries this notebook will use
- Load the variables saved from previous steps
- Connect to the relevant AWS services as we have before for IAM and S3

In [None]:
# Python Built-Ins:
from datetime import datetime
import json

# External Dependencies:
import boto3  # AWS SDK for Python

# Local Dependencies:
import util  # Small tool to print progress spinner

# Reload saved variables:
%store -r

# Connect to AWS services:
personalize = boto3.client("personalize")

## Introduction

As discussed in our [data preparation notebook](01_Preparing_Input_Data.ipynb), different **recipe types** in Amazon Personalize look to solve **different tasks**.

In this notebook, you will train three **"solutions"** (models) for different use-cases:

1. A `User-Personalization` solution for recommending items relevant to a particular user
1. A `SIMS` solution to recommend *similar items* for a given item ID
1. A `Personalized-Ranking` solution which, given a user and a collection of possible items, ranks the items in order of decreasing relevance

The `Popularity-Count` recipe (which just ranks items by popularity) may also be useful as a **baseline** for understanding how the metrics of trained solutions compare against a trivial solution - but we won't specifically cover it here.

Like most objects in AWS, each *recipe* (algorithm) offered by Amazon Resource Name (ARN). We can list the available recipe through the SDK:

In [None]:
personalize.list_recipes()

## Create Solutions

In Amazon Personalize a model is conceptually called a **solution**, and an actual trained model is a **solution version** - reflecting that the model can be re-trained with new updated data.

Through the SDK, we'll need to explicitly create the "solution" **and** then kick off training an actual solution version, to train a model.

### User Personalization

The User-Personalization (`aws-user-personalization`) recipe is optimized for all user->items recommendation scenarios. When recommending items, it uses automatic item exploration.

With automatic exploration, Amazon Personalize automatically tests different item recommendations, learns from how users interact with these recommended items, and boosts recommendations for items that drive better engagement and conversion. This improves item discovery and engagement when you have a fast-changing catalog, or when new items, such as news articles or promotions, are more relevant to users when fresh.

You can balance how much to explore (where items with less interactions data or relevance are recommended more frequently) against how much to exploit (where recommendations are based on what we know or relevance). Amazon Personalize automatically adjusts future recommendations based on implicit user feedback.

First, select the recipe by finding the ARN (see the list of recipes above, or the [Amazon Personalize developer guide](https://docs.aws.amazon.com/personalize/latest/dg/working-with-predefined-recipes.html)):

In [None]:
up_recipe_arn = "arn:aws:personalize:::recipe/aws-user-personalization"

Then, create a solution construct in your **dataset group** using the recipe.

(Again, remember that this creates a solution construct but does not actually start **training a version** of the solution).

In [None]:
up_create_solution_resp = personalize.create_solution(
    name="personalize-poc-userpersonalization",
    datasetGroupArn=dataset_group_arn,
    recipeArn=up_recipe_arn,
    eventType="review",
    solutionConfig={
        "eventValueThreshold": "3",  # (reviews 3 stars or more)
    },
)

up_solution_arn = up_create_solution_resp["solutionArn"]
%store up_solution_arn
print(json.dumps(up_create_solution_resp, indent=2))

Next, create a **version** of the solution to kick off model training:

In [None]:
up_create_solution_version_resp = personalize.create_solution_version(
    solutionArn=up_solution_arn,
)

up_solution_version_arn = up_create_solution_version_resp["solutionVersionArn"]
%store up_solution_version_arn
print(json.dumps(up_create_solution_resp, indent=2))

> ⏰ This training is kicked off *in the background* and can take a while to complete - upwards of 25 minutes and typically around 90 minutes for this recipe on our sample dataset.

Rather than waiting here, we'll start our other solutions training first and then wait for all together:

### SIMS

SIMS is one of the oldest algorithms used within Amazon for recommendation systems. A core use case for it is when you have one item and you want to recommend items that have been interacted with in similar ways over your entire user base. This means the result is not personalized per user. Sometimes this leads to recommending mostly popular items, so there is a hyperparameter that can be tweaked which will reduce the popular items in your results. 

For our use case, using the Movielens data, let's assume we pick a particular movie. We can then use SIMS to recommend other movies based on the interaction behavior of the entire user base. The results are not personalized per user, but instead, differ depending on the movie we chose as our input.

Just like last time, we start by selecting the recipe.

In [None]:
SIMS_recipe_arn = "arn:aws:personalize:::recipe/aws-sims"

...then creating the *solution*:

In [None]:
sims_create_solution_response = personalize.create_solution(
    name = "personalize-poc-sims",
    datasetGroupArn = dataset_group_arn,
    recipeArn = SIMS_recipe_arn,
    eventType="review",
    solutionConfig={
        "eventValueThreshold": "3",  # (reviews 3 stars or more)
    },
)

sims_solution_arn = sims_create_solution_response["solutionArn"]
%store sims_solution_arn
print(json.dumps(sims_create_solution_response, indent=2))

...and finally creating a **solution version** to start the training process:

In [None]:
sims_create_solution_version_response = personalize.create_solution_version(
    solutionArn = sims_solution_arn
)

sims_solution_version_arn = sims_create_solution_version_response["solutionVersionArn"]
%store sims_solution_version_arn
print(json.dumps(sims_create_solution_version_response, indent=2))

> ⏰ This training is kicked off *in the background* and can take a while to complete - upwards of 25 minutes and typically around 35 minutes for this recipe on our sample dataset.

Rather than waiting here, we'll start our other solutions training first and then wait for all together:

### Personalized Ranking

Personalized Ranking is an interesting application of HRNN. Instead of just recommending what is most probable for the user in question, this algorithm takes in a user and a list of items as well. The items are then rendered back in the order of most probable relevance for the user. The use case here is for filtering on unique categories that you do not have item metadata to create a filter, or when you have a broad collection that you would like better ordered for a particular user.

For our use case, using the MovieLens data, we could imagine that a VOD application may want to create a shelf of comic book movies, or movies by a specific director. We most likely have these lists based title metadata we have. We would use personalized ranking to re-order the list of movies for each user, based on their previous tagging history. 

Just like last time, we start by selecting the recipe.

In [None]:
rerank_recipe_arn = "arn:aws:personalize:::recipe/aws-personalized-ranking"

...then creating the *solution*:

In [None]:
rerank_create_solution_response = personalize.create_solution(
    name = "personalize-poc-rerank",
    datasetGroupArn = dataset_group_arn,
    recipeArn = rerank_recipe_arn,
    eventType="review",
    solutionConfig={
        "eventValueThreshold": "3",  # (reviews 3 stars or more)
    },
)

rerank_solution_arn = rerank_create_solution_response["solutionArn"]
%store rerank_solution_arn
print(json.dumps(rerank_create_solution_response, indent=2))

...and finally creating a **solution version** to start the training process:

In [None]:
rerank_create_solution_version_response = personalize.create_solution_version(
    solutionArn = rerank_solution_arn
)

rerank_solution_version_arn = rerank_create_solution_version_response["solutionVersionArn"]
%store rerank_solution_version_arn
print(json.dumps(rerank_create_solution_version_response, indent=2))

> ⏰ This training is kicked off *in the background* and can take a while to complete - upwards of 25 minutes and typically around 45 minutes for this recipe on our sample dataset.

## Hyperparameter Tuning *(Information Only)*

Personalize offers the option of running hyperparameter tuning when creating a solution. Because of the additional computation required to perform hyperparameter tuning, this feature is turned off by default. Therefore, the solutions we created above, will simply use the default values of the hyperparameters for each recipe. For more information about hyperparameter tuning, see the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/customizing-solution-config-hpo.html).

If you have settled on the correct recipe to use, and are ready to run hyperparameter tuning, the following code shows how you would do so, using SIMS as an example.

```python
sims_create_solution_response = personalize.create_solution(
    name = "personalize-poc-sims-hpo",
    datasetGroupArn = dataset_group_arn,
    recipeArn = SIMS_recipe_arn,
    performHPO=True
)

sims_solution_arn = sims_create_solution_response['solutionArn']
print(json.dumps(sims_create_solution_response, indent=2))
```

If you already know the values you want to use for a specific hyperparameter, you can also set this value when you create the solution. The code below shows how you could set the value for the `popularity_discount_factor` for the SIMS recipe.

```python
sims_create_solution_response = personalize.create_solution(
    name = "personalize-poc-sims-set-hp",
    datasetGroupArn = dataset_group_arn,
    recipeArn = SIMS_recipe_arn,
    solutionConfig = {
        'algorithmHyperParameters': {
            'popularity_discount_factor': '0.7'
        }
    }
)

sims_solution_arn = sims_create_solution_response['solutionArn']
print(json.dumps(sims_create_solution_response, indent=2))
```

## Wait for Training to Complete

You can check the status of training solution versions in the [Amazon Personalize console UI](https://console.aws.amazon.com/personalize/home):

* In another browser tab you should already have the AWS Console up from opening this notebook instance. 
* Switch to that tab and search at the top for the service `Personalize`, then go to that service page. 
* Click `View dataset groups`.
* Click the name of your dataset group, most likely something with POC in the name.
* Click `Solutions and recipes`.
* You will now see a list of all of the solutions you created above,  including a column with the status of the solution versions. Once it is `Active`, your solution is ready to be reviewed. It is also capable of being deployed.

...Or simply run the cell below to poll and wait for all solutions to train:

In [None]:
waiting_arns = [
    up_solution_version_arn,
    sims_solution_version_arn,
    rerank_solution_version_arn,
]

def are_solutions_finished(descriptions):
    for desc in descriptions:
        status = desc["solutionVersion"]["status"]
        arn = desc["solutionVersion"]["solutionVersionArn"]
        if status == "ACTIVE":
            print(f"\nTrained {arn}")
            waiting_arns.remove(arn)
        elif "FAILED" in status:
            raise ValueError(f"Build failed!\n{desc}")
    if not len(waiting_arns):
        return True

util.progress.polling_spinner(
    fn_poll_result=lambda: map(
        lambda arn: personalize.describe_solution_version(solutionVersionArn=arn),
        waiting_arns,
    ),
    fn_is_finished=are_solutions_finished,
    fn_stringify_result=lambda d: f"{len(waiting_arns)} models in progress",
    poll_secs=60,
    timeout_secs=4*60*60,  # Max 4 hours
)
print("All solutions ready")

## Evaluate Solution Versions

It should not take more than ~90 minutes to train all the solutions from this notebook. While training is in progress, we recommend taking the time to read up on the various algorithms (recipes) and their behavior in detail. This is also a good time to consider alternatives to how the data was fed into the system and what kind of results you expect to see.

When the solutions finish creating, the next step is to obtain the evaluation metrics. Personalize calculates these metrics based on a subset of the training data. The image below illustrates how Personalize splits the data. Given 10 users, with 10 interactions each (a circle represents an interaction), the interactions are ordered from oldest to newest based on the timestamp. Personalize uses all of the interaction data from 90% of the users (blue circles) to train the solution version, and the remaining 10% for evaluation. For each of the users in the remaining 10%, 90% of their interaction data (green circles) is used as input for the call to the trained model. The remaining 10% of their data (orange circle) is compared to the output produced by the model and used to calculate the evaluation metrics.

![personalize metrics](static/imgs/personalize_metrics.png)

We recommend reading [the documentation](https://docs.aws.amazon.com/personalize/latest/dg/working-with-training-metrics.html) to understand the metrics, but we have also copied parts of the documentation below for convenience.

You need to understand the following terms regarding evaluation in Personalize:

* *Relevant recommendation* refers to a recommendation that matches a value in the testing data for the particular user.
* *Rank* refers to the position of a recommended item in the list of recommendations. Position 1 (the top of the list) is presumed to be the most relevant to the user.
* *Query* refers to the internal equivalent of a GetRecommendations call.

The metrics produced by Personalize are:
* **coverage**: The proportion of unique recommended items from all queries out of the total number of unique items in the training data (includes both the Items and Interactions datasets).
* **mean_reciprocal_rank_at_25**: The [mean of the reciprocal ranks](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) of the first relevant recommendation out of the top 25 recommendations over all queries. This metric is appropriate if you're interested in the single highest ranked recommendation.
* **normalized_discounted_cumulative_gain_at_K**: Discounted gain assumes that recommendations lower on a list of recommendations are less relevant than higher recommendations. Therefore, each recommendation is discounted (given a lower weight) by a factor dependent on its position. To produce the [cumulative discounted gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) (DCG) at K, each relevant discounted recommendation in the top K recommendations is summed together. The normalized discounted cumulative gain (NDCG) is the DCG divided by the ideal DCG such that NDCG is between 0 - 1. (The ideal DCG is where the top K recommendations are sorted by relevance.) Amazon Personalize uses a weighting factor of 1/log(1 + position), where the top of the list is position 1. This metric rewards relevant items that appear near the top of the list, because the top of a list usually draws more attention.
* **precision_at_K**: The number of relevant recommendations out of the top K recommendations divided by K. This metric rewards precise recommendation of the relevant items.

Let's take a look at the evaluation metrics for each of the solutions produced in this notebook. *Please note, your results might differ from the results described in the text of this notebook, due to the quality of the Movielens dataset.* 

### User Personalization metrics

First, retrieve the evaluation metrics for the User Personalization solution version.

In [None]:
up_solution_metrics_response = personalize.get_solution_metrics(
    solutionVersionArn=up_solution_version_arn,
)

print(json.dumps(up_solution_metrics_response, indent=2))

The normalized discounted cumulative gain above tells us that at 5 items, we have less than a (38% for full 22% for small) chance in a recommendation being a part of a user's interaction history (in the hold out phase from training and validation). Around 13% of the recommended items are unique, and we have a precision of only (14% for full, 7.5% for small) in the top 5 recommended items. 

This is clearly not a great model, but keep in mind that we had to use rating data for our interactions because Movielens is an explicit dataset based on ratings. The Timestamps also were from the time that the movie was rated, not watched, so the order is not the same as the order a viewer would watch movies.

### SIMS metrics

Now, retrieve the evaluation metrics for the SIMS solution version.

In [15]:
sims_solution_metrics_response = personalize.get_solution_metrics(
    solutionVersionArn=sims_solution_version_arn,
)

print(json.dumps(sims_solution_metrics_response, indent=2))

{
  "solutionVersionArn": "arn:aws:personalize:ap-southeast-1:024103970757:solution/personalize-poc-sims/533343e2",
  "metrics": {
    "coverage": 0.109,
    "mean_reciprocal_rank_at_25": 0.1649,
    "normalized_discounted_cumulative_gain_at_10": 0.1682,
    "normalized_discounted_cumulative_gain_at_25": 0.208,
    "normalized_discounted_cumulative_gain_at_5": 0.1386,
    "precision_at_10": 0.0435,
    "precision_at_25": 0.0301,
    "precision_at_5": 0.0609
  },
  "ResponseMetadata": {
    "RequestId": "92ddbd22-d286-4d07-9d29-4d60b7c227fd",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Fri, 11 Dec 2020 05:40:06 GMT",
      "x-amzn-requestid": "92ddbd22-d286-4d07-9d29-4d60b7c227fd",
      "content-length": "407",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In this example we are seeing a slightly elevated precision at 5 items, a little over (4.5% for full, 6.4% for small) this time. Effectively this is probably within the margin of error, but given that no effort was made to mask popularity, it may just be returning super popular results that a large volume of users have interacted with in some way. 

### Personalized ranking metrics

Now, retrieve the evaluation metrics for the personalized ranking solution version.

In [14]:
rerank_solution_metrics_response = personalize.get_solution_metrics(
    solutionVersionArn=rerank_solution_version_arn,
)

print(json.dumps(rerank_solution_metrics_response, indent=2))

{
  "solutionVersionArn": "arn:aws:personalize:ap-southeast-1:024103970757:solution/personalize-poc-rerank/c683d991",
  "metrics": {
    "coverage": 0.0033,
    "mean_reciprocal_rank_at_25": 0.0492,
    "normalized_discounted_cumulative_gain_at_10": 0.0591,
    "normalized_discounted_cumulative_gain_at_25": 0.083,
    "normalized_discounted_cumulative_gain_at_5": 0.0328,
    "precision_at_10": 0.0151,
    "precision_at_25": 0.0113,
    "precision_at_5": 0.0113
  },
  "ResponseMetadata": {
    "RequestId": "8eab9661-d405-4662-83b6-94069a1ccaa4",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Fri, 11 Dec 2020 05:40:03 GMT",
      "x-amzn-requestid": "8eab9661-d405-4662-83b6-94069a1ccaa4",
      "content-length": "410",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


## Using Evaluation Metrics

It's tempting to over-focus on these evaluation metrics, but very important to consider the bigger picture:

* In recommendation problems, there is a **strong feedback loop** that **favours the existing deployed system**: People don't click/purchase items they don't see, so the historical interaction data we train from is influenced by the biases of the previously deployed system(s).
* This 'offline' validation procedure doesn't have any knowledge of **input item lists** you might supply to re-ranking models, or **filter rules** you may apply to models in general... Which might have significant impact on real-world deployed performance by filtering out known-irrelevant items (or accidentally removing important items!)
* **Cold starting** of new items is difficult to evaluate using these metrics. The aim of cold-starting strategies is to recommend items which are new to your business. Therefore, these items will not appear in the existing user transaction data which is used to compute the evaluation metrics!

Keeping in mind these factors, the evaluation metrics produced by Personalize are generally useful for two cases:

1. Comparing the performance of solution versions trained on the same recipe, but with different values for the hyperparameters and features (impression data etc)
1. Comparing the performance of solution versions trained on different recipes (except HRNN Coldstart).

Properly evaluating a recommendation system is always best done through **A/B testing** while measuring **actual business outcomes**. Since recommendations generated by a system usually influence the user behavior which it is based on, it is better to run small experiments and apply A/B testing for **longer periods of time**. Over time, the bias from the existing model will fade.

## All set!

We've now trained models for a range of different recommendation tasks, based on our historical data.

In the next notebook we'll **deploy** these models to enable us to start generating real-time recommendations:

- Follow along in the **AWS Console** with the instructions and screenshots in [04a_Deploying_Campaigns_and_Filters_(Console).ipynb](04a_Deploying_Campaigns_and_Filters_(Console).ipynb), *OR*
- Run the same steps in code with the **AWS SDK for Python (Boto3)** by following [04b_Deploying_Campaigns_and_Filters_(Python_SDK).ipynb](04b_Deploying_Campaigns_and_Filters_(Python_SDK).ipynb)