# Preparing Input Data for Amazon Personalize

In this notebook, we'll work through **selecting and preparing** historical training data for Amazon Personalize - uploading the prepared files to Amazon S3.

> ⚠️ We assume you've *already* set up an S3 bucket and IAM execution role for Personalize to access the bucket (either in the AWS console or by running notebook [00_Environment_Setup.ipynb](00_Environment_Setup.ipynb)) and `%store`d this information as we did in the notebook.

## Introduction

Amazon Personalize provides 3 types of "recipe" (algorithm), for solving different recommendation tasks. Some categories include multiple recipes, as follows:

1. [**User Personalization**](https://docs.aws.amazon.com/personalize/latest/dg/user-personalization-recipes.html) - Recommend relevant items for a given user (with some business rule filtering, if required)
    - The [**User-Personalization**](https://docs.aws.amazon.com/personalize/latest/dg/native-recipe-new-item-USER_PERSONALIZATION.html) recipe is the most fully-featured and typically **recommended for all use-cases in this category**
    - The legacy **HRNN-...** recipes in this category should normally be substituted for User-Personalization
    - The [**Popularity-Count**](https://docs.aws.amazon.com/personalize/latest/dg/native-recipe-popularity.html) is a useful **baseline** recipe for contextualizing performance metrics: e.g. "What metrics would I get *just by always recommending the most popular items*)
2. [**Personalized Ranking**](https://docs.aws.amazon.com/personalize/latest/dg/personalized-ranking-recipes.html) - Prioritize relevant items **from a provided shortlist** for a given user (for supplying lists, rather than applying filter rules)
3. [**Related Items**](https://docs.aws.amazon.com/personalize/latest/dg/related-items-recipes.html) - Recommend relevant items for a given **item** in context (e.g. "customers also bought")

No matter the use case, the algorithms all share a base of learning on **user-item-interaction data** - a set of *events* defined by 3 core required attributes:

1. **User_ID** - The user who interacted
2. **Item_ID** - The item the user interacted with
3. **Timestamp** - The time at which the interaction occurred

Personalize also defines certain optional event attributes you can add:

4. **Event_Type** - Categorical label of an event (browse, purchased, rated, etc).
5. **Event_Value** - A value corresponding to the event type that occurred. As a best practice, we generally aim for normalized values between 0 and 1 to be comparable between event types. For example, if there are three phases to complete a transaction (clicked, added-to-cart, and purchased), then there would be an event_value for each phase as 0.33, 0.66, and 1.0 respectively.

...and users can also add **custom interaction attributes** which some recipes can include in the generated model: E.g. location, device type, etc.

In addition to this core **'Interactions' dataset**, we may *optionally* provide:

- **Users metadata** - a table mapping each `USER_ID` to additional custom attributes such as gender, age group, etc.
- **Items metadata** - a table mapping each `ITEM_ID` to additional custom attributes such as brand, category, etc.

More detail on the schema and requirements for Amazon Personalize training data is given in the [Datasets and Schemas](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html) and [Preparing and Importing Data](https://docs.aws.amazon.com/personalize/latest/dg/data-prep.html) sections of the Amazon Personalize Developer Guide.

## Choosing a Data Source

> ⚠️ **Remember:** Only the *Interactions* dataset is mandatory, and all recipes in Amazon Personalize at the time of writing use this as their primary learning source - with secondary input from other metadata.
>
> See the official [Amazon Personalize Cheat Sheet](https://github.com/aws-samples/amazon-personalize-samples/blob/master/PersonalizeCheatSheet2.0.md) for more guidance on **what use-cases Amazon Personalize is a good fit for**, and where you might consider alternative techniques instead.

Many use-cases *do* generate this kind of interaction history data - for example:

1. Video-on-demand applications
1. E-commerce platforms
1. Social media aggregators / platforms

...And we should be reasonably well-placed as long as our use case meets some basic criteria like:

- Authenticated users
- At least 50 unique users
- At least 100 unique items
- Several (and prefarably at least 2 dozen) interactions for each user

However, your historical data will typically not arrive in a perfect form & compatible schema for Personalize - so our first task will be to procure and **re-structure** the training data.

In this example, we'll use the [**MovieLens dataset**](https://grouplens.org/datasets/movielens/) to explore the service: A set of *movie reviews*.

## Fetching the Data (MovieLens Example)

The full MovieLens dataset includes over 25 million interactions (reviews) and a rich collection of metadata for items (movies).

There's also a smaller published extract of the dataset available, which we'll use by default here to shorten training times while still demonstrating the same capabilities. Set `USE_FULL_MOVIELENS` to `True` below if you'd like to use the full dataset instead:

In [None]:
USE_FULL_MOVIELENS = False

In [None]:
data_dir = "poc_data"
%store data_dir

!mkdir -p $data_dir

# Download and extract the dataset:
if not USE_FULL_MOVIELENS:
    !cd $data_dir && wget -N http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
    !cd $data_dir && unzip -o ml-latest-small.zip
    dataset_dir = data_dir + "/ml-latest-small/"
else:
    !cd $data_dir && wget -N http://files.grouplens.org/datasets/movielens/ml-25m.zip
    !cd $data_dir && unzip -o ml-25m.zip
    dataset_dir = data_dir + "/ml-25m/"
%store dataset_dir

Take a look at the data files you've downloaded:

In [None]:
!ls $dataset_dir

At present not much is known except that we have a few CSVs and a readme. Let's output the readme to learn more!

In [None]:
!pygmentize $dataset_dir/README.txt

From the README, we see there is a file `ratings.csv` that should form the basis of our **interactions** data - after all, rating a film definitely is a form of interacting with it!

The dataset also has some genre information as some movie genome data, which looks like a good source of **item metadata**.

As a publicly released dataset, there's not really any useful **user metadata** available here for us to model with - so we'll largely ignore this feature of Personalize in our example... But you can take the item metadata process as a good guide for your own experiments.

## Preparing Interactions Data

To get started exploring our data, we'll first import a few useful libraries:

In [None]:
# Python Built-Ins:
from datetime import datetime

# External Dependencies:
import boto3  # AWS SDK for Python
import pandas as pd  # DataFrame (table) manipulation tools

First we'd like to load our data and check the format, scope, and any gaps - to understand what might need adapting for Amazon Personalize:

In [None]:
original_data = pd.read_csv(dataset_dir + "/ratings.csv")

print(original_data.info())
original_data.head(5)

From the above, you can see that there are a total of (25,000,095 for full 100836 for small) entries in the dataset, with 4 columns, and each cell has been interpreted as int64 format, except for the rating which has been loaded as float64.

We see that there are no null entries in any column (non-null counts all match), but may also be good to check the extent & basic statistics of the columns in case there's anything unexpected there:

In [None]:
original_data.describe()

The Ranges seem to be generally healthy for our `userId`, `movieId` and `rating` columns - and the inferred int64 & float64 formats clearly seem suitable for these fields.

However, we need to dive deeper to understand the timestamps in the data. Amazon Personalize [requires](https://docs.aws.amazon.com/personalize/latest/dg/data-prep-formatting.html#timestamp-data) timestamps in [Unix Epoch](https://en.wikipedia.org/wiki/Unix_time) format.

Currently, the timestamp values are not human-readable. So let's grab an arbitrary timestamp value and check whether it seems consistent.

In [None]:
arb_time_stamp = original_data.iloc[50]["timestamp"]
print(arb_time_stamp)
print(datetime.utcfromtimestamp(arb_time_stamp).strftime("%Y-%m-%d %H:%M:%S"))

Great! This date seems to be within expected range, so we can continue formatting the rest of the data.

The correspondence is pretty clear from `userId` to `USER_ID`, and from `movieId` to `ITEM_ID`... But what about `rating`?

We can define a single `EVENT_TYPE` of "review", and use rating as our `EVENT_VALUE` field: Which will allow us to **threshold filter** our dataset to build models that consider only higher-rating events (e.g. to build a model more likely to recommend movies the user will actually *like*, not just what they're likely to review).

...So the only data preparation we'll need to do in this example is to rename our columns:

In [None]:
interactions_df = original_data.copy()
interactions_df["EVENT_TYPE"] = "review"
interactions_df.rename(
    columns={
        "userId": "USER_ID",
        "movieId": "ITEM_ID",
        "rating": "EVENT_VALUE",
        "timestamp": "TIMESTAMP",
    },
    inplace=True,
)
print(interactions_df.info())
interactions_df.head()

To import this dataset to Amazon Personalize, we'll need to save it in **CSV format** on **Amazon S3**, so first let's save the file to a CSV here on our notebook:

In [None]:
interactions_filename = "interactions.csv"
interactions_path = f"{data_dir}/{interactions_filename}"
%store interactions_path

interactions_df.to_csv(interactions_path, index=False)

...And then upload this file to the bucket we prepared in [00_Environment_Setup.ipynb](00_Environment_Setup.ipynb).

In [None]:
# Replace the below line with e.g. region = "ap-southeast-1" if you didn't run notebook 0
%store -r region
assert isinstance(region, str), "`region` must be a region name string e.g. 'us-east-1'"

# Replace the below line with e.g. bucket_name = "DOC-EXAMPLE-BUCKET" if you didn't run notebook 0
%store -r bucket_name
assert isinstance(bucket_name, str), "`bucket_name` must be a data bucket name string"

session = boto3.Session(region_name=region)
s3 = session.resource("s3")

In [None]:
# Upload the file to S3:
s3.Bucket(bucket_name).Object(interactions_path).upload_file(interactions_path)
interactions_s3uri = f"s3://{bucket_name}/{interactions_path}"
%store interactions_s3uri
print(f"Uploaded interactions to {interactions_s3uri}")

Great - now our only *mandatory* dataset is ready to go! But, to try and improve our model, let's see if we can also prepare an item metadata file from the 'movies.csv' provided in the source dataset:

## Preparing (Item) Metadata

As before, we'll start out by loading and exploring the source data file:

In [None]:
original_items = pd.read_csv(dataset_dir + "/movies.csv")

print(original_items.info())
original_items.head(5)

In [None]:
original_items.describe()

This time, `movieId` is the only numeric column and so the only one we can generate summary statistics for. we also see `title` (A string title which appears to include the year of the film in brackets at the end) and `genres` - a bar-separated list of multiple genre tags for each film.

Amazon Personalize is already able to consume multi-categorical data in this bar-separated format, so we can use this `GENRES` field as-is.

Let's:

- Rename the `movieId` column to `ITEM_ID` in line with Amazon Personalize's required schema
- Rename our other columns to uppercase for consistency
- Try using a [regular expression](https://docs.python.org/3/library/re.html#module-re) to **extract the year** from the title field

with the code below:

In [None]:
items_df = original_items.copy()
items_df.rename(
    columns={
        "movieId": "ITEM_ID",
        "title": "TITLE",
        "genres": "GENRES",
    },
    inplace=True,
)
items_df["YEAR"] = items_df["TITLE"].str.extract(r".*(\d{4})")
items_df["YEAR"] = pd.to_numeric(items_df["YEAR"]).astype("Int64")

print(items_df.info())
items_df.head()

This seems to have gone pretty well overall - but we can see from the summary statistics there are a small proportion of null values in our new `YEAR` column: Let's quickly check whether these gaps make sense:

In [None]:
items_df[items_df["YEAR"].isnull()].sample(5)

These examples seem to make sense as the relevant titles don't obviously list a year.

The proportion of missing values in the `YEAR` field is quite small, and the age of each movie is likely to be usefully descriptive - so we'll keep and use this field.

However, at this time the `TITLE` field is pretty useless to us from a modelling perspective, as today Amazon Personalize can only interpret string fields as category identifiers, not use natural language features mentioned within them.

...So as a final transformation we'll drop the `TITLE` column, and then our item metadata will be ready to load for modelling:

In [None]:
items_df.drop("TITLE", axis=1, inplace=True)
items_df.head()

As for the interactions data, we now just need to save this dataset to CSV format and upload to our Amazon S3 Bucket:

In [None]:
items_filename = "item-meta.csv"
items_path = f"{data_dir}/{items_filename}"
%store items_path

items_df.to_csv(items_path, index=False)

In [None]:
# Upload the file to S3:
s3.Bucket(bucket_name).Object(items_path).upload_file(items_path)
items_s3uri = f"s3://{bucket_name}/{items_path}"
%store items_s3uri
print(f"Uploaded item metadata to {items_s3uri}")

## All set!

We've now prepared interaction and item metadata in a compatible format for Amazon Personalize, and uploaded it to Amazon S3 ready for import.

In the next notebook we'll start actually using the Personalize service, and you have two choices:

- Follow along in the **AWS Console** with the instructions and screenshots in [02a_Importing_Data_(Console).ipynb](02a_Importing_Data_(Console).ipynb), *OR*
- Run the same steps in code with the **AWS SDK for Python (Boto3)** by following [02b_Importing_Data_(Python_SDK).ipynb](02b_Importing_Data_(Python_SDK).ipynb)