# Validating and Importing User-Item-Interaction Data


For the most part the algorithms in Amazon Personalize look to solve different tasks explained here:

1. HRNN & HRNN-Metadata - Personalization
1. HRNN Coldstart - Personalization that promotes new content
1. Personalized-Ranking - Takes a collection of items and then orders them in probable order of interest using an HRNN-like approach.
1. SIMS(Similar Items) - Given one item, what other items are also interacted with by users.
1. Popularity-Count - What items are most popular, if HRNN or HRNN-Metadata do not have an answer for the user you query, this is what is returned by default.


No matter the use case, the algorithms all share a base of learning on user-item-interaction data which is defined by 3 core attributes:

1. UserID - User who interacted
1. ItemID - Item the user interacted with
1. Timestamp - When did this interaction occur

We also support event types and event values defined by:

1. Event Type - Categorical label of an event (browse, purcahsed, rated, etc).
1. Event Value - Something corresponding to event type that happened. Generally speaking we look to normalized between 0 and 1 for the values over the types. So if there are three phases to complete a transaction (clicked, added-to-cart, and purchased) there would be an event_value for each phase as 0.33, 0.66, 1.0 respectfully.

In this particular exercise we will leave event_type and event_value ignored. They can come in handy later but are skipped for the initial POC. 

## Choosing a dataset or data source

As we mentioned, the user-item-iteraction data is key for getting started with the service. This means we need to look for use cases that generate that kind of data, a few common examples are:

1. Video-on-demand applications
1. E-commerce platforms
1. Social media aggregators / platforms

There are a few guidelines for scoping a problem suitable for Personalize. We recommend the values below as a starting point, although the [official limits](https://docs.aws.amazon.com/personalize/latest/dg/limits.html) lie a little lower.

* Authenticated users
* At least 50 unique users
* At least 100 unique items
* At least 2 dozen interactions for each user 

Most of the time this is easily attainable, and if you are low in one category, you can often make up for it by having a larger number in another category.

Generally speaking your data will not arrive in a perfect form for Personalize, and will take some modification to be structured correctly. This notebook looks to guide you through all of that. 

To begin with, we are going to use the [Last.FM](https://grouplens.org/datasets/hetrec-2011/) dataset. These are records of the music listening behavior of its users. The data fits our guidelines with a large number for users, items, and interactions.

First, you will download the dataset and unzip it in a new folder using the code below.

In [None]:
data_dir = "poc_data"
!mkdir $data_dir
!cd $data_dir && wget http://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip
!cd $data_dir && unzip hetrec2011-lastfm-2k.zip

Take a look at the data files you have downloaded.

In [None]:
!ls $data_dir

At present not much is known about the data other than we seem to have many .dat files and a README. Opening the README will tell us about the overall structure of this data. This is a step you probably can skip with custom data unless the data source is coming from an external team.

In [None]:
!pygmentize poc_data/readme.txt

From the README, we can see that there are multiple interaction types in this dataset. Interactions between users marking each other as friends, interactions from users listening to artists, and interactions from tags assigned to users and artists.

In this case, we are focusing on the users, the artists, and the listening interactions. We have 1892 users, 17632 artists (our items in this case), and 92834 user-listened artist interactions. This is more than enough for us to get started with Personalize.

Continue reading through the README to get to the `Files` section. Most of the files in the dataset are not relevant to us, but the `users_artists.dat` file looks promising. The `Data format` section of the README provides more details on the contents of the file. This is where we encounter our first problem.

| userID | artistID | weight  |
|--------|----------|---------|
| 2      | 51       | 13883   |

Although there is interaction data between users and the artists they are listening to, these interactions are stored as weights instead of timestamps. We need user-item-timestamp interaction data for Amazon Personalize. 

If you take another look at the files in the dataset, you should see that `users_taggedartists-timestamps.dat` does contain timestamp data. So what if we use tagging behavior as our interaction data, instead of listening behavior? Can we assume that a user tagging an artist is an indication of positive sentiment? Normally, you would discuss with your customer, or someone who has domain knowledge, to understand if this interaction is suitable for the use case you want to solve. For now, we will assume that tagging behavior is suitable for our needs. 

The schema for the `user_taggedartists-timestamps.dat` is:

| userID | artistID | tagID | timestamp     |
|--------|----------|-------|---------------|
| 2      | 52       | 13    | 1238536800000 |

If we remove the `tagID` attribute, we have exactly the format we need for Amazon Personalize.

## Preparing your data

The next thing to be done is to load the data and confirm the data is in a good state, then save it to a CSV where it is ready to be used with Amazon Personalize.

To get started, import a collection of Python libraries commonly used in data science.

In [None]:
import boto3
from time import sleep
import subprocess
import pandas as pd
import json
import time
import pprint
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates
from datetime import datetime

Next,open the data file and take a look at the first several rows.

In [None]:
original_data = pd.read_csv(data_dir + '/user_taggedartists-timestamps.dat')
original_data.head(5)

Clearly the data did not load correctly. The default delimiter for CSV (comma-separated value) files is a comma (`,`), but in this case the file was saved with tab (`\t`) delimiters. So let's specify the correct delimiter and try loading the data again.

In [None]:
original_data = pd.read_csv(data_dir + '/user_taggedartists-timestamps.dat', delimiter='\t')
original_data.head(5)

That's better. Now that the data has been successfully loaded into memory, let's extract some additional information. First, calculate some basic statistics from the data.

In [None]:
original_data.describe()

This shows that we have a good range of values for `userID` and `artistID`. Next, it is always a good idea to confirm the data format.

In [None]:
original_data.info()

From this, you can see that there are a total of 186,479 entries in the dataset, with 4 columns, and each cell stored as int64 format.

The int64 format is clearly suitable for `userID` and `artistID`. However, we need to diver deeper to understand the timestamps in the data. To use Amazon Personalize, you need to save timestamps in [Unix Epoch](https://en.wikipedia.org/wiki/Unix_time) format.

Currently, the timestamp values are not human-readable. So let's grab an arbitrary timestamp value and figure out how to interpret it.

In [None]:
arb_time_stamp = original_data.iloc[50]['timestamp']
print(arb_time_stamp)
print(datetime.utcfromtimestamp(arb_time_stamp).strftime('%Y-%m-%d %H:%M:%S'))


Oops! For this particular timestamp value, the code rendered a year of 41,132. That's a bit far into the future for us, so clearly this was not the correct way to parse the data. We need a second attempt.

JavaScript records time in milliseconds and this is a collection of data from a web application, so let's divide the timestamp value by 1000 before applying our code.

In [None]:
arb_time_stamp = arb_time_stamp/1000
print(datetime.utcfromtimestamp(arb_time_stamp).strftime('%Y-%m-%d %H:%M:%S'))

February, 2009 feels much more realistic for our dataset. We don't need human-readable timestamps to use Amazon Personalize, but we do want the dates to be realistic, so now move forward by transforming each timestamp in the dataset away from the JavaScript milliseconds format. 

In [None]:
original_data.timestamp = original_data.timestamp / 1000
original_data.head(5)

Do a quick sanity check on the transformed dataset by picking an arbitrary timestamp and transforming it to a human-readable format.

In [None]:
arb_time_stamp = original_data.iloc[50]['timestamp']
print(arb_time_stamp)
print(datetime.utcfromtimestamp(arb_time_stamp).strftime('%Y-%m-%d %H:%M:%S'))

This date makes sense as a timestamp, so we can continue formatting the rest of the data. Remember, the data we need is user-item-interaction data, which is `userID`, `artistID`, and `timestamp` in this case. Our dataset has an additional column, `tagID`, which can be dropped from the dataset.

In [None]:
interactions_df = original_data.copy()
interactions_df = interactions_df[['userID', 'artistID', 'timestamp']]
interactions_df.head()

After manipulating the data, always confirm if the data format has changed.

In [None]:
interactions_df.dtypes

In this case, the timestamp column has changed from int64 to float64. So let's change the format back to int64.

In [None]:
interactions_df.astype({'timestamp': 'int64'}).dtypes

 Amazon Personalize has default column names for users, items, and timestamp. These default column names are `USER_ID`, `ITEM_ID`, AND `TIMESTAMP`. So the final modification to the dataset is to replace the existing column headers with the default headers.

In [None]:
interactions_df.rename(columns = {'userID':'USER_ID', 'artistID':'ITEM_ID', 
                              'timestamp':'TIMESTAMP'}, inplace = True) 


That's it! At this point the data is ready to go, and we just need to save it as a CSV file.

In [None]:
interactions_filename = "interactions.csv"
interactions_df.to_csv((data_dir+"/"+interactions_filename), index=False, float_format='%.0f')

## Creating Dataset Groups and the Interactions Dataset

The highest level of isolation and abstraction with Amazon Personalize is a Dataset Group. Information stored within one of these has no impact on any other dataset group or models created from one. This allows you to run many experiments and is part of how we keep your models private and fully trained only on your data. 

Before importing the data prepared earlier, there needs to be a dataset group and a dataset added to it that handles the interactions.

Dataset Groups can house the following types of information:

* User-Item-Interactions
* Event Streams ( Real time Interactions )
* User Metadata
* Item Metadata

The cells below will create the dataset group and the dataset for interactions.


Now validate that your environment can communicate successfully with Amazon Personalize, the lines below do just that.

In [None]:
# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

### Create the Dataset Group

In [None]:
%%time

create_dataset_group_response = personalize.create_dataset_group(
    name = "personalize-poc-lastfm"
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))

Wait for Dataset Group to Have ACTIVE Status

Before we can use the Dataset Group in any items below it must be active, execute the cell below and wait for it to show active.

In [None]:
%%time

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

### Create the Dataset

First define a schema for the interactions:

In [None]:
%%time

interactions_schema = schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = "personalize-poc-lastfm-interactions",
    schema = json.dumps(interactions_schema)
)

schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

Now create a dataset with that schema.

In [None]:
%%time

dataset_type = "INTERACTIONS"
create_dataset_response = personalize.create_dataset(
    name = "personalize-poc-lastfm-ints",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = schema_arn
)

dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

In [None]:
interactions_dataset_arn = dataset_arn

## Configuring S3 and IAM 


Amazon Personalize will need an S3 bucket to act as the source of your data, as well as IAM roles for accessing it. The code below will set all that up.

Now using the metada stored on this instance of a SageMaker Notebook determine the region we are operating in. If you are using a Jupyter Notebook outside of SageMaker simply define region as the string that indicates the region you would like to use for Forecast and S3.


In [None]:
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]
print(region)

In [None]:
session = boto3.Session(region_name=region)

In [None]:
print(region)
s3 = boto3.client('s3')
account_id = boto3.client('sts').get_caller_identity().get('Account')
bucket_name = account_id + "personalizepoc"
print(bucket_name)
if region != "us-east-1":
    s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})
else:
    s3.create_bucket(Bucket=bucket_name)

#### Attach Policy to S3 Bucket
Amazon Personalize needs to be able to read the content of your S3 bucket that you created earlier. The lines below will do that.

In [None]:
s3 = boto3.client("s3")

policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:*Object",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket_name),
                "arn:aws:s3:::{}/*".format(bucket_name)
            ]
        }
    ]
}

s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))

### Create Personalize Role
Also Amazon Personalize needs the ability to assume Roles in AWS in order to have the permissions to execute certain tasks, the lines below grant that.

In [None]:
iam = boto3.client("iam")

role_name = "PersonalizeRolePOC"
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "personalize.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}

create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
# if you would like to use a bucket with a different name, please consider creating and attaching a new policy
# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"
iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = policy_arn
)

# Now add S3 support
iam.attach_role_policy(
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
    RoleName=role_name
)
time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

#### Upload to S3

Before Personalize can import the data, it needs to be in S3.

In [None]:
# Upload Interactions File
interactions_file_path = data_dir + "/" + interactions_filename
boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_file_path)
interactions_s3DataPath = "s3://"+bucket_name+"/"+interactions_filename

## Importing the Interactions Data

Earlier you created the DatasetGroup and Dataset to house your information, now you will execute an import job that will load the data from S3 into Amazon Personalize for usage building your model.

#### Create Dataset Import Job

In [None]:
%%time

create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "personalize-poc-import1",
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename)
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

#### Wait for Dataset Import Job to Have ACTIVE Status
It can take a while before the import job completes, please wait until you see that it is active below.

In [None]:
%%time

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = dataset_import_job_arn
    )
    status = describe_dataset_import_job_response["datasetImportJob"]['status']
    print("DatasetImportJob: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

Now that the dataset import is active you are ready to start building models with SIMS, Personalized-Ranking, Popularity-Count, and HRNN. Work will continue in other notebooks. Run the cell below before moving on to store a few values for usage in the next notebooks.

In [None]:
%store interactions_dataset_arn
%store dataset_group_arn
%store bucket_name
%store role_arn
%store role_name
%store data_dir