# Validating and Importing Item Metadata <a class="anchor" id="top"></a>

In this notebook, you will pick up where you left off in `01_Validating_and_Importing_User_Item_Interaction_Data.ipynb` to build a working item metadata dataset. This will allow you to work with filters as well as later support the `User Personalization` or `HRNN-Metadata` algorithms.


To run this notebook, you need to have run the previous notebook, `01_Validating_and_Importing_User_Item_Interaction_Data`, where you created a dataset and imported interaction data into Amazon Personalize. At the end of that notebook, you saved some of the variable values, which you now need to load into this notebook.

In [1]:
%store -r

## Project constants

In [2]:
dataset_name = "movielense"
project_name = "sbcPersonalizePOC"
dataset_type = "ITEMS"
itemmetadata_filename = "item-meta.csv"

## Prepare your Item metadata <a class="anchor" id="prepare"></a>
[Back to top](#top)

The next thing to be done is to load the data and confirm the data is in a good state, then save it to a CSV where it is ready to be used with Amazon Personalize.

To get started, import a collection of Python libraries commonly used in data science.

In [3]:
import time
from time import sleep
import json
from datetime import datetime
import boto3
import pandas as pd
import os

Next,open the data file and take a look at the first several rows.

In [4]:
original_data = pd.read_csv(dataset_dir + '/movies.csv')
original_data.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
original_data.describe()

Unnamed: 0,movieId
count,9742.0
mean,42200.353623
std,52160.494854
min,1.0
25%,3248.25
50%,7300.0
75%,76232.0
max,193609.0


This does not really tell us much about the dataset, so we will explore a bit more for just raw info. We can see that genres are often grouped together, and that is fine for us as Personalize does support this structure.

In [6]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


From this, you can see that there are a total of (62,000+ for full 9742 for small) entries in the dataset, with 3 columns.

This is a pretty minimal dataset of just the movieId, title and the list of genres that are applicable to each entry. However there is additional data available in the Movielens dataset. For instance the title includes the year of the movies release. Let's make that another column of metadata

In [7]:
original_data['year'] =original_data['title'].str.extract('.*\((.*)\).*',expand = False)
original_data.head(5)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


From an item metadata perspective, we only want to include information that is relevant to training a model and/or filtering resulte, so we will drop the title, retaining the genre information.

In [8]:
itemmetadata_df = original_data.copy()
itemmetadata_df = itemmetadata_df[['movieId', 'genres', 'year']]
itemmetadata_df.head()

Unnamed: 0,movieId,genres,year
0,1,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Adventure|Children|Fantasy,1995
2,3,Comedy|Romance,1995
3,4,Comedy|Drama|Romance,1995
4,5,Comedy,1995


After manipulating the data, always confirm if the data format has changed.

In [9]:
itemmetadata_df.dtypes

movieId     int64
genres     object
year       object
dtype: object

Amazon Personalize has a default column for `ITEM_ID` that will map to our `movieId`, and now we can flesh out more information by specifying `GENRE` as well.

In [10]:
itemmetadata_df.rename(columns = {'genres':'GENRE', 'movieId':'ITEM_ID', 'year':'YEAR'}, inplace = True) 

That's it! At this point the data is ready to go, and we just need to save it as a CSV file.

In [11]:

itemmetadata_df.to_csv((data_dir+"/"+itemmetadata_filename), index=False, float_format='%.0f')

In [12]:
# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

### Create the dataset

First, define a schema to tell Amazon Personalize what type of dataset you are uploading. There are several reserved and mandatory keywords required in the schema, based on the type of dataset. More detailed information can be found in the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html).

Here, you will create a schema for item metadata data, which needs the `ITEM_ID` and `GENRE` fields. These must be defined in the same order in the schema as they appear in the dataset.

In [13]:
itemmetadata_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "GENRE",
            "type": "string",
            "categorical": True
        },{
            "name": "YEAR",
            "type": "int",
        },
        
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = f"{project_name}-{dataset_name}-item",
    schema = json.dumps(itemmetadata_schema)
)

itemmetadataschema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:726011567823:schema/sbcPersonalizePOC-movielense-item",
  "ResponseMetadata": {
    "RequestId": "3fcca1b9-3b85-45d4-bded-54faf76fe27a",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 19 Dec 2022 13:06:30 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "99",
      "connection": "keep-alive",
      "x-amzn-requestid": "3fcca1b9-3b85-45d4-bded-54faf76fe27a"
    },
    "RetryAttempts": 0
  }
}


With a schema created, you can create a dataset within the dataset group. Note, this does not load the data yet. This will happen a few steps later.

In [14]:
create_dataset_response = personalize.create_dataset(
    name = f"{project_name}-{dataset_name}-items",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = itemmetadataschema_arn
)

items_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:726011567823:dataset/sbcPersonalizePOC-movielense/ITEMS",
  "ResponseMetadata": {
    "RequestId": "bdcf3026-d222-43e4-acfd-520ccc57c570",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 19 Dec 2022 13:06:34 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "102",
      "connection": "keep-alive",
      "x-amzn-requestid": "bdcf3026-d222-43e4-acfd-520ccc57c570"
    },
    "RetryAttempts": 0
  }
}


### Upload data to S3

Now that your Amazon S3 bucket has been created, upload the CSV file of our user-item-interaction data. 

In [15]:
itemmetadata_file_path = data_dir + "/" + itemmetadata_filename
boto3.Session().resource('s3').Bucket(bucket_name).Object(itemmetadata_filename).upload_file(itemmetadata_file_path)
interactions_s3DataPath = "s3://"+bucket_name+"/"+itemmetadata_filename

## Import the item metadata <a class="anchor" id="import"></a>
[Back to top](#top)

Earlier you created the dataset group and dataset to house your information, so now you will execute an import job that will load the data from the S3 bucket into the Amazon Personalize dataset. 

In [17]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = f"{project_name}-item-import",
    datasetArn = items_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, itemmetadata_filename)
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:726011567823:dataset-import-job/sbcPersonalizePOC-item-import",
  "ResponseMetadata": {
    "RequestId": "a07a878c-23ef-4ee2-99dd-317094d77753",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 19 Dec 2022 13:08:33 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "117",
      "connection": "keep-alive",
      "x-amzn-requestid": "a07a878c-23ef-4ee2-99dd-317094d77753"
    },
    "RetryAttempts": 0
  }
}


Before we can use the dataset, the import job must be active. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the import job every second, up to a maximum of 6 hours.

Importing the data can take some time, depending on the size of the dataset. In this workshop, the data import job should take around 15 minutes.

In [18]:
%%time

max_time = time.time() + 6*60*60 # 6 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = dataset_import_job_arn
    )
    status = describe_dataset_import_job_response["datasetImportJob"]['status']
    print("DatasetImportJob: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: ACTIVE
CPU times: user 69.7 ms, sys: 7.95 ms, total: 77.6 ms
Wall time: 5min


With this import now complete you can enable filtering for your recommendations as well as support `HRNN-Metadata`. Run the cell below before moving on to store a few values for usage in the next notebooks. After completing that cell open notebook `03_Creating_and_Evaluating_Solutions.ipynb` to continue.

In [20]:
import os
from utils import save_json

os.makedirs("results", exist_ok=True)

data = {
"items_dataset_arn" : items_dataset_arn,
"itemmetadataschema_arn" : itemmetadataschema_arn
}

save_json("results/02.json", data=data)

json file saved at: results/02.json


In [21]:
%store items_dataset_arn
%store itemmetadataschema_arn

Stored 'items_dataset_arn' (str)
Stored 'itemmetadataschema_arn' (str)


In [22]:
json_data = dict(globals().copy())
all_arns = dict()

for key, val in json_data.items():
    if ("arn" in key):
        print(f"{key}: {val}\n")
        all_arns[key] = val
        
save_json("results/02_all_arns.json", data=all_arns)

interactions_dataset_arn: arn:aws:personalize:us-east-1:726011567823:dataset/sbcPersonalizePOC-movielense/INTERACTIONS

dataset_group_arn: arn:aws:personalize:us-east-1:726011567823:dataset-group/sbcPersonalizePOC-movielense

role_arn: arn:aws:iam::726011567823:role/sbcPersonalizePOCRolePOC

interaction_schema_arn: arn:aws:personalize:us-east-1:726011567823:schema/sbcPersonalizePOC-movielense-interactions

itemmetadataschema_arn: arn:aws:personalize:us-east-1:726011567823:schema/sbcPersonalizePOC-movielense-item

items_dataset_arn: arn:aws:personalize:us-east-1:726011567823:dataset/sbcPersonalizePOC-movielense/ITEMS

dataset_import_job_arn: arn:aws:personalize:us-east-1:726011567823:dataset-import-job/sbcPersonalizePOC-item-import

json file saved at: results/02_all_arns.json


In [23]:
with open("results/02_all.txt", "w+") as f:
    f.write(str(json_data))