# Introduction and Data Preparation

## Contents

This demo will walk you through how to create personalized marketing content (for instance emails) for each user using [Amazon Personalize](https://aws.amazon.com/personalize/) and [Amazon Bedrock](https://aws.amazon.com/bedrock/).

1. Building a work environment (follow the steps bellow)
2. Format your data to use with [Amazon Personalize](https://aws.amazon.com/personalize/). 
3. Train an Amazon Personalize 'Top picks for you' Recommender to get personalized recommendations for each user.
4. Generate a prompt that includes the user's preferences, recommendations, and demographics.
5. Generate a custom email for each user with [Amazon Bedrock](https://aws.amazon.com/bedrock/).

## Architecture diagram


<img src="./images/architecture.png" alt="architecture_diagram" style="width:800px;"/></br>
<center>Fig. 1 Architecture diagram.</center>

## How to Use the Notebook

The code is broken up into cells. There's a triangular `Run` button at the top of this page you can click to execute each cell and move onto the next, or you can press `Shift` + `Enter` while in the cell to execute it and move onto the next one.

As a cell is executing you'll notice a line to the side showcase an `*` while the cell is running or it will update to a number to indicate the last cell that completed executing after it has finished exectuting all the code within a cell.

Follow the instructions below and execute the cells to run this sample.

## Preparing the Data

### Items data

The item data consists of information about the content that is being interacted with, this generally comes from Content Management Systems (CMS). 

In order provide additional metadata, and also to provide a consistent experience for our users we leverage a subset of the IMDb Essential Metadata for Movies/TV/OTT dataset. IMDb is the world's most popular and authoritative source for information on movies, TV shows, and celebrities and powers entertainment experiences around the world. License IMDb entertainment metadata from over 10 million movies, TV series, and Video Game titles including 12 million cast and crew, 1 billion star ratings, and global box office grosses from Box Office Mojo. All IMDb data products are updated daily and easily accessed through AWS Data Exchange. 

The IMDb Essential Metadata for Movies/TV/OTT dataset, which contains 

- 9+ million titles
- 12+ million names
- Film, TV, music and celebrities
- 1 billion ratings from the world’s largest entertainment fan community

IMDb has [multiple datasets available in the Amazon Data Exchange](https://aws.amazon.com/marketplace/seller-profile?id=0af153a3-339f-48c2-8b42-3b9fa26d3367). <img src="./images/IMDb_Logo_Rectangle.png" alt="IMDb logo" style="width:50px;"/>

For this workshop we have extracted the subset of data we needed and prepared it for use with the following information from the IMDb Essential Metadata for Movies/TV/OTT (Bulk data) dataset.

TITLE                      
YEAR                       
IMDB_RATING                
IMDB_NUMBEROFVOTES         
PLOT                       
US_MATURITY_RATING_STRING  
US_MATURITY_RATING         
GENRES 

In addition we added two fields that will help us with our fictional use case that are not derived from the  IMDb dataset

CREATION_TIMESTAMP         
PROMOTION

For the purpose of this workshop we will use the IMDb TT ID to provide a common identifier between the interactions data and the content metadata. Movielens provides its own identifier as well as a the IMDb TT ID (without the leading 'tt') in the 'links.csv' file. 


<div class="alert alert-block alert-warning">
<b>Note: </b>Your use of IMDb data is for the sole purpose of completing the AWS workshop and/or tutorial. Any use of IMDb data outside of the AWS workshop and/or tutorial requires a data license from IMDb. To obtain a data license, please contact: imdb-licensing-support@imdb.com. You will not (and will not allow a third party to) (i) use IMDb data, or any derivative works thereof, for any purpose; (ii) copy, sublicense, rent, sell, lease or otherwise transfer or distribute IMDb data or any portion thereof to any person or entity for any purpose not permitted within the workshop and/or tutorial; (iii) decompile, disassemble, or otherwise reverse engineer or attempt to reconstruct or discover any source code or underlying ideas or algorithms of IMDb data by any means whatsoever; or (iv) knowingly remove any product identification, copyright or other notices from IMDb data.</div>

<div class="alert alert-block alert-warning">
<b>Note: </b>This dataset is not required for Amazon Personalize to generate recommendations, but providing good item metadata will ensure the best results in your trained models.
</div>

### Interactions data

* Interations data: we use the ml-latest-small dataset from the [Movielens](https://grouplens.org/datasets/movielens/) project as a proxy for user-item interactions. 

The interaction data consists of information about the interactions the users of the fictional app will have with the content. This usually comes from analytics tools or Customer Data Platform's (CDP). The best interaction data for use for Amazon Personalize would include the sequential order of user behavior, what content was watched/clicked on and the order it was interacted with. To simulate our interaction data, we will be using data from the [MovieLens project](https://grouplens.org/datasets/movielens/). Movielens offers multiple versions of their dataset, for the purposes of this workshop we will be using the reduced version of this dataset (approx 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users).

### User data

In this example we will not be using user data to train the Amazon Personalize model, because this data is not available from the Movielens dataset we are using. However we will be experimenting with different user personas when working on the email preparation and prompt.

<div class="alert alert-block alert-warning">
<b>Note:</b> This dataset is not manatory for Amazon Personalize to generate recommendations, but providing good item metadata will ensure the best results in your trained models.
</div>



## Set up environment

In [None]:
# Update the installed packages
!pip uninstall -y awscli
!pip install awscli
!pip uninstall -y boto3 botocore
!pip install botocore
!pip install boto3
!pip uninstall -y numexpr
!pip install numexpr

In [None]:
# Import packages
import boto3
import pprint
import time
import pandas as pd
import numpy as np
import re
import json
import random

## Load the variable names
We load the variable names from a file. These are shared with the pretrained models.

In [None]:
# Opening JSON file
f = open('params.json')
parameters = json.load(f)

In [None]:
workshop_dataset_group_name = parameters['datasetGroup']['serviceConfig']['name']

interactions_schema_name = parameters['datasets']['interactions']['schema']['serviceConfig']['name']
interactions_dataset_name = parameters['datasets']['interactions']['dataset']['serviceConfig']['name']

items_schema_name = parameters['datasets']['items']['schema']['serviceConfig']['name']
items_dataset_name = parameters['datasets']['items']['dataset']['serviceConfig']['name']

#The following job names are the starting Strings of the job names that can be created
interactions_import_job_name = 'dataset_import_interaction'
items_import_job_name = 'dataset_import_item'
# users_import_job_name = 'dataset_import_user'

for recommender in parameters['recommenders']:
    # This is currently configured assuming only one recommender, if there are multiple
    # recommenders of the same type further configuration is needed.
    if (recommender['serviceConfig']['recipeArn'] == 'arn:aws:personalize:::recipe/aws-vod-top-picks'):
        recommender_top_picks_for_you_name =recommender['serviceConfig']['name']

In [None]:
#make the directories
data_dir = 'poc_data'
!mkdir $data_dir
imdb_dir = data_dir+'/imdb'
!mkdir $imdb_dir

In [None]:
# variable names
items_filename = "items.csv"
interactions_filename = "interactions.csv"
movielens_dataset_dir = data_dir + "/ml-latest-small/"

## Prepare the Item Metadata <a class="anchor" id="prepare_items"></a>
[Back to top](#top)

In [None]:
# Download IMDB data
!wget -P $imdb_dir https://d2peeor3oplhc6.cloudfront.net/personalize-immersionday-media/imdb/items.csv

In [None]:
# read movielens data
item_data = pd.read_csv(data_dir + '/imdb/items.csv', sep=',', dtype={'PROMOTION': "string"})
item_data.head(5)

In [None]:
item_data.isnull().sum()

That's it! At this point the item data is ready to go, and we just need to save it as a CSV file.

In [None]:
item_data.to_csv((data_dir+"/"+items_filename), index=True, float_format='%.0f')

## Prepare the Interactions data 

First, you will download the dataset from the [MovieLens project](https://grouplens.org/datasets/movielens/) website and unzip it in a new folder using the code below.

In [None]:
# copy movielens data
!cd $data_dir && wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!cd $data_dir && unzip -o ml-latest-small.zip

We can look at the README.txt file and licensing, do not skip over usage license!

In [None]:
!pygmentize $data_dir/ml-latest-small/README.txt

The primary data we are interested in for a recommendation use case is the actual interactions that the users had with the titles(items). 

Open the `ratings.csv` file and take a look at the some rows from throughout the dataset.

In [None]:
interaction_data = pd.read_csv(movielens_dataset_dir + '/ratings.csv', sep=',', dtype={'userId': "int64", 'movieId': "str"})
interaction_data.sample(10)

In [None]:
interaction_data.info()

### Convert the Interactions Data

The interaction data generally is acquired from anaytics or CDP platforms that can identify individual interactions with content/items within a platform. 

We need to do a few things to get this dataset ready to subsitute for our services interaction data.

First off, the movieId is a unique identifier provided by Movielens for each title. However as we saw above IMDb has a much richer set of metadata about the content catalog. In order to use the IMDb data we will need to use a common  identifier between our items and our interactions dataset, which is the IMDb imdbId. To do this Movielens provides the 'links.csv' file which helps convert between the two identifiers.

In [None]:
links = pd.read_csv(movielens_dataset_dir + '/links.csv', sep=',', usecols=[0,1], encoding='latin-1', dtype={'movieId': "str", 'imdbId': "str", 'tmdbId': "str"})
pd.set_option('display.max_rows', 25)
links['imdbId'] = 'tt' + links['imdbId'].astype(object)
links

As you can see this provides a method to identify what the IMDb id is for every title in our interactions dataset, now we will convert the ratings.csv data to utilize the IMDb ID.

In [None]:
imdb_data = interaction_data.merge(links, on='movieId')
imdb_data.drop(columns='movieId', inplace = True)

In [None]:
imdb_data

Now we have a interactions dataset that matches our item catalog dataset. 

### Simulating an interaction dataset 

We are going to make one more modification to make the MoviesLens dataset more like the analytics data that a video streaming service would see in their interactions. MoviesLens is an explicit movie rating dataset, which means users are presented a movie and asked to give it a rating. For recommendation systems/personalization, the industry has moved on to using more implicit data. This is due to many reasons including low numbers of customers rating titles and customers tastes changing over time. Some of the benefits of implicit interaction data is that it is the actual behavior of all users and changes over time as their viewing behavior changes.

To convert the explicit interaction MovieLens ratings dataset into an implicit dataset we are going to create a synthetic dataset using the ratings in MovieLens. 

- Implicit interactions are inherently positive interactions so we will be dropping any rating that is below 2 stars. 
- Ratings of 2 and 3 stars are neutral to slightly positive, we are going to create synthetic "Click" events to simulate a viewer clicking on a title in the UnicornFlix app.
- Ratings of 4 and 5 are overwhelmingly positive, we will use these to create synthetic "Watch" and "Click" events to simulate a viewer both clicking on a title and watching at least 80% of a title.

<div class="alert alert-block alert-warning">
<b>Note:</b> These interactions will be directionaly accurate, but is not a good substitute for actual temporal based interaction data, the order that viewers rated movies on the MovieLens website is not as good as the order of interactions on an actual Video On Demand Streaming app. For more information about the importance of the temporal interaction data see https://www.amazon.science/publications/temporal-contextual-recommendation-in-real-time.
</div>

In [None]:
watched_df = imdb_data.copy()
watched_df = watched_df[watched_df['rating'] > 3]
watched_df = watched_df[['userId', 'imdbId', 'timestamp']]
watched_df['EVENT_TYPE']='Watch'
watched_df.head()

In [None]:
clicked_df = imdb_data.copy()
clicked_df = clicked_df[clicked_df['rating'] > 1]
clicked_df = clicked_df[['userId', 'imdbId', 'timestamp']]
clicked_df['EVENT_TYPE']='Click'
clicked_df.head()

In [None]:
interactions_df = clicked_df.copy()

interactions_df = pd.concat([interactions_df, watched_df])
interactions_df.sort_values("timestamp", axis = 0, ascending = True,
                 inplace = True, na_position ='last')

Lets look at what the new dataset looks like and ensure that the data reflects our fictional streaming services streaming analytics data

In [None]:
interactions_df

 Amazon Personalize has default column names for users, items, and timestamp. These default column names are `USER_ID`, `ITEM_ID`, `TIMESTAMP` and `EVENT_VALUE` for the [VIDEO_ON_DEMAND domain dataset](https://docs.aws.amazon.com/personalize/latest/dg/VIDEO-ON-DEMAND-datasets-and-schemas.html). The final modification to the dataset is to replace the existing column headers with the default headers.

In [None]:
interactions_df.rename(columns = {'userId':'USER_ID', 'imdbId':'ITEM_ID',
                              'timestamp':'TIMESTAMP'}, inplace = True)
interactions_df

We'll be using a subset of the IMDB dataset for this workshop that has been cleaned to remove movies that don't have valid values for the metadata we are using in our ITEMs dataset (we'll work with this more in the net section), so we'll need to make sure we don't have any interactions that have IMDb movie ids that are not in our subset of the IMDb data set.



In [None]:
movies = pd.read_csv(data_dir + '/imdb' + '/items.csv', sep=',', usecols=[0,1], encoding='latin-1', dtype={'movieId': "str", 'imdbId': "str", 'tmdbId': "str"})
pd.set_option('display.max_rows', 25)
movies

In [None]:
movies.nunique(axis=0)

The number of unique ITEM_IDs are not the same in the IMDB data and the interactions data, so we'll clean out the data points with ITEM_IDs that do not have item metadata from the interactions dataset.

In [None]:
interactions_df = interactions_df.merge(movies, on='ITEM_ID')
interactions_df.info()

We will also drop the TITLE column as it is not required in the interactions dataset.

In [None]:
interactions_df = interactions_df.drop(columns=['TITLE'])
interactions_df.info()

That's it! At this point the data is ready to go, and we just need to save it as a CSV file.

In [None]:
interactions_df.to_csv((data_dir+"/"+interactions_filename), index=False, float_format='%.0f')

In [None]:
interactions_df

## Prepare the User Metadata <a class="anchor" id="prepare_users"></a>
[Back to top](#top)

The dataset does not have any user metadata so we will extract the distinct user_ids in our interactions dataset and experiment with different types of users later in this workshop. 

In [None]:
# get all unique user ids from the interaction dataset

user_ids = interactions_df['USER_ID'].unique()
user_data = pd.DataFrame()
user_data["USER_ID"]= user_ids
user_data

Let's see how many users we have:

In [None]:
len(user_ids)

## Personalize Model

In order to get the recommendations, you need to train Amazon Personalized Recommender.

In this case we will use the domain optimized Recommender [Top picks for you](https://docs.aws.amazon.com/personalize/latest/dg/VIDEO_ON_DEMAND-use-cases.html#top-picks-use-case): personalized content recommendations for a user that you specify. With this use case, Amazon Personalize automatically filters videos the user watched based on the userId that you specify and Watch events.

In [None]:
%store interactions_filename
%store items_filename
%store recommender_top_picks_for_you_name
%store workshop_dataset_group_name
%store interactions_schema_name
%store interactions_dataset_name
%store interactions_import_job_name
%store items_schema_name
%store items_dataset_name
%store items_import_job_name
%store data_dir
%store user_ids
%store item_data
%store interactions_df


Go to [02_Train_Personalize_Model_01_Data](02_Train_Personalize_Model_01_Data.ipynb) to continue, and follow the instructions there to upoload your data and train your model.
