# Welcome to UnicornFlix!

Congratulations! You have just been hired by UnicornFlix, which is a new Direct to Consumer Streaming service that launched and is jumping into the crowded space of Video on Demand/Free Ad Supported TV (FAST) providers. You have been hired into the Search & Discovery team which leads efforts around personalization. Currently most of your app does not provide a personalized experience, the movies are presented in a static order for all users. In order to prevent customer churn, you are looking to add personalized experiences. 

You’ve been asked by the founders to:

- Increase subscriber engagement by tailoring every experience to individual users
- Help users discover newly released content
- Increase the breadth of content offered to them from the UnicornFlix catalog
- Reduce the time to value by creating valuable recommendations in a short time

Throughout the this workshop you will be exploring your datasets, building/training several recommendation models and implementing recommendations with API's.


## In this Notebook

In this notebook, you will choose a dataset and prepare it for use with Amazon Personalize.

1. [How to Use the Notebook](#usenotebook)
1. [Introduction to Amazon Personalize Datasets](#datasets)
1. [Prepare the Item Metadata](#prepare_items)
1. [Prepare the Interactions Data](#prepare_interactions)
1. [Prepare the User Metadata](#prepare_users)


## How to Use the Notebook <a class="anchor" id="usenotebook"></a>

### Executing cells
The code is broken up into cells like the one below. There's a triangular Run button at the top of this page that you can click to execute each cell and move onto the next, or you can press `Shift` + `Enter` while in the cell to execute it and move onto the next one.

As a cell is executing you'll notice a line to the side showcase an `*` while the cell is running or it will update to a number to indicate the last cell that completed executing after it has finished exectuting all the code within a cell.

Simply follow the instructions below and execute the cells to get started with Amazon Personalize.

### Understanding the code

This notebook can be used in two modalities:

1. Train as you go by executing each cell. Some cells may take a long time to finish executing as they wait for resources to be created. You will have this experience if you deployed the 'simple'  Amazon CloudFormation template ([personalizeIDSimple.yaml](../personalizeIDSimple.yaml)).

2. Go through notebook with previously created resources. All or the majority of the resources will already be created and cells will just retrieve the information of these existing resources to use them in following steps. You will have this experience if you deployed the 'pretrained' Amazon CloudFormation template ([PersonalizeIDPretrained.yaml](../PersonalizeIDPretrained.yaml)).

<div class="alert alert-block alert-warning">
<b>Note:</b> If you deployed the 'pretrained' template, you must wait at least 3 hours for all resources to tbe trained before proceeding with the workshop.
</div>

Because of this, you will find that some cells have `try` and `except` blocks. In particular most of them are handling a `ResourceAlreadyExistsException` exception. 

You can look at the code in the `try` block to get a good idea of how you can create a resource and understand how to use the Amazon Personalize SDK. The `except` block will let you know that the resource has been created and record the corresponding ARN, which is the Amazon unique identifier.

This is an example of the `try` block for creating a dataset group, this code will execute without exceptions if the dataset group does not exist and raise an exception if the dataset group does already exist:

```python
try: 
    create_dataset_group_response = personalize.create_dataset_group(
        name = workshop_dataset_group_name,
        domain='VIDEO_ON_DEMAND'
    )

    workshop_dataset_group_arn = create_dataset_group_response['datasetGroupArn']
    print(json.dumps(create_dataset_group_response, indent=2))
    print ('\nCreating the Dataset Group with dataset_group_arn = {}'.format(workshop_dataset_group_arn))
    
```
and this is the corresponding `except` block that will be exectuted if an exeption is raised becuse the dataset group already exists. This block saves the ARN for the existig dataset group to use later and lets you know the resource already exists.

```python
except personalize.exceptions.ResourceAlreadyExistsException as e:
    workshop_dataset_group_arn = 'arn:aws:personalize:'+region+':'+account_id+':dataset group/' + 
        workshop_dataset_group_name 
    print ('\nThe the Dataset Group with dataset_group_arn = {} already exists'.format(
        workshop_dataset_group_arn))
    print ('\nWe will be using the existing Dataset Group dataset_group_arn = {}'.format(
        workshop_dataset_group_arn))
```

Depending on the resource, you may also find that sometimes the code will check from a list of resourses to find if a resource exists and then use `if` and `else` blocks to either use the existing resource or create it.

### Let's build!

Python ships with a broad collection of libraries and we need to import those as well as the ones installed to help us like [boto3](https://aws.amazon.com/sdk-for-python/) (AWS SDK for python) and [Pandas](https://pandas.pydata.org/)/[Numpy](https://numpy.org/)  which are core data science tools.

In [None]:
# Get the latest version of botocore to ensure we have the latest features in the SDK
import sys
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install --upgrade --no-deps --force-reinstall botocore
!pip uninstall -y awscli
!pip install awscli
!pip uninstall -y numexpr
!pip install numexpr

In [None]:
# Import packages
import time
from time import sleep
import json
from datetime import datetime

import boto3
import pandas as pd
import numpy as np


In [None]:
# Make directories
data_dir = "poc_data"
!mkdir $data_dir
imdb_dir = data_dir+'/imdb'
!mkdir $imdb_dir

In [None]:
# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

In [None]:
# Get the account id and region to use later
account_id = boto3.client('sts').get_caller_identity().get('Account')
print("account id:", account_id)

with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]
print("region:", region)

If this is a workshop and the resources were created for you, we will retrieve the variables of the resources created.

In [None]:
# Opening JSON file
f = open('./params.json')
parameters = json.load(f)

In [None]:
workshop_dataset_group_name = parameters['datasetGroup']['serviceConfig']['name']

interactions_schema_name = parameters['datasets']['interactions']['schema']['serviceConfig']['name']
interactions_dataset_name = parameters['datasets']['interactions']['dataset']['serviceConfig']['name']

items_schema_name = parameters['datasets']['items']['schema']['serviceConfig']['name']
items_dataset_name = parameters['datasets']['items']['dataset']['serviceConfig']['name']

users_schema_name = parameters['datasets']['users']['schema']['serviceConfig']['name']
users_dataset_name = parameters['datasets']['users']['dataset']['serviceConfig']['name']

#The following job names are the starting Strings of the job names that can be created
interactions_import_job_name = 'dataset_import_interaction'
items_import_job_name = 'dataset_import_item'
users_import_job_name = 'dataset_import_user'

items_filename = "items.csv"
interactions_filename = "interactions.csv"
users_filename = "users.csv"

for recommender in parameters['recommenders']:
    # This is currently configured assuming only one recommender of each type, if there are multiple 
    # recommenders of the same type further configuration is needed.
    if (recommender['serviceConfig']['recipeArn'] == 'arn:aws:personalize:::recipe/aws-vod-more-like-x'):
        recommender_more_like_x_name =recommender['serviceConfig']['name'] 
    if (recommender['serviceConfig']['recipeArn'] == 'arn:aws:personalize:::recipe/aws-vod-top-picks'):
        recommender_top_picks_for_you_name =recommender['serviceConfig']['name']
        
for solution in parameters['solutions']:
    # This is currently configured assuming only one solution of this type, if there are multiple 
    # solutions of the same type further configuration is needed.
    if (solution['serviceConfig']['recipeArn'] == 'arn:aws:personalize:::recipe/aws-personalized-ranking'):
        workshop_rerank_solution_name = solution['serviceConfig']['name'] 
        # This is currently configured assuming only one campaign, if there are multiple campaigns 
        # further configuration is needed.
        workshop_rerank_campaign_name = solution['campaigns'][0]['serviceConfig']['name'] 

Make sure we can use the SDK to interact with Amazon Personalize by describing some of the pre-created resources used in the workshop. If you have not pre-deployed resources and are building them as you go with this notebook, the below cell will raise an exception. You can continue with the notebook and create resources and train models as you go.

In [None]:
try:
    # Describe a few resources using the SDK
    more_like_x_arn = 'arn:aws:personalize:'+region+':'+account_id+':recommender/'+recommender_more_like_x_name 
    describe_response = personalize.describe_recommender(recommenderArn = more_like_x_arn)

    top_picks_arn = 'arn:aws:personalize:'+region+':'+account_id+':recommender/'+recommender_top_picks_for_you_name
    describe_response = personalize.describe_recommender(recommenderArn = top_picks_arn)

    rerank_arn = 'arn:aws:personalize:'+region+':'+account_id+':solution/'+workshop_rerank_solution_name
    describe_response = personalize.describe_solution(solutionArn = rerank_arn)
    
    print("All resources have been successfully pretrained!")
    
except:
    print("Some or all pretrained resources not found. Proceed to the next cell if you will be uploading data and training models as you go.")

## Introduction to Amazon Personalize Datasets <a class="anchor" id="datasets"></a>
[Back to top](#top)

Regardless of the use case, the algorithms all share a base of learning on user-item-interaction data which is defined by 3 core attributes:

1. **UserID** - The user who interacted
1. **ItemID** - The item the user interacted with
1. **Timestamp** - The time at which the interaction occurred

In this notebook we will be importing interactions, user and item data into your environment, inspecting it and converting it to a format that will allow you use it in Amazon Personalize to train models to get personalized recommendations.

The following diagram shows the resources that we will create. with the section we are building  in this notebook highlighted in blue with a dashed outline.

![Workflow](images/01_Data_Layer_Resources.jpg)

Generally speaking your data will not arrive in a perfect form for Personalize, and will take some modification to be structured correctly. This notebook guides you through that process.

### Items data

The item data consists of information about the content that is being interacted with, this generally comes from Content Management Systems (CMS). 

<div class="alert alert-block alert-warning">
<b>Note:</b> This dataset is not manatory for Amazon Personalize to generate recommendations, but providing good item metadata will ensure the best results in your trained models.
</div>

### Item Interactions data

* Interations data: we use the ml-latest-small dataset from the [Movielens](https://grouplens.org/datasets/movielens/) project as a proxy for user-item interactions. 

The item interaction data concists of information about the interactions the users of the fictional app will have with the content. This usually comes from analytics tools or Customer Data Platform's (CDP). The best interaction data for use for Amazon Personalize would include the sequential order of user beavior, what content was watched/clicked on and the order it was interacted with. To simulate our interaction data, we will be using data from the [MovieLens project](https://grouplens.org/datasets/movielens/). Movielens offers multiple versions of their dataset, for the purposes of this workshop we will be using the reduced version of this dataset (approx 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users).

This dataset is manatory for Amazon Personalize to generate recommendations.

### User data

The user data is what information you have about you users, it usually comes from Customer relationship management (CRM) or Subscriber management systems. Since there is no user data included in the MovieLens data, we will be generating a small synthetic dataset to simulate this component of the workshop. 

<div class="alert alert-block alert-warning">
<b>Note:</b> This dataset is not manatory for Amazon Personalize to generate recommendations, but providing good item metadata will ensure the best results in your trained models.
</div>



## Prepare the Item Metadata <a class="anchor" id="prepare_items"></a>
[Back to top](#top)

In order provide additional metadata, and also to provide a consistent experience for our users we leverage a subset of the IMDb Essential Metadata for Movies/TV/OTT dataset. IMDb is the world's most popular and authoritative source for information on movies, TV shows, and celebrities and powers entertainment experiences around the world. License IMDb entertainment metadata from over 10 million movies, TV series, and Video Game titles including 12 million cast and crew, 1 billion star ratings, and global box office grosses from Box Office Mojo. All IMDb data products are updated daily and easily accessed through AWS Data Exchange. 

The IMDb Essential Metadata for Movies/TV/OTT dataset, which contains 

- 9+ million titles
- 12+ million names
- Film, TV, music and celebrities
- 1 billion ratings from the world’s largest entertainment fan community

IMDb has [multiple datasets available in the Amazon Data Exchange](https://aws.amazon.com/marketplace/seller-profile?id=0af153a3-339f-48c2-8b42-3b9fa26d3367). <img src="./images/IMDb_Logo_Rectangle.png" alt="IMDb logo" style="width:50px;"/>

For this workshop we have extracted the subset of data we needed and prepared it for use with the following information from the IMDb Essential Metadata for Movies/TV/OTT (Bulk data) dataset.

TITLE                      
YEAR                       
IMDB_RATING                
IMDB_NUMBEROFVOTES         
PLOT                       
US_MATURITY_RATING_STRING  
US_MATURITY_RATING         
GENRES 

In addition we added two fields that will help us with our fictional use case that are not derived from the  IMDb dataset

CREATION_TIMESTAMP         
PROMOTION

For the purpose of this workshop we will use the IMDb TT ID to provide a common identifier between the interactions data and the content metadata. Movielens provides its own identifier as well as a the IMDb TT ID (without the leading 'tt') in the 'links.csv' file. 


<div class="alert alert-block alert-warning">
<b>Note:</b>Your use of IMDb data is for the sole purpose of completing the AWS workshop and/or tutorial. Any use of IMDb data outside of the AWS workshop and/or tutorial requires a data license from IMDb. To obtain a data license, please contact: imdb-licensing-support@imdb.com. You will not (and will not allow a third party to) (i) use IMDb data, or any derivative works thereof, for any purpose; (ii) copy, sublicense, rent, sell, lease or otherwise transfer or distribute IMDb data or any portion thereof to any person or entity for any purpose not permitted within the workshop and/or tutorial; (iii) decompile, disassemble, or otherwise reverse engineer or attempt to reconstruct or discover any source code or underlying ideas or algorithms of IMDb data by any means whatsoever; or (iv) knowingly remove any product identification, copyright or other notices from IMDb data.</div>



Copy the subsection of IMDb item metadata to this repository.

In [None]:
 # copy IMDB data
!aws s3 cp s3://personalize-solution-staging-us-east-1/personalize-immersionday-media/imdb/items.csv $imdb_dir

Next, load the IMDB `items.csv` file and take a look at the first rows. This file has information about the movie.

In [None]:
item_data = pd.read_csv(data_dir + '/imdb/items.csv', sep=',', dtype={'PROMOTION': "string"},index_col=0)
item_data.head(5)

In [None]:
item_data.describe()

This does not really tell us much about the dataset, so we will explore a bit more and look at the raw information. We can see that genres often appear in groups. That is fine for us as Personalize supports this structure.

In [None]:
item_data.info()

Now we have our Catalog of titles that our service offers. We also have some movies that the UnicornFlix marketing department would like to ensure are promoted in our recommendations. Amazon Personalize has a feature that allows you to promote items into recommendations, and set the balance of promoted items vs recommendations (we will cover this in detail in `03_Inference_Layer.ipynb`. Since we are in Las Vegas, lets create a promotion for movies about or set in Las Vegas. First we will find the movies in our catalog that feature or are set in/about Las Vegas and set the metadata field to true.

In [None]:
mask = item_data['PLOT'].str.contains('las vegas', case=False, na=False)
item_data.loc[mask, 'PROMOTION'] = 'true'
item_metadata = item_data
item_data[mask]

Lets confirm that the changes we have made, have not introduced any null values:

In [None]:
item_data.isnull().sum()

Looks good, we currently have no null values.

That's it! At this point the item data is ready to go, and we just need to save it as a CSV file.

In [None]:
item_data.to_csv((data_dir+"/"+items_filename), index=True, float_format='%.0f')

## Prepare the Item Interactions data 

First, you will download the dataset from the [MovieLens project](https://grouplens.org/datasets/movielens/) website and unzip it in a new folder using the code below.

In [None]:
!cd $data_dir && wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!cd $data_dir && unzip -o ml-latest-small.zip 
dataset_dir = data_dir + "/ml-latest-small/"

Take a look at the data files you have downloaded.

In [None]:
!ls $dataset_dir

We can look at the README.txt file and licensing, do not skip over usage license!

In [None]:
!pygmentize $data_dir/ml-latest-small/README.txt

The primary data we are interested in for a recommendation use case is the actual interactions that the users had with the titles(items). 

Open the `ratings.csv` file and take a look at the some rows from throughout the dataset.

In [None]:
interaction_data = pd.read_csv(dataset_dir + '/ratings.csv', sep=',', dtype={'userId': "int64", 'movieId': "str"})
interaction_data.sample(10)

To use Amazon Personalize, you need to save timestamps in Unix Epoch format.

Lets validate that the timestamp is actually in a Unix Epoch format by converting it into a more easily understood time/date format

In [None]:
arb_time_stamp = interaction_data.iloc[50]['timestamp']
print('timestamp')
print(arb_time_stamp)
print()
print('Date & Time')
print(datetime.utcfromtimestamp(arb_time_stamp).strftime('%Y-%m-%d %H:%M:%S'))

We will do some general summarization and inspection of the data to ensure that it will be helpful for Amazon Personalize

In [None]:
interaction_data.isnull().any()

In [None]:
interaction_data.info()

What you can see is that the Movielens dataset is that this dataset contains a userId, a movieId, the rating that the user gave the movie and the time the made this interaction. For the purposes of our fictional service UnicornFlix this data will stand in for our applications interaction data, which would actually be the click stream data of the titles that were watched, in the order they watched them.

### Convert the Interactions Data

The interaction data generally is acquired from anaytics or CDP platforms that can identify individual interactions with content/items within a platform. 

We need to do a few things to get this dataset ready to subsitute for our services interaction data.

First off, the movieId is a unique identifier provided by Movielens for each title. However as we saw above IMDb has a much richer set of metadata about the content catalog. In order to use the IMDb data we will need to use a common  identifier between our items and our interactions dataset, which is the IMDb imdbId. To do this Movielens provides the 'links.csv' file which helps convert between the two identifiers.

In [None]:
links = pd.read_csv(dataset_dir + '/links.csv', sep=',', usecols=[0,1], encoding='latin-1', dtype={'movieId': "str", 'imdbId': "str", 'tmdbId': "str"})
pd.set_option('display.max_rows', 25)
links['imdbId'] = 'tt' + links['imdbId'].astype(object)
links

As you can see this provides a method to identify what the IMDb id is for every title in our interactions dataset, now we will convert the ratings.csv data to utilize the IMDb ID.

In [None]:
imdb_data = interaction_data.merge(links, on='movieId')
imdb_data.drop(columns='movieId')

Now we have a interactions dataset that matches our item catalog dataset. 

### Simulating a interaction dataset 

We are going to make one more modification to make the MoviesLens dataset more like the analytics data that a video streaming service would see in their interactions. MoviesLens is an explicit movie rating dataset, which means users are presented a movie and asked to give it a rating. For recommendation systems/personalization, the industry has moved on to using more implicit data. This is due to many reasons including low numbers of customers rating titles and customers tastes changing over time. Some of the benefits of implicit interaction data is that it is the actual behavior of all users and changes over time as their viewing behavior changes.

To convert the explicit interaction MovieLens ratings dataset into our fictional streaming service UnicornFlix's implicit dataset we are going to create a synthetic dataset using the ratings in MovieLens. 

- Implicit interactions are inherently positive interactions so we will be dropping any rating that is below 2 stars 
- Ratings of 2 and 3 stars are neutral to slightly positive, we are going to create synthetic "Click" events to simulate a viewer clicking on a title in the UnicornFlix app
- Ratings of 4 and 5 are overwhelmingly positive, we will use these to create synthetic "Watch" and "Click" events to simulate a viewer both clicking on a title and watching at least 80% of a title in the UnicornFlix app

<div class="alert alert-block alert-warning">
<b>Note:</b> This will be directionaly accurate, but is not a good substitute for actual temporal based interaction data, the order that viewers rated movies on the MovieLens website is not as good as the order of interactions on an actual Video On Demand Streaming app. For more information about the importance of the temporal interaction data see
https://www.amazon.science/publications/temporal-contextual-recommendation-in-real-time
</div> 

In [None]:
watched_df = imdb_data.copy()
watched_df = watched_df[watched_df['rating'] > 3]
watched_df = watched_df[['userId', 'imdbId', 'timestamp']]
watched_df['EVENT_TYPE']='Watch'
watched_df.head()

In [None]:
clicked_df = imdb_data.copy()
clicked_df = clicked_df[clicked_df['rating'] > 1]
clicked_df = clicked_df[['userId', 'imdbId', 'timestamp']]
clicked_df['EVENT_TYPE']='Click'
clicked_df.head()

In [None]:
interactions_df = clicked_df.copy()

interactions_df = pd.concat([interactions_df, watched_df])
interactions_df.sort_values("timestamp", axis = 0, ascending = True, 
                 inplace = True, na_position ='last')

Lets look at what the new dataset looks like and ensure that the data reflects our fictional streaming services streaming analytics data

In [None]:
interactions_df

 Amazon Personalize has default column names for users, items, and timestamp. These default column names are `USER_ID`, `ITEM_ID`, `TIMESTAMP` and `EVENT_VALUE` for the [VIDEO_ON_DEMAND domain dataset](https://docs.aws.amazon.com/personalize/latest/dg/VIDEO-ON-DEMAND-datasets-and-schemas.html). The final modification to the dataset is to replace the existing column headers with the default headers.

In [None]:
interactions_df.rename(columns = {'userId':'USER_ID', 'imdbId':'ITEM_ID', 
                              'timestamp':'TIMESTAMP'}, inplace = True) 

We'll be using a subset of the IMDB dataset for this workshop that has been cleaned to remove movies that don't have valid values for the metadata we are using in out ITEMs dataset (we'll work with this more in the net section), so we'll need to make sure we don't have any interactions that have IMDB movie ids that are not in our subset of the IMDB data set.

In [None]:
movies = pd.read_csv(data_dir + '/imdb' + '/items.csv', sep=',', usecols=[0,1], encoding='latin-1', dtype={'movieId': "str", 'imdbId': "str", 'tmdbId': "str"})
pd.set_option('display.max_rows', 25)

Next, let's compare the number of `ITEM_ID` unique keys in the IMDB data to the `ITEM_ID` unique keys in the interactions.  They should be the same.

In [None]:
movies.nunique(axis=0)

The number of unique ITEM_IDs are not the same in the IMDB data and the interactions data, so we'll clean out the data points with ITEM_IDs that do not have item metadata from the interactions dataset.

In [None]:
interactions_df = interactions_df.merge(movies, on='ITEM_ID')
interactions_df.info()

We will also drop the `TITLE` column as it is not required in the interactions dataset.

In [None]:
interactions_df = interactions_df.drop(columns=['TITLE'])
interactions_df.info()

That's it! At this point the data is ready to go, and we just need to save it as a CSV file.

In [None]:
interactions_df.to_csv((data_dir+"/"+interactions_filename), index=False, float_format='%.0f')

## Prepare the User Metadata <a class="anchor" id="prepare_users"></a>
[Back to top](#top)

The dataset does not have any user metadata so we will create a synthetic metadata field that would be an example of the type of user metadata UnicornFlix may have in their CRM/Subcriber management system. This data will be used both for training of the models, but also can be used for inference filters, which will be covered in a later notebook.

In [None]:
# get all unique user ids from the interaction dataset

user_ids = interactions_df['USER_ID'].unique()
user_data = pd.DataFrame()
user_data["USER_ID"]= user_ids
user_data

### Adding User Metadata

The current dataset does not contain additiona user information. For this example, we'll randomly assign a membership level. For Ad Supported models this could indicate premium vs ad supported.

<div class="alert alert-block alert-warning">
<b>Note:</b> NOTE: This is a synthetic dataset and since it is randomly assigned, will be of little value to our mode, in a real world scenario this data would be accurate to the user data.
</div>

NOTE: This is a synthetic dataset and since it is randomly assigned, will be of little value to our mode, in a real world scenario this data would be accurate to the user data.

In [None]:
possible_membership_levels = ['silver', 'gold']
random = np.random.choice(possible_membership_levels, len(user_data.index), p=[0.5, 0.5])
user_data["MEMBERLEVEL"] = random
user_data

That's it! At this point the data is ready to go, and we just need to save it as a CSV file.

In [None]:
# Saving the data as a CSV file
user_data.to_csv((data_dir+"/"+users_filename), index=False, float_format='%.0f')

## S3 bucket <a class="anchor" id="bucket_role"></a>
[Back to top](#top)

So far, we have downloaded, manipulated, and saved the data onto the Amazon EBS instance attached to instance running this Jupyter notebook. 

By default, the Amazon Personalize service does not have permission to access the data we uploaded into the S3 bucket in our account. In order to grant access to the Amazon Personalize service to read our CSVs, you need to set a Bucket Policy and create an IAM role that the Amazon Personalize service will assume. 

Use the metadata stored on the instance underlying this Amazon SageMaker notebook, to determine the region it is operating in. If you are using a Jupyter notebook outside of Amazon SageMaker, simply define the region as a string below. The Amazon S3 bucket needs to be in the same region as the Amazon Personalize resources we have been creating so far.

We will be using the S3 bucket that you created when you deployed the Cloud Formation using [personalizeSimpleCFMarketingContentGen.yml](personalizeSimpleCFMarketingContentGen.yml).

This bucket is created with the policy:

```python
policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:*Object",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket_name),
                "arn:aws:s3:::{}/*".format(bucket_name)
            ]
        }
    ]
}
```

This S3 bucket policy allows Amazon Personalize to be able to read the contents of your S3 bucket. 

In [None]:
# Configure the SDK to SSM:
ssm = boto3.client('ssm')

In [None]:
# get the name of the bucket created by the Cloud Formation
personalizes3bucket = ssm.get_parameter(Name='/cloudformation/personalize-s3-bucket', WithDecryption=False)
bucket_name = personalizes3bucket['Parameter']['Value']

print('bucket_name:', bucket_name)

Let's have a look at the S3 bucket policy. 

In [None]:
s3 = boto3.client('s3')

try:
    bucket_current_policy = s3.get_bucket_policy(Bucket=bucket_name)['Policy']
    print ("Bucket current policy")
    print (json.loads(bucket_current_policy))
    
except Exception as e: 
    raise(e)


### Upload data to S3

Now that your Amazon S3 bucket has been created, upload the CSV files of our 3 datasets (Item, Interaction and User).


<div class="alert alert-block alert-warning">
<b>Note:</b> NOTE: We will cover hot to import real-time data in a future notebook..
</div>


In [None]:
interactions_file_path = data_dir + "/" + interactions_filename

try:
    s3.get_object(
        Bucket=bucket_name,
        Key=interactions_filename,
    )
    print("{} already exists in the bucket {}".format(interactions_file_path, bucket_name))
except s3.exceptions.NoSuchKey:
    # Uploading the file if it does not already exist
    boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_file_path)
    print("File {} uploaded to bucket {}".format(interactions_filename, bucket_name))

items_file_path = data_dir + "/" + items_filename
try:
    s3.get_object(
        Bucket=bucket_name,
        Key=items_filename,
    )
    print("{} already exists in the bucket {}".format(items_file_path, bucket_name))
except s3.exceptions.NoSuchKey:
    # Uploading the file if it does not already exist     
    boto3.Session().resource('s3').Bucket(bucket_name).Object(items_filename).upload_file(items_file_path)
    print("File {} uploaded to bucket {}".format(items_filename, bucket_name))

users_file_path = data_dir + "/" + users_filename
try:
    s3.get_object(
        Bucket=bucket_name,
        Key=users_filename,
    )
    print("{} already exists in the bucket {}".format(users_file_path, bucket_name))
except s3.exceptions.NoSuchKey:
    # Uploading the file if it does not already exist
    boto3.Session().resource('s3').Bucket(bucket_name).Object(users_filename).upload_file(users_file_path)
    print("File {} uploaded to bucket {}".format(users_filename, bucket_name))

We will use this data in the next labs. In order to use this data we will store these variables so subsequent notebooks can use this data. 

In [None]:
%store dataset_dir
%store data_dir
%store interactions_filename
%store items_filename
%store users_filename
%store workshop_rerank_solution_name
%store workshop_rerank_campaign_name
%store recommender_more_like_x_name
%store recommender_top_picks_for_you_name
%store region
%store account_id
%store workshop_dataset_group_name
%store interactions_schema_name
%store items_schema_name
%store users_schema_name
%store interactions_dataset_name
%store items_dataset_name
%store users_dataset_name
%store items_import_job_name
%store interactions_import_job_name
%store users_import_job_name
%store bucket_name

[Go to the next notebook `02_Data_Import`](02_Data_Import.ipynb) to continue.