# Importing Data (with the AWS Python SDK)

In this notebook, we'll create our project workspace in Amazon Personalize and import the prepared data - using [Boto3, the AWS SDK for Python](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html).

> For an **alternative** approach to the same steps through the [Amazon Personalize console UI](https://console.aws.amazon.com/personalize/home) - see Notebook [02a_Importing_Data_(Console).ipynb](02a_Importing_Data_(Console).ipynb) instead.

Before we start, we'll here:

- Import the libraries this notebook will use
- Load the variables saved from previous steps
- Connect to the relevant AWS services as we have before for IAM and S3

In [None]:
# Python Built-Ins:
import json

# External Dependencies:
import boto3  # AWS SDK for Python

# Local Dependencies:
import util  # Small tool to print progress spinner

# Reload saved variables:
%store -r

# Connect to AWS services:
personalize = boto3.client("personalize")

## Creating a Dataset Group (DSG)

You can think of the **dataset group** as your **project workspace**: It's the container within which your datasets, models and deployments/inferences will be created.

A dataset group can contain multiple solutions (models) and campaigns (deployments), but **only one instance of each dataset type** (interactions, items, and users), so:

- You can experiment with different *algorithms and models* **within** one dataset group - which we'll do in this example... But
- For comparing results with *different datasets/schemas*, you'll usually need to work with **multiple** dataset groups.

Since all these steps can be performed through the SDKs/API too, it's absolutely possible to automate pipelines for setting up multiple dataset groups and experiments within them. We'd recommend referring to the MLOps samples in the [official Amazon Personalize samples repository](https://github.com/aws-samples/amazon-personalize-samples) for examples on how to do this.

Since we'll experiment only with different model configurations, we'll create a single dataset group in this example.

In [None]:
create_dsg_response = personalize.create_dataset_group(
    name="personalize-poc-lab"
)

dataset_group_arn = create_dsg_response["datasetGroupArn"]
%store dataset_group_arn
print(json.dumps(create_dsg_response, indent=2))

Although creating a dataset group is usually quick, the above call is asynchronous and we need to check the dataset group reaches `ACTIVE` status before moving on.

The cell below polls the status to wait until our DSG is ready:

In [None]:
def is_dsg_ready(desc):
    status = desc["datasetGroup"]["status"]
    if status == "ACTIVE":
        return True
    elif "FAILED" in status:
        raise ValueError(f"Failed to create Dataset Group!\n{desc}")

util.progress.polling_spinner(
    fn_poll_result=lambda: personalize.describe_dataset_group(datasetGroupArn=dataset_group_arn),
    fn_is_finished=is_dsg_ready,
    fn_stringify_result=lambda d: d["datasetGroup"]["status"],
    poll_secs=20,
    timeout_secs=20*60,  # Max 20 mins
)
print("Dataset Group Ready")

## Defining Interactions Dataset Schema

In this step, we'll need to **define the structure** of our interactions CSV using a JSON schema language - referring to the [Datasets and Schemas](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html) section of the developer guide.

> ⚠️ **NOTE** that:
>
> - The columns list in the JSON must **exactly match** the data file, **including the order of columns**
> - Any fields with **missing values** *must* include `null` in their `type` entry to be correctly treated by the model
> - Watch out also for the `categorical` attribute, which must be set on string fields where appropriate

A comprehensive example schema is provided on the [Interactions Dataset doc page](https://docs.aws.amazon.com/personalize/latest/dg/interactions-datasets.html) for reference.

▶️ **CHECK** the schema below matches the `interactions.csv` we created earlier, before running the cell!

In [None]:
interactions_schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "EVENT_VALUE",
            "type": ["float", "null"]
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        },
        {
            "name": "EVENT_TYPE",
            "type": "string"
        }
    ],
    "version": "1.0"
}

create_interactions_schema_resp = personalize.create_schema(
    name="personalize-movielens-interactions-schema",
    schema=json.dumps(interactions_schema),
)

interactions_schema_arn = create_interactions_schema_resp["schemaArn"]
print(json.dumps(create_interactions_schema_resp, indent=2))

With the schema created, we can now create our dataset object.

Note that this step does not *load* the data yet - just associates the schema to our dataset group:

In [None]:
dataset_type = "INTERACTIONS"
create_interactions_ds_resp = personalize.create_dataset(
    name="personalize-movielens-interactions",
    datasetType=dataset_type,
    datasetGroupArn=dataset_group_arn,
    schemaArn=interactions_schema_arn,
)

interactions_dataset_arn = create_interactions_ds_resp["datasetArn"]
print(json.dumps(create_interactions_ds_resp, indent=2))

## Importing the Interactions Data

In this step, we'll create a **dataset import job** to read our interactions data from S3, validate and load it into our Amazon Personalize dataset group.

In [None]:
create_interactions_dsimport_resp = personalize.create_dataset_import_job(
    jobName="personalize-movielens-interactions-01",
    datasetArn=interactions_dataset_arn,
    dataSource={
        "dataLocation": interactions_s3uri,
    },
    roleArn=personalize_role_arn,
)

interactions_import_job_arn = create_interactions_dsimport_resp["datasetImportJobArn"]
print(json.dumps(create_interactions_dsimport_resp, indent=2))

> ⏰ Importing the data can take some time - which for our small datasets like our movielens-100k extract is often dominated by **overheads** of starting up and shutting down required processing infrastructure and jobs... Rather than scaling with the number of records.
>
> For this small sample, the import should take something like 15 minutes.

We'll start off the next (item metadata) import job in parallel, and wait for both to complete in a later section.

Note that:

- We **cannot use the dataset until the import job is complete** (e.g. in `ACTIVE` status)... But also,
- **Only the interactions dataset is mandatory** - so it would be possible to skip over the other remaining sections below and start building solutions, as soon as this import job completes.

## Defining Item Metadata Schema

We'll follow the same general steps for our item metadata set as for the core interactions dataset.

First, create a dataset schema using the [Datasets and Schemas](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html) docs and the [sample items schema](https://docs.aws.amazon.com/personalize/latest/dg/items-datasets.html#schema-examples-items) as a guide.

▶️ **CHECK** the schema below exactly matches the `item-meta.csv` we created earlier, before running the cell!

In [None]:
items_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "GENRES",
            "type": "string",
            # Any string not in the mandatory fields is 'categorical'
            "categorical": True
        },
        {
            "name": "YEAR",
            # Remember, our year field has some missing values!
            "type": ["int", "null"]
        }
    ],
    "version": "1.0"
}

create_items_schema_resp = personalize.create_schema(
    name="personalize-movielens-items-schema",
    schema=json.dumps(items_schema),
)

items_schema_arn = create_items_schema_resp["schemaArn"]
print(json.dumps(create_items_schema_resp, indent=2))

With the schema created, we can now create our dataset object.

Again this step does not *load* the data yet - just associates the schema to our dataset group:

In [None]:
dataset_type = "ITEMS"
create_items_ds_resp = personalize.create_dataset(
    name="personalize-movielens-items",
    datasetType=dataset_type,
    datasetGroupArn=dataset_group_arn,
    schemaArn=items_schema_arn,
)

items_dataset_arn = create_items_ds_resp["datasetArn"]
print(json.dumps(create_items_ds_resp, indent=2))

## Importing the Items Metadata

In this step, we'll create a **dataset import job** to read our item metadata from S3, validate and load it into our Amazon Personalize dataset group.

In [None]:
create_items_dsimport_resp = personalize.create_dataset_import_job(
    jobName="personalize-movielens-items-01",
    datasetArn=items_dataset_arn,
    dataSource={
        "dataLocation": items_s3uri,
    },
    roleArn=personalize_role_arn,
)

items_import_job_arn = create_items_dsimport_resp["datasetImportJobArn"]
print(json.dumps(create_items_dsimport_resp, indent=2))

> ⏰ As with interactions, importing the data can take some time - which for our small datasets like our movielens-100k extract is often dominated by **overheads** of starting up and shutting down required processing infrastructure and jobs... Rather than scaling with the number of records.
>
> For this small sample, the import should take something like 15 minutes.

## Wait for Imports to Complete

You can of course check progress of your import jobs through the Personalize console, and also through the API.

In the cell below, we set up a simple polling loop to check progress and display updates in the notebook - blocking execution until all jobs are done:

In [None]:
waiting_arns = [interactions_import_job_arn, items_import_job_arn]

def are_imports_finished(descriptions):
    for desc in descriptions:
        status = desc["datasetImportJob"]["status"]
        arn = desc["datasetImportJob"]["datasetImportJobArn"]
        if status == "ACTIVE":
            waiting_arns.remove(arn)
        elif "FAILED" in status:
            raise ValueError(f"Data import failed!\n{desc}")
    if not len(waiting_arns):
        return True

util.progress.polling_spinner(
    fn_poll_result=lambda: map(
        lambda arn: personalize.describe_dataset_import_job(datasetImportJobArn=arn),
        waiting_arns,
    ),
    fn_is_finished=are_imports_finished,
    fn_stringify_result=lambda d: f"{len(waiting_arns)} imports in progress",
    poll_secs=30,
    timeout_secs=60*60,  # Max 1 hour
)
print("Data imported")

## All set!

We've now created our dataset group (project) in Amazon Personalize and imported our source datasets.

In the next notebook we'll create and evaluate some recommendation models based on this data:

- Follow along in the **AWS Console** with the instructions and screenshots in [03a_Creating_and_Evaluating_Solutions_(Console).ipynb](03a_Creating_and_Evaluating_Solutions_(Console).ipynb), *OR*
- Run the same steps in code with the **AWS SDK for Python (Boto3)** by following [03b_Creating_and_Evaluating_Solutions_(Python_SDK).ipynb](03b_Creating_and_Evaluating_Solutions_(Python_SDK).ipynb)