# Lab 2 - Constructing the Graph Database
In this lab we will ingest the product metadata into Neptune and construct a graph from it.

## What is a Graph Database?

A graph database is a database that stores relationships with a graph structure. Data is represented by nodes, edges and properties rather than tables or documents. In graph databases, we are able to transverse relationships very quickly as relationships between nodes are persisted in the database, rather than being caluclated at query times.

[Amazon Neptune](https://aws.amazon.com/neptune/) is a high-performance graph database engine optimized for storing billions of relationships and querying the graph with millisecond latency. The nature of Neptune makes it a great tool for recommendation applications, as recommendations can be made quickly based on existing relationships.

## Setup
Just as in the first lab, we have to prepare our environment by importing dependencies and creating clients.
### Import dependencies
The following libraries are needed for this lab

In [None]:
import boto3
import uuid
import pandas as pd
from gremlin_python.structure.graph import Graph
from gremlin_python.process.traversal import T
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.graph_traversal import __

### Create Clients
Next we need to create the AWS service clients needed for this workshop.
- **neptune**: This resource is used to create our Neptune DB cluster and endpoint.

In [None]:
neptune = boto3.client('neptune')

### Load variables saved in Lab 1
At the end of Lab 1 we saved some variables that we'll need in this lab. The following cell with load those variables into this lab environment.

In [None]:
%store -r

## Create Neptune Cluster and Endpoint
Currently, there is no Neptune cluster to store the relationships. We can create the Neptune DB cluster which will by default also create a reader and a writer cluster endpoint. Creation will take a few minutes.

In [None]:
neptune_cluster = neptune.create_db_cluster(
    DBClusterIdentifier='retail-demo-store-products-cluster',
    Engine='neptune')
endpoint = neptune_cluster['DBCluster']['Endpoint']
print('Neptune Endpoint:' + endpoint)

## Data Preparation
Before we can start building the relationships between nodes and edges, we have to load the products data into a dataframe. This can be done by performing a scan on the DynamoDB table, and adding each data row to the dataframe.

In [1]:
# DynamoDB Scan step:
ddb_response = ddb_table.scan()
items = ddb_response['Items']

# Fetch data into DF
pd_data = []
for data_row in items:
    pd_data.append(data_row)

NameError: name 'ddb_table' is not defined

We can drop columns from the data we are not interested in the relationships of to speed up processing.

In [None]:
cols_to_drop = ['sk', 'url', 'aliases']
df_products = pd.DataFrame(pd_data)
df_products.drop(cols_to_drop, inplace=True, axis=1)

## Improving Access Patterns
To improve access patterns, we can create a dataframe for categories, which include a UUID for each one.

In [None]:
df_categories = df_products[['category']].drop_duplicates(subset=['category'])
df_categories['category_id'] = [uuid.uuid4() for _ in range(len(df_categories.index))]

We do the same for product styles.

In [None]:
df_styles = df_products[['style']].drop_duplicates(subset=['style'])
df_styles['style_id'] = [uuid.uuid4() for _ in range(len(df_styles.index))]

## Ingest into Neptune
As Neptune is a graph database, we have to insert the data according to this structure. This involves initializing a graph and traversing to add each product item by item.

In [None]:
# Initialize Neptune connection with endpoint from created cluster
graph = Graph()
remote_conn = DriverRemoteConnection(
    'wss://' + endpoint + ':8182/gremlin',
    'g')
g = graph.traversal().withRemote(remote_conn)

# Insert all products
for index, row in df_products.iterrows():
    # Insert item by item.
    vertex_insert = g.addV('product') \
        .property(T.id, row['id']) \
        .property('product_name', row['name']) \
        .property('current_stock', row['current_stock']) \
        .property('style', row['style']) \
        .property('gender_affinity', row['gender_affinity']) \
        .property('image', row['image']) \
        .property('category', row['category']) \
        .property('description', row['description']) \
        .property('price', row['price']) \
        .property('featured', row['featured']) \
        .next()

# Need performance improvements and optimizations
# Updating items, to add Labels array (Neptune does not support dict/maps in addV).
    for prop_label in row['image_labels']:
        if prop_label['confidence'] > 75:
            update_results = g.V(vertex_insert).property('labels_confidence_gt_75',
                                                         prop_label['name'].lower()).next()

We can then insert all the categories and styles created earlier as multi-label vertices to improve searchability.

In [None]:
# Insert all categories
for index, row in df_categories.iterrows():
    g.addV('category::{}'.format(row['category'])).property(T.id, str(row['category_id'])).property(
        'name', row['category']).next()

# Insert all styles
for index, row in df_styles.iterrows():
    g.addV('style::{}'.format(row['style'])).property(T.id, str(row['style_id'])).property(
        'name', row['style']).next()

With all the insertions complete, the edges now have to be constructed to connect the graph vertices.

In [None]:
# Add Category and Style IDs
df_with_category_ids = pd.merge(df_products, df_categories, on='category', how='inner')
df_with_cat_and_style_ids = pd.merge(df_with_category_ids, df_styles, on='style', how='inner')

# Create Edges for Categories -> Styles
df_edges_category_style = df_with_cat_and_style_ids[['category_id', 'style_id']].drop_duplicates(
    subset=['category_id', 'style_id'])
# Add edges for Categories -> Styles:
for index, row in df_edges_category_style.iterrows():
    cat_to_style_edge_insert = g.V(str(row['category_id'])).addE('has').to(__.V(str(row['style_id']))).next()
    print(cat_to_style_edge_insert)

# Create Edges for Styles --> Products
df_edges_styles_products = df_with_cat_and_style_ids[['style_id', 'id']].drop_duplicates(subset=['style_id', 'id'])
# Add edges for Styles --> Products (ID is the original column of a product_id):
for index, row in df_edges_styles_products.iterrows():
    style_to_prod_edge_insert = g.V(str(row['style_id'])).addE('has').to(__.V(str(row['id']))).next()
    print(style_to_prod_edge_insert)

### Cleanup
With all the insertions complete and edges created, we close the connection to the graph database.

In [1]:
  remote_conn.close()

NameError: name 'remote_conn' is not defined

## Lab 2 Summary
In this lab we created a Neptune cluster and wrote all our products data to our instance. We then constructed graph vertices between categories and styles which showcase the relationship between products.

In the next lab, we will retrain Personalize with the image label data.

### Continue to Lab 3
Open [Lab 3](./Lab-3-Add-Rekognition-Labels-to-Personalize.ipynb) to continue the workshop.