<p>
  <a href="https://colab.research.google.com/github/ezhilvendhan/ecommerce-demo-gds-vertex-ai/blob/main/similarity.ipynb" target="_blank">
    <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
  </a>
</p>

# Install Prerequisites
First off, you'll also need to install a few packages.

In [None]:
%pip install --quiet --upgrade graphdatascience
%pip install --quiet google-cloud-storage
%pip install --quiet google.cloud.aiplatform

# Restart the Kernel
After you install the additional packages, you need to restart the notebook kernel so it can find the packages.  When you run this, you may get a notification that the kernel crashed.  You can disregard that.

In [3]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'restart': True, 'status': 'ok'}

# Working with Neo4j
You'll need to enter the credentials from your Neo4j instance below.  You can get these by running the command ":server connect" in the Neo4j Browser.  The default DB_USER and DB_NAME are always neo4j.

In [3]:
# Edit these variables!
DB_URL = "neo4j+s://c1c3e9b6.databases.neo4j.io:7687"
DB_PASS = ""

# You can leave this default
DB_USER = 'neo4j'

In [4]:
from graphdatascience import GraphDataScience
gds = GraphDataScience(DB_URL, auth=(DB_USER, DB_PASS), aura_ds=True)

# Data Import

## Create Constraints

In [5]:
result = gds.run_cypher(
  """
    CREATE CONSTRAINT childCategoryIdConstraint IF NOT EXISTS FOR (c:Category) REQUIRE c.id IS UNIQUE
  """
)
display(result)

In [None]:
result = gds.run_cypher(
  """
    CREATE CONSTRAINT attributeIdConstraint IF NOT EXISTS FOR (a:Attribute) REQUIRE a.id IS UNIQUE
  """
)
display(result)

In [None]:
result = gds.run_cypher(
  """
    CREATE CONSTRAINT productIdConstraint IF NOT EXISTS FOR (p:Product) REQUIRE p.id IS UNIQUE
  """
)
display(result)

In [6]:
result = gds.run_cypher(
  """
    SHOW CONSTRAINTS YIELD id, name, type, entityType, labelsOrTypes, properties, ownedIndexId;
  """
)
display(result)

Unnamed: 0,id,name,type,entityType,labelsOrTypes,properties,ownedIndexId
0,16,attributeIdConstraint,UNIQUENESS,NODE,[Attribute],[id],15
1,14,childCategoryIdConstraint,UNIQUENESS,NODE,[Category],[id],13
2,6,constraint_184a6ca,NODE_KEY,NODE,[Manager],[filingManager],5
3,8,constraint_3a493df4,NODE_KEY,NODE,[Holding],"[filingManager, cusip, reportCalendarOrQuarter]",7
4,4,constraint_8d3c6074,NODE_KEY,NODE,[Company],[cusip],3
5,10,player_id,UNIQUENESS,NODE,[Player],[id],9
6,18,productIdConstraint,UNIQUENESS,NODE,[Product],[id],17
7,12,trait_name,UNIQUENESS,NODE,[Trait],[name],11


## Load Data

Create Nodes

In [31]:
result = gds.run_cypher(
  """
    LOAD CSV WITH HEADERS 
    FROM 'https://raw.githubusercontent.com/ezhilvendhan/ecommerce-demo-gds-vertex-ai/main/data/import_node_category.csv' AS line WITH line
    MERGE (n:Category {id:line.id,name:line.name})
  """
)
display(result)

In [32]:
result = gds.run_cypher(
  """
    LOAD CSV WITH HEADERS 
    FROM 'https://raw.githubusercontent.com/ezhilvendhan/ecommerce-demo-gds-vertex-ai/main/data/import_node_sku.csv' AS line WITH line
    MERGE (n:SKU {id:line.id,name:line.name,attributes:line.attributes,brand:toInteger(line.brand),colour:toInteger(line.colour),serial:toInteger(line.serial)})
  """
)
display(result)

In [33]:
result = gds.run_cypher(
  """
    LOAD CSV WITH HEADERS 
    FROM 'https://raw.githubusercontent.com/ezhilvendhan/ecommerce-demo-gds-vertex-ai/main/data/import_node_keyword.csv' AS line WITH line
    MERGE (n:Keyword {id:line.id,keywords:line.keywords})
  """
)
display(result)

In [34]:
result = gds.run_cypher(
  """
    LOAD CSV WITH HEADERS 
    FROM 'https://raw.githubusercontent.com/ezhilvendhan/ecommerce-demo-gds-vertex-ai/main/data/import_node_product.csv' AS line WITH line
    MERGE (n:Product {id:line.id,shop_id:line.shop_id,name:line.name})
  """
)
display(result)

In [35]:
result = gds.run_cypher(
  """
    LOAD CSV WITH HEADERS 
    FROM 'https://raw.githubusercontent.com/ezhilvendhan/ecommerce-demo-gds-vertex-ai/main/data/import_node_attribute.csv' AS line WITH line
    MERGE (n:Attribute {id:line.id,type:line.type,value:line.value})
  """
)
display(result)

Create Relationships

In [36]:
result = gds.run_cypher(
  """
    LOAD CSV WITH HEADERS 
    FROM 'https://raw.githubusercontent.com/ezhilvendhan/ecommerce-demo-gds-vertex-ai/main/data/import_relation_IS_CATEGORY.csv' AS line 
    WITH line
    MATCH (from:SKU {id:line.sku_id}), (to:Category {id:line.category_id})
    CREATE (from)-[:IS_CATEGORY]->(to)
  """
)
display(result)

In [37]:
result = gds.run_cypher(
  """
    LOAD CSV WITH HEADERS 
    FROM 'https://raw.githubusercontent.com/ezhilvendhan/ecommerce-demo-gds-vertex-ai/main/data/import_relation_IS_SKU.csv' AS line 
    WITH line
    MATCH (from:Product {id:line.product_id}), (to:SKU {id:line.sku_id})
    CREATE (from)-[:IS_SKU]->(to)
  """
)
display(result)

In [38]:
result = gds.run_cypher(
  """
    LOAD CSV WITH HEADERS 
    FROM 'https://raw.githubusercontent.com/ezhilvendhan/ecommerce-demo-gds-vertex-ai/main/data/import_relation_WITH_KEYWORD.csv' AS line 
    WITH line
    MATCH (from:SKU {id:line.sku_id}), (to:Keyword {id:line.keyword_id})
    CREATE (from)-[:WITH_KEYWORD]->(to)
  """
)
display(result)

In [39]:
result = gds.run_cypher(
  """
    LOAD CSV WITH HEADERS 
    FROM 'https://raw.githubusercontent.com/ezhilvendhan/ecommerce-demo-gds-vertex-ai/main/data/import_relation_HOT_SALE.csv' AS line 
    WITH line
    MATCH (from:SKU {id:line.sku_id}), (to:Product {id:line.product_id})
    CREATE (from)-[:HOT_SALE]->(to)
  """
)
display(result)

In [40]:
result = gds.run_cypher(
  """
    LOAD CSV WITH HEADERS 
    FROM 'https://raw.githubusercontent.com/ezhilvendhan/ecommerce-demo-gds-vertex-ai/main/data/import_relation_SUPPLEMENT_WITH.csv' AS line 
    WITH line
    MATCH (from:SKU {id:line.from_sku_id}), (to:SKU {id:line.to_sku_id})
    CREATE (from)-[:SUPPLEMENT_WITH]->(to)
  """
)
display(result)

In [41]:
result = gds.run_cypher(
  """
    LOAD CSV WITH HEADERS 
    FROM 'https://raw.githubusercontent.com/ezhilvendhan/ecommerce-demo-gds-vertex-ai/main/data/import_relation_LOW_PRICE.csv' AS line 
    WITH line
    MATCH (from:SKU {id:line.sku_id}), (to:Product {id:line.product_id})
    CREATE (from)-[:LOW_PRICE]->(to)
  """
)
display(result)

## Graph Data Science

First we're going to create an in memory graph represtation of the data in Neo4j Graph Data Science (GDS).

In [42]:
result = gds.run_cypher(
  """
    CALL gds.graph.project(
      'similarity-graph',                                
      'SKU',
      '*',                                    
      {nodeProperties:['brand', 'colour', 'serial']}                           
    )
  """
)
display(result)

Unnamed: 0,nodeProjection,relationshipProjection,graphName,nodeCount,relationshipCount,projectMillis
0,"{'SKU': {'label': 'SKU', 'properties': {'colou...","{'__ALL__': {'orientation': 'NATURAL', 'aggreg...",similarity-graph,1000,500,15


Note, if you get an error saying the graph already exists, that's probably because you ran this code before. You can destroy it using this command:

In [55]:
result = gds.run_cypher(
  """
    CALL gds.graph.drop('similarity-graph')
  """
)
display(result)

Unnamed: 0,graphName,database,memoryUsage,sizeInBytes,nodeCount,relationshipCount,configuration,density,creationTime,modificationTime,schema
0,similarity-graph,neo4j,,-1,5100,56550,{'relationshipProjection': {'__ALL__': {'orien...,0.002175,2022-07-06T03:05:48.619649000+00:00,2022-07-06T03:22:26.023872000+00:00,"{'graphProperties': {}, 'relationships': {'__A..."


Now, let's list the details of the graph to make sure the projection was created as we want.

In [43]:
result = gds.run_cypher(
  """
    CALL gds.graph.list()
  """
)
display(result)

Unnamed: 0,degreeDistribution,graphName,database,memoryUsage,sizeInBytes,nodeCount,relationshipCount,configuration,density,creationTime,modificationTime,schema
0,"{'p99': 1, 'min': 0, 'max': 1, 'mean': 0.5, 'p...",similarity-graph,neo4j,426 KiB,436476,1000,500,{},0.000501,2022-07-06T15:51:12.637584000+00:00,2022-07-06T15:51:12.653368000+00:00,"{'graphProperties': {}, 'relationships': {'__A..."


Lets use K-Nearest Neighbours Algorithm to find similar nodes. You can learn more about that [here](https://neo4j.com/docs/graph-data-science/current/algorithms/knn/).

In [44]:
result = gds.run_cypher(
  """
  CALL gds.knn.write('similarity-graph', {
      writeRelationshipType: 'IS_SIMILAR',
      writeProperty: 'score',
      topK: 10,
      randomSeed: 42,
      concurrency: 1,
      nodeProperties: ['brand', 'colour', 'serial']
  })
  YIELD nodesCompared, relationshipsWritten
  """
)
display(result)

Unnamed: 0,nodesCompared,relationshipsWritten
0,1000,10000


In [45]:
result = gds.run_cypher(
  """
    MATCH (n:SKU)-[s:IS_SIMILAR]-(p:SKU) RETURN n.id as a_id, n.name as a_name, p.id as b_id, p.name as b_name, s.score as score
  """
)
display(result)

Unnamed: 0,a_id,a_name,b_id,b_name,score
0,312,sku_name_312,0,sku_name_0,0.500005
1,412,sku_name_412,0,sku_name_0,0.472222
2,300,sku_name_300,0,sku_name_0,0.416699
3,760,sku_name_760,0,sku_name_0,0.444704
4,135,sku_name_135,0,sku_name_0,0.833333
...,...,...,...,...,...
19995,751,sku_name_751,999,sku_name_999,0.416675
19996,515,sku_name_515,999,sku_name_999,0.555556
19997,398,sku_name_398,999,sku_name_999,0.416679
19998,496,sku_name_496,999,sku_name_999,0.444622


To demonstrate interoperability, let's fetch the data from Neo4J and run a Classification model from Vertex AI. 

In [46]:
import pandas as pd
df = result
df['is_similar'] = df.apply(lambda row: True if (row['score'] > 0.5) else False,axis=1)
df

Unnamed: 0,a_id,a_name,b_id,b_name,score,is_similar
0,312,sku_name_312,0,sku_name_0,0.500005,True
1,412,sku_name_412,0,sku_name_0,0.472222,False
2,300,sku_name_300,0,sku_name_0,0.416699,False
3,760,sku_name_760,0,sku_name_0,0.444704,False
4,135,sku_name_135,0,sku_name_0,0.833333,True
...,...,...,...,...,...,...
19995,751,sku_name_751,999,sku_name_999,0.416675,False
19996,515,sku_name_515,999,sku_name_999,0.555556,True
19997,398,sku_name_398,999,sku_name_999,0.416679,False
19998,496,sku_name_496,999,sku_name_999,0.444622,False


Now that we have the data formatted properly, let's split it into a training and a testing set and write those to disk.

In [47]:

df.to_csv('raw.csv', index=False)
df

Unnamed: 0,a_id,a_name,b_id,b_name,score,is_similar
0,312,sku_name_312,0,sku_name_0,0.500005,True
1,412,sku_name_412,0,sku_name_0,0.472222,False
2,300,sku_name_300,0,sku_name_0,0.416699,False
3,760,sku_name_760,0,sku_name_0,0.444704,False
4,135,sku_name_135,0,sku_name_0,0.833333,True
...,...,...,...,...,...,...
19995,751,sku_name_751,999,sku_name_999,0.416675,False
19996,515,sku_name_515,999,sku_name_999,0.555556,True
19997,398,sku_name_398,999,sku_name_999,0.416679,False
19998,496,sku_name_496,999,sku_name_999,0.444622,False


# Authenticate your Google Cloud Account
Now let's write the file to Google Cloud Storage so we can use it in our model.  To do so, we must first authenticate.

Edit the variables below.  You can find the project ID in the Google Cloud Console.  The STORAGE_BUCKET is the name of a new bucket.  It must be globally unique.  It also needs to be all lower case.

In [48]:
# Edit this variable!
PROJECT_ID = 'neo4jbusinessdev'

# You can leave these defaults
STORAGE_BUCKET = PROJECT_ID + '-ev-form13'
REGION = 'us-central1'

In [49]:
import os
os.environ['GCLOUD_PROJECT'] = PROJECT_ID

In [50]:
try:
    from google.colab import auth as google_auth
    google_auth.authenticate_user()
except:
    pass

# Upload to Google Cloud Storage
Now we can upload our data sets to our bucket.

In [51]:
from google.cloud import storage
client = storage.Client()

Run the code below to create bucket, if needed. If the bucket exists, you get an error

In [52]:
bucket = client.bucket(STORAGE_BUCKET)
if(client.get_bucket(bucket) is None):
  bucket.location=REGION
  client.create_bucket(bucket)

In [53]:
filename='raw.csv'
upload_path = os.path.join('similarity', filename)
blob = bucket.blob(upload_path)
blob.upload_from_filename(filename)

# Train a Model on GCP
We'll use the original features to train an AutoML model.

In [54]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION)

dataset = aiplatform.TabularDataset.create(
    display_name="similarity-raw",
    gcs_source=os.path.join("gs://", STORAGE_BUCKET, 'similarity', 'raw.csv'),
)
dataset.wait()

print(f'\tDataset: "{dataset.display_name}"')
print(f'\tname: "{dataset.resource_name}"')

Creating TabularDataset
Create TabularDataset backing LRO: projects/803648085855/locations/us-central1/datasets/8799466323882016768/operations/9063718069418852352
TabularDataset created. Resource name: projects/803648085855/locations/us-central1/datasets/8799466323882016768
To use this TabularDataset in another session:
ds = aiplatform.TabularDataset('projects/803648085855/locations/us-central1/datasets/8799466323882016768')
	Dataset: "similarity-raw"
	name: "projects/803648085855/locations/us-central1/datasets/8799466323882016768"


In [55]:
job = aiplatform.AutoMLTabularTrainingJob(
    display_name='similarity-raw',
    optimization_prediction_type='classification'
)

In [None]:
model = job.run(
    dataset=dataset, 
    target_column="is_similar", 
    training_fraction_split=0.8, 
    validation_fraction_split=0.1, 
    test_fraction_split=0.1, 
    model_display_name="similarity-raw", 
    disable_early_stopping=False, 
    budget_milli_node_hours=1000, 
)

No column transformations provided, so now retrieving columns from dataset in order to set default column transformations.
The column transformation of type 'auto' was set for the following columns: ['score', 'a_id', 'b_id', 'b_name', 'a_name'].
View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/6376044015095119872?project=803648085855
AutoMLTabularTrainingJob projects/803648085855/locations/us-central1/trainingPipelines/6376044015095119872 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/803648085855/locations/us-central1/trainingPipelines/6376044015095119872 current state:
PipelineState.PIPELINE_STATE_RUNNING


1000 milli node hours, or one node hour, is the minimum budget that Vertex AI allows.  However, Vertex AI isn't respecting that budget currently.  This job will probably run for two and a half hours.  

We're going to move on while that runs.  You can check on the job later in the [Google Cloud Console](https://console.cloud.google.com/) to see the results.  There's a link to the specific job in the output of the cell above.