# Portfolio Churn Prediction with Amazon SageMaker Autopilot and Neo4j
This notebook describes how to use Neo4j and SageMaker together.  In it you connect to a Neo4j instance, load data and compute an embedding.  You then load that data into Amazon S3.  Finally, you use SageMaker to train a model using the new embedding as an additional feature.  

The data set represents a binary classification problem based on data from the SEC's EDGAR database.  It was scraped from the EDGAR system using the code [here](https://github.com/neo4j-partners/neo4j-sec-edgar-form13).  The data set consists of Form 13 data, the quarterly filings of asset managers with $100M or more of assets under management (AUM).

**Important:** This example notebook is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice.

## Deploy Neo4j
You're going to need a Neo4j deployment to run this lab.  The easiest way to get that is via the [AWS Marketplace](https://aws.amazon.com/marketplace/seller-profile?id=23ec694a-d2af-4641-b4d3-b7201ab2f5f9).  Select "Neo4j Enterprise Edition" and deploy that.  Suggested parameters are:

* Stack name - Enter something here
* Graph Database Version - 4.4.8
* Install Graph Data Science - True
* Graph Data Science License Key - None
* Install Bloom - False
* Bloom License Key - None
* Password - Enter something here
* Node Count - 1
* Instance Type - r6.4xlarge
* SSH CIDR - 0.0.0.0/0

The Marketplace listing deploys an Auto Scaling Group (ASG) and a Load Balancer (LB) in front of that.  When deployment is complete, you can get the DNS name of your LB from the console and use that to connect.  You can view deployed NLBs at [Load Balancer](https://console.aws.amazon.com/ec2/v2/home?#LoadBalancers:sort=loadBalancerName).

## Using the Neo4j API
Now that we have a Neo4j deployment, let's connect to Neo4j.  First off, install the Neo4j Graph Data Science package.

In [2]:
%pip install graphdatascience

[0mNote: you may need to restart the kernel to use updated packages.


Now, you're going to need the connection string and credentials from the deployment you created above.

In [3]:
# Edit these variables!
DB_URL = "neo4j://<XXX-nlb-XXX.elb.XXX>.amazonaws.com:7687"
DB_PASS = "<your-password>"

# You can leave this default
DB_USER = "neo4j"

In [4]:
from graphdatascience import GraphDataScience

gds = GraphDataScience(DB_URL, auth=(DB_USER, DB_PASS))

## Load Data into Neo4j
Now that we've got our connection object, let's load the dataset into Neo4j.

The dataset is pulled from the SEC's EDGAR database. These are public filings of something called Form 13. Asset managers with over \$100m AUM are required to submit Form 13 quarterly. That's then made available to the public over http. The csvs linked above were pulled from EDGAR using some python scripts. If you're curious, they're all available at [neo4j-sec-edgar-form13](https://github.com/neo4j-partners/neo4j-sec-edgar-form13). We've filtered the data to only include filings over \$10m in value.

We're going to create constraints for our data.

In [5]:
result = gds.run_cypher(
    "CREATE CONSTRAINT IF NOT EXISTS ON (p:Company) ASSERT (p.cusip) IS NODE KEY;"
)
display(result)

result = gds.run_cypher(
    "CREATE CONSTRAINT IF NOT EXISTS ON (p:Manager) ASSERT (p.filingManager) IS NODE KEY;"
)
display(result)

result = gds.run_cypher(
    "CREATE CONSTRAINT IF NOT EXISTS ON (p:Holding) ASSERT (p.filingManager, p.cusip, p.reportCalendarOrQuarter) IS NODE KEY;"
)
display(result)

Now let's load the nodes.

In [6]:
result = gds.run_cypher(
    """
        LOAD CSV WITH HEADERS FROM "https://neo4j-dataset.s3.amazonaws.com/form13/2021.csv" AS row
        MERGE (c:Company {cusip:row.cusip})
        ON CREATE SET
            c.nameOfIssuer=row.nameOfIssuer
    """
)
display(result)

In [7]:
result = gds.run_cypher(
    """
        LOAD CSV WITH HEADERS FROM "https://neo4j-dataset.s3.amazonaws.com/form13/2021.csv" AS row
        MERGE (m:Manager {filingManager:row.filingManager})
    """
)
display(result)

In [8]:
result = gds.run_cypher(
    """
        LOAD CSV WITH HEADERS FROM "https://neo4j-dataset.s3.amazonaws.com/form13/2021.csv" AS row
        MERGE (h:Holding {filingManager:row.filingManager, cusip:row.cusip, reportCalendarOrQuarter:row.reportCalendarOrQuarter})
        ON CREATE SET
            h.value=row.value, 
            h.shares=row.shares,
            h.target=row.target,
            h.nameOfIssuer=row.nameOfIssuer
    """
)
display(result)

Now let's create relationships between those nodes.

In [9]:
result = gds.run_cypher(
    """
        LOAD CSV WITH HEADERS FROM "https://neo4j-dataset.s3.amazonaws.com/form13/2021.csv" AS row
        MATCH (m:Manager {filingManager:row.filingManager})
        MATCH (h:Holding {filingManager:row.filingManager, cusip:row.cusip, reportCalendarOrQuarter:row.reportCalendarOrQuarter})
        MERGE (m)-[r:OWNS]->(h)
    """
)
display(result)

In [10]:
result = gds.run_cypher(
    """
        LOAD CSV WITH HEADERS FROM "https://neo4j-dataset.s3.amazonaws.com/form13/2021.csv" AS row
        MATCH (h:Holding {filingManager:row.filingManager, cusip:row.cusip, reportCalendarOrQuarter:row.reportCalendarOrQuarter})
        MATCH (c:Company {cusip:row.cusip})
        MERGE (h)-[r:PARTOF]->(c)
    """
)
display(result)

## Graph Data Science
Now we're going to use Neo4j Graph Data Science to create an in-memory graph representation of the data.  We'll enhance that representation with features we engineer using a graph embedding.

In [13]:
result = gds.run_cypher(
    """
    CALL gds.graph.project(
      "mygraph",
      ["Company", "Manager", "Holding"],
      {
          OWNS: {orientation: "UNDIRECTED"},
          PARTOF: {orientation: "UNDIRECTED"}
      }
    )
    YIELD
      graphName AS graph,
      relationshipProjection AS readProjection,
      nodeCount AS nodes,
      relationshipCount AS rels
  """
)
display(result)

Unnamed: 0,graph,readProjection,nodes,rels
0,mygraph,"{'PARTOF': {'orientation': 'UNDIRECTED', 'aggr...",458170,1787688


If you get an error saying the graph already exists, that's probably because you ran this code before. You can destroy it using this command:

In [None]:
# result = gds.run_cypher(
#  """
#    CALL gds.graph.drop("mygraph")
#  """
# )
# display(result)

Now, let's list the details of the graph to make sure the projection was created as we want.

In [14]:
result = gds.run_cypher(
    """
    CALL gds.graph.list()
  """
)
display(result)

Unnamed: 0,degreeDistribution,graphName,database,memoryUsage,sizeInBytes,nodeCount,relationshipCount,configuration,density,creationTime,modificationTime,schema
0,"{'p99': 18, 'min': 1, 'max': 6864, 'mean': 3.9...",mygraph,neo4j,32 MiB,34237600,458170,1787688,{'relationshipProjection': {'PARTOF': {'orient...,9e-06,2022-07-20T22:18:09.350034000+00:00,2022-07-20T22:18:11.052890000+00:00,"{'relationships': {'PARTOF': {}, 'OWNS': {}}, ..."


Now we can generate an embedding from that graph. This is a new feature we can use in our predictions. We're using FastRP, which is a more full featured and higher performance of Node2Vec. You can learn more about that at the [Fast Random Projection
](https://neo4j.com/docs/graph-data-science/current/algorithms/fastrp/) documentation page.

There are a bunch of parameters we could adjust in this.  One of the most obvious is the embeddingDimension.  The documentation covers many more.

In [15]:
result = gds.run_cypher(
    """
  CALL gds.fastRP.mutate("mygraph",{
    embeddingDimension: 16,
    randomSeed: 1,
    mutateProperty:"embedding"
  })
  """
)
display(result)

Unnamed: 0,nodePropertiesWritten,mutateMillis,nodeCount,preProcessingMillis,computeMillis,configuration
0,458170,0,458170,0,326,"{'nodeSelfInfluence': 0, 'relationshipWeightPr..."


That creates an embedding for each node type.  However, we only want the embedding on the nodes of type holding.

We're going to take the embedding from our projection and write it to the holding nodes in the underlying database.

In [16]:
result = gds.run_cypher(
    """
    CALL gds.graph.writeNodeProperties("mygraph", ["embedding"], ["Holding"])
    YIELD writeMillis
  """
)
display(result)

Unnamed: 0,writeMillis
0,1894


In [17]:
result = gds.run_cypher(
    """
    MATCH (n:Holding) RETURN n
  """
)
display(result)

Unnamed: 0,n
0,"(shares, cusip, reportCalendarOrQuarter, filin..."
1,"(shares, cusip, reportCalendarOrQuarter, filin..."
2,"(shares, cusip, reportCalendarOrQuarter, filin..."
3,"(shares, cusip, reportCalendarOrQuarter, filin..."
4,"(shares, cusip, reportCalendarOrQuarter, filin..."
...,...
446917,"(shares, cusip, reportCalendarOrQuarter, filin..."
446918,"(shares, cusip, reportCalendarOrQuarter, filin..."
446919,"(shares, cusip, reportCalendarOrQuarter, filin..."
446920,"(shares, cusip, reportCalendarOrQuarter, filin..."


Note that this query will take 2-3 minutes to run as it's grabbing nearly half a million nodes along with all their properties and our new embedding.

In [18]:
import pandas as pd

df = pd.DataFrame([dict(record.items()) for record in result["n"]])
df

Unnamed: 0,shares,cusip,reportCalendarOrQuarter,filingManager,embedding,nameOfIssuer,value,target
0,270,88579Y101,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[-0.06891358643770218, 0.17251434922218323, -0...",3M Co,52024000,False
1,195,00508Y102,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[-0.0012208922998979688, 0.005601376295089722,...",Acuity Brands Inc,32175000,False
2,4939,00724F101,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[0.11818736791610718, 0.2625264525413513, -0.0...",Adobe Systems Inc,2347852000,False
3,1557,02079K305,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[0.14201900362968445, 0.26428067684173584, 0.1...",Alphabet Inc A,3211344000,False
4,837,02079K107,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[0.30085813999176025, 0.24750658869743347, 0.4...",Alphabet Inc C,1731443000,False
...,...,...,...,...,...,...,...,...
446917,56874,911312106,09-30-2021,LEE DANNER & BASS INC,"[-0.10756000876426697, 0.09883543848991394, 0....",United Parcel Svc. Cl B,10357000,False
446918,231000,92552v100,09-30-2021,LEE DANNER & BASS INC,"[-0.19772686064243317, 0.19481559097766876, 0....",ViaSat Inc,12721000,True
446919,55104,92826C839,09-30-2021,LEE DANNER & BASS INC,"[0.19159139692783356, 0.5284040570259094, 0.15...",Visa Inc,12274000,True
446920,79459,931142103,09-30-2021,LEE DANNER & BASS INC,"[0.16196903586387634, 0.5445767045021057, 0.60...",Wal-Mart Stores Inc.,11075000,False


Note that the embedding row is an array. To make this dataset more consumable, we should flatten that out into multiple individual features: embedding_0, embedding_1, ... embedding_n.


In [19]:
embeddings = pd.DataFrame(df["embedding"].values.tolist()).add_prefix("embedding_")
merged = df.drop(columns=["embedding"]).merge(embeddings, left_index=True, right_index=True)
merged

Unnamed: 0,shares,cusip,reportCalendarOrQuarter,filingManager,nameOfIssuer,value,target,embedding_0,embedding_1,embedding_2,...,embedding_6,embedding_7,embedding_8,embedding_9,embedding_10,embedding_11,embedding_12,embedding_13,embedding_14,embedding_15
0,270,88579Y101,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,3M Co,52024000,False,-0.068914,0.172514,-0.270718,...,0.142906,0.707541,0.255056,0.168236,0.014960,-0.183164,0.214030,-0.149799,0.436635,-0.393186
1,195,00508Y102,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,Acuity Brands Inc,32175000,False,-0.001221,0.005601,-0.008561,...,-0.249999,-0.027272,0.210765,0.232817,0.275897,-0.276700,0.338566,-0.479243,0.789944,-0.336510
2,4939,00724F101,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,Adobe Systems Inc,2347852000,False,0.118187,0.262526,-0.081295,...,-0.591563,0.398565,0.239687,-0.259427,-0.352596,0.025593,0.626399,0.247615,0.445566,0.094596
3,1557,02079K305,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,Alphabet Inc A,3211344000,False,0.142019,0.264281,0.142079,...,-0.150988,1.035650,0.370033,-0.064532,-0.239375,-0.003771,0.124420,0.027277,0.077211,-0.477533
4,837,02079K107,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,Alphabet Inc C,1731443000,False,0.300858,0.247507,0.449530,...,-0.415947,0.365203,-0.428251,-0.000808,0.304490,0.002407,-0.100384,-0.317436,0.368293,-0.037012
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
446917,56874,911312106,09-30-2021,LEE DANNER & BASS INC,United Parcel Svc. Cl B,10357000,False,-0.107560,0.098835,0.136811,...,-0.246284,0.304195,0.127205,-0.927462,-0.280658,0.211018,-0.149063,-0.060732,-0.073806,-0.089820
446918,231000,92552v100,09-30-2021,LEE DANNER & BASS INC,ViaSat Inc,12721000,True,-0.197727,0.194816,0.100359,...,-0.246779,0.370053,0.573178,0.061018,-0.038240,-0.423521,-0.037658,-0.372200,-0.453317,-0.374626
446919,55104,92826C839,09-30-2021,LEE DANNER & BASS INC,Visa Inc,12274000,True,0.191591,0.528404,0.158589,...,-0.766427,-0.110386,0.576788,-0.022191,-0.768111,-0.192285,-0.119118,0.234878,0.124492,-0.344153
446920,79459,931142103,09-30-2021,LEE DANNER & BASS INC,Wal-Mart Stores Inc.,11075000,False,0.161969,0.544577,0.602678,...,-0.070697,-0.004854,0.217884,-0.519946,-0.045873,-0.047081,-0.103562,0.011971,-0.375597,-0.362491


Now that we have the data formatted properly, let's split it into training, testing and validation sets.  We'll write those to disk.

Our data is, in some sense a time series.  We're going to window over three quarters.  Q4 of 2021 is used to generate labels, so it's not present in the data set.  That leaves Q3 as our validation data set.  Q2 becomes test and Q1 is for training.

We take this approach random than generating random folds or similar to avoid time based leakage.

In [20]:
df = merged

train = df.loc[df["reportCalendarOrQuarter"] == "03-31-2021"]
train.to_csv("train.csv", index=False)

test = df.loc[df["reportCalendarOrQuarter"] == "06-30-2021"]
test = test.drop(["target"], axis=1)
test.to_csv("test.csv", index=False)

validate = df.loc[df["reportCalendarOrQuarter"] == "09-30-2021"]
validate = validate.drop(["target"], axis=1)
validate.to_csv("validate.csv", index=False)

## SageMaker Connection
Let's setup our SageMaker connection.

In [21]:
import sagemaker
import boto3

region = boto3.Session().region_name

session = sagemaker.Session()
bucket = session.default_bucket()
prefix = "sagemaker/form13"

role = sagemaker.get_execution_role()

sm = boto3.Session().client(service_name="sagemaker", region_name=region)

## Upload to Amazon S3
Now we're going to upload the training and testing data to our default SageMaker bucket.

In [22]:
train_data_s3_path = session.upload_data(path="train.csv", key_prefix=prefix + "/train")
print("Training data uploaded to: " + train_data_s3_path)

test_data_s3_path = session.upload_data(path="test.csv", key_prefix=prefix + "/test")
print("Testing data uploaded to: " + test_data_s3_path)

validation_data_s3_path = session.upload_data(path="validate.csv", key_prefix=prefix + "/validate")
print("Validation data uploaded to: " + validation_data_s3_path)

Training data uploaded to: s3://sagemaker-us-east-1-159878781974/sagemaker/form13/train/train.csv
Testing data uploaded to: s3://sagemaker-us-east-1-159878781974/sagemaker/form13/test/test.csv
Validation data uploaded to: s3://sagemaker-us-east-1-159878781974/sagemaker/form13/validate/validate.csv


## Setting up the SageMaker Autopilot Job
After uploading the dataset to Amazon S3, you can invoke Autopilot to find the best ML pipeline to train a model on this dataset.

In [23]:
auto_ml_job_config = {"CompletionCriteria": {"MaxCandidates": 3}}

input_data_config = [
    {
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": "s3://{}/{}/train".format(bucket, prefix),
            }
        },
        "TargetAttributeName": "target",
    }
]

output_data_config = {"S3OutputPath": "s3://{}/{}/output".format(bucket, prefix)}

## Launching the SageMaker Autopilot Job
You can now launch the Autopilot job by calling the `create_auto_ml_job` method.

In [24]:
from time import gmtime, strftime, sleep

timestamp_suffix = strftime("%d-%H-%M-%S", gmtime())

auto_ml_job_name = "automl-form13-" + timestamp_suffix
print("AutoMLJobName: " + auto_ml_job_name)

sm.create_auto_ml_job(
    AutoMLJobName=auto_ml_job_name,
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    AutoMLJobConfig=auto_ml_job_config,
    RoleArn=role,
)

AutoMLJobName: automl-form13-20-22-24-13


{'AutoMLJobArn': 'arn:aws:sagemaker:us-east-1:159878781974:automl-job/automl-form13-20-22-24-13',
 'ResponseMetadata': {'RequestId': '5b71c85a-0098-4607-bbf7-8f4e27dbab8c',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '5b71c85a-0098-4607-bbf7-8f4e27dbab8c',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '96',
   'date': 'Wed, 20 Jul 2022 22:24:14 GMT'},
  'RetryAttempts': 0}}

## Tracking SageMaker Autopilot job progress
SageMaker Autopilot job consists of the following high-level steps : 

* Analyzing Data, where the dataset is analyzed and Autopilot comes up with a list of ML pipelines that should be tried out on the dataset. The dataset is also split into train and validation sets. 
* Feature Engineering, where Autopilot performs feature transformation on individual features of the dataset as well as at an aggregate level. 
* Model Tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline).

This job typically takes about 60 minutes to run.

In [25]:
print("JobStatus - Secondary Status")
print("----------------------------")

describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print(describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"])
job_run_status = describe_response["AutoMLJobStatus"]

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
    job_run_status = describe_response["AutoMLJobStatus"]

    print(
        describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"]
    )
    sleep(30)

JobStatus - Secondary Status
----------------------------
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - AnalyzingData
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
InProgress - FeatureEngineering
I

## Results
Now use the describe_auto_ml_job API to look up the best candidate selected by the SageMaker Autopilot job.

In [26]:
import pprint

best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)["BestCandidate"]
best_candidate_name = best_candidate["CandidateName"]

print("CandidateName: " + best_candidate_name)
print(
    "FinalAutoMLJobObjectiveMetricName: "
    + best_candidate["FinalAutoMLJobObjectiveMetric"]["MetricName"]
)
print(
    "FinalAutoMLJobObjectiveMetricValue: "
    + str(best_candidate["FinalAutoMLJobObjectiveMetric"]["Value"])
)
print()
pprint.pprint(best_candidate)

CandidateName: automl-form13-20-22-24-136FqgNeC-003-5dfa316b
FinalAutoMLJobObjectiveMetricName: validation:f1_binary
FinalAutoMLJobObjectiveMetricValue: 0.5145900249481201

{'CandidateName': 'automl-form13-20-22-24-136FqgNeC-003-5dfa316b',
 'CandidateProperties': {'CandidateArtifactLocations': {'Explainability': 's3://sagemaker-us-east-1-159878781974/sagemaker/form13/output/automl-form13-20-22-24-13/documentation/explainability/output',
                                                        'ModelInsights': 's3://sagemaker-us-east-1-159878781974/sagemaker/form13/output/automl-form13-20-22-24-13/documentation/model_monitor/output'},
                         'CandidateMetrics': [{'MetricName': 'F1',
                                               'Set': 'Validation',
                                               'StandardMetricName': 'F1',
                                               'Value': 0.5145900249481201},
                                              {'MetricName': 'LogLoss',


## Batch Inference
Now that we completed the SageMaker Autopilot job on the dataset, let's create a model from the best candidate with Inference Pipelines.

In [27]:
model_name = "automl-form13-model-" + timestamp_suffix
model = sm.create_model(
    Containers=best_candidate["InferenceContainers"], ModelName=model_name, ExecutionRoleArn=role
)
print("Model ARN corresponding to the best candidate is: {}".format(model["ModelArn"]))

Model ARN corresponding to the best candidate is: arn:aws:sagemaker:us-east-1:159878781974:model/automl-form13-model-20-22-24-13


We can use batch inference through Amazon SageMaker batch transform. The same model can also be deployed to perform online inference using Amazon SageMaker hosting.



In [28]:
transform_job_name = "automl-form13-transform-" + timestamp_suffix

transform_input = {
    "DataSource": {"S3DataSource": {"S3DataType": "S3Prefix", "S3Uri": test_data_s3_path}},
    "ContentType": "text/csv",
    "CompressionType": "None",
    "SplitType": "Line",
}

transform_output = {
    "S3OutputPath": "s3://{}/{}/inference-results".format(bucket, prefix),
}

transform_resources = {"InstanceType": "ml.m5.4xlarge", "InstanceCount": 1}

sm.create_transform_job(
    TransformJobName=transform_job_name,
    ModelName=model_name,
    TransformInput=transform_input,
    TransformOutput=transform_output,
    TransformResources=transform_resources,
)

{'TransformJobArn': 'arn:aws:sagemaker:us-east-1:159878781974:transform-job/automl-form13-transform-20-22-24-13',
 'ResponseMetadata': {'RequestId': 'dd2a70c6-c208-4b61-a6e7-14d7ff345f0d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'dd2a70c6-c208-4b61-a6e7-14d7ff345f0d',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '112',
   'date': 'Wed, 20 Jul 2022 23:21:32 GMT'},
  'RetryAttempts': 0}}

Now we can watch the transform job for completion.  That takes approximately 20 minutes.

In [29]:
print("JobStatus")
print("---------")

describe_response = sm.describe_transform_job(TransformJobName=transform_job_name)
job_run_status = describe_response["TransformJobStatus"]
print(job_run_status)

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = sm.describe_transform_job(TransformJobName=transform_job_name)
    job_run_status = describe_response["TransformJobStatus"]
    print(job_run_status)
    sleep(30)

JobStatus
---------
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
Completed


Now let’s get the URL of the transform job results.  You can open this in S3.

In [30]:
bucket = session.default_bucket()
key = "{}/inference-results/test_data.csv.out".format(prefix)
url = "s3://" + bucket + key

print(url)

s3://sagemaker-us-east-1-159878781974sagemaker/form13/inference-results/test_data.csv.out


## View All Candidates
You can view all the candidates (pipeline evaluations with different hyperparameter combinations) that were explored by SageMaker Autopilot and sort them by their final performance metric.

In [31]:
candidates = sm.list_candidates_for_auto_ml_job(
    AutoMLJobName=auto_ml_job_name, SortBy="FinalObjectiveMetricValue"
)["Candidates"]
index = 0
for candidate in candidates:
    print(
        str(index)
        + "  "
        + candidate["CandidateName"]
        + "  "
        + str(candidate["FinalAutoMLJobObjectiveMetric"]["Value"])
    )
    index += 1

0  automl-form13-20-22-24-136FqgNeC-003-5dfa316b  0.5145900249481201
1  automl-form13-20-22-24-136FqgNeC-002-1e69de74  0.45396000146865845
2  automl-form13-20-22-24-136FqgNeC-001-fe4c3ef8  0.4429199993610382


## Candidate Generation Notebook
SageMaker AutoPilot also auto-generates a Candidate Definitions notebook. This notebook can be used to interactively step through the various steps taken by the SageMaker AutoPilot to arrive at the best candidate. This notebook can also be used to override various runtime parameters like parallelism, hardware used, algorithms explored, feature extraction scripts and more.

This code downloads a file from our SageMaker bucket using the SageMaker session.

In [32]:
def downloadNotebook(s3_path):
    session = sagemaker.Session()
    role = sagemaker.get_execution_role()

    # reformat the s3 URL into something boto3 can handle
    s3_path_parts = s3_path.replace("s3://", "").split("/")
    bucket, key, file = s3_path_parts[0], "/".join(s3_path_parts[1:]), s3_path_parts[-1]
    
    print(bucket)
    print(key)
    print(file)

    print("file" + file)
    notebook = session.read_s3_file(bucket, key)
    with open(file, "w") as text_file:
        text_file.write(notebook)

We can download the notebook with the command:

In [33]:
notebook_s3_path = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)["AutoMLJobArtifacts"][
    "CandidateDefinitionNotebookLocation"
]
downloadNotebook(notebook_s3_path)

sagemaker-us-east-1-159878781974
sagemaker/form13/output/automl-form13-20-22-24-13/sagemaker-automl-candidates/automl-form13-20-22-24-13-pr-1-72ff29f4d6214c93ae974d671d64ebc9/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb
SageMakerAutopilotCandidateDefinitionNotebook.ipynb
fileSageMakerAutopilotCandidateDefinitionNotebook.ipynb


## Data Exploration Notebook
Sagemaker Autopilot also auto-generates a Data Exploration notebook.  This code will download that notebook:


In [34]:
notebook_s3_path = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)["AutoMLJobArtifacts"][
    "DataExplorationNotebookLocation"
]
downloadNotebook(notebook_s3_path)

sagemaker-us-east-1-159878781974
sagemaker/form13/output/automl-form13-20-22-24-13/sagemaker-automl-candidates/automl-form13-20-22-24-13-pr-1-72ff29f4d6214c93ae974d671d64ebc9/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb
SageMakerAutopilotDataExplorationNotebook.ipynb
fileSageMakerAutopilotDataExplorationNotebook.ipynb


## Cleanup
SageMaker stores its data in an Amazon S3 bucket.  You may want to delete that once you're done working with it.

The AWS Marketplace listing we deployed Neo4j Enterprise Edition with created a stack.  To delete the deployment, you would navigate to Amazon [CloudFormation](https://console.aws.amazon.com/cloudformation) in the console and delete the stack there.  Be sure to delete the entire stack as that will delete all the subcomponents of the stack.

## Conclusion
In this notebook, you deployed Neo4j Enterprise Edition.  Within SageMaker Studio, you then loaded a data set in Neo4j Graph Database.  You used Neo4j Graph Data Science to compute a graph embedding on that dataset.  Using that embedding, you ran a SageMaker AutoPilot job and inspected the output.

This same flow can be repurposed to add graph embeddings to your own machine learning jobs.  Graph embeddings are just one sort of graph feature that can be used in machine learning.  The approach we used here would apply to incorporating other features like beweeness or neighborhood as well.