Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: Apache-2.0

# Social Network Recommendations

In this example we're going to build a powerful social network predictive capability with some simple Gremlin queries. The techniques intrdocued here can be used to build predictions in other domains outside of social.

### People You May Know

A common feature of many social network applications is the ability to recommend People-You-May-Know (or People-You-May-Want-To-Know) – sometimes abbreviated PYMK.

Using Amazon Neptune, we can implement a PYMK capability using a well-understood phenomenon called *triadic closure*. Tridaic closure is the 
tendency for elements at a very local level in a graph to form stable triangles as the data changes over time. This behaviour can be observed in graphs in all kinds of different domains. It's the basis of many homophily-based recommendation systems – systems that exploit the fact that similarity breeds connections. In this example we're going to look at using triadic closure in the context of a social network.

Let's imagine we have a social network whose members include Bill, Terry and Sarah. Terry is friends with both Bill and Sarah; that is, Terry and Sarah have a mutual friend in Bill. 

Because they have Bill in common, there's a good chance that Sarah and Bill either already know one other or may get to know one another in the near future. Just looking at the graph, we can see they have both the *means* and the *motive* to be friends. Hanging around with Bill provides the means for Sarah and Terry to meet. And because they trust Bill, they have the motive to trust people with whom Bill is friends, increasing the chance that if they do meet, they'll form a connection and close the triangle.

In the context of a social network, we can use triadic closure to implement PYMK. When a particular user logs into the system, we can look up their vertex in the graph, and then traverse their friend-of-a-friend network, looking for opportunities to close triangles. The more paths that extend from our user, through their immediate friends, to someone to whom they are not currently connected, the greater the likelihood the user may either already know that person, or may benefit from getting to know them.

### Setup

Before we begin, we'll clear any existing data from our Neptune cluster, using the cell magic `%%gremlin` and a subsequent drop query:

In [114]:
%%gremlin

g.V().drop()

Tab(children=(Output(layout=Layout(max_height='600px', overflow='scroll', width='100%')), Output(layout=Layout…

How do we know which Neptune cluster to access? The cell magics exposed by Neptune Notebooks use a configuration located by default under `~/graph_notebook_config.json` At the time of initialization of the Sagemaker instance, this configuration is generated using environment variables derived from the cluster being connected to. 

You can check the contents of the configuration in two ways. You can print the file itself, or you can look for the configuration being used by the notebook which you have opened.

In [115]:
%%bash

cat ~/graph_notebook_config.json

{
  "host": "neptunedbcluster-gtdhydmqdzgt.cluster-c6g3m9gnkltx.eu-west-2.neptune.amazonaws.com",
  "port": 8182,
  "auth_mode": "DEFAULT",
  "load_from_s3_arn": "arn:aws:iam::707684582322:role/NeptuneMLQuickStart-NeptuneB-NeptuneLoadFromS3Role-QLGFKTDYTKJD",
  "ssl": true,
  "aws_region": "eu-west-2",
  "sparql": {
    "path": "sparql"
  }
}

In [116]:
%graph_notebook_config

{
  "host": "neptunedbcluster-gtdhydmqdzgt.cluster-c6g3m9gnkltx.eu-west-2.neptune.amazonaws.com",
  "port": 8182,
  "auth_mode": "DEFAULT",
  "load_from_s3_arn": "arn:aws:iam::707684582322:role/NeptuneMLQuickStart-NeptuneB-NeptuneLoadFromS3Role-QLGFKTDYTKJD",
  "ssl": true,
  "aws_region": "eu-west-2",
  "sparql": {
    "path": "sparql"
  }
}


<graph_notebook.configuration.generate_config.Configuration at 0x7fcda0745f98>

### Create a Social Network

Next, we'll create a small social network. Note that the script below comprises a single statement. All the vertices and edges here will be created in the context of a single transaction.

In [117]:
%%gremlin

g.
addV('User').property('name','Bill').property('birthdate', '1988-03-22').
addV('User').property('name','Sarah').property('birthdate', '1992-05-03').
addV('User').property('name','Ben').property('birthdate', '1989-10-21').
addV('User').property('name','Lucy').property('birthdate', '1998-01-17').
addV('User').property('name','Colin').property('birthdate', '2001-08-14').
addV('User').property('name','Emily').property('birthdate', '1998-03-05').
addV('User').property('name','Gordon').property('birthdate', '2002-12-04').
addV('User').property('name','Kate').property('birthdate', '1995-02-12').
addV('User').property('name','Peter').property('birthdate', '2001-02-27').
addV('User').property('name','Terry').property('birthdate', '1989-10-02').
addV('User').property('name','Alistair').property('birthdate', '1992-06-30').
addV('User').property('name','Eve').property('birthdate', '2000-05-13').
addV('User').property('name','Gary').property('birthdate', '1998-09-20').
addV('User').property('name','Mary').property('birthdate', '1997-01-27').
addV('User').property('name','Charlie').property('birthdate', '1989-11-02').
addV('User').property('name','Sue').property('birthdate', '1994-03-08').
addV('User').property('name','Arnold').property('birthdate', '2002-07-23').
addV('User').property('name','Chloe').property('birthdate', '1988-11-04').
addV('User').property('name','Henry').property('birthdate', '1996-03-15').
addV('User').property('name','Josie').property('birthdate', '2003-08-21').
V().hasLabel('User').has('name','Sarah').as('a').V().hasLabel('User').has('name','Bill').addE('FRIEND').to('a').property('strength',1).
V().hasLabel('User').has('name','Colin').as('a').V().hasLabel('User').has('name','Bill').addE('FRIEND').to('a').property('strength',2).
V().hasLabel('User').has('name','Terry').as('a').V().hasLabel('User').has('name','Bill').addE('FRIEND').to('a').property('strength',3).
V().hasLabel('User').has('name','Peter').as('a').V().hasLabel('User').has('name','Colin').addE('FRIEND').to('a').property('strength',1).
V().hasLabel('User').has('name','Kate').as('a').V().hasLabel('User').has('name','Ben').addE('FRIEND').to('a').property('strength',2).
V().hasLabel('User').has('name','Kate').as('a').V().hasLabel('User').has('name','Lucy').addE('FRIEND').to('a').property('strength',3).
V().hasLabel('User').has('name','Eve').as('a').V().hasLabel('User').has('name','Lucy').addE('FRIEND').to('a').property('strength',1).
V().hasLabel('User').has('name','Alistair').as('a').V().hasLabel('User').has('name','Kate').addE('FRIEND').to('a').property('strength',2).
V().hasLabel('User').has('name','Gary').as('a').V().hasLabel('User').has('name','Colin').addE('FRIEND').to('a').property('strength',3).
V().hasLabel('User').has('name','Gordon').as('a').V().hasLabel('User').has('name','Emily').addE('FRIEND').to('a').property('strength',1).
V().hasLabel('User').has('name','Alistair').as('a').V().hasLabel('User').has('name','Emily').addE('FRIEND').to('a').property('strength',3).
V().hasLabel('User').has('name','Terry').as('a').V().hasLabel('User').has('name','Gordon').addE('FRIEND').to('a').property('strength',3).
V().hasLabel('User').has('name','Alistair').as('a').V().hasLabel('User').has('name','Terry').addE('FRIEND').to('a').property('strength',1).
V().hasLabel('User').has('name','Gary').as('a').V().hasLabel('User').has('name','Terry').addE('FRIEND').to('a').property('strength',2).
V().hasLabel('User').has('name','Mary').as('a').V().hasLabel('User').has('name','Terry').addE('FRIEND').to('a').property('strength',3).
V().hasLabel('User').has('name','Henry').as('a').V().hasLabel('User').has('name','Alistair').addE('FRIEND').to('a').property('strength',1).
V().hasLabel('User').has('name','Sue').as('a').V().hasLabel('User').has('name','Eve').addE('FRIEND').to('a').property('strength',2).
V().hasLabel('User').has('name','Sue').as('a').V().hasLabel('User').has('name','Charlie').addE('FRIEND').to('a').property('strength',3).
V().hasLabel('User').has('name','Josie').as('a').V().hasLabel('User').has('name','Charlie').addE('FRIEND').to('a').property('strength',1).
V().hasLabel('User').has('name','Henry').as('a').V().hasLabel('User').has('name','Charlie').addE('FRIEND').to('a').property('strength',2).
V().hasLabel('User').has('name','Henry').as('a').V().hasLabel('User').has('name','Mary').addE('FRIEND').to('a').property('strength',3).
V().hasLabel('User').has('name','Mary').as('a').V().hasLabel('User').has('name','Gary').addE('FRIEND').to('a').property('strength',1).
V().hasLabel('User').has('name','Henry').as('a').V().hasLabel('User').has('name','Gary').addE('FRIEND').to('a').property('strength',2).
V().hasLabel('User').has('name','Chloe').as('a').V().hasLabel('User').has('name','Gary').addE('FRIEND').to('a').property('strength',3).
V().hasLabel('User').has('name','Henry').as('a').V().hasLabel('User').has('name','Arnold').addE('FRIEND').to('a').property('strength',1).
next()

Tab(children=(Output(layout=Layout(max_height='600px', overflow='scroll', width='100%')), Output(layout=Layout…

This is what the network looks like:
    
<img src="https://s3.amazonaws.com/aws-neptune-customer-samples/neptune-sagemaker/images/03-social-network.png"/>

### Create a Recommendation

Let's now create a PYMK recommendation for a specific user.

In the query below, we're finding the vertex that represents our user. We're then traversing `FRIEND` relationships (we don't care about relationship direction, so we're using `both()`) to find that user's immediate friends. We're then traversing another hop into the graph, looking for friends of those friends who _are not currently connected to our user_ (i.e., we're looking for the unclosed triangles).

We then count the paths to these candidate friends, and order the results based on the number of times we can reach a candidate via one of the user's immediate friends.

In [118]:
%%gremlin

g.V().hasLabel('User').has('name', 'Terry').as('user').  
  both('FRIEND').aggregate('friends').  
  both('FRIEND').
    where(P.neq('user')).where(P.without('friends')).  
  groupCount().by('name').  
  order(Scope.local).by(values, Order.decr).
  next()

Tab(children=(Output(layout=Layout(max_height='600px', overflow='scroll', width='100%')), Output(layout=Layout…

### Using Friendship Strength to Improve Recommendations

What if we wanted to base our recommendations only on resonably strong friendship bonds?

If you look at the Gremlin we used to create our graph, you'll see that each `FRIEND` edge has a `strength` property. In the following query, the traversal applies a predicate to this `strength` property. Note that we use `bothE()` rather than `both()` to position the traversal on an edge, where we then apply the predicate. We proceed only where `strength` is greater than one.

In [119]:
%%gremlin

g.V().hasLabel('User').has('name', 'Terry').as('user')
  .bothE('FRIEND')    
    .has('strength', P.gt(1)).otherV()
    .aggregate('friends')
  .bothE('FRIEND')
    .has('strength', P.gt(1)).otherV()
    .where(P.neq('user')).where(P.without('friends'))
  .groupCount().by('name')
  .order(Scope.local).by(values, Order.decr)
  .next()


Tab(children=(Output(layout=Layout(max_height='600px', overflow='scroll', width='100%')), Output(layout=Layout…

### Number of users in the graph 

In [120]:
%%gremlin
g.V().groupCount().by(label).unfold()

Tab(children=(Output(layout=Layout(max_height='600px', overflow='scroll', width='100%')), Output(layout=Layout…

### Number of relations among users

In [121]:
%%gremlin
g.E().groupCount().by(label).unfold()

Tab(children=(Output(layout=Layout(max_height='600px', overflow='scroll', width='100%')), Output(layout=Layout…

### Explore Terry's friends

In [122]:
%%gremlin

g.V().hasLabel('User').has('name', 'Terry').both('FRIEND').groupCount().by('name')

Tab(children=(Output(layout=Layout(max_height='600px', overflow='scroll', width='100%')), Output(layout=Layout…

## Setup for graph export to s3 

In [123]:
s3_bucket_uri="s3://eu-west-2-my-bucket-7076/neptune-ml-social-network/"
# remove trailing slashes
s3_bucket_uri = s3_bucket_uri[:-1] if s3_bucket_uri.endswith('/') else s3_bucket_uri
s3_bucket_uri

's3://eu-west-2-my-bucket-7076/neptune-ml-social-network'

In [124]:
HOME_DIRECTORY = '~'

import os 
import json
import logging
def load_configuration():
    with open(os.path.expanduser(f'{HOME_DIRECTORY}/graph_notebook_config.json')) as f:
        data = json.load(f)
        host = data['host']
        port = data['port']
        if data['auth_mode'] == 'IAM':
            iam = True
        else:
            iam = False
    return host, port, iam


def get_host():
    host, port, iam = load_configuration()
    return host

In [125]:
neptune_host = get_host()
neptune_host

'neptunedbcluster-gtdhydmqdzgt.cluster-c6g3m9gnkltx.eu-west-2.neptune.amazonaws.com'

In [126]:
from urllib.parse import urlparse

def get_export_service_host():
    with open(os.path.expanduser(f'{HOME_DIRECTORY}/.bashrc')) as f:
        data = f.readlines()
        print(data)
    for d in data:
        if str.startswith(d, 'export NEPTUNE_EXPORT_API_URI'):
            parts = d.split('=')
            if len(parts) == 2:
                path = urlparse(parts[1].rstrip())
                return path.hostname + "/v1"
    logging.error(
        "Unable to determine the Neptune Export Service Endpoint. You will need to enter this or assign it manually.")
    return None

In [127]:
get_export_service_host()

['# .bashrc\n', '\n', '# Source global definitions\n', 'if [ -f /etc/bashrc ]; then\n', '\t. /etc/bashrc\n', 'fi\n', '\n', '# User specific aliases and functions\n', '\n', '# >>> conda initialize >>>\n', "# !! Contents within this block are managed by 'conda init' !!\n", '__conda_setup="$(\'/home/ec2-user/anaconda3/bin/conda\' \'shell.bash\' \'hook\' 2> /dev/null)"\n', 'if [ $? -eq 0 ]; then\n', '    eval "$__conda_setup"\n', 'else\n', '    if [ -f "/home/ec2-user/anaconda3/etc/profile.d/conda.sh" ]; then\n', '        . "/home/ec2-user/anaconda3/etc/profile.d/conda.sh"\n', '    else\n', '        export PATH="/home/ec2-user/anaconda3/bin:$PATH"\n', '    fi\n', 'fi\n', 'unset __conda_setup\n', '# <<< conda initialize <<<\n', '\n', 'export GRAPH_NOTEBOOK_AUTH_MODE=DEFAULT\n', 'export GRAPH_NOTEBOOK_IAM_PROVIDER=ROLE\n', 'export GRAPH_NOTEBOOK_SSL=True\n', 'export GRAPH_NOTEBOOK_HOST=neptunedbcluster-gtdhydmqdzgt.cluster-c6g3m9gnkltx.eu-west-2.neptune.amazonaws.com\n', 'export GRAPH_NOTEBO

'uf2e9s2onb.execute-api.eu-west-2.amazonaws.com/v1'

In [128]:
export_params={ 
"command": "export-pg", 
"params": { "endpoint": neptune_host,
            "profile": "neptune_ml",
            "cloneCluster": False
            }, 
"outputS3Path": f'{s3_bucket_uri}/neptune-export',
"additionalParams": {
        "neptune_ml": {
          "version": "v2.0",
        "targets": [
            {
                "edge": ["User", "FRIEND", "User"],
                "type" : "link_prediction"
            }
         ],
         "features": [
            {
                "node": "User",
                "property": "birthdate",
                "type": "datetime"
            }
         ]
        }
      },
"jobSize": "small"}

In [129]:
%%neptune_ml export start --export-url {get_export_service_host()} --export-iam --wait --store-to export_results
${export_params}

Output()

## Data processing
The first step (data processing) processes the exported graph dataset using standard feature preprocessing techniques to prepare it for use by DGL. This step performs functions such as feature normalization for numeric data and encoding text features using word2vec. At the conclusion of this step the dataset is formatted for model training. 


In [130]:
# The training_job_name can be set to a unique value below, otherwise one will be auto generated
import time 
processing_job_name=f'social-link-prediction-processing-{int(time.time())}'

processing_params = f"""
--config-file-name training-data-configuration.json
--job-id {processing_job_name} 
--s3-input-uri {export_results['outputS3Uri']} 
--s3-processed-uri {str(s3_bucket_uri)}/preloading """

In [131]:
%neptune_ml dataprocessing start --wait --store-to processing_results {processing_params}

Output()

## Model training
The second step (model training) trains the ML model that will be used for predictions. 

In [155]:
processing_job_name = 'social-link-prediction-processing-1630073857'

In [156]:
training_job_name=f'social-link-prediction-{int(time.time())}'

training_params=f"""
--job-id {training_job_name} 
--data-processing-id {processing_job_name} 
--instance-type ml.c5.xlarge
--s3-output-uri {str(s3_bucket_uri)}/training """

In [157]:
processing_job_name

'social-link-prediction-processing-1630073857'

In [158]:
%neptune_ml training start --wait --store-to training_results {training_params}

Output()

## Endpoint creation
The final step is to create the inference endpoint which is an Amazon SageMaker endpoint instance that is launched with the model artifacts produced by the best training job. This endpoint will be used by our graph queries to  return the model predictions for the inputs in the request. 

In [159]:
endpoint_params=f"""
--job-id {training_job_name} 
--model-job-id {training_job_name}"""

In [160]:
%neptune_ml endpoint create --wait --store-to endpoint_results {endpoint_params}

Output()

In [161]:
endpoint=endpoint_results['endpoint']['name']
endpoint

'social-l-2021-08-27-15-25-1140000-endpoint'

# Querying using Gremlin


In [182]:
%%gremlin
g.with("Neptune#ml.endpoint","${endpoint}").
    with("Neptune#ml.limit",3).
      V().hasLabel('User').has('name', 'Terry').
        out('FRIEND').with("Neptune#ml.prediction").hasLabel('User').groupCount().by('name')

Tab(children=(Output(layout=Layout(max_height='600px', overflow='scroll', width='100%')), Output(layout=Layout…

In [188]:
%%gremlin
g.with("Neptune#ml.endpoint","${endpoint}").
    with("Neptune#ml.limit",5).
      V().hasLabel('User').has('name', 'Sarah').
        out('FRIEND').with("Neptune#ml.prediction").hasLabel('User').groupCount().by('name')

Tab(children=(Output(layout=Layout(overflow='scroll')),), _titles={'0': 'Error'})