# Graph Entity Resolution with DGL on Amazon SageMaker

This notebook how to train an entity resolution model using graph neural networks. Entity resolution is the task of identifying and linking entites in a graph that belong to the same real world entity. This is useful for use-cases like user profiling where users might access an online service via different temporary session IDs generated by different devices. Entity resolution allows us to consolidate all information about a particular user and deduplicate the user profile.

There are two main parts of this notebook.
* First, we process the raw dataset to prepare the features and construct the graph.
* Next, we create a launch a training job using the SageMaker to train a graph neural network model with DGL.

In [None]:
!bash setup.sh

import sagemaker
from sagemaker_graph_entity_resolution import config

role = config.role
sess = sagemaker.Session()

## Data Preprocessing and Feature Engineering

### Upload raw data to S3

The dataset we use is the [DCA dataset](https://drive.google.com/drive/folders/0B7XZSACQf0KdNXVIUXEyVGlBZnc?usp=drive_open) released as part of the [2016 CIKM Cup competition](https://competitions.codalab.org/competitions/11171). The dataset contains anonymized browsing logs of user accessing various urls. In order to ensure the demonstration runs quickly, and to match the typical format of user activity data that many companies have, we have done some initial preparation steps. We converted the data from json to a relational table format and sampled just a subset of the overall data. The data preparation scripts can be seen in the `data-prep/` folder

The prepared dataset consists of two files:

* `logs.csv`: Records user browsing activity. Each entry consists of a timestamp, the anonymized urls that the user visited, the anonymized title of the url page, and the anonymized transient user id. The column names for the dataset are `['ts', 'urls', 'titles', 'uid']`

* `train.csv`: Records ground truth links between pairs of transient user ids. Each entry has a pair of uids that are known to be the same real world user. The file has no headers.


Now, let's move the raw data to a convenient location in an S3 bucket in your account for this proejct. There it will be picked up by the preprocessing job and training job.

If you would like to use your own dataset for this demonstration. Replace the `raw_data_location` in the cell below with the s3 path or local path of your dataset, and modify the data preprocessing step as needed.

In [None]:
# Replace with an S3 location or local path to point to your own dataset
raw_data_location = 's3://{}/{}/data'.format(config.solution_upstream_bucket, config.solution_name)

session_prefix = 'dgl-entity-resolution'
input_data = 's3://{}/{}/{}'.format(config.solution_bucket, session_prefix, config.s3_data_prefix)

!aws s3 cp --recursive $raw_data_location $input_data

# Set S3 locations to store processed data for training and post-training results and artifacts respectively
train_data = 's3://{}/{}/{}'.format(config.solution_bucket, session_prefix, config.s3_processing_output)
train_output = 's3://{}/{}/{}'.format(config.solution_bucket, session_prefix, config.s3_train_output)

### Run Preprocessing job with Amazon SageMaker Processing

The script we have defined at `data-preprocessing/data_preprocessing.py` performs data preprocessing and feature engineering transformations on the raw tabular data.

We convert the relational table to graph edgelists describing the relationships. For example the columns `['uid', 'urls']` are converted to an edgelist for edge type `('user', 'visits', 'url')` and the columns `['urls', 'titles']` are converted into an edgelist for edge type `('url', 'owned_by', 'domain')`.


We also perform feature engineering to generate features for each user and each domain.

* User features: We use the timestamps in the `ts` column to generate k-hot feature vectors that encode users' weekly browsing habits. For each user, we generate a 168 dimensional (7 days * 24 hours) vector.

* Url features: We use the anonymized tokens in the full url and title to generate features for the url. For example if a url is `a/b/c?d` and the title is `a`, we have a bag of words of `['a', 'b', 'c', 'd']` for the url.  We use a [TfIdfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to convert the text features into numerical features and then perform [dimensionality reduction](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) to obtain a 20 dimensional feature vector.

In order to adapt the preprocessing script to work with your data in the same format, you can modify the python script `data-preprocessing/data_preprocessing_script.py` used in the cell below.

The python processing script also splits our ground-truth linked entities into a train and test/validation set. The default test-ratio is 0.3 but this can be modified.

We use the built SKLearnProcessor provided SageMaker since it already contains the python dependencies - (pandas, sklearn) - that we need for preprocessing and feature engineering.

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_count=1,
                                     instance_type='ml.m5.xlarge')

sklearn_processor.run(code='data-preprocessing/data_preprocessing.py',
                      arguments = ['--test-ratio', '0.3'],
                      inputs=[ProcessingInput(source=input_data,
                                              destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(destination=train_data,
                                                source='/opt/ml/processing/output')])

### View Results of Data Preprocessing
Once the preprocessing job is complete, we can take a look at the contents of processing output folder in the S3 bucket to see the transformed data. 

You should see the following files:

* `transient_edges.csv`: the set of edges between each transient uid and each url visited by uid.
* `transient_nodes:csv`: the set of transient uid nodes along with the activity feature vector for the uids. 
* `user_train_edges`: the set of ground-truth same entities for pairs of trainsient uids that will be used during training.
* `user_test_edges`: the set of ground-truth same entities that will be used to evaluate the trained model. 
* `website_group_nodes`: the set of url nodes along with the feature vectors for the urls.
* `website_group_edges`: the set of edges between each url and it's parent domain.

We add these files to our `param` dictionary because our downstream training job will use these file names to construct the identity graph during model training. 

In [None]:
from os import path
from sagemaker.s3 import S3Downloader
processed_files = S3Downloader.list(train_data)
print("===== Processed Files =====")
print('\n'.join(processed_files))

params = {
    'train-edges': 'user_train_edges.csv',
    'test-edges': 'user_test_edges.csv',
    'transient-nodes': 'transient_nodes.csv',
    'transient-edges': 'transient_edges.csv',
    'website-nodes': 'website_nodes.csv',
    'website-group-edges': 'website_group_edges.csv'
}

print("Graph will be constructed using the following data:\n{}".format(params))

## Train Graph Neural Network with DGL

Graph Neural Networks (GNNs) work by learning numeric representations for nodes and edges informed by the graph structure. We can model the entity resolution problem as a link prediction problems i.e we have a few links between users that are the same entity and we would like to use that information to predict new links/edges between users that are not linked in the graph but may correspond to linked entities.

In order to train a model that can achieve this we need to make/specify two modelling assumptions the `GNN Architecture` and the `Self-Supervision Task`.

* *GNN Architecture*: The GNN Architecture is what GNN framework is used to learn the node representations that are consumed by downstream task model. Since we have nodes and edges of different types, we will be using a relational graph convolutional neural network model (R-GCN). This architecture using works well on heterogeneous graphs. This is also what is known as the `Graph Encoder`

* *Self-Supervision Task*: As alluded to, the task that will be used to supervise the encoder is *link prediciton*. Formally, we generate some negative edges by creating links between nodes sampled at random from the graph. The goal of the task is to learn a score function that gives higher scores to the real positive edges and lower scores to the negative edges. This is what is also known as the `Graph Decoder`

### Hyperparameters
To train the graph neural network, we need to define a few hyperparameters that determine the GNN architecture, graph sampling parameters, optimizer, and optimization parameters.

Here we're setting only a few of the hyperparameters, to see all the hyperparameters and their default values, see dgl-entity-resolution/estimator_fns.py. The parameters set below are:


* `mini-batch`: Whether to perform mini-batch training, which training the model with a batch of nodes at a time and using the sampled graph neighbourhood of the mini-batch nodes.
* `batch-size`: The number of nodes in a mini-batch that are used to compute a single forward pass of the GNN.
* `num-gpus`: The number of gpus to use during training. Use only when training with a GPU enabled instance
* `embedding-size`: In the inductive case, the number of dimensions of the node specific embedding that is concatenated with the node feature vector. For nodes that don't have features, the dimensionality is just `embedding-size`.
* `n-neighbors`: The number of neighbours to sample for each target node during graph sampling for mini-batch training
* `n-layers`: The number of GNN layers in the model
* `negative-sampling-rate`: How many `negative edges` to sample from the graph for each positive edge in the mini-batch. This is used to supervise the loss function so it penalize negative edges and distinguishes those from positive edges. 
* `n-epochs`: The number of training epochs for the model training job. We set this to 3 epochs so that the job terminates quickly. In order to obtain better predictions, train the model for more epochs.
* `optimizer`: The optimization algorithm used for gradient based parameter updates
* `lr`: The learning rate for parameter updates

In [None]:
hyperparams = {
    'mini-batch': 'true',
    'batch-size': 1000,
    'num-gpus': 1,
    'embedding-size': 64,
    'n-neighbors': 100,
    'n-hidden': 16,
    'n-layers': 2,
    'negative-sampling-rate': 10,
    'n-epochs': 3,
    'optimizer': 'adam',
    'lr': 1e-2
}

params.update(**hyperparams)

### Create and Fit SageMaker Estimator

With the hyperparameters defined, we can kick off the training job. We will be using the Deep Graph Library (DGL), with PyTorch as the backend deep learning framework, to define and train the graph neural network. Amazon SageMaker makes it do this with the Framework estimators which have the deep learning frameworks already setup. Here, we create a SageMaker PyTorch estimator and pass in our model training script, hyperparameters, as well as the number and type of training instances we want.

We can then fit the estimator on the the training data location in S3.

In [None]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point='train_dgl_pytorch_entity_resolution.py',
                   source_dir='sagemaker_graph_entity_resolution/dgl_entity_resolution',
                   role=role, 
                   train_instance_count=1,
                   train_instance_type='ml.g4dn.xlarge',
                   framework_version="1.5.0",
                   py_version='py3',
                   hyperparameters=params,
                   output_path=train_output,
                   code_location=train_output,
                   sagemaker_session=sagemaker.Session())

estimator.fit({'train': train_data})

Once the training is completed, the training instances are shut off and SageMaker stores the trained model and new predicted links to the output location in S3.