<a href="https://colab.research.google.com/github/cyyeh/citation-networks/blob/master/tutorials/Node_representation_learning_on_large_graphs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Node representation learning on large graphs

In this tutorial, we will learn how to train unsupervised node embeddings in GraphVite. Node embeddings are critical to analyzing large graphs such as social networks and citation networks, especially when label data are inavailable or sparse.

We will begin with training and evaluation steps of node embedding. Then we will show how to customize datasets and hyperparameters.

---

All the code here can be run on Google Colab directly, and results will be displayed in our browser. To run this tutorial,

1. At the top-right of the menu bar, choose connect to hosted runtime.
2. In the menu, choose Runtime -> Run all.

Since Colab provides only 2 CPU threads and a very economic GPU, the code may take some time to run.

Download and install miniconda and GraphVite. This may take a while.

In [0]:
!wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!./Miniconda3-latest-Linux-x86_64.sh -b -p /usr/local -f

!conda install -y -c milagraph -c conda-forge graphvite \
  python=3.6 cudatoolkit=$(nvcc -V | grep -Po "(?<=V)\d+\.\d+")
!conda install -y wurlitzer ipykernel

import site
site.addsitedir("/usr/local/lib/python3.6/site-packages")
%reload_ext wurlitzer

## Python Interface

First, import the GraphVite package.

In [0]:
import graphvite as gv
import graphvite.application as gap

Here we use a social network dataset, BlogCatalog, for demonstration. Each node in this graph is a blog user, and each edge represents a following relationship between two users.

We specify the dimension of embedding vectors to be 128.

In [0]:
app = gap.GraphApplication(dim=128)
app.load(file_name=gv.dataset.blogcatalog.train)
app.build()
app.train()

Once we learned the embeddings, we can evaluate them on a wide range of downstream tasks.

Since some users are labeled with their interests in BlogCatalog, we validate whether our embeddings can be used for predicting those labels. We use 20% labeled data for training a linear classifier, and test on the rest data.

The evaluation will report both macro-F1 and micro-F1 scores. Larger scores are better.

In [0]:
app.node_classification(file_name=gv.dataset.blogcatalog.label, portions=(0.2,))

In some cases, we may not have node labels for our graphs. As an alternative, we can separate a small group edges from the training set, and evaluate on the link prediction task.

To ensure that our model doesn't seen any test data, we should remove all training edges from the valid or test data.

The performance is quantified by AUC, which is the area under the precision-recall curve. Larger AUC is better.

In [0]:
app.link_prediction(file_name=gv.dataset.blogcatalog.valid,
                    filter_file=gv.dataset.blogcatalog.train)

For our own graphs, we can customize a [`Dataset`](https://graphvite.io/docs/latest/api/dataset.html#graphvite.dataset.Dataset) and leverage the standard preprocess routines in GraphVite.

For example, the following code defines a dataset that contains train, valid and test splits. The portions of positive edges in 3 splits are specified to be 2:1:1.

In [0]:
class MyDataset(gv.dataset.Dataset):
  
  def __init__(self):
    super(MyDataset, self).__init__(
        "my_dataset",
        urls={"train": [], "valid": [], "test": []}
    )
  
  def train_preprocess(self, train_file):
      valid_file = train_file[:train_file.rfind("train.txt")] + "valid.txt"
      test_file = train_file[:train_file.rfind("train.txt")] + "test.txt"
      # change the graph to our own path
      self.link_prediction_split(gv.dataset.blogcatalog.graph,
                                 [train_file, valid_file, test_file],
                                 portions=[2, 1, 1])

my_dataset = MyDataset()

with open(my_dataset.train, "r") as fin:
  print("my_dataset.train: #positive: %d" % len(fin.readlines()))
count = [0] * 2
with open(my_dataset.valid, "r") as fin:
  for line in fin:
    count[int(line.strip()[-1])] += 1
  print("my_dataset.valid: #negative: %d, #positive: %d" % tuple(count))
count = [0] * 2
with open(my_dataset.test, "r") as fin:
  for line in fin:
    count[int(line.strip()[-1])] += 1
  print("my_dataset.test: #negative: %d, #positive: %d" % tuple(count))

The learned embeddings can be accessed through

In [0]:
name2id = app.graph.name2id
vertex_embeddings = app.solver.vertex_embeddings
context_embeddings = app.solver.context_embeddings

# Get the vertex embedding of node "1"
print(vertex_embeddings[name2id["1"]])

Or we can save the learned embeddings by

In [0]:
app.save_model("line_blogcatalog.pkl")

## Configuration File

GraphVite supports a configuration file interface, which can simplify our training procedure.

We can create a configuration scaffold with the following line.

In [0]:
!graphvite new graph --file my_config.yaml
# node (word) embedding for natural language corpus
!graphvite new word graph --file word_graph.yaml

In the left pane, we can find `my_config.yaml` in *Files* tab. As Colab doesn't support editing very well, we can edit it in the following cell and then run the cell.

In [0]:
%%writefile my_config.yaml
application:
  graph

resource:
  # List of GPU ids. Default is all GPUs
  gpus: []
  # Memory limit for each GPU in bytes. Default is all available memory.
  gpu_memory_limit: auto
  # Number of CPU thread per GPU. Default is all CPUs.
  cpu_per_gpu: auto
  # Dimension of the embeddings.
  dim: 128

format:
  # String of delimiter characters. Change it if your node name contains blank character.
  delimiters: " \t\r\n"
  # Prefix of comment strings. Change it if you use comment style other than Python.
  comment: "#"

graph:
  # Path to edge list file. Each line should be one of the following
  # [node 1] [delimiter] [node 2] [comment]...
  # [node 1] [delimiter] [node 2] [delimiter] [weight] [comment]...
  # [comment]...
  # For standard datasets, you can specify them by <[dataset].[split]>.
  file_name:
  # Symmetrize the graph or not. True is recommended.
  as_undirected: true
  # Normalize the adjacency matrix or not. This may influence the performance a little.
  normalization: false

build:
  optimizer:
    # Optimizer.
    type: SGD
    # Learning rate. Default is usually reasonable.
    lr: 0.025
    # Weight decay.
    weight_decay: 0.005
    # Learning rate schedule, can be "linear" or "constant". Linear is recommended.
    schedule: linear
  # Number of partitions. Auto is recommended.
  num_partition: auto
  # Number of negative samples per positive sample.
  # Larger value results in slower training.
  # The performance may be influenced by num_negative * negative_weight.
  num_negative: 1
  # Batch size of samples in CPU-GPU transfer. Default is recommended.
  batch_size: 100000
  # Number of batches in a partition block.
  # Default is recommended.
  episode_size: auto

# Comment out this section if not needed.
load:
  # Path to model file, can be "*.pkl".
  file_name: graph.pkl

train:
  # Model, can be DeepWalk, LINE or node2vec.
  model: DeepWalk
  # Number of epochs. Default is usually reasonable for sparse graphs.
  # For dense graphs (|E| / |V| > 100), you may use smaller values.
  num_epoch: 2000
  # Resume training from a loaded model.
  resume: false
  # Weight of negative samples. Values larger than 10 may cause unstable training.
  negative_weight: 5
  # Exponent of degrees in negative sampling. Default is recommended.
  negative_sample_exponent: 0.75
  # Augmentation step. Default is usually reasonable.
  # Larger value is needed for sparser graphs.
  augmentation_step: auto
  # Return parameter and in-out parameters (node2vec). Need to be tuned on the validation set.
  p: 1
  q: 1
  # Length of each random walk. Default is recommended.
  random_walk_length: 40
  # Batch size of random walks in samplers. Default is recommended.
  random_walk_batch_size: 100
  # Log every n batches.
  log_frequency: 1000

# Comment out this section if not needed.
evaluate:
  # Comment out any task if not needed.
  - task: node classification
    # Path to node label file. Each line should be one of the following
    # [node] [delimiter] [label] [comment]...
    # [comment]...
    file_name:
    # Portions of data used for training. Each of them corresponds to one evaluation.
    portions: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
    # Number of trials repeated. Change it to 1 if your evaluation set is large enough.
    times: 5

  - task: link prediction
    # Path to link prediction file. Each line should be
    # [node 1] [delimiter] [node 2] [delimiter] [label]
    # where label is 1 for positive and 0 for negative.
    file_name:
    # Path to filter file. If you aren't sure that training data is excluded in evaluation,
    # you can specify the training edge list here.
    filter_file:

# Comment out this section if not needed.
save:
  # Path to save file, can be "*.pkl".
  file_name: graph.pkl
  # Save hyperparameters or not.
  save_hyperparameter: false

To run our configuration file, use

In [0]:
# !graphvite run my_config.yaml --cpu 2 --gpu 1

We can also check out existing benchmarks in GraphVite. The full list of benchmarks can be found [here](https://graphvite.io/docs/latest/benchmark.html#node-embedding).

Here we train LINE on Youtube dataset, which is a million-scale social network. Due to time reasons, we skip the evaluation step.

In [0]:
!graphvite baseline line youtube --epoch 200 --no-eval

## Hyperparameters

The default hyperparameters of node embeddings are usually robust. If we want to get the best performance, we may tune the following hyperparmeters, ranked by decreasing importance.

1. `augmentation_step`. It determines the length of random walk augmentation. Generally, we should use a larger value for sparse graphs. Common values are 1, 2, 5 and 10.

2. `num_negative` and `negative_weight`. They control the negative sampling strategy. Common values are 1, 5, 10 for `num_negative` and 1, 2, 5 for `negative_weight`.

3. For node2vec, `p` and `q` control the biased random walk. Common values are 0.25, 1 and 4.