# ArangoDB cuGraph Adapter Getting Started Guide  

<a href="https://colab.research.google.com/github/arangoml/cugraph-adapter/blob/master/examples/ArangoDB_cuGraph_Adapter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![arangodb](https://github.com/arangoml/cugraph-adapter/blob/master/examples/assets/logos/ArangoDB_logo.png?raw=1)
<a href="https://github.com/rapidsai/cugraph" rel="github.com/rapidsai/cugraph"><img src="https://github.com/arangoml/cugraph-adapter/blob/master/examples/assets/logos/rapids_logo.png?raw=1" width=30% height=30%></a>

Export Graphs from [ArangoDB](https://www.arangodb.com/), a multi-model Graph Database, to [cuGraph](https://github.com/rapidsai/cugraph), a library of collective GPU-accelerated graph algorithms.

# Environment Sanity Check



This notebook requires a Tesla T4, P4, or P100 GPU.
1. Open the <u>Runtime</u> dropdown
2. Click on <u>Change Runtime Type</u>
3. Set <u>Hardware accelerator</u> to GPU
4. Re-connect to runtime

Check the output of `!nvidia-smi -L` to make sure you've been allocated a Tesla T4, P4, or P100. If not, you can rely on the _Disconnect and delete runtime_ option to repeat the process & try again (unfortunately this is the only option).

In [None]:
!nvidia-smi -L # T4, P4, or P100 is required

# Setup
Est Time: 5 minutes

In [None]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

In [None]:
# Install the cugraph-adapter & adb-cloud-connector
!pip install git+https://github.com/arangoml/cugraph-adapter.git@housekeeping
!pip install adb-cloud-connector
!git clone -b housekeeping --single-branch https://github.com/arangoml/cugraph-adapter.git

In [None]:
# All imports
import cudf
import cugraph

from adbcug_adapter import ADBCUG_Adapter, ADBCUG_Controller
from adbcug_adapter.typings import CUGId, Json

from arango import ArangoClient
from adb_cloud_connector import get_temp_credentials

import json
import logging
import io, requests
from typing import List, Optional, Any

# Understanding cuGraph & cuDF

(referenced from [docs.rapids.ai](https://docs.rapids.ai/))

RAPIDS cuGraph is a library of graph algorithms that seamlessly integrates into the RAPIDS data science ecosystem and allows the data scientist to easily call graph algorithms using data stored in GPU DataFrames, NetworkX Graphs, or even CuPy or SciPy sparse Matrices.


Here is an example of creating a simple weighted graph:

In [None]:
cug_graph = cugraph.Graph()

df = cudf.DataFrame(
  [('a', 'b', 5), ('a', 'c', 1), ('a', 'd', 4), ('b', 'c', 3), ('c', 'd', 2)],
  columns=['src', 'dst', 'weight']
)

cug_graph.from_cudf_edgelist(
    df,
    source='src',
    destination='dst',
    edge_attr='weight'
)

print('\n--------------------')
print(cug_graph.nodes())
print('\n--------------------')
print(cug_graph.edges())

RAPIDS cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. It provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

For example, the following snippet downloads a CSV, then uses the GPU to parse it into rows and columns and run calculations:

In [None]:
# Load a dataset into a GPU memory resident DataFrame and perform a basic calculation.
# Everything from CSV parsing to calculating tip percentage and computing a grouped average is done on the GPU.

# download CSV file from GitHub
url = "https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode('utf-8')

tips_df = cudf.read_csv(io.StringIO(content))
tips_df['tip_percentage'] = tips_df['tip'] / tips_df['total_bill'] * 100

# display average tip by dining party size
print(tips_df.groupby('size').tip_percentage.mean())

The following snippet loads data into a cuGraph graph and computes PageRank:

In [None]:
from cugraph.experimental.datasets import karate

# Load Karate Graph
G = karate.get_graph(fetch=True)

# Let's now get the PageRank score of each vertex by calling cugraph.pagerank
df_page = cugraph.pagerank(G)

# Let's look at the top 10 PageRank Score
df_page.sort_values('pagerank', ascending=False).head(10)

# Create a Temporary ArangoDB Cloud Instance

In [None]:
# Request temporary instance from the managed ArangoDB Cloud Service.
con = get_temp_credentials()
print(json.dumps(con, indent=2))

# Connect to the instance via the python-arango driver
db = ArangoClient(hosts=con["url"]).db(con["dbName"], con["username"], con["password"], verify=True)

Feel free to use the above URL to check out the UI!

# Import Sample Data

For demo purposes, we will be using the [ArangoDB Fraud Detection example graph](https://colab.research.google.com/github/joerg84/Graph_Powered_ML_Workshop/blob/master/Fraud_Detection.ipynb), and the [ArangoDB IMDB Dataset](https://github.com/arangodb/example-datasets/tree/master/Graphs/IMDB).

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!chmod -R 755 cugraph-adapter/
!./cugraph-adapter/tests/assets/arangorestore -c none --server.endpoint http+ssl://{con["hostname"]}:{con["port"]} --server.username {con["username"]} --server.database {con["dbName"]} --server.password {con["password"]} --replication-factor 3  --input-directory "cugraph-adapter/examples/data/fraud_dump" --include-system-collections true
!./cugraph-adapter/tests/assets/arangorestore -c none --server.endpoint http+ssl://{con["hostname"]}:{con["port"]} --server.username {con["username"]} --server.database {con["dbName"]} --server.password {con["password"]} --replication-factor 3  --input-directory "cugraph-adapter/examples/data/imdb_dump" --include-system-collections true

# Instantiate the Adapter

Connect the ArangoDB-cuGraph Adapter to our database client:

In [None]:
adbcug = ADBCUG_Adapter(db)

# <u>ArangoDB to cuGraph</u>



#### Via ArangoDB Graph Name

Data source
* ArangoDB Fraud-Detection Graph

Package methods used
* [`adbcug_adapter.adapter.arangodb_graph_to_cugraph()`](https://github.com/arangoml/cugraph-adapter/blob/master/adbcug_adapter/adapter.py)

Important notes
* The graph `name` must point to an existing ArangoDB graph
* cuGraph does not support node or edge attributes (apart from edge weight)
* If an ArangoDB edge has an attribute named `weight`, its value will be transferred over to the cuGraph graph. Otherwise, the cuGraph edge weight will default to `0`.

In [None]:
# Define graph name
graph_name = "fraud-detection"

# Create cuGraph graph from ArangoDB graph name
cug_graph = adbcug.arangodb_graph_to_cugraph(graph_name)

# You can also provide valid Python-Arango AQL query options to the command above, like such:
# cug_graph = adbcug_adapter.arangodb_graph_to_cugraph(graph_name, ttl=1000, stream=True)
# See the full parameter list at https://docs.python-arango.com/en/main/specs.html#arango.aql.AQL.execute

# Show graph data
print('\n--------------------')
print(cug_graph.nodes())
print('\n--------------------')
print(cug_graph.edges())

#### Via ArangoDB Collection Names

Data source
* ArangoDB Fraud-Detection Collections

Package methods used
* [`adbcug_adapter.adapter.arangodb_collections_to_cugraph()`](https://github.com/arangoml/cugraph-adapter/blob/master/adbcug_adapter/adapter.py)

Important notes
* The `vertex_collections` & `edge_collections` parameters must point to existing ArangoDB collections within your ArangoDB instance.
* cuGraph does not support node or edge attributes (apart from edge weight)
* If an ArangoDB edge has an attribute named `weight`, its value will be transferred over to the cuGraph graph. Otherwise, the cuGraph edge weight will default to `0`.

In [None]:
# Define collection
vertex_collections = {"account", "bank", "branch", "Class", "customer"}
edge_collections = {"accountHolder", "Relationship", "transaction"}

# Create NetworkX graph from ArangoDB collections
cug_graph = adbcug.arangodb_collections_to_cugraph("fraud-detection", vertex_collections, edge_collections)

# You can also provide valid Python-Arango AQL query options to the command above, like such:
# cug_graph = adbcug_adapter.arangodb_collections_to_cugraph, ttl=1000, stream=True)
# See the full parameter list at https://docs.python-arango.com/en/main/specs.html#arango.aql.AQL.execute

# Show graph data
print('\n--------------------')
print(cug_graph.nodes())
print('\n--------------------')
print(cug_graph.edges())

#### Via ArangoDB Graph Name with a custom ADBCUG_Controller & verbose logging

Data source
* ArangoDB Fraud-Detection Collections

Package methods used
* [`adbcug_adapter.adapter.arangodb_graph_to_cugraph()`](https://github.com/arangoml/cugraph-adapter/blob/master/adbcug_adapter/adapter.py)
* [`adbcug_adapter.controller._prepare_arangodb_vertex()`](https://github.com/arangoml/cugraph-adapter/blob/master/adbcug_adapter/controller.py)

Important notes
* We are creating a custom `ADBCUG_Controller` to specify *how* to convert our ArangoDB vertex IDs into cuGraph node IDs. View the default `ADBCUG_Controller` [here](https://github.com/arangoml/cugraph-adapter/blob/master/adbcug_adapter/controller.py).
* Using a custom ADBCUG Controller for `ArangoDB --> cuGraph` is optional. However, a custom ADBCUG Controller for `cuGraph --> ArangoDB` functionality is almost always needed, at the exception of Homogeneous graphs, and graphs where the node IDs are already formatted to the ArangoDB vertex ID standard (i.e `collection/_key`)

In [None]:
# Define metagraph
graph_name = "fraud-detection"

class Custom_ADBCUG_Controller(ADBCUG_Controller):
    """ArangoDB-cuGraph controller.

    Responsible for controlling how nodes & edges are handled when
    transitioning from ArangoDB to cuGraph.

    You can derive your own custom ADBCUG_Controller.
    """

    def _prepare_arangodb_vertex(self, adb_vertex: Json, col: str) -> None:
        """Prepare an ArangoDB vertex before it gets inserted into the cuGraph
        graph.

        Given an ArangoDB vertex, you can modify it before it gets inserted
        into the cuGraph graph, and/or derive a custom node id for cuGraph
        to use by updating the "_id" attribute of the vertex (otherwise the
        vertex's current "_id" value will be used)

        :param adb_vertex: The ArangoDB vertex object to (optionally) modify.
        :type adb_vertex: adbcug_adapter.typings.Json
        :param col: The ArangoDB collection the vertex belongs to.
        :type col: str
        """
        # Custom behaviour: Add a "_new" prefix to every vertex ID
        adb_vertex["_id"] = "new_" + adb_vertex["_id"]

# Instantiate a new adapter with the custom controller
custom_adbcug_adapter = ADBCUG_Adapter(db, controller=Custom_ADBCUG_Controller())

# You can also change the adapter's logging level for access to
# silent, regular, or verbose logging (logging.WARNING, logging.INFO, logging.DEBUG)
custom_adbcug_adapter.set_logging(logging.DEBUG) # verbose logging

# Create cuGraph Graph an ArangoDB graph using the custom adapter
cug_graph = custom_adbcug_adapter.arangodb_graph_to_cugraph("fraud-detection")

# Show graph data
print('\n--------------------')
print(cug_graph.nodes())
print('\n--------------------')
print(cug_graph.edges())

# <u>cuGraph to ArangoDB</u>

#### Karate Graph

Data source
* [cuGraph 22.06 Datasets](https://github.com/rapidsai/cugraph/blob/branch-22.06/datasets/karate.csv)

Package methods used
* [`adbcug_adapter.adapter.cugraph_to_arangodb()`](https://github.com/arangoml/cugraph-adapter/blob/master/adbcug_adapter/adapter.py)

Important notes
* A custom `ADBCUG Controller` is **not** required here. This is because the karate graph only has 1 vertex collection (`karateka`), and 1 edge collection (`knows`). See the edge definitions below

In [None]:
# Fetch Karate Club data
!wget https://raw.githubusercontent.com/rapidsai/cugraph/branch-22.06/datasets/karate.csv

In [None]:
dataframe = cudf.read_csv("karate.csv", delimiter=' ', names=['src', 'dst'], dtype=['int32', 'int32'] )

# Create the cuGraph graph
cug_graph = cugraph.Graph()
cug_graph.from_cudf_edgelist(dataframe, source='src', destination='dst')

# Specify ArangoDB edge definitions
edge_definitions = [
    {
        "edge_collection": "knows",
        "from_vertex_collections": ["karateka"],
        "to_vertex_collections": ["karateka"],
    }
]

# Create ArangoDB graph from cuGraph
name = "KarateClubGraph"
db.delete_graph(name, drop_collections=True, ignore_missing=True)
adb_graph = adbcug.cugraph_to_arangodb(name, cug_graph, edge_definitions)

# You can also provide valid Python-Arango Import Bulk options to the command above, like such:
# adb_graph = adbcug_adapter.cugraph_to_arangodb(name, cug_graph, edge_definitions, batch_size=5, on_duplicate="replace")
# See the full parameter list at https://docs.python-arango.com/en/main/specs.html#arango.collection.Collection.import_bulk

print('\n--------------------')
print("URL: " + con["url"])
print("Username: " + con["username"])
print("Password: " + con["password"])
print("Database: " + con["dbName"])
print('--------------------\n')
print(f"View the created graph here: {con['url']}/_db/{con['dbName']}/_admin/aardvark/index.html#graph/{name}")

#### Divisibility Graph

Data source
* No source

Package methods used
* [`adbcug_adapter.adapter.cugraph_to_arangodb()`](https://github.com/arangoml/cugraph-adapter/blob/master/adbcug_adapter/adapter.py)

Important notes
* Even if this graph has more than 1 vertex collection, a custom `ADBCUG Controller` is still **not** required here. This is because the cuGraph Node IDs are already formatted to ArangoDB standard, so the default ADBCUG Controller will take care of node identification (see [`_identify_cugraph_node()`](https://github.com/arangoml/cugraph-adapter/blob/master/adbcug_adapter/controller.py))

In [None]:
# Create the cuGraph graph
cug_graph = cugraph.MultiGraph(directed=True)
cug_graph.from_cudf_edgelist(
    cudf.DataFrame(
        [
            (f"numbers_j/{j}", f"numbers_i/{i}", j / i)
            for i in range(1, 101)
            for j in range(1, 101)
            if j % i == 0
        ],
        columns=["src", "dst", "weight"],
    ),
    source="src",
    destination="dst",
    edge_attr="weight"
)

# Specify ArangoDB edge definitions
edge_definitions = [
    {
        "edge_collection": "is_divisible_by",
        "from_vertex_collections": ["numbers_j"],
        "to_vertex_collections": ["numbers_i"],
    }
]

# Create ArangoDB graph from cuGraph
name = "DivisibilityGraph"
db.delete_graph(name, drop_collections=True, ignore_missing=True)
adb_graph = adbcug.cugraph_to_arangodb(name, cug_graph, edge_definitions, keyify_nodes=True)


print('\n--------------------')
print("URL: " + con["url"])
print("Username: " + con["username"])
print("Password: " + con["password"])
print("Database: " + con["dbName"])
print('--------------------\n')
print(f"View the created graph here: {con['url']}/_db/{con['dbName']}/_admin/aardvark/index.html#graph/{name}")

#### School Graph with a custom ADBCUG_Controller

Data source
* No source, the graph data is arbitrary

Package methods used
* [`adbcug_adapter.adapter.cugraph_to_arangodb()`](https://github.com/arangoml/cugraph-adapter/blob/master/adbcug_adapter/adapter.py)

Important notes
* Here we demonstrate the functionality of having a custom `ADBCUG_Controller`, that overrides the [default ADBCUG_Controller](https://github.com/arangoml/cugraph-adapter/blob/master/adbcug_adapter/controller.py).
* Recall that a custom ADBCUG Controller for `cuGraph --> ArangoDB` functionality is almost always needed, at the exception of Homogeneous graphs, and graphs where the node IDs are already formatted to the ArangoDB vertex ID standard (i.e `collection/_key`)

In [None]:
# Load some arbitary data
df = cudf.DataFrame(
  [
   ('student:101', 'lecture:101'),
   ('student:102', 'lecture:102'),
   ('student:103', 'lecture:103'),
   ('student:103', 'student:101'),
   ('student:103', 'student:102'),
   ('teacher:101', 'lecture:101'),
   ('teacher:102', 'lecture:102'),
   ('teacher:103', 'lecture:103'),
   ('teacher:101', 'teacher:102'),
   ('teacher:102', 'teacher:103')
  ],
  columns=['src', 'dst']
)

# Create the cuGraph graph
cug_graph = cugraph.MultiGraph(directed=True)
cug_graph.from_cudf_edgelist(df, source='src', destination='dst')

# Specify ArangoDB edge definitions
edge_definitions = [
    {
        "edge_collection": "attends",
        "from_vertex_collections": ["student"],
        "to_vertex_collections": ["lecture"],
    },
    {
        "edge_collection": "classmate",
        "from_vertex_collections": ["student"],
        "to_vertex_collections": ["student"],
    },
    {
        "edge_collection": "teaches",
        "from_vertex_collections": ["teacher"],
        "to_vertex_collections": ["lecture"],
    },
    {
        "edge_collection": "colleague",
        "from_vertex_collections": ["teacher"],
        "to_vertex_collections": ["teacher"],
    }
]


# Given our graph is heterogeneous, and has a non-ArangoDB way of
# formatting its Node IDs, we must derive a custom ABCCUG Controller
# to handle this behavior.
class Custom_ADBCUG_Controller(ADBCUG_Controller):
  """ArangoDB-cuGraph controller.

  Responsible for controlling how nodes & edges are handled when
  transitioning from ArangoDB to cuGraph.

  You can derive your own custom ADBCUG_Controller.
  """

  def _identify_cugraph_node(self, cug_node_id: CUGId, adb_v_cols: List[str]) -> str:
    """Given a cuGraph node, and a list of ArangoDB vertex collections defined,
    identify which ArangoDB vertex collection it should belong to.

    NOTE: You must override this function if len(**adb_v_cols**) > 1
    OR **cug_node_id* does NOT comply to ArangoDB standards
    (i.e "{collection}/{key}").

    :param cug_node_id: The cuGraph ID of the vertex.
    :type cug_node_id: adbcug_adapter.typings.CUGId
    :param adb_v_cols: All ArangoDB vertex collections specified
        by the **edge_definitions** parameter of cugraph_to_arangodb()
    :type adb_v_cols: List[str]
    :return: The ArangoDB collection name
    :rtype: str
    """
    return str(cug_node_id).split(":")[0] # Identify node based on ':' split

  def _identify_cugraph_edge(
      self,
      from_cug_node: Json,
      to_cug_node: Json,
      adb_e_cols: List[str]
  ) -> str:
    """Given a pair of connected cuGraph nodes, and a list of ArangoDB
    edge collections defined, identify which ArangoDB edge collection it
    should belong to.

    NOTE: You must override this function if len(**adb_e_cols**) > 1.

    NOTE #2: The pair of associated cuGraph nodes can be accessed
    by the **from_cug_node** & **to_cug_node** parameters, and are guaranteed
    to have the following attributes: `{"cug_id", "adb_id", "adb_col", "adb_key"}`

    :param from_cug_node: The cuGraph node representing the edge source.
    :type from_cug_node: adbcug_adapter.typings.Json
    :param to_cug_node: The cuGraph node representing the edge destination.
    :type to_cug_node: adbcug_adapter.typings.Json
    :param adb_e_cols: All ArangoDB edge collections specified
        by the **edge_definitions** parameter of
        ADBCUG_Adapter.cugraph_to_arangodb()
    :type adb_e_cols: List[str]
    :return: The ArangoDB collection name
    :rtype: str
    """
    from_col = from_cug_node["adb_col"] # From node collection
    to_col = to_cug_node["adb_col"] # To node collection

    if (from_col, to_col) == ("student", "lecture"):
      return "attends"
    elif (from_col, to_col) == ("student", "student"):
      return "classmate"
    elif (from_col, to_col) == ("teacher", "lecture"):
      return "teaches"
    elif (from_col, to_col) == ("teacher", "teacher"):
      return "colleague"
    else:
      raise ValueError(f"Unknown edge relationship between {from_cug_node} and {to_cug_node}")

  def _keyify_cugraph_node(self, cug_node_id: CUGId, col: str) -> str:
    """Given a cuGraph node, derive its valid ArangoDB key.

    NOTE: You can override this function if you want to create custom ArangoDB _key
    values from your cuGraph nodes. To enable the use of this method, enable the
    **keyify_nodes** parameter in ADBCUG_Adapter.cugraph_to_arangodb().

    :param cug_node_id: The cuGraph node id.
    :type cug_node_id: adbcug_adapter.typings.CUGId
    :param col: The ArangoDB collection the vertex belongs to.
    :type col: str
    :return: A valid ArangoDB _key value.
    :rtype: str
    """
    return str(cug_node_id).split(":")[1] # Keyify node based on ':' split


# Instantiate the adapter
custom_adbcug_adapter = ADBCUG_Adapter(db, Custom_ADBCUG_Controller())
custom_adbcug_adapter.set_logging(logging.DEBUG) # Update logging to verbose

# Create the ArangoDB graph
name = "SchoolGraph"
db.delete_graph(name, drop_collections=True, ignore_missing=True)
adb_g = custom_adbcug_adapter.cugraph_to_arangodb(name, cug_graph, edge_definitions, keyify_nodes=True)

print('\n--------------------')
print("URL: " + con["url"])
print("Username: " + con["username"])
print("Password: " + con["password"])
print("Database: " + con["dbName"])
print('--------------------\n')
print(f"View the created graph here: {con['url']}/_db/{con['dbName']}/_admin/aardvark/index.html#graph/{name}")
