# Loading Graphs

In addition to the NetworkX compatible APIs, GraphScope proposed a set of APIs in Python 
to meet the needs for loading/analysing/quering very large graphs.

GraphScope models graph data as [property graphs](https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model), in which the edges/vertices are labeled and have many properties. In this tutorial, we show how GraphScope load graphs, including

- How to load  a built-in dataset quickly;
- How to define the schema of a property graph;
- Loading graph from various locations;
- Serializing/Deserializing a graph to/from disk.


## Prerequisite

First, we launch a session and import necessary packages.

In [None]:
# Install graphscope package if you are NOT in the Playground
!pip3 install graphscope

In [None]:
import graphscope
graphscope.set_option(show_log=True)

## Load Built-in Datasets

GraphScope comes with a set of popular datasets, and utility functions to load them into memory,
makes it easy for user to get started.
Here's an example:

In [None]:
from graphscope.dataset import load_ldbc
graph = load_ldbc()

In standalone mode, it will automatically download the data to `~/.graphscope/dataset`, and it will remain in there for future usage.

## Building from scratch

However, it's more common that user need to load there own data and do some analysis.
To load a property graph to GraphScope, we provide a method ``g()`` defined in ``Session``.

First, we create an empty graph.


In [None]:
import graphscope
from graphscope.framework.loader import Loader

graph = graphscope.g()

The class ``Graph`` has several methods:

```python
    def add_vertices(self, vertices, label="_", properties=None, vid_field=0):
        pass

    def add_edges(self, edges, label="_e", properties=None, src_label=None, dst_label=None, src_field=0, dst_field=1):
        pass
```
These methods helps users to construct the schema of the property graph iteratively.

We will use files in `ldbc_sample` through this tutorial. You can get the files in [here](https://github.com/GraphScope/gstest/tree/master/ldbc_sample). Here in this tutorial, we have already download it to local in the previous step.

And you can inspect the graph schema by using ``print(graph.schema)``.


### Build Vertex

We can add a kind of vertices to graph, it has the following parameters:

#### vertices

A loader for data source, which can be a file location, or a numpy, etc.

A simple example:

In [None]:
graph = graphscope.g()
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/person_0_0.csv', delimiter='|'))

It will read data from the the location `~/.graphscope/datasets/ldbc_sample/person_0_0.csv`, and create a vertex label default to ``_``, use the first column as ID, and other columns are used as properties, both the names and data types of properties will be deduced.

#### label

The label name of the vertex, default to ``_``.

There can't have two labels with the same name in a Graph, so user need to assign the name when there are two or more vertex labels. It would also have benefits if user could give every label a meaningful name. It could be any valid identifier.

For example:

In [None]:
graph = graphscope.g()
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/person_0_0.csv', delimiter='|'), label='person')


#### properties

A list of properties, Optional, default to ``None``.

The names should be consistent to the header row of the source data file or column names of pandas DataFrame.  

If equal to ``None`` all columns except the ``vid_field`` column will be treated as properties. If equal to empty list ``[]``, then no properties will be added. Otherwise, only mentioned columns will be loaded.

For example:

In [None]:
# properties will be firstName,lastName,gender,birthday,creationDate,locationIP,browserUsed
graph = graphscope.g()
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/person_0_0.csv', delimiter='|'), label='person', properties=None)

# properties will be firstName, lastName
graph = graphscope.g()
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/person_0_0.csv', delimiter='|'), label='person', properties=['firstName', 'lastName'])

# no properties
graph = graphscope.g()
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/person_0_0.csv', delimiter='|'), label='person', properties=[])
v

#### vid_field

The column used as vertex ID. The value in this column of the data source will be used for source ID or destination ID when loading edges. Default to 0.

It can be a ``str``, the name of columns, or it can be a ``int``, representing the sequence in the columns.

The default value will use the first column.

In [None]:
graph = graphscope.g()
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/person_0_0.csv', delimiter='|'), vid_field='firstName')

graph = graphscope.g()
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/person_0_0.csv', delimiter='|'), vid_field=0)

### Build Edge

Now we can add edges to the graph, which is a little complicate than vertices.

#### edges

Similar to the ``vertices`` in the ``Build Vertex`` section. It's a location indicating where to read the data.

Let's see an example:


In [None]:
graph = graphscope.g()
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/person_0_0.csv', delimiter='|'), label='person')
# Note we already added a vertex label named 'person'.
graph = graph.add_edges(Loader('~/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv', delimiter='|'), src_label='person', dst_label='person')

This will load an edge which label is ``_e`` (the default value), its source vertex and destination vertex will be ``person``, using the **first column** as the source vertex ID, the **second column** as the destination vertex ID, the others as properties.

#### label

The label name of the edge, default to ``_e``. It's recommended to use a meaningful label name.


In [None]:
graph = graphscope.g()
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/person_0_0.csv', delimiter='|'), label='person')
graph = graph.add_edges(Loader('~/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv', delimiter='|'), label='knows', src_label='person', dst_label='person')

#### properties

A list of properties, default to None. The meaning and behavior are identical to the one of Vertex.

#### src_label and dst_label

The label name of the source vertex and the label name of the destination vertex. We have already seen these two in above example, where we assigned them both to 'person'. It could be different values, for example:


In [None]:
graph = graphscope.g()
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/person_0_0.csv', delimiter='|'), label='person')
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/comment_0_0.csv', delimiter='|'), label='comment')
# Note we already added a vertex label named 'person'.
graph = graph.add_edges(Loader('~/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv', delimiter='|'), label='likes', src_label='person', dst_label='comment')

#### src_field and dst_field

The columns used for source vertex id and for destination vertex id. Default to 0 and 1, respectively.

The value and behavior is similar to ``vid_field`` in Vertex, except for it takes two columns as edges is constituted by source vertex id and destination vertex id. Here's an example:


In [None]:
graph = graphscope.g()
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/person_0_0.csv', delimiter='|'), label='person')
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/comment_0_0.csv', delimiter='|'), label='comment')
graph = graph.add_edges(Loader('~/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv', delimiter='|'), label='likes', src_label='person', dst_label='comment', src_field='Person.id', dst_field='Comment.id')
# Or use the index.
# graph = graph.add_edges(Loader('~/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv', delimiter='|'), label='likes', src_label='person', dst_label='comment', src_field=0, dst_field=1)


## Advanced techniques

Here are some advanced techniques to deal with very simple graphs or very complex graphs.

### Deduce vertex labels when not ambiguous

If there is only one vertex label in the graph, the label of vertices can be omitted.
GraphScope will infer the source and destination vertex label is that very label.


In [None]:
graph = graphscope.g()
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/person_0_0.csv', delimiter='|'), label='person')
# GraphScope will assign ``src_label`` and ``dst_label`` to ``person`` automatically.
graph = graph.add_edges(Loader('~/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv', delimiter='|'))


### Deduce vertex from edges

If user add_edges with unseen ``src_label`` or ``dst_label``, graphscope will extract an vertex table from endpoints of the edges.

In [None]:
graph = graphscope.g()
# Deduce vertex label `person` from the source and destination endpoints of edges.
graph = graph.add_edges(Loader('~/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv', delimiter='|'), src_label='person', dst_label='person')

graph = graphscope.g()
# Deduce the vertex label `person` from the source endpoint,
# and vertex label `comment` from the destination endpoint of edges.
graph = graph.add_edges(Loader('~/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv', delimiter='|'), label='likes', src_label='person', dst_label='comment')


### Multiple relations

In some cases, an edge label may connect two kinds of vertices. For example, in a
graph, two kinds of edges are labeled with ``likes`` but represents two relations.
i.e., ``person`` -> ``likes`` <- ``comment`` and ``person`` -> ``likes`` <- ``post``.

In this case, we can simple add the relation again with the same edge label,
but with different source and destination label.


In [None]:
graph = graphscope.g()
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/person_0_0.csv', delimiter='|'), label='person')
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/comment_0_0.csv', delimiter='|'), label='comment')
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/post_0_0.csv', delimiter='|'), label='post')

graph = graph.add_edges(Loader('~/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv', delimiter='|'),
        label="likes",
        src_label="person", dst_label="comment",
    )

graph = graph.add_edges(Loader('~/.graphscope/datasets/ldbc_sample/person_likes_post_0_0.csv', delimiter='|'),
        label="likes",
        src_label="person", dst_label="post",
    )

Note:

   1. This feature(multiple relations using same edge label) is only avaiable in `lazy` mode yet.
   2. It is worth noting that for several configurations in the side `Label`, 
      the attributes should be the same in number and type, and preferably 
      have the same name, because the data of the same `Label` will be put into one Table, 
      and the attribute names will uses the names specified by the first configuration.


### Specify data types of properties manually

GraphScope will deduce data types from input files, and most of the time it will work as expected.
However, sometimes user may want more customization. To cater to the need, A additional type can follow the property name, like this:

In [None]:
graph = graphscope.g()
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/post_0_0.csv', delimiter='|'), label='post', properties=['content', ('length', 'int')])

It will force the property to cast to the type that specified, note it requires the name and the type in one tuple. in this case, the property ``length`` will have type ``int`` rather than the default ``int64_t``. The most common scenario is to use ``int``, ``int64``, ``float``, ``double``, or ``str``.


### Other Parameters of Graph


The class ``Graph`` has three meta options, which are:

- ``oid_type``, can be ``int64_t`` or ``string``. Default to ``int64_t`` cause it's more faster and costs less memory. But if the ID column can't be represented by ``int64_t``, then we should use ``string``.
- ``directed``, bool, default to ``True``. Controls load an directed or undirected Graph.
- ``generate_eid``, bool, default to ``True``. Whether to automatically generate an unique id for all edges.


## Put It Together

Let make this example complete.

In [None]:
graph = graphscope.g(oid_type='int64_t', directed=True, generate_eid=True)
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/person_0_0.csv', delimiter='|'), label='person')
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/comment_0_0.csv', delimiter='|'), label='comment')
graph = graph.add_vertices(Loader('~/.graphscope/datasets/ldbc_sample/post_0_0.csv', delimiter='|'), label='post')

graph = graph.add_edges(Loader('~/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv', delimiter='|'), label='knows', src_label='person', dst_label='person')
graph = graph.add_edges(Loader('~/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv', delimiter='|'), label='likes', src_label='person', dst_label='comment')
graph = graph.add_edges(Loader('~/.graphscope/datasets/ldbc_sample/person_likes_post_0_0.csv', delimiter='|'), label='likes', src_label='person', dst_label='post')

print(graph.schema)

A more complex example to load LDBC snb graph can be find [here](https://github.com/alibaba/GraphScope/blob/main/python/graphscope/dataset/ldbc.py).

### Load From Pandas or Numpy

The datasource aforementioned is an object of :ref:`Loader`. A loader wraps
a location or the data itself. ``graphscope`` supports load a graph
from pandas dataframes or numpy ndarrays, makes it easy for construct a graph right in the python console.

Apart from the loader, the other fields like properties, label, etc. is same as examples above.


#### From Pandas

In [None]:
import numpy as np
import pandas as pd

leader_id = np.array([0, 0, 0, 1, 1, 3, 3, 6, 6, 6, 7, 7, 8])
member_id = np.array([2, 3, 4, 5, 6, 6, 8, 0, 2, 8, 8, 9, 9])
group_size = np.array([4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2])
e_data = np.transpose(np.vstack([leader_id, member_id, group_size]))
df_group = pd.DataFrame(e_data, columns=['leader_id', 'member_id', 'group_size'])

In [None]:
student_id = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
avg_score = np.array([490.33, 164.5 , 190.25, 762. , 434.2, 513. , 569. ,  25. , 308. ,  87. ])
v_data = np.transpose(np.vstack([student_id, avg_score]))
df_student = pd.DataFrame(v_data, columns=['student_id', 'avg_score']).astype({'student_id': np.int64})

In [None]:
# use a dataframe as datasource, properties omitted, col_0/col_1 will be used as src/dst by default.
# (for vertices, col_0 will be used as vertex_id by default)
graph = graphscope.g().add_vertices(df_student).add_edges(df_group)

#### From Numpy

Note that each array is a column, we pass it like as COO matrix format to the loader.


In [None]:
array_group = [df_group[col].values for col in ['leader_id', 'member_id', 'group_size']]
array_student = [df_student[col].values for col in ['student_id', 'avg_score']]

graph = graphscope.g().add_vertices(array_student).add_edges(array_group)

### Loader Variants


When a loader wraps a location, it may only contains a str.
The string follows the standard of URI. When receiving a request for loading graph
from a location, ``graphscope`` will parse the URI and invoke corresponding loader
according to the schema.

Currently, ``graphscope`` supports loaders for ``local``, ``s3``, ``oss``, ``hdfs``:
Data is loaded by [v6d](https://github.com/v6d-io/v6d) , ``v6d`` takes advantage
of [fsspec](https://github.com/intake/filesystem_spec) to resolve specific scheme and formats.
Any additional specific configurations can be passed in kwargs of ``Loader``, and these configurations will
directly be passed to corresponding storage class. Like ``host`` and ``port`` to ``HDFS``, or ``access-id``, ``secret-access-key`` to ``oss`` or ``s3``.

```

    from graphscope.framework.loader import Loader

    ds1 = Loader("file:///var/datafiles/group.e")
    ds2 = Loader("oss://graphscope_bucket/datafiles/group.e", key='access-id', secret='secret-access-key', endpoint='oss-cn-hangzhou.aliyuncs.com')
    ds3 = Loader("hdfs:///datafiles/group.e", host='localhost', port='9000', extra_conf={'conf1': 'value1'})
    d34 = Loader("s3://datafiles/group.e", key='access-id', secret='secret-access-key', client_kwargs={'region_name': 'us-east-1'})
```
User can implement customized driver to support additional data sources. Take [ossfs](https://github.com/v6d-io/v6d/blob/main/modules/io/adaptors/ossfs.py) as an example, User need to subclass ``AbstractFileSystem``, which
is used as resolve to specific protocol scheme, and ``AbstractBufferFile`` to do read and write.
The only methods user need to override is ``_upload_chunk``,
``_initiate_upload`` and ``_fetch_range``. In the end user need to use ``fsspec.register_implementation('protocol_name', 'protocol_file_system')`` to register corresponding resolver.

## Serialization and Deserialization
When the graph is huge, it takes large amount of time(e.g., maybe hours) for the graph loadding.
GraphScope provides serialization and deserialization for graph data, 
which dumps and load the constructed graphs in the form of binary data to(from) disk. This functions save much time, and make our lives easier. 

### Serialization

`graph.save_to` takes a `path` argument, indicating the location to store the binary data. 

In [None]:
graph.save_to('/tmp/seri')

### Deserialization

`graph.load_from` is a `classmethod`, its signature looks like `graph.save_to`. However, its `path` argument should be exactly the same to the `path` passed in `graph.save_to`, as it relys on naming to find the binary files. Please note that during serialization, the workers dump its own data to files with its index as suffix. Thus the number of workers for deserialization should be **exactly the same** to that for serialization.

In addition, `graph.load_from` needs an extra `sess` parameter, specifying which session the graph would be deserialized in.

In [None]:
from graphscope.framework.graph import Graph
sess = graphscope.session(cluster_type='hosts')
deserialized_graph = Graph.load_from('/tmp/seri', sess)

In [None]:
print(deserialized_graph.schema)