This is a toolset for bulk loading data from raw files to graphscope persistent storage service. Currently the tool supports a specific format of the raw data as described in "Data Format", and the originial data must be located in an HDFS. To load the data files into GraphScope storage, users can run the data-loading tool from a terminal on a Client machine, and we assume that Client has access to a Hadoop cluster, which can run MapReduce jobs, have read/write access to the HDFS, and connect to a running GraphScope storage service.
- Java compilation environment (Maven 3.5+ / JDK1.8), if you need to build the tools from source code
- Hadoop cluster (version 2.x) that can run map-reduce jobs and has HDFS supported
- Running GIE with persistent storage service (graph schema should be properly defined)
If you have the distribution package maxgraph.tar.gz
, decompress it. Then you can find the map-reduce job jar data_load_tools-0.0.1-SNAPSHOT.jar
under maxgraph/lib/
and the executable load_tool.sh
under maxgraph/bin/
.
If you want to build from source code, just run mvn clean package -DskipTests
. You can find the compiled jar data_load_tools-0.0.1-SNAPSHOT.jar
in the target/
directory. The load_tool.sh
is just a wrapper for java command, you can only use data_load_tools-0.0.1-SNAPSHOT.jar
in the following demonstration.
The data loading tools assume the original data files are in the HDFS.
Each file should represents either a vertex type or a relationship of an edge type. Below are the sample data of a vertex type person and a relationShip person-knows->person of edge type knows:
- person.csv
id|name
1000|Alice
1001|Bob
...
- person_knows_person.csv
person_id|person_id_1|date
1000|1001|20210611151923
...
The first line of the data file is a header that describes the key of each field. The header is not required.
If there is no header in the data file, you need to set skip.header
to true
in the data building process
(For details, see params description in "Building a partitioned graph").
The rest lines are the data records. Each line represents one record. Data fields are seperated by a custom seperator
("|" in the example above). In the vertex data file person.csv
, id
field and name
field are the primary-key and
the property of the vertex type person
respectively. In the edge data file person_knows_person.csv
, person_id
field is the primary-key of the source vertex, person_id_1
field is the primary-key of the destination vertex, date
is the property of the edge type knows
.
All the data fields will be parsed according to the data-type defined in the graph schema. If the input data field cannot be parsed correctly, data building process would be failed with corresponding errors.
The loading process contains three steps:
- Step 1: A partitioned graph is built from the source files and stored in the same HDFS using a MapReduce job
- Step 2: The graph partitions are loaded into the store servers (in parallel)
- Step 3: Commit to the online service so that data is ready for serving queries
Build data by running the hadoop map-reduce job with following command:
$ hadoop jar data_load_tools-0.0.1-SNAPSHOT.jar com.alibaba.maxgraph.dataload.databuild.OfflineBuild <path/to/config/file>
The config file should follow a format that is recognized by Java java.util.Properties
class. Here is an example:
split.size=256
separator=\\|
input.path=/tmp/ldbc_sample
output.path=/tmp/data_output
graph.endpoint=1.2.3.4:55555
column.mapping.config={"person_0_0.csv":{"label":"person","propertiesColMap":{"0":"id","1":"name"}},"person_knows_person_0_0.csv":{"label":"knows","srcLabel":"person","dstLabel":"person","srcPkColMap":{"0":"id"},"dstPkColMap":{"1":"id"},"propertiesColMap":{"2":"date"}}}
skip.header=true
Details of the parameters are listed below:
Config key | Required | Default | Description |
---|---|---|---|
split.size | false | 256 | Hadoop map-reduce input data split size in MB |
separator | false | \\| | Seperator used to parse each field in a line |
input.path | true | - | Input HDFS dir |
output.path | true | - | Output HDFS dir |
graph.endpoint | true | - | RPC endpoint of the graph storage service. You can get the RPC endpoint following this document: GraphScope Store Service |
column.mapping.config | true | - | Mapping info for each input file in JSON format. Each key in the first level should be a fileName that can be found in the input.path , and the corresponding value defines the mapping info. For a vertex type, the mapping info should includes 1) label of the vertex type, 2) propertiesColMap that describes the mapping from input field to graph property in the format of { columnIdx: "propertyName" } . For an edge type, the mapping info should includes 1) label of the edge type, 2) srcLabel of the source vertex type, 3) dstLabel of the destination vertex type, 4) srcPkColMap that describes the mapping from input field to graph property of the primary keys in the source vertex type, 5) dstPkColMap that describes the mapping from input field to graph property of the primary keys in the destination vertex type, 6) propertiesColMap that describes the mapping from input field to graph property of the edge type |
skip.header | false | true | Whether to skip the first line of the input file |
After data building completed, you can find the output files in the output.path
of HDFS. The output files includes a
meta file named META
, an empty file named _SUCCESS
, and some data files that one for each partition named in the
pattern of part-r-xxxxx.sst
. The layout of the output directory should look like:
/tmp/data_output
|- META
|- _SUCCESS
|- part-r-00000.sst
|- part-r-00001.sst
|- part-r-00002.sst
...
Now ingest the offline built data into the graph storage. If you have load_data.sh
, then run:
$ ./load_data.sh -c ingest -d hdfs://1.2.3.4:9000/tmp/data_output
Or you can run with java
:
$ java -cp data_load_tools-0.0.1-SNAPSHOT.jar com.alibaba.maxgraph.dataload.LoadTool -c ingest -d hdfs://1.2.3.4:9000/tmp/data_output
The offline built data can be ingested successfully only once, otherwise errors will occur.
After data ingested into graph storage, you need to commit data loading. The data will not be able to read until committed successfully. If you have load_data.sh
, then run:
$ ./load_data.sh -c commit -d hdfs://1.2.3.4:9000/tmp/data_output
Or you can run with java
:
$ java -cp data_load_tools-0.0.1-SNAPSHOT.jar com.alibaba.maxgraph.dataload.LoadTool -c commit -d hdfs://1.2.3.4:9000/tmp/data_output
Notice: The later committed data will overwrite the earlier committed data which have same vertex types or edge relations.