Exports Amazon Neptune property graph data to CSV or JSON, or RDF graph data to Turtle.
- Exporting to the Bulk Loader CSV Format
- Exporting the Results of User-Supplied Queries
- Exporting an RDF Graph
- Building neptune-export
- Deploying neptune-export as an AWS Lambda Function
Exporting to the Bulk Loader CSV Format
When exporting to the CSV format used by the Amazon Neptune bulk loader, neptune-export generates CSV files based on metadata derived from scanning your graph. This metadata is persisted in a JSON file. There are three ways in which you can use the tool to generate bulk load files:
export-pg– This command makes two passes over your data: the first to generate the metadata, the second to create the data files. By scanning all nodes and edges in the first pass, the tool captures the superset of properties for each label, identifies the broadest datatype for each property, and identifies any properties for which at least one vertex or edge has multiple values. If exporting to CSV, these latter properties are exported to CSV as array types. If exporting to JSON, these property values are exported as array nodes.
create-pg-config– This command makes a single pass over your data to generate the metadata config file.
export-pg-from-config– This command makes a single pass over your data to create the CSV or JSON files. It uses a preexisting metadata config file.
create-pg-config both generate metadata JSON files describing the properties associated with each node and edge label. By default, these commands will scan the entire database. For large datasets, this can take a long time.
Both commands also allow you to sample a range of nodes and edges in order to create this metadata. If you are confident that sampling your data will yield the same metadata as scanning the entire dataset, specify the
--sample option with these commands. If, however, you have reason to believe the same property on different nodes or edges could yield different datatypes, or different cardinalities, or that nodes or edges with the same labels could contain different sets of properties, you should consider retaining the default behaviour of a full scan.
All three commands allow you to supply vertex and edge label filters.
- If you supply label filters to the
export-pgcommand, the metadata file and the exported data files will contain data only for the labels specified in the filters.
- If you supply label filters to the
create-pg-configcommand, the metadata file will contain data only for the labels specified in the filters.
- If you supply label filters to the
export-pg-from-configcommand, the exported data files will contain data for the intersection of labels in the config file and the labels specified in the command filters.
export-pg-from-config commands support parallel export. You can supply a concurrency level, which determines the number of client threads used to perform the parallel export, and, optionally, a range or batch size, which determines how many nodes or edges will be queried by each thread at a time. If you specify a concurrency level, but don't supply a range, the tool will calculate a range such that each thread queries (1/concurrency level) * number of nodes/edges nodes or edges.
If using parallel export, we recommend setting the concurrency level to the number of vCPUs on your Neptune instance.
You can load balance requests across multiple instances in your cluster (or even multiple clusters) by supplying multiple
neptune-export uses long-running queries to generate the metadata and the data files. You may need to increase the
neptune_query_timeout DB parameter in order to run the tool against large datasets.
For large datasets, we recommend running this tool against a standalone database instance that has been restored from a snapshot of your database.
Exporting the Results of User-Supplied Queries
export-pg-from-queries command allows you to supply groups of Gremlin queries and export the results to CSV or JSON.
Every user-supplied query should return a resultset whose every result comprises a Map. Typically, these are queries that return a
valueMap() or a projection created using
Queries are grouped into named groups. All the queries in a named group should return the same columns. Named groups allow you to 'shard' large queries and execute them in parallel (using the
--concurrency option). The resulting CSV or JSON files will be written to a directory named after the group.
If there is a possibility that individual rows in a query's resultset will contain different keys, use the
--two-pass-analysis flag to force neptune-export to determine the superset of keys or column headers for the query.
You can supply multiple named groups using multiple
--queries options. Each group comprises a name, an equals sign, and then a semi-colon-delimited list of Gremlin queries. Surround the list of queries in double quotes. For example:
Alternatively, you can supply a JSON file of queries.
Parallel execution of queries
If using parallel export, we recommend setting the concurrency level to the number of vCPUs on your Neptune instance. When neptune-export executes named groups of queries in parallel, it simply flattens all the queries into a queue, and spins up a pool of worker threads according to the concurrency level you have specified using
--concurrency. Worker threads continue to take queries from the queue until the queue is exhausted.
Queries whose results contain very large rows can sometimes trigger a
CorruptedFrameException. If this happens, adjust the batch size (
--batch-size) to reduce the number of results returned to the client in a batch (the default is 64).
Exporting an RDF Graph
At present neptune-export supports exporting an RDF dataset to Turtle with a single-threaded long-running query.
Encryption in transit
You can connect to Neptune from neptune-export using SSL by specifying the
If you are using a load balancer or a proxy server (such as HAProxy), you must use SSL termination and have your own SSL certificate on the proxy server.
IAM DB authentication
neptune-export supports exporting from databases that have IAM database authentication enabled. Supply the
--use-iam-auth option with each command. Remember to set the SERVICE_REGION environment variable – e.g.
neptune-export also supports connecting through a load balancer to a Neptune database with IAM DB authetication enabled. However, this feature is only currently supported for property graphs, with support for RDF graphs coming soon.
If you are connecting through a load balancer, and have IAM DB authentication enabled, you must also supply either an
--nlb-endpoint option (if using a network load balancer) or an
--alb-endpoint option (if using an application load balancer), and an
For details on using a load balancer with a database with IAM DB authentication enabled, see Connecting to Amazon Neptune from Clients Outside the Neptune VPC.
To build the jar, run:
mvn clean install
Deploying neptune-export as an AWS Lambda Function
The neptune-export jar can be deployed as an AWS Lambda function. To access Neptune, you will either have to configure the function to access resources inside your VPC, or expose the Neptune endpoints via a load balancer.
Be mindful of the AWS Lambda limits, particularly with regard to function timeouts (max 15 minutes) and /tmp directory storage (512 MB). Large exports can easily exceed these limits.
When deployed as a Lambda function, neptune-export will automatically copy the export files to an S3 bucket of your choosing. Optionally, it can also write a completion file to a separate S3 location (useful for triggering additional Lambda functions). You must configure your function with an IAM role that has write access to these S3 locations.
The Lambda function expects a number of parameters, which you can supply either as environment variables or via a JSON input parameter. Fields in the JSON input parameter override any environment variables you have set up.
|Environment Variable||JSON Field||Description|
||neptune-export command and command-line options: e.g.
||S3 location to which exported files will be written||Mandatory|
||S3 location of a JSON config file to be used when exporting a property graph from a config file||Optional|
||S3 location to which a completion file should be written once all export files have been copied to S3||Optional|