Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting Neptune in the Databuilder #13

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
61 changes: 61 additions & 0 deletions rfcs/013-neptune-support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
- Feature Name: Amazon Neptune Databuilder support
- Start Date: 2020-11-10
- RFC PR: [amundsen-io/rfcs#13](https://github.com/amundsen-io/rfcs/pull/13)
- Amundsen Issue: [amundsen-io/amundsen#0000](https://github.com/amundsen-io/amundsen/issues/0000) (leave this empty for now)

# Amazon Neptune Databuilder Support

## Summary

This RFC proposes introducing support for Amazon's GraphDB Neptune.

## Motivation

As of now Amundsen only supports Neptune in the metadata proxy. This RFC proposes to add Neptune support to the databuilder so that Amundsen fully supports Neptune throughout its stack.

## Guide-level Explanation (aka Product Details)

Currently the Amundsen databuilder library only has support for the Neo4j datastore. The goal of this RFC is to add additional loaders, publishers, and serializers to the library suite so that Neptune is supported. The goal is to maintain the same interfaces so that switching between neo4j and Neptune is as easy as switching the components.

## UI/UX-level Explanation

Not Applicable

## Reference-level Explanation (aka Technical Details)

To support Neptune in the databuilder. Several new components are needed:

- A Neptune serializer which converts `GraphNodes` and `GraphRelationships` into the format that the Neptune's bulk data loader expects.

- A `FsNeputuneCSVLoader` similar to the `FsNeo4jCSVLoader` which writes the GraphNodes and GraphRelationships into CSVs that can be consumed by the publisher.

- A `NeputuneCsvBulkPublisher` which takes the CSVs generated by the `FsNeputuneCSVLoader` and publishes them to Neptune. The process of publishing can be broken down into 2 steps:
1. Uploading the CSV files to Amazon's S3.
2. Making a request to the Neptune's bulk loader endpoint pointing at the s3 files. (details can be found https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load.html)
Thanks to the team at Square most of the process of publishing Amundsen data to Neptune is already implemented in the Neptune bulk loader API found in the repo https://github.com/amundsen-io/amundsengremlin.

- Adding the amundsengremlin repo as a dependency.

- Tests supporting Neptune models and loader and publisher.

## Drawbacks

The RFC adds support for another datastore which brings in additional components and increases the code size of the repo. In addition the https://github.com/amundsen-io/amundsengremlin repo will be added as a dependency which brings in its own complexities as well.


## Alternatives

No action is the main alternative here. The dependencies from https://github.com/amundsen-io/amundsengremlin could be separated so that the metadataproxy and databuilder don't have the same requirements but it seems unnecessary as of now.

## Prior art

N/A

## Unresolved questions

N/A


## Future possibilities

None.