feat: Neptune Data builder Integration #438

AndrewCiambrone · 2021-02-08T20:23:50Z

Summary of Changes

Implements the Neptune as a datastore in the Databuilder library.
RFC: https://github.com/amundsen-io/rfcs/blob/master/rfcs/013-neptune-support.md

Tests

I created tests for each model to test the serialization compatibility for Neptune. Also added a test the Neptune data loader.

Documentation

I created a sample script that shows how to use the FSNeptuneCSVLoader and NeptuneCSVPublisher.

CheckList

Make sure you have checked all steps below to ensure a timely review.

PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.
PR includes a summary of changes.
PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
PR passes make test

Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>

AndrewCiambrone · 2021-02-08T20:26:23Z

databuilder/clients/neptune_client.py

@@ -0,0 +1,141 @@
+# Copyright Contributors to the Amundsen project.


Let me know if you all want this to be placed in: https://github.com/amundsen-io/amundsengremlin

AndrewCiambrone · 2021-02-08T20:27:09Z

databuilder/extractor/es_last_updated_extractor.py

@@ -10,9 +10,9 @@
 from databuilder.extractor.generic_extractor import GenericExtractor


-class Neo4jEsLastUpdatedExtractor(GenericExtractor):
+class EsLastUpdatedExtractor(GenericExtractor):


Please Note: I changed this to be more generic so all data stores can use this.

sounds good, cc @allisonsuarez from Lyft side

AndrewCiambrone · 2021-02-08T20:29:50Z

databuilder/models/user.py

@@ -37,7 +37,7 @@ def __init__(self,
                 email: str,
                 first_name: str = '',
                 last_name: str = '',
-                 name: str = '',
+                 full_name: str = '',


Please Note: One of the sample files refers this as full_name and I thought that name made more sense. Up to you all if it should be name.

AndrewCiambrone · 2021-02-08T20:33:26Z

Please note that this PR brings support for multiple models that the gremlin metadata proxy currently does not support. So some models will not show up on the frontend even though they are ingested.

feng-tao · 2021-02-10T17:13:20Z

will take a look

dorianj

Generally LGTM. Left some code-level comments and questions. This is super neat!

I'm not familiar enough with the testing or and some other aspects to confidently approve, I will let @feng-tao give word on that.

dorianj · 2021-02-16T18:38:08Z

databuilder/clients/neptune_client.py

+        neptune_uri = "wss://{host}/gremlin".format(
+            host=self._neptune_host
+        )
+        self.source_factory = neptune_bulk_loader_api.get_neptune_graph_traversal_source_factory(


Why save this factory?

dorianj · 2021-02-16T18:45:17Z

databuilder/publisher/neptune_csv_publisher.py

+        super(NeptuneCSVPublisher, self).__init__()
+
+    def init(self, conf: ConfigTree) -> None:
+        self._boto_session = Session(


can/should this share code with NeptuneSessionClient?

I do not think they should. The api used by the NeptuneCSVPublisher are used in a different context than the api's used by the NeptuneSessionClient. I feel like it could be combined if you feel strongly. But I feel like it might break the abstraction of the client if the two should be blended together.

No strong feelings, totally fine to leave separate if we feel they may diverge more in the future -- just noticed that a lot of the code reading from config and forming connection string was similar and wondered.

dorianj · 2021-02-16T18:52:50Z

databuilder/publisher/neptune_csv_publisher.py

+            errors=True
+        )
+        load_status_payload = load_status_response.get('payload', {})
+        if 'status' not in load_status_payload.get('overallStatus', {}):


nit, and ok to ignore, but I think more pythonic and DRY would be to just attempt the load_status_payload['overallStatus']['status'] and catch the KeyError?

(this applies in a few other places, but won't comment repeatedly)

dorianj · 2021-02-16T18:56:58Z

databuilder/publisher/neptune_csv_publisher.py

+        file_paths = self._get_file_paths()
+        for file_location in file_paths:
+            with open(file_location, 'rb') as file_csv:
+                file_csv_bytes = BytesIO(file_csv.read())


why wrap the file_csv in a BytesIO? can we not pass the file_csv handle straight to neptune_api_client.upload, allowing it to stream from disk rather than buffer? If there's a reason why that's not ok, a comment would be helpful

Oh that is not very efficient. Not sure what I was thinking when I did that. Good catch.

dorianj · 2021-02-16T19:08:30Z

databuilder/task/neptune_staleness_removal_task.py

+    def validate(self) -> None:
+        """
+        Validation method. Focused on limit the risk on deleting nodes and relations.
+         - Check if deleted nodes will be within 10% of total nodes.


I think this is 5% by default?

dorianj · 2021-02-16T21:47:48Z

databuilder/task/neptune_staleness_removal_task.py

+        self.target_nodes = set(conf.get_list(NeptuneStalenessRemovalTask.TARGET_NODES))
+        self.target_relations = set(conf.get_list(NeptuneStalenessRemovalTask.TARGET_RELATIONS))
+        self.batch_size = conf.get_int(NeptuneStalenessRemovalTask.BATCH_SIZE)
+        self.dry_run = conf.get_bool(NeptuneStalenessRemovalTask.DRY_RUN)


I can't see where this is used?

Great call out I must had removed it at some point during debugging. Added back in.

dorianj · 2021-02-16T21:48:47Z

databuilder/task/neptune_staleness_removal_task.py

+            .with_fallback(NeptuneStalenessRemovalTask.DEFAULT_CONFIG)
+        self.target_nodes = set(conf.get_list(NeptuneStalenessRemovalTask.TARGET_NODES))
+        self.target_relations = set(conf.get_list(NeptuneStalenessRemovalTask.TARGET_RELATIONS))
+        self.batch_size = conf.get_int(NeptuneStalenessRemovalTask.BATCH_SIZE)


are deletions being batched? i don't see this used atm?

Great callout I removed the flag. I could not find a great way to batch delete without it being ridiculously slow. So I found what I think is a okay way to delete the items from the db. It has not caused us any problems so far.

I think the batching stuff is unique to neo4j performance characteristics, so nice that we don't need to do batching here, much simpler

dorianj · 2021-02-16T21:53:46Z

databuilder/task/neptune_staleness_removal_task.py

+            self,
+            total_records: Iterable[Dict[str, Any]],
+            stale_records: Iterable[Dict[str, Any]],
+            types: Iterable[str]


this might be similar to existing code, but i believe this typing won't guarantee that the iterator is replayable -- so doing if type_str not in types multiple times may fail?

Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>

feng-tao

looks great, some minor naming nits , thanks for the great contribution!

feng-tao · 2021-02-17T04:44:05Z

databuilder/clients/neptune_client.py

+        edge_traversal.next()
+
+    @staticmethod
+    def _update_entity_properties_on_traversal(


any reason some static method starts with underscore while some don't?

feng-tao · 2021-02-17T04:44:52Z

databuilder/extractor/es_last_updated_extractor.py

@@ -10,9 +10,9 @@
 from databuilder.extractor.generic_extractor import GenericExtractor


-class Neo4jEsLastUpdatedExtractor(GenericExtractor):
+class EsLastUpdatedExtractor(GenericExtractor):


sounds good, cc @allisonsuarez from Lyft side

feng-tao · 2021-02-17T04:45:58Z

databuilder/extractor/neptune_search_data_extractor.py

+                yield result
+
+    def get_scope(self) -> str:
+        return 'extractor.search_data'


extractor.neptune_search_data ? the scope should be different between extractors

feng-tao · 2021-02-17T04:50:02Z

databuilder/loader/file_system_neptune_csv_loader.py

+        self._closer.close()
+
+    def get_scope(self) -> str:
+        return "loader.filesystem_csv_neptune"


loader.neptune_filesystem_csv ?

I was trying to be consistent with https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/loader/file_system_neo4j_csv_loader.py#L187

but I think loader.neptune_filesystem_csv sounds better.

feng-tao

Is it possible to add a short tutorial on how to use Neptune e2e in https://github.com/amundsen-io/amundsen/tree/master/docs/tutorials ? That would be super helpful for others who want to use neptune , thanks!

Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>

AndrewCiambrone added 6 commits February 8, 2021 13:26

Implement Neptune databuilder connection

8d60860

Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>

merge upstream

d599334

Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>

Added documentation and isort

30cba1b

Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>

Additional documentation

4884edf

Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>

Make the neo4j_es_last_updated more generic for all datastores

abd895a

Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>

A few fixes due to upstream changes

52976fc

Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>

AndrewCiambrone requested review from allisonsuarez, dikshathakur3119, feng-tao, jinhyukchang and a team as code owners February 8, 2021 20:23

AndrewCiambrone commented Feb 8, 2021

View reviewed changes

feng-tao added the keep fresh Disables stalebot from closing an issue label Feb 9, 2021

dorianj reviewed Feb 16, 2021

View reviewed changes

AndrewCiambrone added 3 commits February 16, 2021 18:22

Acknowledge reviewers comments

6bdceb4

Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>

linting fix

dde16bd

Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>

do not read file into memory

01036ed

Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>

feng-tao approved these changes Feb 17, 2021

View reviewed changes

feng-tao reviewed Feb 17, 2021

View reviewed changes

acknowledge reviewers comments

1baac4a

Signed-off-by: Andrew Ciambrone <andrjc4@vt.edu>

feng-tao merged commit 303e8aa into amundsen-io:master Feb 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Neptune Data builder Integration #438

feat: Neptune Data builder Integration #438

AndrewCiambrone commented Feb 8, 2021

AndrewCiambrone Feb 8, 2021

AndrewCiambrone Feb 8, 2021 •

edited

feng-tao Feb 17, 2021

AndrewCiambrone Feb 8, 2021

AndrewCiambrone commented Feb 8, 2021 •

edited

feng-tao commented Feb 10, 2021

dorianj left a comment •

edited

dorianj Feb 16, 2021

dorianj Feb 16, 2021

AndrewCiambrone Feb 17, 2021

dorianj Feb 17, 2021

dorianj Feb 16, 2021

dorianj Feb 16, 2021

dorianj Feb 16, 2021

AndrewCiambrone Feb 16, 2021

dorianj Feb 16, 2021

dorianj Feb 16, 2021

AndrewCiambrone Feb 16, 2021

dorianj Feb 16, 2021

AndrewCiambrone Feb 16, 2021

dorianj Feb 17, 2021

dorianj Feb 16, 2021

feng-tao left a comment

feng-tao Feb 17, 2021

feng-tao Feb 17, 2021

feng-tao Feb 17, 2021

feng-tao Feb 17, 2021

AndrewCiambrone Feb 17, 2021

feng-tao left a comment

		@@ -0,0 +1,141 @@
		# Copyright Contributors to the Amundsen project.

feat: Neptune Data builder Integration #438

feat: Neptune Data builder Integration #438

Conversation

AndrewCiambrone commented Feb 8, 2021

Summary of Changes

Tests

Documentation

CheckList

Choose a reason for hiding this comment

AndrewCiambrone Feb 8, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewCiambrone commented Feb 8, 2021 • edited

feng-tao commented Feb 10, 2021

dorianj left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feng-tao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feng-tao left a comment

Choose a reason for hiding this comment

AndrewCiambrone Feb 8, 2021 •

edited

AndrewCiambrone commented Feb 8, 2021 •

edited

dorianj left a comment •

edited