-
Notifications
You must be signed in to change notification settings - Fork 722
(feat): Add Amazon Neptune support 🚀 #1085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
|
Regarding naming conventions we tend to just go with whatever makes sense for the domain / service really. I'm happy with |
awswrangler/neptune/__init__.py
Outdated
| @@ -0,0 +1,17 @@ | |||
| """Utilities Module for Amazon OpenSearch.""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor overlook there
|
Also just note that there's convenient |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
jaidisido
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. Some general comments.
awswrangler/neptune/neptune.py
Outdated
| Run a Gremlin Query | ||
|
|
||
| >>> import awswrangler as wr | ||
| >>> client = wr.neptune.Client(host='NEPTUNE-ENDPOINT') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't tend to call a class directly in Wrangler. Instead perhaps we can replicate the pattern here where a connect method is called and the connection is then reused in other methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for Neptune, I intended to use the HTTPS endpoints for all the languages, so there isn't really a long living connection like there is with some other databases. It's really just a client that has all the needed information to generate the appropriate requests call at the time of invocation. I was thinking that the client object here would be very similar to the connection object in Redshift. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client can be implemented as a class, my comment was simply to indicate that no other method in the entire library is calling a class directly and we would prefer to keep it that way for UX. So it can be implemented as a class but there would be a helper to return an instance of the class like you mentioned for here
awswrangler/neptune/neptune.py
Outdated
|
|
||
| >>> import awswrangler as wr | ||
| >>> client = wr.neptune.Client(host='NEPTUNE-ENDPOINT') | ||
| >>> df = wr.neptune.gremlin.read(client, "g.V().limit(5).valueMap()") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this should be wr.neptune.read_gremlin instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote for wr.neptune.gremlin.read may be, because each graph might have their own functions and it creates a good isolation between property graph and RDF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth on this a bit and I ended up with read_gremlin and read_opencypher instead of creating subclasses for each language, as that seemed to better fit the overall patterns in the library. I must have missed that in the docs, but they have been updated now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed would prefer dedicated methods like above
awswrangler/neptune/neptune.py
Outdated
|
|
||
| >>> import awswrangler as wr | ||
| >>> client = wr.neptune.Client(host='NEPTUNE-ENDPOINT') | ||
| >>> wr.neptune.gremlin.to_graph( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be one per graph type (i.e. gremlin, sparql) or should this be wr.neptune.to_graph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about the feedback here, I ended up creating a to_property_graph() and to_rdf_graph() as the incoming data frame for each will have a different format. For Gremlin versus openCypher they will have the same format, so I decided to combine them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be fine. Once we start implementing these methods it will become clearer if they should merged or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single method should be fine as long as input is a DataFrame but yes that would become clear when implementation is added
To add to this, there is a validate.sh script that can be run locally or as a pre-commit hook for code static checks. An alternative is to prepend |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
…es and result parsing
…ociated with the three query languages
awswrangler/neptune/client.py
Outdated
|
|
||
| # If the value is a Vertex or Edge do special processing | ||
| if isinstance(d[k], Vertex) or isinstance(d[k], Edge) or isinstance(d[k], VertexProperty) or isinstance(d[k], Property): | ||
| d[k] = d[k].__dict__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
Modifying object.__dict__ directly or writing to an instance of a class __dict__ attribute directly is not recommended. Inside every module is a __dict__ object.dict attribute which contains its symbol table. If you modify object.__dict__, then the symbol table is changed. Also, direct assignment to the __dict__ attribute is not possible.
awswrangler/neptune/client.py
Outdated
|
|
||
| # For lists or paths unwind them | ||
| if isinstance(result, list) or isinstance(result, Path): | ||
| for x in result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
To create a list, try to use list comprehension instead of a loop. List comprehension is the preferred way to make a list using Python, and it's simpler and easier to understand than using a loop.
awswrangler/neptune/client.py
Outdated
| # If this is a list or Path then unwind it | ||
| if isinstance(data, list) or isinstance(data, Path): | ||
| res=[] | ||
| for x in data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
To create a list, try to use list comprehension instead of a loop. List comprehension is the preferred way to make a list using Python, and it's simpler and easier to understand than using a loop.
awswrangler/neptune/client.py
Outdated
|
|
||
| def read_gremlin(self, query, headers:Dict[str, Any] = None) -> Dict[str, Any]: | ||
| try: | ||
| nest_asyncio.apply() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious - is this required? Nesting event loops seems to be a controversial topic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running it without nesting event loops and it does not work inside Jupyter without having it nested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a way to use a single event loop... let me check the details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So looks like they even added the ability to enable event loop nesting to the gremlin driver itself now. You can just enable it by passing transport_factory=AiohttpTransport(call_from_event_loop=True). I added a commit with this
…s. Moved the Gremlin parsing code out of client and into its own static class
…o database infra scripting
…s well as added update option for property graph data
…tion pass locally
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
awswrangler/neptune/client.py
Outdated
| uri = f"{HTTP_PROTOCOL}://{self.host}:{self.port}/gremlin" | ||
| request = self._prepare_request("GET", uri, headers=headers) | ||
| ws_url = f"{WS_PROTOCOL}://{self.host}:{self.port}/gremlin" | ||
| c = client.Client( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed this - do we need to open and close the connection on every query? Can this be moved to __init__?
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
…book for Amazon Neptune
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
… name and added code to create the df when running cardinality tests.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Issue #, if available:
Description of changes:
First draft of what a Neptune interface might look like.
I did have an utstanding question though on the naming of the write function names. There seems to be several conventions (
put,to_sql,index, etc.) that different services have used based on how they work. Is there a preferred naming convention we would like to follow here?By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.