-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add read_items to dynamodb module #1877
Conversation
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
* explicitly return None instead of return * switch from Mapping to Dict
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Ok, after some fighting with my local VSC setup I finally managed to review code and successfully complete all checks except CodeBuild: I wait for a feedback from your side because I don't know how to fix/proceed. Regarding new feature testing: I should be able to add some mocked tests like the ones added for partiql, but I am not sure how much relevant they would be. For more reliable tests against actual DynamoDB tables in a real AWS account, I probably need some guidance and/or let someone else do the job (to avoid author-bias in tests writing as well as risks of messing up with datalake/full test environments - I think I am not sure about what to do). |
Thank you @a-slice-of-py, this is a great start. My team will review and we can either suggest or create some tests. I agree that they should be integration and not unit tests. We already have some but in a separate branch. Don't worry about the failing CB, it's a flaky test on our side, nothing to do with your PR |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great, left some initial comments. We will crack on with some tests to give you an idea of how to create them
# SEE: handle possible unprocessed keys. As suggested in Boto3 docs, | ||
# this approach should involve exponential backoff, but this should be | ||
# already managed by AWS SDK itself, as stated | ||
# [here](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it should be handled by the SDK. If not, we do have a method which is used for exponential backoff. It requires knowing the exception though
max_items_evaluated: Optional[int] = None, | ||
as_dataframe: bool = True, | ||
boto3_session: Optional[boto3.Session] = None, | ||
) -> Union[pd.DataFrame, List[Dict[str, Any]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually only return a pandas DataFrame given that this is supposed to be an AWS SDK for pandas
. However I do see the appeal of also enabling list of dicts, especially that the as_dataframe
argument is True by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I totally understand the focus of awswrangler, but I thought that provide a pandas opt-out choice would encourage the adoption of this new feature and related refactoring of pre-existing code.
Users can adopt it step-by-step: first by substituting their current snippets - which might return "raw" items as list of dicts - with wr.dynamodb.read_items(..., as_dataframe=False)
, and only then dealing with a change in data structures.
|
||
|
||
@apply_configs | ||
def read_items( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One additional argument we usually consider in our read methods is chunked
. If set to True then an Iterator of data frames is yielded. It would make the API memory friendly as the user would be able to process items in chunks, particularly when doing a full scan on the table.
This might be a bit more complex so we could consider it later once we have some working tests.
awswrangler/dynamodb/_read.py
Outdated
kwargs["Limit"] = max_items_evaluated | ||
|
||
# If kwargs are sufficiently informative, proceed with actual read op | ||
if partition_values or key_condition_expression or filter_expression or allow_full_scan or max_items_evaluated: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using any
would be simpler perhaps?
if any((partition_values, key_condition_expression,...)):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, thanks.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
# SEE: recursive approach guarantees that each reserved keyword will be properly replaced, | ||
# even if it will require as many calls as the reserved keywords involved (not so efficient...) | ||
return wrapper(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaidisido do you have any feedback on this?
I wasn't able to find any way to obtain the list of DynamoDB reserved keywords programmatically (having that beforehand would ease the implementation without requiring recursion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did look/ask around and there is no programmatic way to access the list of reserved keywords sadly.
That being said, looking at its GH history that list is static. It has not been changed since its inception.
This makes sense as adding a keyword to the list would be a serious breaking change in production.
All of this to say that I would be ok with creating a separate file hosting the list of keywords and using it for your check (example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reference.
I tried to switch to a LBYL approach for reserved words management but I found it harder than expected - mainly due to the diverse format which ProjectionExpression
, KeyConditionExpression
and FilterExpression
can assume - making difficult to find a general preventive action.
So, given that:
- expressions can be also expressed via
Key
andAttr
, which seem to sanitize reserved words too - the (current) EAFP approach is already available
I propose to keep it as it is, at least for the moment.
If we are worried about the recursive approach (which leads to as many validation client calls in case as number of reserved words in the given kwargs), we can put it behind an opt-in/opt-out kwarg, something like sanitize_dynamodb_reserved_words: bool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think it's ok to keep it as is for now as it would be too complex to capture all edge cases from the involved expressions
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
I have added some initial tests to get us going as discussed @a-slice-of-py. The test coverage report displayed at the end of our CB logs is one measure of the quality of tests that we should implement. |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Thanks @jaidisido: I looked at the tests and they seem to already cover nearly all the possible usage of I just don't get why CB logs reported as missing lines 258->269, even if they should have been hit in all scan/query ops, for example at line 115 of |
Looking in detail into the coverage report I can see that use_query and use_scan cases were hit (green), it's just the surrounding pagination and unprocessed keys that is missed (red) I am happy with where we are with this, just waiting on others to review before merging. Thanks again @a-slice-of-py |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, thanks
Feature or Bugfix
Detail
read_items
function to handle read operations towards a given DynamoDB Table, returning the results as pandas DataFrameRelates
wr.dynamodb.read_items
#1867By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.