Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add read_items to dynamodb module #1877

Merged
merged 23 commits into from
Dec 29, 2022
Merged

Conversation

a-slice-of-py
Copy link
Contributor

Feature or Bugfix

  • Feature

Detail

  • Add to DynamoDB module a new read_items function to handle read operations towards a given DynamoDB Table, returning the results as pandas DataFrame

Relates

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 389754c
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 3fa32c5
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: fbc75bd
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: f819f8e
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 3234d41
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 3dcee5a
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: dd85bf6
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 63f87c9
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 0a22032
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 24152a6
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@a-slice-of-py
Copy link
Contributor Author

Ok, after some fighting with my local VSC setup I finally managed to review code and successfully complete all checks except CodeBuild: I wait for a feedback from your side because I don't know how to fix/proceed.

Regarding new feature testing: I should be able to add some mocked tests like the ones added for partiql, but I am not sure how much relevant they would be.

For more reliable tests against actual DynamoDB tables in a real AWS account, I probably need some guidance and/or let someone else do the job (to avoid author-bias in tests writing as well as risks of messing up with datalake/full test environments - I think I am not sure about what to do).

@a-slice-of-py a-slice-of-py changed the title feat: add read_items to dynamodb module (#1867) feat: add read_items to dynamodb module Dec 19, 2022
@jaidisido
Copy link
Contributor

Thank you @a-slice-of-py, this is a great start. My team will review and we can either suggest or create some tests. I agree that they should be integration and not unit tests. We already have some but in a separate branch.

Don't worry about the failing CB, it's a flaky test on our side, nothing to do with your PR

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 5b1e74b
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: e42475f
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link
Contributor

@jaidisido jaidisido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, left some initial comments. We will crack on with some tests to give you an idea of how to create them

Comment on lines +252 to +255
# SEE: handle possible unprocessed keys. As suggested in Boto3 docs,
# this approach should involve exponential backoff, but this should be
# already managed by AWS SDK itself, as stated
# [here](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it should be handled by the SDK. If not, we do have a method which is used for exponential backoff. It requires knowing the exception though

max_items_evaluated: Optional[int] = None,
as_dataframe: bool = True,
boto3_session: Optional[boto3.Session] = None,
) -> Union[pd.DataFrame, List[Dict[str, Any]]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually only return a pandas DataFrame given that this is supposed to be an AWS SDK for pandas. However I do see the appeal of also enabling list of dicts, especially that the as_dataframe argument is True by default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I totally understand the focus of awswrangler, but I thought that provide a pandas opt-out choice would encourage the adoption of this new feature and related refactoring of pre-existing code.

Users can adopt it step-by-step: first by substituting their current snippets - which might return "raw" items as list of dicts - with wr.dynamodb.read_items(..., as_dataframe=False), and only then dealing with a change in data structures.



@apply_configs
def read_items(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional argument we usually consider in our read methods is chunked. If set to True then an Iterator of data frames is yielded. It would make the API memory friendly as the user would be able to process items in chunks, particularly when doing a full scan on the table.

This might be a bit more complex so we could consider it later once we have some working tests.

kwargs["Limit"] = max_items_evaluated

# If kwargs are sufficiently informative, proceed with actual read op
if partition_values or key_condition_expression or filter_expression or allow_full_scan or max_items_evaluated:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using any would be simpler perhaps?
if any((partition_values, key_condition_expression,...)):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, thanks.

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 8b39ce4
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Comment on lines +201 to +203
# SEE: recursive approach guarantees that each reserved keyword will be properly replaced,
# even if it will require as many calls as the reserved keywords involved (not so efficient...)
return wrapper(*args, **kwargs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaidisido do you have any feedback on this?

I wasn't able to find any way to obtain the list of DynamoDB reserved keywords programmatically (having that beforehand would ease the implementation without requiring recursion).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did look/ask around and there is no programmatic way to access the list of reserved keywords sadly.

That being said, looking at its GH history that list is static. It has not been changed since its inception.

This makes sense as adding a keyword to the list would be a serious breaking change in production.

All of this to say that I would be ok with creating a separate file hosting the list of keywords and using it for your check (example).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reference.

I tried to switch to a LBYL approach for reserved words management but I found it harder than expected - mainly due to the diverse format which ProjectionExpression, KeyConditionExpression and FilterExpression can assume - making difficult to find a general preventive action.

So, given that:

  • expressions can be also expressed via Key and Attr, which seem to sanitize reserved words too
  • the (current) EAFP approach is already available

I propose to keep it as it is, at least for the moment.

If we are worried about the recursive approach (which leads to as many validation client calls in case as number of reserved words in the given kwargs), we can put it behind an opt-in/opt-out kwarg, something like sanitize_dynamodb_reserved_words: bool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think it's ok to keep it as is for now as it would be too complex to capture all edge cases from the involved expressions

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 38e7686
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

This is looking great, left some initial comments. We will crack on with some tests to give you an idea of how to create them

I have added some initial tests to get us going as discussed @a-slice-of-py. The test coverage report displayed at the end of our CB logs is one measure of the quality of tests that we should implement.

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 1842ee5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 989d788
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@a-slice-of-py
Copy link
Contributor Author

This is looking great, left some initial comments. We will crack on with some tests to give you an idea of how to create them

I have added some initial tests to get us going as discussed @a-slice-of-py. The test coverage report displayed at the end of our CB logs is one measure of the quality of tests that we should implement.

Thanks @jaidisido: I looked at the tests and they seem to already cover nearly all the possible usage of read_items (excepted the ones involving pagination and unprocessed keys), as confirmed by the coverage reported at the end of logs - awesome.

I just don't get why CB logs reported as missing lines 258->269, even if they should have been hit in all scan/query ops, for example at line 115 of test_dynamodb.py

@jaidisido
Copy link
Contributor

jaidisido commented Dec 28, 2022

Looking in detail into the coverage report I can see that use_query and use_scan cases were hit (green), it's just the surrounding pagination and unprocessed keys that is missed (red)

image

I am happy with where we are with this, just waiting on others to review before merging. Thanks again @a-slice-of-py

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 7e8b4d2
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link
Contributor

@malachi-constant malachi-constant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants