88: add entities methods create_many, delete, merge #93

lindsay-stevens · 2024-07-31T07:22:48Z

Closes #88

What has been done to verify that this works as intended?

Unit tests for the data processor method _prep_data_for_merge, and integration tests in test_client.py.

Why is this the best possible solution? Were any other approaches considered?

The original ticket called for an example script but 1) this seemed like functionality that would be widely useful enough to be a part of the library itself, and 2) it was somewhat complicated to implement and test (lots of options and edge cases) so as part of the library it should be easier to actively maintain.

Also, I added the use case to the existing example script create_entities_from_submissions.py because there was an almost complete overlap in the use case - it didn't involve updating data before, but you can do the same thing with merge. The tests contain examples of numerous other usage / call patterns.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Provides some handy new ways to interact with Entities

Do we need any specific form for testing your changes? If so, please attach one.

No, the tests include all necessary data.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Before submitting this PR, please make sure you have:

included test cases for core behavior and edge cases in tests
run python -m unittest and verified all tests pass
run ruff format pyodk tests and ruff check pyodk tests to lint code
verified that any code or assets from external sources are properly credited in comments

lindsay-stevens · 2024-07-31T07:35:10Z

@yanokwa @tobiasmcnulty does it seem like this new merge method will help simplify your workflows? Any feedback / comments / ideas / edge cases / use cases welcome. Thanks

tobiasmcnulty · 2024-07-31T13:15:32Z

@lindsay-stevens This looks great! Thanks for adding support for these methods directly in the library. I might not be able to test right away, but if real-world feedback would be helpful and is not too urgent, I can try to prioritize testing it and report back.

My only initial comment is that single-threaded deletes and updates can be quite slow (think 6 minutes instead of 6 seconds for ~5000 rows, when comparing merge()-style code to create_many()). It could be worth including a disclaimer in the docs, or adding support for multi-threaded deletes and updates. This is what we do in our code:

def delete_entities(
    client,
    project_id,
    entity_list_name,
    uuids_to_delete,
):
    def delete_entity(uuid):
        delete_resp = client.delete(
            f"/projects/{project_id}/datasets/{entity_list_name}/entities/{uuid}",
        )
        maybe_fail_loudly_for_odk_resp(delete_resp)

    pool = ThreadPool(20)
    print(f"Deleting {len(uuids_to_delete)} UUIDs...")
    pool.map(delete_entity, uuids_to_delete)
    print(f"Done.")

lindsay-stevens · 2024-07-31T14:30:34Z

@tobiasmcnulty thanks! Testing would not be urgent, but appreciated if you have a chance. About the use cases:

why is reducing the processing time important? Are entity updates needed that fast, or is it a cost thing etc?
I can imagine needing to update 5k entities in one go, but what is the use case for a large batch of deletes?

Incorporating some kind of async might be out of scope for this PR. Since pyodk is a library, if it opens threadpools, and user app code then opens threadpools to invoke pyodk, we can end up with unexpectedly a lot of threads. Also a flood of requests to Central. I don't know if there are plans for it, but maybe Central could have a batch update and batch delete endpoint, like there is for create. Another approach could be make pyodk use aiohttp or similar, but that's a large refactor. With the current code, an approach could be to monkey-patch EntityService.update and .delete to use async, or make them no-ops and use the merge return values to run the updates/deletes with async.

tobiasmcnulty · 2024-07-31T15:09:36Z

We may have upwards of 3M entities to update (don't worry, not all in one list), so 6 minutes -> 6 seconds is a big improvement. I agree that long term it would be better to implement these changes server side, so take take the feedback with a grain of salt. Perhaps just a note in the docs so warn users before they can't figure out why merge() isn't returning, especially when create_many() is so fast?

lognaturel

I feel like the create_many and merge signatures should be more similar. I've made a few suggestions to that effect and am interested to hear what you think, @lindsay-stevens

How about also adding a brief note about performance in the merge and delete doc comments, maybe also adding one to update? Something to the effect of "this method sends requests one at a time and may be slow for larger numbers of Entities. Consider making parallel calls and describing your use case on the ODK forum."

pyodk/_endpoints/entities.py

lindsay-stevens · 2024-08-20T09:25:41Z

How about also adding a brief note about performance in the merge and delete doc comments, maybe also adding one to update? Something to the effect of "this method sends requests one at a time and may be slow for larger numbers of Entities. Consider making parallel calls and describing your use case on the ODK forum."

I added a note to merge in d657b0f, but I think adding this kind of statement to individual endpoint methods might be too much, because it would apply to basically all the pyodk endpoint methods (handling lots of forms, or submissions, etc).

- chg: move insert data reshaping to create_many so that it's easier to use and accepts same kind of data as merge. - add: create_source/source_size parameter for both methods to allow specifying this info separately to the data. - chg: docstring improvements

- from mkdocstrings-python dependency on griffe>=0.37 - griffe public API changed in v1.0.0 released 2 weeks ago.

lognaturel

Thanks for making those updates!

lindsay-stevens requested a review from lognaturel July 31, 2024 07:22

add: entities methods create_many, delete, merge

531cb2a

lindsay-stevens force-pushed the pyodk-88 branch from a192675 to 531cb2a Compare July 31, 2024 07:50

add: option to skip merge updates, docs for bulk update/delete

d657b0f

lindsay-stevens requested review from lognaturel and removed request for lognaturel August 16, 2024 07:11

lognaturel reviewed Aug 19, 2024

View reviewed changes

pyodk/_endpoints/entities.py Outdated Show resolved Hide resolved

pyodk/_endpoints/entities.py Show resolved Hide resolved

pyodk/_endpoints/entities.py Outdated Show resolved Hide resolved

pyodk/_endpoints/entities.py Outdated Show resolved Hide resolved

lindsay-stevens added 2 commits August 29, 2024 21:53

fix: pin docs dependency with breaking update that breaks CI

b95ee56

- from mkdocstrings-python dependency on griffe>=0.37 - griffe public API changed in v1.0.0 released 2 weeks ago.

lindsay-stevens requested a review from lognaturel August 29, 2024 12:25

lognaturel approved these changes Aug 29, 2024

View reviewed changes

lindsay-stevens merged commit 0430222 into getodk:master Aug 30, 2024
4 checks passed

lindsay-stevens deleted the pyodk-88 branch August 30, 2024 06:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

88: add entities methods create_many, delete, merge #93

88: add entities methods create_many, delete, merge #93

lindsay-stevens commented Jul 31, 2024

lindsay-stevens commented Jul 31, 2024

tobiasmcnulty commented Jul 31, 2024 •

edited

Loading

lindsay-stevens commented Jul 31, 2024

tobiasmcnulty commented Jul 31, 2024

lognaturel left a comment

lindsay-stevens commented Aug 20, 2024

lognaturel left a comment

88: add entities methods create_many, delete, merge #93

88: add entities methods create_many, delete, merge #93

Conversation

lindsay-stevens commented Jul 31, 2024

What has been done to verify that this works as intended?

Why is this the best possible solution? Were any other approaches considered?

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Do we need any specific form for testing your changes? If so, please attach one.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Before submitting this PR, please make sure you have:

lindsay-stevens commented Jul 31, 2024

tobiasmcnulty commented Jul 31, 2024 • edited Loading

lindsay-stevens commented Jul 31, 2024

tobiasmcnulty commented Jul 31, 2024

lognaturel left a comment

Choose a reason for hiding this comment

lindsay-stevens commented Aug 20, 2024

lognaturel left a comment

Choose a reason for hiding this comment

tobiasmcnulty commented Jul 31, 2024 •

edited

Loading