Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

88: add entities methods create_many, delete, merge #93

Merged
merged 4 commits into from
Aug 30, 2024

Conversation

lindsay-stevens
Copy link
Contributor

Closes #88

What has been done to verify that this works as intended?

Unit tests for the data processor method _prep_data_for_merge, and integration tests in test_client.py.

Why is this the best possible solution? Were any other approaches considered?

The original ticket called for an example script but 1) this seemed like functionality that would be widely useful enough to be a part of the library itself, and 2) it was somewhat complicated to implement and test (lots of options and edge cases) so as part of the library it should be easier to actively maintain.

Also, I added the use case to the existing example script create_entities_from_submissions.py because there was an almost complete overlap in the use case - it didn't involve updating data before, but you can do the same thing with merge. The tests contain examples of numerous other usage / call patterns.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Provides some handy new ways to interact with Entities

Do we need any specific form for testing your changes? If so, please attach one.

No, the tests include all necessary data.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Before submitting this PR, please make sure you have:

  • included test cases for core behavior and edge cases in tests
  • run python -m unittest and verified all tests pass
  • run ruff format pyodk tests and ruff check pyodk tests to lint code
  • verified that any code or assets from external sources are properly credited in comments

@lindsay-stevens
Copy link
Contributor Author

@yanokwa @tobiasmcnulty does it seem like this new merge method will help simplify your workflows? Any feedback / comments / ideas / edge cases / use cases welcome. Thanks

@tobiasmcnulty
Copy link
Contributor

tobiasmcnulty commented Jul 31, 2024

@lindsay-stevens This looks great! Thanks for adding support for these methods directly in the library. I might not be able to test right away, but if real-world feedback would be helpful and is not too urgent, I can try to prioritize testing it and report back.

My only initial comment is that single-threaded deletes and updates can be quite slow (think 6 minutes instead of 6 seconds for ~5000 rows, when comparing merge()-style code to create_many()). It could be worth including a disclaimer in the docs, or adding support for multi-threaded deletes and updates. This is what we do in our code:

def delete_entities(
    client,
    project_id,
    entity_list_name,
    uuids_to_delete,
):
    def delete_entity(uuid):
        delete_resp = client.delete(
            f"/projects/{project_id}/datasets/{entity_list_name}/entities/{uuid}",
        )
        maybe_fail_loudly_for_odk_resp(delete_resp)

    pool = ThreadPool(20)
    print(f"Deleting {len(uuids_to_delete)} UUIDs...")
    pool.map(delete_entity, uuids_to_delete)
    print(f"Done.")

@lindsay-stevens
Copy link
Contributor Author

@tobiasmcnulty thanks! Testing would not be urgent, but appreciated if you have a chance. About the use cases:

  • why is reducing the processing time important? Are entity updates needed that fast, or is it a cost thing etc?
  • I can imagine needing to update 5k entities in one go, but what is the use case for a large batch of deletes?

Incorporating some kind of async might be out of scope for this PR. Since pyodk is a library, if it opens threadpools, and user app code then opens threadpools to invoke pyodk, we can end up with unexpectedly a lot of threads. Also a flood of requests to Central. I don't know if there are plans for it, but maybe Central could have a batch update and batch delete endpoint, like there is for create. Another approach could be make pyodk use aiohttp or similar, but that's a large refactor. With the current code, an approach could be to monkey-patch EntityService.update and .delete to use async, or make them no-ops and use the merge return values to run the updates/deletes with async.

@tobiasmcnulty
Copy link
Contributor

We may have upwards of 3M entities to update (don't worry, not all in one list), so 6 minutes -> 6 seconds is a big improvement. I agree that long term it would be better to implement these changes server side, so take take the feedback with a grain of salt. Perhaps just a note in the docs so warn users before they can't figure out why merge() isn't returning, especially when create_many() is so fast?

@lindsay-stevens lindsay-stevens requested review from lognaturel and removed request for lognaturel August 16, 2024 07:11
Copy link
Member

@lognaturel lognaturel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the create_many and merge signatures should be more similar. I've made a few suggestions to that effect and am interested to hear what you think, @lindsay-stevens

How about also adding a brief note about performance in the merge and delete doc comments, maybe also adding one to update? Something to the effect of "this method sends requests one at a time and may be slow for larger numbers of Entities. Consider making parallel calls and describing your use case on the ODK forum."

pyodk/_endpoints/entities.py Outdated Show resolved Hide resolved
pyodk/_endpoints/entities.py Show resolved Hide resolved
pyodk/_endpoints/entities.py Outdated Show resolved Hide resolved
pyodk/_endpoints/entities.py Outdated Show resolved Hide resolved
@lindsay-stevens
Copy link
Contributor Author

How about also adding a brief note about performance in the merge and delete doc comments, maybe also adding one to update? Something to the effect of "this method sends requests one at a time and may be slow for larger numbers of Entities. Consider making parallel calls and describing your use case on the ODK forum."

I added a note to merge in d657b0f, but I think adding this kind of statement to individual endpoint methods might be too much, because it would apply to basically all the pyodk endpoint methods (handling lots of forms, or submissions, etc).

- chg: move insert data reshaping to create_many so that it's
  easier to use and accepts same kind of data as merge.
- add: create_source/source_size parameter for both methods to
  allow specifying this info separately to the data.
- chg: docstring improvements
- from mkdocstrings-python dependency on griffe>=0.37
- griffe public API changed in v1.0.0 released 2 weeks ago.
Copy link
Member

@lognaturel lognaturel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making those updates!

@lindsay-stevens lindsay-stevens merged commit 0430222 into getodk:master Aug 30, 2024
4 checks passed
@lindsay-stevens lindsay-stevens deleted the pyodk-88 branch August 30, 2024 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Script to use a CSV to bulk create, update, or delete Entities in an Entity List
3 participants