Move AsyncMinHashLSH from experimental to datasketch.aio by ekzhu · Pull Request #301 · ekzhu/datasketch

ekzhu · 2026-01-20T07:19:43Z

Create new datasketch/aio/ submodule with AsyncMinHashLSH, AsyncMinHashLSHInsertionSession, and AsyncMinHashLSHDeleteSession
Add optional AsyncMinHashLSH export to main datasketch package
Deprecate experimental module with warnings pointing to new location
Update tests to use new import path
Rename experimental_aio dependency group to aio (keep alias)
Remove experimental from coverage omit

New import paths:

from datasketch.aio import AsyncMinHashLSH
from datasketch import AsyncMinHashLSH

Old imports still work but emit DeprecationWarning.

What does this PR do?

Briefly describe the changes.

Fixes # (issue) if any.

Checklist

Are unit tests passing?
Documentation added/updated for all public APIs?
Is this a breaking change? If yes, add "[BREAKING]" to the PR title.

- Create new `datasketch/aio/` submodule with AsyncMinHashLSH, AsyncMinHashLSHInsertionSession, and AsyncMinHashLSHDeleteSession - Add optional AsyncMinHashLSH export to main datasketch package - Deprecate experimental module with warnings pointing to new location - Update tests to use new import path - Rename `experimental_aio` dependency group to `aio` (keep alias) - Remove experimental from coverage omit New import paths: - `from datasketch.aio import AsyncMinHashLSH` - `from datasketch import AsyncMinHashLSH` Old imports still work but emit DeprecationWarning. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

gemini-code-assist · 2026-01-20T07:20:02Z

Summary of Changes

Hello @ekzhu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the asynchronous MinHash LSH functionality by promoting it from an experimental module to a dedicated, stable datasketch.aio submodule. This change aims to formalize the async LSH implementation within the library, making it more accessible and indicating its readiness for broader use. It includes updates to import paths, dependency management, and ensures a smooth transition for existing users through deprecation warnings and backward-compatible re-exports.

Highlights

Module Relocation: The AsyncMinHashLSH implementation and its related classes (AsyncMinHashLSHInsertionSession, AsyncMinHashLSHDeleteSession) have been moved from the datasketch.experimental.aio module to a new, stable datasketch.aio submodule.
Top-Level Export: The AsyncMinHashLSH class is now optionally exported directly from the main datasketch package, allowing for a simpler import path (from datasketch import AsyncMinHashLSH) if the necessary asynchronous dependencies are installed.
Deprecation Strategy: The datasketch.experimental and datasketch.experimental.aio modules are now officially deprecated. Importing from these paths will emit DeprecationWarning messages, guiding users to the new datasketch.aio location, while still re-exporting the classes for backward compatibility.
Dependency Group Renaming: The experimental_aio dependency group in pyproject.toml has been renamed to aio, with experimental_aio maintained as a deprecated alias for compatibility.
Coverage Configuration Update: The experimental directory has been removed from the coverage omit list in pyproject.toml, reflecting its transition out of an experimental status.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request successfully moves the asynchronous LSH implementation from the experimental module to a new datasketch.aio module, which is a great step towards stabilizing this feature. The changes include creating the new submodule, deprecating the old one with warnings, and updating dependencies and test configurations.

My review focuses on ensuring the correctness and robustness of the newly located async code. I've found a few critical issues in the async storage backends, particularly with Redis, that could lead to data loss or incorrect behavior. I've also identified a high-severity bug related to the MongoDB storage implementation and several medium-severity issues concerning documentation examples, code efficiency, and minor bugs. Addressing these points will significantly improve the quality and reliability of the async functionality.

- Sort __all__ lists alphabetically (RUF022) - Add noqa: E402 for intentional imports after deprecation warnings - Add blank line after docstring Example section (D413) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Bug fixes: - Add missing await on self.execute() in AsyncRedisBuffer - Fix getmany() to correctly use Redis pipeline (don't await individual commands) - Add getmany() method to AsyncMongoListStorage for get_subset_counts support - Fix typo: "AsyncMinHash" -> "AsyncMinHashLSH" in error message - Remove dead code: await self.keys (storage objects aren't awaitable) Performance: - Change tuple buffers to lists in AsyncMongoBuffer for O(1) append Documentation: - Fix module docstring example to use prepickle=True for string keys - Simplify insertion_session and delete_session docstring examples Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This test exercises the getmany() method that was fixed in a previous commit, ensuring proper coverage for the async implementation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Change workflow triggers from workflow_run (which runs after main Test completes and doesn't show as PR checks) to push/pull_request triggers. This makes integration test results visible directly in PR status checks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Change workflow trigger from workflow_run to push/pull_request triggers for consistency with MongoDB and Redis test workflows. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ekzhu · 2026-01-20T18:26:33Z

@copilot please see if all reviews have been addressed if not please address them

Copilot · 2026-01-20T18:26:41Z

@ekzhu I've opened a new pull request, #302, to work on those changes. Once the pull request is ready, I'll request review from you.

* Initial plan * Update documentation and fix RedisStorage import issue Co-authored-by: ekzhu <320302+ekzhu@users.noreply.github.com> * Add clarifying comments for optional Redis dependency handling Co-authored-by: ekzhu <320302+ekzhu@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: ekzhu <320302+ekzhu@users.noreply.github.com>

ekzhu · 2026-01-20T18:46:00Z

@gemini-code-assist are all reviews addressed?

gemini-code-assist · 2026-01-20T18:46:14Z

@ekzhu It looks like @Copilot has created a new pull request, #302, to address the changes and will request your review once it's ready. I recommend reviewing that pull request once it's available to confirm that all reviews have been addressed to your satisfaction. I am unable to directly verify the status of reviews in a separate pull request.

ekzhu · 2026-01-20T18:46:59Z

@gemini-code-assist

That PR has been merged, please take a look at the latest version of the current PR

…ecations - Delete orphaned datasketch/experimental/aio/storage.py. Nothing imports from it (the shims all forward to datasketch.aio.lsh), and the stale file still contained the pre-fix bugs that datasketch/aio/storage.py silently fixed: un-awaited AsyncRedisBuffer.execute(), tuple-based buffer stacks, missing AsyncMongoListStorage.getmany, and a broken pipeline getmany impl. - Drop the dead try/except ImportError fallback around `from datasketch.aio import AsyncMinHashLSH` in datasketch/__init__.py. datasketch.aio.storage already degrades gracefully when motor/redis are absent, so the fallback was unreachable and only served to rebind the public symbol to None for type checkers. - Fix AsyncMinHashLSH.get_subset_counts under prepickle=True: query keys now get pickled to match the stored representation, so the lookup actually finds the inserted keys. Adds a regression test. - Refactor the experimental deprecation shims to emit exactly one DeprecationWarning per deprecated import path (down from three). Uses PEP 562 __getattr__ with globals() caching so the warning fires on first attribute access and subsequent accesses skip the hook. Static analyzers see the __all__ names via TYPE_CHECKING-guarded imports. - Minor: add asyncio.run(main()) to the datasketch.aio docstring example so it's copy-paste runnable, and add a KEEP IN SYNC comment on the duplicated experimental_aio extras group in pyproject.toml.

…ecations (#313) - Delete orphaned datasketch/experimental/aio/storage.py. Nothing imports from it (the shims all forward to datasketch.aio.lsh), and the stale file still contained the pre-fix bugs that datasketch/aio/storage.py silently fixed: un-awaited AsyncRedisBuffer.execute(), tuple-based buffer stacks, missing AsyncMongoListStorage.getmany, and a broken pipeline getmany impl. - Drop the dead try/except ImportError fallback around `from datasketch.aio import AsyncMinHashLSH` in datasketch/__init__.py. datasketch.aio.storage already degrades gracefully when motor/redis are absent, so the fallback was unreachable and only served to rebind the public symbol to None for type checkers. - Fix AsyncMinHashLSH.get_subset_counts under prepickle=True: query keys now get pickled to match the stored representation, so the lookup actually finds the inserted keys. Adds a regression test. - Refactor the experimental deprecation shims to emit exactly one DeprecationWarning per deprecated import path (down from three). Uses PEP 562 __getattr__ with globals() caching so the warning fires on first attribute access and subsequent accesses skip the hook. Static analyzers see the __all__ names via TYPE_CHECKING-guarded imports. - Minor: add asyncio.run(main()) to the datasketch.aio docstring example so it's copy-paste runnable, and add a KEEP IN SYNC comment on the duplicated experimental_aio extras group in pyproject.toml. Co-authored-by: Claude <noreply@anthropic.com>

* Move AsyncMinHashLSH from experimental to datasketch.aio - Create new `datasketch/aio/` submodule with AsyncMinHashLSH, AsyncMinHashLSHInsertionSession, and AsyncMinHashLSHDeleteSession - Add optional AsyncMinHashLSH export to main datasketch package - Deprecate experimental module with warnings pointing to new location - Update tests to use new import path - Rename `experimental_aio` dependency group to `aio` (keep alias) - Remove experimental from coverage omit New import paths: - `from datasketch.aio import AsyncMinHashLSH` - `from datasketch import AsyncMinHashLSH` Old imports still work but emit DeprecationWarning. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix ruff linting errors - Sort __all__ lists alphabetically (RUF022) - Add noqa: E402 for intentional imports after deprecation warnings - Add blank line after docstring Example section (D413) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Address code review feedback from gemini-code-assist Bug fixes: - Add missing await on self.execute() in AsyncRedisBuffer - Fix getmany() to correctly use Redis pipeline (don't await individual commands) - Add getmany() method to AsyncMongoListStorage for get_subset_counts support - Fix typo: "AsyncMinHash" -> "AsyncMinHashLSH" in error message - Remove dead code: await self.keys (storage objects aren't awaitable) Performance: - Change tuple buffers to lists in AsyncMongoBuffer for O(1) append Documentation: - Fix module docstring example to use prepickle=True for string keys - Simplify insertion_session and delete_session docstring examples Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add test for get_subset_counts async method This test exercises the getmany() method that was fixed in a previous commit, ensuring proper coverage for the async implementation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Run MongoDB and Redis integration tests on PRs Change workflow triggers from workflow_run (which runs after main Test completes and doesn't show as PR checks) to push/pull_request triggers. This makes integration test results visible directly in PR status checks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Run Cassandra integration tests on PRs Change workflow trigger from workflow_run to push/pull_request triggers for consistency with MongoDB and Redis test workflows. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Move AsyncMinHashLSH from experimental to datasketch.aio (#302) * Initial plan * Update documentation and fix RedisStorage import issue Co-authored-by: ekzhu <320302+ekzhu@users.noreply.github.com> * Add clarifying comments for optional Redis dependency handling Co-authored-by: ekzhu <320302+ekzhu@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: ekzhu <320302+ekzhu@users.noreply.github.com> * Address PR #301 review: drop dead code, fix prepickle bug, quiet deprecations (#313) - Delete orphaned datasketch/experimental/aio/storage.py. Nothing imports from it (the shims all forward to datasketch.aio.lsh), and the stale file still contained the pre-fix bugs that datasketch/aio/storage.py silently fixed: un-awaited AsyncRedisBuffer.execute(), tuple-based buffer stacks, missing AsyncMongoListStorage.getmany, and a broken pipeline getmany impl. - Drop the dead try/except ImportError fallback around `from datasketch.aio import AsyncMinHashLSH` in datasketch/__init__.py. datasketch.aio.storage already degrades gracefully when motor/redis are absent, so the fallback was unreachable and only served to rebind the public symbol to None for type checkers. - Fix AsyncMinHashLSH.get_subset_counts under prepickle=True: query keys now get pickled to match the stored representation, so the lookup actually finds the inserted keys. Adds a regression test. - Refactor the experimental deprecation shims to emit exactly one DeprecationWarning per deprecated import path (down from three). Uses PEP 562 __getattr__ with globals() caching so the warning fires on first attribute access and subsequent accesses skip the hook. Static analyzers see the __all__ names via TYPE_CHECKING-guarded imports. - Minor: add asyncio.run(main()) to the datasketch.aio docstring example so it's copy-paste runnable, and add a KEEP IN SYNC comment on the duplicated experimental_aio extras group in pyproject.toml. Co-authored-by: Claude <noreply@anthropic.com> * Parallelize AsyncMongoListStorage.getmany with asyncio.gather Addresses the gemini-code-assist review comment on PR #314: the previous sequential loop issued N round trips in series. asyncio.gather runs the underlying find() queries concurrently. https://claude.ai/code * Fix Redis set-backed getmany and wire CI to the new aio extra Hardcoding `LRANGE` in `AsyncRedisListStorage.getmany` broke the inherited method on `AsyncRedisSetStorage`: the subclass overrides `_get_items` to use `SMEMBERS`, but the pipeline in `getmany` bypassed that override and would fail with `WRONGTYPE` against set keys. Route `getmany` through `_get_items` so the subclass override applies, following the existing `itemcounts`/`_get_len` pattern in the same file. Add a regression test that calls `getmany` on an unordered (set-backed) hashtable storage, which exercises the previously-broken path under Redis. Update the three integration-test workflows (mongo, redis, cassandra) to install the new `aio` extra instead of the deprecated `experimental_aio` alias, aligning CI install coverage with the new public API surface. https://claude.ai/code * Apply prepickle consistently in async has_key and remove `AsyncMinHashLSH.has_key()` forwarded the caller's raw key to storage, and `_remove()` reused the raw key for the existence check, hashtable lookup, and the delete itself. Under `prepickle=True` the storage holds pickled bytes, so: await lsh.insert("a", m) await lsh.has_key("a") # returned False await lsh.remove("a") # raised ValueError Mirror the sync `MinHashLSH._remove` pattern: pickle the key inside `_remove` first and call the storage primitives (`self.keys.has_key`, `self.keys.get`, `self.keys.remove`, `hashtable.remove_val`) directly so they all see the stored representation. `has_key()` now pickles internally for the public API. `_insert()` was calling `has_key()` after it had already pickled the key, which accidentally worked but would have started double-pickling once `has_key` learned to pickle on its own; switch `_insert`'s duplicate check to `self.keys.has_key()` directly to keep that path single-pickle. Add regression tests under `TestAsyncMinHashLSHWithPrepickle` covering `has_key`, `remove`, and `delete_session.remove` with string keys. https://claude.ai/code * Close behavioral-parity gaps in async LSH test coverage Add four tests covering behaviors that are part of the public async contract but were only tested implicitly (or not at all): - test_insert_duplicate_raises: re-inserting an existing bytes key must raise ValueError - test_insert_check_duplication_false: check_duplication=False bypass must allow re-insertion without error - test_is_empty: is_empty() must flip False after insert and back to True after removing all keys - test_insert_duplicate_prepickle: duplicate detection must also work for pickled string keys and the bypass must still apply Design gaps (no async `merge` or `add_to_query_buffer`) are tracked separately - they are missing APIs rather than missing tests and are out of scope for this PR. https://claude.ai/code * Fix Redis getmany: use direct pipeline calls instead of _get_items Pipeline command methods in redis.asyncio are synchronous (they queue and return self). The previous approach routed through async _get_items which awaited the pipeline return value — while Pipeline.__await__ makes this technically valid, it adds unnecessary async machinery to what should be a synchronous queue operation. Revert AsyncRedisListStorage.getmany to the original direct pipe.lrange() pattern, and add an explicit getmany override in AsyncRedisSetStorage that uses pipe.smembers(). This is the "override in subclass" approach from the reviewer's suggestion, and keeps pipeline command queuing purely synchronous. https://claude.ai/code * Fix _query_b to unpickle keys under prepickle=True The sync MinHashLSH._query_b unpickles candidates before returning them when prepickle is enabled. The async version was missing this step, returning raw pickled bytes instead of original keys. https://claude.ai/code --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>

gemini-code-assist bot reviewed Jan 20, 2026

View reviewed changes

ekzhu and others added 5 commits January 19, 2026 23:25

Fix ruff linting errors

fea2bad

- Sort __all__ lists alphabetically (RUF022) - Add noqa: E402 for intentional imports after deprecation warnings - Add blank line after docstring Example section (D413) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add test for get_subset_counts async method

6dd8467

This test exercises the getmany() method that was fixed in a previous commit, ensuring proper coverage for the async implementation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Run Cassandra integration tests on PRs

4b6f774

Change workflow trigger from workflow_run to push/pull_request triggers for consistency with MongoDB and Redis test workflows. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Copilot AI mentioned this pull request Jan 20, 2026

Move AsyncMinHashLSH from experimental to datasketch.aio #302

Merged

3 tasks

Merge branch 'master' into move-async-api-to-main

7da6df0

ekzhu closed this Apr 14, 2026

ekzhu mentioned this pull request Apr 14, 2026

Move AsyncMinHashLSH from experimental to datasketch.aio #314

Merged

3 tasks

ekzhu deleted the move-async-api-to-main branch April 14, 2026 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move AsyncMinHashLSH from experimental to datasketch.aio#301

Move AsyncMinHashLSH from experimental to datasketch.aio#301
ekzhu wants to merge 9 commits intomasterfrom
move-async-api-to-main

ekzhu commented Jan 20, 2026

Uh oh!

gemini-code-assist bot commented Jan 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ekzhu commented Jan 20, 2026

Uh oh!

Copilot AI commented Jan 20, 2026

Uh oh!

ekzhu commented Jan 20, 2026

Uh oh!

gemini-code-assist bot commented Jan 20, 2026

Uh oh!

ekzhu commented Jan 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ekzhu commented Jan 20, 2026

What does this PR do?

Checklist

Uh oh!

gemini-code-assist bot commented Jan 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ekzhu commented Jan 20, 2026

Uh oh!

Copilot AI commented Jan 20, 2026

Uh oh!

ekzhu commented Jan 20, 2026

Uh oh!

gemini-code-assist bot commented Jan 20, 2026

Uh oh!

ekzhu commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ekzhu commented Jan 20, 2026 •

edited

Loading