Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring & Optional data concurrency #35

Open
chris-ha458 opened this issue Aug 28, 2023 · 4 comments
Open

Refactoring & Optional data concurrency #35

chris-ha458 opened this issue Aug 28, 2023 · 4 comments

Comments

@chris-ha458
Copy link
Contributor

chris-ha458 commented Aug 28, 2023

I was thinking about incorporating rayon in certain sections of the code behind a feature flag.
For instance
pub fn most_common_tiebreaker<F>
could be changed as follows

pub fn most_common_tiebreaker<F>(&self, mut tiebreaker: F) -> Vec<(T, N)>
where
    F: FnMut(&T, &T) -> ::std::cmp::Ordering + Send + Sync,
{
    let mut items = self
        .map
        .iter() // i'm not sure if changing this into par_iter() would be more useful or needless complication
        .map(|(key, count)| (key.clone(), count.clone()))
        .collect::<Vec<_>>();
    #[cfg(feature = "use_rayon")]
    {
        use rayon::prelude::*;
        items.par_sort_unstable_by(|(a_item, a_count), (b_item, b_count)| {
            b_count
                .cmp(a_count)
                .then_with(|| tiebreaker(a_item, b_item))
        });
    }
    #[cfg(not(feature = "use_rayon"))]
    {
        items.sort_unstable_by(|(a_item, a_count), (b_item, b_count)| {
            b_count
                .cmp(a_count)
                .then_with(|| tiebreaker(a_item, b_item))
        });
    }
    items
}

Similar feature flag based dependencies would be set in Cargo.toml as well.
Overall, this would be integrated in a way that preserves default behavior unless a feature flag is set (likely as rayon)

If you are open to this, I would first begin by initiating a bit of refactoring first.
Most if not all of the code is situated inside lib.rs which is now pushing 2k loc.
I think the unit tests can be extracted into /src/lib/tests.rs as a separate file.
This is in contrast to /tests/integration_tests.rs which would be for integration tests.
Of course, if any specific test is identified to be more suitable for integration tests I would extract them and organise them as such.

What do you feel about each?

  • Refactoring the code, especially for the tests.
  • Adding rayon (behind a feature flag to preserve default behavior)
@coriolinus
Copy link
Owner

Refactoring the code is fine, to an extent. This is very much a style thing, and I don't want to pre-approve anything that I'll regret later on. That said, I agree that 2k lines is too big for a well-factored source file.

I'm willing to look at a PR adding Rayon, but that PR should include some benchmarks showing at what data magnitude the feature is justified, and documentation exposing that information to end users. Rayon is sometimes a game-changer, but you can't just naively throw it at problems and expect to see an improvement.

FWIW: In the context of

    let mut items = self
        .map
        .iter()
        .map(|(key, count)| (key.clone(), count.clone()))
        .collect::<Vec<_>>();

I would be extremely surprised to discover that par_iter performed better under any circumstance. That said, I'm willing to be convinced by a well-crafted benchmark.

Either way, a refactor is a wholly separate concern from adding feature-gated Rayon support, so these should be two distinct PRs.

@chris-ha458
Copy link
Contributor Author

I agree with your points.
Here is my plan

  1. prepare a separate PR for refactoring. A lot of it will be stylistic choices, and finding an acceptable solution will help me understand what you envision for the repo.
  2. when 1. is done, add benches against the current version of the code. It is likely that this will be a separate PR since there are some opinionated choices to be made such as nightly #[bench] vs criterion. The data i am planning to handle are in million/billion scale (LLM dataset deduplication/counting) I'll likely add benches that go that high.
  3. Prepare a rayon PR that includes rayon integration and documentation, with further benches as identified as necessary.

@chris-ha458
Copy link
Contributor Author

1. is started in #36

@coriolinus
Copy link
Owner

Sounds good. With regard to benches, I'd recommend criterion instead of nightly benches. I want to expose the entire surface of the library, including benchmarking, to end-users without requiring a nightly toolchain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants