Refactoring & Optional data concurrency #35

chris-ha458 · 2023-08-28T11:56:56Z

I was thinking about incorporating rayon in certain sections of the code behind a feature flag.
For instance
pub fn most_common_tiebreaker<F>
could be changed as follows

pub fn most_common_tiebreaker<F>(&self, mut tiebreaker: F) -> Vec<(T, N)>
where
    F: FnMut(&T, &T) -> ::std::cmp::Ordering + Send + Sync,
{
    let mut items = self
        .map
        .iter() // i'm not sure if changing this into par_iter() would be more useful or needless complication
        .map(|(key, count)| (key.clone(), count.clone()))
        .collect::<Vec<_>>();
    #[cfg(feature = "use_rayon")]
    {
        use rayon::prelude::*;
        items.par_sort_unstable_by(|(a_item, a_count), (b_item, b_count)| {
            b_count
                .cmp(a_count)
                .then_with(|| tiebreaker(a_item, b_item))
        });
    }
    #[cfg(not(feature = "use_rayon"))]
    {
        items.sort_unstable_by(|(a_item, a_count), (b_item, b_count)| {
            b_count
                .cmp(a_count)
                .then_with(|| tiebreaker(a_item, b_item))
        });
    }
    items
}

Similar feature flag based dependencies would be set in Cargo.toml as well.
Overall, this would be integrated in a way that preserves default behavior unless a feature flag is set (likely as rayon)

If you are open to this, I would first begin by initiating a bit of refactoring first.
Most if not all of the code is situated inside lib.rs which is now pushing 2k loc.
I think the unit tests can be extracted into /src/lib/tests.rs as a separate file.
This is in contrast to /tests/integration_tests.rs which would be for integration tests.
Of course, if any specific test is identified to be more suitable for integration tests I would extract them and organise them as such.

What do you feel about each?

Refactoring the code, especially for the tests.
Adding rayon (behind a feature flag to preserve default behavior)

The text was updated successfully, but these errors were encountered:

coriolinus · 2023-08-28T15:23:36Z

Refactoring the code is fine, to an extent. This is very much a style thing, and I don't want to pre-approve anything that I'll regret later on. That said, I agree that 2k lines is too big for a well-factored source file.

I'm willing to look at a PR adding Rayon, but that PR should include some benchmarks showing at what data magnitude the feature is justified, and documentation exposing that information to end users. Rayon is sometimes a game-changer, but you can't just naively throw it at problems and expect to see an improvement.

FWIW: In the context of

    let mut items = self
        .map
        .iter()
        .map(|(key, count)| (key.clone(), count.clone()))
        .collect::<Vec<_>>();

I would be extremely surprised to discover that par_iter performed better under any circumstance. That said, I'm willing to be convinced by a well-crafted benchmark.

Either way, a refactor is a wholly separate concern from adding feature-gated Rayon support, so these should be two distinct PRs.

chris-ha458 · 2023-08-28T23:54:24Z

I agree with your points.
Here is my plan

prepare a separate PR for refactoring. A lot of it will be stylistic choices, and finding an acceptable solution will help me understand what you envision for the repo.
when 1. is done, add benches against the current version of the code. It is likely that this will be a separate PR since there are some opinionated choices to be made such as nightly #[bench] vs criterion. The data i am planning to handle are in million/billion scale (LLM dataset deduplication/counting) I'll likely add benches that go that high.
Prepare a rayon PR that includes rayon integration and documentation, with further benches as identified as necessary.

chris-ha458 · 2023-08-29T00:03:05Z

1. is started in #36

coriolinus · 2023-08-29T07:54:44Z

Sounds good. With regard to benches, I'd recommend criterion instead of nightly benches. I want to expose the entire surface of the library, including benchmarking, to end-users without requiring a nightly toolchain.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring & Optional data concurrency #35

Refactoring & Optional data concurrency #35

chris-ha458 commented Aug 28, 2023 •

edited by coriolinus

coriolinus commented Aug 28, 2023

chris-ha458 commented Aug 28, 2023

chris-ha458 commented Aug 29, 2023

coriolinus commented Aug 29, 2023

Refactoring & Optional data concurrency #35

Refactoring & Optional data concurrency #35

Comments

chris-ha458 commented Aug 28, 2023 • edited by coriolinus

coriolinus commented Aug 28, 2023

chris-ha458 commented Aug 28, 2023

chris-ha458 commented Aug 29, 2023

coriolinus commented Aug 29, 2023

chris-ha458 commented Aug 28, 2023 •

edited by coriolinus