Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Datum for string kernels #4632

Closed
wants to merge 2 commits into from
Closed

Conversation

alexandreyc
Copy link
Contributor

Which issue does this PR close?

Closes #4595.

What changes are included in this PR?

For the moment I only implemented the change for LIKE and I would like your feedbacks before implementing other kernels.

Are there any user-facing changes?

There will be some deprecated functions once it's finished.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 2, 2023
@tustvold
Copy link
Contributor

tustvold commented Aug 2, 2023

Really cool to see this, and thank you for breaking this up. I'm away for the next week but will try to get some time to sit down with this, but otherwise will get back to you on my return.

I'm keen to see if there is some way to use type-erasure, i.e. &dyn Fn(&str, &str) or similar to both cut down on codegen and simplify the code easier to read. This does assume the dyn dispatch is negligible compared to the cost of the string operation, but I suspect this may be the case

@tustvold
Copy link
Contributor

tustvold commented Aug 2, 2023

I did a quick and dirty experiment, and it would appear my hunch is broadly correct

fn do_bench() {
   let cmp: &dyn Fn(&str, &str) -> bool = black_box(&|l, r| l == r);
    c.bench_function("eq_simple_dyn", |b| {
        b.iter(|| {
            let l = black_box(&l);
            let r = black_box(&r);
            l.iter()
                .zip(r)
                .map(|(l, r)| match (l, r) {
                    (Some(l), Some(r)) => Some(cmp(l, r)),
                    _ => None,
                })
                .collect::<Vec<_>>()
        })
    });

    let l = create_string_array::<i32>(1024, 0.);
    let needle = l.value(0);
    c.bench_function("eq_simple_scalar", |b| {
        b.iter(|| {
            let l = black_box(&l);
            let needle = black_box(needle);
            l.iter().map(|v| v.map(|x| x == needle)).collect::<Vec<_>>()
        })
    });

    let cmp = black_box(make_eq_scalar(needle));
    c.bench_function("eq_simple_dyn_scalar", |b| {
        b.iter(|| {
            let l = black_box(&l);
            l.iter().map(|v| v.map(|x| cmp(x))).collect::<Vec<_>>()
        })
    });

    dyn_cmp_dict_benchmarks(c);
}

fn make_eq_scalar(needle: &str) -> Box<dyn Fn(&str) -> bool + '_> {
    Box::new(move |a| a == needle)
}

And got

eq_simple               time:   [5.4250 µs 5.4278 µs 5.4308 µs]

eq_simple_dyn           time:   [6.5201 µs 6.5227 µs 6.5253 µs]

eq_simple_scalar        time:   [4.4315 µs 4.4334 µs 4.4355 µs]

eq_simple_dyn_scalar    time:   [4.9554 µs 4.9575 µs 4.9600 µs]

Equality is the cheapest of the string operations and yet the dyn dispatch is barely registering. I suspect if you tweaked it to have a signature like Fn([(&str, &str); 64]) -> u64 the overheads would disappear entirely. Something worth thinking about possibly...

@tustvold
Copy link
Contributor

Sorry for taking so long to get back to you, I think I would like this to follow a similar pattern as #4701 and #4465.

I appreciate this is a much more intrusive change, and is quite fiddly, and so no hard feelings if you would like to back out of this. I already have a partially implemented version of this

@alexandreyc
Copy link
Contributor Author

I was just reading #4701 :)

Yeah I think that it will take me a while to figure this out completely. It might be more productive that you go further with it. Anyway, it allowed me to learn more about the code base. Feel free to close this PR!

@tustvold
Copy link
Contributor

Closing in favor of #4732, thank you for your work on this 👍

@tustvold tustvold closed this Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Datum Based String Kernels
2 participants