Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flint_mpn_mul et al.; use in fmpz_mul, fmpz_addmul and other places #1141

Merged
merged 10 commits into from Jun 2, 2022

Conversation

fredrik-johansson
Copy link
Collaborator

This needs more profiling.

fmpz_poly_mul_KS had a cutoff at 1000 limbs for using the FLINT FFT which looks too low. I can't get a consistent speedup over mpn_mul below 32000 limbs on my machine. Perhaps because the FFT is missing assembly optimizations without MPIR? Anyway, this puts the FFT threshold in one obvious place (mpn_extras.h), where we can easily replace the SS-FFT with Dan's FFT when that's available.

I haven't profiled fmpz_mul, fmpz_addmul and fmpz_submul carefully at small sizes. There could be some GMP tricks that I'm missing.

@fredrik-johansson
Copy link
Collaborator Author

I'm not sure why there is a test failure for fmpz_mod_mul with MinGW GCC.

However, running valgrind gives

==381268== Invalid write of size 8
==381268==    at 0x492F8CB: _fmpz_sub_mpn_1 (in /home/fredrik/src/flint2/libflint.so.17.0.0)
==381268==    by 0x1095E0: main (in /home/fredrik/src/build/fmpz_mod/test/t-mul)
==381268==  Address 0x57ce608 is 0 bytes after a block of size 8 alloc'd
==381268==    at 0x484DCD3: realloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==381268==    by 0x50B39E8: __gmp_default_reallocate (in /usr/local/lib/libgmp.so.10.4.1)
==381268==    by 0x50CA151: __gmpz_realloc (in /usr/local/lib/libgmp.so.10.4.1)
==381268==    by 0x493FC6A: fmpz_mul_2exp (in /home/fredrik/src/flint2/libflint.so.17.0.0)
==381268==    by 0x10974C: main (in /home/fredrik/src/flint2/build/fmpz_mod/test/t-mul)

even without this patch, so there may be a preexisting bug in fmpz_mul_mod or elsewhere that we have to fix.

@fredrik-johansson
Copy link
Collaborator Author

I believe fmpz_add_ui/fmpz_sub_ui are buggy: they can write two limbs even when only one is allocated. This will need fixing in 2.9...

@wbhart
Copy link
Collaborator

wbhart commented May 26, 2022

It's sad that we are so much behind GMP up to 32000 limbs. I guess the MPIR stuff was the reason. I could believe the crossover moved up to 2000 limbs otherwise, but 32000 is way too high.

And yes, the buggy add/sub_ui will have to be fixed for 2.9-beta3.

@wbhart
Copy link
Collaborator

wbhart commented May 26, 2022

Perhaps @albinahlback could write some test code for it that tests every single case that the new code creates so the bug is caught by test code. I think it is very important to get those functions 100% correct.

@fredrik-johansson
Copy link
Collaborator Author

It's sad that we are so much behind GMP up to 32000 limbs. I guess the MPIR stuff was the reason. I could believe the crossover moved up to 2000 limbs otherwise, but 32000 is way too high.

Presumably mostly due to the missing mpn_sumdiff_n, doubling memory accesses in the butterflies? BTW, I looked at this function in MPIR and the generic version seems to be defined differently from the one in mpn_extras.h (performing an extra negation). I don't understand how the FFT can be correct with both...

@fredrik-johansson
Copy link
Collaborator Author

I'm inclined just to re-rewrite add/sub_ui. Without gotos.

@albinahlback
Copy link
Collaborator

1882/2233 Test #1885: fmpz_mod-test-t-mul ...................................................***Failed    0.01 sec
mul....FAIL
check aliasing secondi = 0, j = 0, k = 1

@fredrik-johansson
Copy link
Collaborator Author

Trying with rewritten fmpz_addmul_ui/fmpz_submul_ui.

@albinahlback
Copy link
Collaborator

As MinGW is the OS where long int is 32 bit, as opposed to 64 bit, I presume there can be some issue there.

@albinahlback
Copy link
Collaborator

I wouldn't merge this until any profiling has been made. I don't see any use of mpn_mul_basecase, which leaves me to believe that this is going to be slower for smaller inputs.

@fredrik-johansson
Copy link
Collaborator Author

mpn_mul and mpn_mul_n call mpn_mul_basecase for small input. There is no need for us to do so.

fmpz/mul.c Outdated Show resolved Hide resolved
@fredrik-johansson
Copy link
Collaborator Author

Perhaps @albinahlback could write some test code for it that tests every single case that the new code creates so the bug is caught by test code. I think it is very important to get those functions 100% correct.

A systematic reason why we often don't catch bugs like this is that most test code doesn't randtest output variables before use, so we often just test the case where the output variable is 0. Setting output variables to random values is something I started doing in Arb a while ago and definitely helps.

Plus wanting more FLINT_ASSERTs of course.

I should also get into the habit of valgrinding the entire test suite more frequently.

@albinahlback
Copy link
Collaborator

We should probably set assert on CI.

@fredrik-johansson
Copy link
Collaborator Author

Should be correct now, but thorough profiling remains.

Here is for the result for fft/profile/p-mul_fft_main:

flint_speedup

Maybe fine-tuning the FFT on this architecture could sort out some of the kinks...

@albinahlback
Copy link
Collaborator

What is the speed with 10 and less limbs?

@fredrik-johansson
Copy link
Collaborator Author

I will try to profile small operands this afternoon.

@fredrik-johansson
Copy link
Collaborator Author

I revised the profiling code to try a vector of different inputs. This might be more meaningful than recycling the same operand in a loop since these are branch-heavy functions. Some high-level benchmarks would be even better though.

ADD
10 bits:      min 1.50x,    max 1.51x
30 bits:      min 1.55x,    max 1.55x
60 bits:      min 1.55x,    max 1.55x
62 bits:      min 1.54x,    max 1.56x
64 bits:      min 1.29x,    max 1.30x
66 bits:      min 1.22x,    max 1.22x
80 bits:      min 1.08x,    max 1.09x
128 bits:      min 1.01x,    max 1.02x
160 bits:      min 1.03x,    max 1.03x
256 bits:      min 1.02x,    max 1.02x
512 bits:      min 1.04x,    max 1.04x
1024 bits:      min 1.04x,    max 1.03x
4096 bits:      min 1.04x,    max 1.03x

SUB
10 bits:      min 1.58x,    max 1.58x
30 bits:      min 1.62x,    max 1.62x
60 bits:      min 1.63x,    max 1.62x
62 bits:      min 1.61x,    max 1.61x
64 bits:      min 1.35x,    max 1.35x
66 bits:      min 1.26x,    max 1.26x
80 bits:      min 1.08x,    max 1.08x
128 bits:      min 0.99x,    max 0.98x
160 bits:      min 0.99x,    max 0.98x
256 bits:      min 1.02x,    max 1.02x
512 bits:      min 1.03x,    max 1.03x
1024 bits:      min 1.03x,    max 1.03x
4096 bits:      min 1.01x,    max 1.00x

ADDMUL
10 bits:      min 0.99x,    max 0.99x
30 bits:      min 0.99x,    max 1.00x
60 bits:      min 0.98x,    max 0.97x
62 bits:      min 0.98x,    max 0.98x
64 bits:      min 0.99x,    max 0.99x
66 bits:      min 0.99x,    max 0.99x
80 bits:      min 1.00x,    max 1.01x
128 bits:      min 0.99x,    max 1.00x
160 bits:      min 1.00x,    max 1.00x
256 bits:      min 1.00x,    max 1.00x
512 bits:      min 1.02x,    max 1.02x
1024 bits:      min 1.00x,    max 1.01x
4096 bits:      min 0.99x,    max 0.99x

MUL
10 bits:      min 1.07x,    max 1.07x
30 bits:      min 1.05x,    max 1.05x
60 bits:      min 1.01x,    max 1.01x
62 bits:      min 1.01x,    max 1.01x
64 bits:      min 1.01x,    max 1.01x
66 bits:      min 1.01x,    max 1.01x
80 bits:      min 1.02x,    max 1.02x
128 bits:      min 1.07x,    max 1.06x
160 bits:      min 1.03x,    max 1.04x
256 bits:      min 1.00x,    max 1.00x
512 bits:      min 1.00x,    max 1.01x
1024 bits:      min 1.00x,    max 1.00x
4096 bits:      min 0.99x,    max 0.99x

@tthsqe12
Copy link
Contributor

Presumably mostly due to the missing mpn_sumdiff_n, doubling memory accesses in the butterflies

Don't think so. The double memory accesses are much less of a problem than you might imagine.

@tthsqe12
Copy link
Contributor

Where is the commit that fixes fmpz_add_ui and friends? so that it can be applied to the 2.9 branch..

@fredrik-johansson
Copy link
Collaborator Author

cb80fb9

@wbhart
Copy link
Collaborator

wbhart commented May 31, 2022

This would be better if there was some test code that would have failed with the previous version of add/sub_ui.

But the profiles look good.

I don't recall anything about the extra negation in MPIR. Both Flint and MPIR are tested pretty well so I'd be surprised if there is not some compensating change somewhere.

@tthsqe12
Copy link
Contributor

It's sad that we are so much behind GMP up to 32000 limbs. I guess the MPIR stuff was the reason. I could believe the crossover moved up to 2000 limbs otherwise, but 32000 is way too high.

Unfortunately, I can confirm this crossover point. The flint fft really only starts to kick in at large sizes.

flint_fft

@wbhart wbhart merged commit 39ac881 into flintlib:trunk Jun 2, 2022
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants