Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use fast measure(body) in measure!(flow) #155

Merged
merged 1 commit into from
Jul 31, 2024
Merged

Use fast measure(body) in measure!(flow) #155

merged 1 commit into from
Jul 31, 2024

Conversation

weymouth
Copy link
Collaborator

We don't need normal or velocity measurements outside a very narrow band in the solver, so measure(fast) can be applied there as well, but the band is variable. Changed fast=false to fastd²=Inf and n,V are skipped when d^2>fastd².

Applied in measure!(flow) and tests passing, but I haven't benchmarked yet. Should be a bit faster for the Jelly and will really speed up ParametricBody simulations.

We don't need normal or velocity measurements outside a very narrow band in the solver, so `measure(fast)` can be applied there as well, but the band is variable. Changed `fast` to `fastd²=Inf` and n,V are skipped when `d^2>fastd²`.
@b-fg
Copy link
Member

b-fg commented Jul 29, 2024

You could locally merge #152 into this PR (no need to commit it) and do the benchmark with that :)

for i ∈ 1:N
dᵢ,nᵢ,Vᵢ = measure(body,loc(i,I,T),t)
dᵢ,nᵢ,Vᵢ = measure(body,loc(i,I,T),t,fastd²=d²)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line won't do too much, since the if statement above is already catching most of them. I tried to reduce to d=epsilon, but this broke the code.

@fastmath @inline function fill!(μ₀,μ₁,V,d,I)
d[I] = sdf(body,loc(0,I,T),t)
if abs(d[I])<2+ϵ
d[I] = sdf(body,loc(0,I,T),t,fastd²=d²)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the important line for ParametricBodies.

@weymouth
Copy link
Collaborator Author

You could locally merge #152 into this PR (no need to commit it) and do the benchmark with that :)

But I need to compare with master. How would that work? Also, I'm working from home so I don't have my GPU available today.

@b-fg
Copy link
Member

b-fg commented Jul 29, 2024

  1. Merge that branch into this one and run the benchmarks.
  2. Checkout to that branch and run the benchmarks (I have just updated that branch with master, so you will be basically benchmarking master with this).
  3. Compare benchmarks from 1 and 2.

But yes, ideally we would like to see GPU results too.

--

Or I could also merge that with master already, because I think it is finished now.

@b-fg b-fg changed the title Use fast measure(body) in measure!(flow) Use fast measure(body) in measure!(flow) Jul 29, 2024
@b-fg
Copy link
Member

b-fg commented Jul 29, 2024

In fact, you do need to use that PR because otherwise the changes in measure will not be actually benchmarked. I fixed this bug in that PR.

@weymouth
Copy link
Collaborator Author

CPU benchmarks on my machine show a slow down??!??

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │   18690fa │ 1.10.0 │   Float32 │       78733 │   0.00 │    16.86 │           643.06 │     1.00 │
│  CPUx01 │   1dbc2d2 │ 1.10.0 │   Float32 │       78733 │   0.00 │    17.39 │           663.25 │     0.97 │
│  CPUx04 │   18690fa │ 1.10.0 │   Float32 │     2274514 │   0.40 │     9.42 │           359.26 │     1.79 │
│  CPUx04 │   1dbc2d2 │ 1.10.0 │   Float32 │     2274514 │   0.45 │     7.73 │           294.88 │     2.18 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │   18690fa │ 1.10.0 │   Float32 │       70606 │   0.00 │   117.36 │           559.61 │     1.00 │
│  CPUx01 │   1dbc2d2 │ 1.10.0 │   Float32 │       70606 │   0.00 │   106.49 │           507.78 │     1.10 │
│  CPUx04 │   18690fa │ 1.10.0 │   Float32 │     2021882 │   0.06 │    61.00 │           290.85 │     1.92 │
│  CPUx04 │   1dbc2d2 │ 1.10.0 │   Float32 │     2021882 │   0.07 │    51.39 │           245.06 │     2.28 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │   18690fa │ 1.10.0 │   Float32 │      196747 │   0.00 │    15.02 │          1145.60 │     1.00 │
│  CPUx01 │   1dbc2d2 │ 1.10.0 │   Float32 │      196739 │   0.00 │    13.55 │          1034.03 │     1.11 │
│  CPUx04 │   18690fa │ 1.10.0 │   Float32 │     5549636 │   0.54 │    10.56 │           805.81 │     1.42 │
│  CPUx04 │   1dbc2d2 │ 1.10.0 │   Float32 │     5549382 │   0.70 │     8.65 │           659.80 │     1.74 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │   18690fa │ 1.10.0 │   Float32 │      230016 │   0.00 │   124.63 │          1188.58 │     1.00 │
│  CPUx01 │   1dbc2d2 │ 1.10.0 │   Float32 │      230016 │   0.00 │   111.15 │          1060.02 │     1.12 │
│  CPUx04 │   18690fa │ 1.10.0 │   Float32 │     6552597 │   0.08 │    74.73 │           712.64 │     1.67 │
│  CPUx04 │   1dbc2d2 │ 1.10.0 │   Float32 │     6552597 │   0.08 │    61.22 │           583.83 │     2.04 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@b-fg
Copy link
Member

b-fg commented Jul 30, 2024

Here are my benchmarks. 1dbc2d2 is "master" (the benchmark PR) and 8d348ef is the current PR. I do not see any major performance speedup or regression.

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   1dbc2d2 │ 1.10.4 │   Float32 │       78733 │   0.00 │     9.13 │           348.45 │     1.00 │
│     CPUx01 │   8d348ef │ 1.10.4 │   Float32 │       78733 │   0.00 │     9.34 │           356.17 │     0.98 │
│     CPUx04 │   1dbc2d2 │ 1.10.4 │   Float32 │     2274514 │   0.00 │     3.11 │           118.73 │     2.93 │
│     CPUx04 │   8d348ef │ 1.10.4 │   Float32 │     2274514 │   0.00 │     3.20 │           121.88 │     2.86 │
│     CPUx08 │   1dbc2d2 │ 1.10.4 │   Float32 │     3555070 │   0.00 │     3.19 │           121.53 │     2.87 │
│     CPUx08 │   8d348ef │ 1.10.4 │   Float32 │     3555070 │   0.00 │     3.63 │           138.55 │     2.51 │
│ GPU-NVIDIA │   1dbc2d2 │ 1.10.4 │   Float32 │     2671216 │   0.00 │     0.65 │            24.97 │    13.96 │
│ GPU-NVIDIA │   8d348ef │ 1.10.4 │   Float32 │     2672028 │   0.00 │     0.66 │            25.16 │    13.85 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   1dbc2d2 │ 1.10.4 │   Float32 │       70606 │   0.00 │    73.55 │           350.70 │     1.00 │
│     CPUx01 │   8d348ef │ 1.10.4 │   Float32 │       70606 │   0.00 │    58.92 │           280.96 │     1.25 │
│     CPUx04 │   1dbc2d2 │ 1.10.4 │   Float32 │     2021882 │   0.00 │    18.47 │            88.09 │     3.98 │
│     CPUx04 │   8d348ef │ 1.10.4 │   Float32 │     2021882 │   0.00 │    19.25 │            91.81 │     3.82 │
│     CPUx08 │   1dbc2d2 │ 1.10.4 │   Float32 │     3159482 │   0.00 │    19.09 │            91.01 │     3.85 │
│     CPUx08 │   8d348ef │ 1.10.4 │   Float32 │     3159482 │   0.00 │    19.26 │            91.84 │     3.82 │
│ GPU-NVIDIA │   1dbc2d2 │ 1.10.4 │   Float32 │     2347006 │   0.00 │     3.12 │            14.89 │    23.56 │
│ GPU-NVIDIA │   8d348ef │ 1.10.4 │   Float32 │     2347806 │   0.00 │     3.15 │            15.02 │    23.35 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   1dbc2d2 │ 1.10.4 │   Float32 │      196731 │   0.00 │     7.77 │           592.83 │     1.00 │
│     CPUx01 │   8d348ef │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.92 │           604.10 │     0.98 │
│     CPUx04 │   1dbc2d2 │ 1.10.4 │   Float32 │     5549128 │   1.54 │     4.55 │           346.84 │     1.71 │
│     CPUx04 │   8d348ef │ 1.10.4 │   Float32 │     5549932 │   0.00 │     4.55 │           346.92 │     1.71 │
│     CPUx08 │   1dbc2d2 │ 1.10.4 │   Float32 │     8644684 │   2.59 │     5.01 │           382.28 │     1.55 │
│     CPUx08 │   8d348ef │ 1.10.4 │   Float32 │     8645956 │   1.04 │     5.06 │           386.02 │     1.54 │
│ GPU-NVIDIA │   1dbc2d2 │ 1.10.4 │   Float32 │     6539350 │   0.00 │     1.49 │           113.83 │     5.21 │
│ GPU-NVIDIA │   8d348ef │ 1.10.4 │   Float32 │     6536576 │   0.00 │     1.46 │           111.14 │     5.33 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌────────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│  Backend   │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├────────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│     CPUx01 │   1dbc2d2 │ 1.10.4 │   Float32 │      230016 │   0.00 │    58.94 │           562.14 │     1.00 │
│     CPUx01 │   8d348ef │ 1.10.4 │   Float32 │      230016 │   0.00 │    59.67 │           569.03 │     0.99 │
│     CPUx04 │   1dbc2d2 │ 1.10.4 │   Float32 │     6552597 │   0.37 │    20.24 │           193.04 │     2.91 │
│     CPUx04 │   8d348ef │ 1.10.4 │   Float32 │     6552597 │   0.00 │    21.47 │           204.78 │     2.75 │
│     CPUx08 │   1dbc2d2 │ 1.10.4 │   Float32 │    10222545 │   0.76 │    21.64 │           206.34 │     2.72 │
│     CPUx08 │   8d348ef │ 1.10.4 │   Float32 │    10222545 │   0.66 │    20.88 │           199.14 │     2.82 │
│ GPU-NVIDIA │   1dbc2d2 │ 1.10.4 │   Float32 │     7787222 │   0.00 │     4.69 │            44.76 │    12.56 │
│ GPU-NVIDIA │   8d348ef │ 1.10.4 │   Float32 │     7787224 │   0.00 │     4.67 │            44.57 │    12.61 │
└────────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@weymouth
Copy link
Collaborator Author

Yeah, that's actually encouraging. It means the current measure function is already optimized for AutoBodies.

For NURBS ParametricBodies, I measured a 100% speed up with this PR, so it's certainly worth doing.

@b-fg
Copy link
Member

b-fg commented Jul 31, 2024

So can we merge this?

@weymouth weymouth merged commit cf2a247 into master Jul 31, 2024
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants