Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance on huge files #26

Closed
HealthyPear opened this issue Mar 27, 2023 · 3 comments
Closed

Performance on huge files #26

HealthyPear opened this issue Mar 27, 2023 · 3 comments

Comments

@HealthyPear
Copy link
Member

I am dealing with ~135 Gb particle files and wondering about the best way to work with them.

The code I used is the following,

from corsikaio import CorsikaParticleFile

input_file = "[....]/DAT100001"

with CorsikaParticleFile(input_file) as f:
    for event in f:
        if event.header['event_number']==2:
            break
        pass

I then used cProfile to produce the following profile file,

test_pycorsikaio_simplest.prof.zip

which can be opened with e.g. Snakeviz.

My ideal solution would be to read the file in multi-threaded chunks, but given this is Corsika I am not sure if and how it can be done.

@maxnoe
Copy link
Member

maxnoe commented Mar 27, 2023

Just to comment here what I also wrote in slack:

The main issue is that we read the files 273 bytes at a time, resulting in a lot of system calls that switch between userland and kernel space.

This could be much improved by reading the data in larger chunks. E.g. some largish multiple of 273 for files without the "buffer size" and the actual buffer size for files that do have one.

I can't promise when I will be able to try that out, feel free to try yourself and open a PR.

I doubt multi-threading will help much here, CORSIKA files are sequential and you need to look for the markers in the first 4 bytes of every chunk (RUNH / EVTH / EVTE / LONGI / RUNE)

@HealthyPear
Copy link
Member Author

In this regard, but also for simpler use cases, what about adding some computing benchmarks to the CI using files using git lfs?

@maxnoe
Copy link
Member

maxnoe commented Nov 23, 2023

The main issue here was addressed: we now read much larger blocks than 273 bytes from the filesystem and this has resulted in a speedup: #29

I am closing this. If performance is still an issue, please provide profiling information in a new issue.

@maxnoe maxnoe closed this as completed Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants