Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up Tribe and Party reading #462

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from
Open

Conversation

calum-chamberlain
Copy link
Member

@calum-chamberlain calum-chamberlain commented May 9, 2021

What does this PR do?

Implements a few speed-ups for Party IO, and will/may include shifting to sparse event style objects. The main IO patches should be non-breaking and are:

  1. Write chunked catalog files to the party archive, with a user-definable maximum number of events per file.
  2. Read detection catalog files in parallel (optional), requires 1
  3. Move away from searching exhaustively (and stupidly) through the catalog of detection events for a matching id - just use a dictionary instead.

Why was it initiated? Any relevant Issues?

#445 mentions some of these points. Mostly speed for large catalogs was a real issue, and the catalog reading wasn't the main slow point! Scanning through the whole catalog to assign events to detections was dumb (my bad), and dictionary lookups speed this up a lot.

Writing out large catalogs can be really expensive for memory - it SUCKS getting to the end of a big matched-filter run, only to run out of memory when writing out the results... Limiting the number of events written to a single catalog file (hidden from the user in the tar archive) reduces the risk here - the main memory cost is holding both the catalog and the serialized catalog in memory at the same time - only serializing a chunk of the catalog at a time reduces the memory cost, without an obvious major slow-down. In theory this could be parallelized.

PR Checklist

  • develop base branch selected?
  • This PR is not directly related to an existing issue (which has no PR yet).
  • All tests still pass.
  • Any new features or fixed regressions are be covered via new tests.
  • Any new or changed features have are fully documented.
  • Significant changes have been added to CHANGES.md.
    - [ ] First time contributors have added your name to CONTRIBUTORS.md.

Copy link
Collaborator

@flixha flixha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me!
I added some tiny comments for some parameters. I tried to test this a bit, and even though I don't have a good larger dataset to test this on right now, I was able to measure some speedups.

One issue I had: The parallel-reading works well for me in an isolated script, but running it as part of one of my detection routines, I ran into a multiprocessing-deadlock when reading a tribe in parallel. I got these deadlocks quite regularly on my system in my own functions until I started using spawning instead of forking the Process-pools (multiprocessing.get_context("spawn").Pool()). The deadlock here did not occur for me when spawning processes or using a ThreadPool (but these can be slower depending on context). From one quick test and some of my experience I would read the quakeml-files with a spawned pool, and the mseed-files with a ThreadPool***.

Unfortunately, it's not so easy to provide a minimal working example for these deadlocks...

***: If there's ever the possibility of corrupt mseed-files being read, then the ThreadPool would cause python to stop with an uncatcheable Segfault, while the process-pools can handle the error.

eqcorrscan/core/match_filter/party.py Outdated Show resolved Hide resolved
eqcorrscan/core/match_filter/party.py Outdated Show resolved Hide resolved
eqcorrscan/core/match_filter/tribe.py Outdated Show resolved Hide resolved
@calum-chamberlain
Copy link
Member Author

Thanks for those comments @flixha I agree and will get onto them! I'm not fond of setting the process spawning method, even in a context manager, mostly to ensure that the user can (but probably won't) set the spawn method.

Quick question, what OS and Python version are you running to get those deadlocks? I pretty much just develop on Linux and haven't run into deadlocks, but I know different OS's support multiprocessing in different ways. Would you also be able to share a script that deadlocks?

Finally, I may switch to using concurrent.futures to more easily allow ProcessPoolExecutor and ThreadPoolExecutor options for the reading of QuakeML and MSEED respectively.

@calum-chamberlain calum-chamberlain changed the title Parallel reading, chunked writing and catalog keying Speed-up Tribe and Party reading May 10, 2021
@calum-chamberlain
Copy link
Member Author

An additional speed-up for party reading could be to read the parts of the party (Tribe, detection catalogs, detection stats) in separate processes, before finally reconstructing the party at the end.

I should provide some benchmarks for this at some point to actually demonstrate any speed-ups. One of the main speed-ups was actually the change from the silly lookup of template streams by an exhaustive search, to using a dictionary, which doesn't need any parallelism.

Initial testing showed that reading a Party of c. 300k detections and 2k templates went from taking > 9 hours to < 30 minutes. Still not fast, but at least more manageable...

@flixha
Copy link
Collaborator

flixha commented May 12, 2021

Thanks for those comments @flixha I agree and will get onto them! I'm not fond of setting the process spawning method, even in a context manager, mostly to ensure that the user can (but probably won't) set the spawn method.

Quick question, what OS and Python version are you running to get those deadlocks? I pretty much just develop on Linux and haven't run into deadlocks, but I know different OS's support multiprocessing in different ways. Would you also be able to share a script that deadlocks?

Finally, I may switch to using concurrent.futures to more easily allow ProcessPoolExecutor and ThreadPoolExecutor options for the reading of QuakeML and MSEED respectively.

I'm on python 3.8.8, and as OS I'm using RHEL 8.3 and RHEL 7.9 depending on the machine. I have seen many deadlocks especially when reading files in parallel on these. I've spent too much time trying to debug that without really understanding why it's happening. From what I read, python on Windows and Mac uses the spawn-method by default, and only on Linux uses forking by default, which is less robust in terms of package imports in the workers.

I wasn't fully aware of the context managing abilities of multiprocessing. I tried to use multiprocessing.set_start_method('spawn') at the start of the script below , but that will unfortunately result in a RuntimeError: context has already been set. which will work if I guard it within if __name__ == "__main__":.

Example that will deadlock on the last line for me using the Linux machines; while it works on Mac 10.13.6:

from obspy import UTCDateTime
from obspy.clients.fdsn import Client
from obspy.core.event import Pick
from eqcorrscan.core.match_filter import Tribe
from eqcorrscan.utils import catalog_utils

client = Client('GEONET')
t1 = UTCDateTime(2016, 9, 4)
t2 = t1 + 86400
catalog = client.get_events(
    starttime=t1, endtime=t2, minmagnitude=4, minlatitude=-49,
    maxlatitude=-35, minlongitude=175.0, maxlongitude=-175.0)
catalog = catalog_utils.filter_picks(
    catalog, channels=['EHZ'], top_n_picks=5)
for event in catalog:
    extra_pick = Pick()
    extra_pick.phase_hint = 'S'
    extra_pick.time = event.picks[0].time + 10
    extra_pick.waveform_id = event.picks[0].waveform_id
    event.picks.append(extra_pick)
tribe = Tribe()
tribe.construct(
    method='from_client', catalog=catalog, client_id='GEONET',
    lowcut=2.0, highcut=9.0, samp_rate=50.0, filt_order=4,
    length=3.0, prepick=0.15, swin='all', process_len=3600)

tribe.write('test.tgz', max_events_per_file=4)

# Deadlocks here:
Tribe().read('test.tgz', cores=4)

@calum-chamberlain
Copy link
Member Author

Are you running that as a script or from an iPython session? In general when running scripts that use multiprocessing libraries you should always encapsulate your code in an if __name__ == "__main__": block. Check the multiprocessong docs here - the section on "Safe importing of main module" is what you need.

@flixha
Copy link
Collaborator

flixha commented May 12, 2021

The if __name__ == "__main__": did solve it indeed here!

@calum-chamberlain
Copy link
Member Author

I just ran that script without issue on my Ubuntu machine using Python 3.8.8 from conda... Curious that you run into locks. I have seen similar issues on Windows - there should probably be a note somewhere prominent in the docs about encapsulating everything in an if __name__ == "__main__": block.

If you have the time/chance, if you run into these locks again without the block can you try that and report if it does/does not work?

Shameful admittance on my part, but I either run everything in iPython (for writing), Jupyter (for sharing with people), or write everything in functions by default because I tend to reuse a lot of code. I don't think I ever run EQcorrscan components in a script like this, hence not making any note of it!

@flixha
Copy link
Collaborator

flixha commented May 12, 2021

I get the deadlock both in ipython (in a window in VScode) and when running with python from the terminal. I have not generally had the deadlock-problem with EQcorrscan's functions (if I do run into that I can report it of course), but I have experienced it in several of my functions that wrap around EQcorrscan-functions (e.g., creating templates in parallel). In these cases, the deadlock still occurs even though all parallelization happens in the functions.

For the test above, if I put that into a function, it will run even without the if __name__ == "__main__":in front of the function call.

@calum-chamberlain
Copy link
Member Author

Interesting - I also don't use the iPython within vscode, or pycharm, I seem to get odd hangs in then in general, but if you are getting the same locks from the terminal then I don't know!

RE making templates in parallel - that may have something to do with pre-processing trying to process the streams in parallel and nested parallelism not working well. At some point I will redesign pre-processing to move away from multiprocessing - hopefully though some obspy speed-ups I'm playing with.
If you can, for parallel template creation, or anything that doesn't require any kind of memory sharing (e.g. looping through days of detections), I recommend not using Python multiprocessing to do this. I usually write a script with arguments of start and end and write a little bash script that calls the Python script multiple times and puts them into the background.

@flixha
Copy link
Collaborator

flixha commented May 12, 2021

Thanks for your thoughts on this!
When doing the template-creation in parallel, I use parallel=False, cores=1 for the pre_processing. In total that makes it fast because most time in my scripts is spent on organizing picks and traces.

For the template-creation parallelization, I now use joblib's Parallel(n_jobs=cores)delayed(function)(parameters), which defaults to a robust pool-creation in loky through fork + exec instead of a simple fork or spawn in multiprocessing. That's more stable in that situation, while it comes with some other inconveniences, e.g., in handling errors in the workers.

@flixha
Copy link
Collaborator

flixha commented May 25, 2021

Hi Calum,
I noticed an issue with the Resource-ID links when reading Quakeml files in parallel in subprocesses, see obspy/obspy#2842 .
I haven't seen the problem in this implementation here yet, or it may not be particularly relevant, but thought it's worth mentioning in case it has a "hidden" effect.


if len(all_detections) == 0:
return Party()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Check what these lines are about - they don't seem related to the topic of this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants