Speed-up Tribe and Party reading #462

calum-chamberlain · 2021-05-09T07:40:16Z

What does this PR do?

Implements a few speed-ups for Party IO, and ~~will/may include shifting to sparse event style objects~~. The main IO patches should be non-breaking and are:

Write chunked catalog files to the party archive, with a user-definable maximum number of events per file.
Read detection catalog files in parallel (optional), requires 1
Move away from searching exhaustively (and stupidly) through the catalog of detection events for a matching id - just use a dictionary instead.

Why was it initiated? Any relevant Issues?

#445 mentions some of these points. Mostly speed for large catalogs was a real issue, and the catalog reading wasn't the main slow point! Scanning through the whole catalog to assign events to detections was dumb (my bad), and dictionary lookups speed this up a lot.

Writing out large catalogs can be really expensive for memory - it SUCKS getting to the end of a big matched-filter run, only to run out of memory when writing out the results... Limiting the number of events written to a single catalog file (hidden from the user in the tar archive) reduces the risk here - the main memory cost is holding both the catalog and the serialized catalog in memory at the same time - only serializing a chunk of the catalog at a time reduces the memory cost, without an obvious major slow-down. In theory this could be parallelized.

PR Checklist

develop base branch selected?
This PR is not directly related to an existing issue (which has no PR yet).
All tests still pass.
Any new features or fixed regressions are be covered via new tests.
Any new or changed features have are fully documented.
Significant changes have been added to CHANGES.md.
~~- [ ] First time contributors have added your name to CONTRIBUTORS.md.~~

flixha

This looks pretty good to me!
I added some tiny comments for some parameters. I tried to test this a bit, and even though I don't have a good larger dataset to test this on right now, I was able to measure some speedups.

One issue I had: The parallel-reading works well for me in an isolated script, but running it as part of one of my detection routines, I ran into a multiprocessing-deadlock when reading a tribe in parallel. I got these deadlocks quite regularly on my system in my own functions until I started using spawning instead of forking the Process-pools (multiprocessing.get_context("spawn").Pool()). The deadlock here did not occur for me when spawning processes or using a ThreadPool (but these can be slower depending on context). From one quick test and some of my experience I would read the quakeml-files with a spawned pool, and the mseed-files with a ThreadPool***.

Unfortunately, it's not so easy to provide a minimal working example for these deadlocks...

***: If there's ever the possibility of corrupt mseed-files being read, then the ThreadPool would cause python to stop with an uncatcheable Segfault, while the process-pools can handle the error.

eqcorrscan/core/match_filter/party.py

eqcorrscan/core/match_filter/tribe.py

calum-chamberlain · 2021-05-10T20:11:02Z

Thanks for those comments @flixha I agree and will get onto them! I'm not fond of setting the process spawning method, even in a context manager, mostly to ensure that the user can (but probably won't) set the spawn method.

Quick question, what OS and Python version are you running to get those deadlocks? I pretty much just develop on Linux and haven't run into deadlocks, but I know different OS's support multiprocessing in different ways. Would you also be able to share a script that deadlocks?

Finally, I may switch to using concurrent.futures to more easily allow ProcessPoolExecutor and ThreadPoolExecutor options for the reading of QuakeML and MSEED respectively.

…-bound

calum-chamberlain · 2021-05-11T22:14:44Z

An additional speed-up for party reading could be to read the parts of the party (Tribe, detection catalogs, detection stats) in separate processes, before finally reconstructing the party at the end.

I should provide some benchmarks for this at some point to actually demonstrate any speed-ups. One of the main speed-ups was actually the change from the silly lookup of template streams by an exhaustive search, to using a dictionary, which doesn't need any parallelism.

Initial testing showed that reading a Party of c. 300k detections and 2k templates went from taking > 9 hours to < 30 minutes. Still not fast, but at least more manageable...

eqcorrscan/core/match_filter/tribe.py

flixha · 2021-05-12T08:10:37Z

Thanks for those comments @flixha I agree and will get onto them! I'm not fond of setting the process spawning method, even in a context manager, mostly to ensure that the user can (but probably won't) set the spawn method.

Quick question, what OS and Python version are you running to get those deadlocks? I pretty much just develop on Linux and haven't run into deadlocks, but I know different OS's support multiprocessing in different ways. Would you also be able to share a script that deadlocks?

Finally, I may switch to using concurrent.futures to more easily allow ProcessPoolExecutor and ThreadPoolExecutor options for the reading of QuakeML and MSEED respectively.

I'm on python 3.8.8, and as OS I'm using RHEL 8.3 and RHEL 7.9 depending on the machine. I have seen many deadlocks especially when reading files in parallel on these. I've spent too much time trying to debug that without really understanding why it's happening. From what I read, python on Windows and Mac uses the spawn-method by default, and only on Linux uses forking by default, which is less robust in terms of package imports in the workers.

I wasn't fully aware of the context managing abilities of multiprocessing. I tried to use multiprocessing.set_start_method('spawn') at the start of the script below ~~, but that will unfortunately result in a RuntimeError: context has already been set.~~ which will work if I guard it within if __name__ == "__main__":.

Example that will deadlock on the last line for me using the Linux machines; while it works on Mac 10.13.6:

from obspy import UTCDateTime
from obspy.clients.fdsn import Client
from obspy.core.event import Pick
from eqcorrscan.core.match_filter import Tribe
from eqcorrscan.utils import catalog_utils

client = Client('GEONET')
t1 = UTCDateTime(2016, 9, 4)
t2 = t1 + 86400
catalog = client.get_events(
    starttime=t1, endtime=t2, minmagnitude=4, minlatitude=-49,
    maxlatitude=-35, minlongitude=175.0, maxlongitude=-175.0)
catalog = catalog_utils.filter_picks(
    catalog, channels=['EHZ'], top_n_picks=5)
for event in catalog:
    extra_pick = Pick()
    extra_pick.phase_hint = 'S'
    extra_pick.time = event.picks[0].time + 10
    extra_pick.waveform_id = event.picks[0].waveform_id
    event.picks.append(extra_pick)
tribe = Tribe()
tribe.construct(
    method='from_client', catalog=catalog, client_id='GEONET',
    lowcut=2.0, highcut=9.0, samp_rate=50.0, filt_order=4,
    length=3.0, prepick=0.15, swin='all', process_len=3600)

tribe.write('test.tgz', max_events_per_file=4)

# Deadlocks here:
Tribe().read('test.tgz', cores=4)

calum-chamberlain · 2021-05-12T08:15:21Z

Are you running that as a script or from an iPython session? In general when running scripts that use multiprocessing libraries you should always encapsulate your code in an if __name__ == "__main__": block. Check the multiprocessong docs here - the section on "Safe importing of main module" is what you need.

flixha · 2021-05-12T08:19:55Z

The if __name__ == "__main__": did solve it indeed here!

calum-chamberlain · 2021-05-12T08:24:10Z

I just ran that script without issue on my Ubuntu machine using Python 3.8.8 from conda... Curious that you run into locks. I have seen similar issues on Windows - there should probably be a note somewhere prominent in the docs about encapsulating everything in an if __name__ == "__main__": block.

If you have the time/chance, if you run into these locks again without the block can you try that and report if it does/does not work?

Shameful admittance on my part, but I either run everything in iPython (for writing), Jupyter (for sharing with people), or write everything in functions by default because I tend to reuse a lot of code. I don't think I ever run EQcorrscan components in a script like this, hence not making any note of it!

flixha · 2021-05-12T08:31:53Z

I get the deadlock both in ipython (in a window in VScode) and when running with python from the terminal. I have not generally had the deadlock-problem with EQcorrscan's functions (if I do run into that I can report it of course), but I have experienced it in several of my functions that wrap around EQcorrscan-functions (e.g., creating templates in parallel). In these cases, the deadlock still occurs even though all parallelization happens in the functions.

For the test above, if I put that into a function, it will run even without the if __name__ == "__main__":in front of the function call.

calum-chamberlain · 2021-05-12T09:09:22Z

Interesting - I also don't use the iPython within vscode, or pycharm, I seem to get odd hangs in then in general, but if you are getting the same locks from the terminal then I don't know!

RE making templates in parallel - that may have something to do with pre-processing trying to process the streams in parallel and nested parallelism not working well. At some point I will redesign pre-processing to move away from multiprocessing - hopefully though some obspy speed-ups I'm playing with.
If you can, for parallel template creation, or anything that doesn't require any kind of memory sharing (e.g. looping through days of detections), I recommend not using Python multiprocessing to do this. I usually write a script with arguments of start and end and write a little bash script that calls the Python script multiple times and puts them into the background.

flixha · 2021-05-12T09:22:43Z

Thanks for your thoughts on this!
When doing the template-creation in parallel, I use parallel=False, cores=1 for the pre_processing. In total that makes it fast because most time in my scripts is spent on organizing picks and traces.

For the template-creation parallelization, I now use joblib's Parallel(n_jobs=cores)delayed(function)(parameters), which defaults to a robust pool-creation in loky through fork + exec instead of a simple fork or spawn in multiprocessing. That's more stable in that situation, while it comes with some other inconveniences, e.g., in handling errors in the workers.

flixha · 2021-05-25T13:55:43Z

Hi Calum,
I noticed an issue with the Resource-ID links when reading Quakeml files in parallel in subprocesses, see obspy/obspy#2842 .
I haven't seen the problem in this implementation here yet, or it may not be particularly relevant, but thought it's worth mentioning in case it has a "hidden" effect.

calum-chamberlain · 2023-03-16T05:32:13Z

eqcorrscan/core/match_filter/party.py

+
+        if len(all_detections) == 0:
+            return Party()
+


TODO: Check what these lines are about - they don't seem related to the topic of this PR?

calum-chamberlain added enhancement core.match_filter labels May 9, 2021

calum-chamberlain added this to the 0.5.0 milestone May 9, 2021

calum-chamberlain self-assigned this May 9, 2021

calum-chamberlain changed the base branch from master to develop May 9, 2021 22:14

calum-chamberlain force-pushed the match-io-speed branch from a89df61 to 29c7732 Compare May 9, 2021 23:27

Cherry picking rebase

a94bb72

calum-chamberlain force-pushed the match-io-speed branch from 29c7732 to a94bb72 Compare May 9, 2021 23:32

calum-chamberlain requested a review from flixha May 10, 2021 04:14

flixha reviewed May 10, 2021

View reviewed changes

eqcorrscan/core/match_filter/party.py Outdated Show resolved Hide resolved

eqcorrscan/core/match_filter/party.py Outdated Show resolved Hide resolved

eqcorrscan/core/match_filter/tribe.py Outdated Show resolved Hide resolved

calum-chamberlain added 2 commits May 11, 2021 09:28

Change to using concurrent.futures; argument names; threadpool for IO…

a83b4cc

…-bound

Stickler

a05ef1c

calum-chamberlain changed the title ~~Parallel reading, chunked writing and catalog keying~~ Speed-up Tribe and Party reading May 10, 2021

flixha reviewed May 12, 2021

View reviewed changes

eqcorrscan/core/match_filter/tribe.py Show resolved Hide resolved

calum-chamberlain added 5 commits June 23, 2021 15:18

Merge branch 'develop' into match-io-speed

6df9554

Merge branch 'develop' into match-io-speed

b6e88e1

Merge branch 'develop' into match-io-speed

9bb1de1

Merge branch 'develop' into match-io-speed

1ae624a

mag-calc is not touched in this pr - remove changelog

929d8a0

calum-chamberlain commented Mar 16, 2023

View reviewed changes

calum-chamberlain added 2 commits March 17, 2023 09:29

Merge branch 'develop' into match-io-speed

1fea6fb

Merge branch 'develop' into match-io-speed

6d56449

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed-up Tribe and Party reading #462

Speed-up Tribe and Party reading #462

calum-chamberlain commented May 9, 2021 •

edited

flixha left a comment •

edited

calum-chamberlain commented May 10, 2021

calum-chamberlain commented May 11, 2021

flixha commented May 12, 2021 •

edited

calum-chamberlain commented May 12, 2021

flixha commented May 12, 2021

calum-chamberlain commented May 12, 2021

flixha commented May 12, 2021

calum-chamberlain commented May 12, 2021

flixha commented May 12, 2021 •

edited

flixha commented May 25, 2021

calum-chamberlain Mar 16, 2023

Speed-up Tribe and Party reading #462

Are you sure you want to change the base?

Speed-up Tribe and Party reading #462

Conversation

calum-chamberlain commented May 9, 2021 • edited

What does this PR do?

Why was it initiated? Any relevant Issues?

PR Checklist

flixha left a comment • edited

Choose a reason for hiding this comment

calum-chamberlain commented May 10, 2021

calum-chamberlain commented May 11, 2021

flixha commented May 12, 2021 • edited

calum-chamberlain commented May 12, 2021

flixha commented May 12, 2021

calum-chamberlain commented May 12, 2021

flixha commented May 12, 2021

calum-chamberlain commented May 12, 2021

flixha commented May 12, 2021 • edited

flixha commented May 25, 2021

calum-chamberlain Mar 16, 2023

Choose a reason for hiding this comment

calum-chamberlain commented May 9, 2021 •

edited

flixha left a comment •

edited

flixha commented May 12, 2021 •

edited

flixha commented May 12, 2021 •

edited