Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make fqgrep more grep-like in its options #13

Merged
merged 23 commits into from
Dec 14, 2022
Merged

Make fqgrep more grep-like in its options #13

merged 23 commits into from
Dec 14, 2022

Conversation

nh13
Copy link
Member

@nh13 nh13 commented Aug 10, 2022

This adds a ton of changes:

  1. All reader, writer, and matching threads use a rayon thread pool. This means that --threads is respected. Previously reader and writer threads were always allocated outside the match pool, and there were specific arguments for the latter and compressing the output (the latter feature has been removed, plaintext FASTQ is the only output format, just pipe it if you need to).
  2. Takes in a pattern as the first positional argument, which is now a regular expression (previously a fixed string).
  3. Takes in zero or more file paths after the positional argument. Uses standard input if no file are given positionally or with -f below.
  4. Input files are assumed to be plain uncompressed FASTQs unless the --decompress option is given, in which case they're assumed to be GZIP compressed. This includes standard input. The exception are .gz/.bgz and.fastq/.fq which are always treated as GZIP compressed and plain text respectively.
  5. Implement the following options from grep:
  • -c, --count: simply return the count of matching records
  • -F, --fixed-strings: interpret pattern as a set of fixed strings
  • -v: Selected records are those not matching any of the specified patterns
  • --color <color>: color the output records with ANSI color codes
  • -e, --regexp <regexp>...: specify the pattern used during the search. Can be specified multiple times
  • -f, --file <file>: Read one or more newline separated patterns from file.
  • -Z, --decompress: treat all non .gz, .bgz, .fastq, and .fq files as GZIP compressed (default treat as uncompressed)
  1. The exit code follows GREP, where we exit with 0 if one or more lines were selected, 1 if no lines were selected, and >1 if an error occurred.
  2. Add non-grep options:
  • --paired: if one file or standard input is used, treat the input as an interleaved paired end FASTQ. If more than one file is given, ensure that the number of files are a multiple of two, and treat each consecutive pair of files as R1 and R2 respectively. If the pattern matches either R1 or R2, output both (interleaved FASTQ).
  • --reverse-complement: searches the reverse complement of the read sequences in addition
  • --progress: write progress (counts of records searched)
  • -t, --threads <threads>: see (1) above

This could use a lot of unit tests and some test on large datasets (to ensure correctness and performance relative to the previous version)

This was referenced Aug 12, 2022
@nh13
Copy link
Member Author

nh13 commented Aug 15, 2022

Weird, there seems to be some issue with the thread pools stalling when threads <= 2

@nh13
Copy link
Member Author

nh13 commented Aug 18, 2022

@sstadick totally fine if you have zero time, but I am stuck with a weird deadlock when using exactly two threads. I think there's a thread for reading, writing, and processing, so something is stalling or deadlocking. I am suspicious that my use of a rayon thread pools has a fundamental misunderstanding causing it. Any chance you have ideas?

Copy link
Contributor

@sstadick sstadick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nh13 I didn't see the notification for this until today!

So the number of threads you need is equal to:
(if paired) 2 for reading (1 if not paired)
1 for writing
1 for the pool.install

Otherwise, as you noted, it will deadlock.

Example: Paired input, 3 threads given

1 thread is immediately started for writing, and blocks waiting on the channel
1 thread is used as the context within pool.install
A thread for reader 1 is created on the threadpool, sending chunks and then blocking once the channel is full.
A task is added to the pool for reader 2 that wants to use a thread, but all 3 threads are in a blocked state:

  • the pool.install thread is waiting on reader2 to start sending over its channel
  • the reader 1 thread is blocked waiting for its channel to be less full to start ending again
  • the writer is blocked and waiting for anything to show up over the rx channel.

What I'd recommend is spawning threads not tied to the threadpool to do the reading and writing.
I'd not count the reader and writer threads toward the total number of threads used since they don't really use CPU and mostly just sleep-wait on sys call.s

That's how gzp and by association crabz do their accounting for threads.
Alternatively you could switch to an async model which adds a lot of complexity, but specifically solves this scenario where you want to context switch between tasks with a limited threadpool.

progress.record();
progress.record();
}
assert!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this compares the full read header, which will likely be different R1 and R2 since it contains the <read> element. I had to remove this to run locally on the datasets I had on hand.

@sstadick
Copy link
Contributor

sstadick commented Aug 22, 2022

I pushed a branch that has debug statements that demonstrate exactly how / why the deadlock happens: https://github.com/fulcrumgenomics/fqgrep/tree/demo/deadlock

After building, run with RUST_LOG="debug" ./target/release/fqgrep ... with threads set to 2 or 3 depending on whether or not you have paired input.

@nh13
Copy link
Member Author

nh13 commented Aug 22, 2022

@sstadick the pony I want is "I have N tasks, M processors, where N >> M, and M is a thread pool". I want the user to say "use M threads", and not have it be "the tool will use M + 2" threads, if that makes sense. I thought that spawning threads and installing closures would support that? If a thread is blocked, can't another thread be executed?

@sstadick
Copy link
Contributor

For a number of complicated reasons the pony you are asking for is async, and probably specifically the tokio runtime.

As far as I can tell, rayon maintains a pool of threads, and doles out 1 task per thread. rayon::spawn adds a task to the queue. None of your tasks finish, they all block on a recv or send. Rayon can't know that a thread is blocked. The OS knows. i.e. if you were to massively oversubscribe rayon threads to OS threads the OS can context swtich between them, but rayon can't. This is part of why I think it's fine to not count the IO threads toward --threads - the OS can and will manage them.

Part of the reason rayon can't do this is that suspending the execution of a task and restarting it on another thread means moving memory around, which may or may not be safe.

async specifically solves this problem. It maintains a pool of N threads and allows you to launch M tasks on those threads. The difference is that the baked in .await on an async version of a Receiver notifies the async runtime that the task has hit a breakpoint at which time it can be interrupted. Additionally, running in an async function forces Pin/Unpin on your datatypes, annotating what can and can't be moved around.

Lastly, with regards to not counting IO threads - that is what pigz does as well sstadick/gzp#11 (comment)

@nh13
Copy link
Member Author

nh13 commented Aug 22, 2022

@sstadick thanks for the explanation, that makes a lot of sense. In a lot of contexts, like you have for gzp the user will want to have a hard limit on cores. A job that uses more threads than requested may even be killed (e.g. on clusters), and if we want to run multiple tools on the same instance, we don't want to be surprised if we over-subscribe.

I do think the right answer is to span separate reader/writer threads and call it a day, like you have in crabz and elsewhere.

@nh13
Copy link
Member Author

nh13 commented Aug 25, 2022

@tfenne I think this is ready for review (@sstadick you're welcome to review too, but under no obligation and no pressure).

@nh13 nh13 marked this pull request as ready for review August 25, 2022 03:39
@nh13 nh13 changed the title wip: make it grep-like Make fqgrep more grep-like in its options Aug 25, 2022
Copy link
Contributor

@sstadick sstadick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome!!

src/lib/color.rs Outdated Show resolved Hide resolved
src/lib/matcher.rs Show resolved Hide resolved
src/lib/matcher.rs Outdated Show resolved Hide resolved
src/lib/mod.rs Outdated Show resolved Hide resolved
@nh13 nh13 requested a review from tfenne September 1, 2022 16:00
@nh13
Copy link
Member Author

nh13 commented Sep 1, 2022

@tfenne in case you want to review it, here it is. I think there are probably non-trivial improvements that can be made to the implementation, but I've stared at it long enough I can no longer see them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants