Make fqgrep more grep-like in its options #13

nh13 · 2022-08-10T15:41:18Z

This adds a ton of changes:

All reader, writer, and matching threads use a rayon thread pool. This means that --threads is respected. Previously reader and writer threads were always allocated outside the match pool, and there were specific arguments for the latter and compressing the output (the latter feature has been removed, plaintext FASTQ is the only output format, just pipe it if you need to).
Takes in a pattern as the first positional argument, which is now a regular expression (previously a fixed string).
Takes in zero or more file paths after the positional argument. Uses standard input if no file are given positionally or with -f below.
Input files are assumed to be plain uncompressed FASTQs unless the --decompress option is given, in which case they're assumed to be GZIP compressed. This includes standard input. The exception are .gz/.bgz and.fastq/.fq which are always treated as GZIP compressed and plain text respectively.
Implement the following options from grep:

-c, --count: simply return the count of matching records
-F, --fixed-strings: interpret pattern as a set of fixed strings
-v: Selected records are those not matching any of the specified patterns
--color <color>: color the output records with ANSI color codes
-e, --regexp <regexp>...: specify the pattern used during the search. Can be specified multiple times
-f, --file <file>: Read one or more newline separated patterns from file.
-Z, --decompress: treat all non .gz, .bgz, .fastq, and .fq files as GZIP compressed (default treat as uncompressed)

The exit code follows GREP, where we exit with 0 if one or more lines were selected, 1 if no lines were selected, and >1 if an error occurred.
Add non-grep options:

--paired: if one file or standard input is used, treat the input as an interleaved paired end FASTQ. If more than one file is given, ensure that the number of files are a multiple of two, and treat each consecutive pair of files as R1 and R2 respectively. If the pattern matches either R1 or R2, output both (interleaved FASTQ).
--reverse-complement: searches the reverse complement of the read sequences in addition
--progress: write progress (counts of records searched)
-t, --threads <threads>: see (1) above

This could use a lot of unit tests and some test on large datasets (to ensure correctness and performance relative to the previous version)

nh13 · 2022-08-15T04:12:49Z

Weird, there seems to be some issue with the thread pools stalling when threads <= 2

nh13 · 2022-08-18T17:48:46Z

@sstadick totally fine if you have zero time, but I am stuck with a weird deadlock when using exactly two threads. I think there's a thread for reading, writing, and processing, so something is stalling or deadlocking. I am suspicious that my use of a rayon thread pools has a fundamental misunderstanding causing it. Any chance you have ideas?

sstadick

@nh13 I didn't see the notification for this until today!

So the number of threads you need is equal to:
(if paired) 2 for reading (1 if not paired)
1 for writing
1 for the pool.install

Otherwise, as you noted, it will deadlock.

Example: Paired input, 3 threads given

1 thread is immediately started for writing, and blocks waiting on the channel
1 thread is used as the context within pool.install
A thread for reader 1 is created on the threadpool, sending chunks and then blocking once the channel is full.
A task is added to the pool for reader 2 that wants to use a thread, but all 3 threads are in a blocked state:

the pool.install thread is waiting on reader2 to start sending over its channel
the reader 1 thread is blocked waiting for its channel to be less full to start ending again
the writer is blocked and waiting for anything to show up over the rx channel.

What I'd recommend is spawning threads not tied to the threadpool to do the reading and writing.
I'd not count the reader and writer threads toward the total number of threads used since they don't really use CPU and mostly just sleep-wait on sys call.s

That's how gzp and by association crabz do their accounting for threads.
Alternatively you could switch to an async model which adds a lot of complexity, but specifically solves this scenario where you want to context switch between tasks with a limited threadpool.

sstadick · 2022-08-22T16:29:01Z

src/main.rs

+                progress.record();
+                progress.record();
+            }
+            assert!(


Note that this compares the full read header, which will likely be different R1 and R2 since it contains the <read> element. I had to remove this to run locally on the datasets I had on hand.

sstadick · 2022-08-22T17:05:40Z

I pushed a branch that has debug statements that demonstrate exactly how / why the deadlock happens: https://github.com/fulcrumgenomics/fqgrep/tree/demo/deadlock

After building, run with RUST_LOG="debug" ./target/release/fqgrep ... with threads set to 2 or 3 depending on whether or not you have paired input.

nh13 · 2022-08-22T17:56:56Z

@sstadick the pony I want is "I have N tasks, M processors, where N >> M, and M is a thread pool". I want the user to say "use M threads", and not have it be "the tool will use M + 2" threads, if that makes sense. I thought that spawning threads and installing closures would support that? If a thread is blocked, can't another thread be executed?

sstadick · 2022-08-22T19:52:34Z

For a number of complicated reasons the pony you are asking for is async, and probably specifically the tokio runtime.

As far as I can tell, rayon maintains a pool of threads, and doles out 1 task per thread. rayon::spawn adds a task to the queue. None of your tasks finish, they all block on a recv or send. Rayon can't know that a thread is blocked. The OS knows. i.e. if you were to massively oversubscribe rayon threads to OS threads the OS can context swtich between them, but rayon can't. This is part of why I think it's fine to not count the IO threads toward --threads - the OS can and will manage them.

Part of the reason rayon can't do this is that suspending the execution of a task and restarting it on another thread means moving memory around, which may or may not be safe.

async specifically solves this problem. It maintains a pool of N threads and allows you to launch M tasks on those threads. The difference is that the baked in .await on an async version of a Receiver notifies the async runtime that the task has hit a breakpoint at which time it can be interrupted. Additionally, running in an async function forces Pin/Unpin on your datatypes, annotating what can and can't be moved around.

Lastly, with regards to not counting IO threads - that is what pigz does as well sstadick/gzp#11 (comment)

nh13 · 2022-08-22T23:01:52Z

@sstadick thanks for the explanation, that makes a lot of sense. In a lot of contexts, like you have for gzp the user will want to have a hard limit on cores. A job that uses more threads than requested may even be killed (e.g. on clusters), and if we want to run multiple tools on the same instance, we don't want to be surprised if we over-subscribe.

I do think the right answer is to span separate reader/writer threads and call it a day, like you have in crabz and elsewhere.

nh13 · 2022-08-25T03:39:18Z

@tfenne I think this is ready for review (@sstadick you're welcome to review too, but under no obligation and no pressure).

sstadick

This looks awesome!!

src/lib/color.rs

src/lib/matcher.rs

src/lib/mod.rs

nh13 · 2022-09-01T16:00:52Z

@tfenne in case you want to review it, here it is. I think there are probably non-trivial improvements that can be made to the implementation, but I've stared at it long enough I can no longer see them.

Add a rust-toolchain file.

nh13 added 15 commits August 10, 2022 08:40

wip: make it grep-like

8cf0c7f

clippy

fad92ff

add -c/--count

f1d0a4d

paired end

91d8159

simplify

635ae85

add the --plain option

3241664

fixes

73aa9ee

fixes

a70395a

fix

1e9de49

improve exit code

f4bb927

Add coloring

120940f

usage fix

37ff2b3

adding missing source

bc96437

improve docs

5da7b54

change --plain to --decompress to match GREP

99d976e

This was referenced Aug 12, 2022

program name #14

Closed

Make R2 optional #7

Closed

simplify matching with coloring

e8fe3ba

sstadick reviewed Aug 22, 2022

View reviewed changes

move reader and writer into their own threads

649fb64

nh13 marked this pull request as ready for review August 25, 2022 03:39

nh13 changed the title ~~wip: make it grep-like~~ Make fqgrep more grep-like in its options Aug 25, 2022

sstadick approved these changes Aug 30, 2022

View reviewed changes

src/lib/color.rs Outdated Show resolved Hide resolved

src/lib/matcher.rs Show resolved Hide resolved

src/lib/matcher.rs Outdated Show resolved Hide resolved

src/lib/mod.rs Outdated Show resolved Hide resolved

nh13 requested a review from tfenne September 1, 2022 16:00

nh13 and others added 6 commits September 11, 2022 15:26

minor updates

e422429

fix coloring with reverse complement

d8e6e6a

fixing invert match

bba454c

Add a rust-toolchain file.

a3f9a67

Merge pull request #15 from fulcrumgenomics/tf_add_toolchain

09190e5

Add a rust-toolchain file.

Tests for main.rs, mod.rs, and matcher.rs (#16)

f5aea07

nh13 merged commit 9099e04 into main Dec 14, 2022

nh13 deleted the refactor/grep-like branch December 14, 2022 17:42

This was referenced Dec 15, 2022

allow queries to come from file #11

Closed

Skip searching the reverse complement #8

Closed

Make ref-sequence optional #6

Closed

Provide offsets into the read to explicitly search #9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make fqgrep more grep-like in its options #13

Make fqgrep more grep-like in its options #13

nh13 commented Aug 10, 2022 •

edited

Loading

nh13 commented Aug 15, 2022

nh13 commented Aug 18, 2022

sstadick left a comment

sstadick Aug 22, 2022

sstadick commented Aug 22, 2022 •

edited

Loading

nh13 commented Aug 22, 2022

sstadick commented Aug 22, 2022

nh13 commented Aug 22, 2022

nh13 commented Aug 25, 2022

sstadick left a comment

nh13 commented Sep 1, 2022

Make fqgrep more grep-like in its options #13

Make fqgrep more grep-like in its options #13

Conversation

nh13 commented Aug 10, 2022 • edited Loading

nh13 commented Aug 15, 2022

nh13 commented Aug 18, 2022

sstadick left a comment

Choose a reason for hiding this comment

sstadick Aug 22, 2022

Choose a reason for hiding this comment

sstadick commented Aug 22, 2022 • edited Loading

nh13 commented Aug 22, 2022

sstadick commented Aug 22, 2022

nh13 commented Aug 22, 2022

nh13 commented Aug 25, 2022

sstadick left a comment

Choose a reason for hiding this comment

nh13 commented Sep 1, 2022

nh13 commented Aug 10, 2022 •

edited

Loading

sstadick commented Aug 22, 2022 •

edited

Loading