## Notebook of demos, tutorial

by Darach, @darachm

This notebook is intended to give new users an idea of how `itermae` works to 
accomplish basic parsing, how to debug its usage, how to run it in parallel, 
and how to use multiple matches to do more complex parsing and filtering.

It is intended to be run in the `itermae` repository, so it can access the 
example data and config files in that repo. Uses the `bash_kernel`, of course.

Any commentary or feedback on this tool or documentation would be very valuable 
and very welcome, so please reach out if you have the time to do so.
Submit an [issue on Gitlab](https://gitlab.com/darachm/itermae/-/issues).

## Table of contents:

1. [Installation](#install)
1. [Quickstart](#quickstart)
1. [Usage](#usage)
2. [Debugging](#operations)
3. Parallel chopping, intended usage
4. Multiple operations

## Install

Install `itermae` with pip.

In [None]:
python3 -m pip install itermae

You can also build and install the package from inside the repository, by using 
the `make install` command. 

Package requirements (beyond base Python >= 3.6) are listed in
`requirements.txt`, including `pytest` for development and testing.

I also recommend installing [GNU `parallel`](https://www.gnu.org/software/parallel/),
as this utility allows you to easily parallelize text-based file inputs. 
`itermae` was designed for the use-case of a sequencing input file being fed by
`parallel` through a bash pipe into the STDIN of `itermae`.

I recommend you [install parallel from source](https://ftp.gnu.org/gnu/parallel/) 
(scroll down to latest), as some package managers (ubuntu for example) have
changed what the package named parallel contains or have version from years out
of date. You can install from source, or use the 
[Singularity image for itermae](https://singularity-hub.org/collections/4537),
which has `parallel` installed for use.

## Quickstart <a id="quickstart" />

Copy and modify the below command. 

- You can change the `-N 1000` to reflect how many records to 
  feed into the script at a time.
  
- Edit the config file to specify the input/output formats,
  the matches executed, and how the outputs are filtered and
  formed from these matches.

In [48]:
zcat itermae/data/barseq.fastq.gz \
    | parallel --pipe -l 4 -N 1000 \
        itermae --config itermae/data/example_schema.yml \
    | head

NB501157:100:H5J5LBGX2:1:11101:10000:10043	0	*	0	255	*	=	0	0	TCAGTCGTAGCAGTTCGATG	EEEEEEEEEEEEEEEEEEEE	IE:Z:0
NB501157:100:H5J5LBGX2:1:11101:10000:10043	0	*	0	255	*	=	0	0	TTCAC	AAAAA	IE:Z:1
NB501157:100:H5J5LBGX2:1:11101:10000:10043_TTCAC	0	*	0	255	*	=	0	0	TTCACXTCAGTXTCAGTCGTAGCAGTTCGATG	AAAAAIEEEEEIEEEEEEEEEEEEEEEEEEEE	IE:Z:2
NB501157:100:H5J5LBGX2:1:11101:10000:10138	0	*	0	255	*	=	0	0	TGGGCAGACACAACGCTACA	EEEEEEEEEEEEEEEEEEEE	IE:Z:0
NB501157:100:H5J5LBGX2:1:11101:10000:10138	0	*	0	255	*	=	0	0	GCTTC	AAAAA	IE:Z:1
NB501157:100:H5J5LBGX2:1:11101:10000:10138_GCTTC	0	*	0	255	*	=	0	0	GCTTCXTGGGCXTGGGCAGACACAACGCTACA	AAAAAIEEEEEIEEEEEEEEEEEEEEEEEEEE	IE:Z:2
NB501157:100:H5J5LBGX2:1:11101:10000:16613	0	*	0	255	*	=	0	0	GACAGACTGATAACCCTTGC	EEEEEEEEEEEEEEEEEEEE	IE:Z:0
NB501157:100:H5J5LBGX2:1:11101:10000:16613	0	*	0	255	*	=	0	0	GCTTC	AAAAA	IE:Z:1
NB501157:100:H5J5LBGX2:1:11101:10000:16613_GCTTC	0	*	0	255	*	=	0	0	GCTTCXGACAGXGACAGACTGATAACCCTTGC	AAAAAIEEEEEIEEEEEEEEEEEEEEEEEEEE	IE:Z:2
NB501157:1

## Usage <a id="usage" />

Here is a simple example that extracts the first 10 bases of the input sequences.

In [45]:
head -n 8 itermae/data/barseq.fastq | itermae -m "input > (?P<firstTen>.{10})" -os "firstTen"

NB501157:100:H5J5LBGX2:1:11101:10000:10043	0	*	0	255	*	=	0	0	TTCACGTCCT	AAAAAEEEA/	IE:Z:0
NB501157:100:H5J5LBGX2:1:11101:10000:10138	0	*	0	255	*	=	0	0	GCTTCGTCCT	AAAAAEEEE/	IE:Z:0


Of course, the tool has many more options with which to configure
inputs, outputs, matching, and filtering operations.

To show a higher-level example of where we're aiming, 
here is using `itermae` with a configuration file and `parallel`:

In [33]:
zcat itermae/data/barseq.fastq.gz \
    | parallel --pipe -l 4 -N 10 -j 1 \
        itermae --config itermae/data/tutorial_example.yml \
    | head

>NB501157:100:H5J5LBGX2:1:11101:10000:10043_TTCAC
TCAGTCGTAGCAGTTCGATG
>NB501157:100:H5J5LBGX2:1:11101:10000:10138_GCTTC
TGGGCAGACACAACGCTACA
>NB501157:100:H5J5LBGX2:1:11101:10000:16613_GCTTC
GACAGACTGATAACCCTTGC
>NB501157:100:H5J5LBGX2:1:11101:10000:19701_CTACT
GATGCACTGCGTTCCATGTT
>NB501157:100:H5J5LBGX2:1:11101:10000:5096_TAAGT
AGGGCTCGTCGATTCGTCTT


but you can also specify configuration using command-line arguments 
in order to have fewer files:

In [34]:
zcat itermae/data/barseq.fastq.gz \
    | parallel --pipe -l 4 --quote -N 10 -j 1 \
        itermae -m 'input > (?P<sampleIndex>[ATCGN]{5,5})(?P<prefix>GTCCACGAGGTCTCT){e<=2}(?P<barcode>[ATCGN]{18,22})(?P<untitled_1>CGTACGCTG){i<=1,d<=2,s<=2}' \
            -os 'barcode' -oi 'input.id+"_"+sampleIndex.seq' \
            -of 'barcode.length >= 17 and sampleIndex.length >= 3' \
            --output-format fasta \
    | head

>NB501157:100:H5J5LBGX2:1:11101:10000:10043_TTCAC
TCAGTCGTAGCAGTTCGATG
>NB501157:100:H5J5LBGX2:1:11101:10000:10138_GCTTC
TGGGCAGACACAACGCTACA
>NB501157:100:H5J5LBGX2:1:11101:10000:16613_GCTTC
GACAGACTGATAACCCTTGC
>NB501157:100:H5J5LBGX2:1:11101:10000:19701_CTACT
GATGCACTGCGTTCCATGTT
>NB501157:100:H5J5LBGX2:1:11101:10000:5096_TAAGT
AGGGCTCGTCGATTCGTCTT


Below I will discuss what each option is and how to specify it in
command-line arguments and yaml

### Input options

* Input source - where are the input reads coming from?

    The default is 'STDIN', which specifies to expect reads to be piped ("`|`")
    in to the tool. Alternatively you can specify a file path to open.
    
    Specify with the command-line argument `--input` or `-i`, or set the YAML 
    key `from:` inside the `input:` block.
    
    
* Is the input gzip compressed?

    The default is no. Files specified can be gzipped, but of course the
    default 'STDIN' cannot be piped in as compressed (pass through `zcat`).
    
    Specify with the command-line argument `--gzipped` or `-z`, 
    or set the YAML key `gzipped:` inside the `input:` block to 
    `true` or `false`.
    
    
* What format are the input records in?

    The default is 'fastq'. This tool also handles:
    
    - 'FASTA' files, where per-base quality scores are assumed to be the
        maximum value.
    - 'SAM' files, but all flags are ignored and not included in the 
        outputs. Similarlly, all alignment information is forgotten.
    - 'TXT files, where there is a single sequence per line. This ID of
        the record is defined by the sequence, and all qualities are 
        assumed to be the maximum value.
    - Any other format that BioPython SeqIO will recognize.
    
  All formats specified are case-insensitive.
    
  Specify with the command-line argument `--input-format`, or set the YAML 
  key `format:` inside the `input:` block.


### Matches

Matches are the heart of the tool. `itermae` simply uses the `regex` library,
and is essentially a utility to arrange `regex` fuzzy expression matches on
`Biopython SeqIO` objects such that the groups are handled well and
filtered/formatted appropriately for amplicon sequencing analyses.

However, exposing this flexibility makes its use complicated. There are two
ways of defining it - the YAML method may be simplier if you're not used to
regular expressions.

- With command-line :

  This can be specified on the command-line as one match per each argument
  of `-m` or `--match`. This requires that you write a regular expression
  such that capture groups are named using the 
  `(?P<nameOfCaptureGroup>ATCGgroupPatternATCG)` syntax, and can use the
  `regex` fuzzy matching denotations such as, for example `{e<=2}` to 
  permit 2 or fewer errors in the preceeding match group. See examples.
  
        
- With YAML config file:
    
  Alternatively, the YML configuration mode offers a simpler although more
  bulky syntax. Inside a YML configuration file, you specify a dictionary
  with the `matches:` top-level key. Inside this, using indented `-`'s,
  you specify an array of matches. For each, they need a value behind the
  key `use:` to denote what is matched against (by default `input`). These
  can groups captured from previous matches. 
  
  Inside each match, 
  you specify the pattern to match behind the key `pattern:`. This
  recognizes IUPAC codes, so `A` is `A`, while `R` matches `A` or `G`.
  Additionally `*` matches zero or more of anything, while `+` matches one 
  or more of anything.
  For each character of the pattern, you also need to specify a `marking:`
  string specifying what group each character belongs to. This must be the
  same length as the `pattern:`, and each different character denotes a 
  different group.
  
  Example:
  
      pattern: NGTCCACGAGGTCTCTNCGTACGCTG
      marking: ABBBBBBBBBBBBBBBCDDDDDDDDD

  Denotes a pattern of group `A` with any base, group `B` with a specific
  sequence, group `C` with any base, and group `D` with a specific sequence.
  
  The in a key `marked_groups:` you can specify for each of these groups
  what the parameters of the match are. Specifically, you can give the
  group a `name:` when matched, you can denote that the group must `repeat:`
  a certain number of times or a `repeat_min:` and `repeat_max:` number of
  times. You can set permissible errors with `allowed_errors:`, or specify
  this for specific types of errors - `allowed_insertions:`, 
  `allowed_deletions:`, `allowed_substitutions:`.
  
  Here is a full example:
  
      matches:
        - use: input
          pattern: NGTCCACGAGGTCTCTNCGTACGCTG
          marking: ABBBBBBBBBBBBBBBCDDDDDDDDD
          marked_groups:
            A:
                name: sampleIndex
                repeat: 5
            B:
                name: prefix
                allowed_errors: 2
            C:
                name: barcode
                repeat_min: 18
                repeat_max: 22
            D:
                allowed_insertions: 1
                allowed_deletions: 2
                allowed_substititions: 2


### Ouput options

* Output - where to write successful outputs to?

    The default is 'STDIN', which specifies to expect reads to be piped ("`|`")
    in to the tool. Alternatively you can specify a file path to open.
    
    Specify with the command-line argument `--input` or `-i`, or set the YAML 
    key `from:` inside the `input:` block.
    
    
* Is the input gzip compressed?

    The default is no. Files specified can be gzipped, but of course the
    default 'STDIN' cannot be piped in as compressed (pass through `zcat`).
    
    Specify with the command-line argument `--gzipped` or `-z`, 
    or set the YAML key `gzipped:` inside the `input:` block to 
    `true` or `false`.
    
    
* What format are the input records in?

    The default is 'fastq'. This tool also handles:
    
    - 'FASTA' files, where per-base quality scores are assumed to be the
        maximum value.
    - 'SAM' files, but all flags are ignored and not included in the 
        outputs. Similarlly, all alignment information is forgotten.
    - 'TXT files, where there is a single sequence per line. This ID of
        the record is defined by the sequence, and all qualities are 
        assumed to be the maximum value.
    - Any other format that BioPython SeqIO will recognize.
    
  All formats specified are case-insensitive.
    
  Specify with the command-line argument `--input-format`, or set the YAML 
  key `format:` inside the `input:` block.


80character ruler
         X         X         X         X         X         X         X         X

If the operation of output sequence definition is missing, then the program fails and tells you where it has a problem. You can omit the output ID and it will use the input ID by default. Example of failing:

In [None]:
cat itermae/data/toy.fastq | itermae -o "input > " -oseq "input" 

One more thing, I'm going to use the `-of` argument to change the output to FASTA format.
While the default is SAM (because it's handy for tabular file operations on the commandline),
here FASTA is a little easier to stare at than SAM (FASTA is simpler). For example:

In [None]:
cat itermae/data/toy.fastq | itermae -o "input > ." -oseq "input" -of "fasta"

### Operations

How to write an operation? Each operation has two parts. 
There's the part on the left that is what the operation is applied to, the part on the right that is the 
regular expression to apply, and between these is ` > ` (that is : a space, greater than sign, a space).

#### inputs
For your first operation, the input has to be `input` because that's the only available sequence, the input sequence of the sequence record. If you use mutiple operations, you can feed groups "captured" from previous operations as inputs to later operations (we talk about this later in this guide).

#### regular expressions

Then there's the regular expression. These are complex, and `itermae` is designed to expose this complexity 
to the user directly so that you can do more with the tool! Here, I will try to explain a cookbook/copy-paste 
version that you can use if you're new to regexes and need to trial-and-error the result you want:

The pattern you define is matched against the read. So the pattern `.` will match one letter of anything, 
which is why I used it in the above examples (it always matches!). Putting an `ATCG` will only match when
there's exactly a `ATCG` in the read, anywhere in the read, but exactly those four letters in that sequence.
If you want to only one character of DNA, you can use `[ATCG]` to match any one of those four letters. 
*Always use uppercase for `itermae`!* If you want to include `N`, you can do that with `[ATCGN]`. 

To match two letters, you could write `[ATCGN][ATCGN]`, or more easily `[ATCGN]{2,2}`. 
The `{}` braces help you specify how many to match. 
You can specify one number `{2,2}`, or a range `{2,5}`.
So for a DNA barcode of length 18-22 bases, you may want to specify `[ATCGN]{18,22}`.

`(` `)` parentheses are used to denote groups, like they're used in math. 
Capture groups use these `()` but also use `?P` and `<>` brackets to name the group. 
A named capture group uses notation like `(?P<barcode>[ATCGN]{18,22})`, where 
`(?P<` is the start, `barcode` is the name of the group, and the group is whatever 
matches to `[ATCGN]{18,22}`, until the `)`. It's complex syntax, but necessary to clearly
denote what is the name of the group and what is the group.

To match this barcode only when it is between defined sequences, for example priming sites, you can use a more
complex pattern like `GGTCTCT(?P<barcode>[ATCGN]{18,22})CGTACGCT`. This matches a sequence where:
- There is a `GGTCTCT`
- then 18 to 22 bases of either A, T, C, G, or N
- then `CGTACGCT`

Crucially, it then saves that center 18 to 22 bases as a group named `barcode`.
That means you can then only output that group, or you can put it into the output ID, or you can even match
other operation patterns against _just that group_.

For example (using a `\` to continue it onto a new line):

In [None]:
cat itermae/data/toy.fastq \
    | itermae -o "input > GGTCTCT(?P<barcode>[ATCGN]{18,22})CGTACG" -oseq "barcode" -of "fasta"

This tries to match the pattern on every read, and only where it matches it them writes to the output the matched 
`barcode` that contains the group of interest. Note that we use `"input > ` and `-oseq "barcode"`. 
Important!

Note that 5 are output.

#### fuzzy regular expressions

However, errors are common in sequencing. Substitutions, whether they come from a mutation, PCR misincorporation,
or erroneous basecall can interfere with your ability to use defined sequence to parse your read.
Indels are rarer in Illumina sequencing, but any indel early will affect the position of the later sequence.
Barcode/amplicon libraries also have size heterogeneity from construction and propogation.

To handle this, we can use "fuzzy" regular expressions. These are regular expressions, but they're approximate.
This can tolerate errors of single-base changes or errors in length. `itermae` uses the `regex` module to do
fuzzy regular expression matching. This is an excellent library that provides a new syntax to specify
"fuziness" of a group.

To write the above example in another way, we add after the group the modifier `{e<=1}`. 
This means that less-than-or-equal to 1 errors of any kind (insertion, deletion, substitution) are tolerated
in the previous group.
So we can change our operation to `input > (GGTCTCT){e<=1}(?P<barcode>[ATCGN]{18,22})"`, and this will
tolerate one or fewer errors in the `GGTCTCT` just in front of the barcode. Note the parentheses are necessary.

In [None]:
cat itermae/data/toy.fastq \
    | itermae -o "input > (GGTCTCT){e<=1}(?P<barcode>[ATCGN]{18,22})CGTACG" -oseq "barcode" -of "fasta"

Ah, see there are six that match now. Which one did not match? To demonstrate, I'll name that first
group as `fixed` and output that.

In [None]:
cat itermae/data/toy.fastq \
    | itermae -o "input > (?P<fixed>GGTCTCT){e<=1}(?P<barcode>[ATCGN]{18,22})" -oseq "fixed" -of "fasta"

So there's a `GGTCTAT` in there, the second one. 

I can also append this group to the input ID, and then have the barcode be
the main output sequence again. Note that changing the ID means you have to access attributes like 
`.id` or `.seq` from each captured group, as they're handled like BioPython SeqRecords internally. 
And the below also puts a `+'_'+` between them for readability.

In [None]:
cat itermae/data/toy.fastq \
    | itermae -o "input > (?P<fixed>GGTCTCT){e<=1}(?P<barcode>[ATCGN]{18,22})" \
        -oseq "barcode" -oid "input.id+'_'+fixed.seq" -of "fasta"

You can build on this to make it more complex. These particular reads actually have a five base index 
at the beginning that denotes the sample, so below I'll extract that and append it to the read ID, then output
the barcode, while fuzzy matching the priming sites.

In [None]:
cat itermae/data/toy.fastq \
    | itermae \
        -o "input > (?P<sampleIndex>[ATCGN]{5,5})(GTCCACGAGGTCTCT){e<=2}(?P<barcode>[ATCGN]{18,22})(CGTACGCTG){e<=1}" \
        -oseq "barcode" -oid "input.id+'_'+sampleIndex.seq" -of "fasta"

This should get you going, but regular expressions are a powerful and common feature in programming and 
analysis. For more, I recommend you search for "python regular expression tutorial" 
or the like and read a few articles that talk about this, then consult the `regex` package documentation 
for more information about those specific modifiers.

### Filters

Note that some barcodes are different sizes. We can also add on filters, so matching here only barcodes 
that start on base 20 and are exactly 20 bases long 
(although we can do this in the pattern, here we use a filter).

In [None]:
cat itermae/data/toy.fastq \
    | itermae \
        -o "input > (?P<sampleIndex>[ATCGN]{5,5})(GTCCACGAGGTCTCT){e<=2}(?P<barcode>[ATCGN]{18,22})(CGTACGCTG){e<=1}" \
        -oseq "barcode" -oid "input.id+'_'+sampleIndex.seq" -of "fasta" \
        --filter "barcode.start == 20 and barcode.length == 20"

A more useful example would be filtering on quality. Those are accessible by accessing the
`.quality` attribute of the sequence in question. 

You can also access other attributes of each group using
that name (the group name).
( Internally, this references an object with attributes of `start`, `end`,
`length`, `quality`, and `seq`.)
You can use these together in filters. You can't filter on the raw `input`,
(or on `dummyspacer` for that matter).
See `apply_filter()` function in the `__init__.py` file for details.

To make building these easier, the `statistics` package is loaded. 
So you can do things like filtering on
`min(barcode.quality) >= 33` or `statistics.median(barcode.quality) > 34 and barcode.length >= 16`.

For debugging, you can print the thing you're filtering and see that spit
out. So something like `print(statistics.median(barcode.quality))`
(this is using `eval` directly by the way, so be responsible).

In [None]:
cat itermae/data/toy.fastq \
    | itermae \
        -o "input > (?P<sampleIndex>[ATCGN]{5,5})(GTCCACGAGGTCTCT){e<=2}(?P<barcode>[ATCGN]{18,22})(CGTACGCTG){e<=1}" \
        -oseq "barcode" -oid "input.id+'_'+sampleIndex.seq" -of "fastq" \
        --filter "statistics.median(barcode.quality) >= 34 and min(barcode.quality) < 36"

### Output format

You can change the output format using `-of` or `--output-format` and the formats SAM, FASTQ, or FASTA. 
You can then write that to a file by using the `>` operator in the shell.

In [None]:
cat itermae/data/toy.fastq \
    | itermae \
        -o "input > (?P<sampleIndex>[ATCGN]{5,5})(GTCCACGAGGTCTCT){e<=2}(?P<barcode>[ATCGN]{18,22})(CGTACGCTG){e<=1}" \
        -oseq "barcode" -oid "input.id+'_'+sampleIndex.seq" -of "fastq" \
        --filter "statistics.median(barcode.quality) >= 34 and min(barcode.quality) < 36"
    > out.fastq

head -n 4 out.fastq
rm out.fastq

So that's "usage". 

Usually, one can use use the below debugging features to build up multiple operations that parse and filter
capture groups, then you can output a SAM file. That can get split based on which sequence is output, what index
it has, etc, and then you can cluster/tabulate barcodes with another tool.

## Debugging

A lot of what `itermae` is doing depends on the regular expressions, so it can get a bit complex and debugging
effectively is important. One strategy is to take a sample of reads
from the middle of the file (or at least not the first ~100,000 of a FASTQ file) and feed in a hundred or so reads
at a time to see how it works.

To help instrument how the operations you define are working, there's a `--report` feature.
This argument specifies a file path where `itermae` will generate a report for each read, reporting if it was
successful or not, what the input read was, what filters it passed and where it matched.

There's also a `--failed` argument, a path where `itermae` puts the failed reads (for whatever reason)
for later inspection.

Additionally, there's `-v` or `--verbose` arguments. One outputs information about the run to standard error
(so that standard output still works fine for piping output), and additional `-v`'s added on increase the level
of reporting to individual reads and individual operations.

Here's an example with the first level of verbosity, and outputing a report and failed reads file.

In [None]:
cat itermae/data/toy.fastq | itermae \
        -o "input > (?P<sampleIndex>[ATCGN]{5,5})(GTCCACGAGGTCTCT){e<=2}(?P<barcode>[ATCGN]{18,22})(CGTACGC){e<=1}" \
        -oseq "barcode" -oid "input.id+'_'+sampleIndex.seq" -of "sam" \
        --report report.csv --failed failed.fastq -v \
    > out.sam

head out.sam
rm out.sam

echo "The report:"
head -n 4 report.csv
rm report.csv
echo
echo "The first of the failed reads:"
head -n 4 failed.fastq
rm failed.fastq

We see the third record (at fake positions 10002) failed in not being able to for the output.
We see the failed read, and see that it's lacking the pattern we're looking for.

For the report CSV, the columns are:

- passed or not
- input ID
- input sequence
- output sequence, if successful
- the result of any filters (later), which here is just True
- a string that denotes each group matched, where it starts, ends, and length

## Parallel chopping, intended usage

### Input/output

`itermae` is based on a previous tool, but this implementation has removed any pretense of trying to
handle parallelization internally. Instead, this tool just takes a stream of standard in FASTQ and parses it.
You can readily parallelize by using the command-line tool GNU `parallel` (yes I'll cite it).
This designs limits complexity of code and prevents a certain _fuzzy_ memory leak...

However, there is a single-thread ability to read from a file on disk using `--input`, and that can be indicated
as being gzipped with the flag `-z`. Similarly, `itermae` is designed to output to standard output, but you can
specify an output file with `--output`. 

Below is an example of using `parallel` to parallelize the above example, 
of course you'll need that installed and available:

In [None]:
cat itermae/data/toy.fastq \
    | parallel --pipe -L 4 \
        'itermae -o "input > (?P<sampleIndex>[ATCGN]{5,5})(GTCCACGAGGTCTCT){e<=2}(?P<barcode>[ATCGN]{18,22})(CGTACGCTG){e<=1}" \
            -oseq "barcode" -oid "input.id+\"_\"+sampleIndex.seq" -of "fasta" ' 

The `--pipe` arguement specifies to pipe the input into `itermae`, and do that in chunks of `-L` size.
You will probably want to use a chunk size of 10-100 thousand, but do that in multiples of 4 (since it's a fastq)!
So something like `parallel --pipe -L 40000` is what you'd want to use on real data.

Then you put the `itermae` command in quotes. Here I've used `'` on the outside. Note that if I need to `"`
something inside an argument building the output sequence, I need to escape that with a `\`.

You can also restrict the jobs to run with arguments like `-j`, so consult the `parallel` manual for more info.
I usually leave it to figure things out for me, and it makes full use of the machine.

## Multiple operations

You can add multiple operations. Each sequential operation added on in the command line is executed
one after another. This allows you to take one capture group and search within it for more groups.
The big advantage of this is speed (less combinatorial complexity of mismatches and degeneracy)
and less complicated regular expressions.

For example, here's the above example but broken into two operations. The first takes the sample index, the
second takes the barcode. Note that I introduce using `*` to mean 0 or more (usually much more) of `[ATCGN]`.
Also, I put the `CTCCACGAG` group, with an error `{e<=1}` expression, inside the `rest` capture group.

In [None]:
cat itermae/data/toy.fastq | itermae \
        -o "input > (?P<sampleIndex>[ATCGN]{5,5})(?P<rest>(GTCCACGAG){e<=1}[ATCGN]*)" \
        -o "rest  > (GTCCACGAGGTCTCT){e<=2}(?P<barcode>[ATCGN]{18,22})(CGTACGCTG){e<=1}" \
        -oseq "barcode" -oid "input.id+\"_\"+sampleIndex.seq" -of "fasta"

Note that you can also use multiple outputs. For example, here I'll output the barcode with and without
the priming sites around it.

In [None]:
cat itermae/data/toy.fastq | itermae \
        -o "input > (?P<sampleIndex>[ATCGN]{5,5})(?P<rest>(GTCCACGAG){e<=1}[ATCGN]*)" \
        -o "rest  > (?P<upPrime>GTCCACGAGGTCTCT){e<=2}(?P<barcode>[ATCGN]{18,22})(?P<downPrime>CGTACGCTG){e<=1}" \
        -oseq "barcode" -oid "input.id+\"_\"+sampleIndex.seq" \
        -oseq "upPrime+barcode+downPrime" -oid "input.id+\"_withFixedFlanking_\"+sampleIndex.seq" \
        -of "fasta"

Doesn't that look nicer? And it runs a bit faster, since you don't use combinations of searching for all
possible errors if the from and back parts, especially with complex parsing schemes.