## Notebook of demos, tutorial

This notebook is intended to give new users an idea of how `itermae` works to accomplish basic parsing, how to debug its usage, how to run it in parallel, and how to use multiple operations to do more complex parsing and filtering.

1. Basic usage
2. Debugging
3. Parallel chopping, intended usage
4. Multiple operations

## Basic usage

`itermae` expects to have a FASTQ file piped into the standard input (STDIN) of the program, 
likely using the `|` character. If then applies one or more operations (denoted by `-o` or `--operation`) 
to each read. For each read where the operation was successful, it generates an output sequence record 
with a sequenced defined by the `-oseq` or `--output-seq` argument 
and an optional sequence ID defined by the `-oid` or `--output-id` argument. 
The output format is by default a SAM file.

Here is a minimal example that does nothing interesting, just outputs the input in SAM format:

In [None]:
cat example-data/toy.fastq | itermae -o "input > ." -oseq "input"

If the operation of output sequence definition is missing, then the program fails and tells you where it has a problem. You can omit the output ID and it will use the input ID by default. Example of failing:

In [None]:
cat example-data/toy.fastq | itermae -o "input > " -oseq "input"

### operations

How to write an operation? Each operation has two parts. 
There's the part on the left that is what the operation is applied to, the part on the right that is the 
regular expression to apply, and between these is ` > ` (that is : a space, greater than sign, a space).

#### inputs
For your first operation, the input has to be `input` because that's the only available sequence, the input sequence of the sequence record. If you use mutiple operations, you can feed groups "captured" from previous operations as inputs to later operations (we talk about this later in this guide).

#### regular expressions

Then there's the regular expression. These are complex, and `itermae` is designed to expose this complexity 
to the user directly so that you can do more with the tool! Here, I will try to explain a cookbook/copy-paste 
version that you can use if you're new to regexes and need to trial-and-error the result you want:

The pattern you define is matched against the read. So the pattern `.` will match one letter of anything, 
which is why I used it in the above examples (it always matches!). Putting an `ATCG` will only match when
there's exactly a `ATCG` in the read, anywhere in the read, but exactly those four letters in that sequence.
If you want to only one character of DNA, you can use `[ATCG]` to match any one of those four letters. 
*Always use uppercase for `itermae`!* If you want to include `N`, you can do that with `[ATCGN]`. 

To match two letters, you could write `[ATCGN][ATCGN]`, or more easily `[ATCGN]{2}`. 
The `{}` braces help you specify how many to match. 
You can specify one number `{3}`, or a range `{2,5}`.
So for a DNA barcode of length 18-22 bases, you may want to specify `[ATCGN]{18,22}`.

`(` `)` parentheses are used to denote groups, like they're used in math. 
Capture groups use these `()` but also use `?P` and `<>` brackets to name the group. 
A named capture group uses notation like `(?P<barcode>[ATCGN]{18,22})`, where 
`(?P<` is the start, `barcode` is the name of the group, and the group is whatever 
matches to `[ATCGN]{18,22}`, until the `)`. It's complex syntax, but necessary to clearly
denote what is the name of the group and what is the group.

To match this barcode only when it is between defined sequences, for example priming sites, you can use a more
complex pattern like `GGTCTCT(?P<barcode>[ATCGN]{18,22})CGTACGCT`. This matches a sequence where:
- There is a `GGTCTCT`
- then 18 to 22 bases of either A, T, C, G, or N
- then `CGTACGCT`

Crucially, it then saves that center 18 to 22 bases as a group named `barcode`.
That means you can then only output that group, or you can put it into the output ID, or you can even match
other operation patterns against _just that group_.

For example (using a `\` to continue it onto a new line):

In [None]:
cat example-data/toy.fastq \
    | itermae -o "input > GGTCTCT(?P<barcode>[ATCGN]{18,22})CGTACG" -oseq "barcode"

This tries to match the pattern on every read, and only where it matches it them writes to the output the matched 
`barcode` that contains the group of interest. Note that we use `"input > ` and `-oseq "barcode"`. 
Important!

Note that 5 are output.

#### fuzzy regular expressions

However, errors are common in sequencing. Substitutions, whether they come from a mutation, PCR misincorporation,
or erroneous basecall can interfere with your ability to use defined sequence to parse your read.
Indels are rarer in Illumina sequencing, but any indel early will affect the position of the later sequence.
Barcode/amplicon libraries also have size heterogeneity from construction and propogation.

To handle this, we can use "fuzzy" regular expressions. These are regular expressions, but they're approximate.
This can tolerate errors of single-base changes or errors in length. `itermae` uses the `regex` module to do
fuzzy regular expression matching. This is an excellent library that provides a new syntax to specify
"fuziness" of a group.

To write the above example in another way, we add after the group the modifier `{e<=1}`. 
This means that less-than-or-equal to 1 errors of any kind (insertion, deletion, substitution) are tolerated
in the previous group.
So we can change our operation to `input > (GGTCTCT){e<=1}(?P<barcode>[ATCGN]{18,22})"`, and this will
tolerate one or fewer errors in the `GGTCTCT` just in front of the barcode. Note the parentheses are necessary.

In [None]:
cat example-data/toy.fastq \
    | itermae -o "input > (GGTCTCT){e<=1}(?P<barcode>[ATCGN]{18,22})CGTACG" -oseq "barcode"

Ah, see there are six that match now. Which one did not match? To demonstrate, I'll name that first
group as `fixed` and output that.

In [None]:
cat example-data/toy.fastq \
    | itermae -o "input > (?P<fixed>GGTCTCT){e<=1}(?P<barcode>[ATCGN]{18,22})" -oseq "fixed"

So there's a `GGTCTAT` in there, the second one. 

I can also append this group to the input ID, and then have the barcode be
the main output sequence again. Note that changing the ID means you have to access attributes like 
`.id` or `.seq` from each captured group, as they're handled like BioPython SeqRecords internally. 
And the below also puts a `+'_'+` between them for readability.

In [None]:
cat example-data/toy.fastq \
    | itermae -o "input > (?P<fixed>GGTCTCT){e<=1}(?P<barcode>[ATCGN]{18,22})" \
        -oseq "barcode" -oid "input.id+'_'+fixed.seq"

You can build on this to make it more complex. These particular reads actually have a five base index 
at the beginning that denotes the sample, so below I'll extract that and append it to the read ID, then output
the barcode, while fuzzy matching the priming sites.

In [None]:
cat example-data/toy.fastq \
    | itermae \
        -o "input > (?P<sampleIndex>[ATCGN]{5})(GTCCACGAGGTCTCT){e<=2}(?P<barcode>[ATCGN]{18,22})(CGTACGCTG){e<=1}" \
        -oseq "barcode" -oid "input.id+'_'+sampleIndex.seq"

For outputs, 
you can change the output format using `-of` or `--output-format` and the formats SAM, FASTQ, or FASTA. 
You can then write that to a file by using the `>` operator in the shell.

In [None]:
cat example-data/toy.fastq \
    | itermae \
        -o "input > (?P<sampleIndex>[ATCGN]{5})(GTCCACGAGGTCTCT){e<=2}(?P<barcode>[ATCGN]{18,22})(CGTACGCTG){e<=1}" \
        -oseq "barcode" -oid "input.id+'_'+sampleIndex.seq" -of "FASTQ" \
    > out.sam

head -n 4 out.sam

So that's "basic usage". I could now split my SAM file based on which sample index it has,
and cluster/tabulate barcodes.
    
This should get you going, but regular expressions are a powerful and common feature in programming and 
analysis. For more, I recommend you search for "python regular expression tutorial" 
or the like and read a few articles that talk about this, then consult the `regex` package documentation 
for more information about those specific modifiers.

## Debugging

`itermae` has a lot going on internally, and it can get a bit complex. It's best to take a sample of reads
from the middle of the file (or at least not the first ~100,000 of a FASTQ file) and feed in a hundred at a
time to see what it does.

But, there's also additional features to help with that. Extending on the above example, we can generate 
a report of what was made and how. Below, we use `--report` to specify that.
We also use `--failed` to put the failed reads (for whatever reason) into a file for inspection.

In [None]:
cat example-data/toy.fastq | itermae \
        -o "input > (?P<sampleIndex>[ATCGN]{5})(GTCCACGAGGTCTCT){e<=2}(?P<barcode>[ATCGN]{18,22})(CGTACGCTG){e<=1}" \
        -oseq "barcode" -oid "input.id+'_'+sampleIndex.seq" \
        --report report.txt --failed failed.fastq \
    > out.sam

echo "The report:"
head -n 4 report.txt
echo
echo "The first of the failed reads:"
head -n 4 failed.fastq

We see the third record (at fake positions 10002) failed in not being able to for the output.
We see the failed read, and see that it's lacking the pattern we're looking for.

For the report, the columns are:

- passed or not
- input ID
- input sequence
- output sequence, if successful
- the result of any filters (later), which here is just True
- a string that denotes each group matched, where it starts, ends, and length

## Parallel chopping, intended usage

### Input/output

`itermae` is based on a previous tool, but this implementation has removed any pretense of trying to
handle parallelization internally. Instead, this tool just takes a stream of standard in FASTQ and parses it.
You can readily parallelize by using the command-line tool GNU `parallel` (yes I'll cite it).
This designs limits complexity of code and prevents a certain _fuzzy_ memory leak...

However, there is a single-thread ability to read from a file on disk using `--input`, and that can be indicated
as being gzipped with the flag `-z`. Similarly, `itermae` is designed to output to standard output, but you can
specify an output file with `--output`. 

Below is an example of using `parallel` to parallelize the above example:

In [None]:
cat example-data/toy.fastq \
    | parallel --pipe -L 4 \
        'itermae -o "input > (?P<sampleIndex>[ATCGN]{5})(GTCCACGAGGTCTCT){e<=2}(?P<barcode>[ATCGN]{18,22})(CGTACGCTG){e<=1}" \
            -oseq "barcode" -oid "input.id+\"_\"+sampleIndex.seq" ' 

The `--pipe` arguement specifies to pipe the input into `itermae`, and do that in chunks of `-L` size.
You will probably want to use a chunk size of 10-100 thousand, but do that in multiples of 4 (since it's a fastq)!
So something like `parallel --pipe -L 40000` is what you'd want to use on real data.

Then you put the `itermae` command in quotes. Here I've used `'` on the outside. Note that if I need to `"`
something inside an argument building the output sequence, I need to escape that with a `\`.

You can also restrict the jobs to run with arguments like `-j`, so consult the `parallel` manual for more info.

## Multiple operations

(to write)

You can chain operations