# Generating counts from an alignment

Counting is more straightforward than alignment, but it's still very important and a bioinformatician has to think carefully about how to go about this process. Remember, identifying differnetially expressed genes relies upon the statistical comparison of the number of reads assigned to genes, compared among samples. 

We will be using `featureCounts` to count reads. `StringTie` is another popular counter. Install `subread` (of which `featureCounts` is a submodule) with `conda install bioconda::subread`.

After installing, take a look at the `featureCounts` manual:

In [None]:
!featureCounts

We are analyzing data produced in the [Winners vs. Losers paper](https://journals.plos.org/plospathogens/article?id=10.1371/journal.ppat.1012268), which analyzed the expression differences between schistosome eggs that were trapped in the liver and eggs that were trapped in the intestine. Luckily for us, the methods section contains a few details about their use of `featureCounts`:

> The UMIs were deduplicated using open source UMI-tools software package (version 1.1.2) [64]. Deduplicated mapped reads were counted on the gene level using FeatureCounts (version 2.0.1) [65] with options -M and–fraction (counting of multi-mapped reads with expression value as a fraction based on the number of genes assigned, ranging from 2–20 genes).

You can see from the man page above that the `-M` flag allows for multi-mappers to be counted - these are reads that mapped to multiple different locations with equal quality scores. The `--fraction` argument tells the program how to account for multi-mapped reads:

> When '-M' is specified, each reported alignment from a multi-mapping read (identified via 'NH' tag) will carry a fractional count of 1/x, instead of 1 (one), where x is the total number of alignments reported for the same read.

So, if we have a read that mapped to two different genes, those two reads will count for 0.5 for each gene that it aligned to.

So far, we have dealt with duplicate reads, even though we know from FASTQC that we had many duplicates:

<img src="assets/duplicates.png" width="400">

There are many different steps at which you can remove duplicates. Some people remove them from the FASTQC during trimming/filter, but my opinion is that is unnecessary and can potentially remove data that you might want later. I think it's better to align the duplicates but mark them (rather than removing them) so that they aren't counted during the counting step. To do this, we will first use Picard Tools to mark the duplicate reads, then tell featureCounts to ignore anything marked as a duplicate (with the the `--ignoreDup` flag). This flag will analyze the FLAG field in the BAM and ignore any alignment that contains the duplicate bit.


In [None]:
%mkdir logs
%mkdir dedup

!picard MarkDuplicates -I ../6_alignment/alignment/star/Aligned.sortedByCoord.out.bam -M logs/star_duplicates -O dedup/star.bam --VALIDATION_STRINGENCY SILENT

***20 minutes to complete***

Finally, we are also going to use the stranded argument because the QuantSeq FWD 3’mRNA Library Prep Kit is a stranded kit. Here's the final `featureCounts` command:

In [None]:
!featureCounts -T 32 \
    dedup/star.bam \
    -T 32 \
    --byReadGroup \
    -s 1 \
    --ignoreDup \
    -M \
    --fraction \
    -a ../2_genome_exploration/genome/annotations.gtf \
    -o star_counts.tsv \
    --verbose

We now are finally at a point in the analysis in which we can compare our results with the results published in the Winners vs. Losers paper. One of their most prominent results was showing the immunomodulatory gene Smp_245390 was more highly expressed in mature liver eggs than mature intestinal eggs:

<img src="assets/fig4d.png" width="400">

The paper also included their own counting results in [S2 Table](https://journals.plos.org/plospathogens/article/file?type=supplementary&id=10.1371/journal.ppat.1012268.s002). Here are their results and our results compared:

| Analysis | INT_im1 | INT_im2 | INT_im3 | INT_ma1 | INT_ma2 | INT_ma3 | LIV_im1 | LIV_im2 | LIV_im3 | LIV_ma1 | LIV_ma2 | LIV_ma3 |
|----------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| Theirs   | 2.84	 | 2.35	   | 20.46 	 | 29.45   | 49.03   | 16.17   | 258.46  | 11.50   | 11.67   | 227.95  | 94.20   | 63.37   |
| Ours     | 0.5     | 1       | 12.3    | 8.67    | 8       | 1       | 218.48  | 2.38    | 4       | 269.16  | 76.61   | 36.19   |

Looks pretty good! Our counts are lower than theirs across the board (other than LIV_ma1), but the main patterns hold true. Based on these numbers, it will look like our analysis is likely to reproduce the main finding from the paper - that Smp_245390 is significantly more highly exprssed in mature liver eggs than mature intestine eggs.

Let's now do all of that again, but with the HISAT alignment...

In [None]:
!picard MarkDuplicates -I ../6_alignment/alignment/hisat/merged.bam -M logs/hisat_duplicates -O dedup/hisat.bam --VALIDATION_STRINGENCY SILENT

!featureCounts -T 32 \
    dedup/hisat.bam \
    -T 32 \
    --byReadGroup \
    -s 1 \
    --ignoreDup \
    -M \
    --fraction \
    -a ../2_genome_exploration/genome/annotations.gtf \
    -o hisat_counts.tsv \
    --verbose

Let's compare the new numbers:

| Analysis | INT_im1 | INT_im2 | INT_im3 | INT_ma1 | INT_ma2 | INT_ma3 | LIV_im1 | LIV_im2 | LIV_im3 | LIV_ma1 | LIV_ma2 | LIV_ma3 |
|----------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| Theirs   | 2.84	 | 2.35	   | 20.46 	 | 29.45   | 49.03   | 16.17   | 258.46  | 11.50   | 11.67   | 227.95  | 94.20   | 63.37   |
| Ours (STAR)    | 0.5     | 1       | 12.3    | 8.67    | 8       | 1       | 218.48  | 2.38    | 4       | 269.16  | 76.61   | 36.19   |
| Ours (HISAT)   | 2     | 0.5       | 25.2    | 41.2    | 34.4       | 8.5       | 1095.48  | 15.7    | 17.5       | 2368.03  | 524.07   | 304.8   |

Finally, we can use MultiQC to aggregate the report, which will now include MarkDuplicates and featureCounts logs.

In [None]:
!multiqc --force -d ../5_fastq/fastq/qc/ ../5_fastq/trimmed/qc/ ../6_alignment ../7_alignment_qc .