# Generating counts from an alignment

Counting is more straightforward than alignment, but it's still very important and a bioinformatician has to think carefully about how to go about this process. Remember, identifying differnetially expressed genes relies upon the statistical comparison of the number of reads assigned to genes, compared among samples. 

We will be using `featureCounts` to count reads. `StringTie` is another popular counter. Install `subread` (of which `featureCounts` is a submodule) with `mamba install bioconda::subread`.

After installing, take a look at the `featureCounts` manual:

In [None]:
!featureCounts

We are analyzing data produced in the [this paper](https://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0009200), which analyzed the expression differences between parasites that were treated with praziquantel or not treated at all. The authors of this paper used HISAT2 to align, but they used a different tool for counting reads. This gives us an opportunity to compare the performance of different counting tools.

You can see from the man page above that the `-M` flag allows for multi-mappers to be counted - these are reads that mapped to multiple different locations with equal quality scores. The `--fraction` argument tells the program how to account for multi-mapped reads:

> When '-M' is specified, each reported alignment from a multi-mapping read (identified via 'NH' tag) will carry a fractional count of 1/x, instead of 1 (one), where x is the total number of alignments reported for the same read.

So, if we have a read that mapped to two different genes, those two reads will count for 0.5 for each gene that it aligned to.

## Dealing with duplicate reads

So far, we have dealt with duplicate reads, even though we know from FASTQC that we had many duplicates. In the below example, over 60% of the reads were duplicates:

<img src="assets/duplicates_3.png" width="400">

Indeed, this was true of all samples:

<img src="assets/duplicates_4.png" width="400">

There are a few different reasons for why a sample would have many duplicate reads. The biggest reason is that, during library prep, there was a small amount of input RNA. This means that when the library is finished with the PCR enrichment steps, the library fragments are amplified so many times that many of the reads derived from duplicate fragments:

<img src="assets/pcr.png" width="400">

Many duplicate reads are a hallmark of RNA-seq experiments where the original tissue source was small, which resulted in low amounts of RNA being extracted. There are many different steps at which you can remove duplicates. Some people remove them from the FASTQ during trimming/filter, but my opinion is that is unnecessary and can potentially remove data that you might later be interested in. I think it's better to align the duplicates but mark them (rather than removing them) so that they aren't counted during the counting step. If you look at last week's [notebook on SAM/BAM QC](../7_alignment_qc/7_alignment_qc.ipynb), you'll see that the bit flag 0x400 represents a duplicate read. So, we can use a tool to update the read's flag if it is suspected to be a duplicate.

To do this, we will first use [Picard Tools to mark the duplicate reads](https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard); then we'll tell featureCounts to ignore anything marked as a duplicate (with the the `--ignoreDup` flag). This flag will analyze the FLAG field in the BAM and ignore any alignment that contains the duplicate bit.

In [None]:
%mkdir logs
%mkdir dedup

!picard MarkDuplicates \
    -I /data/classes/2025/fall/biol343/course_files/alignments/star/Aligned.sorted.bam \
    -M logs/star_duplicates \
    -O dedup/star.bam \
    --VALIDATION_STRINGENCY SILENT

***20 minutes to complete***

After marking duplicates, we are ready to do the counting. We're going to count `--byReadGroup` so we can compare counts between and among samples. We will ignore duplicates and also count multimapping reads. We'll also specify that we have paired-end reads. Here's the final `featureCounts` command:

In [None]:
!featureCounts \
    dedup/star.bam \
    -T 32 \
    -p \
    --byReadGroup \
    --ignoreDup \
    -M \
    --fraction \
    -a ../2_genome_exploration/genome/annotations.gtf \
    -o star_counts.tsv \
    --verbose

***5 minute time to completion***

We now are finally at a point in the analysis in which we can compare our results with the results published in the paper. One of their most prominent results was showing the ABC transporter (which detoxifies drugs by transporting them out of the cell) gene Smp_089200 decreased in expression after drug treatment. Other genes also decreased, here's a selection of them:

<img src="assets/example_genes.png" width="400">

The paper also included their own counting results in [S2 File](https://journals.plos.org/plosntds/article/file?type=supplementary&id=10.1371/journal.pntd.0009200.s006). Here are their results and our results compared:

| Analysis | CTRL A | CTRL B | CTRL C | CTRL D | CTRL E | PZQ A | PZQ B | PZQ C | PZQ D | PZQ E |
|----------|--------|--------|--------|--------|--------|-------|-------|-------|-------|-------|
| Theirs   | 106    | 90     | 86     | 91     | 98     | 43    | 52    | 51    | 58    | 54    |
| Ours     | 6131   | 5536   | 5034   | 5324   | 5321   | 2403  | 2890  | 2956  | 3446  | 2915  |

Looks pretty good! Our counts are higher than theirs across the board because we're reporting slightly different metrics (theirs are normalized, ours are not), but the main patterns hold true. Based on these numbers, it will look like our analysis is likely to reproduce one of the the main findings from the paper - that drug treatment significantly decreases the expression of Smp_089200.

Let's now do all of that again, but with the HISAT alignment...

In [None]:
!picard MarkDuplicates \
    -I /data/classes/2025/fall/biol343/course_files/alignments/hisat/merged.bam \
    -M logs/hisat_duplicates \
    -O dedup/hisat.bam \
    --VALIDATION_STRINGENCY SILENT

!featureCounts \
    dedup/hisat.bam \
    -T 32 \
    -p \
    --byReadGroup \
    --ignoreDup \
    -M \
    --fraction \
    -a ../2_genome_exploration/genome/annotations.gtf \
    -o hisat_counts.tsv \
    --verbose

***2 hour time to completion***

We now are finally at a point in the analysis in which we can compare our results with the results published in the paper. One of their most prominent results was showing the ABC transporter (which detoxifies drugs by transporting them out of the cell) gene Smp_089200 decreased in expression after drug treatment. Other genes also decreased, here's a selection of them:

<img src="assets/example_genes.png" width="400">

The paper also included their own counting results in [S2 File](https://journals.plos.org/plosntds/article/file?type=supplementary&id=10.1371/journal.pntd.0009200.s006). Here are their results and our results compared:

| Analysis | CTRL A | CTRL B | CTRL C | CTRL D | CTRL E | PZQ A | PZQ B | PZQ C | PZQ D | PZQ E |
|----------|--------|--------|--------|--------|--------|--------|------|-------|-------|-------|
| Theirs   | 106    | 90     | 86     | 91     | 98     | 43     | 52   | 51    | 58    | 54    |
| Ours     |     |       |     |     |        |        |   |     |        |   |

Looks pretty good! Our counts are lower than theirs across the board , but the main patterns hold true. Based on these numbers, it will look like our analysis is likely to reproduce the main finding from the paper - that drug treatment significantly decreases the expression of Smp_089200.

Let's now do all of that again, but with the HISAT alignment...

In [None]:
!multiqc --force -d ../5_fastq/fastq/qc/ ../5_fastq/trimmed/qc/ ../6_alignment ../7_alignment_qc .