# Generating counts from an alignment

Counting is more straightforward than alignment, but it's still very important and a bioinformatician has to think carefully about how to go about this process. Remember, identifying differnetially expressed genes relies upon the statistical comparison of the number of reads assigned to genes, compared among samples. 

We will be using `featureCounts` to count reads. `StringTie` is another popular counter. Install `subread` (of which `featureCounts` is a submodule) with `conda install bioconda::subread`.

After installing, take a look at the `featureCounts` manual:

In [1]:
!featureCounts


Version 2.0.6

Usage: featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ... 

## Mandatory arguments:

  -a <string>         Name of an annotation file. GTF/GFF format by default. See
                      -F option for more format information. Inbuilt annotations
                      (SAF format) is available in 'annotation' directory of the
                      package. Gzipped file is also accepted.

  -o <string>         Name of output file including read counts. A separate file
                      including summary statistics of counting results is also
                      included in the output ('<string>.summary'). Both files
                      are in tab delimited format.

  input_file1 [input_file2] ...   A list of SAM or BAM format files. They can be
                      either name or location sorted. If no files provided,
                      <stdin> input is expected. Location-sorted paired-end reads
                      a

We are analyzing data produced in the [Winners vs. Losers paper](https://journals.plos.org/plospathogens/article?id=10.1371/journal.ppat.1012268), which analyzed the expression differences between schistosome eggs that were trapped in the liver and eggs that were trapped in the intestine. Luckily for us, the methods section contains a few details about their use of `featureCounts`:

> The UMIs were deduplicated using open source UMI-tools software package (version 1.1.2) [64]. Deduplicated mapped reads were counted on the gene level using FeatureCounts (version 2.0.1) [65] with options -M and–fraction (counting of multi-mapped reads with expression value as a fraction based on the number of genes assigned, ranging from 2–20 genes).

You can see from the man page above that the `-M` flag allows for multi-mappers to be counted - these are reads that mapped to multiple different locations with equal quality scores. The `--fraction` argument tells the program how to account for multi-mapped reads:

> When '-M' is specified, each reported alignment from a multi-mapping read (identified via 'NH' tag) will carry a fractional count of 1/x, instead of 1 (one), where x is the total number of alignments reported for the same read.

So, if we have a read that mapped to two different genes, those two reads will count for 0.5 for each gene that it aligned to.

## Dealing with duplicate reads

So far, we have dealt with duplicate reads, even though we know from FASTQC that we had many duplicates. In the below example, over 60% of the reads were duplicates:

<img src="assets/duplicates_2.png" width="400">

Indeed, this was true of all samples:

<img src="assets/duplicates.png" width="400">

There are a few different reasons for why a sample would have many duplicate reads. The biggest reason is that, during library prep, there was a log amount of input RNA. This means that when the library is finished with the PCR enrichment steps, the library fragments are amplified so many times that many of the reads derived from duplicate fragments:

<img src="assets/pcr.png" width="400">

Many duplicate reads are a hallmark of RNA-seq experiments where the original tissue source was small, which resulted in low amounts of RNA being extracted. There are many different steps at which you can remove duplicates. Some people remove them from the FASTQ during trimming/filter, but my opinion is that is unnecessary and can potentially remove data that you might later be interested in. I think it's better to align the duplicates but mark them (rather than removing them) so that they aren't counted during the counting step. If you look at last week's [notebook on SAM/BAM QC](../7_alignment_qc/7_alignment_qc.ipynb), you'll see that the bit flag 0x400 represents a duplicate read. So, we can use a tool to update the read's flag if it is suspected to be a duplicate.

To do this, we will first use [Picard Tools to mark the duplicate reads](https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard); then we'll tell featureCounts to ignore anything marked as a duplicate (with the the `--ignoreDup` flag). This flag will analyze the FLAG field in the BAM and ignore any alignment that contains the duplicate bit.

In [1]:
%mkdir logs
%mkdir dedup

!picard MarkDuplicates -I ../6_alignment/alignment/star/Aligned.sortedByCoord.out.bam -M logs/star_duplicates -O dedup/star.bam --VALIDATION_STRINGENCY SILENT

mkdir: cannot create directory ‘logs’: File exists
mkdir: cannot create directory ‘dedup’: File exists
16:34:42.477 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data/users/corwinbm5021/.conda/envs/biol343_20241029/share/picard-3.2.0-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue Oct 29 16:34:42 CDT 2024] MarkDuplicates --INPUT ../6_alignment/alignment/star/Aligned.sortedByCoord.out.bam --OUTPUT dedup/star.bam --METRICS_FILE logs/star_duplicates --VALIDATION_STRINGENCY SILENT --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --FLOW_MODE false --FLOW_DUP_STRATEGY FLOW_QUALITY_SUM_STRATEGY --USE_END_IN_UNPAIRED_READS false --USE_UNPAIRED_CLIPPED_END false --UNPAIRED_END_UNCERTAINTY 0 --UNPAIRED_START_UNCERTAINTY 0 --FLOW_SKIP_FIRST_N_FLOWS 0 --F

***20 minutes to complete***

Finally, we are also going to use the stranded argument because the QuantSeq FWD 3’mRNA Library Prep Kit is a stranded kit. That is, the library was created such that we can be 100% confident that each fragment that we sequenced represents the same sequences as the template mRNA. Here's the final `featureCounts` command:

In [4]:
!featureCounts -T 32 \
    /data/classes/2024/fall/biol343/course_files/dedup/star.bam \
    -T 32 \
    --byReadGroup \
    -s 1 \
    --ignoreDup \
    -M \
    --fraction \
    -a ../2_genome_exploration/genome/annotations.gtf \
    -o star_counts.tsv \
    --verbose


       [44;37m =====      [0m[36m   / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
       [44;37m   =====    [0m[36m  | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
       [44;37m     ====   [0m[36m   \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
       [44;37m       ==== [0m[36m   ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
	  v2.0.6

||  [0m                                                                          ||
||             Input files : [36m1 BAM file  [0m [0m                                    ||
||  [0m                                                                          ||
||                           [36mstar.bam[0m [0m                                        ||
||  [0m                                                                          ||
||             Output file : [36mstar_counts.tsv[0m [0m                                 ||
||                 Summary : [36mstar_counts.tsv.summary[0m [0m                         

We now are finally at a point in the analysis in which we can compare our results with the results published in the Winners vs. Losers paper. One of their most prominent results was showing the immunomodulatory gene Smp_245390 was more highly expressed in mature liver eggs than mature intestinal eggs:

<img src="assets/fig4d.png" width="400">

The paper also included their own counting results in [S2 Table](https://journals.plos.org/plospathogens/article/file?type=supplementary&id=10.1371/journal.ppat.1012268.s002). Here are their results and our results compared:

| Analysis | INT_im1 | INT_im2 | INT_im3 | INT_ma1 | INT_ma2 | INT_ma3 | LIV_im1 | LIV_im2 | LIV_im3 | LIV_ma1 | LIV_ma2 | LIV_ma3 |
|----------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| Theirs   | 2.84	 | 2.35	   | 20.46 	 | 29.45   | 49.03   | 16.17   | 258.46  | 11.50   | 11.67   | 227.95  | 94.20   | 63.37   |
| Ours     | 0.5     | 1       | 12.3    | 8.67    | 8       | 1       | 218.48  | 2.38    | 4       | 269.16  | 76.61   | 36.19   |

Looks pretty good! Our counts are lower than theirs across the board (other than LIV_ma1), but the main patterns hold true. Based on these numbers, it will look like our analysis is likely to reproduce the main finding from the paper - that Smp_245390 is significantly more highly exprssed in mature liver eggs than mature intestine eggs.

Let's now do all of that again, but with the HISAT alignment...

In [6]:
#!picard MarkDuplicates -I ../6_alignment/alignment/hisat/merged.bam -M logs/hisat_duplicates -O dedup/hisat.bam --VALIDATION_STRINGENCY SILENT

!featureCounts -T 32 \
    /data/classes/2024/fall/biol343/course_files/dedup/hisat.bam \
    -T 32 \
    --byReadGroup \
    -s 1 \
    --ignoreDup \
    -M \
    --fraction \
    -a ../2_genome_exploration/genome/annotations.gtf \
    -o hisat_counts.tsv \
    --verbose


       [44;37m =====      [0m[36m   / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
       [44;37m   =====    [0m[36m  | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
       [44;37m     ====   [0m[36m   \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
       [44;37m       ==== [0m[36m   ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
	  v2.0.6

||  [0m                                                                          ||
||             Input files : [36m1 BAM file  [0m [0m                                    ||
||  [0m                                                                          ||
||                           [36mhisat.bam[0m [0m                                       ||
||  [0m                                                                          ||
||             Output file : [36mhisat_counts.tsv[0m [0m                                ||
||                 Summary : [36mhisat_counts.tsv.summary[0m [0m                        

Let's compare the new numbers:

| Analysis | INT_im1 | INT_im2 | INT_im3 | INT_ma1 | INT_ma2 | INT_ma3 | LIV_im1 | LIV_im2 | LIV_im3 | LIV_ma1 | LIV_ma2 | LIV_ma3 |
|----------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| Theirs   | 2.84	 | 2.35	   | 20.46 	 | 29.45   | 49.03   | 16.17   | 258.46  | 11.50   | 11.67   | 227.95  | 94.20   | 63.37   |
| Ours (STAR)    | 0.5     | 1       | 12.3    | 8.67    | 8       | 1       | 218.48  | 2.38    | 4       | 269.16  | 76.61   | 36.19   |
| Ours (HISAT)   | 2     | 0.5       | 25.2    | 41.2    | 34.4       | 8.5       | 1095.48  | 15.7    | 17.5       | 2368.03  | 524.07   | 304.8   |

Finally, we can use MultiQC to aggregate the report, which will now include MarkDuplicates and featureCounts logs.

In [5]:
!multiqc --force -d ../5_fastq/fastq/qc/ ../5_fastq/trimmed/qc/ ../6_alignment ../7_alignment_qc .


  [91m///[0m ]8;id=123635;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.17[0m

[34m|           multiqc[0m | Prepending directory to sample names
[34m|           multiqc[0m | Search path : /data/users/corwinbm5021/BIOL343/5_fastq/fastq/qc
[34m|           multiqc[0m | Search path : /data/users/corwinbm5021/BIOL343/5_fastq/trimmed/qc
[34m|           multiqc[0m | Search path : /data/users/corwinbm5021/BIOL343/6_alignment
[34m|           multiqc[0m | Search path : /data/users/corwinbm5021/BIOL343/7_alignment_qc
[34m|           multiqc[0m | Search path : /data/users/corwinbm5021/BIOL343/8_counting
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m208/208[0m  [0m [2m./star_counts.tsv[0m
[?25h[34m|    feature_counts[0m | Found 1 reports
[34m|          samtools[0m | Found 1 stats reports
[34m|          samtools[0m | Found 1 flagstat reports
[34m|              star[0m | Found 3 reports
[34m|   