# Get genome and annotations

To prepare, adjust the user environmental variable (`$USER`) to be set to your user ID.

1. Make a new directory called `genome`
2. Navigate to `genome/`
3. Download the reference genome file, which will be compressed as `.gz`
    1. The `>` operator saves the incoming file to a new file name, which we call `genome.fa.gz`
    2. Checkout details and info about the genome at [WormBase ParaSite](https://parasite.wormbase.org/Schistosoma_mansoni_prjea36577/Info/Index/)
4. Decompress the reference

In [1]:
%cd /data/users/wheelenj/biol343
!mkdir genome
%cd genome
!wget -O genome.fa.gz https://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS19/species/schistosoma_mansoni/PRJEA36577/schistosoma_mansoni.PRJEA36577.WBPS19.genomic.fa.gz
!gzip -d genome.fa.gz

/data/users/wheelenj/biol343
mkdir: cannot create directory ‘genome’: File exists
/data/users/wheelenj/biol343/genome
--2024-07-15 15:18:22--  https://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS19/species/schistosoma_mansoni/PRJEA36577/schistosoma_mansoni.PRJEA36577.WBPS19.genomic.fa.gz
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.193.165
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.193.165|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 116797085 (111M) [application/x-gzip]
Saving to: ‘genome.fa.gz’

genome.fa.gz          2%[                    ]   3.34M   592KB/s    eta 3m 11s ^C
gzip: genome.fa already exists; do you wish to overwrite (y or n)? 

`grep` is a command line tool that is often used to inspect files, especially files that includes sequences (i.e., FastA/FastQ). 

View the `grep` manual and then use it to do the following:

- Count the number of contigs/chromosomes
- Take a look at the header of each contig/chromosome

How many chromosomes are there, and what is the length of each?

In [1]:
!grep --help

# !grep -c '>' /data/users/wheelenj/biol343/genome/genome.fa
# !grep '>' /data/users/wheelenj/biol343/genome/genome.fa

Usage: grep [OPTION]... PATTERN [FILE]...
Search for PATTERN in each FILE.
Example: grep -i 'hello world' menu.h main.c

Pattern selection and interpretation:
  -E, --extended-regexp     PATTERN is an extended regular expression
  -F, --fixed-strings       PATTERN is a set of newline-separated strings
  -G, --basic-regexp        PATTERN is a basic regular expression (default)
  -P, --perl-regexp         PATTERN is a Perl regular expression
  -e, --regexp=PATTERN      use PATTERN for matching
  -f, --file=FILE           obtain PATTERN from FILE
  -i, --ignore-case         ignore case distinctions
  -w, --word-regexp         force PATTERN to match only whole words
  -x, --line-regexp         force PATTERN to match only whole lines
  -z, --null-data           a data line ends in 0 byte, not newline

Miscellaneous:
  -s, --no-messages         suppress error messages
  -v, --invert-match        select non-matching lines
  -V, --version             display version information and exit
      

This assembly is full-length (an impressive achievement, which you'll learn more about next semester 🤯), and has been assembled into the 7 autosomes, two sex chromosome (Z and W), and a mitochondrial genome.

In addition to the genomic sequences, we also need the annotations file. Genome annotation files are tab-separated files that include coordinate information for all genomic annotations. For example, the start/stop location of every single gene, mRNA, and exon. Annotation are usually in GTF or GFF format; GTF is more common and preferred by many programs. [Here's the definition](http://mblab.wustl.edu/GTF22.html) for the GTF file format.

Let's get the annotations and decompress them.

- This time we'll pipe the commands, so we have to redirect the download to standard out (`-O -`).
- The pipe operator `|` allows you to run a command using standard output from the previous command as the input.
- To write a single command over multiple lines, use the `\` sign.

In [None]:
!wget -O - https://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS19/species/schistosoma_mansoni/PRJEA36577/schistosoma_mansoni.PRJEA36577.WBPS19.canonical_geneset.gtf.gz | \
    gzip -d > /data/users/wheelenj/biol343/genome/annotations.gtf

Let's checkout what the annotations look like. The `head` and `tail` commands allow you to look at the first or last lines respectively. A numerical argument defines how many lines you want to see.

In [2]:
!head -10 /data/users/wheelenj/biol343/genome/annotations.gtf

#!genebuild-version 2022-11-WormBase
SM_V10_1	WormBase	gene	68427	68783	.	-	.	gene_id "Smp_329140"; gene_version "1"; gene_source "WormBase"; gene_biotype "protein_coding";
SM_V10_1	WormBase	transcript	68427	68783	.	-	.	gene_id "Smp_329140"; gene_version "1"; transcript_id "Smp_329140.1"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
SM_V10_1	WormBase	exon	68427	68783	.	-	.	gene_id "Smp_329140"; gene_version "1"; transcript_id "Smp_329140.1"; exon_number "1"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; exon_id "Smp_329140.1.e1"; tag "Ensembl_canonical";
SM_V10_1	WormBase	CDS	68596	68763	.	-	0	gene_id "Smp_329140"; gene_version "1"; transcript_id "Smp_329140.1"; exon_number "1"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; protein_id "

Each line is a different annotation. In the first 10 lines we see gene, exon, CDS, start codon, stop codon, 5' UTR, and 3' UTR. The columns (technically "fields") of a GTF file are described at the link provided above. Fields are tab-separated. `grep` can again be used to search for annotations of interest, but this time we have to use a few regular expressions. For example, suppose we want to know how many transcripts are expressed:

In [None]:
!grep -c "transcript" /data/users/wheelenj/biol343/genome/annotations.gtf
!grep -c -P "\ttranscript\t" /data/users/wheelenj/biol343/genome/annotations.gtf

If we only search for "transcript," we'll get results for "transcript_biotype", "transcript_source", etc. There are far fewer than 231617 transcripts in this genome. To search for only "transcript" when it's alone in a field (that is, surrounded by tabs), we have to use a regular expression and tell `grep` we're doing so with the `-P` flag.

We can this same idea to count how many genes are on chromosome 1:

In [None]:
!grep -c -P "SM_V10_1.*\tgene\t" /data/users/wheelenj/biol343/genome/annotations.gtf

You can combine `grep` and `cut` to extract fields from specific lines. For instance, suppose you're interested in the gene called Smp_104210, which is a an opsin protein (the receptor that detects photons in eyes or eye-spots). The following command would show you all the lines that contained that gene ID:

In [None]:
!grep 'Smp_104210' /data/users/wheelenj/biol343/genome/annotations.gtf

Now suppose you wanted the start position of the trasnscript associated with Smp_104210. We know from the GTF description that the 3rd field contains the feature type and the 4th field contains the start location. `cut` allows you to parse each field of a delimited file. `grep` exracts lines containing the search term, which can then be piped to `cut` to extract the feature, and then `grep` can again be used to only keep mRNA features.

In [None]:
!grep 'Smp_104210' /data/users/wheelenj/biol343/genome/annotations.gtf | cut -f 3,4 | grep 'transcript'

# FastQ download and QC

We have the genome and annotations. The RNA-seq data will be mapped to this reference, and then we can count how many RNA-seq reads align to the annotations we care about (genes or transcripts). Now we need to get the FastQ files that we want to align and analyze.
- Create a new directory called `fastq`
- Get the SRA Run Table, which give the metadata for the FastQ files.
    - *This will include a `cp` command to move it from the class dir to the user dir*

In [None]:
!mkdir /data/users/wheelenj/biol343/fastq

Let's take a look at the run table:

In [None]:
!head -15 /data/users/wheelenj/biol343/fastq/SraRunTable.txt

You can see that there are 12 different runs, each has an ID that looks something like `SRR26691082`. After the ID, quite a bit of run metadata is provided. You should spend some time learning about what each field denotes.

As we know, the experiment had four different samples: 
- liver immature
- liver mature
- intestine immature
- intestine mature

This information is provided in the metadata. We are all going to work with all 12 runs.

First we have to use `sra-tools` to download the FastQ files from NCBI's SRA database. To so, we'll use a `for` loop within bash. We can use `cut` to get the first field, but because it's comma-delimited instead of tab-delimited, we have to tell the program with the `-d` option. This can be saved to a file, which then can be looped through line-by-line.

In [None]:
!cut -d ',' -f 1 /data/users/wheelenj/biol343/fastq/SraRunTable.txt | tail -n +2 > /data/users/wheelenj/biol343/fastq/sra_list.txt
!while IFS= read -r line; do \
    echo "Getting $line from NCBI SRA"; \
    parallel-fastq-dump --sra-id $line --threads 16 --outdir . --gzip; \
    done < sra_list.txt

If we take another look at the metadata, there are a few things that are of interest to us and our analysis. First, these reads were generated with `PolyA` selection, which means reads should have many A's on their 3' end. Second, the reads are `ILLUMINA` reads generated on a `NextSeq 500` instrument, which means they may have Illumina adapters; depending on whether or not they were trimmed by the authors prior to uploading them to SRA. Third, from the paper (but not the run table), we know that these are single-end reads. This metadata will be important to us soon.

Whenever we look at FastQ files for the first time, we should perform quality control (QC). The primary tool used for read QC is called FastQC, which is installed in our environment. Let's take a look at the manual and then run QC on our reads:

In [None]:
!fastqc -h

In [None]:
%cd /data/users/wheelenj/biol343/fastq
!fastqc -t 16 *.fastq.gz

You should have a decent idea what each of these QC metrics mean. For our immediate purposes, there are a few things to pay attention to:

1. Base sequence quality gets lower toward the end of the read.
2. The beginning of almost all reads begins with TATA (the UMI linker).
    1. Some reads include the UMI as well, but some libraries don't have it.
3. There are millions of duplicated sequences (sometimes >60%)!
4. There are PolyA tails.
5. It looks like the adapter sequences have already been trimmed (see below). 

From the [paper](https://journals.plos.org/plospathogens/article?id=10.1371/journal.ppat.1012268#sec010), we know that they used the QuantSeq 3' mRNA FWD V2 library prep kit. You can read a bit about the kit [here](https://www.lexogen.com/store/quantseq-3-mrna-seq-v2-fwd-with-udi/). This kit is optimized for a few things:

- Degraded mRNA - it will only get sequences from the 3' end (and PolyA tails)
- Low input RNA - which is often the case for small eukaryotic parasites, especially when working with eggs, which are difficult to get RNA out of

Typically, manufacturers will provide some guidance on how to trim and filter sequences produced from their kit. For Lexogen (the maker of the QuantSeq kit), this information can be found in the [FAQs](https://faqs.lexogen.com/faq/what-sequences-should-be-trimmed). This site shows that `AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC` is the adapter sequence. Is that sequence (or one like it) in the "Overrepresented sequences" section? If so, we need to trim the adapter. If not, we just need to trim the PolyA tail. We'll deal with duplicates later.

Remember how libraries are prepared. Each read may include a 6 nt UDI, a 4 nt (TATA) linker, the insert, the PolyA tail, and then the adapter. Based on our QC data, it looks like the adapters have been trimmed (`AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC` isn't overrepresented). The PolyA tail *may* have been trimmed, but it's not obvious - we still see up to 40% PolyA at position 59 in SRR26691093. On the other side of the read, we see TATA in all 4 files, and SRR26691093 and SRR26691087 contain the 6 nt UDI while SRR26691085 and SRR26691082 do not.

We definitely want to trim PolyA tails, but trimming UDIs and linkers actually depends on the type of aligner being used. STAR (the aligner we'll be using) can soft-clip the ends of reads that have a high mismatch rate. This also means that we don't need to trim low-quality bases, because they can also be soft-clipped. What about the duplicates? Again, those can be marked during alignment, so we don't need to remove them either.

In the end, we'll just trim PolyA tails and do some trimming that's particular to NextSeq that we're not going to get into. We will use the tool [cutadapt](https://cutadapt.readthedocs.io/en/stable/) to perform the trimming. First, check out the manual:

In [None]:
!cutadapt --help

Edit this block and define the result of each of flags used in the above commands.

`-j`: 

`-m`: 

`-O`:

`-a`:

`-n`:

We're not actually going to use these options, but it's good to get used to searching tool manuals and deeply understanding each available option/flag. 

Now let's trim and run QC on these new files to see how they look. We'll use a bash loop again.

In [None]:
%cd /data/users/wheelenj/biol343/fastq
%mkdir trimmed
!for fastq in *.fastq.gz; do \
    cutadapt -j 16 -m 20 --poly-a --nextseq-trim=10 -o ./trimmed/$fastq $fastq; \
    done
!fastqc -t 16 ./trimmed/*.fastq.gz

Open up the QC files from before/after trimming and compare them. Are they looking better? 

You'll notice that the "Sequence Length Distribution" section went from ✅ to ❌. That's because the trimmed PolyA tail was different for each read, so now we have a broad distribution of lengths rather than all of them being 69 nt. 

Even though we don't have adapters, we still have have overrepresented sequences. What are they? BLAST them against a nucleotide database and see if they are concerning or not.

# Alignment

We will be using STAR (the same tool the authors used) to align RNA-seq reads to the reference genome. STAR has a number of advantages over other aligners:

1. Ultra-fast
2. Can deal with UMIs/linkers/errors at the ends of reads
3. Splice-aware

To allow it to be splice-aware (align across splice junctions, with large gaps corresponding to introns), we need to generate a genome index that incorporates the annotations (GTF) and the sequences. We will be using the [STAR manual](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf) heavily.

## Index the genome

From the manual:

The basic options to generate genome indices are as follows:
```
--runThreadN NumberOfThreads
--runMode genomeGenerate
--genomeDir /path/to/genomeDir
--genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 ...
--sjdbGTFfile /path/to/annotations.gtf
--sjdbOverhang ReadLength-1
```

Most of these are self-explanatory, but `--sjdbOverhang` takes special consideration. Here's the desription:

> `--sjdbOverhang` specifies the length of the genomic sequence around the annotated junction
to be used in constructing the splice junctions database. Ideally, this length should be equal
to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina
2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the
ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as
well as the ideal value.

Usings this description and what we know about our read lenghs from our QC, choose the best value. Edit and run the code block below.

In [None]:
!STAR \
--runThreadN 16 \
--runMode genomeGenerate \
--genomeDir /data/users/wheelenj/biol343/genome \
--genomeFastaFiles /data/users/wheelenj/biol343/genome/genome.fa \
--sjdbGTFfile /data/users/wheelenj/biol343/genome/annotations.gff3 \
--sjdbOverhang 68 \
--genomeSAindexNbases 13

## Mapping
Here are the manual instructions for the mapping steps:

>The basic options to run a mapping job are as follows:  
`--runThreadN` *NumberOfThreads*  
`--genomeDir` */path/to/genomeDir*  
`--readFilesIn` */path/to/read1 [/path/to/read2 ]*  
`--genomeDir` specifies path to the genome directory where genome indices where generated
(see Section 2. Generating genome indexes).  
`--readFilesIn` name(s) (with path) of the files containing the sequences to be mapped (e.g.
RNA-seq FASTQ files). If using Illumina paired-end reads, the read1 and read2 files have to
be supplied. STAR can process both FASTA and FASTQ files. Multi-line (i.e. sequence split
in multiple lines) FASTA (but not FASTQ) files are supported.  
If the read files are compressed, use the `--readFilesCommand` *UncompressionCommand* option,
where *UncompressionCommand* is the un-compression command that takes the file name as
input parameter, and sends the uncompressed output to stdout. For example, for gzipped
files (\*.gz) use `--readFilesCommand` *zcat* OR `--readFilesCommand` *gunzip -c*. For bzip2compressed files, use `--readFilesCommand` *bunzip2 -c*.

Run the mapping step in the code block below. Remember to use the trimmed reads. Let's run separate commands for each input FastQ file. You could also provide them all in one command, but they would the aligned reads would be in a single output file with a distinguisher in the SAM header, which we haven't yet talked about. Speaking of SAM files, we'll also set an option to create BAMs instead of SAMs and to sort these files by coordinate. Finally, we also need to set the `--outFileNamePrefix` to make sure each outfile gets a different. We will make a separate directory to organize the output files.

In [None]:
%mkdir /data/users/wheelenj/biol343/mapping

!STAR \
--runThreadN 16 \
--runMode alignReads \
--genomeDir /data/users/wheelenj/biol343/genome \
--readFilesIn /data/users/wheelenj/biol343/fastq/SRR26691082_trim1.fastq \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix /data/users/wheelenj/biol343/mapping/SRR26691082_trim1/

In section 9 of the STAR manual, it provides guidance for running the so-called 2-pass mapping. In this scheme, the first pass maps to the known splice junctions provided in the GFF3 while the second pass re-maps to known and novel junctions (which are output in `SJ.out.tab`. However, 9.1 recommends that we include the junctions from ***all*** samples, so we need to align the remaining samples first: 

In [None]:
!STAR \
--runThreadN 16 \
--runMode alignReads \
--genomeDir /data/users/wheelenj/biol343/genome \
--readFilesIn /data/users/wheelenj/biol343/fastq/SRR26691085_trim1.fastq \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix /data/users/wheelenj/biol343/mapping/SRR26691085_trim1/

!STAR \
--runThreadN 16 \
--runMode alignReads \
--genomeDir /data/users/wheelenj/biol343/genome \
--readFilesIn /data/users/wheelenj/biol343/fastq/SRR26691087_trim1.fastq \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix /data/users/wheelenj/biol343/mapping/SRR26691087_trim1/

!STAR \
--runThreadN 16 \
--runMode alignReads \
--genomeDir /data/users/wheelenj/biol343/genome \
--readFilesIn /data/users/wheelenj/biol343/fastq/SRR26691093_trim1.fastq \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix /data/users/wheelenj/biol343/mapping/SRR26691093_trim1/

Now we run them all again, this time providing the paths to the new splice junctions with `--sjdbFileChrStartEnd`.

In [None]:
!STAR \
--runThreadN 16 \
--runMode alignReads \
--genomeDir /data/users/wheelenj/biol343/genome \
--readFilesIn /data/users/wheelenj/biol343/fastq/SRR26691082_trim1.fastq \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix /data/users/wheelenj/biol343/mapping/SRR26691082_trim1/ \
--sjdbFileChrStartEnd /data/users/wheelenj/biol343/mapping/SRR26691082_trim1/SJ.out.tab \
                      /data/users/wheelenj/biol343/mapping/SRR26691085_trim1/SJ.out.tab \
                      /data/users/wheelenj/biol343/mapping/SRR26691087_trim1/SJ.out.tab \
                      /data/users/wheelenj/biol343/mapping/SRR26691093_trim1/SJ.out.tab

!STAR \
--runThreadN 16 \
--runMode alignReads \
--genomeDir /data/users/wheelenj/biol343/genome \
--readFilesIn /data/users/wheelenj/biol343/fastq/SRR26691085_trim1.fastq \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix /data/users/wheelenj/biol343/mapping/SRR2669108_trim1/ \
--sjdbFileChrStartEnd /data/users/wheelenj/biol343/mapping/SRR26691082_trim1/SJ.out.tab \
                      /data/users/wheelenj/biol343/mapping/SRR26691085_trim1/SJ.out.tab \
                      /data/users/wheelenj/biol343/mapping/SRR26691087_trim1/SJ.out.tab \
                      /data/users/wheelenj/biol343/mapping/SRR26691093_trim1/SJ.out.tab

!STAR \
--runThreadN 16 \
--runMode alignReads \
--genomeDir /data/users/wheelenj/biol343/genome \
--readFilesIn /data/users/wheelenj/biol343/fastq/SRR26691087_trim1.fastq \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix /data/users/wheelenj/biol343/mapping/SRR26691087_trim1/ \
--sjdbFileChrStartEnd /data/users/wheelenj/biol343/mapping/SRR26691082_trim1/SJ.out.tab \
                      /data/users/wheelenj/biol343/mapping/SRR26691085_trim1/SJ.out.tab \
                      /data/users/wheelenj/biol343/mapping/SRR26691087_trim1/SJ.out.tab \
                      /data/users/wheelenj/biol343/mapping/SRR26691093_trim1/SJ.out.tab

!STAR \
--runThreadN 16 \
--runMode alignReads \
--genomeDir /data/users/wheelenj/biol343/genome \
--readFilesIn /data/users/wheelenj/biol343/fastq/SRR26691093_trim1.fastq \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix /data/users/wheelenj/biol343/mapping/SRR26691093_trim1/ \
--sjdbFileChrStartEnd /data/users/wheelenj/biol343/mapping/SRR26691082_trim1/SJ.out.tab \
                      /data/users/wheelenj/biol343/mapping/SRR26691085_trim1/SJ.out.tab \
                      /data/users/wheelenj/biol343/mapping/SRR26691087_trim1/SJ.out.tab \
                      /data/users/wheelenj/biol343/mapping/SRR26691093_trim1/SJ.out.tab

We're also going to merge these BAMs, just in case. We also create an index, which allows for fast random access of the BAM file, rather than only accessing the beginning or end of the file.

In [None]:
%cd /data/users/wheelenj/biol343/mapping
!samtools merge -o merged.bam SRR26691082_trim1/Aligned.sortedByCoord.out.bam \
                              SRR26691085_trim1/Aligned.sortedByCoord.out.bam \
                              SRR26691087_trim1/Aligned.sortedByCoord.out.bam \
                              SRR26691093_trim1/Aligned.sortedByCoord.out.bam
!samtools merge merged.bam

Mapping is now complete! Now onto post-alignment QC...

# Alignment QC

In [None]:
%cd /data/users/wheelenj/biol343/
!grep Smp_104210 genome/annotations.gff3

In [None]:
!samtools tview -d T -p SM_V10_Z:64329382 mapping/merged.bam

In [None]:
%cd /data/users/wheelenj/biol343/
!multiqc .

IGV of Smp_316760 (a highly expressed VAL). SM_V10_6:17,278,403-17,279,395

In [None]:
%cd /data/users/wheelenj/biol343/
!grep Smp_316760 genome/annotations.gff3

# Dedup
Only done for low-input, low-complexity libraries. Could even compare the results between deduped and raw alignments.

In [None]:
%cd /data/users/wheelenj/biol343/mapping
!picard MarkDuplicates \
      I=SRR26691082_trim1/Aligned.sortedByCoord.out.bam \
      O=SRR26691082_trim1/dedup.bam \
      M=output_duplicate_metrics.txt

In [None]:
!head SRR26691082_trim1/output_duplicate_metrics.txt

In [None]:
%cd /data/users/wheelenj/biol343/mapping
!picard MarkDuplicates \
      I=SRR26691085_trim1/Aligned.sortedByCoord.out.bam \
      O=SRR26691085_trim1/dedup.bam \
      M=output_duplicate_metrics.txt

!picard MarkDuplicates \
      I=SRR26691087_trim1/Aligned.sortedByCoord.out.bam \
      O=SRR26691087_trim1/dedup.bam \
      M=output_duplicate_metrics.txt

!picard MarkDuplicates \
      I=SRR26691093_trim1/Aligned.sortedByCoord.out.bam \
      O=SRR26691093_trim1/dedup.bam \
      M=output_duplicate_metrics.txt

# Counting

Convert GFF to GTF:

In [None]:
%cd /data/users/wheelenj/biol343/mapping
%mkdir /data/users/wheelenj/biol343/counts
!featureCounts SRR26691082_trim1/dedup.bam SRR26691085_trim1/dedup.bam SRR26691087_trim1/dedup.bam SRR26691093_trim1/dedup.bam -T 16 -a /data/users/wheelenj/biol343/genome/annotations.gtf -g gene_id -G /data/users/wheelenj/biol343/genome/genome.fa -o /data/users/wheelenj/biol343/counts/counts.tsv -M --fraction --ignoreDup

# Differential expression
Need multiple replicates to do this correctly, so will have to align/count all 12 files in the end...