# transcriptome assembly and annotation

Aaron Lee, June 2023

This data analysis pipeline largely follows the workflow outlined by Ya Yang and Diego Morales-Briones for transcriptome assembly in phylogenomic dataset/orthology analysis https://bitbucket.org/yanglab/phylogenomic_dataset_construction/src/master/. It is slightly less conservative, with fewer raw data cleaning steps. The workflow is as follows:
1. quality checking
2. quality trimming
3. deduplication
4. removing organellar reads
5. de novo transcriptome assembly
6. process transcriptome sequences
7. coding sequence annotation

## dependencies
Anaconda, or any of the derivatives (eg miniconda, mamba, etc). All of the following programs will be wrapped up in a conda environment. Additionally, you will need the chimera removal script from the phylogenomic dataset construction pipeline.

- FastQC
- TRIMMOMATIC
- PRINSEQ
- Bowtie2
- transrate
- Trinity
- BLAST+
- transdecoder
- chimera removal script

The following code will help us to write and submit batch scripts to the SLURM job scheduler from this notebook on MSI.

In [2]:
import subprocess, sys, os

#subprocess.run("conda env create -n transcriptome --file transcriptome.yml")
subprocess.run(["conda", "activate", "transcriptome"])

CompletedProcess(args=['conda', 'activate', 'transcriptome'], returncode=1)

# 1. quality checking

In [5]:
with open("fastqc.sh", "w+") as outf:
    outf.write("#!/usr/bin/bash\n")
    outf.write("#SBATCH --mail-user=lee02893@umn.edu\n")
    outf.write("#SBATCH --mail-type=ALL\n")
    outf.write("#SBATCH --job-name=fastqc\n")
    outf.write("#SBATCH --time=1:00:00\n")
    outf.write("#SBATCH --partition=msismall\n")
    outf.write("#SBATCH --ntasks=16\n")
    outf.write("#SBATCH --mem=8g\n\n")
    #outf.write("module load fastqc\n")
    outf.write("files=$(ls /home/yangya/lee02893/sceletium/data)\n")
    outf.write("fastqc -t 16 ${files[@]}\n")

subprocess.run(["sbatch", "fastqc.sh"])

    

CompletedProcess(args=['sbatch', 'fastqc.sh'], returncode=0)

# 2. quality trimming

In [7]:
with open("trimmomatic.sh", "w+") as outf:
    outf.write("#!/usr/bin/bash\n")
    outf.write("#SBATCH --mail-user=lee02893@umn.edu\n")
    outf.write("#SBATCH --mail-type=ALL\n")
    outf.write("#SBATCH --job-name=trim\n")
    outf.write("#SBATCH --time=12:00:00\n")
    outf.write("#SBATCH --partition=msismall\n")
    outf.write("#SBATCH --ntasks=16\n")
    outf.write("#SBATCH --mem=8g\n")
    #outf.write("module load trimmomatic\n")
    outf.write("java -jar $TRIMMOMATIC/trimmomatic.jar PE -threads 16 /home/yangya/lee02893/sceletium/data/Sceletium_nova_S25_R1_001.fastq.gz /home/yangya/lee02893/sceletium/data/Sceletium_nova_S25_R2_001.fastq.gz Sceletium_nova.trim.1P.fq Sceletium_nova.trim.1U.fq Sceletium_nova.trim.2P.fq Sceletium_nova.trim.2U.fq ILLUMINACLIP:$TRIMMOMATIC/adapters/all_illumina_adapters.fa:2:30:7 SLIDINGWINDOW:4:20 LEADING:10 TRAILING:10 MINLEN:35")


subprocess.run(["sbatch", "trimmomatic.sh"])


# 3. deduplication

In [8]:
with open("prinseq.sh", "w+") as outf:
    outf.write("#!/usr/bin/bash\n")
    outf.write("#SBATCH --mail-user=lee02893@umn.edu\n")
    outf.write("#SBATCH --mail-type=ALL\n")
    outf.write("#SBATCH --job-name=prinseq\n")
    outf.write("#SBATCH --time=12:00:00\n")
    outf.write("#SBATCH --partition=msismall\n")
    outf.write("#SBATCH --ntasks=16\n")
    outf.write("#SBATCH --mem=8g\n")
    #outf.write("module load prinseq\n")
    outf.write("prinseq-lite.pl -verbose -fastq Sceletium_nova_filtered.1P.fq Sceletium_nova_filtered.2P.fq -derep 123 -out_good Sceletium_nova_dedup -out_bad null")

subprocess.run(["sbatch", "prinseq.sh"])


# 4. removing organellar reads

In [10]:
with open("bowtie2.sh", "w+") as outf:
    outf.write("#!/usr/bin/bash\n")
    outf.write("#SBATCH --mail-user=lee02893@umn.edu\n")
    outf.write("#SBATCH --mail-type=ALL\n")
    outf.write("#SBATCH --job-name=bowtie2\n")
    outf.write("#SBATCH --time=12:00:00\n")
    outf.write("#SBATCH --partition=msismall\n")
    outf.write("#SBATCH --ntasks=16\n")
    outf.write("#SBATCH --mem=8g\n")
    #outf.write("module load bowtie2\n")
    # build bowtie2 index for ice plant chloroplast
    outf.write("bowtie2-build /home/yangya/lee02893/sceletium/3-bowtie2/Mesembryanthemum_crystallinum.cp.fa Mcry_cp\n")
    # align reads to ice plant chloroplast and retain reads that DO NOT map (--un-conc)
    outf.write("bowtie2 -x Mcry_cp -1 Sceletium_nova_filtered.1P.fq -2 Sceletium_nova_filtered.2P.fq -derep 123 -out_good ${ARRAYJOB} -out_bad null")
    # remove large SAM file
    outf.write("rm *.sam")
    
subprocess.run(["sbatch", "bowtie2.sh"])


# 5. de novo transcriptome assembly

In [None]:
with open("trinity.sh", "w+") as outf:
    outf.write("#!/usr/bin/bash\n")
    outf.write("#SBATCH --mail-user=lee02893@umn.edu\n")
    outf.write("#SBATCH --mail-type=ALL\n")
    outf.write("#SBATCH --job-name=trinity\n")
    outf.write("#SBATCH --time=1-00:00:00\n")
    outf.write("#SBATCH --partition=msismall\n")
    outf.write("#SBATCH --ntasks=16\n")
    outf.write("#SBATCH --mem=200g\n")
    #outf.write("module load trinity\n")
    outf.write("Trinity --seqType fq --max_memory 200G --CPU 16 --verbose --left Sceletium_nova.bt2.1P.fq --right Sceletium_nova.bt2.2P.fq --output trinity_Sceletium_nova --full_cleanup")
    
subprocess.run(["sbatch", "trinity.sh"])


# 6. process transcriptome sequences

In [None]:
with open("process_transcripts.sh", "w+") as outf:
    outf.write("#!/usr/bin/bash\n")
    outf.write("#SBATCH --mail-user=lee02893@umn.edu\n")
    outf.write("#SBATCH --mail-type=ALL\n")
    outf.write("#SBATCH --job-name=chimeras\n")
    outf.write("#SBATCH --time=1-00:00:00\n")
    outf.write("#SBATCH --partition=msismall\n")
    outf.write("#SBATCH --ntasks=20\n")
    outf.write("#SBATCH --mem=40g\n")
    #outf.write("module load ncbi_blast+\n")
    
    # blastx search Trinity transcripts against ice plant amino acid sequences
    outf.write("blastx -db /home/yangya/lee02893/blastdb/iceplant_protein -out Sceletium_nova_blastx.tsv -query trinity_Sceletium_nova/Trinity.fasta -num_threads 20 -evalue 10 -outfmt "6 qseqid qlen sseqid slen frames pident nident length mismatch gapopen qstart qend sstart send evalue bitscore" -max_target_seqs 100\n")
    # remove chimeric sequences using Ya's script
    outf.write("python /home/yangya/lee02893/software/phylogenomic_dataset_construction/scripts/detect_chimera_from_blastx_modified.py Sceletium_nova_blastx.tsv ./\n")
    
    # remove poorly supported sequences using transrate
    outf.write("transrate --assembly trinity_Sceletium_nova/Trinity.fasta --left Sceletium_nova.bt2.1P.fq --right Sceletium_nova.bt2.2P.fq")
    
    # call "unigenes"
    # usage: python get_unigenes.py [transrate directory] [chimera directory]
    outf.write("python /home/yangya/lee02893/github/genomics_tools/transcriptome_tools/get_unigenes.py ./ ./")
    
subprocess.run(["sbatch", "process_transcripts.sh"])


# 7. coding sequence annotation

In [None]:
# need transrate and transdecoder here

with open("transdecoder.sh", "w+") as outf:
    outf.write("#!/usr/bin/bash\n")
    outf.write("#SBATCH --mail-user=lee02893@umn.edu\n")
    outf.write("#SBATCH --mail-type=ALL\n")
    outf.write("#SBATCH --job-name=trinity\n")
    outf.write("#SBATCH --time=1-00:00:00\n")
    outf.write("#SBATCH --partition=msismall\n")
    outf.write("#SBATCH --ntasks=20\n")
    outf.write("#SBATCH --mem=40g\n")
    
    # blastx search Trinity transcripts against ice plant amino acid sequences
    outf.write("TransDecoder.LongOrfs -t unigenes.")
    
subprocess.run(["sbatch", "fastqc.sh"])
