MAX is a novel method to quantify the Mutant-Allele eXpression (MAX) at isoform level from RNA-seq data.
MAX requires several files as input, such as:
- A list of mutations containing the information of chromosome, start position, end position, reference sequence, alternative sequence and gene names. Here is an example of the mutation file with the header:
- The GTF annotation file, which can be dowlnloaded from the UCSC Table Browser. The hg19 or hg38 RefGene GTF can be downloaded by running:
wget ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.refGene.gtf.gz # hg19
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz #hg38
- The whole transcriptome reference, or the isoforms of a subset of genes that you aim to focus on. The clean version of hg19 or hg38 reference can be downloaded by running:
wget ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/refMrna.fa.gz # hg19
wget ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/refMrna.fa.gz #hg38
- The RNA-seq data from multiple samples.
- R version 3.6.0 or later with installed packages: GenomicFeatures, BSgenome.Hsapiens.UCSC.hg38 (or BSgenome.Hsapiens.UCSC.hg19), polyester, Biostrings, foreach and doParallel
- C++11 compliant compiler (g++ >= 4.7)
- Download the latest binary version from the MAX release:
wget https://github.com/WenjiangDeng/MAX/releases/download/v0.1.0/MAX-binary-0.1.0.tar.gz
- Uncompress to folder
tar -xzvf MAX-binary-0.1.0.tar.gz
- Move to the MAX_home directory and do configuration for MAX
cd MAX-binary-0.1.0
bash configure.sh
- Add paths of lib folder and bin folder to LD_LIBRARY_PATH and PATH
export LD_LIBRARY_PATH=/path/to/XAEM-binary-0.1.0/lib:$LD_LIBRARY_PATH
export PATH=/path/to/XAEM-binary-0.1.0/bin:$PATH
#Done
- download MAX
wget https://github.com/WenjiangDeng/MAX/releases/download/v0.1.0/MAX-source-0.1.0.tar.gz
tar -xzvf MAX-source-0.1.0.tar.gz
cd MAX-source-0.1.0
#config to run MAX
bash configure.sh
- install boost_1_55_0
wget http://sourceforge.net/projects/boost/files/boost/1.58.0/boost_1_58_0.tar.gz
tar -xvzf boost_1_58_0.tar.gz
cd boost_1_58_0
sudo apt-get update
sudo apt-get install build-essential g++ python-dev autotools-dev libicu-dev build-essential libbz2-dev libboost-all-dev
sudo apt-get install aptitude
aptitude search boost
./bootstrap.sh --prefix=boost_1_58_0_build
./b2
./b2 install
#The Boost C++ Libraries were successfully built! #add the lib and folder to paths
export LD_LIBRARY_PATH=$PWD/boost_1_58_0_build/stage/lib:$LD_LIBRARY_PATH
export PATH=$PWD/boost_1_58_0_build:$PATH
- install tbb44_20160526oss
cd ..
wget https://www.threadingbuildingblocks.org/sites/default/files/software_releases/source/tbb44_20160526oss_src_0.tgz
tar xvf tbb44_20160526oss_src_0.tgz
sudo apt-get install libtbb-dev
# install cmake for ubuntu: cmake 3.5.1
sudo apt install cmake
# install curl
sudo apt install curl
# install autoconf
sudo apt-get install autoconf
# install zlib
sudo apt install zlib1g-dev
sudo apt install zlib1g
# update all installations
sudo apt-get update
# install MAX
DBOOST_ROOT=$PWD/boost_1_58_0/boost_1_58_0_build/ DTBB_INSTALL_DIR=$PWD/tbb44_20160526oss/ DCMAKE_INSTALL_PREFIX=MAX-source-0.1.0 bash install.sh
#The MAX was successfully built!
###########
#add lib and bin folders to paths
export LD_LIBRARY_PATH=$PWD/Circall_0.1.0_build/lib:$LD_LIBRARY_PATH
export PATH=$PWD/Circall_0.1.0_build/bin:$PATH
#done
This step will produce (1) the reference which contains both wild-type and mutant alleles; (2) the index for the reference and (3) the X matrix (design matrix). This step requires the following input files: a list of mutations, the GTF file, the wild-type transcriptome reference, the version of gene model ("hg19" or "hg38") and the working directory. When you have prepared these files, the command to start the analysis is:
# wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz
# wget ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/refMrna.fa.gz
bash MAX.sh -m mutation_list.txt -g hg38.refGene.gtf -r refMrna.fa -v hg38 -d /path/to/directory
The outputs will be: WT_Mut_reference.fa, X_matrix.RData and the Index_reference folder.
Suppose we already created a working directory “MAX_project” (/path/to/MAX_project/) for the quantification.
- The command to generate equivalence class table for each sample is similar with “salmon quant”. The input parameters are the Index_reference folder and the RNA-seq data. The "-l" option defines the library type of reads. Below shows the example if we want to run MAX for sample1 and sample2 with 8 cpus:
MAX -i /path/to/Index_reference -l IU -1 s1_read1.fasta -2 s1_read2.fasta -p 8 -o /path/to/MAX_project/sample1
MAX -i /path/to/Index_reference -l IU -1 s2_read1.fasta -2 s2_read2.fasta -p 8 -o /path/to/MAX_project/sample2
- If the data is compressed in gz format, we can combine "gunzip" in the command:
MAX -i /path/to/Index_reference -l IU -1 <(gunzip -c s1_read1.gz) -2 <(gunzip -c s1_read2.gz) -p 8 -o /path/to/MAX_project/sample1
MAX -i /path/to/Index_reference -l IU -1 <(gunzip -c s2_read1.gz) -2 <(gunzip -c s2_read2.gz) -p 8 -o /path/to/MAX_project/sample2
- After running MAX there will be the output of the equivalence class table for multiple samples. We then create the Y count matrix. For example, if we want to run MAX parallelly using 8 cores, the command is:
Rscript Create_count_matrix.R design.matrix=/path/to/X_matrix.RData workdir=/path/to/MAX_project core=8
When the Y count matrix is constructed, we can use the AEM algorithm to quantify the mutant-allele expression. The command is as follows:
Rscript AEM_update_X_beta.R workdir=/path/to/MAX_project design.matrix=/path/to/X_matrix.RData max.out=/path/to/mutant_expression.RData remove.ycount=TRUE core=8
- workdir: the path to working directory
- design.matrix: the path to the design matrix
- max.out: the output file for isoform expression
- remove.ycount (default=TRUE): to clean all data of Ycount after use
- core: the number of cpu cores for parallel computing, default is 8.
The final results are in the mutant_expression.RData, which contains two objects: the MAX_count for the read counts value and MAX_tpm for the TPM (Transcripts Per Kilobase Million) value.
This section shows a complete run for MAX pipeline. We can test MAX just by copy and paste of the commands. Here we focus on the mutations in the FLT3 gene, which is one of the most frequently mutated oncogenes in Acute Myeloid Leukemia (AML).
- Download the binary file of MAX and configure the path
# Create a working folder
mkdir MAX_binary
cd MAX_binary
# Download the binary version of MAX
wget https://github.com/WenjiangDeng/MAX/releases/download/v0.1.0/MAX-binary-0.1.0.tar.gz
# Configure the tool
tar -xzvf MAX-binary-0.1.0.tar.gz
cd MAX-binary-0.1.0
bash configure.sh
# Add the paths to system
export LD_LIBRARY_PATH=$PWD/lib:$LD_LIBRARY_PATH
export PATH=$PWD/bin:$PATH
cd ..
- Download the mutation file, GTF annotation and the sequences of isoforms from FLT3 gene
mkdir MAX_project
cd MAX_project
wget https://github.com/WenjiangDeng/MAX/releases/download/v0.1.0/mutation_list.txt
wget https://github.com/WenjiangDeng/MAX/releases/download/v0.1.0/test_FLT3.gtf
wget https://github.com/WenjiangDeng/MAX/releases/download/v0.1.0/isoform_ref_FLT3_gene.fa
- construct the wild-type+Mutant reference, reference index and the X matrix
# This step can take several minutes
bash ../MAX-binary-0.1.0/MAX.sh -m mutation_list.txt -g test_FLT3.gtf -r isoform_ref_FLT3_gene.fa -v hg19 -d $PWD
- Download the test RNA-seq data of 10 samples
wget https://github.com/WenjiangDeng/MAX/releases/download/v0.1.0/RNA-seq_FLT3.tar.gz
tar -xzvf RNA-seq_FLT3.tar.gz
- Generate the eqclass table and Y count matrix using 8 cores
for ind in $(seq -f %02.0f 10); do
MAX -i Index_reference -l IU -1 'fasta_flt3/sample_'$ind'_1.fasta' -2 'fasta_flt3/sample_'$ind'_2.fasta' -p 8 -o 'sample_'$ind
done
Rscript ../MAX-binary-0.1.0/R/Create_count_matrix.R workdir=$PWD design.matrix=X_matrix.RData core=8
- Estimate mutant-allele expression using AEM algorithm with 8 cores
Rscript ../MAX-binary-0.1.0/R/AEM_update_X_beta.R workdir=$PWD design.matrix=X_matrix.RData max.out=mutant_expression.RData core=8
The final results are in the mutant_expression.RData, which contains the MAX_count and MAX_tpm objects.
MAX2 is an extension of MAX for heterogenous RNA-seq data analysis. In MAX2, the construction of X matrix and the quasi-mapping steps are the same as in MAX. The only difference is that before isoform quantification, MAX2 will cluster heterogenous samples based on their mutation profile.
- Merge mutation to generate Mutated_eqClass.txt.
# SampleMut is a two-column mutation file. Each row shows a sample harboring an unique mutation, please see mutDat_NPM1.txt as an example
# sampleID is the name of input sample
# SampleEq is the eqClass.txt file from previous quasi-mapping step.
Rscript mergeMutSingleSample.R sampleMut=$sampleMutFn sampleID=$sampleFn sampleEq=eqClass.txt
- Generate the ycount and quantify using the AEM algorithm
Rscript genCountSample.R xmatEq=$xmatEqFn sampleMut=$sampleMutFn sampleID=$sampleFn sampleEq=$outdir/Mutated_eqClass.txt YcountDir=$Ycount_outdir
Rscript estimateBeta.R workdir=$Ycount_outdir sampleMut=$sampleMutFn out=MAX_isoform_expression_AEM.RData
The estimation results will be saved as isoformCount in MAX2_isoform_expression.RData.