The scripts used in the manuscript entitled 'Tracking pre-mRNA maturation across subcellular compartments identifies developmental gene regulation through intron retention and nuclear anchoring' are included in the present folders for the purpose of reproducibility and transparency of the results from the manuscript.
The scripts are broken down by the order of analysis.
Please download the datasets from https://doi.org/10.5281/zenodo.4540589 and run the analysis.
The code is provided without a warranty. Codes from parts of the analysis are rewritten to make them easy to understand thus may cause errors when running them. If you encounter any problems or questions, please report them to the author (zcpan1016@gmail.com).
- Download from Zenodo: https://doi.org/10.5281/zenodo.4540589
cd datacurl -L -o Intron.feature.annotation https://zenodo.org/record/4540589/files/Intron.feature.annotation?download=1curl -L -o intron_FI_combined_results.txt https://zenodo.org/record/4540589/files/intron_FI_combined_results.txt?download=1curl -L -o Intron_transcript.txt https://zenodo.org/record/4540589/files/Intron_transcript.txt?download=1curl -L -o mm10.fa.gz https://zenodo.org/record/4540589/files/mm10.fa.gz?download=1curl -L -o Mus_musculus.GRCm38.91.chr.gtf.gz https://zenodo.org/record/4540589/files/Mus_musculus.GRCm38.91.chr.gtf.gz?download=1
- Download other data:
curl -L -O ftp://ftp.ensembl.org/pub/release-91/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz
- Unzip
gunzip -c ./mm10.fa.gz > mm10.fagunzip -c ./Mus_musculus.GRCm38.91.chr.gtf.gz > ./Mus_musculus.GRCm38.91.chr.gtfgunzip -c ./Mus_musculus.GRCm38.cdna.all.fa.gz > ./Mus_musculus.GRCm38.cdna.all.facd ..
conda create --prefix ./conda_env_py2conda activate ./conda_env_py2conda install -c conda-forge -c bioconda python=2 r-base=4 sra-tools star kallisto samtools=1.11 pysam=0.16.0 numpy=1.15.4 tqdm keras scikit-learn h5py r-rtsne r-ggplot2 r-ggthemes r-scales r-ggpubrpip install Pillowpip install pyclusteringpip install -U matplotlibR> repos <- "http://cran.us.r-project.org"> install.packages("circlize", repos=repos)> quit()conda deactivate
Output: One example of output should look like
./results/mESC/Chr/Aligned.sortedByCoord.out.bam (RNA-seq alignments)
./results/mESC/Chr/kallisto/abundance.tsv (Gene expression)
./results/mESC/Chr/siri_out (Quantification of the levels of intron retentions)
Download fastq files
conda activate ./conda_env_py2cd quantification_intronRetention_1prefetch SRR12883492fastq-dump --split-files ./SRR12883492/SRR12883492.sra
Create indices
STAR --runMode genomeGenerate --runThreadN 4 --genomeDir ./star_index --genomeFastaFiles ../data/mm10.fa --sjdbGTFfile ../data/Mus_musculus.GRCm38.91.chr.gtfkallisto index ../data/Mus_musculus.GRCm38.cdna.all.fa --i Mus_musculus.GRCm38.cdna.all.fa.idx
Setup for run
mkdir -p results/mESC/Chrmkdir -p data/mESC/fastqmv ./SRR12883492_1.fastq data/mESC/fastq/Chr_1.fastqmv ./SRR12883492_2.fastq data/mESC/fastq/Chr_2.fastq- edit
PARENT_DIRvariable at top ofRNAseq_Analysis_Pipeline.py chmod +x SIRI/bin/SIRI
Run
- 8 threads 64 GB:
python RNAseq_Analysis_Pipeline.py conda deactivatecd ..
Output: cluster.png should look like what is shown in the .ipynb
conda activate ./conda_env_py2cd xmeans_cluster_U_intron_2python run_scripts.pyconda deactivatecd ..
Output: PTC_figure.png should look like what is shown in the .ipynb
conda activate ./conda_env_py2cd ptc_analysis_3python run_scripts.pyconda deactivatecd ..
Output: ./results/performance.png should look like what is shown in the .ipynb
conda activate ./conda_env_py2cd deep_learning_4- 8 threads 32 GB:
python run_scripts.py conda deactivatecd ..
Output: tsne_plot.pdf should look like the one uploaded to the repo
conda activate ./conda_env_py2cd tsne_analysis_5Rscript tsne_plot.Rconda deactivatecd ..
Output: circos.pdf should look like the one uploaded to the repo
conda activate ./conda_env_py2cd circos_plot_6Rscript circos_plot.Rconda deactivatecd ..