High-throughput RNA isoform sequencing using programmable cDNA concatenation

Abstract

Alternative splicing is a core biological process that enables profound and essential diversification of gene function. Short-read RNA sequencing approaches fail to resolve RNA isoforms and therefore primarily enable gene expression measurements - an isoform unaware representation of the transcriptome. Conversely, full-length RNA sequencing using long-read technologies are able to capture complete transcript isoforms, but their utility is deeply constrained due to throughput limitations. Here, we introduce MAS-ISO-seq, a technique for programmably concatenating cDNAs into single molecules optimal for long-read sequencing, boosting the throughput >15 fold to nearly 40 million cDNA reads per run on the Sequel IIe sequencer. We validated unambiguous isoform assignment with MAS-ISO-seq using a synthetic RNA isoform library and applied this approach to single-cell RNA sequencing of tumor-infiltrating T cells. Results demonstrated a >30 fold boosted discovery of differentially spliced genes and robust cell clustering, as well as canonical PTPRC splicing patterns across T cell subpopulations and the concerted expression of the associated hnRNPLL splicing factor. Methods such as MAS-ISO-seq will drive discovery of novel isoforms and the transition from gene expression to transcript isoform expression analyses.

Authors

Aziz M. Al’Khafaji^1*†, Jonathan T. Smith^1*, Kiran V Garimella^1*†, Mehrtash Babadi^1*†, Victoria Popic^1*, Moshe Sade-Feldman^1,2, Michael Gatzen¹, Siranush Sarkizova¹, Marc A. Schwartz^1,3,4, Emily M. Blaum^1,2, Allyson Day¹, Maura Costello¹, Tera Bowers¹, Stacey Gabriel¹, Eric Banks¹, Anthony A. Philippakis¹, Genevieve M. Boland⁵, Paul C. Blainey^1,6,8,†, Nir Hacohen^{1,7,10,11,†}

Broad Institute of Harvard and MIT, Cambridge, MA, USA
Department of Medicine, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, USA.
Division of Hematology/Oncology, Boston Children's Hospital, Boston, Massachusetts, USA.
Department of Pediatric Oncology, Dana Farber Cancer Institute, Boston, Massachusetts, USA.
Division of Surgical Oncology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
Center for Cancer Research, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
Koch Institute for Integrative Cancer Research at Massachusetts Institute of Technology, Cambridge, MA, USA
Harvard Medical School, Boston, MA, USA
Center for Immunology and Inflammatory Diseases, Massachusetts General Hospital, Charlestown, MA, USA

* - These authors contributed equally
† - Corresponding authors

Data

All data from this study are available online (or are in the process of being uploaded).

There were two datasets from this study:

Dataset	Number of Samples	Location
Human peripheral blood monouclear cells (PBMC) / tumor-infiltrating CD8+ T cells	2	Release in progress...
Spike-in RNA Variant Control Mix data (SIRVs set 4, Lexogen)	2	Terra*, FTP: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/MasSeqNatBiotech2021

* - The SIRV samples were prepared with two library preparation techniques: a length 10 MAS-ISO-seq array and a length 15 MAS-ISO-seq array. They were multiplexed into a single pooled sample and sequenced in a single run on a PacBio Sequel IIe. Our software package, Longbow, was then used to demultiplex the single SIRV multiplexed sample into two outputs - one for the length 15 array and one for the length 10 array. These demultiplexed files are what is currently available in the Terra workspace.

Code

Analysis Pipelines / Workflows

The pipelines to analyze the raw sequencing data for this paper are automated using Cromwell and Workflow Description Language WDL. The full pipeline for the analysis used for this paper can be found in the Long Read Methods and Applications (LRMA) pipeline repository.
These WDL files contain many sub-tasks strung together into a workflow that performs a large analysis. The subtasks are all defined in separate WDL files that are included in the main WDL. Individual tasks run commands inside docker images.

The analysis workflows used in this paper can be found in the workflows directory.

The following are quick links to the WDL files used for specific parts of the analysis:

Analysis	Pipeline
SIRV analysis	MasIsoSeqSirvAnalysis.wdl
Main MAS-ISO-seq analysis	PB10xMasSeqArraySingleFlowcellv5.wdl
Downsampled CCS read analysis	PB10xMasSeqArraySingleFlowcellv5_ArrayElementAnalysis_NoSharding_CCS_only.wdl
Downsampled CCS+CLR read analysis	PB10xMasSeqArraySingleFlowcellv5_ArrayElementAnalysisV2.wdl

NOTE: The downsampled WDL files are a subset of the Main MAS-ISO-seq analysis, so the inputs for those WDLs are not included here.

Docker Images

Custom docker images used in this paper are all defined in the LRMA long-read-pipelines repository in the docker folder.

Currently all of these files live in a branch of of the LRMA long-read-pipelines repository.

Longbow

Longbow is the profile HMM tool used to segment the MAS-seq array reads. It is open-source and maintained byt the Data Sciences Platform Long Read Methods and Applications (LRMA) group. The source code can be found here: https://github.com/broadinstitute/longbow.

For the preprint, we used version 0.2.2-Paper. This version is considered obsolete and should not be used.

For the final submission we used version 0.6.3 for all operations except UMI Correction. For UMI correction we used version 0.6.6

Single-cell analysis

The Jupyter notebooks used for the single-cell analysis and downsampling analysis in this paper can be found in this repository in the scripts directory.

Additional Analysis Scripts

Addition scripts and Jupyter notebooks used to perform analysis and figure creation for the paper are located in the scripts directory.

Terra Workspace Example

A Terra workspace with an example of how to process MAS-ISO-seq data can be found here:

MAS-seq - Data Segmentation and Alignment

This workspace is an example of how to segment and align MAS-ISO-seq data.

The data in this workspace are the same Spike-in RNA Variant Control Mix (SIRVs set 4, Lexogen) samples that we used as controls in the paper.

Pre-print of the Paper

A preprint of the paper can be found on bioRxiv here: High-throughput RNA isoform sequencing using programmable cDNA concatenation.

Other Long-Read Pipelines / Analyses

Additional long-read analyses and pipelines can be found at the LRMA group pipelines repository. These pipelines are not necessarily directly related to this work, but many share common components and sub-workflows / tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
scripts		scripts
workflows		workflows
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

workflows

workflows

.gitignore

.gitignore

README.md

README.md

Repository files navigation

High-throughput RNA isoform sequencing using programmable cDNA concatenation

Abstract

Authors

Data

Code

Analysis Pipelines / Workflows

Docker Images

Longbow

Single-cell analysis

Additional Analysis Scripts

Terra Workspace Example

Pre-print of the Paper

Other Long-Read Pipelines / Analyses

About

Contributors 2

Languages

broadinstitute/mas-seq-paper-data

Folders and files

Latest commit

History

Repository files navigation

High-throughput RNA isoform sequencing using programmable cDNA concatenation

Abstract

Authors

Data

Code

Analysis Pipelines / Workflows

Docker Images

Longbow

Single-cell analysis

Additional Analysis Scripts

Terra Workspace Example

Pre-print of the Paper

Other Long-Read Pipelines / Analyses

About

Resources

Code of conduct

Stars

Watchers

Forks

Languages