Skip to content

broadinstitute/mas-seq-paper-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

High-throughput RNA isoform sequencing using programmable cDNA concatenation

Abstract

Alternative splicing is a core biological process that enables profound and essential diversification of gene function. Short-read RNA sequencing approaches fail to resolve RNA isoforms and therefore primarily enable gene expression measurements - an isoform unaware representation of the transcriptome. Conversely, full-length RNA sequencing using long-read technologies are able to capture complete transcript isoforms, but their utility is deeply constrained due to throughput limitations. Here, we introduce MAS-ISO-seq, a technique for programmably concatenating cDNAs into single molecules optimal for long-read sequencing, boosting the throughput >15 fold to nearly 40 million cDNA reads per run on the Sequel IIe sequencer. We validated unambiguous isoform assignment with MAS-ISO-seq using a synthetic RNA isoform library and applied this approach to single-cell RNA sequencing of tumor-infiltrating T cells. Results demonstrated a >30 fold boosted discovery of differentially spliced genes and robust cell clustering, as well as canonical PTPRC splicing patterns across T cell subpopulations and the concerted expression of the associated hnRNPLL splicing factor. Methods such as MAS-ISO-seq will drive discovery of novel isoforms and the transition from gene expression to transcript isoform expression analyses.

Authors

Aziz M. Al’Khafaji1*†, Jonathan T. Smith1*, Kiran V Garimella1*†, Mehrtash Babadi1*†, Victoria Popic1*, Moshe Sade-Feldman1,2, Michael Gatzen1, Siranush Sarkizova1, Marc A. Schwartz1,3,4, Emily M. Blaum1,2, Allyson Day1, Maura Costello1, Tera Bowers1, Stacey Gabriel1, Eric Banks1, Anthony A. Philippakis1, Genevieve M. Boland5, Paul C. Blainey1,6,8,†, Nir Hacohen1,7,10,11,†

  1. Broad Institute of Harvard and MIT, Cambridge, MA, USA
  2. Department of Medicine, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
  3. Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, USA.
  4. Division of Hematology/Oncology, Boston Children's Hospital, Boston, Massachusetts, USA.
  5. Department of Pediatric Oncology, Dana Farber Cancer Institute, Boston, Massachusetts, USA.
  6. Division of Surgical Oncology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
  7. Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
  8. Center for Cancer Research, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
  9. Koch Institute for Integrative Cancer Research at Massachusetts Institute of Technology, Cambridge, MA, USA
  10. Harvard Medical School, Boston, MA, USA
  11. Center for Immunology and Inflammatory Diseases, Massachusetts General Hospital, Charlestown, MA, USA

* - These authors contributed equally
† - Corresponding authors

Data

  • All data from this study are available online (or are in the process of being uploaded).

There were two datasets from this study:

Dataset Number of Samples Location
Human peripheral blood monouclear cells (PBMC) / tumor-infiltrating CD8+ T cells 2 Release in progress...
Spike-in RNA Variant Control Mix data (SIRVs set 4, Lexogen) 2 Terra*,
FTP: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/MasSeqNatBiotech2021

* - The SIRV samples were prepared with two library preparation techniques: a length 10 MAS-ISO-seq array and a length 15 MAS-ISO-seq array. They were multiplexed into a single pooled sample and sequenced in a single run on a PacBio Sequel IIe. Our software package, Longbow, was then used to demultiplex the single SIRV multiplexed sample into two outputs - one for the length 15 array and one for the length 10 array. These demultiplexed files are what is currently available in the Terra workspace.

Code

Analysis Pipelines / Workflows

The pipelines to analyze the raw sequencing data for this paper are automated using Cromwell and Workflow Description Language WDL. The full pipeline for the analysis used for this paper can be found in the Long Read Methods and Applications (LRMA) pipeline repository.
These WDL files contain many sub-tasks strung together into a workflow that performs a large analysis. The subtasks are all defined in separate WDL files that are included in the main WDL. Individual tasks run commands inside docker images.

The analysis workflows used in this paper can be found in the workflows directory.

The following are quick links to the WDL files used for specific parts of the analysis:

Analysis Pipeline
SIRV analysis MasIsoSeqSirvAnalysis.wdl
Main MAS-ISO-seq analysis PB10xMasSeqArraySingleFlowcellv5.wdl
Downsampled CCS read analysis PB10xMasSeqArraySingleFlowcellv5_ArrayElementAnalysis_NoSharding_CCS_only.wdl
Downsampled CCS+CLR read analysis PB10xMasSeqArraySingleFlowcellv5_ArrayElementAnalysisV2.wdl

NOTE: The downsampled WDL files are a subset of the Main MAS-ISO-seq analysis, so the inputs for those WDLs are not included here.

Docker Images

Custom docker images used in this paper are all defined in the LRMA long-read-pipelines repository in the docker folder.

Currently all of these files live in a branch of of the LRMA long-read-pipelines repository.

Longbow

Longbow is the profile HMM tool used to segment the MAS-seq array reads. It is open-source and maintained byt the Data Sciences Platform Long Read Methods and Applications (LRMA) group. The source code can be found here: https://github.com/broadinstitute/longbow.

For the preprint, we used version 0.2.2-Paper. This version is considered obsolete and should not be used.

For the final submission we used version 0.6.3 for all operations except UMI Correction. For UMI correction we used version 0.6.6

Single-cell analysis

The Jupyter notebooks used for the single-cell analysis and downsampling analysis in this paper can be found in this repository in the scripts directory.

Additional Analysis Scripts

Addition scripts and Jupyter notebooks used to perform analysis and figure creation for the paper are located in the scripts directory.

Terra Workspace Example

A Terra workspace with an example of how to process MAS-ISO-seq data can be found here:

This workspace is an example of how to segment and align MAS-ISO-seq data.

The data in this workspace are the same Spike-in RNA Variant Control Mix (SIRVs set 4, Lexogen) samples that we used as controls in the paper.

Pre-print of the Paper

A preprint of the paper can be found on bioRxiv here: High-throughput RNA isoform sequencing using programmable cDNA concatenation.

Other Long-Read Pipelines / Analyses

Additional long-read analyses and pipelines can be found at the LRMA group pipelines repository. These pipelines are not necessarily directly related to this work, but many share common components and sub-workflows / tasks.

About

Data and additional information from the initial MAS-ISO-seq study, "High-throughput RNA isoform sequencing using programmable cDNA concatenation"

Resources

Code of conduct

Stars

Watchers

Forks