Daniel Bunis1†, Rebecca Jaszczak1‡ and Ravi Patel1§
1UCSF DSCoLab
†daniel.bunis@ucsf.edu
‡rebecca.jaszczak@ucsf.edu
§ravi.patel2@ucsf.edu
This pipeline was developed collaboratively by a small team of wet lab “bench” scientists and data scientists of the CoLabs and associated labs at UCSF. We aimed to develop a pipeline for CyTOF analysis which scaled well for datasets with large numbers of events and provided consistency of clusters when down-sampling. We also want the pipeline to be easily implemented by “bench” scientists to analyze their own CyTOF data, bypassing the need for programming expertise. Cyclone is a stand-alone tool for the analysis of full single-cell high dimensional CyTOF data. It can integerate with SCAFFoLD workflow, and perform custom statistics and visualizations.
To get started with this pipeline, you should have:
- A folder containing fcs files from cytof data
- file metadata (described below)
- marker metadata (described below)
- csv of grid sizes (
CyTOF_pipeline/grid_sizes.csv
)
- Chunks work of the steps below across 8 ‘checkpoints’
- Calculates UMAP embeddings
- Performs clustering with FlowSOM (first, over an optimization space to allow users to pick their ideal parameters)
- Then, with user-chosen parameters, calculates and outputs various metrics (detailed in the Checkpoints sections below)
The pipeline assumes you have already performed any of the below pre-processing steps which may be necessary for your data:
- subset populations
- debarcoding
- bead normalization
- batch correction
- compensation
- special transformations (beyond the arcsinh transformation which the pipeline does)
Optionally, if your data has been batch corrected by a method that does not output corrected FCS files, and only outputs a corrected matrix, it is still possible to use the pipeline. Feel free to contact us if that is your use-case.
To run the pipeline, you first need to:
- Clone the repo
- Install all required packages
- Choose or create an intended output directory
- Create a set of metadata files:
file_metadata.csv
marker_metadata.csv
- Finalize your
config.yml
file - Submit your pipeline call as an
Rscript
.
Simply installing this counterpart will take care of most required installations. You can install from github with:
if (!requireNamespace("remotes")) {
install.packages("remotes")
}
remotes::install_github("ravipatel4/CyTOF_pipeline", subdir = "cyclone")
Some packages are not absolutely required, so we leave to the user whether to install them. If you want them, you will need to install them yourself:
- Scaffold
remotes::install_github("nolanlab/scaffold")
This is the directory where all your checkpoints will be saved to, as well as any plots that result from running the pipeline.
Required columns:
file_name
: the name of the FCS filesdonor_id
: specifies the sample originpool_id
: use any ID mechanism to indicate samples from the same CyTOF poolcontrol_sample
: ‘control’ means the batch control used in pools (NOT experimental control)
In the event that your data doesn’t have pools, or different donors, you
can simply populate these columns with the same value, e.g. 1
.
How it’s used: To direct the pipeline to the location of the FCS files, and associate some metadata with them (donor, pool).
How to create: You can make this in R/python/Excel.
Recommended format: CSV (Comma delimited)
Required columns:
channel_name
: from the FCS filesmarker_name
: from the FCS filesused_for_UMAP
: Boolean, whether the marker should be used in UMAP calculationsused_for_clustering
: Boolean, whether the marker should be used in clustering calculationsused_for_scaffold
: Boolean, whether the marker should be used in scaffold analysis
In most cases, used_for_UMAP
, used_for_clustering
,
used_for_scaffold
can be the same, but there may be cases where you
might use a marker for clustering, but not for UMAP calculation, for
example.
How to create: Use CyTOF_pipeline/make_marker_metadata_csv.R
script.
The file directs the pipeline to find all inputs, and is the main point of control for how the pipeline will be run.
Contents
- path of directory where FCS files are; will be searched recursively
- path for where you want the output files saved
- where
file_metadata.csv
is located - where
marker_metadata.csv
is located - path for
gated_fcs_dir
, the template for scaffold analysis (can be left blank if not running scaffold) rerun_from_step
: if you need to restart the pipeline at an intermediate step- data processing related: can vary per user needs or leave as default
- UMAP related: can vary per user needs or leave as default
- clustering:
flowsom
recommended, as it is faster, althoughclara
is an alternative - nthreads: the option to parallelize the optimization step (see Notes on Parallelization; not available on Windows)
Template: CyTOF_pipeline/config.yml
How to create: Fill in the config.yml
file, saving it with whatever
name you would like. We will use my_config.yml
.
Parallelization is available on Mac and Linux operating systems. It is not available for Windows users.
If you are submitting on a shared compute cluster, here is a suggested workflow:
- Structure your initial job call to have one more thread than you
will assign in
my_config.yml
.- E.g., if you intend to parallel process over 4 cores, request 5 cores from your scheduler.
- In the same example, ensure that
nthreads: 4
value is assigned inmy_config.yml
- Larger datasets (over 20 million cells) may need to allocate scratch
directories, e.g.
--gres=scratch:500G
, pending cluster requirements. - Request enough RAM for the number of cells contained in your
dataset.
- As structured currently, R will split the total amount of RAM
passed in your job to each of the processes requested in
my_config.yml
- We have used ~100Gb RAM without parallaization with ~20million cells (config 'nthreads: 1'), ~420Gb RAM for 4x parallelization with ~40million cells (config 'nthreads: 4', scheduler ntasks=5), and ~520Gb RAM for 3x parallelization with ~50 million cells (config 'nthreads: 3', scheduler ntasks=4). This may vary based on your computer/cluster specs.
- As structured currently, R will split the total amount of RAM
passed in your job to each of the processes requested in
If processing in parallel, please note that there will be no log
messages FlowSOM clustering begins for grid...
after you have received
the Starting the FlowSOM clustering...
message.
Once everything is set up,run the pipeline, passing as a command line
variable your modified my_config.yml
.
Rscript cytof_pipeline.R -c my_config.yml
The pipeline is broken down into multiple chunks, where each chunk saves its output as a checkpoint.Rdata file.
- Read, preprocess and transform the data. Output:
checkpoint_1.RData
- Calculate UMAP. Output:
checkpoint_2.RData
- Optimize clustering parameters. Output:
heckpoint_3.RData
&clustering_param_optimization.pdf
- Perform clustering. Output:
checkpoint_4.RData
- Generate various matrices. Output:
checkpoint_5.RData
- Prepare input files for SCAFFoLD analysis. Output:
checkpoint_6.RData
- Generate SCAFFoLD map. Output:
checkpoint_7.RData
- Produce final set of visualizations. Output:
checkpoint_8.RData
and plots below
clustering_param_optimization.pdf
- DBI x cluster counts graphfeature_plots.png
- a set list of markers helpful for cell type identificationsplit_umap_by_cluster.png
- histogram of each clusters density in UMAP space dimensionsRplots.pdf
per cluster, histogram of median expression of all markersplots.pdf
- umap colored by pool ID to survey potential batch effects