/
cmd_qc.Rmd
498 lines (379 loc) · 35.3 KB
/
cmd_qc.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
---
title: "Generation of comprehensive quality control metrics with SCTK"
output: html_document
---
```{r setup, include=FALSE}
require(Biobase)
knitr::opts_chunk$set(warning = FALSE)
pkgVersion <- package.version("singleCellTK")
```
# Introduction
This pipeline will import data from single-cell preprocessing algorithms (e.g. [CellRanger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger), [HCA Optimus](https://data.humancellatlas.org/pipelines/optimus-workflow), [Alevin](https://salmon.readthedocs.io/en/latest/alevin.html)), generate various quality control metrics (e.g. general metrics, doublet scores, contamination estimates) using multiple tools, and output results in standard data containers (e.g. [SingleCellExperiment](https://rdrr.io/bioc/SingleCellExperiment/man/SingleCellExperiment.html), [Seurat object](https://satijalab.org/seurat/index.html), [AnnData](https://github.com/theislab/anndata)).
For data generated with microfluidic devices, the first major step after UMI counting is to detect cell barcodes that represent droplets containing a true cell and exclude empty droplets that only contain ambient RNA.
- We use the terms **"Droplet" matrix** to denote a count matrix that still contains empty droplets;
- **"Cell" matrix** to denote a count matrix of cells where empty droplets have been excluded but no other filtering has been performed;
- And **"FilteredCell" matrix** to indicate a count matrix where poor quality cells have also been excluded.
The Droplet and Cell matrices have also been called "raw" and "filtered" matrices, respectively, by tools such as CellRanger. However, using the term "filtered" can be ambiguous as other forms of cell filtering can be applied beyond empty droplets (e.g. excluding poor-quality cells based on low number of UMIs). Both the original droplet matrix and the filtered cell matrix can be QC'ed in this pipeline. However, QC of the droplet matrix is specific for single cell data generated from microfluidic devices (e.g. 10X).
To run the pipeline, users can [install the singleCellTK (SCTK) package](installation.html) along with Python dependencies that may be potentially used (mainly [Scrublet](https://github.com/swolock/scrublet) and [AnnData](https://github.com/theislab/anndata)). Alternatively, users can run the Docker version of the pipeline which is described in detail further down in this page.
# Running the pipeline
## Running SCTK-QC with SCTK local installation
To run the pipeline script, users will need to download the `SCTK_runQC.R` script file [here](https://github.com/compbiomed/singleCellTK/blob/master/exec/SCTK_runQC.R). Also, the tutorial of installing singleCellTK package can found in [Installation](installation.html) page.
A simple example to run this pipeline on the 'Cell' matrix generated by Cellranger V3 is shown below:
```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-b /base/path \
-P CellRangerV3 \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat \
-g /Path_to_gmt/name_of_gmt_file.gmt \
-d Cell \
-n 2 \
-T MulticoreParam
```
This pipeline enables different ways to import CellrangerV2/CellrangerV3 data for flexibility. Also, the pipeline is compatible with datasets generated by other algorithms. Please refer to the section [*Importing data from different preprocessing tools*](#importing-data-from-different-preprocessing-tools) for more details.
User can quantify expression of mitochondrial genes by passing a GMT files containing mitochondrial genes (with `-g/--gmt` argument). User can also quantify expression of mitochondrial genes for human or mouse dataset by setting `-M/--detectMitoLevel` argument. Please refer to the section [*Gene sets*](#gene-sets) for more details.
Besides, the pipeline contains various parameters to control the process of quality control. Please refer to the section [*Parameters*](#parameters) for more details.
## Running SCTK-QC with Docker
### Installing docker
If you have not used docker before, you can follow the instruction to install and set up docker in [Windows](https://docs.docker.com/desktop/windows/install/), [Mac](https://docs.docker.com/desktop/mac/install/) or [Linux](https://runnable.com/docker/install-docker-on-linux).
### Running SCTK-QC pipeline using docker image
The Docker image can be obtained by running:
```{r, echo=FALSE, eval=TRUE, include=TRUE, comment=NA}
code <- paste0('docker pull campbio/sctk_qc:', pkgVersion)
cat(code)
```
The usage of each argument is the same as running command line analysis. Here is an example code to perform QC on CellRangerV3 data with SCTK docker:
```{r, echo=FALSE, eval=TRUE, include=TRUE, comment=NA}
code <- paste("docker run --rm -v /path/to/data:/SCTK_docker \\",
paste0(' -it campbio/sctk_qc:', pkgVersion, " \\"),
" -b /SCTK_docker/cellranger_folder \\
-P CellRangerV3 \\
-s SampleName \\
-o /SCTK_docker/Output_Directory \\
-g /SCTK_docker/name_of_gmt_file.gmt \\
-S TRUE \\
-F SCE,AnnData,FlatFile,Seurat", sep='\n'
)
cat(code)
```
The docker image will not access files in your host file system by default. To get access to the files on your machine, you can properly set up a mount volume. Noted that the transcriptome data and GMT file needed to be accessible to the container through mounted volume. In the above example, mount volume is enabled for accessing input and output directory using argument `-v`. The transcriptome and GMT files stored in the path `/path/to/data` of your machine file system is now available in `/SCTK_docker` folder inside the docker. To learn more about mounted volumes, please check out [this post](https://docs.docker.com/storage/volumes/).
Please refer to the section [*Parameters*](#parameters) for more details about parameters.
### Running SCTK-QC pipeline docker image with singularity
Users who have not used [Singularity](https://sylabs.io/guides/2.6/user-guide/introduction.html) before can install it following the instruction [here](https://sylabs.io/guides/3.1/user-guide/installation.html). The Singularity image for SCTK-QC can be easily built using Docker Hub as a source:
```{r, echo=FALSE, eval=TRUE, include=TRUE, comment=NA}
code <- paste0('singularity pull docker://campbio/sctk_qc:', pkgVersion)
cat(code)
```
The usage of singleCellTK Singularity image is very similar to that of Docker. In Singularity 3.0+, the mount volume is [automatically overlaid](https://sylabs.io/guides/3.1/user-guide/bind_paths_and_mounts.html).
It's recommended to re-set the home directory when you run singularity. Singularity will mount `\$HOME` path on your file system by default, which might contain your personal R/Python library folder. If we don't re-set the home to mount, singularity will try to use R/Python libraries which are not built within the singularity image and cause some conflicts. You can point to some "sanitized home", which is different from `\$HOME` path on your machine, using argument `-H`/`--home` [(see more information)](https://sylabs.io/guides/3.1/user-guide/bind_paths_and_mounts.html). Or you can disable the `\$HOME` binding by setting the argument `--no-home`. Besides, you can use argument `--bind`/`-B` to specify your own mount volume, which is the path that contains the dataset and will be used to store the output of QC pipeline. The example is shown as below:
```{r, echo=FALSE, eval=TRUE, include=TRUE, comment=NA}
code <- paste("singularity run --home=/PathToSanitizedHome \\",
paste0(' --bind /PathToData:/data sctk_qc_', pkgVersion, ".sif \\"),
" -b /SCTK_docker/cellranger_folder \\
-P CellRangerV3 \\
-s SampleName \\
-o /SCTK_docker/Output_Directory \\
-g /SCTK_docker/name_of_gmt_file.gmt \\
-S TRUE \\
-F SCE,AnnData,FlatFile,Seurat", sep='\n'
)
cat(code)
```
### Important note about docker image
One important note about this docker image: please run the docker image on a machine / node which has a **CPU** with the following architecture: **broadwell, haswell, skylake, cascadelake or the latest architecture**. This can avoid having the "illegal operation" issue from Scrublet package, because this Python package are compiled by SIMD instructions that are compatible with these CPU architectures. Please specify the CPU architecture, at the script header after `#$ -l cpu_arch=`, as one of the following: `broadwell`, `haswell`, `skylake`, `cascadelake` or latest architecture. One of the example is shown below:
```{r, echo=FALSE, eval=TRUE, include=TRUE, comment=NA}
code <- paste("#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe omp 16
#$ -l cpu_arch=broadwell|haswell|skylake|cascadelake
singularity run --home=/PathToSanitizedHome \ ### this also works for 'docker run'",
paste0(' --bind /PathToData:/data sctk_qc_', pkgVersion, ".sif \\"),
" -b /SCTK_docker/cellranger_folder \\
-P CellRangerV3 \\
-s SampleName \\
-o /SCTK_docker/Output_Directory \\
-g /SCTK_docker/name_of_gmt_file.gmt \\
-S TRUE \\
-F SCE,AnnData,FlatFile,Seurat", sep='\n'
)
cat(code)
```
# Parameters {#parameters}
## Table of Parameters
The pipeline contains various parameters to control the process of quality control. The function of each parameter is shown below:
### Required arguments
The required arguments are as follows:
| Parameter | Description |
|---------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `-b`, `--basePath` (**required**) | Base path for the output from the preprocessing algorithm. |
| `-P`, `--preproc` (**required**) | Algorithm used for preprocessing. One of `CellRangerV2`, `CellRangerV3`, `BUStools`, `STARSolo`, `SEQC`, `Optimus`, `DropEst`, `SceRDS`, `Seurat`, `CountMatrix` and `AnnData`. |
| `-s`, `--sample` (**required**) | Name of the sample. This will be prepended to the cell barcodes. |
| `-o`, `--directory` (**required**) | Output directory. A new subdirectory will be created with the name "sample". R, Python, and FlatFile directories will be created under the "sample" directory containing the data containers with QC metrics. Default `.`. More information about output directory structure is explained in [*Outputs*](#outputs) section below. |
| `-F`, `--outputFormat` (**required**) | The output format of this QC pipeline. Currently, it supports `SCE`, `Seurat`, `FlatFile`, `AnnData` and `HTAN` (manifest files that meets HTAN requirement). |
| `-S`, `--splitSample` (**required**) | Save a `SingleCellExperiment` object for each sample. Default is `TRUE`. If `FALSE`, the data of all samples will be combined into one `SingleCellExperiment` object and this object will be output. |
### Batch processing
When running the pipeline with more than one sample to process, care should be exercised with the -S flag, and whether or not one wishes to combine the samples.
In addition, the following parameters will need multiple inputs, separated by a **comma** **without any spacing**:
| Parameter | Description |
|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `-b`, `--basePath` | Base path for the output from the preprocessing algorithm; one is required for each sample, even if they are in the same location. |
| `-P`, `--preproc` | In multiple object mode, each entry should have a preprocessing algorithm associated with it; the pipeline supports serial processing of different sample types. |
| `-s`, `--sample` | Names of the samples. |
An example execution is shown below:
```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-b /base/path1,/base/path2 \
-P Preprocessing_Algorithm_1,Preprocess_Algorithm_2 \
-s SampleName1,SampleName2 \
-o Output_Directory \
-S FALSE \
-F SCE,AnnData,FlatFile \
-d Droplet \
-D TRUE \
```
### Optional arguments
The optional arguments are as follows. Their usage depend on type of data and user-defined behaviour.
| Parameter | Description |
|-------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `-g`, `--gmt` | GMT file containing gene sets for quality control. |
| `-t`, `--delim` (required when -g is specified) | Delimiter used in GMT file. Default `"\t"`. |
| `-G`, `--genome` | The name of genome reference. This is only required for CellRangerV2 data. |
| `-y`, `--yaml` | YAML file used to specify parameters of QC functions called by SCTK-QC pipeline. Please check [*Specify parameters using yaml file*](#specify-parameters-using-yaml-file) section for details. |
| `-c`, `--cellData` | The full path of the RDS, H5AD or Matrix file of the cell matrix. This would be use only when `--preproc` is `SceRDS`, `Seurat`, `AnnData`, or `CountMatrix`. |
| `-r`, `--rawData` | The full path of the RDS file or Matrix file of the droplet matrix. This would be provided only when `--preproc` is `SceRDS`, `Seurat`, `AnnData`, or `CountMatrix`. |
| `-C`, `--cellPath` | The directory containing `matrix.mtx.gz`, `features.tsv.gz` and `barcodes.tsv.gz` files originally generated by 10x CellrangerV2 or CellrangerV3 (files in the `filtered_feature_bc_matrix` directory). This argument only works when `--preproc` is `CellRangerV2` or `CellRangerV3`. Default is `NULL`. If `base_path` is `NULL`, `cellPath` or `rawPath` should be specified. |
| `-R`, `--rawPath` | The directory containing `matrix.mtx.gz`, `features.tsv.gz` and `barcodes.tsv.gz` files originally generated by 10x CellrangerV2 or CellrangerV3 (files in the `raw_feature_bc_matrix` directory). This argument only works when `--preproc` is `CellRangerV2` or `CellRangerV3`. Default is `NULL`. If `base_path` is `NULL`, `cellPath` or `rawPath` should be specified. |
| `-d`, `--dataType` | Type of data as input. Default is `Both`, which means taking both droplet and cell matrix as input. If set as `Droplet`, it will only processes droplet data. If set as `Cell`, it will only processes cell data. |
| `-D`, `--detectCells` | Detect cells from droplet matrix. Default is `FALSE`. This argument is only evaluated when `-d` is `Droplet`. If set as `TRUE`, cells will be detected and cell matrix will be subset from the droplet matrix. Also, QC will be performed on the detected cell matrix. |
| `-m`, `--cellDetectMethod` | Methods to detect cells from droplet matrix. Default is `EmptyDrops`. This argument is only evaluated when `-D` is `TRUE`. Other options could be `Knee` or `Inflection`. More information is provided in the [*Droplet QC* documentation](cnsl_dropletqc.html). |
| `-n`, `--numCores` | Number of cores used to run the pipeline. By default is `1`. Parallel computing is enabled if `-n` is greater than 1. |
| `-T`, `--parallelType` | Type of parallel computing used for parallel computing. Parallel computing used in this pipeline depends on [BiocParallel](https://bioconductor.org/packages/release/bioc/html/BiocParallel.html) package. Default is `MulticoreParam`. It can be `MulticoreParam` or `SnowParam`. This argument will be evaluated only when `--numCores` is greater than 1. |
| `-i`, `--studyDesign` | The TXT file containing the description of the study design. Default is `NULL`. This would be shown at the beginning of the HTML report of cell and droplet QC. |
| `-L`, `--subTitle` | The subtitle used in the cell and droplet QC HTML report. Default is `None`. The subtitle can contain information of the sample, like sample name, etc. If `-S` is set as `TRUE`, the length of subsitle should be the same as the number of samples. If `-S` is set as `FALSE`, the length of subtitle should be one or NULL. |
| `-M`, `--detectMitoLevel` | Detect mitochondrial gene expression level. If `TRUE`, the pipeline will examine mitochondrial gene expression level automatically without the need of importing user defined GMT file. Default is `TRUE`. |
| `-E`, `--mitoType` | Type of mitochondrial gene-set to be loaded when `--detectMitoLevel` is set to `TRUE`. Possible choices are: `human-ensembl`, `human-symbol`, `human-entrez`, `human-ensemblTranscriptID`, `mouse-ensembl`, `mouse-symbol`, `mouse-entrez` and `mouse-ensemblTranscriptID`. |
## Specify parameters using YAML file {#specify-parameters-using-yaml-file}
Users can specify parameters for QC algorithms in this pipeline with a YAML file (supplied with `-y`/`--yamlFile` argument). The current supported QC algorithms including doublet detection (`bcds`, `cxds`, `cxds_bcds_hybrid`, `doubletFinder`, `doubletCells` and `scrublet`), decontamination (`decontX`), emptyDrop detection (`emptyDrops`) and barcodeRankDrops (`barcodeRanks`). A summary of each function is shown below:
![QC yaml parameters](qc_yamlParameters.png)
An example of QC parameters YAML file is shown below:
```
---
Params: ### should not be omitted
bcds:
ntop: 600
cxds:
ntop: 600
cxds_bcds_hybrid:
nTop: 600
decontX:
maxIter: 600
emptyDrops:
lower: 50
niters: 5000
testAmbient: True
barcodeRanks:
lower: 50
```
The format of YAML file can be found [here](https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html). The parameters should be consistent with the parameters of each QC function in SCTK. Parameters that are not defined in this YAML file will use the default value. Please refer [reference](../reference/index.html#quality-control-preprocessing) for detailed information about arguments of each QC function.
## Parallel computing
SCTK-QC pipeline enables parallel computing to speed up the analysis. Parallel computing is enabled by setting `-n`/`--numCores` greater than 1. The `-n`/`--numCores` is used to set the number of cores used for the pipeline.
The backend of parallel computing is supported by [BiocParallel](https://bioconductor.org/packages/release/bioc/html/BiocParallel.html) package. Therefore, users can select different types of parallel evaluation by setting `-T`/`--parallelType` argument. Default is `MulticoreParam`. Currently, `MulticoreParam` and `SnowParam` are supported for `-T` argument. However, `MulticoreParam` is not supported by Windows system. Windows user can choose `SnowParam` as the backend of parallel computing.
# Description of input data
## QC on combinations of droplet and cell matrix
`-d` argument is used to specify which type of count matrix is used in the pipeline. Default is `Both`, which means the pipeline will run quality control on both droplet and cell count data.
Users can also choose to run SCTK-QC pipeline on only the droplet count or cell count matrix, instead of running on both. In this case, the pipeline will only take the single input and perform QC on it. An example is shown below:
```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-b /base/path \
-P Preprocessing_Algorithm \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat \
-d Droplet \
-D TRUE \
-m EmptyDrops
```
If `-d` argument is set as `Droplet`, the QC pipeline will only take droplet count matrix as input and perform quality control. You can choose whether to detect cells from the droplet matrix by setting `-D` as `TRUE`. If yes, cell count matrix will be detected and the pipeline will also perform quality control on this matrix and output the result. You could further define the method used to detect cells from droplet matrix by setting `-m` argument. `-m` could be one of `EmptyDrops`, `Knee` or `Inflection`. `EmptyDrops` will keep cells that pass the `runEmptyDrops()` function test. `Knee` and `Inflection` will keep cells that pass the knee or inflection point returned from `runBarcodeRankDrops()` function.
If `-d` argument is set as `Cell`, the QC pipeline will only take cell count matrix as input and perform quality control. A figure showing the analysis steps and outputs of different inputs is shown below:
![QC single input](qc_singleInput.png)\
## Importing data from different preprocessing tools {#importing-data-from-different-preprocessing-tools}
### Import data from Cellranger
This pipeline enables different ways to import CellrangerV2/CellrangerV3 data for flexibility.
1. If the cellranger data set is saved in the default [cellranger output directory](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/overview), you can load the data by running following code:
For CellRangerV3:
```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-b /base/path \
-P CellRangerV3 \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat
```
For CellRangerV2, the reference used by cellranger needs to be specified by `-G`/`--genome`:
```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-b /base/path \
-P CellRangerV2 \
-s SampleName \
-o Output_Directory \
-S TRUE \
-G GenomeName \
-F SCE,AnnData,FlatFile,Seurat
```
`-b` specifies the base path and usually it's the output folder of 10x `cellranger-count`. `-s` specifies the sample name, which has to be the same as the name of the sample folder under the base folder. The folder layout would look like the following:
```
├── BasePath
└── SampleName
├── outs
| ├── filtered_feature_bc_matrix
| | ├── barcodes.tsv.gz
| | ├── features.tsv.gz
| | └── matrix.mtx.gz
| ├── raw_feature_bc_matrix
| | ├── barcodes.tsv.gz
| | ├── features.tsv.gz
| | └── matrix.mtx.gz
...
```
2. If the `cellranger-count` output have been moved out of the default [cellranger output directory](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/overview), you can specified the path to droplet and cell count matrix using arguments `-R` and `-C`:
```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-P CellRangerV2 \
-C /path/to/cell/matrix \
-R /path/to/droplet/matrix \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat
```
In this case, you **must** skip `-b` arguments and you can also skip `-G` argument for CellRangerV2 data.
### Import data from RDS, H5AD or matrix stored in a text file
1. If your data in stored as a `SingleCellExperiment` object in RDS file, singleCellTK also supports this type of input. To run quality control with RDS file as input, run the following code:
```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-P SceRDS \
-s Samplename \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat \
-r /path/to/rds/file/droplet.RDS \
-c /path/to/rds/file/cell.RDS
```
2. If your input is stored as a `Seurat` object in RDS file, you may use the following code:
```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-P Seurat \
-s Samplename \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat \
-r /path/to/rds/file/droplet.RDS \
-c /path/to/rds/file/cell.RDS
```
**Warning:** Both `SingleCellExperiment` and `Seurat` objects use the same RDS format. Be sure which one your data is formatted as or the pipeline may not successfully run.
3. If your input as stored as an `AnnData` object in H5AD file,
```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-P AnnData \
-s Samplename \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat \
-r /path/to/rds/file/droplet.RDS \
-c /path/to/rds/file/cell.RDS
```
4. If your input is stored in TXT file as a matrix, which has barcodes as column names and genes as row names, run the following code to start the quality control pipeline:
```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-P CountMatrix \
-s Samplename \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat \
-r /path/to/matrix/file/droplet.txt \
-c path/to/matrix/file/cell.txt
```
### Methods to run pipeline on data set generated by other algorithms
SCTK-QC pipeline allows importing data from the following pre-processing tools or objects:
- [CellRanger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger)
- [Optimus](https://data.humancellatlas.org/pipelines/optimus-workflow)
- [DropEst](https://github.com/hms-dbmi/dropEst)
- [BUStools](https://github.com/BUStools/bustools)
- [Seqc](https://github.com/ambrosejcarr/seqc)
- [STARSolo](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md)
- [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) object saved in RDS file
- [AnnData](https://github.com/theislab/anndata) object saved in HDF5 file
If your data is preprocessed by other algorithms, you might want to make sure the '-b' argument matches the path storing the data and the '-P' argument matches the right preprocessing tools. Basically, the templated is shown below:
```{bash, eval=FALSE}
Rscript SCTK_runQC.R \
-b /base/path \
-P Preprocessing_Algorithm \
-s SampleName \
-o Output_Directory \
-S TRUE \
-F SCE,AnnData,FlatFile,Seurat
```
The following table describes how SCTK expects the inputs to be structured and passed for each import function. In all cases, SCTK retains the standard output directory structure from upstream tools. All the import functions return the imported counts matrix as an `assay` in a `SingleCellExperiment` object, with associated information in respective `colData`, `rowData`, `reducedDims`, and `metadata` fields.
![QC data format](qc_inputShell.png)\
The table above also shows the R console functions for the QC algorithms. Detailed information about function parameters and defaults are available in the [*Reference*](../reference/index.html) section.
## Gene sets {#gene-sets}
Quantifying the level of gene sets can be useful quality control. For example, the percentage of counts from mitochondrial genes can be an indicator of cell stress or death.
User can quantify the expression of mitochondrial genes for human or mouse dataset by setting `-M`/`--detectMitoLevel` as `TRUE`. User needs to specify the correct `--mitoType` argument for the dataset. The SCTK-QC pipeline has built-in mitochondrial gene sets for human and mouse genes. It supports four different type of gene id: `gene symbol`, `entrez ID`, `ensembl ID` and `ensembl transcript ID`. Therefore, there are eight options for `--mitoType` arguments: `human-ensembl`, `human-symbol`, `human-entrez`, `human-ensemblTranscriptID`, `mouse-ensembl`, `mouse-symbol`, `mouse-entrez` and `mouse-ensemblTranscriptID`.
To quantify expression of other gene set, users can pass a [GMT](https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29) file (with `-g`/`--gmt` argument) to the pipeline with one row for each gene set. The first column should be the name of the gene set (e.g. `mito`).
The second column for each gene set in the GMT file (i.e. the description) should contain the location of where to look for the matching IDs in the data. If set to `rownames`, then the gene set IDs will be matched with the row IDs of the data matrix. If a character string or an integer index is supplied, then gene set IDs will be matched to the IDs in that column of feature table. Gene sets with mitochondrial genes can be found [here](https://github.com/compbiomed/singleCellTK/tree/master/exec).
# Outputs {#outputs}
The output directory is created under the path specified by `-o`/`--directory` argument. Each sample is stored in the subdirectory (named by `-s`/`--sample` argument) within this output direcotry. Within each sample directory, each output format will be separated into subdirectories. The output file hierarchy is shown below:
```
(root; output directory)
├── level3Meta.csv
├── level4Meta.csv
├── sample1_dropletQC.html
├── sample1_cellQC.html
└── sample1
├──sample1_cellQC_summary.csv
├── R
| ├── sample1_Droplets.rds
| └── sample1_Cells.rds
├── Python
| ├── Droplets
| | └── sample1.h5ad
| └── Cells
| └── sample1.h5ad
├── FlatFile
| ├── Droplets
| | ├── assays
| | | └── sample1_counts.mtx.gz
| | ├── metadata
| | | └── sample1_metadata.rds
| | ├── sample1_colData.txt.gz
| | └── sample1_rowData.txt.gz
| └── Cells
| ├── assays
| | └── sample1_counts.mtx.gz
| ├── metadata
| | └── sample1_metadata.rds
| ├── reducedDims
| | ├──sample1_decontX_UMAP.txt.gz
| | ├──sample1_scrublet_TSNE.txt.gz
| | └──sample1_scrublet_UMAP.txt.gz
| ├── sample1_colData.txt.gz
| └── sample1_rowData.txt.gz
└── sample1_QCparameters.yaml
```
# Documentation of available tools
### Empty droplet detection:
- [emptyDrops](https://rdrr.io/github/MarioniLab/DropletUtils/man/emptyDrops.html) from the package [DropletUtils](https://bioconductor.org/packages/release/bioc/html/DropletUtils.html)
- [barcodeRanks](https://rdrr.io/github/MarioniLab/DropletUtils/man/barcodeRanks.html) from the package [DropletUtils](https://bioconductor.org/packages/release/bioc/html/DropletUtils.html)
### Doublet Detection
- [scDblFinder](https://rdrr.io/bioc/scDblFinder/man/scDblFinder.html) from the package [scDblFinder](https://bioconductor.org/packages/release/bioc/html/scDblFinder.html)
- [cxds](https://rdrr.io/bioc/scds/man/cxds.html), [bcds](https://rdrr.io/bioc/scds/man/bcds.html), and [cxds_bcds_hybrid](https://rdrr.io/bioc/scds/man/cxds_bcds_hybrid.html) from the package [scds](http://bioconductor.org/packages/release/bioc/html/scds.html)
- [doubletFinder](https://rdrr.io/github/chris-mcginnis-ucsf/DoubletFinder/man/doubletFinder.html) from the package [DoubletFinder](https://github.com/chris-mcginnis-ucsf/DoubletFinder)
- [Scrublet](https://bioconda.github.io/recipes/scrublet/README.html) from the package [scrublet](https://github.com/allonkleinlab/scrublet)
### Ambient RNA detection
- [decontX](https://rdrr.io/bioc/celda/man/decontX.html) from the package [celda](https://bioconductor.org/packages/release/bioc/html/celda.html)
- [SoupX](https://github.com/constantAmateur/SoupX) from package [SoupX](https://github.com/constantAmateur/SoupX)