/
scp.Rmd
1035 lines (845 loc) · 37.4 KB
/
scp.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Single Cell Proteomics data processing and analysis"
author:
- name: Laurent Gatto
- name: Christophe Vanderaa
output:
BiocStyle::html_document:
self_contained: yes
toc: true
toc_float: true
toc_depth: 2
code_folding: show
bibliography: scp.bib
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('scp')`"
vignette: >
%\VignetteIndexEntry{Single Cell Proteomics data processing and analysis}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
crop = NULL
## cf https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```
# The `scp` package
The `scp` package is used to process and analyse mass spectrometry
(MS)-based single cell proteomics (SCP) data. The functions rely on a
specific data structure that wraps
[`QFeatures`](https://rformassspectrometry.github.io/QFeatures/)
objects (@Gatto2023-ry) around
[`SingleCellExperiment`](http://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html)
objects (@Amezquita2019-bf). This data structure could be seen as
Matryoshka dolls were the `SingleCellExperiment` objects are small
dolls contained in the bigger `QFeatures` doll.
The `SingleCellExperiment` class provides a dedicated framework for
single-cell data. The `SingleCellExperiment` serves as an interface to
many cutting-edge methods for processing, visualizing and analysis
single-cell data. More information about the `SingleCellExperiment`
class and associated methods can be found in the
[OSCA book](http://bioconductor.org/books/release/OSCA/).
The `QFeatures` class is a data framework dedicated to manipulate and
process MS-based quantitative data. It preserves the relationship
between the different levels of information: peptide to spectrum match
(PSM) data, peptide data and protein data. The `QFeatures` package
also provides an interface to many utility functions to streamline the
processing MS data. More information about MS data analysis tools can
be found in the
[RforMassSpectrometry project](https://www.rformassspectrometry.org/).
```{r scp_framework, results='markup', fig.cap="`scp` relies on `SingleCellExperiment` and `QFeatures` objects.", echo=FALSE, out.width='100%', fig.align='center'}
knitr::include_graphics("./figures/SCP_framework.png", error = FALSE)
```
Before running the vignette we need to load the `scp` package.
```{r load_scp, message = FALSE}
library("scp")
```
We also load `ggplot2` and `dplyr` for convenient data
manipulation and plotting.
```{r load_other_package, message = FALSE}
library("ggplot2")
library("dplyr")
```
# Before you start
This vignette will guide you through some common steps of mass
spectrometry-based single-cell proteomics (SCP) data analysis. SCP is
an emerging field and further research is required to develop a
principled analysis workflow. Therefore, we **do not guarantee** that the
steps presented here are the best steps for this type of data analysis.
This vignette performs the steps that were described in the SCoPE2
landmark paper (@Specht2021-jm) and that were reproduced in another
work using the `scp` package (@Vanderaa2021-ue). The replication on
the full SCoPE2 dataset using `scp` is available in
[this vignette](https://uclouvain-cbio.github.io/SCP.replication/articles/SCoPE2.html).
We hope to convince the reader that,
although the workflow is probably not optimal, `scp` has the full
potential to perform standardized and principled data analysis. All
functions presented here are comprehensively documented, highly modular,
can easily be extended with new algorithms. Suggestions, feature
requests or bug reports are warmly welcome. Feel free to open an issue
in the
[GitHub repository](https://github.com/UCLouvain-CBIO/scp/issues).
This workflow can be applied to any MS-based SCP data. The minimal
requirement to follow this workflow is that the data should contain
the following information:
- `runCol`/`Raw.file`: field in the feature data and the sample data
that gives the names of MS acquisition runs or files.
- `quantCols`: field in the sample data that links to columns in the
quantification data and that allows to link samples to MS channels
(more details in another
[vignette](https://uclouvain-cbio.github.io/scp/articles/read_scp.html)).
- `SampleType`: field in the sample data that provides the type of
sample that is acquired (carrier, reference, single-cell,...). Only
needed for multiplexing experiments.
- `Potential.contaminant`: field in the feature data that marks
contaminant peptides.
- `Reverse`: field in the feature data that marks reverse peptides.
- `PIF`: field in the feature data that provides spectral purity.
- `PEP` or `dart_PEP`: field in the feature data that provides peptide
posterior error probabilities.
- `Modified.sequence`: field in the feature data that provides the
peptide identifiers.
- `Leading.razor.protein`: field in the feature data that provides the
protein identifiers.
- At least one field in the feature data that contains quantification
values. In this case, there are 16 quantification columns named as
`Reporter.intensity.` followed by an index (`1` to `16`).
Each required field will be described more in detail in the
corresponding sections. Names can be adapted by the user to more
meaningful ones or adapted to other output tables.
# Read in SCP data
The first step is to read in the PSM quantification table generated
by, for example, MaxQuant (@Tyanova2016-yj). We created a small
example data by subsetting the MaxQuant `evidence.txt` table provided
in the SCoPE2 landmark paper (@Specht2021-jm). The `mqScpData` table
is a typical example of what you would get after reading in a CSV file
using `read.csv` or `read.table`. See `?mqScpData` for more
information about the table content.
```{r}
data("mqScpData")
```
We also provide an example of a sample annotation table that provides
useful information about the samples that are present in the example
data. See `?sampleAnnotation` for more information about the table
content.
```{r}
data("sampleAnnotation")
```
As a note, the example sample data contains 5 different types of samples
(`SampleType`) that can be found in a TMT-based SCP data set:
```{r}
table(sampleAnnotation$SampleType)
```
* The carrier channels (`Carrier`) contain 200 cell equivalents and
are meant to boost the peptide identification rate.
* The normalization channels (`Reference`) contain 5 cell equivalents
and are used to partially correct for between-run variation.
* The unused channels (`Unused`) are channels that are left empty due
to isotopic cross-contamination by the carrier channel.
* The negative control channels (`Blank`) contain samples that do not
contain any cell but are processed as single-cell samples.
* The single-cell sample channels contain the single-cell samples of
interest, that are macrophage (`Macrophage`) or monocyte
(`Monocyte`).
Using `readSCP`, we combine both tables in a `QFeatures` object
formatted as described above.
```{r readSCP}
scp <- readSCP(assayData = mqScpData,
colData = sampleAnnotation,
runCol = "Raw.file",
removeEmptyCols = TRUE)
scp
```
See here that the 3 first assays contain 11 columns that correspond to
the TMT-11 labels and the last assay contains 16 columns that
correspond to the TMT-16 labels.
**Important**: More details about the usage of `readSCP()` and how to
read your own data set are provided in the `Load data using readSCP`
[vignette](https://uclouvain-cbio.github.io/scp/articles/read_scp.html).
Another way to get an overview of the scp object is to plot the
`QFeatures` object. This will create a graph where each node is an
assay and links between assays are denoted as edges.
```{r plot1}
plot(scp)
```
# Clean missing data
All single-cell data contain many zeros. The zeros can be biological
zeros or technical zeros and differentiating between the two types is
not a trivial task. To avoid artefacts in downstream steps, we replace
the zeros by the missing value `NA`. The `zeroIsNA` function takes the
`QFeatures` object and the name(s) or index/indices of the assay(s) to
clean and automatically replaces any zero in the selected quantitative
data by `NA`.
```{r zeroIsNA}
scp <- zeroIsNA(scp, i = 1:4)
```
# Filter PSMs
A common steps in SCP is to filter out low-confidence PSMs. Each PSM
assay contains feature meta-information that are stored in the
`rowData` of the assays. The `QFeatures` package allows to quickly
filter the rows of an assay by using these information. The available
variables in the `rowData` are listed below for each assay.
```{r rowDataNames}
rowDataNames(scp)
```
## Filter features based on feature annotations
Below are some examples of criteria that are used to identify
low-confidence. The information is readily available since this was
computed by MaxQuant:
- Remove PSMs that are matched to contaminants
- Remove PSMs that are matched to the decoy database
- Keep PSMs that exhibit a high PIF (parental ion fraction),
indicative of the purity of a spectrum
We can perform this filtering using the `filterFeatures` function from
`QFeatures`. `filterFeatures` automatically accesses the feature
annotations and selects the rows that meet the provided condition(s). For
instance, `Reverse != "+"` keeps the rows for which the `Reverse`
variable in the `rowData` is not `"+"` (*i.e.* the PSM is not matched
to the decoy database).
```{r filter_psms}
scp <- filterFeatures(scp,
~ Reverse != "+" &
Potential.contaminant != "+" &
!is.na(PIF) & PIF > 0.8)
```
## Filter assays based on detected features
To avoid proceeding with failed runs, another interesting filter is to
remove assays with too few features. If a batch contains less than,
for example, 150 features we can then suspect something wrong happened
in that batch and it should be removed. Using `dims`, we can query the
dimensions (hence the number of features and the number of samples) of
all assays contained in the dataset.
```{r dims}
dims(scp)
```
Actually, a `QFeatures` object can be seen as a three-order array:
$features \times samples \times assay$. Hence, `QFeatures` supports
three-order subsetting `x[rows, columns, assays]`. We first select the
assays that have sufficient PSMs (the number of rows is greater than
150), and then subset the `scp` object for the assays that meet the
criterion.
```{r filter_assays}
keepAssay <- dims(scp)[1, ] > 150
scp <- scp[, , keepAssay]
scp
```
Notice the `190321S_LCA10_X_FP97_blank_01` sample was removed because
it did not contain sufficient features, as expected from a blank
run. This could also have been the case for failed runs.
## Filter features based on SCP metrics
Another type of filtering is specific to SCP. In the SCoPE2 analysis,
the authors suggest a filters based on the sample to carrier ratio
(SCR), that is the reporter ion intensity of a single-cell sample
divided by the reporter ion intensity of the carrier channel (200
cells) from the same batch. It is expected that the carrier
intensities are much higher than the single-cell intensities.
The SCR can be computed using the `computeSCR` function from
`scp`. The function must be told which channels are the samples that
must be divided and which channel contains the carrier. This
information is provided in the sample annotations and is accessed using
the `colData`, under the `SampleType` field.
```{r show_annotation}
table(colData(scp)[, "SampleType"])
```
In this dataset, `SampleType` gives the type of sample that is present
in each TMT channel. The SCoPE2 protocole includes 5 types of samples:
* The carrier channels (`Carrier`) contain 200 cell equivalents and
are meant to boost the peptide identification rate.
* The normalization channels (`Reference`) contain 5 cell equivalents
and are used to partially correct for between-run variation.
* The unused channels (`Unused`) are channels that are left empty due
to isotopic cross-contamination by the carrier channel.
* The negative control channels (`Blank`) contain samples that do not
contain any cell but are processed as single-cell samples.
* The single-cell sample channels contain the single-cell samples of
interest, that are macrophage (`Macrophage`) or monocyte
(`Monocyte`).
The `computeSCR` function expects the following input:
* The `QFeatures` dataset
* The assay name(s) or index/indices for which the SCR should be
computed
* `colvar`: the variable in the sample annotations (`colData`) that
hold the information used to discriminate sample channels from
carrier channels.
* `carrierPattern`: a string pattern (following regular expression
syntax) that identifies the carrier channel in each batch.
* `samplePattern`: a string pattern (following regular expression
syntax) that identifies the samples to divide.
Optionally, you can also provide the following arguments:
* `rowDataName`: the name of the column in the `rowData` where to
store the computed SCR for each feature.
* `sampleFUN`: when multiple samples are present in an assay, there
are as many SCR as there are samples that need to be summarized to a
single value per feature. `sampleFUN` tells which function to use
for summarizing the sample values before computing the SCR; the
default is the `mean`.
* `carrierFUN`: some designs might include several carriers per run
(not the case in this example). Similarly to `sampleFUN`,
`carrierFUN` tells which function to use for summarizing the carrier
values before computing the SCR; the default is the same function as
`sampleFUN`.
The function creates a new field in the `rowData` of the assays. We
compute the average SCR for each PSM and store it in the corresponding
`rowData`, under the `MeanSCR` column.
```{r computeSCR}
scp <- computeSCR(scp,
i = 1:3,
colvar = "SampleType",
carrierPattern = "Carrier",
samplePattern = "Macrophage|Monocyte",
sampleFUN = "mean",
rowDataName = "MeanSCR")
```
Before applying the filter, we plot the distribution of the average
SCR. We collect the `rowData` from several assays in a single table
`DataFrame` using the `rbindRowData` function from `QFeatures`.
```{r plot_SCR, warning=FALSE, message=FALSE}
rbindRowData(scp, i = 1:3) |>
data.frame() |>
ggplot(aes(x = MeanSCR)) +
geom_histogram() +
geom_vline(xintercept = c(1/200, 0.1),
lty = c(2, 1)) +
scale_x_log10()
```
The expected ratio between single cells and the carrier is 1/200
(dashed line). We can see that the distribution mode is slightly
shifted towards higher ratios with a mode around 0.01. However, there
are a few PSMs that stand out of the distribution and have a much
higher signal than expected, indicating something wrong happened
during the quantification of those PSMs. We therefore filter out PSMs
with an average SCR higher than 0.1 (solide line). This is again
easily performed using the `filterFeatures` functions.
```{r filter_SCR}
scp <- filterFeatures(scp,
~ !is.na(MeanSCR) &
MeanSCR < 0.1)
```
## Filter features to control for FDR
Finally, we might also want to control for false discovery rate (FDR).
MaxQuant already computes posterior error probabilities (PEP), but
filtering on PEPs is too conservative (@Kall2008-hb) so we provide the
`pep2qvalue` function to convert PEPs to q-values that are directly
related to FDR. We here compute the q-values from the PEP (`dart_PEP`)
across all 3 assays. `dart_PEP` contains the PEP values that have been
updated using the DART-ID algorithm (@Chen2019-uc). The function will
store the results in the `rowData`, we here asked to name the new
column `qvalue_PSMs`.
```{r compute_qvalues}
scp <- pep2qvalue(scp,
i = 1:3,
PEP = "dart_PEP",
rowDataName = "qvalue_PSMs")
```
We also allow to compute q-values at peptide or protein level rather
than PSM. In this case, you need to supply the `groupBy` argument.
Suppose we want to compute the q-values at protein level, we can fetch
the protein information stored under `Leading.razor.protein` in the
`rowData`. This time, we store the q-values in a new field called
`qvalue_proteins`.
```{r compute_qvalues2}
scp <- pep2qvalue(scp,
i = 1:3,
PEP = "dart_PEP",
groupBy = "Leading.razor.protein",
rowDataName = "qvalue_proteins")
```
We can now filter the PSM to control, let's say, the protein FDR at
1\%. This can be performed using `filterFeatures` because the q-values
were stored in the `rowData`.
```{r filter_FDR}
scp <- filterFeatures(scp,
~ qvalue_proteins < 0.01)
```
# Process the PSM data
## Relative reporter ion intensity
In order to partialy correct for between-run variation, SCoPE2
suggests computing relative reporter ion intensities. This means that
intensities measured for single-cells are divided by the reference
channel containing 5-cell equivalents. We use the `divideByReference`
function that divides channels of interest by the reference
channel. Similarly to `computeSCR`, we can point to the samples and
the reference columns in each assay using the annotation contained in
the `colData`.
We here divide all columns (using the regular expression wildcard `.`)
by the reference channel (`Reference`).
```{r divideByReference}
scp <- divideByReference(scp,
i = 1:3,
colvar = "SampleType",
samplePattern = ".",
refPattern = "Reference")
```
# Aggregate PSM data to peptide data
Now that the PSM assays are processed, we can aggregate them to
peptides. This is performed using the `aggregateFeaturesOverAssays`
function. For each assay, the function aggregates several PSMs into a
unique peptide. This is best illustrated by the figure below.
```{r features_aggregation, results = 'markup', fig.cap = "Conceptual illustration of feature aggregation.", echo=FALSE, out.width='100%', fig.align='center'}
knitr::include_graphics("./figures/feature_aggregation.png", error = FALSE)
```
Remember there currently are three assays containing the PSM data.
```{r show_agg_psms}
scp
```
The PSMs are aggregated over the `fcol` feature variable, here the
modified peptide sequence. We also need to supply an aggregating
function that will tell how to combine the quantitative data of the
PSMs to aggregate. We here aggregate the PSM data using the median
value per sample thanks to the `matrixStats:colMedians()`
function. Other functions can be used and we refer to
`?aggregateFeatures` for more information about available aggregation
functions. The `aggregateFeaturesOverAssays()` function will create a
new assay for each aggregated assay. We name the aggregated assays
using the original names and appending `peptides_` at the start.
```{r aggregate_peptides, message = FALSE}
scp <- aggregateFeaturesOverAssays(scp,
i = 1:3,
fcol = "Modified.sequence",
name = paste0("peptides_", names(scp)),
fun = matrixStats::colMedians, na.rm = TRUE)
```
Notice that 3 new assays were created in the `scp` object. Those new
assays contain the aggregated features while the three first assays
are unchanged. This allows to keep track of the data processing.
```{r show_agg_peptides}
scp
```
Under the hood, the `QFeatures` architecture preserves the
relationship between the aggregated assays. See `?AssayLinks` for more
information on relationships between assays. This is illustrated on
the `QFeatures` plot:
```{r plot2}
plot(scp)
```
# Join the SCoPE2 sets in one assay
Up to now, we kept the data belonging to each MS run in separate
assays. We now combine all batches into a single assay. This is done
using the `joinAssays` function from the `QFeatures` package. Note
that we now use the aggregated assays, so assay 4 to 6.
```{r joinAssays}
scp <- joinAssays(scp,
i = 4:6,
name = "peptides")
```
In this case, one new assay is created in the `scp` object that
combines the data from assay 4 to 6. The samples are always distinct
so the number of column in the new assay (here $48$) will always
equals the sum of the columns in the assays to join (here $16 + 16 +
16$). The feature in the joined assay might contain less features than
the sum of the rows of the assays to join since common features
between assays are joined in a single row.
```{r plot3}
plot(scp)
```
# Filter single-cells
Another common step in single-cell data analysis pipelines is to
remove low-quality cells. After subsetting for the samples of interest,
we will use 2 metrics: the median relative intensities per cell and
the median coefficient of variation (CV) per cell.
## Filter samples of interest
We can subset the cells of interest, that is the negative control
samples, the macrophages and the monocytes. This
can easily be done by taking advantage of the `colData` and the
subsetting operators. Recall that `QFeatures` objects support
three-order subsetting, `x[rows, columns, assays]`, where columns are
the samples of interest.
```{r subset samples}
scp
scp <- scp[, scp$SampleType %in% c("Blank", "Macrophage", "Monocyte"), ]
```
The subsetting removes unwanted samples from all assays. The filtered
data set contains the same number of assays with the same number of
features, but the number of columns (hence sampled) decreased.
```{r show_cell_filter}
scp
```
## Filter based on the median relative intensity
We compute the median relative reporter ion intensity for each cell
separately and apply a filter based on this statistic. This procedure
recalls that of library size filtering commonly performed in scRNA-Seq
data analysis, where the library size is the sum of the counts in each
single cell. We compute and store the median intensity in the
`colData`.
```{r compute_colMedians}
medians <- colMedians(assay(scp[["peptides"]]), na.rm = TRUE)
scp$MedianRI <- medians
```
Looking at the distribution of the median per cell can highlight
low-quality cells.
```{r plot_medianRI, warning=FALSE, message=FALSE}
colData(scp) |>
data.frame() |>
ggplot() +
aes(x = MedianRI,
y = SampleType,
fill = SampleType) +
geom_boxplot() +
scale_x_log10()
```
The negative control samples should not contain any peptide
information and are therefore used to assess the amount of background
signal. The graph above confirms that the signal measured in
single-cells (macrophages and monocytes) is above the background
signal, hence no filtering is needed. Would it not be the case, the
same procedure as in the previous section can be used for selecting
the cells that have an associated median RI lower that a defined
threshold.
## Filter based on the median CV
The median CV measures the consistency of quantification for a group
of peptides that belong to a protein. We remove cells that exhibit
high median CV over the different proteins. We compute the median CV
per cell using the `computeMedianCV` function from the `scp`
package. The function takes the `peptides` assay and computes the CV
for each protein in each cell. To perform this, we must supply the
name of the `rowData` field that contains the protein information
through the `groupBy` argument. We also only want to compute CVs if we
have at least 5 peptides per protein. Finally, we also perform a
normalization and divide the columns by the median. The computed
median CVs are automatically stored in the `colData` under the name
that is supplied, here `MedianCV`.
```{r medianCVperCell}
scp <- medianCVperCell(scp,
i = 1:3,
groupBy = "Leading.razor.protein",
nobs = 5,
norm = "div.median",
na.rm = TRUE,
colDataName = "MedianCV")
```
The computed CVs are stored in the `colData` of the `peptides` assay
and holds the median CV per cell computed using at least 5
observations (peptides). The main interest of computing the median CV
per cell is to filter cells with reliable quantification. The negative
control samples are not expected to have reliable quantifications and
hence can be used to estimate an empirical null distribution of the
CV. This distribution helps defining a threshold that filters out
single-cells that contain noisy quantification.
```{r plot_medianCV, message = FALSE, warning = FALSE}
getWithColData(scp, "peptides") |>
colData() |>
data.frame() |>
ggplot(aes(x = MedianCV,
fill = SampleType)) +
geom_boxplot() +
geom_vline(xintercept = 0.65)
```
We can see that the protein quantification for single-cells are much
more consistent than for negative control samples. Based on the
distribution of the negative controls, we decide to keep the cells
that have a median CV lower than 0.65. Note this example is inaccurate
because the null distribution is based on only 3 negative controls,
but more sets could lead to a better estimation of the CV null
distribution.
```{r create_filter}
scp <- scp[, !is.na(scp$MedianCV) & scp$MedianCV < 0.65, ]
```
We can now remove the negative controls since all QC metrics are now
computed.
```{r remove_NC}
scp <- scp[, scp$SampleType != "Blank", ]
```
# Process the peptide data
In this vignette, the peptide data are further processed before
aggregation to proteins. The steps are: normalization, filter peptides
based on missing data and log-transformation.
## Normalization
The columns (samples) of the peptide data are first normalized by
dividing the relative intensities by the median relative
intensities. Then, the rows (peptides) are normalized by dividing the
relative intensities by the mean relative intensities. The normalized
data is stored in a separate assay. This normalization procedure is
suggested in the SCoPE2 analysis and is applied using the `sweep`
method. Beside the dataset and the assay to normalize, the method
expects a `MARGIN`, that is either row-wise (`1`) or column-wise (`2`)
transformation, the `FUN` function to apply and `STATS`, a vector of
values to apply. More conventional normalization procedure can be
found in `?QFeatures::normalize`.
```{r normalize_scale}
## Divide columns by median
scp <- sweep(scp,
i = "peptides",
MARGIN = 2,
FUN = "/",
STATS = colMedians(assay(scp[["peptides"]]), na.rm = TRUE),
name = "peptides_norm_col")
## Divide rows by mean
scp <- sweep(scp,
i = "peptides_norm_col",
MARGIN = 1,
FUN = "/",
STATS = rowMeans(assay(scp[["peptides_norm_col"]]), na.rm = TRUE),
name = "peptides_norm")
```
Notice each call to `sweep` created a new assay. Let's have a look to
the current stat of the `QFeatures` plot:
```{r plot4}
plot(scp)
```
## Remove peptides with high missing rate
Peptides that contain many missing values are not
informative. Therefore, another common procedure is to remove higly
missing data. In this example, we remove peptides with more than 99 \%
missing data. This is done using the `filterNA` function from
`QFeatures`.
```{r filterNA}
scp <- filterNA(scp,
i = "peptides_norm",
pNA = 0.99)
```
## Log-transformation
In this vignette, we perform log2-transformation using the
`logTransform` method from `QFeatures`. Other log-transformation can
be applied by changing the `base` argument.
```{r logTransform}
scp <- logTransform(scp,
base = 2,
i = "peptides_norm",
name = "peptides_log")
```
Similarly to `sweep`, `logTransform` creates a new assay in `scp`.
# Aggregate peptide data to protein data
Similarly to aggregating PSM data to peptide data, we can aggregate
peptide data to protein data using the `aggregateFeatures` function.
```{r aggregate_proteins}
scp <- aggregateFeatures(scp,
i = "peptides_log",
name = "proteins",
fcol = "Leading.razor.protein",
fun = matrixStats::colMedians, na.rm = TRUE)
```
The only difference between `aggregateFeatures` and
`aggregateFeaturesOverAssays` is that the second function can
aggregate several assay at once whereas the former only takes one
assay to aggregate. Hence, only a single assay, `proteins`, was
created in the `scp` object.
```{r show_agg_proteins}
scp
```
After the second aggregation, the `proteins` assay in this example
contains quantitative information for 89 proteins in 15 single-cells.
# Process the protein data
The protein data is further processed in three steps: normalization,
imputation (using the KNN algorithm) and batch correction (using the
`ComBat` algorithm).
## Normalization
Normalization is performed similarly to peptide normalization. We use
the same functions, but since the data were log-transformed at the
peptide level, we subtract by the statistic (median or mean) instead
of dividing.
```{r normalize_center}
## Center columns with median
scp <- sweep(scp, i = "proteins",
MARGIN = 2,
FUN = "-",
STATS = colMedians(assay(scp[["proteins"]]),
na.rm = TRUE),
name = "proteins_norm_col")
## Center rows with mean
scp <- sweep(scp, i = "proteins_norm_col",
MARGIN = 1,
FUN = "-",
STATS = rowMeans(assay(scp[["proteins_norm_col"]]),
na.rm = TRUE),
name = "proteins_norm")
```
## Imputation
The protein data contains a lot of missing values.
```{r missingness}
scp[["proteins_norm"]] |>
assay() |>
is.na() |>
mean()
```
The average missingness in the `proteins` assay is around 25
\%. Including more samples and hence more batches can increase the
missingness up to 70 \% as seen for the complete SCoPE2 dataset
(@Specht2021-jm). Whether imputation is beneficial or deleterious for
the data will not be discussed in this vignette. But taking those
missing value into account is essential to avoid artefacts in
downstream analyses. The data imputation is performed using the K
nearest neighbors algorithm, with k = 3. This is available from the
`impute()` mehtod. More details about the arguments can be found in
`?impute::impute.knn`.
```{r impute}
scp <- impute(scp,
i = "proteins_norm",
name = "proteins_imptd",
method = "knn",
k = 3, rowmax = 1, colmax= 1,
maxp = Inf, rng.seed = 1234)
```
Note that after imputation, no value are missing.
```{r missingness_imputed}
scp[["proteins_imptd"]] |>
assay() |>
is.na() |>
mean()
```
## Batch correction
A very important step for processing SCP data is to correct for batch
effects. Batch effects are caused by technical variation occurring
during different MS runs. Since only a small number of single-cells
can be acquired at once, batch effects are unavoidable.
The `ComBat()` function from the `sva` package can be used to perform
batch correction as it is performed in the SCoPE2 analysis. We do not
claim that `ComBat` is the best algorithm for batch correcting SCP
data and other batch correcting methods could be used using the same
procedure.
We first extract the assay to process.
```{r get_proteins}
sce <- getWithColData(scp, "proteins_imptd")
```
Next, we need to provide a design matrix and the batch annotation to
`Combat`. The design matrix allows to protect variables of interest,
in our case `SampleType`.
```{r prepare_batch_correction}
batch <- sce$runCol
model <- model.matrix(~ SampleType, data = colData(sce))
```
We then load and call `ComBat` and overwrite the data matrix. Recall
the data matrix can be accessed using the `assay` function.
```{r compute_batch_correction, results = 'hide', message = FALSE}
library(sva)
assay(sce) <- ComBat(dat = assay(sce),
batch = batch,
mod = model)
```
Finally, we add the batch corrected assay to the `QFeatures` object
and create the feature links.
```{r add_batch_correction}
scp <- addAssay(scp,
y = sce,
name = "proteins_batchC")
scp <- addAssayLinkOneToOne(scp,
from = "proteins_imptd",
to = "proteins_batchC")
```
For the last time, we plot the overview of the fully processed data
set:
```{r plot5}
plot(scp)
```
# Dimension reduction
Because each assay contains `SingelCellExperiment` objects, we can
easily apply methods developed in the scRNA-Seq field. A useful
package for dimension reduction on single-cell data is the `scater`.
```{r load_scater}
library(scater)
```
This package provides streamline functions to computes various
dimension reduction such as PCA, UMAP, t-SNE, NMF, MDS, ....
## PCA
PCA can be computed using the `runPCA` method. It returns a
`SingleCellExperiment` object for which the dimension reduction
results are stored in the `reducedDim` slot.
```{r run_PCA}
scp[["proteins_batchC"]] <- runPCA(scp[["proteins_batchC"]],
ncomponents = 5,
ntop = Inf,
scale = TRUE,
exprs_values = 1,
name = "PCA")
```
The computed PCA can be displayed using the `plotReducedDim` function. The
`dimred` arguments should give the name of the dimension reduction results to
plot, here we called it `PCA`. The samples are colored by type of sample.
```{r plot_PCA}
plotReducedDim(scp[["proteins_batchC"]],
dimred = "PCA",
colour_by = "SampleType",
point_alpha = 1)
```
This is a minimalistic example with only a few plotted cells, but the
original SCoPE2 dataset contained more than thousand cells.
## UMAP
Similarly to PCA, we can compute a UMAP using the `runUMAP`
method. Note however that the UMAP implementation requires a
initialization, usually provided by PCA. The previous PCA results are
used automatically when supplying `dimred = "PCA"` (`PCA` is the name
of the dimension reduction result that we supplied in the previous
section).
```{r run_UMAP}
scp[["proteins_batchC"]] <- runUMAP(scp[["proteins_batchC"]],
ncomponents = 2,
ntop = Inf,
scale = TRUE,
exprs_values = 1,
n_neighbors = 3,
dimred = "PCA",
name = "UMAP")
```
The computed UMAP can be displayed using the `plotReducedDim`
function. The `dimred` arguments gives the name of the dimension
reduction results to plot, here we called it `UMAP`. The samples are
colored by type of sample.
```{r plot_UMAP}
plotReducedDim(scp[["proteins_batchC"]],
dimred = "UMAP",
colour_by = "SampleType",
point_alpha = 1)
```
The UMAP plot is a very interesting plot for large datasets. A UMAP on
this small example dataset is not useful but is shown for
illustration.
# Monitoring data processing
`QFeatures` keeps the links between the different assays along the
processing of the data. This greatly facilitates the visualization of
the quantitative data for a features at the different processing
levels. For instance, suppose we are interested
in the protein *Plastin-2* (protein ID is `P13796`). A useful QC is to
monitor the data processing at the PSM, peptide and protein
level. This can easily be done thanks to the `QFeatures`
framework. Using the `subsetByFeature`, we can extract the protein of
interest and its related features in the other assays. The data is
formatted to a long format table that can easily be plugged in the
`ggplot2` visualization tool.
```{r monitor_plot, warning = FALSE, fig.width = 10, fig.height = 10, out.width = "100%"}
## Get the features related to Plastin-2 (P13796)
subsetByFeature(scp, "P13796") |>
## Format the `QFeatures` to a long format table
longFormat(colvars = c("runCol", "SampleType", "quantCols")) |>
data.frame() |>
## This is used to preserve ordering of the samples and assays in ggplot2
mutate(assay = factor(assay, levels = names(scp)),
Channel = sub("Reporter.intensity.", "TMT-", quantCols),
Channel = factor(Channel, levels = unique(Channel))) |>
## Start plotting
ggplot(aes(x = Channel, y = value, group = rowname, col = SampleType)) +
geom_point() +
## Plot every assay in a separate facet
facet_wrap(facets = vars(assay), scales = "free_y", ncol = 3) +
## Annotate plot
xlab("Channels") +
ylab("Intensity (arbitrary units)") +
## Improve plot aspect
theme(axis.text.x = element_text(angle = 90),