/
index.Rmd
executable file
·158 lines (123 loc) · 8.15 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
title: "Human dermal fibroblast clonality project"
author: "Davis J. McCarthy"
site: workflowr::wflow_site
output:
workflowr::wflow_html:
toc: false
---
## Project overview
This project investigates clonality in human dermal fibroblast cell populations
in 32 cell lines from distinct donors, using bulk whole-exome sequencing and
single-cell RNA-sequencing data.
**Key findings:**
* A novel approach for integrating DNA-seq and single-cell RNA-seq data to
reconstruct clonal substructure and single-cell transcriptomes.
* A new computational method, [cardelino](https://github.com/PMBio/cardelino), to map
single-cell RNA-seq profiles to clones.
* Evidence for non-neutral evolution of clonal populations in human fibroblasts.
* Proliferation and cell cycle pathways are commonly distorted in mutated clonal
populations, with implications for cancer and ageing.
For a richer overview, see the [About](about.html) page.
## Data pre-processing
The data pre-processing for this project from the raw data described above is
complicated and computationally expensive, so this repository does not reproduce
the data pre-processing in an automated way. However, we provide the source code
for the [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow for
data pre-processing in this repository. Docker images providing the computing
environment and software used are publicly available, split into an image for
command line [bioinformatics tools](https://hub.docker.com/r/davismcc/fibroblast-clonality/)
and an [R installation](https://hub.docker.com/r/davismcc/r-singlecell-img/) with
necessary packages installed.
If you would like to pre-process the data from raw reads to results as we have,
please consult our description of [how to run](data_preprocessing.html) the
workflow.
## Analyses
Here we present the reproducible the results of our analyses. They were
generated by rendering the
[R Markdown documents](https://github.com/davismcc/fibroblast-clonality/tree/master/analysis)
into webpages available at the links below.
The results presented in the paper were produced with these analyses.
1. [Simulation results.](simulations.html)
1. [Overview of lines.](overview_lines.html)
1. [Selection models.](selection_models.html)
1. [Analysis of clonal prevalences.](clone_prevalences_cardelino-relax.html)
1. [Analysis for the example cell line *joxm*.](analysis_for_joxm_cardelino-relax.html)
1. [Variance components analysis.](variance_components.html)
1. [Differential expression analysis.](differential_expression_cardelino-relax.html)
1. [Analysis of effects of somatic variants on cis gene expression.](mutated_genes.html)
## Data availability
This is a complicated project, and reproducing all of the results presented, especially from raw data is highly non-trivial. Nevertheless, we have made all data available so that everything is entirely reproducible.
Single-cell RNA-seq data have been deposited in the
[ArrayExpress](https://www.ebi.ac.uk/arrayexpress) database at EMBL-EBI under accession
number [E-MTAB-7167](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-7167).
Whole-exome sequencing data is available through the
[HipSci portal](http://www.hipsci.org). Processed data and large results files are
available from [Zenodo](http://doi.org/10.5281/zenodo.1403510) with DOI 10.5281/zenodo.1403510.
To set up the project to reproduce our analyses, first clone the [source code repository](https://github.com/davismcc/fibroblast-clonality) from GitHub. Next, download all of the reference, metadata and results files and add them to the (cloned) project folder with the following structure:
```
.
├── data
│ ├── canopy
│ │ ├── canopy_results.*.rds
│ ├── cell_assignment
│ │ ├── cardelino_results.*.rds
│ ├── de_analysis_FTv62
│ │ ├── cellcycle_analyses
│ │ │ ├── filt_lenient.all_filt_sites.de_results_unstimulated_cells.cc.rds
│ │ │ ├── filt_lenient.cell_coverage_sites.de_results_unstimulated_cells.cc.rds
│ │ │ ├── filt_strict.all_filt_sites.de_results_unstimulated_cells.cc.rds
│ │ │ └── filt_strict.cell_coverage_sites.de_results_unstimulated_cells.cc.rds
│ │ ├── filt_lenient.all_filt_sites.de_results_unstimulated_cells.rds
│ │ ├── filt_lenient.cell_coverage_sites.de_results_unstimulated_cells.rds
│ │ ├── filt_strict.all_filt_sites.de_results_unstimulated_cells.rds
│ │ └── filt_strict.cell_coverage_sites.de_results_unstimulated_cells.rds
│ ├── donor_info_070818.txt
│ ├── donor_info_core.csv
│ ├── donor_neutrality.tsv
│ ├── exome-point-mutations
│ │ ├── high-vs-low-exomes.v62.ft.alldonors-filt_lenient.all_filt_sites.vep_most_severe_csq.txt
│ │ └── high-vs-low-exomes.v62.ft.filt_lenient-alldonors.txt.gz
│ ├── human_H_v5p2.rdata
│ ├── human_c2_v5p2.rdata
│ ├── human_c6_v5p2.rdata
│ ├── neg-bin-rsquared-petr.csv
│ ├── neutralitytestr-petr.tsv
| ├── sces
│ │ ├── sce_*.rds
│ ├── selection
│ │ ├── neg-bin-params-fit.csv
│ │ ├── neg-bin-rsquared-fit.csv
│ ├── simulations
│ │ ├── *.filt_lenient.cell_coverage_sites.mult.rds
│ │ ├── *.simulate.rds
│ └── variance_components
│ ├── covar_all.csv
│ ├── donorVar
│ │ ├── *.var_part.var1.csv
│ ├── fit_all_gene_highVar.csv
│ ├── fit_per_gene_highVar.csv
│ ├── gene_info_all.csv
│ └── logcnt_all.csv
├── metadata
│ ├── cell_metadata.csv
│ └── data_processing_metadata.tsv
├── references
│ ├── 1000G_phase1.indels.hg19.sites.vcf.gz
│ ├── GRCh37.p13.genome.ERCC92.fa
│ ├── Homo_sapiens.GRCh37.rel75.cdna.all.ERCC92.fa.gz
│ ├── Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
│ ├── dbsnp_138.hg19.biallelicSNPs.HumanCoreExome12.Top1000ExpressedIpsGenes.Maf0.01.HWE0.0001.HipSci.vcf.gz
│ ├── dbsnp_138.hg19.vcf.gz
│ ├── gencode.v19.annotation_ERCC.gtf
│ ├── hipsci.wec.gtarray.HumanCoreExome.imputed_phased.20170327.genotypes.allchr.fibro_samples_v2_filt_vars_sorted_oa.vcf.gz
│ ├── hipsci.wec.gtarray.HumanCoreExome.imputed_phased.20170327.genotypes.allchr.fibro_samples_v2_filt_vars_sorted_oa.vcf.gz.csi
│ └── knownIndels.intervals
```
For simplicity, we ignore all the directories and files present in the source code repository (that you should have clones) to focus just on where you should _add_ the files downloaded from Zenodo. Yes, it's still complicated, but such is life.
There is a large number of `canopy_results.*.rds` files: these should be stored in the `data/canopy` directory. Similarly, all of the `cardelino_results.*.rds` files should be stored in `data/cell_assignment`. All of the SingleCellExperiment object files (`sce_*.rds`) should be stored in `data/sces`. Simulation results files (`*.mult.rds`; `*.simulate.rds`) should be stored in `data/simulations`. Variance components results should be stored in `data/variance_components` as shown above.
Differential expression results belong in `data/de_analysis_FTv62`.
Metadata files belong in `metadata`. Reference files belong in `references`.
With the data downloaded and organised as above, you will be able to reproduce the analyses presented in the RMarkdown files linked to above and, if desired, even run the whole analysis pipeline from raw reads to results following [these instructions](data_preprocessing.html).
-------
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.