Permalink
Newer
Older
100644 689 lines (518 sloc) 26.5 KB
May 11, 2016 @jherrero More code cleanup
1 #!/usr/bin/env perl
Oct 23, 2014 @charlesbreeze eFORGE first commit
2
3 =head1 NAME
4
5 eforge.pl - Experimentally derived Functional element Overlap analysis of ReGions from EWAS.
6
7 =head1 SYNOPSIS
8
9 eforge.pl options (-f file) (-mvp mvplist)
10
11 =head1 DESCRIPTION
12
13 Analyse a set of MVPs for their overlap with DNase 1 hotspots compared to matched background MVPs.
14 Identifies enrichment in DHS by tissue and plots graphs and table to display. Arbitrarily a minumum of 5* MVPs is required.
15 Note that if no MVPs are given the script will run on A DEFAULT EWAS* as an example output.
16
17 Several outputs are made.
18
19 A straight base R graphics pdf chart of the data.
20
21 A polychart (https://github.com/Polychart/polychart2) interactive javascript graphic using rCharts (http://ramnathv.github.io/rCharts/).
22
23 A dimple (http://dimplejs.org) d3 interactive graphic using rCharts.
24
25 A table using the Datatables (https://datatables.net) plug-in for the jQuery Javascript library, again accessed through rCharts.
26
27 In each of the graphics the colouring should be consistent. Blue (p value > 0.05), light red or pink (0.05 => p value > 0.01), red or dark red (p value <= 0.01 ) for the 95% and 99% cIs.
28 Or whatever other thresholds are specified.
29
30 eForge functions, plotting options and stats are provided by eForge::eForge, eForge::ePlot and eForge::eStats modules.
31
32 =head1 OPTIONS
33
34 =over
35
May 10, 2016 @jherrero Clean up the code
36 =item B<--dataset TAG>
Oct 23, 2014 @charlesbreeze eFORGE first commit
37
May 10, 2016 @jherrero Clean up the code
38 Set of functional data to look for enrichment. Either ENCODE data ('encode'), unconsolidated Roadmap
39 Epigenome data ('erc'), consolidated Roadmap Epigenome data ('erc2'), or Blueprint data ('blueprint').
40 erc by default.
Oct 23, 2014 @charlesbreeze eFORGE first commit
41
May 10, 2016 @jherrero Clean up the code
42 Use --dataset ? to get a list of available datasets on your local install.
Sep 30, 2015 @jherrero New DB schema - both code and DB are data-agnostic
43
May 10, 2016 @jherrero Clean up the code
44 =item B<--array TAG>
Oct 23, 2014 @charlesbreeze eFORGE first commit
45
May 10, 2016 @jherrero Clean up the code
46 Array (FKA background) is set at default to 450k array ('450k'), the Illumina Infinium HumanMethylation450 BeadChip.
Oct 23, 2014 @charlesbreeze eFORGE first commit
47
May 10, 2016 @jherrero Clean up the code
48 For the time being, it is suficient for MVPs to be on the 450k array. Probes within 1kb of each other
49 will undergo filtering.
Oct 23, 2014 @charlesbreeze eFORGE first commit
50
May 10, 2016 @jherrero Clean up the code
51 Use --array ? to get a list of available backgrounds on your local install.
Oct 23, 2014 @charlesbreeze eFORGE first commit
52
May 10, 2016 @jherrero Clean up the code
53 =item B<--label STRING>
Oct 23, 2014 @charlesbreeze eFORGE first commit
54
55 Supply a label that you want to use for the plotting titles, and filenames.
56
May 10, 2016 @jherrero Clean up the code
57 =item B<--f FILENAME>
Oct 23, 2014 @charlesbreeze eFORGE first commit
58
59 Supply the name of a file containing a list of MVPs.
60 Format must be given by the -format flag.
May 10, 2016 @jherrero Clean up the code
61 If not supplied the analysis is performed either on mvps provided as probeids (cg or ch probes) in a
62 comma separated list through the mvps option or on a set of data from a default ewas study, namely a
63 set of monocyte tDMPs from Jaffe AE and Irizarry RA, Genome Biol 2014.
64
65 Note that at least 5 MVPs are required at a minimum by default.
Oct 23, 2014 @charlesbreeze eFORGE first commit
66
May 10, 2016 @jherrero Clean up the code
67 =item B<--mvps probe_id,probe_id...>
Oct 23, 2014 @charlesbreeze eFORGE first commit
68
69 Can provide the mvps as probeids in a comma separated list.
70
May 10, 2016 @jherrero Clean up the code
71 =item B<--min_mvps INT>
Oct 23, 2014 @charlesbreeze eFORGE first commit
72
73 Specify the minimum number of MVPs to be allowed. Default is 5 now we are using binomial test.
74
May 10, 2016 @jherrero Clean up the code
75 =item B<--thresh FLOAT,FLOAT>
Oct 23, 2014 @charlesbreeze eFORGE first commit
76
77 Alter the default binomial p value thresholds. Give a comma separate list of three e.g. 0.05,0.01 for the defaults
78
May 10, 2016 @jherrero Clean up the code
79 =item B<--format STRING>
Oct 23, 2014 @charlesbreeze eFORGE first commit
80
81 If f is specified, specify the file format as follow:
82
May 10, 2016 @jherrero Clean up the code
83 probeid = list of mvps as probeids each on a separate line. Optionally can add other fields after the probeid which are ignored,
Oct 23, 2014 @charlesbreeze eFORGE first commit
84 unless the pvalue filter is specified, in which case eForge assumes that the second field is the minus log10 pvalue
85
May 10, 2016 @jherrero Clean up the code
86 bed = File given is a bed file of locations (chr\tbeg\tend). bed format should be 0 based and the chromosome should be given as chrN.
Oct 23, 2014 @charlesbreeze eFORGE first commit
87 However we will also accept chomosomes as just N (ensembl) and 1-based format where beg and end are the same*.
88
89 tabix = File contains MVPs in tabix format.
90
May 10, 2016 @jherrero Clean up the code
91 =item B<--filter FLOAT>
Oct 23, 2014 @charlesbreeze eFORGE first commit
92
May 10, 2016 @jherrero Clean up the code
93 Set a filter on the MVPs based on the -log10 pvalue. This works for files in the probeid' format.
94 Give a value as the lower threshold and only MVPs with -log10 pvalues >= to the threshold will be
95 analysed. Default is no filtering.
Oct 23, 2014 @charlesbreeze eFORGE first commit
96
May 11, 2016 @jherrero Rename bkgdstats and related variables for cleaner code
97 =item B<--save_stats>
Oct 23, 2014 @charlesbreeze eFORGE first commit
98
May 11, 2016 @jherrero Rename bkgdstats and related variables for cleaner code
99 Output annotation stats for the original and the random picks.
Oct 23, 2014 @charlesbreeze eFORGE first commit
100
May 10, 2016 @jherrero Clean up the code
101 =item B<--reps INT>
102
103 The number of background matching sets to pick and analyse. Default 1000.
Oct 23, 2014 @charlesbreeze eFORGE first commit
104
May 10, 2016 @jherrero Clean up the code
105 =item B<--proxy TAG>
Oct 23, 2014 @charlesbreeze eFORGE first commit
106
May 10, 2016 @jherrero Clean up the code
107 Apply filter for MVPs in proximity (within 1 kb of another test MVP). With proximity filter specified,
108 eForge will report MVPs removed due to proximity with another MVP in the list and will randomly pick
109 one of the probes among the set of probes that are in proximity (within 1 kb of each other).
Oct 23, 2014 @charlesbreeze eFORGE first commit
110
May 11, 2016 @jherrero Rename bkgdstats and related variables for cleaner code
111 At the moment, this is a dummy flag as only one proximity filter is available for each array. It
112 will become useful if the database and code support more than one. At the moment to turn off
113 proximity filtering, simply specify -noproxy
Oct 23, 2014 @charlesbreeze eFORGE first commit
114
May 10, 2016 @jherrero Clean up the code
115 =item B<--noproxy>
Oct 23, 2014 @charlesbreeze eFORGE first commit
116
117 Turn off proximity filtering.
118
May 10, 2016 @jherrero Clean up the code
119 =item B<--depletion>
Oct 23, 2014 @charlesbreeze eFORGE first commit
120
May 10, 2016 @jherrero Clean up the code
121 Analyse for depletion pattern instead of the default enrichment analysis. Use when dealing with
122 datasets suspected not to overlap with DHS (or the relevant functional assay). Specifying depletion
123 will be indicated on the label (the text "Depletion Analysis" will be added to the file label).
Oct 23, 2014 @charlesbreeze eFORGE first commit
124
May 10, 2016 @jherrero Clean up the code
125 =item B<--noplot>
Oct 23, 2014 @charlesbreeze eFORGE first commit
126
127 Just make the data file, don't plot.
128
May 10, 2016 @jherrero Clean up the code
129 =item B<--help|-h|-?>
Oct 23, 2014 @charlesbreeze eFORGE first commit
130
131 Print a brief help message and exits.
132
May 10, 2016 @jherrero Clean up the code
133 =item B<--man|-m>
Oct 23, 2014 @charlesbreeze eFORGE first commit
134
135 Print this perldoc and exit.
136
137 =back
138
Sep 30, 2015 @jherrero Copyright and authorship
139 =head1 LICENCE AND COPYRIGHT
Oct 23, 2014 @charlesbreeze eFORGE first commit
140
141 eforge.pl Functional analysis of EWAS MVPs
142
Sep 30, 2015 @jherrero Copyright and authorship
143 Copyright (C) [2014-2015] EMBL - European Bioinformatics Institute and University College London
Oct 23, 2014 @charlesbreeze eFORGE first commit
144
Sep 30, 2015 @jherrero Copyright and authorship
145 This program is free software; you can redistribute it and/or modify
146 it under the terms of the GNU General Public License as published by
147 the Free Software Foundation; version 2 dated June, 1991 or at your option
148 any later version.
Oct 23, 2014 @charlesbreeze eFORGE first commit
149
Sep 30, 2015 @jherrero Copyright and authorship
150 This program is distributed in the hope that it will be useful,
151 but WITHOUT ANY WARRANTY; without even the implied warranty of
152 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
153 GNU General Public License for more details.
Oct 23, 2014 @charlesbreeze eFORGE first commit
154
Sep 30, 2015 @jherrero Copyright and authorship
155 A copy of the GNU General Public License is available in the source tree;
156 if not, write to the Free Software Foundation, Inc.,
157 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
Oct 23, 2014 @charlesbreeze eFORGE first commit
158
159 =head1 CONTACT
160
Sep 30, 2015 @jherrero New DB schema - both code and DB are data-agnostic
161 Charles Breeze, C<< <c.breeze at ucl.ac.uk> >>
162
163 Javier Herrero, C<< <javier.herrero at ucl.ac.uk> >>
164
165 =head1 ACKNOWLEDGEMENTS
166
167 This software is based on the FORGE tool developed by Ian Dunham at the EMBL-EBI
Oct 23, 2014 @charlesbreeze eFORGE first commit
168
Sep 30, 2015 @jherrero Copyright and authorship
169 Javier Herrero <javier.herrero@ucl.ac.uk>
170
Oct 23, 2014 @charlesbreeze eFORGE first commit
171 =cut
172
173 use strict;
174 use 5.010;
175 use warnings;
176 use DBI; #database link to sqlite database
177 use Sort::Naturally;
178 use Cwd;
179 use Getopt::Long; #check this module
180 use File::Basename;
181 use Config::IniFiles;
182 use Pod::Usage;
183 use Scalar::Util qw(looks_like_number);
184 use eForge::eStats;
185 use eForge::ePlot;
186 use eForge::eForge;
Oct 23, 2014 @jherrero Added --out_dir option and switch to Data::UUID for default dir name
187 use Data::UUID;
Mar 10, 2015 @jherrero Adding B-Y FDR corrected q-values to output file
188 use Statistics::Multtest qw(BY);
Oct 23, 2014 @charlesbreeze eFORGE first commit
189
190
191 my $cwd = getcwd;
192
Apr 8, 2016 @jherrero More changes for v1.2
193 my $dbname = "eforge_1.2.db";
Jun 23, 2015 @jherrero Update webcode to use eforge_1.1.db
194
May 10, 2016 @jherrero Clean up the code
195 my $array; # Default value
196 my $array_label;
197 my $format = 'probeid'; # Input format
198 my $label = 'Unnamed'; # Label for plots
199 my $reps = 1000;
200 # set binomial p values, multiple test correction is used
201 my $thresh; # string for command line option
202 my $t_marginal = 0.05; # default marginal p-value threshold
203 my $t_strict = 0.01; # default strict p-value threshold
204
May 11, 2016 @jherrero More code cleanup
205 my $min_num_probes = 5; # the minimum number of probes allowed for test. Set to 5 as we have binomial p
May 10, 2016 @jherrero Clean up the code
206
May 11, 2016 @jherrero Rename bkgdstats and related variables for cleaner code
207 my ($dataset, $filename, $save_probe_annotation_stats, $noplot,
May 11, 2016 @jherrero More code cleanup
208 $help, $man, $proxy, $noproxy, $depletion, $filter, $out_dir, $probe_list,
Mar 12, 2015 @jherrero Add --autoopen options
209 $web, $autoopen);
Oct 23, 2014 @charlesbreeze eFORGE first commit
210
211 GetOptions (
May 10, 2016 @jherrero Clean up the code
212 'dataset=s' => \$dataset,
May 11, 2016 @jherrero Rename bkgdstats and related variables for cleaner code
213 'save_stats|bkgrd' => \$save_probe_annotation_stats,
May 10, 2016 @jherrero Clean up the code
214 'array|bkgd=s' => \$array,
Oct 23, 2014 @charlesbreeze eFORGE first commit
215 'label=s' => \$label,
May 10, 2016 @jherrero Clean up the code
216 'f=s' => \$filename,
Oct 23, 2014 @charlesbreeze eFORGE first commit
217 'format=s' => \$format,
May 11, 2016 @jherrero More code cleanup
218 'probes|mvps=s@' => \$probe_list,
219 'min_num_probes|min_mvps=i' => \$min_num_probes,
Oct 23, 2014 @charlesbreeze eFORGE first commit
220 'noplot' => \$noplot,
221 'reps=i' => \$reps,
222 'thresh=s' => \$thresh,
Oct 24, 2014 @jherrero Option proxy is now a string.
223 'proxy=s' => \$proxy,
Oct 23, 2014 @charlesbreeze eFORGE first commit
224 'noproxy' => \$noproxy,
225 'depletion' => \$depletion,
226 'filter=f' => \$filter,
Oct 23, 2014 @jherrero Added --out_dir option and switch to Data::UUID for default dir name
227 'out_dir=s' => \$out_dir,
May 10, 2016 @jherrero Clean up the code
228 'web=s' => \$web,
Mar 12, 2015 @jherrero Add --autoopen options
229 'autoopen' => \$autoopen,
Oct 23, 2014 @charlesbreeze eFORGE first commit
230 'help|h|?' => \$help,
231 'man|m' => \$man,
232
233 );
234
235
236 pod2usage(1) if ($help);
237 pod2usage(-verbose => 2) if ($man);
238
Oct 23, 2014 @jherrero Added --out_dir option and switch to Data::UUID for default dir name
239 if (!$out_dir) {
240 my $ug = new Data::UUID;
241 $out_dir = $ug->to_hexstring($ug->create());
242 }
May 11, 2016 @jherrero Rename bkgdstats and related variables for cleaner code
243 mkdir $out_dir;
244
Oct 23, 2014 @jherrero Added --out_dir option and switch to Data::UUID for default dir name
245
May 10, 2016 @jherrero Clean up the code
246 # Define the thresholds to use.
247 if ($thresh) {
248 ($t_marginal, $t_strict) = parse_pvalue_thresholds($thresh);
Mar 11, 2015 @jherrero Prettify the code, remove some legacy bits
249 }
Oct 23, 2014 @charlesbreeze eFORGE first commit
250
May 10, 2016 @jherrero Clean up the code
251
Sep 30, 2015 @jherrero New DB schema - both code and DB are data-agnostic
252 ## ============================================================================
253 ## Connect to the DB
254 ## ============================================================================
255 # This reads the config file and sets up the $datadir variable
256 my $dirname = dirname(__FILE__);
257 my $cfg = Config::IniFiles->new( -file => "$dirname/eforge.ini" );
258 my $datadir = $cfg->val('Files', 'datadir');
259
260 unless (-s "$datadir/$dbname") {
261 die "Database $dbname not found or empty";
262 }
263 my $dsn = "dbi:SQLite:dbname=$datadir/$dbname";
264 my $dbh = DBI->connect($dsn, "", "") or die $DBI::errstr;
265 ## ============================================================================
266
267
268 ## ============================================================================
269 ## Check the dataset against the info on the DB
270 ## ============================================================================
271 my $all_datasets = get_all_datasets($dbh);
272 if (!defined($all_datasets)) {
273 die "Empty database: no dataset loaded!\n";
May 10, 2016 @jherrero Clean up the code
274 } elsif (!defined($dataset)) {
275 $dataset = $all_datasets->[0]->{tag};
276 print "Using default dataset: [$dataset] ".$all_datasets->[0]->{name}."\n";
277 } elsif ($dataset eq "?") {
Sep 30, 2015 @jherrero New DB schema - both code and DB are data-agnostic
278 print "Available datasets:\n - [".join("\n - [", map {$_->{tag}."] ".$_->{name}} @$all_datasets)."\n";
279 exit();
May 10, 2016 @jherrero Clean up the code
280 } elsif (!grep {$_ eq $dataset} map {$_->{tag}} @$all_datasets) {
281 die "Dataset $dataset unknown\nAvailable datasets:\n - [".join("\n - [", map {$_->{tag}."] ".$_->{name}} @$all_datasets)."\n";
Jan 8, 2015 @charlesbreeze 27k hash addition
282 }
Sep 30, 2015 @jherrero New DB schema - both code and DB are data-agnostic
283 ## ============================================================================
284
285
286 ## ============================================================================
287 ## Check the array name (A.K.A. background) against DB
288 ## ============================================================================
289 my $all_arrays = get_all_arrays($dbh);
290 if (!defined($all_arrays)) {
291 die "Empty database: no background loaded!\n";
May 10, 2016 @jherrero Clean up the code
292 } elsif (!defined($array)) {
293 $array = $all_arrays->[0]->{tag};
294 print "Using default background: [$array] ".$all_arrays->[0]->{name}."\n";
295 $array_label = $all_arrays->[0]->{name};
296 } elsif ($array eq "?") {
Sep 30, 2015 @jherrero New DB schema - both code and DB are data-agnostic
297 print "Available arrays:\n - [".join("\n - [", map {$_->{tag}."] ".$_->{name}} @$all_arrays)."\n";
298 exit();
May 10, 2016 @jherrero Clean up the code
299 } elsif (!grep {$_ eq $array} map {$_->{tag}} @$all_arrays) {
300 die "Array $array unknown\nAvailable arrays:\n - [".join("\n - [", map {$_->{tag}."] ".$_->{name}} @$all_arrays)."\n";
Sep 30, 2015 @jherrero New DB schema - both code and DB are data-agnostic
301 } else {
302 foreach my $this_array (@$all_arrays) {
May 10, 2016 @jherrero Clean up the code
303 if ($this_array->{tag} eq $array) {
304 $array_label = $this_array->{name};
Sep 30, 2015 @jherrero New DB schema - both code and DB are data-agnostic
305 last;
306 }
307 }
308 }
309 ## ============================================================================
310
311
312 ## ============================================================================
313 ## Check the proxy_filter (A.K.A. filter) against DB
314 ## ============================================================================
315 # Set proximity filter
May 11, 2016 @jherrero More code cleanup
316 if (defined $noproxy) {
317 $proxy = undef;
318 } else {
319 my $all_proxy_filters = get_all_proximity_filters($dbh);
May 10, 2016 @jherrero Clean up the code
320 if ($all_proxy_filters->{$array}) {
321 $proxy = $all_proxy_filters->{$array};
Sep 30, 2015 @jherrero New DB schema - both code and DB are data-agnostic
322 }
323 }
324 ## ============================================================================
325
Oct 23, 2014 @charlesbreeze eFORGE first commit
326
May 11, 2016 @jherrero More code cleanup
327 ## ============================================================================
328 ## Append main options (depletion on/off; array; dataset) to $label
329 ## ============================================================================
Mar 11, 2015 @jherrero Prettify the code, remove some legacy bits
330 if (defined $depletion) {
Jan 9, 2015 @jherrero Clean up the code after merge
331 $label = "$label.depletion";
Mar 11, 2015 @jherrero Prettify the code, remove some legacy bits
332 }
May 11, 2016 @jherrero More code cleanup
333 (my $lab = $label) =~ s/\s/_/g; # Avoid whitespaces on the label
May 10, 2016 @jherrero Clean up the code
334 $lab = "$lab.$array.$dataset";
May 11, 2016 @jherrero More code cleanup
335 ## ============================================================================
Oct 23, 2014 @charlesbreeze eFORGE first commit
336
337
May 10, 2016 @jherrero Clean up the code
338 ## ============================================================================
339 ## Read and process the input MVPs
340 ## ============================================================================
Oct 23, 2014 @jherrero Add warning messages for tracking progress on the web interface
341 warn "[".scalar(localtime())."] Processing input...\n";
May 11, 2016 @jherrero More code cleanup
342 # This will read the probes from the file if provided, from the probe list otherwise or use the
343 # example data set as a last resort.
344 my $mvps = get_input_probes($filename, $probe_list);
345 my $original_mvps = [@$mvps];
346 my $num_of_input_mvps = scalar(@$mvps);
347
348 # Apply the proximity filter if requested
349 my ($proximity_excluded);
350 if(defined $proxy) {
351 ($proximity_excluded, $mvps) = proximity_filter($dbh, $array, $mvps);
352 while (my ($excluded_mvp, $mvp) = each %$proximity_excluded) {
353 warn "$excluded_mvp excluded for $proxy proximity filter with $mvp\n";
354 }
355 }
Mar 11, 2015 @jherrero Prettify the code, remove some legacy bits
356
May 11, 2016 @jherrero More code cleanup
357 # $annotated_probes is an arrayref with probe_id, sum, bit, gene_group, cgi_group for each input probe
358 my $annotated_probes = get_probe_annotations_and_overlap_for_dataset($dbh, $dataset, $array, $mvps);
359 my $existing_probes = {map {$_->[0] => 1} @$annotated_probes};
360 $mvps = [keys %$existing_probes];
Oct 23, 2014 @charlesbreeze eFORGE first commit
361
May 11, 2016 @jherrero More code cleanup
362 ## Detect and remove the missing probes.
May 11, 2016 @jherrero Fix buglet: don't count excluded probes as missing
363 my $num_missing_probes = find_missing_probes($original_mvps, $existing_probes, $proximity_excluded);
Oct 23, 2014 @jherrero Move warning message from module to main script
364
May 11, 2016 @jherrero More code cleanup
365 # Print summary of filtering and checks:
366 my $msg = "For $label, $num_of_input_mvps MVPs provided, ". scalar @$mvps.
367 " retained: $num_missing_probes were not found";
368 if (defined $proxy) {
369 $msg .= " and " . scalar(keys %$proximity_excluded) . " excluded using $proxy proximity filter";
Oct 23, 2014 @charlesbreeze eFORGE first commit
370 }
May 11, 2016 @jherrero More code cleanup
371 warn $msg, ".\n";
Oct 23, 2014 @charlesbreeze eFORGE first commit
372
May 11, 2016 @jherrero More code cleanup
373 # Check we have enough MVPs left
374 my $num_of_valid_probes = scalar @$mvps;
375 if ($num_of_valid_probes < $min_num_probes) {
376 die "Fewer than $min_num_probes MVPs. Analysis not run\n";
Mar 11, 2015 @jherrero Prettify the code, remove some legacy bits
377 }
May 10, 2016 @jherrero Clean up the code
378 ## ============================================================================
Oct 23, 2014 @charlesbreeze eFORGE first commit
379
380
381 # get the cell list array and the hash that connects the cells and tissues
May 11, 2016 @jherrero More code cleanup
382 # $samples is a hash whose keys are the $cells (short name for the cell type/lines) and value is
383 # another hash with 'tissue', 'datatype', 'file' and 'acc' keys.
384 # IMPORTANT: $cells contains the list of cells in the order defined in the DB. This is critical
385 # to correctly assign each bit to the right sample.
386 my ($cells, $samples) = get_samples_from_dataset($dbh, $dataset);
Oct 23, 2014 @charlesbreeze eFORGE first commit
387
388 # unpack the bitstrings and store the overlaps by cell.
May 11, 2016 @jherrero Even more code cleanup
389 # $overlaps is a complex hash like:
390 # $overlaps->{'MVPS'}->{$probe_id}->{'SUM'} (total number of overlaps of this probe with features in this dataset)
391 # $overlaps->{'MVPS'}->{$probe_id}->{'PARAMS'} (gene and CGI annotations for this probe)
392 # $overlaps->{'CELLS'}->{$cell}->{'COUNT'} (number of input MVPs that overlap with the signal on this sample)
393 # $overlaps->{'CELLS'}->{$cell}->{'MVPS'} (list of input MVPs that overlap with the signal on this sample)
394 my $overlaps = process_overlaps($annotated_probes, $cells, $dataset);
Oct 23, 2014 @charlesbreeze eFORGE first commit
395
396 # generate stats on the background selection
May 11, 2016 @jherrero Rename bkgdstats and related variables for cleaner code
397 if (defined $save_probe_annotation_stats) {
398 save_probe_annotation_stats($overlaps, $out_dir, $lab, "test");
Mar 11, 2015 @jherrero Prettify the code, remove some legacy bits
399 }
Oct 23, 2014 @charlesbreeze eFORGE first commit
400
401
402
403 # only pick background mvps matching mvps that had bitstrings originally.
404 #reference to hash key 'MVPS' is due to use of eforge.pm module from eForge tool
May 11, 2016 @jherrero More code cleanup
405 #(in subroutines process_overlaps, etc)
Oct 23, 2014 @charlesbreeze eFORGE first commit
406
407
May 11, 2016 @jherrero Even more code cleanup
408 # Identify the feature and cpg island relationship, and then make random picks
May 10, 2016 @jherrero Clean up the code
409 warn "[".scalar(localtime())."] Loading the $array background...\n";
May 11, 2016 @jherrero Even more code cleanup
410 my $random_picks = get_random_matching_picks($overlaps, $array, $datadir, $reps);
Oct 23, 2014 @charlesbreeze eFORGE first commit
411
412 ########check below lines:
413
414 # for bkgrd set need to get distribution of counts instead
415 # make a hash of data -> cell -> bkgrd-Set -> overlap counts
May 11, 2016 @jherrero Rename bkgdstats and related variables for cleaner code
416 my %overlaps_per_cell; #this hash is going to store the overlaps for the random picks, per cell
Oct 23, 2014 @charlesbreeze eFORGE first commit
417
418 # Get the bits for the background sets and process
May 11, 2016 @jherrero Even more code cleanup
419 my $total_num_probes_in_random_picks;
Oct 23, 2014 @charlesbreeze eFORGE first commit
420
May 11, 2016 @jherrero More code cleanup
421 warn "[".scalar(localtime())."] Running the analysis with $num_of_valid_probes MVPs...\n";
May 11, 2016 @jherrero Rename bkgdstats and related variables for cleaner code
422 my $count = 0;
May 11, 2016 @jherrero Even more code cleanup
423 foreach my $this_random_pick (@$random_picks) {
May 11, 2016 @jherrero Rename bkgdstats and related variables for cleaner code
424 warn "[".scalar(localtime())."] Repetition $count out of ".$reps."\n" if (++$count%100 == 0);
May 11, 2016 @jherrero Even more code cleanup
425 $annotated_probes = get_probe_annotations_and_overlap_for_dataset($dbh, $dataset, $array, $this_random_pick);
426
427 $total_num_probes_in_random_picks += scalar @$annotated_probes;
428
May 11, 2016 @jherrero More code cleanup
429 unless (scalar @$annotated_probes == $num_of_valid_probes) {
May 11, 2016 @jherrero Rename bkgdstats and related variables for cleaner code
430 warn "Random pick #$count only has " . scalar @$annotated_probes . " probes compared to $num_of_valid_probes in the input set.\n";
Mar 11, 2015 @jherrero Prettify the code, remove some legacy bits
431 }
May 11, 2016 @jherrero Even more code cleanup
432
433 my $this_pick_overlaps = process_overlaps($annotated_probes, $cells, $dataset);
434
435 # accumulate the overlap counts by cell
436 foreach my $cell (keys %{$this_pick_overlaps->{'CELLS'}}) {
437 push @{$overlaps_per_cell{$cell}}, $this_pick_overlaps->{'CELLS'}->{$cell}->{'COUNT'};
Mar 11, 2015 @jherrero Prettify the code, remove some legacy bits
438 }
May 11, 2016 @jherrero Even more code cleanup
439
May 11, 2016 @jherrero Rename bkgdstats and related variables for cleaner code
440 if (defined $save_probe_annotation_stats) {
441 save_probe_annotation_stats($this_pick_overlaps, $out_dir, $lab, $count);
Mar 11, 2015 @jherrero Prettify the code, remove some legacy bits
442 }
443 }
Oct 23, 2014 @charlesbreeze eFORGE first commit
444
445 $dbh->disconnect();
Oct 23, 2014 @jherrero Add warning messages for tracking progress on the web interface
446 warn "[".scalar(localtime())."] All repetitions done.\n";
Oct 23, 2014 @charlesbreeze eFORGE first commit
447
Oct 23, 2014 @jherrero Add warning messages for tracking progress on the web interface
448 warn "[".scalar(localtime())."] Calculating p-values...\n";
Oct 23, 2014 @charlesbreeze eFORGE first commit
449 #Having got the test overlaps and the bkgd overlaps now calculate p values and output
450 #the table to be read into R for plotting.
451
Oct 23, 2014 @jherrero Added --out_dir option and switch to Data::UUID for default dir name
452
Apr 18, 2016 @jherrero Use compressed files to save space on the server
453 if (!$web) {
454 open(BACKGROUND, "| gzip -9 > $out_dir/background.tsv.gz") or die "Cannot open background.tsv";
455 }
Oct 23, 2014 @charlesbreeze eFORGE first commit
456
457
458
Mar 10, 2015 @jherrero Adding B-Y FDR corrected q-values to output file
459 my @results;
460 my @pvalues;
Oct 23, 2014 @charlesbreeze eFORGE first commit
461 ###ncmp is a function from Sort::Naturally
May 11, 2016 @jherrero More code cleanup
462 foreach my $cell (sort {ncmp($$samples{$a}{'tissue'},$$samples{$b}{'tissue'}) || ncmp($a,$b)} @$cells){
463 # above line sorts by the tissues alphabetically (from $samples hash values)
Oct 23, 2014 @charlesbreeze eFORGE first commit
464
Oct 23, 2014 @jherrero Fix indentation
465 # ultimately want a data frame of names(results)<-c("Zscore", "Cell", "Tissue", "File", "MVPs")
Apr 18, 2016 @jherrero Use compressed files to save space on the server
466 if (!$web) {
May 11, 2016 @jherrero Even more code cleanup
467 print BACKGROUND join("\t", @{$overlaps_per_cell{$cell}}), "\n";
Apr 18, 2016 @jherrero Use compressed files to save space on the server
468 }
May 11, 2016 @jherrero Even more code cleanup
469 my $teststat = ($overlaps->{'CELLS'}->{$cell}->{'COUNT'} or 0); #number of overlaps for the test MVPs
Oct 23, 2014 @charlesbreeze eFORGE first commit
470
471 # binomial pvalue, probability of success is derived from the background overlaps over the tests for this cell
472 # $backmvps is the total number of background mvps analysed
473 # $tests is the number of overlaps found over all the background tests
May 11, 2016 @jherrero Even more code cleanup
474 my $total_num_overlaps_in_random_picks;
475 foreach (@{$overlaps_per_cell{$cell}}) {
476 $total_num_overlaps_in_random_picks += $_;
Mar 11, 2015 @jherrero Prettify the code, remove some legacy bits
477 }
May 11, 2016 @jherrero Even more code cleanup
478 my $p = sprintf("%.6f", $total_num_overlaps_in_random_picks / $total_num_probes_in_random_picks);
Oct 23, 2014 @charlesbreeze eFORGE first commit
479
480 # binomial probability for $teststat or more hits out of $mvpcount mvps
481 # sum the binomial for each k out of n above $teststat
482 my $pbinom;
Mar 11, 2015 @jherrero Prettify the code, remove some legacy bits
483 if (defined $depletion) {
484 foreach my $k (0 .. $teststat) {
May 11, 2016 @jherrero More code cleanup
485 $pbinom += binomial($k, $num_of_valid_probes, $p);
Mar 11, 2015 @jherrero Prettify the code, remove some legacy bits
486 }
487 } else {
May 11, 2016 @jherrero More code cleanup
488 foreach my $k ($teststat .. $num_of_valid_probes) {
489 $pbinom += binomial($k, $num_of_valid_probes, $p);
Mar 11, 2015 @jherrero Prettify the code, remove some legacy bits
490 }
Oct 23, 2014 @charlesbreeze eFORGE first commit
491 }
492 if ($pbinom >1) {
Mar 11, 2015 @jherrero Prettify the code, remove some legacy bits
493 $pbinom=1;
494 }
Mar 10, 2015 @jherrero Adding B-Y FDR corrected q-values to output file
495 # Store the p-values in natural scale (i.e. before log transformation) for FDR correction
496 push(@pvalues, $pbinom);
Mar 12, 2015 @jherrero Keep p-value in natural scale and represent the -log10 in ePlot
497 $pbinom = sprintf("%.2e", $pbinom);
Mar 10, 2015 @jherrero Adding B-Y FDR corrected q-values to output file
498
499 # Z score calculation (note: this is here only for legacy reasons. Z-scores assume normal distribution)
May 11, 2016 @jherrero Even more code cleanup
500 my $zscore = zscore($teststat, $overlaps_per_cell{$cell});
Mar 10, 2015 @jherrero Adding B-Y FDR corrected q-values to output file
501
Oct 23, 2014 @charlesbreeze eFORGE first commit
502 my $mvp_string = "";
May 11, 2016 @jherrero Even more code cleanup
503 $mvp_string = join(",", @{$overlaps->{'CELLS'}->{$cell}->{'MVPS'}})
504 if defined $overlaps->{'CELLS'}->{$cell}->{'MVPS'};
Oct 23, 2014 @charlesbreeze eFORGE first commit
505 # This gives the list of overlapping MVPs for use in the tooltips. If there are a lot of them this can be a little useless
506 my ($shortcell, undef) = split('\|', $cell); # undo the concatenation from earlier to deal with identical cell names.
Mar 10, 2015 @jherrero Adding B-Y FDR corrected q-values to output file
507
May 11, 2016 @jherrero More code cleanup
508 push(@results, [$zscore, $pbinom, $shortcell, $$samples{$cell}{'tissue'}, $$samples{$cell}{'datatype'}, $$samples{$cell}{'file'}, $mvp_string, $$samples{$cell}{'acc'}]);
Mar 10, 2015 @jherrero Adding B-Y FDR corrected q-values to output file
509 }
Apr 18, 2016 @jherrero Use compressed files to save space on the server
510 if (!$web) {
511 close(BACKGROUND);
512 }
Oct 23, 2014 @charlesbreeze eFORGE first commit
513
May 11, 2016 @jherrero More code cleanup
514 ## ============================================================================
515 ## Correct the p-values for multiple testing using the Benjamini-Yekutieli FDR control method
516 ## ============================================================================
Mar 10, 2015 @jherrero Adding B-Y FDR corrected q-values to output file
517 my $qvalues = BY(\@pvalues);
Mar 12, 2015 @jherrero Keep p-value in natural scale and represent the -log10 in ePlot
518 $qvalues = [map {sprintf("%.2e", $_)} @$qvalues];
May 11, 2016 @jherrero More code cleanup
519 ## ============================================================================
520
Oct 23, 2014 @charlesbreeze eFORGE first commit
521
May 11, 2016 @jherrero More code cleanup
522 ## ============================================================================
523 ## Write the results to a tab-separated file
524 ## ============================================================================
525 my $results_filename = "$lab.chart.tsv.gz";
526 open(TSV, "| gzip -9 > $out_dir/$results_filename") or die "Cannot open $out_dir/$results_filename: $!";
Apr 18, 2016 @jherrero Use compressed files to save space on the server
527 print TSV join("\t", "Zscore", "Pvalue", "Cell", "Tissue", "Datatype", "File", "Probe", "Accession", "Qvalue"), "\n";
Mar 10, 2015 @jherrero Adding B-Y FDR corrected q-values to output file
528 for (my $i = 0; $i < @results; $i++) {
Apr 18, 2016 @jherrero Use compressed files to save space on the server
529 print TSV join("\t", @{$results[$i]}, $qvalues->[$i]), "\n";
Mar 10, 2015 @jherrero Adding B-Y FDR corrected q-values to output file
530 }
Apr 18, 2016 @jherrero Use compressed files to save space on the server
531 close(TSV);
May 11, 2016 @jherrero More code cleanup
532 ## ============================================================================
Oct 23, 2014 @charlesbreeze eFORGE first commit
533
534
May 11, 2016 @jherrero More code cleanup
535 ## ============================================================================
536 ## Generate plots
537 ## ============================================================================
Oct 23, 2014 @jherrero Add warning messages for tracking progress on the web interface
538 warn "[".scalar(localtime())."] Generating plots...\n";
Oct 23, 2014 @jherrero Fix indentation
539 unless (defined $noplot){
Oct 23, 2014 @charlesbreeze eFORGE first commit
540 #Plotting and table routines
May 11, 2016 @jherrero More code cleanup
541 Chart($results_filename, $lab, $out_dir, $samples, $cells, $label, $t_marginal, $t_strict, $dataset); # basic pdf plot
542 dChart($results_filename, $lab, $out_dir, $dataset, $label, $t_marginal, $t_strict, $web); # rCharts Dimple chart
543 table($results_filename, $lab, $out_dir, $web); # Datatables chart
Oct 23, 2014 @charlesbreeze eFORGE first commit
544 }
May 11, 2016 @jherrero More code cleanup
545 ## ============================================================================
Oct 23, 2014 @jherrero Add warning messages for tracking progress on the web interface
546
547 warn "[".scalar(localtime())."] Done.\n";
Mar 12, 2015 @jherrero Add --autoopen options
548
549 if ($autoopen) {
550 system("open $out_dir/$lab.table.html");
551 system("open $out_dir/$lab.dchart.html");
552 system("open $out_dir/$lab.chart.pdf");
553 }
May 10, 2016 @jherrero Clean up the code
554
May 11, 2016 @jherrero More code cleanup
555
May 10, 2016 @jherrero Clean up the code
556 ####################################################################################################
557 ####################################################################################################
558 ##
559 ## Sub-functions
560 ##
561 ####################################################################################################
562 ####################################################################################################
563
564
565 =head2 parse_pvalue_thresholds
566
567 Arg[1] : string $thresholds
568 Returns : arrayref of marginal and strict thresholds (floats)
569 Example : ($t_marginal, $t_strict) = parse_pvalue_thesholds("0.05,0.01");
570 Description : This function returns the both marginal and strict p-value thresholds as read from
571 the command line option. The input string should contain both numbers separated by
572 a comma.
573 Exceptions : Dies if $thresholds is empty, does not contain numbers or are not defined between
574 0 and 1 and/or the marginal threshold is not larger or equal to the strict one.
575
576 =cut
577
578 sub parse_pvalue_thresholds {
579 my ($thresh) = @_;
580 my ($t_marginal, $t_strict);
581
582 if (!$thresh) {
583 die "Cannot read p-value thresholds from an empty string\n";
584 }
585
586 ($t_marginal, $t_strict) = split(",", $thresh);
587 unless (looks_like_number($t_marginal) && looks_like_number($t_strict)){
588 die "You must specify numerical p-value thresholds in a comma separated list\n";
589 }
590 unless ((1 >= $t_marginal) && ($t_marginal >= $t_strict) && ($t_strict >= 0)) {
591 die "The p-value thresholds should be 1 >= T.marginal >= T.strict >= 0\n";
592 }
593 return ($t_marginal, $t_strict);
594 }
595
596
May 11, 2016 @jherrero More code cleanup
597 =head2 get_input_probes
May 10, 2016 @jherrero Clean up the code
598
599 Arg[1] : string $filename
May 11, 2016 @jherrero More code cleanup
600 Arg[2] : arrayref $probe_list
May 10, 2016 @jherrero Clean up the code
601 Returns : arrayref of probe IDs (string)
May 11, 2016 @jherrero More code cleanup
602 Example : $mvps = get_input_probes("input.txt", undef);
603 Example : $mvps = get_input_probes(undef, ["cg13430807", "cg10480329,cg06297318,cg19301114"]);
604 Example : $mvps = get_input_probes(undef, undef);
May 10, 2016 @jherrero Clean up the code
605 Description : This function returns the list of input probe IDs. This can come from either
May 11, 2016 @jherrero More code cleanup
606 $filename if defined or from $probe_list otherwise. Each element in $probe_list is a
May 10, 2016 @jherrero Clean up the code
607 string which contains one or more probe IDs separated by commas (see Examples).
608 Falls back to the default data set from Jaffe and Irizarry.
609 The set of probe IDs is checked to remove redundant entries.
610 Exceptions : Dies if the file is not found or cannot be opened for whatever reason.
611
612 =cut
613
May 11, 2016 @jherrero More code cleanup
614 sub get_input_probes {
615 my ($filename, $probe_list) = @_;
616 my $probes;
May 10, 2016 @jherrero Clean up the code
617
618 if (defined $filename) {
619 my $fh;
620 if ($filename =~ /\.gz$/) {
621 open($fh, "gunzip -c $filename |") or die "cannot open file $filename : $!";
622 } elsif ($filename =~ /\.bz2$/) {
623 open($fh, "bunzip2 -c $filename |") or die "cannot open file $filename : $!";
624 } else {
625 open($fh, "$filename") or die "cannot open file $filename : $!";
626 }
May 11, 2016 @jherrero More code cleanup
627 $probes = process_file($fh, $format, $dbh, $array, $filter);
May 10, 2016 @jherrero Clean up the code
628
May 11, 2016 @jherrero More code cleanup
629 } elsif ($probe_list and @$probe_list) {
630 @$probes = split(/,/, join(',', @$probe_list));
May 10, 2016 @jherrero Clean up the code
631
632 } else{
633 # Test MVPs from Liu Y et al. Nat Biotechnol 2013 Pulmonary_function.snps.bed (*put EWAS bedfile here)
634 # If no options are given it will run on the default set of MVPs
635 warn "No probe input given, so running on default set of probes, a set of monocyte tDMPs from Jaffe AE and Irizarry RA, Genome Biol 2014.";
May 11, 2016 @jherrero New example dataset works for 450k and 27k arrays
636 @$probes = qw(cg00839584 cg02497428 cg02780988 cg03055440 cg05445326 cg10045881 cg11051139 cg11058932 cg12091331 cg12962778 cg16303562 cg16501235 cg18589858 cg18712919 cg18854666 cg21792432 cg22081096 cg25059899 cg26989103 cg27443224);
May 10, 2016 @jherrero Clean up the code
637 }
638
639 # Remove redundancy in the input
May 11, 2016 @jherrero More code cleanup
640 my %probes_hash;
641 foreach my $probe (@$probes) {
642 $probes_hash{$probe}++;
643 }
644
645 while (my ($probe, $num) = each %probes_hash) {
646 if ($num > 1) {
647 say "$probe is present $num times in the input. Analysing only once."
648 }
May 10, 2016 @jherrero Clean up the code
649 }
650
May 11, 2016 @jherrero More code cleanup
651 @$probes = keys %probes_hash;
652
653 return($probes);
654 }
655
656
657 =head2 find_missing_probes
658
659 Arg[1] : arrayref of strings $original_probe_ids
660 Arg[2] : hashref $existing_probe_ids (keys are probe_ids, values are ignored)
May 11, 2016 @jherrero Fix buglet: don't count excluded probes as missing
661 Arg[3] : hashref $excluded_probe_ids (keys are probe_ids, values are ignored)
662 Returns : int $num_missing_probes
May 11, 2016 @jherrero More code cleanup
663 Example : my $num_missing_probes = find_missing_probes(['cg001', 'cg002', 'cg003', 'cg004'],
May 11, 2016 @jherrero Fix buglet: don't count excluded probes as missing
664 {'cg001' => 1, 'cg003 => 1}, {'cg002' => 'cg001');
May 11, 2016 @jherrero More code cleanup
665 Description : Detects and prints the list of missing probes if any.
666 Exceptions :
667
668 =cut
669
670 sub find_missing_probes {
May 11, 2016 @jherrero Fix buglet: don't count excluded probes as missing
671 my ($original_probes, $existing_probes_hash, $excluded_probes_hash) = @_;
May 11, 2016 @jherrero More code cleanup
672 my $num_missing_probes = 0;
673
674 my $missing_probes = [];
675 foreach my $probe_id (@$original_probes) {
May 11, 2016 @jherrero Fix buglet: don't count excluded probes as missing
676 unless ($existing_probes_hash->{$probe_id} or $excluded_probes_hash->{$probe_id}) {
May 11, 2016 @jherrero More code cleanup
677 push @$missing_probes, $probe_id;
May 10, 2016 @jherrero Clean up the code
678 }
679 }
May 11, 2016 @jherrero More code cleanup
680 $num_missing_probes = scalar @$missing_probes;
May 10, 2016 @jherrero Clean up the code
681
May 11, 2016 @jherrero More code cleanup
682 if ($num_missing_probes > 0) {
683 warn "The following $num_missing_probes MVPs have not been analysed because they were not found on the selected array\n";
684 warn join("\n", @$missing_probes) . "\n";
685 }
May 10, 2016 @jherrero Clean up the code
686
May 11, 2016 @jherrero More code cleanup
687 return $num_missing_probes;
May 10, 2016 @jherrero Clean up the code
688 }