Skip to content

Commit

Permalink
Fix typos in documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
mlinderm committed May 9, 2019
1 parent c198f97 commit 10f4b63
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 6 deletions.
3 changes: 1 addition & 2 deletions BENCHMARKING.md
Expand Up @@ -122,8 +122,7 @@ The specific command when running DECA on Databricks is below.
]
```

The `20130108.exome.targets.exclude.txt` file is the concatenation
`20130108.exome.targets.gc.txt` and `20130108.exome.targets.lc.txt` files,
The `20130108.exome.targets.exclude.txt` file is the concatenation of `20130108.exome.targets.gc.txt` and `20130108.exome.targets.lc.txt` files,
which are in turn generated from `20130108.exome.targets.interval_list` as
described in the [XHMM
tutorial](http://atgu.mgh.harvard.edu/xhmm/tutorial.shtml).
Expand Down
8 changes: 4 additions & 4 deletions README.md
Expand Up @@ -215,13 +215,13 @@ environment and would likely need to be modified for other environments.

## Running DECA on AWS with Elastic MapReduce

DECA can readily be run on Amazon AWS using the Elastic MapReduce (EMR) [Spark configuration](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). Data can be read from and written to S3 using the s3a:// scheme. For example, the 1000 Genomes data is available as a [public dataset](https://aws.amazon.com/1000genomes/) on S3 in the `1000genomes` bucket (i.e. `s3a://1000genomes/...`). S3a is an overlay over the AWS Simple Storage System (S3) cloud data store which is provided by Apache Hadoop.
DECA can readily be run on Amazon AWS using the Elastic MapReduce (EMR) [Spark configuration](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). Data can be read from and written to S3 using the s3a:// scheme. For example, the 1000 Genomes data are available as a [public dataset](https://aws.amazon.com/1000genomes/) on S3 in the `1000genomes` bucket (i.e. `s3a://1000genomes/...`). S3a is an overlay over the AWS Simple Storage System (S3) cloud data store which is provided by Apache Hadoop.

Note that unlike HDFS, S3 is an eventually-consistent filesystem and so you may encounter problems when trying to read recently written files, such as occurs at the end of the DECA operations when combining sharded files. When writing to S3 use the `-multi_file` option to leave the files sharded for subsequent combination or analysis.

DECA has been tested with emr-5.13. Clusters can be created with the command-line tools or the AWS management console. A JSON file `emr_config.json` is provided in the scripts directory to configure clusters for maximum resource utilization.
DECA has been tested with emr-5.13. Clusters can be created with the command-line tools or the AWS management console. A JSON file [`emr_config.json`](scripts/emr_config.json) is provided in the scripts directory to configure clusters for maximum resource utilization.

A bootstrap script `emr_bootstrap.sh` is provided in the scripts directory for use as a [bootstrap action](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html). The bootstrap script can copy a pre-built JAR onto the cluster (faster) or build DECA directly from GitHub (slower). To use the bootstrap script, copy it to S3 and provide the S3 path as the bootstrap action when creating the cluster. To copy a pre-built JAR onto the cluster provide a s3 path to the DECA CLI jar, e.g. `s3://path/to/deca-cli_2.11-0.2.1-SNAPSHOT.jar`, as the optional argument to bootstrap action. After connecting to the EMR master node via SSH, you can launch DECA as you would on any YARN cluster. For example the following command calls CNVs in the entire 1000 Genomes phase 3 cohort on a cluster of i3.2xlarge nodes.
A bootstrap script [`emr_bootstrap.sh`](scripts/emr_bootstrap.sh) is provided in the scripts directory for use as a [bootstrap action](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html). The bootstrap script can copy a pre-built JAR onto the cluster (faster) or build DECA directly from GitHub (slower). To use the bootstrap script, copy it to S3 and provide the S3 path as the bootstrap action when creating the cluster. To copy a pre-built JAR onto the cluster provide a s3 path to the DECA CLI jar, e.g. `s3://path/to/deca-cli_2.11-0.2.1-SNAPSHOT.jar`, as the optional argument to bootstrap action. After connecting to the EMR master node via SSH, you can launch DECA as you would on any YARN cluster. For example the following command calls CNVs in the entire 1000 Genomes phase 3 cohort on a cluster of i3.2xlarge nodes.

```
deca-submit \
Expand Down Expand Up @@ -273,7 +273,7 @@ normalize_and_discover

DECA can readily be run on [Databricks](https://databricks.com) on the Amazon cloud. DECA has been tested on Databricks Light 2.4 as a spark-submit job using the DECA jar fetched from a S3 bucket. As with EMR, data can be read from and written to S3 using the s3a:// scheme. The Databricks cluster was configured to access S3 via [AWS IAM roles](https://docs.databricks.com/administration-guide/cloud-configurations/aws/iam-roles.html#secure-access-to-s3-buckets-using-iam-roles). Note that access to any public buckets, e.g. the 1000genomes bucket, must also be included in the cross account IAM role created according to the above instructions. The same issues with eventual consistency described above also apply when writing data to S3 from the Databricks cluster.

An example configuration for call CNVs directly from the original BAM files:
An example configuration for calling CNVs directly from the original BAM files:

```json
[
Expand Down

0 comments on commit 10f4b63

Please sign in to comment.