Skip to content

Commit

Permalink
docs(usage.md): updated master.yaml metadata specs, attempted to add …
Browse files Browse the repository at this point in the history
…some more details about adding a bwa reference (not sure if this is right), still need to add documentation on use case where index is already present somewhere else and it is added to the master.yaml
  • Loading branch information
JRWallace committed Jul 18, 2019
1 parent 6a2d73c commit 5561af9
Showing 1 changed file with 135 additions and 178 deletions.
313 changes: 135 additions & 178 deletions docs/usage.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
## **Overview**
RefChef is a reference management tool that helps make your next-generation sequencing projects and analyses reproducible. You can use it to document the provenance of reference genomes, transcriptomes, or proteomes downloaded from public databases (as well as their associated indices and annotations). It is a flexible workflow that could also be used to internally track the progress through different versions of draft assemblies. RefChef will: (1) document the exact steps undertaken in the retrieval and processing of genomic references; (2) maintain the associated metadata; (3) provide a mechanism for automatically reproducing retrieval and creation of an exact copy of genomic references.
<<<<<<< Updated upstream

![Diagram](assets/refchef-diagram.svg)

Expand Down Expand Up @@ -34,169 +35,193 @@ This information can be specified in a [`cfg.yaml`](#cfg.yaml) file, a [`cfg.ini
**The following example uses a local repository for tracking references.**

Create your own local repository for tracking references:
=======

![Diagram](assets/refchef-diagram.svg)


**RefChef comes with two commands:**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[`refchef-cook`](#refchef-cook):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Will read recipes and execute the commands that will retrieve the references, indices, or
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; annotations based on the contents of [`master.yaml`](#master.yaml).

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[`refchef-menu`](#refchef-menu):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Provides a way for the user to list all references present in the system, based
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; on [`master.yaml`](#master.yaml), as well as filter the list of references based on metadata options.

**RefChef requires a `master.yaml` file:**

In addition to the [`refchef-cook`](#refchef-cook) and [`refchef-menu`](#refchef-menu) commands, RefChef requires a [`master.yaml`](#master.yaml) containing a list of references, indices, annotations, and metadata, as well as the commands necessary to download and process the files. When [`refchef-cook`](#refchef-cook) is executed, RefChef will append the [`master.yaml`](#master.yaml) to change the `complete` option from `false` to `true`and will also add a `uuid` for each reference, the date the files were downloaded and their location, as well as a complete list of files. Based on the arguments you pass to [`refchef-cook`](#refchef-cook), it will either commit those changes to [`master.yaml`](#master.yaml) to a local repository (red arrow in the above figure) or commit and push the changes to a remote repository (blue arrow in the above figure).

**RefChef requires configuration information:**

[`refchef-cook`](#refchef-cook) and [`refchef-menu`](#refchef-menu) both require some configuration information, including:

1. Where you'd like the references to be saved
2. The local git repository for version control of references
3. The remote github repository for version control of reference
sequences (optional).

This information can be specified in a [`cfg.yaml`](#cfg.yaml) file, a [`cfg.ini`](#cfg.ini) file, or it can be passed as arguments to [`refchef-cook`](#refchef-cook).

## **Quickstart**
[Create a remote repository.](https://help.github.com/en/articles/creating-a-new-repository) and [clone it.](https://help.github.com/en/articles/cloning-a-repository)

```
cd /Volumes/jwalla12
git init local_references
git clone https://github.com/JRWallace/remote_references.git
```

Create a directory for refchef to store your references:
```
mkdir /Volumes/jwalla12/references
```

Create a [`master.yaml`](#master.yaml) file and save it in your git repository directory (`local_references` in the above example). Here is a [`master.yaml`](#master.yaml) file that will download the grch38 human genome from Ensembl:
Create a [`master.yaml`](#master.yaml) file and save it in your local git repository directory. Here is a [`master.yaml`](#master.yaml) file that will download chromosome 1 of the grch38 human genome from Ensembl:

```
grch38:
metadata:
name: grch38
organism: Homo sapiens
common_name: human
ncbi_taxon_id: 9606
organism: homo sapiens
organization: ensembl
description: Genome Reference Consortium Human Build 38
genbank_accession:
refseq_accession:
ensembl_release_number: 87
custom: no
downloader: jrwallace
description: corresponds to ganbank id GCA_000001405.22
downloader: joselynn wallace
ensembl_release_number: 87
accession:
genbank:
refseq:
levels:
references:
- component: primary
complete:
status: false
commands:
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/CHECKSUMS
- md5sum *.gz > postdownload-checksums.md5
- md5 *.gz > postdownload-checksums.md5
- gunzip *.gz
- md5sum *.* > final_checksums.md5
```

Pass the configuration arguments directly to [`refchef-cook`](#refchef-cook) in the following example:
- md5 *.* > final_checksums.md5
```
refchef-cook -e -o /Volumes/jwalla12/references -gl /Volumes/jwalla12/local_references
```

After running [`refchef-cook`](#refchef-cook) , you'll see the following:
Pass the configuration arguments in a config file or directly to [`refchef-cook`](#refchef-cook) (as seen in the following example):

```
2019-07-16 10:34:12,972 INFO:
===========================================
REFCHEF 🐶
-------------------------------------------
- References will be downloaded to: /Volumes/jwalla12/references
- Remote repository for master.yaml False
- Local repository for master.yaml /Volumes/jwalla12/local_references
- Logs files: /Volumes/jwalla12/local_references/logs/
-------------------------------------------
===========================================
REFCHEF 🐶
-------------------------------------------
- References will be downloaded to: /Volumes/jwalla12/references
- Remote repository for master.yaml False
- Local repository for master.yaml /Volumes/jwalla12/local_references
- Logs files: /Volumes/jwalla12/local_references/logs/
-------------------------------------------
2019-07-16 10:34:12,972 INFO:
-------------------------------------------
The folowing references will be downloaded:
- grch38
===========================================
-------------------------------------------
The folowing references will be downloaded:
- grch38
===========================================
2019-07-16 10:34:12,974 INFO: 🐶 RefChef... getting reference: grch38, component: primary
🐶 RefChef... getting reference: grch38, component: primary
2019-07-16 10:34:12,975 INFO: Running command "wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
Running command "wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
--2019-07-16 10:34:12-- ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
=> ‘Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.8
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.8|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/release-87/fasta/homo_sapiens/dna ... done.
==> SIZE Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ... 881214448
==> PASV ... done. ==> RETR Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ... done.
Length: 881214448 (840M) (unauthoritative)
Homo_sapiens.GRCh38 100%[===================>] 840.39M 10.6MB/s in 91s
2019-07-16 10:35:46 (9.24 MB/s) - ‘Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’ saved [881214448]
2019-07-16 10:35:46,020 INFO: Running command "wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/CHECKSUMS"
Running command "wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/CHECKSUMS"
--2019-07-16 10:35:46-- ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/CHECKSUMS
=> ‘CHECKSUMS’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.8
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.8|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/release-87/fasta/homo_sapiens/dna ... done.
==> SIZE CHECKSUMS ... 5010
==> PASV ... done. ==> RETR CHECKSUMS ... done.
Length: 5010 (4.9K) (unauthoritative)
CHECKSUMS 100%[===================>] 4.89K --.-KB/s in 0s
2019-07-16 10:35:48 (50.3 MB/s) - ‘CHECKSUMS’ saved [5010]
2019-07-16 10:35:48,338 INFO: Running command "md5 *.gz > postdownload-checksums.md5"
Running command "md5 *.gz > postdownload-checksums.md5"
2019-07-16 10:35:50,186 INFO: Running command "gunzip *.gz"
Running command "gunzip *.gz"
2019-07-16 10:36:47,937 INFO: Running command "md5 *.* > final_checksums.md5"
Running command "md5 *.* > final_checksums.md5"
2019-07-16 10:37:16,145 INFO: References processed: ['grch38']
References processed: ['grch38']
2019-07-16 10:37:16,145 INFO: Location of references: /Volumes/jwalla12/references
Location of references: /Volumes/jwalla12/references
refchef-cook -e -o /Volumes/jwalla12/references -gl /Volumes/jwalla12/remote_references/remote_references -gr jrwallace/remote_references -g commit
```

After this command is run, [`master.yaml`](#master.yaml) will reflect that you have downloaded the references and it will now look like this:
After [`refchef-cook`](#refchef-cook) is run, [`master.yaml`](#master.yaml) will reflect that you have downloaded the references and it will now look like this:

```
grch38:
metadata:
name: grch38_release87
species: Homo sapiens
name: grch38
common_name: human
ncbi_taxon_id: 9606
organism: homo sapiens
organization: ensembl
downloader: jrwallace
custom: false
description: corresponds to ganbank id GCA_000001405.22
downloader: joselynn wallace
ensembl_release_number: 87
accession:
genbank: null
refseq: null
levels:
references:
- component: primary
complete:
status: true
time: 2019-07-12 16:02:25.505498
time: 2019-07-18 14:43:33.302255
commands:
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/CHECKSUMS
- md5sum *.gz > postdownload-checksums.md5
- md5 *.gz > postdownload-checksums.md5
- gunzip *.gz
- md5sum *.* > final_checksums.md5
- md5 *.* > final_checksums.md5
location: /Volumes/jwalla12/references/grch38/primary
files:
- CHECKSUMS
- final_checksums.md5
- Homo_sapiens.GRCh38.dna.primary_assembly.fa
- Homo_sapiens.GRCh38.dna.chromosome.1.fa
- metadata.txt
- postdownload-checksums.md5
uuid: ce305c7a-7473-30da-b7e5-7d4fd9185975
```

To add more files (we will add a bwa index), create a new branch. Here, we make a branch called 'bwa_index_grch38':
```
git checkout -b bwa_index_grch38
```

Create a `new.yaml` file to append new reference indexes to the `master.yaml`.

```
grch38_index:
metadata:
name: grch38_index
common_name: human
ncbi_taxon_id: 9606
organism: homo sapiens
organization: ensembl
custom: no
description: corresponds to ganbank id GCA_000001405.22
downloader: joselynn wallace
ensembl_release_number: 87
accession:
genbank:
refseq:
levels:
references:
- component: bwa_index
complete:
status: false
src: ce305c7a-7473-30da-b7e5-7d4fd9185975
commands:
- mkdir
- bwa index /Volumes/jwalla12/references/grch38/primary/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
```

Then use [`refchef-cook`](#refchef-cook)

```
refchef-cook -e -o /Volumes/jwalla12/references -gl /Volumes/jwalla12/remote_references/remote_references -gr jrwallace/remote_references -n /Volumes/jwalla12/remote_references/remote_references/new.yaml -g commit
```

Then try [`refchef-menu`](#refchef-menu)

```
refchef-menu -f /Volumes/jwalla12/remote_references/remote_references/master.yaml
```
Which will show you:

```
dict_ = yaml.load(yml)
┌ 🐶 RefChef Menu ────────────┬───────────┬────────────────────────────────────────────┬──────────────────────────────────────┐
│ name │ organism │ component │ description │ uuid │
├──────────────┼──────────────┼───────────┼────────────────────────────────────────────┼──────────────────────────────────────┤
│ grch38 │ homo sapiens │ primary │ corresponds to ganbank id GCA_000001405.22 │ ad4140d7-7dd8-3bb9-87e5-4208502d5a72 │
│ grch38_index │ homo sapiens │ bwa_index │ corresponds to ganbank id GCA_000001405.22 │ fa25201e-a994-11e9-a88c-8c8590bd206d │
└──────────────┴──────────────┴───────────┴────────────────────────────────────────────┴──────────────────────────────────────┘
```

Checkout the master branch and merge your new branch into master:

```
git checkout master
git merge bwa_index_grch38
```

## **Usage**


## refchef-cook <a name="refchef-cook"></a>
Reads recipes and executes the commands that will retrieve the references, indices, or annotations
Reads recipes and executes the commands that will retrieve the references, indices, or annotations.

**Usage:**
`refchef-cook [*arguments*]`
Expand All @@ -219,9 +244,7 @@ Reads recipes and executes the commands that will retrieve the references, indic


## refchef-menu <a name="refchef-menu"></a>


This command provides a way for the user to list all references present in the system, based on `master.yaml`, as well as filter the list of references based on metadata options.
This command provides a way for the user to list all references present in the system, based on `master.yaml`, as well as filter the list of references based on metadata options. You must specify either `--master, -m` or `--config, -c`

**Usage:**
`refchef-cook [*arguments*]`
Expand Down Expand Up @@ -323,70 +346,4 @@ break-on-error=yes
verbose=yes
```

!!! Note
You can opt not to use a config file. In that case, when using `refchef-menu`, you must pass the argument `--master (-m)` with he path to the `master.yaml` file.
When using `refchef-cook`, you must pass at least the output directory (``--outdir, -o`) and the path to the local git directory, where the `master.yaml` file is located (``--git_local, -gl`). If you want the changes to `master.yaml` to be pushed to a git service, you must also pass `--git_remote `(-gr)`.

### User workflow diagram







#### Downloading and processing references, indices, or annotations.
This command will read a `master.yaml` located in the `github-directory` path from the config file, or the directory passed to `--git_local`. The `master.yaml` file contains a list of references, indices, and annotations, as well as their metadata, and commands necessary to download and process the files (see example below).
The `master.yaml` file stores all the information about a reference that is downloaded or will be downloaded. When `refchef-cook -e` is executed, the files are downloaded to the output directory and processed. In addition, RefChef updates the status of the complete option to `true` in the `master.yaml`, it also adds an `uuid`, the date, location, and list of files. If a reference has the `true` in the complete status, that entry will not be processed again.

###


```
#### Downloading an index linked to a reference.
Indices can be downloaded just like any reference or annotation (see process above), but also, one might download an index that is linked to a particular reference. In that case, the index entry in the `master.yaml` file has a key `src` that takes the `uuid` of the reference to be linked to the index.
Example of index `master.yaml`:
```yaml
index_1:
metadata:
name: index_test1
species: mouse
organization: ucsc
downloader: fgelin
levels:
indices:
- component: bwa_index
complete:
status: false
src: 8040b09f-3844-3c42-b765-1f6a32614895
commands:
- wget -nv https://s3.us-east-2.amazonaws.com/refchef-tests/chr1.fa.gz
- md5 *.fa.gz > postdownload_checksums.md5
- gunzip *.gz
- md5 *.fa > final_checksums.md5
```

In this case, the commands will be processed like before, but in the reference folder, a symlink to the index folder will be created.

Arguments:
`--execute, -e`: will execute all commands listed in the `master.yaml` for each reference, if reference doesn't exist in the location provided in the config file.
`--new, -n`: path to a new yaml file containing other references to be downloaded and appended to the `master.yaml`.
`--git, -g`: Git action. Choose from `commit` or `push`.
`--outdir, -o`: output directory, where references will be downloaded to.
`--git_local, -gl`: Local git directory, where the `master.yaml` file can be found.
`--git_remote, -gr`: Remote git repository, in the format `user/project_name`.
`--logs, -l`: Whether to save the log files.

Example run:
1 - This will read in `new.yaml` file, append to `master.yaml` and commit the changes using git.
`refchef-cook --config /path/to/cfg.yaml --execute --new new.yaml --git commit`.

2 - This will process `master.yaml`, commit and push changes to the remote repository:
`refchef-cook --execute -o /path/to/output/dir --git_local /path/to/git/dir --git_remote user/project_name --git push`


### `refchef-menu`

0 comments on commit 5561af9

Please sign in to comment.