# **BEDBASE Demo**

The following demo has the purpose of demonstrating how to process, generate statistics and plots of BED files genrated by the R package Genomic Distributions using the REST API for the bedstat and bedbuncher pipelines. 

The general workflow for uploading bed files and their 


## Prior to start the tutorial
We need create a directory where we'll store the bedbase pipelines and files to be processed. 

In [1]:
cd $HOME/Desktop

In [2]:
mkdir bedbase_tutorial
cd bedbase_tutorial

To download the BED files and PEPs we'll need for this demo, we can easily do this with:

In [3]:
wget http://big.databio.org/example_data/bedbase_demo/bedbase_demo_files_justBED/bedbase_BEDfiles.tar.gz     
wget http://big.databio.org/example_data/bedbase_demo/bedbase_demo_files_justBED/bedbase_demo_PEPs.tar.gz 

--2020-03-17 16:20:59--  http://big.databio.org/example_data/bedbase_demo/bedbase_demo_files_justBED/bedbase_BEDfiles.tar.gz
Resolving big.databio.org (big.databio.org)... 128.143.245.181
Connecting to big.databio.org (big.databio.org)|128.143.245.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60245813 (57M) [application/octet-stream]
Saving to: ‘bedbase_BEDfiles.tar.gz’


2020-03-17 16:20:59 (102 MB/s) - ‘bedbase_BEDfiles.tar.gz’ saved [60245813/60245813]

--2020-03-17 16:21:00--  http://big.databio.org/example_data/bedbase_demo/bedbase_demo_files_justBED/bedbase_demo_PEPs.tar.gz
Resolving big.databio.org (big.databio.org)... 128.143.245.181
Connecting to big.databio.org (big.databio.org)|128.143.245.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1262 (1.2K) [application/octet-stream]
Saving to: ‘bedbase_demo_PEPs.tar.gz’


2020-03-17 16:21:00 (220 MB/s) - ‘bedbase_demo_PEPs.tar.gz’ saved [1262/1262]



To use our files and PEPs, we need to untar them:

In [4]:
tar -zxvf bedbase_BEDfiles.tar.gz
tar -zxvf bedbase_demo_PEPs.tar.gz

bedbase_BEDfiles/
bedbase_BEDfiles/GSE105977_ENCFF449EZT_optimal_idr_thresholded_peaks_hg19.bed.gz
bedbase_BEDfiles/GSE105587_ENCFF018NNF_conservative_idr_thresholded_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSE105587_ENCFF413ANK_peaks_hg19.bed.gz
bedbase_BEDfiles/GSM2423312_ENCFF155HVK_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSE105977_ENCFF617QGK_optimal_idr_thresholded_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSE91663_ENCFF316ASR_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSM2423313_ENCFF722AOG_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSE105587_ENCFF809OOE_conservative_idr_thresholded_peaks_hg19.bed.gz
bedbase_BEDfiles/GSM2827349_ENCFF196DNQ_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSE91663_ENCFF553KIK_optimal_idr_thresholded_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSE91663_ENCFF319TPR_conservative_idr_thresholded_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSE105977_ENCFF634NTU_peaks_hg19.bed.gz
bedbase_BEDfiles/GSE105977_ENCFF937CGY_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSM2827350_ENCFF928JXU_peaks_GRCh38.bed.gz
bedb

## First part of the tutorial (insert BED files stats into elastic)


### 1) Create a PEP describing the BED files to process

In order to get started, we'll need a PEP [Portable Encapsulated project](https://pepkit.github.io/). A PEP consists of 1) an annotation sheet (.csv) that contains information about the samples on a project and 2) a project config.yaml file that points to the sample annotation sheet. THe config file also has other components, such as derived attributes, that in this case point to the BED files to be processed. The following is an example of a config file using the derived attributes output_file_path and yaml_file to point to the `.bed.gz` files and their respective metadata.

In [5]:
cat bedbase_demo_PEPs/bedstat_config.yaml

metadata:
  sample_table: bedstat_annotation_sheet.csv
  output_dir: ../bedstat/bedstat_pipeline_logs 
  pipeline_interfaces: ../bedstat/pipeline_interface.yaml

constant_attributes: 
  output_file_path: "source"
  yaml_file: "source2"
  protocol: "bedstat"

derived_attributes: [output_file_path, yaml_file]
data_sources:
  source: "../bedbase_BEDfiles/{file_name}" 
  source2: "../bedstat/bedstat_pipeline_logs/submission/{sample_name}.yaml"


### 2) Download bedstat and the Bedbase configuration manager (bbconf)

[bbconf](https://github.com/databio/bbconf) implements convenience methods for interacting with the database backend, which in this case is defined by an Elastic search local cluster. For carrying out this demo, we'll be using the dev version of `bbconf` that can be download as follows:

In [6]:
git clone git@github.com:databio/bedstat
pip install git+https://github.com/databio/bbconf.git@dev

Cloning into 'bedstat'...
remote: Enumerating objects: 161, done.[K
remote: Counting objects: 100% (161/161), done.[K
remote: Compressing objects: 100% (88/88), done.[K
Receiving objects: 100% (358/358), 57.22 KiB | 5.72 MiB/s, done.
remote: Total 358 (delta 78), reused 109 (delta 43), pack-reused 197[K
Resolving deltas: 100% (152/152), done.
Collecting git+https://github.com/databio/bbconf.git@dev
  Cloning https://github.com/databio/bbconf.git (to revision dev) to /tmp/pip-req-build-evqcaq50
  Running command git clone -q https://github.com/databio/bbconf.git /tmp/pip-req-build-evqcaq50
  Running command git checkout -b dev --track origin/dev
  Switched to a new branch 'dev'
  Branch 'dev' set up to track remote branch 'dev' from 'origin'.
Building wheels for collected packages: bbconf
  Building wheel for bbconf (setup.py) ... [?25ldone
[?25h  Created wheel for bbconf: filename=bbconf-0.0.2.dev0-cp36-none-any.whl size=8961 sha256=c063f39d13c3cca85cbc49fc1f12da2ca1e9741a4d7cf59

In order to use bbconf, we'll need to create a minimal configuration.yaml file. The path to this configration file can be stores as the environmental variable `$BEDBASE`

In [None]:
cat $BEDBASE

### 3) Run the bedstat pipeline on the demo PEP

[bedstat](https://github.com/databio/bedstat) is a pypiper pipeline that generates statistics and plots of BED files. For more detailed information about the pipeline and how to set a local elastic search cluster to insert and query files, click [here](https://github.com/databio/bedstat/blob/master/README.md) 

To run [bedstat](https://github.com/databio/bedstat) and the other required pipelines in this demo, we will rely on the pipeline submission engine [looper](http://looper.databio.org/en/latest/). For detailed instructions in how to link a project to a pipeline, click [here](http://looper.databio.org/en/latest/linking-a-pipeline/). If the pipeline is being run from an HPC environment where docker is not available, we recommend running the pipeline using the `--no-db-commit` flag (this will only calculate statistics and generate plots but will not insert this information into the local elasticsearch cluster.

In [None]:
cd ~/Desktop/bedstat
looper run bedhost_demo_files_justBED/bedhost_demo_refPEP/demo_config.yaml --no-db-commit --compute local 

Once we have generated plots and statistics, we can insert them into our local elastic search cluster running the bedstat pipeline with the `--just-db-commit` flag

In [None]:
cd ~/Desktop/bedstat
looper run bedhost_demo_files_justBED/bedhost_demo_refPEP/demo_config.yaml --just-db-commit --compute local 

After the previous steps have been executed, our BED files should be available for query on our local elastic search cluster. Files can be queried using the `bedbuncher` pipeline described in the below section. 


## Second part of the tutorial (use bedbuncher to create bedsets)

### 1) Create a new PEP describing the bedset name and specific JSON query  
[bedbuncher](https://github.com/databio/bedbuncher) is a pipeline designed to create bedsets (sets of BED files retrieved from bedbase). In order to create bedsets, we will need to create an additional PEP describing the query as well as attributes such as the name assigned to the newly created bedset. This configuration file should descibe the path to the `JSON` query file. THe configuration file should have the following structure:

In [None]:
cd ~/Desktop/bedbuncher/project
cat bedset_query.csv

In [None]:
cd ~/Desktop/bedbuncher/project
cat cfg.yaml

### 2) Run the bedbuncher pipeline with looper

In order to create a bedset, we simply need to create a PEP as previously shown and run the bedbuncher pipeline using looper

In [None]:
cd ~/Desktop/bedbuncher
looper run project/cfg.yaml --compute local

## Third part of the demo (run local instance of bedhost)

The last part of the tutorial consists on running a local instance of [bedhost](https://github.com/databio/bedhost/tree/master) (a REST API for bedstat and bedbuncher produced outputs) in order to explore and download output files. To access the API, we'll need to download the dev branch of the github repository as follows:

In [None]:
git clone git@github.com:databio/bedhost

Then we need to run the following command, making sure to point to the previously described bedbase config.yaml file 

In [None]:
bedhost serve -c path/to/config

If we have stored the path to the bedbase config in the environment variable `$BEDBASE` (suggested), it's not neccesary to specify the path to the config file to start bedhost

In [None]:
bedhost serve 