# BEDBASE workflow tutorial

The following demo has the purpose of demonstrating how to process, generate statistics and plots of BED files generated by the R package Genomic Distributions using the `bedhost` REST API for the bedstat and bedbuncher pipelines output. 

Notes:

- If this hasn't been already done, we recommend starting this jupyter notebook enabling sudo permissions since steps such as downloading `docker` or running an elasticsearch `docker` container won't be executed otherwise. This can be done with `sudo jupyter notebook --allow-root`

 
 

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Create-a-tutorial-directory-and-download-required-files-and-pipelines" data-toc-modified-id="Create-a-tutorial-directory-and-download-required-files-and-pipelines-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Create a tutorial directory and download required files and pipelines</a></span></li><li><span><a href="#BEDSTAT:-Generate-statistics-and-plots-of-BED-files" data-toc-modified-id="BEDSTAT:-Generate-statistics-and-plots-of-BED-files-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>BEDSTAT: Generate statistics and plots of BED files</a></span><ul class="toc-item"><li><span><a href="#Create-a-PEP-describing-the-BED-files-to-process" data-toc-modified-id="Create-a-PEP-describing-the-BED-files-to-process-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Create a PEP describing the BED files to process</a></span></li><li><span><a href="#Download-the-bedbase-configuration-manager-(bbconf)-and-install-bedstat-dependencies" data-toc-modified-id="Download-the-bedbase-configuration-manager-(bbconf)-and-install-bedstat-dependencies-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Download the bedbase configuration manager (bbconf) and install bedstat dependencies</a></span></li><li><span><a href="#Inititiate-a-local-elasticsearch-cluster" data-toc-modified-id="Inititiate-a-local-elasticsearch-cluster-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Inititiate a local elasticsearch cluster</a></span></li><li><span><a href="#Run-bedstat--on-the-demo-PEP" data-toc-modified-id="Run-bedstat--on-the-demo-PEP-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Run bedstat  on the demo PEP</a></span></li></ul></li><li><span><a href="#BEDBUNCHER:-Create-bedsets-and-their-respective-statistics" data-toc-modified-id="BEDBUNCHER:-Create-bedsets-and-their-respective-statistics-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>BEDBUNCHER: Create bedsets and their respective statistics</a></span><ul class="toc-item"><li><span><a href="#Create-a-new-PEP-describing-the-bedset-name-and-specific-JSON-query" data-toc-modified-id="Create-a-new-PEP-describing-the-bedset-name-and-specific-JSON-query-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Create a new PEP describing the bedset name and specific JSON query</a></span></li><li><span><a href="#Create-outputs-directory-and-install-bedbuncher-CML-dependencies" data-toc-modified-id="Create-outputs-directory-and-install-bedbuncher-CML-dependencies-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Create outputs directory and install bedbuncher CML dependencies</a></span></li><li><span><a href="#Run-bedbuncher-using-Looper" data-toc-modified-id="Run-bedbuncher-using-Looper-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Run bedbuncher using Looper</a></span></li></ul></li><li><span><a href="#BEDHOST:--API-to-explore-pipeline-outputs" data-toc-modified-id="BEDHOST:--API-to-explore-pipeline-outputs-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>BEDHOST:  API to explore pipeline outputs</a></span></li></ul></div>

## Create a tutorial directory and download required files and pipelines 
We need create a directory where we'll store the bedbase pipelines and files to be processed. We'll also need to create an environment variable that points to the tutorial directory (we'll need this variable in section 3 of the tutorial). 

In [None]:
cd $HOME/Desktop

In [None]:
mkdir bedbase_tutorial
cd bedbase_tutorial
export BBTUTORIAL=`pwd`

To download the files we'll need for this tutorial, we can easily do it with the following commands:

In [None]:
wget http://big.databio.org/example_data/bedbase_tutorial/bed_files.tar.gz     

The downloaded files are compressed so we'll need to untar them:

In [None]:
tar -zxvf bed_files.tar.gz

Additionally, we'll download the 3 core pipelines and tools needed to complete this tutorial: `bedstat`, `bedbuncher` and `bedhost`

In [None]:
git clone git@github.com:databio/bedstat
git clone git@github.com:databio/bedbuncher
git clone git@github.com:databio/bedhost

## BEDSTAT: Generate statistics and plots of BED files 


### Create a PEP describing the BED files to process

In order to get started, we'll need a PEP [Portable Encapsulated project](https://pepkit.github.io/). A PEP consists of 1) an annotation sheet (.csv) that contains information about the samples on a project and 2) a project config.yaml file that points to the sample annotation sheet. The config file also has other components, such as derived attributes, that in this case point to the BED files to be processed. The following is an example of a config file using the derived attributes `output_file_path` and `yaml_file` to point to the `.bed.gz` files and their respective metadata.

In [None]:
cat bedhost/tutorial_files/PEPs/bedstat_config.yaml

### Download the bedbase configuration manager (bbconf) and install bedstat dependencies

[bedstat](https://github.com/databio/bedstat) is a [pypiper](http://code.databio.org/pypiper/) pipeline that generates statistics and plots of BED files. Additionally, [bedstat](https://github.com/databio/bedstat) relies in
[bbconf](https://github.com/databio/bbconf), the `bedbase` configuration manager which implements convenience methods for interacting with an elasticsearch database, where our files metadata will be placed. For carrying out this demo, we'll be using the dev version of `bbconf` that can be downloaded as follows:

In [None]:
# Install the bedbase configuration manager
pip install git+https://github.com/databio/bbconf.git@dev --user > bbconf_log.txt

# Install Python dependencies
pip install piper --user > piper_log.txt

# Install R dependencies
Rscript bedstat/scripts/installRdeps.R > R_deps.txt

We'll need to create a directory where we can store the stats and plots generated by `bedstat`. Additionally, we'll create a directory where we can store log and metadata files that we'll need later on.

In [None]:
mkdir outputs
mkdir outputs/bedstat_output
mkdir outputs/bedstat_output/bedstat_pipeline_logs

In order to use bbconf, we'll need to create a minimal configuration.yaml file. The path to this configuration file can be stored in the environment variable `$BEDBASE`.

In [None]:
cat bedhost/tutorial_files/bedbase_configuration.yaml

### Inititiate a local elasticsearch cluster

In addition to generate statistics and plots, [bedstat](https://github.com/databio/bedstat) inserts JSON formatted metadata into an [elasticsearch](https://www.elastic.co/elasticsearch/?ultron=[EL]-[B]-[AMER]-US+CA-Exact&blade=adwords-s&Device=c&thor=elasticsearch&gclid=Cj0KCQjwjcfzBRCHARIsAO-1_Oq5mSdze16kripxT5_I__EeH9F-xUCz_khEvzGL7q_mqP62CahJ9SIaAg2BEALw_wcB) database from which we'll search files and information about them. (This step may have to be performed outside the notebook since these commands ask for a sudo password. 

In [None]:
# If docker is not already installed, you can do so with the following commands
#(make sure you have sudo permissions)

sudo apt-get update
sudo apt-get install docker-engine -y

# Create a persistent volume to house elastic search data
sudo docker volume create es-data

# Run the docker container for elasticsearch
sudo docker run -p 9200:9200 -p 9300:9300 -v es-data:/usr/share/elasticsearch/data -e "xpack.ml.enabled=false" \
  -e "discovery.type=single-node" elasticsearch:7.5.1

### Run bedstat  on the demo PEP
To run [bedstat](https://github.com/databio/bedstat) and the other required pipelines in this demo, we will rely on the pipeline submission engine [looper](http://looper.databio.org/en/latest/),which can be installed in the following manner

In [None]:
pip install --user loopercli

In order to establish a modular connection between a project and a pipeline, we'll need to create a [pipeline interface](http://looper.databio.org/en/latest/linking-a-pipeline/) file, which tells looper how to run the pipeline. If `bedstat` is being run from an HPC environment where docker is not available, we recommend running the pipeline using the `--no-db-commit` flag (this will only calculate statistics and generate plots but will not insert this information into the local elasticsearch cluster. Once we have generated plots and statistics, we can insert them into our local elasticsearch cluster running `bedstat` with the `--just-db-commit` flag. If your data lives on a local environment, as it's the case in this tutorial, it's not necessary to set those flags and we can run bedstat in the following manner:

In [None]:
looper run bedhost/tutorial_files/PEPs/bedstat_config.yaml --bedbase-config bedhost/tutorial_files/bedbase_configuration.yaml \
--no-db-commit --compute local --limit 10 -R > outputs/bedstat_output/bedstat_pipeline_logs/looper_logs.txt

In [None]:
looper run bedhost/tutorial_files/PEPs/bedstat_config.yaml --bedbase-config bedhost/tutorial_files/bedbase_configuration.yaml \
--just-db-commit --compute local --limit 10 -R > outputs/bedstat_output/bedstat_pipeline_logs/looper_logs.txt

Just for informative purposes, we can inspect how bedstat encapsulates the information for each bed file:

In [None]:
head outputs/bedstat_output/bedstat_pipeline_logs/looper_logs.txt

After the previous steps have been executed, our BED files should be available for query on our local elastic search cluster. Files can be queried using the `bedbuncher` pipeline described in the below section. 


## BEDBUNCHER: Create bedsets and their respective statistics 

### Create a new PEP describing the bedset name and specific JSON query  
[bedbuncher](https://github.com/databio/bedbuncher) is a pipeline designed to create bedsets (sets of BED files retrieved from bedbase), with their respective statistics and additional outputs such as a `PEP` and an `iGD` database. In order to run `bedbuncher`, we will need to design an additional PEP describing the query as well as attributes such as the name assigned to the newly created bedset. This configuration file should point to the `JSON` file describing the query to find files of interest. The configuration file should have the following structure:

In [None]:
cat bedhost/tutorial_files/PEPs/bedbuncher_query.csv

In [None]:
cat bedhost/tutorial_files/PEPs/bedbuncher_config.yaml

###  Create outputs directory and install bedbuncher CML dependencies

We need a folder where we can store `bedset` related outputs. Though not required, we'll also create a directory where we can store the `bedbuncher` pipeline logs. 

In [None]:
mkdir outputs/bedbuncher_output
mkdir outputs/bedbuncher_output/bedbuncher_pipeline_logs

One of the feats of `bedbuncher` includes [iGD](https://github.com/databio/iGD) database creation from the files in the bedset. `iGD` can be installed by cloning the repository from github, executing the make file to create the binary, and pointing the binary location with the `$PATH` environment variable. 

In [None]:
git clone git@github.com:databio/iGD
cd iGD
make > igd_make_log.txt
cd ..

#Add iGD bin to PATH (might have to do this before starting the tutorial) Something like 
export PATH=$BBTUTORIAL/iGD/bin/:$PATH

### Run bedbuncher using Looper 

Once we have cloned the `bedbuncher` repository, set our local elasticsearch cluster and created the `iGD` binary, we can run `bedbuncher` passing the location of the `bedbase` configuration file to the argument `--bedbase-config`. Note: if the path to the `bedbase` configration file has been stored in the `$BEDBASE` environment variable, it's not neccesary to pass the `--bedbase-config` argument. 

In [None]:
looper run  bedhost/tutorial_files/PEPs/bedbuncher_config.yaml  --bedbase-config bedhost/tutorial_files/bedbase_configuration.yaml \
--compute local -R > outputs/bedbuncher_output/bedbuncher_pipeline_logs/looper_logs.txt 

## BEDHOST:  API to explore pipeline outputs

The last part of the tutorial consists on running a local instance of [bedhost](https://github.com/databio/bedhost/tree/master) (a REST API for bedstat and bedbuncher produced outputs) in order to explore plots, statistics and download pipeline outputs. To run `bedhost`, we'll pip install the package from the previously cloned repository:

In [None]:
pip install bedhost/. --user > bedhost_log.txt

To start bedhost, we simply need to run the following commands passing the location of the `bedbase` config file to the `-c` flag.  

In [None]:
bedhost serve -c  $BBTUTORIAL/bedhost/tutorial_files/bedbase_configuration.yaml

If we have stored the path to the bedbase config in the environment variable `$BEDBASE` (suggested), it's not neccesary to pass the `-c` flag. 

In [None]:
bedhost serve 

The `bedhost` API can be opened in the url [http://0.0.0.0:8000](http://0.0.0.0:8000). We can now explore the plots and statistics generated by the `bedstat` and `bedbuncher` pipelines.