# **BEDBASE Demo**

The following demo has the purpose of demonstrating how to process, generate statistics and plots of BED files genrated by the R package Genomic Distributions using the REST API for the bedstat and bedbuncher pipelines. 

The general workflow for uploading bed files and their 


## Prior to start the tutorial (files download)
We need create a directory where we'll store the bedbase pipelines and files to be processed. 

In [1]:
cd $HOME/Desktop

In [2]:
mkdir bedbase_tutorial
cd bedbase_tutorial

To download the BED files and PEPs we'll need for this demo, we can easily do this with:

In [3]:
wget http://big.databio.org/example_data/bedbase_demo/bedbase_demo_files_justBED/bedbase_BEDfiles.tar.gz     
wget http://big.databio.org/example_data/bedbase_demo/bedbase_demo_files_justBED/bedbase_demo_PEPs.tar.gz 

--2020-03-19 12:48:12--  http://big.databio.org/example_data/bedbase_demo/bedbase_demo_files_justBED/bedbase_BEDfiles.tar.gz
Resolving big.databio.org (big.databio.org)... 128.143.245.181
Connecting to big.databio.org (big.databio.org)|128.143.245.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60245813 (57M) [application/octet-stream]
Saving to: ‘bedbase_BEDfiles.tar.gz’


2020-03-19 12:49:31 (742 KB/s) - ‘bedbase_BEDfiles.tar.gz’ saved [60245813/60245813]

--2020-03-19 12:49:31--  http://big.databio.org/example_data/bedbase_demo/bedbase_demo_files_justBED/bedbase_demo_PEPs.tar.gz
Resolving big.databio.org (big.databio.org)... 128.143.245.181
Connecting to big.databio.org (big.databio.org)|128.143.245.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1374 (1.3K) [application/octet-stream]
Saving to: ‘bedbase_demo_PEPs.tar.gz’


2020-03-19 12:49:31 (109 MB/s) - ‘bedbase_demo_PEPs.tar.gz’ saved [1374/1374]



To use our files and PEPs, we need to untar them:

In [4]:
tar -zxvf bedbase_BEDfiles.tar.gz
tar -zxvf bedbase_demo_PEPs.tar.gz

bedbase_BEDfiles/
bedbase_BEDfiles/GSE105977_ENCFF449EZT_optimal_idr_thresholded_peaks_hg19.bed.gz
bedbase_BEDfiles/GSE105587_ENCFF018NNF_conservative_idr_thresholded_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSE105587_ENCFF413ANK_peaks_hg19.bed.gz
bedbase_BEDfiles/GSM2423312_ENCFF155HVK_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSE105977_ENCFF617QGK_optimal_idr_thresholded_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSE91663_ENCFF316ASR_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSM2423313_ENCFF722AOG_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSE105587_ENCFF809OOE_conservative_idr_thresholded_peaks_hg19.bed.gz
bedbase_BEDfiles/GSM2827349_ENCFF196DNQ_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSE91663_ENCFF553KIK_optimal_idr_thresholded_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSE91663_ENCFF319TPR_conservative_idr_thresholded_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSE105977_ENCFF634NTU_peaks_hg19.bed.gz
bedbase_BEDfiles/GSE105977_ENCFF937CGY_peaks_GRCh38.bed.gz
bedbase_BEDfiles/GSM2827350_ENCFF928JXU_peaks_GRCh38.bed.gz
bedb

## First part of the tutorial (insert BED files stats into elastic)


### 1) Create a PEP describing the BED files to process

In order to get started, we'll need a PEP [Portable Encapsulated project](https://pepkit.github.io/). A PEP consists of 1) an annotation sheet (.csv) that contains information about the samples on a project and 2) a project config.yaml file that points to the sample annotation sheet. THe config file also has other components, such as derived attributes, that in this case point to the BED files to be processed. The following is an example of a config file using the derived attributes output_file_path and yaml_file to point to the `.bed.gz` files and their respective metadata.

In [5]:
cat bedbase_demo_PEPs/bedstat_config.yaml

metadata:
  sample_table: bedstat_annotation_sheet.csv
  output_dir: ../bedstat/bedstat_pipeline_logs 
  pipeline_interfaces: ../bedstat/pipeline_interface.yaml

constant_attributes: 
  output_file_path: "source"
  yaml_file: "source2"
  protocol: "bedstat"

derived_attributes: [output_file_path, yaml_file]
data_sources:
  source: ../bedbase_BEDfiles/{file_name} 
  source2: ../bedstat/bedstat_pipeline_logs/submission/{sample_name}.yaml


### 2) Download bedstat and the Bedbase configuration manager (bbconf)

[bedstat](https://github.com/databio/bedstat) is a [pypiper](http://code.databio.org/pypiper/) pipeline that generates statistics and plots of BED files. [bbconf](https://github.com/databio/bbconf) implements convenience methods for interacting with the database backend, which in this case is defined by an Elastic search local cluster. For carrying out this demo, we'll be using the dev version of `bbconf` that can be download as follows:

In [8]:
git clone git@github.com:databio/bedstat
# Install Python dependencies
pip install piper --user
pip install --user loopercli
pip install git+https://github.com/databio/bbconf.git@dev --user
# Install R dependencies
#Rscript scripts/installRdeps.R

Cloning into 'bedstat'...
remote: Enumerating objects: 165, done.[K
remote: Counting objects: 100% (165/165), done.[K
remote: Compressing objects: 100% (92/92), done.[K
remote: Total 362 (delta 81), reused 106 (delta 43), pack-reused 197[K
Receiving objects: 100% (362/362), 57.94 KiB | 1.26 MiB/s, done.
Resolving deltas: 100% (155/155), done.
[33mYou are using pip version 18.0, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 18.0, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting git+https://github.com/databio/bbconf.git@dev
  Cloning https://github.com/databio/bbconf.git (to revision dev) to /tmp/pip-req-build-k90lb9g3
Building wheels for collected packages: bbconf
  Running setup.py bdist_wheel for bbconf ... [?25ldone
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-t0r6jqgo/wheels/86/ec/ae/8d3556

We'll need to create a directory where we can store the stats and plots generated by `bedstat`. Additionally, we'll create a directory where we can store log and metadata files that we'll need later on.

In [9]:
mkdir bedstat/bedstat_output
mkdir bedstat/bedstat_pipeline_logs

In order to use bbconf, we'll need to create a minimal configuration.yaml file. The path to this configuration file can be stored in the environment variable `$BEDBASE`.

In [16]:
cat bedbase_demo_PEPs/bedbase_configuration.yaml

path:
  pipelines_output: ../bedstat/bedstat_output

database:
  host: localhost
  bed_index: bed_index
  bedset_index: bedset_index

server:
  host: 0.0.0.0
  port: 8000


### 3) Inititiate a local elasticsearch cluster

In addition to generate statistics and plots, [bedstat](https://github.com/databio/bedstat) inserts JSON formatted metadata into an [elasticsearch](https://www.elastic.co/elasticsearch/?ultron=[EL]-[B]-[AMER]-US+CA-Exact&blade=adwords-s&Device=c&thor=elasticsearch&gclid=Cj0KCQjwjcfzBRCHARIsAO-1_Oq5mSdze16kripxT5_I__EeH9F-xUCz_khEvzGL7q_mqP62CahJ9SIaAg2BEALw_wcB) database that it'll later be used to search and extract files and information about them.  

In [None]:
# If docker is not already installed, you can do so with the following commands
#(make sure you have sudo permissions)

sudo apt-get update
sudo apt-get install docker-engine -y

# Create a persistent volume to house elastic search data
docker volume create es-data

# Run the docker container for elasticsearch
docker run -p 9200:9200 -p 9300:9300 -v es-data:/usr/share/elasticsearch/data -e "xpack.ml.enabled=false" \
  -e "discovery.type=single-node" elasticsearch:7.5.1

### 4) Run the bedstat pipeline on the demo PEP
To run [bedstat](https://github.com/databio/bedstat) and the other required pipelines in this demo, we will rely on the pipeline submission engine [looper](http://looper.databio.org/en/latest/). For detailed instructions in how to link a project to a pipeline, click [here](http://looper.databio.org/en/latest/linking-a-pipeline/). If the pipeline is being run from an HPC environment where docker is not available, we recommend running the pipeline using the `--no-db-commit` flag (this will only calculate statistics and generate plots but will not insert this information into the local elasticsearch cluster.

In [23]:
#looper run bedbase_demo_PEPs/bedstat_config.yaml --no-db-commit --compute local --limit 1 -R

looper run bedbase_demo_PEPs/bedstat_config.yaml --bedbase-config bedbase_demo_PEPs/bedbase_configuration.yaml \
--no-db-commit --compute local --limit 1 -R

Command: run (Looper version: 0.12.6)
Reading sample annotations sheet: '/home/jer4xy/Desktop/bedbase_tutorial/bedbase_demo_PEPs/bedstat_annotation_sheet.csv'
Storing sample table from file '/home/jer4xy/Desktop/bedbase_tutorial/bedbase_demo_PEPs/bedstat_annotation_sheet.csv'
Activating compute package 'local'
Finding pipelines for protocol(s): bedstat
Known protocols: bedstat
'/home/jer4xy/Desktop/bedbase_tutorial/bedbase_demo_PEPs/../bedstat/pipeline/bedstat.py' appears to attempt to run on import; does it lack a conditional on '__main__'? Using base type: Sample
[36m## [1 of 15] bedbase_demo_db1 (bedstat)[0m
Submission settings lack memory specification
Writing script to /home/jer4xy/Desktop/bedstat/bedstat_pipeline_logs/submission/bedstat_bedbase_demo_db1.sub
Job script (n=1; 0.00 Gb): ../bedstat/bedstat_pipeline_logs/submission/bedstat_bedbase_demo_db1.sub
Compute node: cphg-5L9SYF2
Start time: 2020-03-19 14:07:14
### Pipeline run code and environment:

*              Command:  

3: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrUn_GL000224v1, chr17_GL000205v2_random, chrUn_GL000219v1, chrUn_GL000195v1, chrUn_GL000218v1, chr22_KI270733v1_random, chr1_KI270706v1_random, chrUn_GL000220v1, chrUn_GL000216v2, chr17_KI270729v1_random, chr1_KI270713v1_random
  - in 'y': chrCHR_HG107_PATCH, chrCHR_HG126_PATCH, chrCHR_HG1311_PATCH, chrCHR_HG1342_HG2282_PATCH, chrCHR_HG1362_PATCH, chrCHR_HG142_HG150_NOVEL_TEST, chrCHR_HG151_NOVEL_TEST, chrCHR_HG1832_PATCH, chrCHR_HG2021_PATCH, chrCHR_HG2023_PATCH, chrCHR_HG2030_PATCH, chrCHR_HG2058_PATCH, chrCHR_HG2063_PATCH, chrCHR_HG2066_PATCH, chrCHR_HG2072_PATCH, chrCHR_HG2095_PATCH, chrCHR_HG2104_PATCH, chrCHR_HG2116_PATCH, chrCHR_HG2191_PATCH, chrCHR_HG2213_PATCH, chrCHR_HG2217_PATCH, chrCHR_HG2232_PATCH, chrCHR_HG2233_PATCH, chrCHR_HG2235_PATCH, chrCHR_HG2239_PATCH, chrCHR_HG2247_PATCH, chrCHR_HG2288_HG2289_PATCH, chrCHR_HG2290_PATCH, chrCHR_HG2291_PATCH, chrCHR_HG

Once we have generated plots and statistics, we can insert them into our local elastic search cluster running the bedstat pipeline with the `--just-db-commit` flag

In [None]:
#looper run bedbase_demo_PEPs/bedstat_config.yaml  --just-db-commit --compute local -R

looper run bedbase_demo_PEPs/bedstat_config.yaml --bedbase-config bedbase_demo_PEPs/bedbase_configuration.yaml \
--just-db-commit --compute local -R

After the previous steps have been executed, our BED files should be available for query on our local elastic search cluster. Files can be queried using the `bedbuncher` pipeline described in the below section. 


## Second part of the tutorial (use bedbuncher to create bedsets)

### 1) Create a new PEP describing the bedset name and specific JSON query  
[bedbuncher](https://github.com/databio/bedbuncher) is a pipeline designed to create bedsets (sets of BED files retrieved from bedbase). In order to create bedsets, we will need to create an additional PEP describing the query as well as attributes such as the name assigned to the newly created bedset. This configuration file should descibe the path to the `JSON` query file. THe configuration file should have the following structure:

In [25]:
cat bedbase_demo_PEPs/bedbuncher_query.csv

sample_name,bedset_name,JSONquery_name,bbconfig_name,JSONquery_path,output_folder_path
bedset1,bedbase_demo_bedset,test_query,bedbase_configuration,source1,source2


In [26]:
cat bedbase_demo_PEPs/bedbuncher_config.yaml

metadata:
  sample_table: bedbuncher_query.csv
  output_dir: ../bedbuncher/bedbuncher_pipeline_logs
  pipeline_interfaces: ../bedbuncher/pipeline_interface.yaml 

derived_attributes: [JSONquery_path, bbconfig_path]
data_sources:
  source1: ../bedbuncher/tests/{JSONquery_name}.json
  source2: ./{bbconfig_name}.yaml
constant_attributes:
  protocol: "bedbuncher"


### 2) Download the bedbuncher pipeline 

To download `bedbuncher`, simply clone the repository from github. 

In [27]:
git clone git@github.com:databio/bedbuncher

Cloning into 'bedbuncher'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (27/27), done.[K
remote: Total 235 (delta 22), reused 26 (delta 12), pack-reused 196[K
Receiving objects: 100% (235/235), 54.59 KiB | 1.24 MiB/s, done.
Resolving deltas: 100% (130/130), done.


One of the feats of `bedbuncher` is [iGD](https://github.com/databio/iGD) database creation from the files in the bedset. [iGD](https://github.com/databio/iGD) can be installed as follows:  

In [None]:
git clone https://github.com/databio/iGD.git
cd iGD
make

### 3) Run the bedbuncher pipeline using Looper 

Once we have cloned the `bedbuncher` repository, we just need to point to the config file previously shown and pass the location of the `bedbase` configuration file to the argument `--bedbase-config`

In [None]:
cd ..
looper run project/cfg.yaml --bedbase-config bedbase_demo_PEPs/bedbase_configuration.yaml  --compute local

## Third part of the demo (run local instance of bedhost)

The last part of the tutorial consists on running a local instance of [bedhost](https://github.com/databio/bedhost/tree/master) (a REST API for bedstat and bedbuncher produced outputs) in order to explore and download output files. To access the API, we'll need to download the dev branch of the github repository as follows:

In [30]:
git clone git@github.com:databio/bedhost
pip install bedhost/. --user

Cloning into 'bedhost'...
remote: Enumerating objects: 107, done.[K
remote: Counting objects: 100% (107/107), done.[K
remote: Compressing objects: 100% (77/77), done.[K
remote: Total 618 (delta 69), reused 65 (delta 29), pack-reused 511[K
Receiving objects: 100% (618/618), 171.60 KiB | 423.00 KiB/s, done.
Resolving deltas: 100% (402/402), done.
Processing ./bedhost
Collecting aiofiles (from bedhost==0.0.1)
  Downloading https://files.pythonhosted.org/packages/cf/f2/a67a23bc0bb61d88f82aa7fb84a2fb5f278becfbdc038c5cbb36c31feaf1/aiofiles-0.4.0-py3-none-any.whl
Collecting fastapi (from bedhost==0.0.1)
[?25l  Downloading https://files.pythonhosted.org/packages/18/5b/46c084c174fc69b2a7e1d9c22d014f39fb677d9a7635f24734ef56e0fb53/fastapi-0.52.0-py3-none-any.whl (47kB)
[K    100% |████████████████████████████████| 51kB 1.2MB/s ta 0:00:011
Collecting starlette (from bedhost==0.0.1)
[?25l  Downloading https://files.pythonhosted.org/packages/37/2e/f56602beda25b376bbaaeadb626cf212b673457075ffe

Then we need to run the following command, making sure to point to the previously described bedbase config.yaml file 

In [32]:
pip install itsdangerous --user
bedhost serve -c  bedbase_demo_PEPs/bedbase_configuration.yaml


Collecting itsdangerous
  Downloading https://files.pythonhosted.org/packages/76/ae/44b03b253d6fade317f32c24d100b3b35c2239807046a4c953c7b89fa49e/itsdangerous-1.1.0-py2.py3-none-any.whl
Installing collected packages: itsdangerous
Successfully installed itsdangerous-1.1.0
[33mYou are using pip version 18.0, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
DEBU 2020-03-19 14:37:24,007 | bedhost:est:266 > Configured logger 'bedhost' using logmuse v0.2.6 
DEBU 14:37:24 | bbconf:est:266 > Configured logger 'bbconf' using logmuse v0.2.6 
INFO 14:37:24 | bbconf:bbconf:61 > Established connection with Elasticsearch: localhost 
DEBU 14:37:24 | bbconf:bbconf:63 > Elasticsearch info:
{'name': '3c0f2923e411', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'nZZ-pE_5T-SB1lCM0E8dDg', 'version': {'number': '7.5.1', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '3ae9ac9a93c95bd0cdc054951cf95d88e1e18d96', 'build_dat

: 1

If we have stored the path to the bedbase config in the environment variable `$BEDBASE` (suggested), it's not neccesary to specify the path to the config file to start bedhost

In [None]:
bedhost serve 