# BEDBASE workflow tutorial

This demo demonstrates how to process, analyze, visualize, and serve BED files. The process has 3 steps: First, individual BED files are analyzed using the [bedstat](https://github.com/databio/bedstat) pipeline. Second, BED files are grouped and then analyzed as groups using the [bedbuncher](https://github.com/databio/bedbuncher) pipeline. Finally, the BED files, along with statistics, plots, and grouping information, is served via a web interface and RESTful API using the [bedhost](https://github.com/databio/bedhost/tree/master) package.

**Glossary of terms:**

- *bedfile*: a tab-delimited file with one genomic region per line. Each genomic region is decribed by 3 required columns: chrom, start and end.
- *bedset*: a collection of BED files grouped by with a shared biological, experimental, or logical criterion.


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Preparation" data-toc-modified-id="1.-Preparation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>1. Preparation</a></span></li><li><span><a href="#2.-BEDSTAT:-Generate-statistics-and-plots-of-BED-files" data-toc-modified-id="2.-BEDSTAT:-Generate-statistics-and-plots-of-BED-files-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>2. BEDSTAT: Generate statistics and plots of BED files</a></span><ul class="toc-item"><li><span><a href="#Get-a-PEP-describing-the-bedfiles-to-process" data-toc-modified-id="Get-a-PEP-describing-the-bedfiles-to-process-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Get a PEP describing the bedfiles to process</a></span></li><li><span><a href="#Install-bedstat-dependencies" data-toc-modified-id="Install-bedstat-dependencies-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Install bedstat dependencies</a></span></li><li><span><a href="#Inititiate-a-local-elasticsearch-cluster" data-toc-modified-id="Inititiate-a-local-elasticsearch-cluster-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Inititiate a local elasticsearch cluster</a></span></li><li><span><a href="#Run-bedstat--on-the-demo-PEP" data-toc-modified-id="Run-bedstat--on-the-demo-PEP-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Run bedstat  on the demo PEP</a></span></li></ul></li><li><span><a href="#3.-BEDBUNCHER:-Create-bedsets-and-their-respective-statistics" data-toc-modified-id="3.-BEDBUNCHER:-Create-bedsets-and-their-respective-statistics-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>3. BEDBUNCHER: Create bedsets and their respective statistics</a></span><ul class="toc-item"><li><span><a href="#Create-a-new-PEP-describing-the-bedset-name-and-specific-JSON-query" data-toc-modified-id="Create-a-new-PEP-describing-the-bedset-name-and-specific-JSON-query-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Create a new PEP describing the bedset name and specific JSON query</a></span></li><li><span><a href="#Create-outputs-directory-and-install-bedbuncher-CML-dependencies" data-toc-modified-id="Create-outputs-directory-and-install-bedbuncher-CML-dependencies-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Create outputs directory and install bedbuncher CML dependencies</a></span></li><li><span><a href="#Run-bedbuncher-using-Looper" data-toc-modified-id="Run-bedbuncher-using-Looper-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Run bedbuncher using Looper</a></span></li></ul></li><li><span><a href="#4.-BEDHOST:--Serve-BED-files-and-API-to-explore-pipeline-outputs" data-toc-modified-id="4.-BEDHOST:--Serve-BED-files-and-API-to-explore-pipeline-outputs-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>4. BEDHOST:  Serve BED files and API to explore pipeline outputs</a></span></li></ul></div>

## 1. Preparation 

First, we will create a tutorial directory where we'll store the bedbase pipelines and files to be processed. We'll also need to create an environment variable that points to the tutorial directory (we'll need this variable later). 

In [1]:
cd ../..
pwd

/home/bnt4me/Virginia/bed_maker_new


In [2]:
mkdir bedbase_tutorial
cd bedbase_tutorial
export BBTUTORIAL=`pwd`
export BEDBASE_DATA_PATH_HOST=`pwd`

Download some example BED files:

In [3]:
wget http://big.databio.org/example_data/bedbase_tutorial/bed_files.tar.gz

--2021-11-11 10:36:42--  http://big.databio.org/example_data/bedbase_tutorial/bed_files.tar.gz
Resolving big.databio.org (big.databio.org)... 128.143.245.182, 128.143.245.181
Connecting to big.databio.org (big.databio.org)|128.143.245.182|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44549692 (42M) [application/octet-stream]
Saving to: ‘bed_files.tar.gz’


2021-11-11 10:36:47 (9.54 MB/s) - ‘bed_files.tar.gz’ saved [44549692/44549692]



The downloaded files are compressed so we'll need to untar them:

In [4]:
tar -zxvf bed_files.tar.gz
rm -rf bed_files.tar.gz

bed_files/
bed_files/GSE105587_ENCFF018NNF_conservative_idr_thresholded_peaks_GRCh38.bed.gz
bed_files/GSM2423312_ENCFF155HVK_peaks_GRCh38.bed.gz
bed_files/GSE105977_ENCFF617QGK_optimal_idr_thresholded_peaks_GRCh38.bed.gz
bed_files/GSE91663_ENCFF316ASR_peaks_GRCh38.bed.gz
bed_files/GSM2423313_ENCFF722AOG_peaks_GRCh38.bed.gz
bed_files/GSM2827349_ENCFF196DNQ_peaks_GRCh38.bed.gz
bed_files/GSE91663_ENCFF553KIK_optimal_idr_thresholded_peaks_GRCh38.bed.gz
bed_files/GSE91663_ENCFF319TPR_conservative_idr_thresholded_peaks_GRCh38.bed.gz
bed_files/GSE105977_ENCFF937CGY_peaks_GRCh38.bed.gz
bed_files/GSM2827350_ENCFF928JXU_peaks_GRCh38.bed.gz
bed_files/GSE105977_ENCFF793SZW_conservative_idr_thresholded_peaks_GRCh38.bed.gz


Additionally, we'll download a matrix we need to provide if we wish to plot the tissue specificity of our set of genomic ranges:

In [5]:
wget http://big.databio.org/open_chromatin_matrix/openSignalMatrix_hg38_percentile99_01_quantNormalized_round4d.txt.gz

--2021-11-11 10:36:56--  http://big.databio.org/open_chromatin_matrix/openSignalMatrix_hg38_percentile99_01_quantNormalized_round4d.txt.gz
Resolving big.databio.org (big.databio.org)... 128.143.245.181, 128.143.245.182
Connecting to big.databio.org (big.databio.org)|128.143.245.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 380989260 (363M) [application/octet-stream]
Saving to: ‘openSignalMatrix_hg38_percentile99_01_quantNormalized_round4d.txt.gz’


2021-11-11 10:39:14 (2.65 MB/s) - ‘openSignalMatrix_hg38_percentile99_01_quantNormalized_round4d.txt.gz’ saved [380989260/380989260]



Lastly, we'll download the core pipelines and tools needed to complete this tutorial: `bedstat`, `bedbuncher` , `bedhost`, and `bedhost-ui`

In [6]:
git clone -b master git@github.com:Khoroshevskyi/bedbase.git
git clone -b validate_genome_assembly git@github.com:databio/bedstat.git
git clone -b validate_genome_assembly git@github.com:databio/bedbuncher
git clone -b dev git@github.com:databio/bedhost
git clone -b dev git@github.com:databio/bedhost-ui

Cloning into 'bedbase'...
remote: Enumerating objects: 380, done.[K
remote: Counting objects: 100% (380/380), done.[K
remote: Compressing objects: 100% (242/242), done.[K
remote: Total 380 (delta 179), reused 281 (delta 100), pack-reused 0[K
Receiving objects: 100% (380/380), 544.83 KiB | 8.51 MiB/s, done.
Resolving deltas: 100% (179/179), done.
Cloning into 'bedstat'...
remote: Enumerating objects: 750, done.[K
remote: Counting objects: 100% (260/260), done.[K
remote: Compressing objects: 100% (103/103), done.[K
remote: Total 750 (delta 120), reused 221 (delta 98), pack-reused 490[K
Receiving objects: 100% (750/750), 165.38 KiB | 5.91 MiB/s, done.
Resolving deltas: 100% (351/351), done.
Cloning into 'bedbuncher'...
remote: Enumerating objects: 611, done.[K
remote: Counting objects: 100% (143/143), done.[K
remote: Compressing objects: 100% (73/73), done.[K
remote: Total 611 (delta 66), reused 105 (delta 37), pack-reused 468[K
Receiving objects: 100% (611/611), 113.48 KiB | 

## 2. BEDSTAT: Generate statistics and plots of BED files 

### Get a PEP describing the bedfiles to process

The first step is to process the BED files using the `bedstat` pipeline, which computes statistics and makes plots for each individual BED file. To begin, we'll need some annotation information for our BED files to load. We'll use the standard [PEP](https://pepkit.github.io/) format for the annotation, which consists of 1) a sample table (.csv) that annotates the files, and 2) a project config.yaml file that points to the sample annotation sheet. The config file also has other components, such as derived attributes, that in this case point to the bedfiles to be processed. Here is the PEP config file for this example project. It includes annotation information for each BED file, and also points to the `.bed.gz` files using derived attributes `output_file_path` and `yaml_file`.

In [7]:
cat bedbase/tutorial_files/PEPs/bedstat_config.yaml

pep_version: 2.0.0
sample_table: bedstat_annotation_sheet.csv

looper:
    output-dir: $BEDBASE_DATA_PATH_HOST/outputs/bedstat_output/bedstat_pipeline_logs

sample_modifiers:
  append:
    bedbase_config: $BEDBASE_DATA_PATH_HOST/bedbase/tutorial_files/bedbase_configuration_compose.yaml
    pipeline_interfaces: $BEDBASE_DATA_PATH_HOST/bedstat/pipeline_interface.yaml
    output_file_path: OUTPUT
    yaml_file: SAMPLE_YAML
    open_signal_matrix: MATRIX
    bigbed:  BIGBED
  derive:
    attributes: [output_file_path, yaml_file, open_signal_matrix, bigbed]
    sources:
      OUTPUT: "$BEDBASE_DATA_PATH_HOST/bed_files/{file_name}"
      SAMPLE_YAML: "$BEDBASE_DATA_PATH_HOST/outputs/bedstat_output/bedstat_pipeline_logs/submission/{sample_name}_sample.yaml"
      MATRIX: "$BEDBASE_DATA_PATH_HOST/openSignalMatrix_{genome}_percentile99_01_quantNormalized_round4d.txt.gz"
      BIGBED: "$BEDBASE_DATA_PATH_HOST/bigbed_files"


### Install bedstat dependencies

`bedstat` is a [pypiper](http://code.databio.org/pypiper/) pipeline that generates statistics and plots of bedfiles. Additionally, `bedstat` uses [bbconf](https://github.com/databio/bbconf), the bedbase configuration manager which implements convenience methods for interacting with an Elasticsearch database, where our file metadata will be placed. These and the appropriate R dependencies can be installed as follows:

In [8]:
pip install -r bedstat/requirements/requirements-all.txt  --user > requirements_log.txt

Install R dependencies

In [9]:
Rscript bedstat/scripts/installRdeps.R > R_deps.txt

Loading required package: R.utils
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.1 (2020-08-26 16:20:06 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.24.0 (2020-08-26 16:11:58 UTC) successfully loaded. See ?R.oo for help.

Attaching package: ‘R.oo’

The following object is masked from ‘package:R.methodsS3’:

    throw

The following objects are masked from ‘package:methods’:

    getClasses, getMethods

The following objects are masked from ‘package:base’:

    attach, detach, load, save

R.utils v2.11.0 (2021-09-26 08:30:02 UTC) successfully loaded. See ?R.utils for help.

Attaching package: ‘R.utils’

The following object is masked from ‘package:utils’:

    timestamp

The following objects are masked from ‘package:base’:

    cat, commandArgs, getOption, inherits, isOpen, nullfile, parse,

Loading required package: BiocManager
Bioconductor version '3.13' is out-of-date; the current release version '3.14'
  is available with R ver

There's an additional dependency needed by `bedstat` if we wish to calculate and plot the GC content of our bedfiles. Depending on the genome assemblies of the files listed on a PEP, the appropriate BSgenome packages should be installed. The following is an example of how we can do so:

In [10]:
cat bedbase/tutorial_files/scripts/BSgenome_install.R

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("BSgenome.Hsapiens.UCSC.hg38.masked")

In [11]:
Rscript bedbase/tutorial_files/scripts/BSgenome_install.R > BSgenome.txt

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cloud.r-project.org

Bioconductor version 3.13 (BiocManager 1.30.16), R 4.1.1 (2021-08-10)
Installation paths not writeable, unable to update packages
  path: /usr/lib/R/library
  packages:
    nlme, spatial
Old packages: 'GenomicDistributionsData', 'gert'
package(s) not installed when version(s) same as current; use `force = TRUE` to
  re-install: 'BSgenome.Hsapiens.UCSC.hg38.masked' 


We'll need to create a directory where we can store the stats and plots generated by `bedstat`. Additionally, we'll create a directory where we can store log and metadata files that we'll need later on.

In [12]:
mkdir -p outputs/bedstat_output/bedstat_pipeline_logs

In order to use `bbconf`, we'll need to create a minimal configuration.yaml file. The path to this configuration file can be stored in the environment variable `$BEDBASE`.

In [13]:
cat bedbase/tutorial_files/bedbase_configuration_compose.yaml

path:
  pipeline_output_path: $BEDBASE_DATA_PATH_HOST/outputs
  bedstat_dir: bedstat_output
  bedbuncher_dir: bedbuncher_output
  remote_url_base: null
database:
  host: $DB_HOST_URL
  port: $POSTGRES_PORT
  password: $POSTGRES_PASSWORD
  user: $POSTGRES_USER
  name: $POSTGRES_DB
  dialect: postgresql
  driver: psycopg2
server:
  host: 0.0.0.0
  port: 8000


### Inititiate a local PostgreSQL instance

In addition to generate statistics and plots, `bedstat` inserts JSON formatted metadata into relational [PostgreSQL] database. 

If you don't have docker installed, you can install it with `sudo apt-get update && apt-get install docker-engine -y`.

Now, create a persistent volume to house PostgreSQL data:

In [14]:
docker volume create postgres-data

postgres-data


Spin up a `postgres` container. Provide required environment variables (need to match the settings in bedbase configuration file) and bind the created docker volume to `/var/lib/postgresql/data` path in the container:

In [15]:
docker run -d --name bedbase-postgres -p 5432:5432 -e POSTGRES_PASSWORD=bedbasepassword -e POSTGRES_USER=postgres -e POSTGRES_DB=postgres -v postgres-data:/var/lib/postgresql/data postgres:13

Unable to find image 'postgres:13' locally
13: Pulling from library/postgres

[1Bc13d9b9b: Pulling fs layer 
[1Bf9d5f5fe: Pulling fs layer 
[1Ba7a559cb: Pulling fs layer 
[1Bfd845683: Pulling fs layer 
[1B331369f5: Pulling fs layer 
[1B54a3ac3a: Pulling fs layer 
[1Bdf00588c: Pulling fs layer 
[1Bb2e86480: Pulling fs layer 
[1B2927c0d9: Pulling fs layer 
[5B54a3ac3a: Waiting fs layer 
[5Bdf00588c: Waiting fs layer 
[4B2927c0d9: Waiting fs layer 
[1BDigest: sha256:1adb50e5c24f550a9e68457a2ce60e9e4103dfc43c3b36e98310168165b443a1
Status: Downloaded newer image for postgres:13
a1c86207e532a4ee5a1f1fb5c2801a9ff21da611faf30c7673dff5b5847ef362


Above command might add environment variables, if it didn't happen run this command:

In [16]:
export DB_HOST_URL='localhost'
export POSTGRES_PORT=5432
export POSTGRES_PASSWORD='bedbasepassword'
export POSTGRES_USER='postgres'
export POSTGRES_DB='postgres'

### Run bedstat  on the demo PEP
To run `bedstat` and the other required pipelines in this tutorial, we will rely on the pipeline submission engine [looper](http://looper.databio.org/en/latest/), which can be installed as follows:

In [17]:
pip install looper --user



In order to establish a modular connection between a project and a pipeline, we'll need to create a [pipeline interface](http://looper.databio.org/en/latest/linking-a-pipeline/) file, which tells looper how to run the pipeline. 

In [18]:
cat bedstat/pipeline_interface.yaml

pipeline_name: BEDSTAT
pipeline_type: sample
var_templates:
  path: "{looper.piface_dir}/pipeline/bedstat.py"
input_schema: http://schema.databio.org/pipelines/bedstat.yaml
pre_submit:
  python_functions:
    - looper.write_sample_yaml
command_template: >
  {pipeline.var_templates.path}
  --bedfile {sample.output_file_path}
  --genome {sample.genome}
  --sample-yaml {sample.yaml_file}
  {% if sample.bedbase_config is defined %} --bedbase-config {sample.bedbase_config} {% endif %}
  {% if sample.open_signal_matrix is defined %} --open-signal-matrix {sample.open_signal_matrix} {% endif %}
  {% if sample.bigbed is defined %} --bigbed {sample.bigbed} {% endif %}


Next open new terminal window and run this commands to create new schema:
- 1) docker exec -it <em>container_id</em> bash
- 2) psql -U postgres
- 3) CREATE SCHEMA postgres;

Next close postgres terminal and container terminal with command: "exit"


Once we have properly linked our project to the pipeline of interest, in this case` bedstat`, we simply need to point the `looper run` command our `PEP` config file. Additionally, if the bedbase configuration file location is not stored in the `$BEDBASE` variable, we can pass it to `looper` as an additional argument:

In [24]:
looper run bedbase/tutorial_files/PEPs/bedstat_config.yaml --package local \
--command-extra="-R" > outputs/bedstat_output/bedstat_pipeline_logs/looper_logs.txt

Looper version: 1.3.1
Command: run
  os.path.dirname(self._file_path),
  self.config_file = self._file_path
Activating compute package 'local'
[36m## [1 of 11] sample: bedbase_demo_db1; pipeline: BEDSTAT[0m
Calling pre-submit function: looper.write_sample_yaml
Writing script to /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/outputs/bedstat_output/bedstat_pipeline_logs/submission/BEDSTAT_bedbase_demo_db1.sub
Job script (n=1; 0.00Gb): /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/outputs/bedstat_output/bedstat_pipeline_logs/submission/BEDSTAT_bedbase_demo_db1.sub
[36m## [2 of 11] sample: bedbase_demo_db2; pipeline: BEDSTAT[0m
Calling pre-submit function: looper.write_sample_yaml
Writing script to /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/outputs/bedstat_output/bedstat_pipeline_logs/submission/BEDSTAT_bedbase_demo_db2.sub
Job script (n=1; 0.00Gb): /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/outputs/bedstat_output/bedstat_pipeline_logs/submission/BEDSTA

If there is an error: "ImportError: cannot import name 'get_logger' from 'peppy.utils'": reinstall looper
To see if bedstats have been created successfully go to the looper logs, and see if there is no errors. Path to the looper log file is: "bedbase_tutorial/outputs/bedstat_output/bedstat_pipeline_logs/looper_logs.txt"

Just for informative purposes, we can inspect how `bedstat` operates on each bedfile:

In [19]:
head outputs/bedstat_output/bedstat_pipeline_logs/looper_logs.txt

Compute node: cphg-Precision-5560
Start time: 2021-11-10 10:43:39
### Pipeline run code and environment:

*              Command:  `/home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/bedstat/pipeline/bedstat.py --bedfile /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/bed_files/GSE105587_ENCFF018NNF_conservative_idr_thresholded_peaks_GRCh38.bed.gz --genome hg38 --sample-yaml /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/outputs/bedstat_output/bedstat_pipeline_logs/submission/bedbase_demo_db1_sample.yaml --bedbase-config /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/bedbase/tutorial_files/bedbase_configuration_compose.yaml --open-signal-matrix /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/openSignalMatrix_hg38_percentile99_01_quantNormalized_round4d.txt.gz --bigbed /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/bigbed_files -R`
*         Compute host:  cphg-Precision-5560
*          Working dir:  /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial
*   

After the previous steps have been executed, our bedfiles should be available for query on our local Elasticsearch cluster. Files can be queried using the `bedbuncher` pipeline described in the below section. 


## 3. BEDBUNCHER: Create bedsets and their respective statistics 

### Create a new PEP describing the bedset name and specific JSON query 

Now that we've processed several individual BED files, we'll turn to the next task: grouping them together into collections of BED files, which we call *bedsets*. For this, we use the `bedbuncher` pipeline, which produces outputs for each bedset, such as a bedset PEP, bedset-level statistics and plots, and an `IGD` database. To run `bedbuncher`, we will need another PEP describing each bedset. Though the annotation sheet below specifies attributes for one bedset, you can create as many as you wish using additional rows. For each bedset, you need to provide the query to retrieve certain collection BED files. 

The following example PEP shows the attributes we need to provide for each bedset and the config.yaml file that will grab the files needed to run `bedbuncher`:

In [25]:
cat bedbase/tutorial_files/PEPs/bedbuncher_query.csv

sample_name,bedset_name,genome,query,operator,query_val,bbconfig_name,bedbase_config
sample1,bedsetOver1kRegions,hg38,'regions_no',gt,"""1000""",bedbase_configuration_compose,source1
sample2,bedsetOver50GCContent,hg38,'gc_content',gt,"""0.5""",bedbase_configuration_compose,source1
sample3,bedsetUnder500MeanWidth,hg38,'mean_region_width',lt,"""500""",bedbase_configuration_compose,source1
sample4,bedsetTestSelectCellType,hg38,"""other::text~~:str_1 or other::text~~:str_2""","""str_1,str_2""","""%GM12878%,%HEK293%""",bedbase_configuration_compose,source1
sample5,bedsetTestSelectGenome,hg38,"""name=:name_1 or name=:name_2""","""name_1,name_2""","""GSE105587_ENCFF018NNF_conservative_idr_thresholded_peaks_GRCh38,GSE91663_ENCFF553KIK_optimal_idr_thresholded_peaks_GRCh38""",bedbase_configuration_compose,source1
sample6,bedsetTestCellType,hg38,"""other""",contains,"""""{\""cell_type\"":\ \""K562\""}""""",bedbase_configuration_compose,source1
sample7,bedsetTestSpace,hg38,"""other""",contains,"""

In [21]:
cat bedbase/tutorial_files/PEPs/bedbuncher_config.yaml

pep_version: 2.0.0
sample_table: bedbuncher_query.csv

looper:
    output_dir: $BBTUTORIAL/outputs/bedbuncher_output/bedbuncher_pipeline_logs
    piface_dir: $BBTUTORIAL/bedbuncher

sample_modifiers:
  append:
    pipeline_interfaces: $BBTUTORIAL/bedbuncher/pipeline_interface.yaml 
  derive:
    attributes: [bedbase_config]
    sources:
      source1: $BBTUTORIAL/bedbase/tutorial_files/{bbconfig_name}.yaml


Running `bedbuncher` with arguments defined in the example PEP above will result in a bedset with bedfiles that consist of at least 1000 regions.

###  Create outputs directory and install bedbuncher command line dependencies

We need a folder where we can store bedset related outputs. Though not required, we'll also create a directory where we can store the `bedbuncher` pipeline logs. 

In [22]:
mkdir -p outputs/bedbuncher_output/bedbuncher_pipeline_logs

One of the feats of `bedbuncher` includes [IGD](https://github.com/databio/IGD) database creation from the files in the bedset. `IGD` can be installed by cloning the repository from github, executing the make file to create the binary, and pointing the binary location with the `$PATH` environment variable. 

In [23]:
git clone git@github.com:databio/IGD
cd IGD
make > igd_make_log.txt 2>&1
cd ..

export PATH=$BBTUTORIAL/IGD/bin/:$PATH

Cloning into 'IGD'...
remote: Enumerating objects: 1297, done.[K
remote: Counting objects: 100% (67/67), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 1297 (delta 35), reused 40 (delta 17), pack-reused 1230[K
Receiving objects: 100% (1297/1297), 949.45 KiB | 9.31 MiB/s, done.
Resolving deltas: 100% (804/804), done.


### Run bedbuncher using Looper 

Once we have cloned the `bedbuncher` repository, set our local Elasticsearch cluster and created the `iGD` binary, we can run the pipeline by pointing `looper run` to the appropriate `PEP` config file. As mentioned earlier, if the path to the bedbase configuration file has been stored in the `$BEDBASE` environment variable, it's not neccesary to pass the `--bedbase-config` argument. 

In [26]:
looper run  bedbase/tutorial_files/PEPs/bedbuncher_config.yaml  --package local \
--command-extra="-R" > outputs/bedbuncher_output/bedbuncher_pipeline_logs/looper_logs.txt

Looper version: 1.3.1
Command: run
  os.path.dirname(self._file_path),
  self.config_file = self._file_path
Activating compute package 'local'
[36m## [1 of 10] sample: sample1; pipeline: BEDBUNCHER[0m
Writing script to /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/outputs/bedbuncher_output/bedbuncher_pipeline_logs/submission/BEDBUNCHER_sample1.sub
Job script (n=1; 0.00Gb): /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/outputs/bedbuncher_output/bedbuncher_pipeline_logs/submission/BEDBUNCHER_sample1.sub
[36m## [2 of 10] sample: sample2; pipeline: BEDBUNCHER[0m
Writing script to /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/outputs/bedbuncher_output/bedbuncher_pipeline_logs/submission/BEDBUNCHER_sample2.sub
Job script (n=1; 0.00Gb): /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/outputs/bedbuncher_output/bedbuncher_pipeline_logs/submission/BEDBUNCHER_sample2.sub
[36m## [3 of 10] sample: sample3; pipeline: BEDBUNCHER[0m
Writing script to /home/bnt4me/Virgin

## 4. BEDHOST:  Serve BED files and API to explore pipeline outputs

The last part of the tutorial consists on running a local instance of `bedhost` (a REST API for `bedstat` and `bedbuncher` produced outputs) in order to explore plots, statistics and download pipeline outputs. 
To run `bedhost`, frist use `bedhost-ui` to built the bedhost user interface with React.

In [35]:
cd bedhost-ui
# Install node modules defined in package.json
npm install 
# Build the app for production to the ./build folder
npm run build
# copy the contents of the ./build directory to bedhost/bedhost/static/bedhost-ui
cp -avr ./build ../bedhost/bedhost/static/bedhost-ui

cd ..



> core-js@2.6.11 postinstall /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/bedhost-ui/node_modules/babel-runtime/node_modules/core-js
> node -e "try{require('./postinstall')}catch(e){}"

[96mThank you for using core-js ([94m https://github.com/zloirock/core-js [96m) for polyfilling JavaScript standard library![0m

[96mThe project needs your help! Please consider supporting of core-js on Open Collective or Patreon: [0m
[96m>[94m https://opencollective.com/core-js [0m
[96m>[94m https://www.patreon.com/zloirock [0m

[96mAlso, the author of core-js ([94m https://github.com/zloirock [96m) is looking for a good job -)[0m


> core-js@3.6.5 postinstall /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/bedhost-ui/node_modules/core-js
> node -e "try{require('./postinstall')}catch(e){}"


> core-js-pure@3.6.5 postinstall /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/bedhost-ui/node_modules/core-js-pure
> node -e "try{require('./postinstall')}catch(e){}"

[K[?25h

To run `bedhost`, we'll pip install the package from the previously cloned repository:

In [36]:
pip install bedhost/. --user > bedhost_log.txt

In [158]:
cat $BBTUTORIAL/bedbase/tutorial_files/bedhost_cofig.yaml

path:
  pipeline_output_path: /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/bed_files
  bedstat_dir: /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/outputs/bedstat_output/
  bedbuncher_dir: /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/outputs/bedbuncher_output/

database:
  port: $POSTGRES_PORT
  password: $POSTGRES_PASSWORD
  user: $POSTGRES_USER
  name: $POSTGRES_DB
  dialect: postgresql
  driver: psycopg2

server:
  host: 0.0.0.0
  port: 8000

In [159]:
cd bedhost

To start `bedhost`, we simply need to run the following command passing the location of the bedbase configuration file to the `-c` flag.  

In [38]:
bedhost serve -c  $BBTUTORIAL/bedbase/tutorial_files/bedbase_configuration_compose.yaml

Traceback (most recent call last):
  File "/home/bnt4me/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1799, in _execute_context
    self.dialect.do_execute(
  File "/home/bnt4me/.local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 717, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UndefinedColumn: column bedfiles.genome does not exist
LINE 1: ...edfiles_name, bedfiles.md5sum AS bedfiles_md5sum, bedfiles.g...
                                                             ^
HINT:  Perhaps you meant to reference the column "bedfiles.name".


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/bnt4me/.local/bin/bedhost", line 8, in <module>
    sys.exit(main())
  File "/home/bnt4me/.local/lib/python3.8/site-packages/bedhost/main.py", line 68, in main
    bbc = BedBaseConf(bbconf.get_bedbase_cfg(args.config))
  File "/home/bnt4me/.local/lib/python3.8/site-packag

: 1

If we have stored the path to the bedbase config in the environment variable `$BEDBASE` (suggested), it's not neccesary to use said flag. 

In [105]:
bedhost serve 

Traceback (most recent call last):
  File "/home/bnt4me/.local/bin/bedhost", line 8, in <module>
    sys.exit(main())
  File "/home/bnt4me/.local/lib/python3.8/site-packages/bedhost/main.py", line 68, in main
    bbc = BedBaseConf(bbconf.get_bedbase_cfg(args.config))
  File "/home/bnt4me/.local/lib/python3.8/site-packages/bbconf/helpers.py", line 24, in get_bedbase_cfg
    raise BedBaseConnectionError(
bbconf.exceptions.BedBaseConnectionError: You must provide a config file or set the BEDBASE environment variable


: 1

The `bedhost` API can be opened in the url [http://0.0.0.0:8000](http://0.0.0.0:8000). We can now explore the plots and statistics generated by the `bedstat` and `bedbuncher` pipelines.

## or optionally run BEDHOST using containers

Alternatively, you can run the application inside a container.

For that we'll use [docker compose](https://docs.docker.com/compose/), a tool that makes running multi-contaier Docker applications possible. The `docker-compose.yaml` file defines two services: 
- `fastapi-api`: runs the fastAPI server 
- `postgres-db`: runs the PostgeSQL database used by the server


In [36]:
cd $BBTUTORIAL

Use the `BEDBASE_DATA_PATH_HOST` environment variable to point to the host directory with the pipeline results that will be mounted in the container as a volume. 

The environment variables are passed to the container via `.env` file, which the `docker-compose.yaml` points to for each service. Additionally, you can just export the environment variables before issuing the `docker-compose` command.
When you set the same environment variable in multiple files, here’s the priority used by Compose to choose which value to use:

1. Compose file
2. Shell environment variables
3. Environment file
4. Dockerfile
4. Variable is not defined

In [37]:
export BEDBASE_DATA_PATH_HOST=$BBTUTORIAL # should be named like this from the beginning

In [57]:
cd bedhost

In [162]:
docker-compose up 

[31mERROR[0m: Service 'fastapi-api' depends on service 'localhost' which is undefined.


: 1

In [118]:
export BEDBASE_DATA_PATH=/bedbase
#BEDBASE_DATA_PATH_HOST=/Users/mstolarczyk/Desktop/testing/bedbase_tutorial
export DB_HOST_URL=postgres-db

In [161]:
export DB_HOST_URL='localhost'

In [108]:
cd bedbase_tutorial

bash: cd: bedbase_tutorial: No such file or directory


: 1

In [137]:
cat docker-compose.yaml

cat: docker-compose.yaml: No such file or directory


: 1

In [136]:
ls
# docker run --rm --init -p 8000:8000 --name bedstat-rest-server \
#   --network="host" \
#   -v /home/bnt4me/Virginia/bed_maker_new/bedbase_tutorial/bed_files \
#   bedstat-rest-api-server uvicorn main:app --reload