# BEDBASE workflow tutorial

This demo demonstrates how to process, analyze, visualize, and serve BED files. The process has 5 steps: First, the [bedmaker](https://github.com/databio/bedmaker) pipeline converts different region data files (bed, bedGraph, bigBed, bigWig, and wig) into BED format and generates bigBed format for each file for visualization in Genome Browser.  An optional step, the [bedqc](https://github.com/databio/bedqc) pipline, flags the BED files that you might not want to include in the downstream analysis.  Second, individual BED files are analyzed using the [bedstat](https://github.com/databio/bedstat) pipeline. Third, BED files are grouped and then analyzed as groups using the [bedbuncher](https://github.com/databio/bedbuncher) pipeline. Fourth, [bedembed](https://github.com/databio/bedembed) uses the StarSpace method to embed the bed files and the meta data, and the distances between the file labels and trained search terms will be calculated with cosine distance. Finally, the BED files, along with statistics, plots, and grouping information, is served via a web interface and RESTful API using the [bedhost](https://github.com/databio/bedhost) package.

**Glossary of terms:**

- *bedfile*: a tab-delimited file with one genomic region per line. Each genomic region is decribed by 3 required columns: chrom, start and end.
- *bedset*: a collection of BED files grouped by with a shared biological, experimental, or logical criterion.


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Preparation" data-toc-modified-id="1.-Preparation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>1. Preparation</a></span></li><li><span><a href="#2.-BEDMAKER:-convert-non-bed-files-into-bed-files-and-generate-bigBed-files-for-genome-browser-tracks" data-toc-modified-id="2.-BEDMAKER:-convert-non-bed-files-into-bed-files-and-generate-bigBed-files-for-genome-browser-tracks-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>2. BEDMAKER: convert non-bed files into bed files and generate bigBed files for genome browser tracks</a></span><ul class="toc-item"><li><span><a href="#Get-a-PEP-describing-the-files-to-process" data-toc-modified-id="Get-a-PEP-describing-the-files-to-process-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Get a PEP describing the files to process</a></span></li><li><span><a href="#Run-bedmaker-on-the-demo-PEP" data-toc-modified-id="Run-bedmaker-on-the-demo-PEP-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Run bedmaker on the demo PEP</a></span></li></ul></li><li><span><a href="#OPTIONAL-BEDQC:-flag-bed-files-for-futher-evaluation-to-determine-whether-they-should-be-included-in-the-downstream-analysis" data-toc-modified-id="OPTIONAL-BEDQC:-flag-bed-files-for-futher-evaluation-to-determine-whether-they-should-be-included-in-the-downstream-analysis"><span class="toc-item-num">&nbsp;&nbsp;</span>OPTIONAL BEDQC: flag bed files for futher evaluation to determine whether they should be included in the downstream analysis</a></span><ul class="toc-item"><li><span><a href="#Get-a-PEP-describing-the-files-to-process" data-toc-modified-id="Get-a-PEP-describing-the-files-to-process"><span class="toc-item-num">&nbsp;&nbsp;</span>Get a PEP describing the files to process</a></span></li><li><span><a href="#Run-bedqc-on-the-demo-PEP" data-toc-modified-id="Run-bedqc-on-the-demo-PEP"><span class="toc-item-num">&nbsp;&nbsp;</span>Run bedqc on the demo PEP</a></span></li></ul></li><li><span><a href="#3.-BEDSTAT:-Generate-statistics-and-plots-of-BED-files" data-toc-modified-id="3.-BEDSTAT:-Generate-statistics-and-plots-of-BED-files-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>3. BEDSTAT: Generate statistics and plots of BED files</a></span><ul class="toc-item"><li><span><a href="#Get-a-PEP-describing-the-bedfiles-to-process" data-toc-modified-id="Get-a-PEP-describing-the-bedfiles-to-process-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Get a PEP describing the bedfiles to process</a></span></li><li><span><a href="#Install-bedstat-dependencies" data-toc-modified-id="Install-bedstat-dependencies-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Install bedstat dependencies</a></span></li><li><span><a href="#Inititiate-a-local-PostgreSQL-instance" data-toc-modified-id="Inititiate-a-local-PostgreSQL-instance-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Inititiate a local PostgreSQL instance</a></span></li><li><span><a href="#Run-bedstat--on-the-demo-PEP" data-toc-modified-id="Run-bedstat--on-the-demo-PEP-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Run bedstat  on the demo PEP</a></span></li></ul></li><li><span><a href="#4.-BEDBUNCHER:-Create-bedsets-and-their-respective-statistics" data-toc-modified-id="4.-BEDBUNCHER:-Create-bedsets-and-their-respective-statistics-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>4. BEDBUNCHER: Create bedsets and their respective statistics</a></span><ul class="toc-item"><li><span><a href="#Create-a-new-PEP-describing-the-bedset-name-and-specific-JSON-query" data-toc-modified-id="Create-a-new-PEP-describing-the-bedset-name-and-specific-JSON-query-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Create a new PEP describing the bedset name and specific JSON query</a></span></li><li><span><a href="#Create-outputs-directory-and-install-bedbuncher-CML-dependencies" data-toc-modified-id="Create-outputs-directory-and-install-bedbuncher-CML-dependencies-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Create outputs directory and install bedbuncher CML dependencies</a></span></li><li><span><a href="#Run-bedbuncher-using-Looper" data-toc-modified-id="Run-bedbuncher-using-Looper-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Run bedbuncher using Looper</a></span></li></ul></li><li><span><a href="#5.-BEDHOST:--Serve-BED-files-and-API-to-explore-pipeline-outputs" data-toc-modified-id="5.-BEDHOST:--Serve-BED-files-and-API-to-explore-pipeline-outputs-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>5. BEDHOST:  Serve BED files and API to explore pipeline outputs</a></span></li></ul></div>

## 1. Preparation 

First, we will create a tutorial directory where we'll store the bedbase pipelines and files to be processed. We'll also need to create an environment variable that points to the tutorial directory (we'll need this variable later). 

In [1]:
mkdir bedbase_tutorial
cd bedbase_tutorial
export BEDBASE_DATA_PATH_HOST=`pwd`
export CODE=`pwd`

mkdir: cannot create directory ‘bedbase_tutorial’: File exists


Download some example BED files:

In [2]:
wget http://big.databio.org/example_data/bedbase_tutorial/bed_files.tar.gz     

--2022-10-13 12:20:25--  http://big.databio.org/example_data/bedbase_tutorial/bed_files.tar.gz
Resolving big.databio.org (big.databio.org)... 128.143.223.179
Connecting to big.databio.org (big.databio.org)|128.143.223.179|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44549692 (42M) [application/octet-stream]
Saving to: ‘bed_files.tar.gz’


2022-10-13 12:21:20 (801 KB/s) - ‘bed_files.tar.gz’ saved [44549692/44549692]



The downloaded files are compressed so we'll need to untar them:

In [3]:
tar -zxvf bed_files.tar.gz && mv bed_files files

bed_files/
bed_files/GSE105587_ENCFF018NNF_conservative_idr_thresholded_peaks_GRCh38.bed.gz
bed_files/GSM2423312_ENCFF155HVK_peaks_GRCh38.bed.gz
bed_files/GSE105977_ENCFF617QGK_optimal_idr_thresholded_peaks_GRCh38.bed.gz
bed_files/GSE91663_ENCFF316ASR_peaks_GRCh38.bed.gz
bed_files/GSM2423313_ENCFF722AOG_peaks_GRCh38.bed.gz
bed_files/GSM2827349_ENCFF196DNQ_peaks_GRCh38.bed.gz
bed_files/GSE91663_ENCFF553KIK_optimal_idr_thresholded_peaks_GRCh38.bed.gz
bed_files/GSE91663_ENCFF319TPR_conservative_idr_thresholded_peaks_GRCh38.bed.gz
bed_files/GSE105977_ENCFF937CGY_peaks_GRCh38.bed.gz
bed_files/GSM2827350_ENCFF928JXU_peaks_GRCh38.bed.gz
bed_files/GSE105977_ENCFF793SZW_conservative_idr_thresholded_peaks_GRCh38.bed.gz


In [4]:
rm bed_files.tar.gz

Additionally, we'll download a matrix we need to provide if we wish to plot the tissue specificity of our set of genomic ranges:

Lastly, we'll download the core pipelines and tools needed to complete this tutorial: `bedmaker`, `bedqc`, `bedstat`, `bedbuncher` , `bedhost`, and `bedhost-ui`

In [5]:
git clone -b dev git@github.com:databio/bedbase.git
git clone -b dev git@github.com:databio/bedmaker
git clone -b dev_alex git@github.com:databio/bedstat
git clone -b dev git@github.com:databio/bedboss
# git clone -b validate_genome_assembly git@github.com:databio/bedbuncher
# git clone git@github.com:databio/bedembed
# git clone -b dev git@github.com:databio/bedhost
# git clone git@github.com:databio/bedhost-ui

Cloning into 'bedbase'...
remote: Enumerating objects: 443, done.[K
remote: Counting objects: 100% (96/96), done.[K
remote: Compressing objects: 100% (44/44), done.[K
remote: Total 443 (delta 66), reused 59 (delta 48), pack-reused 347[K
Receiving objects: 100% (443/443), 543.96 KiB | 8.50 MiB/s, done.
Resolving deltas: 100% (215/215), done.
Cloning into 'bedmaker'...
remote: Enumerating objects: 549, done.[K
remote: Counting objects: 100% (132/132), done.[K
remote: Compressing objects: 100% (87/87), done.[K
remote: Total 549 (delta 54), reused 97 (delta 33), pack-reused 417[K
Receiving objects: 100% (549/549), 1.76 MiB | 9.76 MiB/s, done.
Resolving deltas: 100% (296/296), done.
Cloning into 'bedstat'...
remote: Enumerating objects: 1004, done.[K
remote: Counting objects: 100% (236/236), done.[K
remote: Compressing objects: 100% (128/128), done.[K
remote: Total 1004 (delta 114), reused 183 (delta 72), pack-reused 768[K
Receiving objects: 100% (1004/1004), 4.82 MiB | 9.40 MiB

### Let's install this packages!

In [45]:
pip install ./bedmaker
pip install ./bedstat
pip install ./bedboss

Processing ./bedmaker
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting argparse>=1.4.0
  Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Building wheels for collected packages: bedmaker
  Building wheel for bedmaker (setup.py) ... [?25ldone
[?25h  Created wheel for bedmaker: filename=bedmaker-0.1.0-py3-none-any.whl size=11471 sha256=4ffe2026736a9c6695c045f9e583be754ad10f19dc83d83b203c15f68f259b8e
  Stored in directory: /tmp/pip-ephem-wheel-cache-sdvxky0u/wheels/b0/3c/ec/d391d359533d0588c473b0dd29a3e30863d9c5a04d10b537f1
Successfully built bedmaker
Installing collected packages: argparse, bedmaker
  Attempting uninstall: bedmaker
    Found existing installation: bedmaker 0.1.0
    Uninstalling bedmaker-0.1.0:
      Successfully uninstalled bedmaker-0.1.0
Successfully installed argparse-1.4.0 bedmaker-0.1.0
Processing ./bedstat
  Preparing metadata (setup.py) ... [?25ldone


Building wheels for collected packages: bedstat
  Building wheel for bedstat (setup.py) ... [?25ldone
[?25h  Created wheel for bedstat: filename=bedstat-0.1.0-py3-none-any.whl size=12949 sha256=c54da40f6e309bf3bc2d79d21c2b8f60bc8554fa4b42ddb8a203a150c5ec68d9
  Stored in directory: /tmp/pip-ephem-wheel-cache-aq2gpxad/wheels/79/d1/85/9e5219be05d57343991497d2aaa54ca43673b643bb74ea4c78
Successfully built bedstat
Installing collected packages: bedstat
  Attempting uninstall: bedstat
    Found existing installation: bedstat 0.1.0
    Uninstalling bedstat-0.1.0:
      Successfully uninstalled bedstat-0.1.0
Successfully installed bedstat-0.1.0
Processing ./bedboss
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: bedboss
  Building wheel for bedboss (setup.py) ... [?25ldone
[?25h  Created wheel for bedboss: filename=bedboss-0.1.0-py3-none-any.whl size=6489 sha256=ef3d08434c5b7605897dbd2f39691180ced19491d6a2f602b9e9977f58653e02
  Stored in directory: /tm

In [42]:
pwd

/home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial


In [22]:
cd bedbase

# 2. BEDBOSS: ALL TOGETHER

### Check and update config files

In [14]:
ls ../../tutorial_files/bedboss

bedboss_looper.yaml           config_db_local.yaml
bedstat_annotation_sheet.csv  pipeline_interface.yaml


Config for local db and bedstat

In [12]:
cat ../../tutorial_files/bedboss/config_db_local.yaml

path:
  pipeline_output_path: test_f2
  bedstat_dir: outputs/bedstat_output
  bedbuncher_dir: outputs/bedbuncher_output
  remote_url_base: null
database:
  host: localhost
  port: 5432
  password: bedbasepassword
  user: postgres
  name: postgres
  dialect: postgresql
  driver: psycopg2
server:
  host: 0.0.0.0
  port: 8080

looper for bedboss

In [16]:
cat ../../tutorial_files/bedboss/pipeline_interface.yaml

pipeline_name: BEDMAKER
pipeline_type: sample
command_template: >
  bedboss
  --sample-name {sample.sample_name}
  --input-file {sample.input_file_path}
  --input-type {sample.format}
  --genome {sample.genome}
  --output-folder {sample.output_folder}
  --narrowpeak {sample.narrowpeak}
  --rfg-config {sample.rfg_config_path}
  --bedbase-config {sample.bedbase_config}


In [17]:
cat ../../tutorial_files/bedboss/bedboss_looper.yaml

pep_version: 2.1.0
sample_table: bedstat_annotation_sheet.csv

looper:
    output-dir: ./pipeline_interface.yaml

sample_modifiers:
  append:
    pipeline_interfaces: ./pipeline_interface.yaml
    input_file_path: INPUT
    output_path: "$BEDBASE_DATA_PATH_HOST/output"
    narrowpeak: TRUE
    rfg_config_path: RFG
  derive:
    attributes: [input_file_path, rfg_config_path]
    sources:
      INPUT: "$BEDBASE_DATA_PATH_HOST/files/{file_name}"
      RFG: "$REFGENIE"
  imply:
    - if:
        antibody: [H3K4me3, H3K27me3, H3K27ac, H3K9ac, H4K5ac, H3K4me, H3K36me3, H4K5ac, H3K9ac]
      then:
        narrowpeak: FALSE


### RUN BEDBoss

Additionally, we have to initialize environment variable $REFGENIE - the path to the refgenie configuration file. If Refgenie is not initialize, we will have to initialize it localy. use `pip install --user refgenie` to install and add to the PATH with `export PATH=~/.local/bin:$PATH`

In [20]:
export REFGENIE='genome_config.yaml'
refgenie init -c $REFGENIE

Initialized genome configuration file: /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/genome_config.yaml
Created directories:
 - /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/data
 - /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/alias


In [37]:
bedToBigBed

bedToBigBed v. 2.8 - Convert bed file to bigBed. (bbi version: 4)
usage:
   bedToBigBed in.bed chrom.sizes out.bb
Where in.bed is in one of the ascii bed formats, but not including track lines
and chrom.sizes is a two-column file/URL: <chromosome name> <size in bases>
and out.bb is the output indexed big bed file.
If the assembly <db> is hosted by UCSC, chrom.sizes can be a URL like
  http://hgdownload.soe.ucsc.edu/goldenPath/<db>/bigZips/<db>.chrom.sizes
or you may use the script fetchChromSizes to download the chrom.sizes file.
If you have bed annotations on patch sequences from NCBI, a more inclusive
chrom.sizes file can be found using a URL like
  http://hgdownload.soe.ucsc.edu/goldenPath/<db>/database/chromInfo.txt.gz
If not hosted by UCSC, a chrom.sizes file can be generated by running
twoBitInfo on the assembly .2bit file.
The in.bed file must be sorted by chromosome,start,
  to sort a bed file, use the unix sort command:
     sort -k1,1 -k2,2n unsorted.bed > sorted.bed
Sorting 

: 255

In [46]:
looper run ../../tutorial_files/bedboss/bedboss_looper.yaml --package local

Looper version: 1.3.2
Command: run
[2KDetecting duplicate sample names [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [36m0:00:00[0m
[?25hActivating compute package 'local'
[36m## [1 of 11] sample: bedbase_demo_db1; pipeline: BEDMAKER[0m
Writing script to /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/pipeline_interface.yaml/submission/BEDMAKER_bedbase_demo_db1.sub
Job script (n=1; 0.00Gb): ./pipeline_interface.yaml/submission/BEDMAKER_bedbase_demo_db1.sub
Compute node: cphg-Precision-5560
Start time: 2022-10-13 13:36:03
processing genome name...
Getting Open Signal Matrix file path...
output_bed = /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bed/bedbase_demo_db1.bed.gz
output_bigbed = /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bigbed
### Pipeline run code and environment:

*              Command:  `/home/bnt4me/Virginia/venv/jup_notebook/bin/bedboss --sample-name bedb

*         Compute host:  cphg-Precision-5560
*          Working dir:  /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial
*            Outfolder:  /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bed/bedmaker_logs/bedbase_demo_db2/
*  Pipeline started at:   (10-13 13:36:04) elapsed: 0.0 _TIME_

### Version log:

*       Python version:  3.8.10
*          Pypiper dir:  `/home/bnt4me/Virginia/venv/jup_notebook/lib/python3.8/site-packages/pypiper`
*      Pypiper version:  0.12.3
*         Pipeline dir:  `/home/bnt4me/Virginia/venv/jup_notebook/bin`
*     Pipeline version:  None

### Arguments passed to pipeline:


----------------------------------------

Got input type: bed
Target exists: `/home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bed/bedbase_demo_db2.bed.gz`  
Removed existing flag: '/home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bed/bedmaker_logs/bedbase_demo_db2/bedmaker_co

*         Compute host:  cphg-Precision-5560
*          Working dir:  /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial
*            Outfolder:  /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bed/bedqc_logs/
*  Pipeline started at:   (10-13 13:36:05) elapsed: 0.0 _TIME_

### Version log:

*       Python version:  3.8.10
*          Pypiper dir:  `/home/bnt4me/Virginia/venv/jup_notebook/lib/python3.8/site-packages/pypiper`
*      Pypiper version:  0.12.3
*         Pipeline dir:  `/home/bnt4me/Virginia/venv/jup_notebook/bin`
*     Pipeline version:  None

### Arguments passed to pipeline:


----------------------------------------

Target to produce: `/home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bed/bedqc_logs/9cjgig8c`  

> `zcat /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bed/bedbase_demo_db3.bed.gz > /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/outpu

<pre>
313000</pre>
Command completed. Elapsed time: 0:00:00. Running peak memory: 0.002GB.  
  PID: 568643;	Command: bash;	Return code: 0;	Memory used: 0.0GB

Starting cleanup: 1 files; 0 conditional files for cleanup

Cleaning up flagged intermediate files. . .

### Pipeline completed. Epilogue
*        Elapsed time (this run):  0:00:00
*  Total elapsed time (all runs):  0:00:00
*         Peak memory (this run):  0.002 GB
*        Pipeline completed time: 2022-10-13 13:36:06
Generating bigBed files for: /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/files/GSE105977_ENCFF937CGY_peaks_GRCh38.bed.gz
Determining path to chrom.sizes asset via Refgenie.
Reading refgenie genome configuration file from file: /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/genome_config.yaml
/home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/alias/hg38/fasta/default/hg38.chrom.sizes
Determined path to chrom.sizes asset: /home/bnt4me/Virginia/repos/bedbase/doc

    sys.exit(main())
  File "/home/bnt4me/Virginia/venv/jup_notebook/lib/python3.8/site-packages/bedboss/bedboss.py", line 301, in main
    run_bedboss(**args_dict)
  File "/home/bnt4me/Virginia/venv/jup_notebook/lib/python3.8/site-packages/bedboss/bedboss.py", line 146, in run_bedboss
    bedstat.run_bedstat(
  File "/home/bnt4me/Virginia/venv/jup_notebook/lib/python3.8/site-packages/bedstat/bedstat.py", line 94, in run_bedstat
    bbc = bbconf.BedBaseConf(config_path=bedbase_config, database_only=True)
  File "/home/bnt4me/Virginia/venv/jup_notebook/lib/python3.8/site-packages/bbconf/bbconf.py", line 45, in __init__
    cfg_path = get_bedbase_cfg(config_path)
  File "/home/bnt4me/Virginia/venv/jup_notebook/lib/python3.8/site-packages/bbconf/helpers.py", line 22, in get_bedbase_cfg
    selected_cfg = select_config(config_filepath=cfg, config_env_vars=CFG_ENV_VARS)
  File "/home/bnt4me/Virginia/venv/jup_notebook/lib/python3.8/site-packages/yacman/yacman.py", line 498, in select_config


Start time: 2022-10-13 13:36:08
processing genome name...
Getting Open Signal Matrix file path...
output_bed = /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bed/bedbase_demo_db7.bed.gz
output_bigbed = /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bigbed
### Pipeline run code and environment:

*              Command:  `/home/bnt4me/Virginia/venv/jup_notebook/bin/bedboss --sample-name bedbase_demo_db7 --input-file /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/files/GSE91663_ENCFF553KIK_optimal_idr_thresholded_peaks_GRCh38.bed.gz --input-type bed --genome hg38 --output_folder /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output --narrowpeak True --rfg-config genome_config.yaml --bedbase-config ./config_db_local.yaml`
*         Compute host:  cphg-Precision-5560
*          Working dir:  /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial
*            Outfolder:  /home/bnt

*     Pipeline version:  None

### Arguments passed to pipeline:


----------------------------------------

Got input type: bed
Target exists: `/home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bed/bedbase_demo_db8.bed.gz`  
Removed existing flag: '/home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bed/bedmaker_logs/bedbase_demo_db8/bedmaker_completed.flag'
Removed existing flag: '/home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bed/bedqc_logs/bedQC-pipeline_completed.flag'
### Pipeline run code and environment:

*              Command:  `/home/bnt4me/Virginia/venv/jup_notebook/bin/bedboss --sample-name bedbase_demo_db8 --input-file /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/files/GSM2423312_ENCFF155HVK_peaks_GRCh38.bed.gz --input-type bed --genome hg38 --output_folder /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output --narrowpeak True --rfg-config genom

*      Pypiper version:  0.12.3
*         Pipeline dir:  `/home/bnt4me/Virginia/venv/jup_notebook/bin`
*     Pipeline version:  None

### Arguments passed to pipeline:


----------------------------------------

Target to produce: `/home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bed/bedqc_logs/y4kv37sr`  

> `zcat /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bed/bedhost_demo_db9.bed.gz > /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bed/bedqc_logs/y4kv37sr` (568842)
<pre>
</pre>
Command completed. Elapsed time: 0:00:00. Running peak memory: 0.002GB.  
  PID: 568842;	Command: zcat;	Return code: 0;	Memory used: 0.002GB

Targetless command, running...  

> `bash /home/bnt4me/Virginia/venv/jup_notebook/lib/python3.8/site-packages/bedmaker/est_line.sh /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/output/files_bed/bedqc_logs/y4kv37sr ` (568844)
<pre>
303000</pre>
Command comp

Generating bigBed files for: /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/files/GSM2827349_ENCFF196DNQ_peaks_GRCh38.bed.gz
Determining path to chrom.sizes asset via Refgenie.
Reading refgenie genome configuration file from file: /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/genome_config.yaml
/home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/alias/hg38/fasta/default/hg38.chrom.sizes
Determined path to chrom.sizes asset: /home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/alias/hg38/fasta/default/hg38.chrom.sizes

### Pipeline completed. Epilogue
*        Elapsed time (this run):  0:00:00
*  Total elapsed time (all runs):  0:00:02
*         Peak memory (this run):  0 GB
*        Pipeline completed time: 2022-10-13 13:36:12
Config file path isn't a file: ./config_db_local.yaml
Traceback (most recent call last):
  File "/home/bnt4me/Virginia/venv/jup_notebook/bin/bedboss", line 8, in <module>
    sys.exit(main())
  Fil

  File "/home/bnt4me/Virginia/venv/jup_notebook/lib/python3.8/site-packages/bbconf/bbconf.py", line 45, in __init__
    cfg_path = get_bedbase_cfg(config_path)
  File "/home/bnt4me/Virginia/venv/jup_notebook/lib/python3.8/site-packages/bbconf/helpers.py", line 22, in get_bedbase_cfg
    selected_cfg = select_config(config_filepath=cfg, config_env_vars=CFG_ENV_VARS)
  File "/home/bnt4me/Virginia/venv/jup_notebook/lib/python3.8/site-packages/yacman/yacman.py", line 498, in select_config
    raise result
OSError: ./config_db_local.yaml

Looper finished
Samples valid for job generation: 11 of 11
Commands submitted: 11 of 11
Jobs submitted: 11
[0m

## 2. BEDMAKER: convert non-bed files into bed files and generate bigBed files for genome browser tracks

### Get a PEP describing the files to process

This is a preprocess step to convert non-bed files into bed format using `bedmaker`. Currently supported formats are bedGraph, bigBed, bigWig and wig. `Bedmaker` also generates bigBed files that will be using in Genome Browser. To begin, we'll need some annotation information for our files to load. We'll use the standard [PEP](http://pep.databio.org) format for the annotation, which consists of 1) a sample table (.csv) that annotates the files, and 2) a project config.yaml file that points to the sample annotation sheet. The config file also has other components, such as derived and implied attributes, that in this case point to the files to be processed and whether they are narrowpeak or not. Here is the PEP config file for this example project. 

In [10]:
cat bedbase/tutorial_files/PEPs/bedmaker_config.yaml

pep_version: 2.0.0
sample_table: bedstat_annotation_sheet.csv

looper:
    output-dir: $BEDBASE_DATA_PATH_HOST/outputs/bedmaker_output/bedmaker_pipeline_logs 

sample_modifiers:
  append:
    pipeline_interfaces: $CODE/bedmaker/pipeline_interface.yaml
    input_file_path: INPUT
    output_bed_path: BOUT
    output_bigbed_path: $BEDBASE_DATA_PATH_HOST/bigbed_files
    narrowpeak: TRUE
    rfg_config_path: RFG
    protocol: "make_bed"
  derive:
    attributes: [input_file_path, output_bed_path, rfg_config_path]
    sources:
      INPUT: "$BEDBASE_DATA_PATH_HOST/files/{file_name}"
      BOUT: "$BEDBASE_DATA_PATH_HOST/bed_files/{file_name}" 
      RFG: "$REFGENIE"
  imply:
    - if:
        antibody: [H3K4me3, H3K27me3, H3K27ac, H3K9ac, H4K5ac, H3K4me, H3K36me3, H4K5ac, H3K9ac]
      then:
        narrowpeak: FALSE


The output bigBed files will be stored in `$BEDBASE_DATA_PATH_HOST/bigbed_files`.bed files will be stored in `$BEDBASE_DATA_PATH_HOST/bed_files`. But We'll need to create a directory where we can store the log and submission files.

In [20]:
mkdir -p outputs/bedmaker_output/bedmaker_pipeline_logs

This step requires `bedToBigBed`. If you don't have it installed, you can download it from [ucsc](http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed), and add to PATH.

In [29]:
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed
chmod a+x bedToBigBed

--2022-10-13 13:23:23--  http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9573456 (9.1M)
Saving to: ‘bedToBigBed’


2022-10-13 13:23:42 (524 KB/s) - ‘bedToBigBed’ saved [9573456/9573456]



In [30]:
ls

[0m[01;34malias[0m             [01;34mbedmaker[0m     [01;34mfiles[0m               [01;34mpipeline_interface.yaml[0m
[01;34mbedbase[0m           [01;34mbedstat[0m      genome_config.yaml
[01;34mbedbase_tutorial[0m  [01;32mbedToBigBed[0m  [01;34mopenSignalMatrix[0m
[01;34mbedboss[0m           [01;34mdata[0m         [01;34moutput[0m


In [31]:
pwd

/home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial


In [34]:
export PATH=$PATH:/home/bnt4me/Virginia/repos/bedbase/docs_jupyter/bedbase_tutorial/bedToBigBed

### Run bedmaker on the demo PEP

To run bedmaker and the other required pipelines in this tutorial, we will rely on the pipeline submission engine looper, which can be installed as follows:

In [None]:
pip install looper --user

In [24]:
looper run bedbase/tutorial_files/PEPs/bedmaker_config.yaml --package local \
--command-extra="-R" > outputs/bedmaker_output/bedmaker_pipeline_logs/looper_logs.txt

Looper version: 1.3.1
Command: run
  os.path.dirname(self._file_path),
  self.config_file = self._file_path
Activating compute package 'local'
[36m## [1 of 11] sample: bedbase_demo_db1; pipeline: BEDMAKER[0m
Writing script to /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/outputs/bedmaker_output/bedmaker_pipeline_logs/submission/BEDMAKER_bedbase_demo_db1.sub
Job script (n=1; 0.00Gb): /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/outputs/bedmaker_output/bedmaker_pipeline_logs/submission/BEDMAKER_bedbase_demo_db1.sub
[36m## [2 of 11] sample: bedbase_demo_db2; pipeline: BEDMAKER[0m
Writing script to /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/outputs/bedmaker_output/bedmaker_pipeline_logs/submission/BEDMAKER_bedbase_demo_db2.sub
Job script (n=1; 0.00Gb): /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/outputs/bedmaker_output/bedmaker_pipeline_logs/submission/BEDMAKER_bedbase_demo_db2.sub
[36m## [3 of 11] sample: bedbase_demo_db3; pipeline: BEDMAKER[0m
Writing script to /h

## OPTIONAL BEDQC: flag bed files for futher evaluation to determine whether they should be included in the downstream analysis

### Get a PEP describing the files to process

This is an optional step to flag bed files for futher evaluation to determine whether they should be included in the downstream analysis using `bedqc`. Currently it flags bed files that are larger than 2G, has over 5 milliom regions, and/or has mean region width less than 10 bp. To begin, we'll need some annotation information for our files to load. We'll use the standard [PEP](http://pep.databio.org) format for the annotation, which consists of 1) a sample table (.csv) that annotates the files, and 2) a project config.yaml file that points to the sample annotation sheet. The config file also has other components, such as derived and implied attributes, that in this case point to the files to be processed and whether they are narrowpeak or not. Here is the PEP config file for this example project. 

In [4]:
cat ../../bedbase/tutorial_files/PEPs/bedqc_config.yaml

pep_version: 2.0.0
sample_table: bedstat_annotation_sheet.csv

looper:
    output-dir: $BEDBASE_DATA_PATH_HOST/outputs/bedqc_output/bedqc_pipeline_logs 

sample_modifiers:
  append:
    pipeline_interfaces: $BEDBASE_DATA_PATH_HOST/bedqc/pipeline_interface.yaml
    input_file_path: INPUT
    output_dir: $BEDBASE_DATA_PATH_HOST/outputs/bedqc_output/bedqc_pipeline_logs
  derive:
    attributes: [input_file_path]
    sources:
      INPUT: "$BEDBASE_DATA_PATH_HOST/bed_files/{file_name}" 
      


We'll need to create a directory where we can store the output `flaged_bed.csv`, log and submission files.

In [None]:
mkdir -p outputs/bedqc_output/bedqc_pipeline_logs

### Run bedqc on the demo PEP

In [13]:
looper run bedbase/tutorial_files/PEPs/bedqc_config.yaml --package local \
--command-extra="-R" > outputs/bedqc_output/bedqc_pipeline_logs/looper_logs.txt

Looper version: 1.3.0
Command: run
  os.path.dirname(self._file_path),
  self.config_file = self._file_path
Activating compute package 'local'
[36m## [1 of 17] sample: bedbase_demo_db1; pipeline: BEDQC[0m
Writing script to /home/bx2ur/Documents/data/bedbase_tutorial/outputs/bedqc_output/bedqc_pipeline_logs/submission/BEDQC_bedbase_demo_db1.sub
Job script (n=1; 0.00Gb): /home/bx2ur/Documents/data/bedbase_tutorial/outputs/bedqc_output/bedqc_pipeline_logs/submission/BEDQC_bedbase_demo_db1.sub
[36m## [2 of 17] sample: bedbase_demo_db2; pipeline: BEDQC[0m
Writing script to /home/bx2ur/Documents/data/bedbase_tutorial/outputs/bedqc_output/bedqc_pipeline_logs/submission/BEDQC_bedbase_demo_db2.sub
Job script (n=1; 0.00Gb): /home/bx2ur/Documents/data/bedbase_tutorial/outputs/bedqc_output/bedqc_pipeline_logs/submission/BEDQC_bedbase_demo_db2.sub
[36m## [3 of 17] sample: bedbase_demo_db3; pipeline: BEDQC[0m
Writing script to /home/bx2ur/Documents/data/bedbase_tutorial/outputs/bedqc_output/be

the flaged bedfiles will stored as a csv file:

In [15]:
cat  outputs/bedqc_output/bedqc_pipeline_logs/flaged_bed.csv

file_name,detail 
ENCFF464DKS.bed.gz,['Mean region width is less than 10 bp.'] 
ENCFF610FVD.bed.gz,['Mean region width is less than 10 bp.'] 
ENCFF756GON.bed.gz,['Mean region width is less than 10 bp.']


## 3. BEDSTAT: Generate statistics and plots of BED files 

### Get a PEP describing the bedfiles to process

The first step is to process the BED files using the `bedstat` pipeline, which computes statistics and makes plots for each individual BED file. To begin, we'll need some annotation information for our BED files to load. We'll use the standard [PEP](http://pep.databio.org) format for the annotation, which consists of 1) a sample table (.csv) that annotates the files, and 2) a project config.yaml file that points to the sample annotation sheet. The config file also has other components, such as derived attributes, that in this case point to the bedfiles to be processed. Here is the PEP config file for this example project. It includes annotation information for each BED file, and also points to the `.bed.gz` files using derived attributes `output_file_path` and `yaml_file`.

In [25]:
cat bedbase/tutorial_files/PEPs/bedstat_config.yaml

pep_version: 2.0.0
sample_table: bedstat_annotation_sheet.csv

looper:
    output-dir: $BEDBASE_DATA_PATH_HOST/outputs/bedstat_output/bedstat_pipeline_logs 

sample_modifiers:
  append:
    bedbase_config: $CODE/bedbase/tutorial_files/bedbase_configuration_compose.yaml
    pipeline_interfaces: $CODE/bedstat/pipeline_interface.yaml
    output_file_path: OUTPUT
    yaml_file: SAMPLE_YAML
    open_signal_matrix: MATRIX
    bigbed:  BIGBED
  derive:
    attributes: [output_file_path, yaml_file, open_signal_matrix, bigbed]
    sources:
      OUTPUT: "$BEDBASE_DATA_PATH_HOST/bed_files/{file_name}" 
      SAMPLE_YAML: "$BEDBASE_DATA_PATH_HOST/outputs/bedstat_output/bedstat_pipeline_logs/submission/{sample_name}_sample.yaml"
      MATRIX: "$BEDBASE_DATA_PATH_HOST/openSignalMatrix_{genome}_percentile99_01_quantNormalized_round4d.txt.gz"
      BIGBED: "$BEDBASE_DATA_PATH_HOST/bigbed_files"


### Install bedstat dependencies

`bedstat` is a [pypiper](http://code.databio.org/pypiper/) pipeline that generates statistics and plots of bedfiles. Additionally, `bedstat` uses [bbconf](https://github.com/databio/bbconf), the bedbase configuration manager which implements convenience methods for interacting with an Elasticsearch database, where our file metadata will be placed. These and the appropriate R dependencies can be installed as follows:

In [8]:
pip install -r bedstat/requirements.txt --user > requirements_log.txt

: 1

Install R dependencies

In [32]:
Rscript bedstat/scripts/installRdeps.R > R_deps.txt

In case there is an issue installing `GenomicDistributionsData`, try:
```
wget http://big.databio.org/GenomicDistributionsData/GenomicDistributionsData_0.0.2.tar.gz
Rscript -e 'install.packages("GenomicDistributionsData_0.0.2.tar.gz", type="source", repos=NULL)'
```

There's an additional dependency needed by `bedstat` if we wish to calculate and plot the GC content of our bedfiles. Depending on the genome assemblies of the files listed on a PEP, the appropriate BSgenome packages should be installed. The following is an example of how we can do so:

In [12]:
cat bedbase/tutorial_files/scripts/BSgenome_install.R

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("BSgenome.Hsapiens.UCSC.hg38.masked")

In [31]:
Rscript bedbase/tutorial_files/scripts/BSgenome_install.R > BSgenome.txt

We'll need to create a directory where we can store the stats and plots generated by `bedstat`. Additionally, we'll create a directory where we can store log and metadata files that we'll need later on.

In [26]:
mkdir -p outputs/bedstat_output/bedstat_pipeline_logs

In order to use `bbconf`, we'll need to create a minimal configuration.yaml file. The path to this configuration file can be stored in the environment variable `$BEDBASE`.

In [27]:
cat bedbase/tutorial_files/bedbase_configuration_compose.yaml

path:
  pipeline_output_path: $BEDBASE_DATA_PATH_HOST/outputs
  bedstat_dir: bedstat_output
  bedbuncher_dir: bedbuncher_output
  remote_url_base: null
database:
  host: $DB_HOST_URL
  port: $POSTGRES_PORT
  password: $POSTGRES_PASSWORD
  user: $POSTGRES_USER
  name: $POSTGRES_DB
  dialect: postgresql
  driver: psycopg2
server:
  host: 0.0.0.0
  port: 8000
remotes:
  http:
    prefix: http://data.bedbase.org/
    description: HTTP compatible path
  s3:
    prefix: s3://data.bedbase.org/
    description: S3 compatible path


### Inititiate a local PostgreSQL instance

In addition to generate statistics and plots, `bedstat` inserts JSON formatted metadata into relational [PostgreSQL] database. 

If you don't have docker installed, you can install it with `sudo apt-get update && apt-get install docker-engine -y`.

Now, create a persistent volume to house PostgreSQL data:

In [7]:
docker volume create postgres-data

postgres-data


Spin up a `postgres` container. Provide required environment variables (need to match the settings in bedbase configuration file) and bind the created docker volume to `/var/lib/postgresql/data` path in the container:

In [8]:
docker run -d --name bedbase-postgres -p 5432:5432 -e POSTGRES_PASSWORD=bedbasepassword -e POSTGRES_USER=postgres -e POSTGRES_DB=postgres -v postgres-data:/var/lib/postgresql/data postgres

Unable to find image 'postgres:latest' locally
latest: Pulling from library/postgres

[1Bb6b2107f: Pulling fs layer 
[1B51fa2b56: Pulling fs layer 
[1Bb6f96d81: Pulling fs layer 
[1Bac832fde: Pulling fs layer 
[1Bee1a3f12: Pulling fs layer 
[1B3c06319e: Pulling fs layer 
[1Ba72764d5: Pulling fs layer 
[1B2872ecae: Pulling fs layer 
[1Ba31f2e3d: Pulling fs layer 
[1B442835e0: Pulling fs layer 
[1B05af3390: Pulling fs layer 
[1B852bb872: Pulling fs layer 
[1B0be11543: Pulling fs layer 
[1BDigest: sha256:8f7c3c9b61d82a4a021da5d9618faf056633e089302a726d619fa467c73609e4
Status: Downloaded newer image for postgres:latest
11bba276e7c48ccdd101d78aacffe85b19283611ebd91572fbd69da06086c698


If environment variables are not initialized with function above, We have to initialize them manually 

In [28]:
export DB_HOST_URL=localhost
export POSTGRES_PORT=5432
export POSTGRES_PASSWORD=bedbasepassword
export POSTGRES_USER=postgres
export POSTGRES_DB=postgres

### Run bedstat  on the demo PEP

In order to establish a modular connection between a project and a pipeline, we'll need to create a [pipeline interface](http://looper.databio.org/en/latest/linking-a-pipeline/) file, which tells looper how to run the pipeline. 

In [10]:
cat bedstat/pipeline_interface_new.yaml

pipeline_name: BEDSTAT
pipeline_type: sample
path: pipeline/bedstat.py
input_schema: http://schema.databio.org/pipelines/bedstat.yaml
command_template: >
  {pipeline.path}
  --bedfile {sample.output_file_path}
  --genome {sample.genome}
  --sample-yaml {sample.yaml_file}
  {% if sample.bedbase_config is defined %} --bedbase-config {sample.bedbase_config} {% endif %}
  {% if sample.open_signal_matrix is defined %} --open-signal-matrix {sample.open_signal_matrix} {% endif %}


Once we have properly linked our project to the pipeline of interest, in this case` bedstat`, we simply need to point the `looper run` command our `PEP` config file. Additionally, if the bedbase configuration file location is not stored in the `$BEDBASE` variable, we can pass it to `looper` as an additional argument:

In [29]:
looper run bedbase/tutorial_files/PEPs/bedstat_config.yaml --package local \
--command-extra="-R" > outputs/bedstat_output/bedstat_pipeline_logs/looper_logs.txt

Looper version: 1.3.1
Command: run
  os.path.dirname(self._file_path),
  self.config_file = self._file_path
Activating compute package 'local'
[36m## [1 of 11] sample: bedbase_demo_db1; pipeline: BEDSTAT[0m
Calling pre-submit function: looper.write_sample_yaml
Writing script to /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/outputs/bedstat_output/bedstat_pipeline_logs/submission/BEDSTAT_bedbase_demo_db1.sub
Job script (n=1; 0.00Gb): /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/outputs/bedstat_output/bedstat_pipeline_logs/submission/BEDSTAT_bedbase_demo_db1.sub
[36m## [2 of 11] sample: bedbase_demo_db2; pipeline: BEDSTAT[0m
Calling pre-submit function: looper.write_sample_yaml
Writing script to /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/outputs/bedstat_output/bedstat_pipeline_logs/submission/BEDSTAT_bedbase_demo_db2.sub
Job script (n=1; 0.00Gb): /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/outputs/bedstat_output/bedstat_pipeline_logs/submission/BEDSTAT_bedbase_demo_d

Just for informative purposes, we can inspect how `bedstat` operates on each bedfile:

In [30]:
head outputs/bedstat_output/bedstat_pipeline_logs/looper_logs.txt

Compute node: cphg-Precision-5560
Start time: 2021-12-06 12:17:46
### Pipeline run code and environment:

*              Command:  `/home/bnt4me/Virginia/bed_maker/bedbase_tutorial/bedstat/pipeline/bedstat.py --bedfile /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/bed_files/GSE105587_ENCFF018NNF_conservative_idr_thresholded_peaks_GRCh38.bed.gz --genome hg38 --sample-yaml /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/outputs/bedstat_output/bedstat_pipeline_logs/submission/bedbase_demo_db1_sample.yaml --bedbase-config /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/bedbase/tutorial_files/bedbase_configuration_compose.yaml --open-signal-matrix /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/openSignalMatrix_hg38_percentile99_01_quantNormalized_round4d.txt.gz --bigbed /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/bigbed_files -R`
*         Compute host:  cphg-Precision-5560
*          Working dir:  /home/bnt4me/Virginia/bed_maker/bedbase_tutorial
*            Outfolder:  /home/b

After the previous steps have been executed, our bedfiles should be available for query on our local Elasticsearch cluster. Files can be queried using the `bedbuncher` pipeline described in the below section. 


## 4. BEDBUNCHER: Create bedsets and their respective statistics 

### Create a new PEP describing the bedset name and specific JSON query 

Now that we've processed several individual BED files, we'll turn to the next task: grouping them together into collections of BED files, which we call *bedsets*. For this, we use the `bedbuncher` pipeline, which produces outputs for each bedset, such as a bedset PEP, bedset-level statistics and plots, and an `IGD` database. To run `bedbuncher`, we will need another PEP describing each bedset. Though the annotation sheet below specifies attributes for one bedset, you can create as many as you wish using additional rows. For each bedset, you need to provide the query to retrieve certain collection BED files. 

The following example PEP shows the attributes we need to provide for each bedset and the config.yaml file that will grab the files needed to run `bedbuncher`:

In [31]:
cat bedbase/tutorial_files/PEPs/bedbuncher_query.csv

sample_name,bedset_name,genome,query,operator,query_val,bbconfig_name,bedbase_config
sample1,bedsetOver1kRegions,hg38,'regions_no',gt,"""1000""",bedbase_configuration_compose,source1
sample2,bedsetOver50GCContent,hg38,'gc_content',gt,"""0.5""",bedbase_configuration_compose,source1
sample3,bedsetUnder500MeanWidth,hg38,'mean_region_width',lt,"""500""",bedbase_configuration_compose,source1
sample4,bedsetTestSelectCellType,hg38,"""other::text~~:str_1 or other::text~~:str_2""","""str_1,str_2""","""%GM12878%,%HEK293%""",bedbase_configuration_compose,source1
sample5,bedsetTestSelectGenome,hg38,"""name=:name_1 or name=:name_2""","""name_1,name_2""","""GSE105587_ENCFF018NNF_conservative_idr_thresholded_peaks_GRCh38,GSE91663_ENCFF553KIK_optimal_idr_thresholded_peaks_GRCh38""",bedbase_configuration_compose,source1
sample6,bedsetTestCellType,hg38,"""other""",contains,"""""{\""cell_type\"":\ \""K562\""}""""",bedbase_configuration_compose,source1
sample7,bedsetTestSpace,hg38,"""other""",contains,"""

In [32]:
cat bedbase/tutorial_files/PEPs/bedbuncher_config.yaml

pep_version: 2.0.0
sample_table: bedbuncher_query.csv

looper:
    output_dir: $BEDBASE_DATA_PATH_HOST/outputs/bedbuncher_output/bedbuncher_pipeline_logs

sample_modifiers:
  append:
    pipeline_interfaces: $CODE/bedbuncher/pipeline_interface.yaml 
  derive:
    attributes: [bedbase_config]
    sources:
      source1: $CODE/bedbase/tutorial_files/{bbconfig_name}.yaml


Running `bedbuncher` with arguments defined in the example PEP above will result in a bedset with bedfiles that consist of at least 1000 regions.

###  Create outputs directory and install bedbuncher command line dependencies

We need a folder where we can store bedset related outputs. Though not required, we'll also create a directory where we can store the `bedbuncher` pipeline logs. 

In [33]:
mkdir -p outputs/bedbuncher_output/bedbuncher_pipeline_logs

One of the feats of `bedbuncher` includes [IGD](https://github.com/databio/IGD) database creation from the files in the bedset. `IGD` can be installed by cloning the repository from github, executing the make file to create the binary, and pointing the binary location with the `$PATH` environment variable. 

In [34]:
git clone git@github.com:databio/IGD
cd IGD
make > igd_make_log.txt 2>&1
cd ..

export PATH=$BEDBASE_DATA_PATH_HOST/IGD/bin/:$PATH

Cloning into 'IGD'...
remote: Enumerating objects: 1297, done.[K
remote: Counting objects: 100% (67/67), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 1297 (delta 35), reused 40 (delta 17), pack-reused 1230[K
Receiving objects: 100% (1297/1297), 949.45 KiB | 10.79 MiB/s, done.
Resolving deltas: 100% (804/804), done.


### Run bedbuncher using Looper 

Once we have cloned the `bedbuncher` repository, set our local Postgres cluster and created the `iGD` binary, we can run the pipeline by pointing `looper run` to the appropriate `PEP` config file. As mentioned earlier, if the path to the bedbase configuration file has been stored in the `$BEDBASE` environment variable, it's not neccesary to pass the `--bedbase-config` argument. 

In [36]:
looper run  bedbase/tutorial_files/PEPs/bedbuncher_config.yaml  --package local \
--command-extra="-R" > outputs/bedbuncher_output/bedbuncher_pipeline_logs/looper_logs.txt

Looper version: 1.3.1
Command: run
  os.path.dirname(self._file_path),
  self.config_file = self._file_path
Activating compute package 'local'
[36m## [1 of 10] sample: sample1; pipeline: BEDBUNCHER[0m
Writing script to /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/outputs/bedbuncher_output/bedbuncher_pipeline_logs/submission/BEDBUNCHER_sample1.sub
Job script (n=1; 0.00Gb): /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/outputs/bedbuncher_output/bedbuncher_pipeline_logs/submission/BEDBUNCHER_sample1.sub
[36m## [2 of 10] sample: sample2; pipeline: BEDBUNCHER[0m
Writing script to /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/outputs/bedbuncher_output/bedbuncher_pipeline_logs/submission/BEDBUNCHER_sample2.sub
Job script (n=1; 0.00Gb): /home/bnt4me/Virginia/bed_maker/bedbase_tutorial/outputs/bedbuncher_output/bedbuncher_pipeline_logs/submission/BEDBUNCHER_sample2.sub
[36m## [3 of 10] sample: sample3; pipeline: BEDBUNCHER[0m
Writing script to /home/bnt4me/Virginia/bed_maker/bed

## 5. BEDEMBED: 

### bedembed_train: Uses the StarSpace method to embed the bed files and the meta data.

We need to install [StarSpace](https://github.com/facebookresearch/StarSpace) first.  

In [None]:
mkdir -p bedembed/tools

We need to install [Boost](http://www.boost.org/) library and specify the path of boost library in makefile in order to run StarSpace.

In [None]:
wget https://boostorg.jfrog.io/artifactory/main/release/1.78.0/source/boost_1_78_0.zip
unzip boost_1_78_0.zip
sudo mv boost_1_78_0 /usr/local/bin
cd /usr/local/bin/boost_1_78_0
./bootstrap.sh
./b2

In order to build StarSpace on Mac OS or Linux, use the following:

In [None]:
cd $BEDBASE_DATA_PATH_HOST/bedembed/tools
git clone https://github.com/facebookresearch/Starspace.git
cd Starspace
make
make embed_doc

We need a folder where we can store bedembed related outputs. Though not required, we'll also create a directory where we can store the bedembed pipeline logs.

In [None]:
mkdir -p outputs/bedembed_output/bedembed_pipeline_logs

In [None]:
path_starspace=$BEDBASE_DATA_PATH_HOST'/bedembed/tools/Starspace/starspace'
path_meta=$BEDBASE_DATA_PATH_HOST'/bedbase/tutorial_files/PEPs/bedstat_annotation_sheet.csv'
# download Universe file from rivanna
path_universe=$BEDBASE_DATA_PATH_HOST'/tiles1000.hg19.bed'
path_output=$BEDBASE_DATA_PATH_HOST'/outputs/bedembed_output/'
assembly='hg38'
path_data=$BEDBASE_DATA_PATH_HOST'/bed_files/'
labels="exp_protocol,cell_type,tissue,antibody,treatment"
no_files=10
start_line=0
dim=50
epochs=20
learning_rate=0.001

python ./bedembed/pipeline/bedembed_train.py -star $path_starspace -i $path_data -g $assembly -meta $path_meta -univ $path_universe \
-l $labels -nof $no_files -o $path_output -startline $start_line -dim $dim -epochs $epochs -lr $learning_rate

### bedembed_test: calculate the distances between file labels and trained search terms

### Get a PEP describing the bedfiles to process 

We'll use the standard [PEP](http://pep.databio.org) format for the annotation, which consists of 1) a sample table (.csv) that annotates the files, and 2) a project config.yaml file that points to the sample annotation sheet. The config file also has other components, such as derived attributes, that in this case point to the bedfiles to be processed. Here is the PEP config file for this example project:

In [1]:
cat bedbase/tutorial_files/PEPs/bedembed_test_config.yaml

bedembed_version: 0.0.0
sample_table: bedstat_annotation_sheet.csv

looper:
  output-dir: $BEDBASE_DATA_PATH_HOST/outputs/bedembed_output/bedembed_pipeline_logs 
sample_modifiers:
  append:
    bedbase_config: $BEDBASE_DATA_PATH_HOST/bedbase/tutorial_files/bedbase_configuration_compose.yaml
    pipeline_interfaces: $BEDBASE_DATA_PATH_HOST/bedembed/pipeline_interface_test.yaml
    universe: /project/shefflab/data/StarSpace/universe/universe_tilelen1000.bed
    input_file_path: INPUT
    output_file_path: $BEDBASE_DATA_PATH_HOST/outputs/bedembed_output
    yaml_file: SAMPLE_YAML
  derive:
    attributes: [yaml_file, input_file_path]
    sources:
      INPUT: "/project/shefflab/data/encode/{file_name}"
      SAMPLE_YAML: "$BEDBASE_DATA_PATH_HOST/outputs/bedembed_output/bedembed_pipeline_logs/submission/{sample_name}_sample.yaml"


### Run bedembed using Looper 

Once we have cloned the `bedembed` repository, set our local postgres cluster, we can run the pipeline by pointing `looper run` to the appropriate `PEP` config file. As mentioned earlier, if the path to the bedbase configuration file is provided, the calculated distances will report to the postgres database, if not it will save as a csv file in the `output_file_path`

In [None]:
looper run bedbase/tutorial_files/PEPs/bedembed_test_config.yaml --package local

## 5. BEDHOST:  Serve BED files and API to explore pipeline outputs

The last part of the tutorial consists on running a local instance of `bedhost` (a REST API for `bedstat` and `bedbuncher` produced outputs) in order to explore plots, statistics and download pipeline outputs. 
To run `bedhost`, frist use `bedhost-ui` to built the bedhost user interface with React.

In [38]:
cd bedhost-ui
# Install node modules defined in package.json
npm install 
# Build the app for production to the ./build folder
npm run build
# copy the contents of the ./build directory to bedhost/bedhost/static/bedhost-ui
cp -avr ./build ../bedhost/bedhost/static/bedhost-ui

cd ..

To run `bedhost`, we'll pip install the package from the previously cloned repository:

In [39]:
pip install bedhost/. --user > bedhost_log.txt

To start `bedhost`, we simply need to run the following command passing the location of the bedbase configuration file to the `-c` flag.  

In [None]:
bedhost serve -c  $BEDBASE_DATA_PATH_HOST/bedbase/tutorial_files/bedbase_configuration_compose.yaml

Serving data for columns: ['md5sum']
Serving data for columns: ['md5sum']
Generating GraphQL schema
running bedhost app
[32mINFO[0m:     Started server process [[36m648505[0m]
[32mINFO[0m:     Waiting for application startup.
[32mINFO[0m:     Application startup complete.
[32mINFO[0m:     Uvicorn running on [1mhttp://0.0.0.0:8000[0m (Press CTRL+C to quit)
[32mINFO[0m:     127.0.0.1:47532 - "[1mGET / HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     127.0.0.1:47532 - "[1mGET /ui/static/css/2.fa6c921b.chunk.css HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     127.0.0.1:47534 - "[1mGET /ui/static/css/main.4620a2c9.chunk.css HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     127.0.0.1:47536 - "[1mGET /ui/static/js/2.b0639060.chunk.js HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     127.0.0.1:47534 - "[1mGET /ui/static/js/main.56118e82.chunk.js HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     127.0.0.1:47536 - "[1mGET /api/bed/all/data/count HTTP/1.1[0m" [32m200 OK[0m
[(

If we have stored the path to the bedbase config in the environment variable `$BEDBASE` (suggested), it's not neccesary to use said flag. 

In [None]:
bedhost serve 

The `bedhost` API can be opened in the url [http://0.0.0.0:8000](http://0.0.0.0:8000). We can now explore the plots and statistics generated by the `bedstat` and `bedbuncher` pipelines.

## or optionally run BEDHOST using containers

Alternatively, you can run the application inside a container.

For that we'll use [docker compose](https://docs.docker.com/compose/), a tool that makes running multi-contaier Docker applications possible. The `docker-compose.yaml` file defines two services: 
- `fastapi-api`: runs the fastAPI server 
- `postgres-db`: runs the PostgeSQL database used by the server


In [24]:
cd $BEDBASE_DATA_PATH_HOST

Use the `BEDBASE_DATA_PATH_HOST` environment variable to point to the host directory with the pipeline results that will be mounted in the container as a volume. 

The environment variables are passed to the container via `.env` file, which the `docker-compose.yaml` points to for each service. Additionally, you can just export the environment variables before issuing the `docker-compose` command.
When you set the same environment variable in multiple files, here’s the priority used by Compose to choose which value to use:

1. Compose file
2. Shell environment variables
3. Environment file
4. Dockerfile
4. Variable is not defined

In [26]:
cd bedhost; docker-compose up

Pulling postgres-db (postgres:)...
latest: Pulling from library/postgres
Digest: sha256:8f7c3c9b61d82a4a021da5d9618faf056633e089302a726d619fa467c73609e4
Status: Downloaded newer image for postgres:latest
Recreating postgreSQL-bedbase ... 
[1BRecreating fastAPI-bedbase    ... mdone[0m
[1BAttaching to postgreSQL-bedbase, fastAPI-bedbase
[33mpostgreSQL-bedbase |[0m 
[33mpostgreSQL-bedbase |[0m PostgreSQL Database directory appears to contain a database; Skipping initialization
[33mpostgreSQL-bedbase |[0m 
[33mpostgreSQL-bedbase |[0m 2020-11-02 23:10:28.883 UTC [1] LOG:  starting PostgreSQL 13.0 (Debian 13.0-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
[33mpostgreSQL-bedbase |[0m 2020-11-02 23:10:28.885 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
[33mpostgreSQL-bedbase |[0m 2020-11-02 23:10:28.885 UTC [1] LOG:  listening on IPv6 address "::", port 5432
[33mpostgreSQL-bedbase |[0m 2020-11-02 23:10:28.891 UTC [1] LOG:  