# bedboss stat

This tutorial is intended to introduce you to bedstat, pipeline that produces statistics and plots based on bed and bigbed files

### 1. Install all dependencies and initialize database for it

- Install dependecies: [How to install R dependencies](./how_to_install_r_dep/)
- Initialize database: [How to initialize database](./how_to_create_database/)
- Create config file: [How to create config file](./how_to_bedbase_config/)

### 2. Create working repository

In [7]:
mkdir stat_tutorial ; cd stat_tutorial 

Create config file by downloading it and configuring it

In [21]:
cat bedbase_config_test.yaml

path:
  pipeline_output_path: $BEDBOSS_OUTPUT_PATH  # do not change it
  bedstat_dir: bedstat_output
  remote_url_base: null
  bedbuncher_dir: bedbucher_output
database:
  host: localhost
  port: 5432
  password: docker
  user: postgres
  name: pep-db
  dialect: postgresql
  driver: psycopg2
server:
  host: 0.0.0.0
  port: 8000
remotes:
  http:
    prefix: https://data.bedbase.org/
    description: HTTP compatible path
  s3:
    prefix: s3://data.bedbase.org/
    description: S3 compatible path


### 3. Download bed and bigbed files

Bed file

In [14]:
wget -O sample1.bed.gz https://github.com/bedbase/bedboss/raw/dev/test/data/bed/hg19/correct/sample1.bed.gz


--2023-02-28 15:32:57--  https://github.com/bedbase/bedboss/raw/dev/test/data/bed/hg19/correct/sample1.bed.gz
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/bedbase/bedboss/dev/test/data/bed/hg19/correct/sample1.bed.gz [following]
--2023-02-28 15:32:57--  https://raw.githubusercontent.com/bedbase/bedboss/dev/test/data/bed/hg19/correct/sample1.bed.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7087126 (6.8M) [application/octet-stream]
Saving to: ‘sample1.bed.gz’


2023-02-28 15:32:58 (95.8 MB/s) - ‘sample1.bed.gz’ saved [7087126/7087126]



BigBed file

In [15]:
wget -O sample1.bigBed https://github.com/bedbase/bedboss/raw/dev/test/data/bigbed/hg19/correct/sample1.bigBed


--2023-02-28 15:33:00--  https://github.com/bedbase/bedboss/raw/dev/test/data/bigbed/hg19/correct/sample1.bigBed
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/bedbase/bedboss/dev/test/data/bigbed/hg19/correct/sample1.bigBed [following]
--2023-02-28 15:33:00--  https://raw.githubusercontent.com/bedbase/bedboss/dev/test/data/bigbed/hg19/correct/sample1.bigBed
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13092350 (12M) [application/octet-stream]
Saving to: ‘sample1.bigBed’


2023-02-28 15:33:00 (101 MB/s) - ‘sample1.bigBed’ saved [13092350/13092350]



In [22]:
ls

bedbase_config_test.yaml  [0m[01;31msample1.bed.gz[0m  sample1.bigBed


### 4. Run statistics:

Additionally we need some metadata about files. 1) genome assembly, config file and know output folder.

In [23]:
bedboss stat --help

usage: bedboss stat [-h] --bedfile BEDFILE --outfolder OUTFOLDER
                    [--open-signal-matrix OPEN_SIGNAL_MATRIX] [--ensdb ENSDB]
                    [--bigbed BIGBED] --bedbase-config BEDBASE_CONFIG
                    [-y SAMPLE_YAML] --genome GENOME_ASSEMBLY [--no-db-commit]
                    [--just-db-commit]

options:
  -h, --help            show this help message and exit
  --bedfile BEDFILE     a full path to bed file to process [Required]
  --outfolder OUTFOLDER
                        Pipeline output folder [Required]
  --open-signal-matrix OPEN_SIGNAL_MATRIX
                        a full path to the openSignalMatrix required for the
                        tissue specificity plots
  --ensdb ENSDB         a full path to the ensdb gtf file required for genomes
                        not in GDdata
  --bigbed BIGBED       a full path to the bigbed files
  --bedbase-config BEDBASE_CONFIG
                        a path to the bedbase configuration file [Required]


In [39]:
bedboss stat \
--bedfile ./sample1.bed.gz \
--bigbed ./sample1.bigBed \
--outfolder ./test_output \
--genome hg19 \
--bedbase-config ./bedbase_config_test.yaml 


### Pipeline run code and environment:

*              Command:  `/home/bnt4me/virginia/venv/jupyter/bin/bedboss stat --bedfile ./sample1.bed.gz --bigbed ./sample1.bigBed --outfolder ./test_output --genome hg19 --bedbase-config ./bedbase_config_test.yaml`
*         Compute host:  bnt4me-Precision-5560
*          Working dir:  /home/bnt4me/virginia/repos/bedbase_all/bedboss/docs_jupyter/stat_tutorial
*            Outfolder:  ./test_output/
*  Pipeline started at:   (02-28 15:46:52) elapsed: 0.0 _TIME_

### Version log:

*       Python version:  3.10.6
*          Pypiper dir:  `/home/bnt4me/virginia/venv/jupyter/lib/python3.10/site-packages/pypiper`
*      Pypiper version:  0.12.3
*         Pipeline dir:  `/home/bnt4me/virginia/venv/jupyter/bin`
*     Pipeline version:  0.1.0-dev1

### Arguments passed to pipeline:


----------------------------------------

Target to produce: `/home/bnt4me/virginia/repos/bedbase_all/bedboss/docs_jupyter/stat_tutorial/test_output/output/bedstat_output/c5

After plots and statistics were produced, we can look at them

In [43]:
ls test_output/output/bedstat_output/c557c915a9901ce377ef724806ff7a2c

sample1_chrombins.pdf              [0m[01;35msample1_neighbor_distances.png[0m
[01;35msample1_chrombins.png[0m              sample1_paritions.pdf
sample1_cumulative_partitions.pdf  [01;35msample1_paritions.png[0m
[01;35msample1_cumulative_partitions.png[0m  sample1_plots.json
sample1_expected_partitions.pdf    sample1_tssdist.pdf
[01;35msample1_expected_partitions.png[0m    [01;35msample1_tssdist.png[0m
sample1.json                       sample1_widths_histogram.pdf
sample1_neighbor_distances.pdf     [01;35msample1_widths_histogram.png[0m


In [44]:
cat test_output/output/bedstat_output/c557c915a9901ce377ef724806ff7a2c/sample1.json

{
  "name": ["sample1"],
  "regions_no": [300000],
  "mean_region_width": [663.9],
  "md5sum": ["c557c915a9901ce377ef724806ff7a2c"],
  "median_TSS_dist": [48580],
  "exon_frequency": [14871],
  "exon_percentage": [0.0496],
  "fiveUTR_frequency": [8981],
  "fiveUTR_percentage": [0.0299],
  "intergenic_frequency": [141763],
  "intergenic_percentage": [0.4725],
  "intron_frequency": [106638],
  "intron_percentage": [0.3555],
  "promoterCore_frequency": [10150],
  "promoterCore_percentage": [0.0338],
  "promoterProx_frequency": [6851],
  "promoterProx_percentage": [0.0228],
  "threeUTR_frequency": [10746],
  "threeUTR_percentage": [0.0358]
}
