# Integrating Hub to an Existing Package to boost performance

This is an experiment to compare Hub's performance in real life packages. We select a package that currently uses a TileDB backend and record its performance, then the backend is ported to Hub, and performance of the new backend is recorded on the same tests as before. Ideally, the Hub backend should outperform the previous TileDB backend.

### Name of package: dnafrag
Link to original package: [https://github.com/kundajelab/dnafrag](https://github.com/kundajelab/dnafrag)

Link  to modified package: [https://github.com/DebadityaPal/dnafrag](https://github.com/DebadityaPal/dnafrag)

### Why this package?

The motivation behind selecting "dnafrag" as the package is that we wanted to compare Hub's performance to that of TileDB first, before branching out to other similar alternatives. For ease of portability the package selected had to be smaller in size so that a single programmer can quickly understand the working mechanism of the package and start to port the backend to Hub, secondly the package had to use a TileDB backend since that is the package we want to compare Hub to. The intersection of these requirements produced a couple of OSS packages, but few of them were  outdated, few more had errors which could not be resolved and so on. "dnafrag" was the optimal choice as it was relatively small in size and ran without any inherent errors.

## The Experiment follows:

In [None]:
from google.colab import drive
drive.mount("/content/drive")

## Installation

The following cell will install all the dependencies and the package itself, users will have to restart the colab environment after running this cell if they are running it on Google Colab

In [None]:
%cd "/content/drive/MyDrive/"
!git clone https://github.com/DebadityaPal/dnafrag
!pip install numpy
!pip install tqdm
!pip install scipy
!pip install tiledb
!pip install hub
!pip install pybedtools
%cd "/content/drive/MyDrive/dnafrag"
!pip install .

(Only for Google Colab environment)

If the runtime enviroment was restarted, users can run the notebook from the next cell onwards, previous cells dont need to be executed provided they were executed before the restart.

In [None]:
%cd "/content/drive/MyDrive/dnafrag"

## Testing

The following test has been taken from the original repository of the package, even the files have been taken from there. We have modified the functions to add the timer and utilize the Hub Backend.

In [None]:
import os
import gzip
import json
import tempfile
import time

import numpy as np

import dnafrag

FRAGBED_FILE = "dnafrag/tests/test_fragbed_100k.fragbed.gz"
GENOME_FILE = "dnafrag/tests/test_fragbed_hg19.chrom.sizes"

TEST_CHROM_LENS = [500, 1034, 2031, 60001]
MAX_INTERVAL_LEN = 400
max_output_fraglen=300

NUM_TEST_CHROMS = len(TEST_CHROM_LENS)
TEST_CHROM_NAMES = ["chr{}".format(i) for i in range(NUM_TEST_CHROMS)]

output_dir = "./temp"

bed_entries = None

def time_tiledb():
    start = time.time()
    dnafrag.core.write_fragbed(
        fragment_bed=FRAGBED_FILE, output_dir=output_dir+"_tiledb/", genome_file=GENOME_FILE, max_fraglen=max_output_fraglen, backend="tiledb"
    )
    end = time.time()
    print("Time elapsed in seconds (TileDB): ", end-start)

def time_hub():
    start = time.time()
    dnafrag.core.write_fragbed(
        fragment_bed=FRAGBED_FILE, output_dir=output_dir+"_hub/", genome_file=GENOME_FILE, max_fraglen=max_output_fraglen, backend="hub"
    )
    end = time.time()
    print("Time elapsed in seconds (Hub): ", end-start)

if __name__ == "__main__":
    time_tiledb()
    time_hub()

## Clearing out the Space

**IMPORTANT:**

Run this cell very carefully and only after checking the path. This cell will delete the root folder of the package that was created when the initial cells were run. Changing the path can have unwanted deletion of other files.

In [3]:
!rm -r "/content/drive/MyDrive/dnafrag"

## Inference

We can clearly see that Hub outperforms the existing TileDB backend, thus for a package like this, shifting their backend to Hub would increase their performance in terms of time taken during computation.


However, we should also test the performance with some more data, if we can get some.