# Integrating Hub to an Existing Package to boost performance #2

This is an experiment to compare Hub's performance in real life packages. We select a package that currently uses a TileDB backend and record its performance, then the backend is ported to Hub, and performance of the new backend is recorded on the same tests as before. Ideally, the Hub backend should outperform the previous TileDB backend.

### Name of package: genomelake

Link to original package: https://github.com/kundajelab/genomelake

Link to modified package: https://github.com/DebadityaPal/genomelake

### Why this package?

The motivation behind selecting "genomelake" as the package is that we wanted to compare Hub's performance to that of TileDB, before branching out to other similar alternatives. For ease of portability the package selected had to be smaller in size so that a single programmer can quickly understand the working mechanism of the package and start to port the backend to Hub, secondly the package had to use a TileDB backend since that is the package we want to compare Hub to. The intersection of these requirements produced a couple of OSS packages, but few of them were outdated, few more had errors which could not be resolved and so on. "genomelake" was the optimal choice as it was relatively small in size, ran without any inherent errors and was also relatively better known than the other packages in terms of GitHub Repo stars.

## The Experiment Follows:

In [None]:
from google.colab import drive
drive.mount("/content/drive")

## Fetching the Data

We fetch a NCBI protein sequence dataset of a particular species of yeast and use that as our testing dataset.

In [None]:
!wget "https://drive.google.com/uc?export=download&id=1rF5kLOEBdAi7qo3EYZoI0Ixy7snBamsf" -P "/content/drive/MyDrive/Genomelake/Data"
!mv "/content/drive/MyDrive/Genomelake/Data/uc?export=download&id=1rF5kLOEBdAi7qo3EYZoI0Ixy7snBamsf" "/content/drive/MyDrive/Genomelake/Data/yeast_2.fa"
!mkdir "/content/drive/MyDrive/Genomelake/Repo"
%cd "/content/drive/MyDrive/Genomelake/Repo"
!git clone https://github.com/DebadityaPal/genomelake

##Installation

The following cell will install all the dependencies and the package itself, users will have to restart the colab environment after running this cell if they are running it on Google Colab

In [None]:
%cd "/content/drive/MyDrive/Genomelake/Repo/genomelake"

!pip install  hub
!pip install pyBigWig
!pip install bcolz
!pip install pybedtools
!python setup.py install

(Only for Google Colab environment)

If the runtime enviroment was restarted, users can run the notebook from the next cell onwards, previous cells dont need to be executed provided they were executed before the restart.

In [None]:
%cd "/content/drive/MyDrive/Genomelake"

## Testing

We will be testing a particular function from the repository named "extract_fasta_to_file", this function takes a FASTA file and converts it into a dataset with the backend of choice given by the user. This function can effectively check the writing efficiency of the backends.

In [2]:
from genomelake.backend import extract_fasta_to_file
import time

genome_fasta = "./Data/yeast_2.fa"
genome_data_directory = "./Data/yeast_2"

def time_hub():
  start = time.time()
  extract_fasta_to_file(genome_fasta, genome_data_directory+"_hub", mode="hub")
  end = time.time()
  print("Time taken by Hub: ", end-start)

def time_tiledb():
  start = time.time()
  extract_fasta_to_file(genome_fasta, genome_data_directory+"_tiledb", mode="tiledb")
  end = time.time()
  print("Time taken by tiledb: ", end-start)

In [None]:
time_tiledb()
time_hub()

##Clearing out the Space

**IMPORTANT:**

Run this cell very carefully and only after checking the path. This cell will delete the root folder of the package that was created when the initial cells were run. Changing the path can have unwanted deletion of other files.

In [4]:
!rm -r "/content/drive/MyDrive/Genomelake"