Skip to content

Commit

Permalink
Merge branch 'feature/examples' of https://github.com/activeloopai/Hub
Browse files Browse the repository at this point in the history
…into feature/examples
  • Loading branch information
kristinagrig06 committed Mar 30, 2021
2 parents 2246c28 + c99dd19 commit 3ead5eb
Show file tree
Hide file tree
Showing 31 changed files with 2,176 additions and 119 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,7 @@ The [examples](https://github.com/activeloopai/Hub/tree/master/examples) directo
| [Transforming Data](https://github.com/activeloopai/Hub/blob/master/examples/tutorial/Tutorial%203%20-%20Transforming%20Data.ipynb) | Briefs on how data transformation with Hub|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/activeloopai/Hub/blob/master/examples/tutorial/Tutorial%203%20-%20Transforming%20Data.ipynb) |
| [Dynamic Tensors](https://github.com/activeloopai/Hub/blob/master/examples/tutorial/Tutorial%204%20-%20What%20are%20Dynamic%20Tensors.ipynb) | Handling data with variable shape and sizes|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/activeloopai/Hub/blob/master/examples/tutorial/Tutorial%204%20-%20What%20are%20Dynamic%20Tensors.ipynb) |
| [NLP using Hub](https://github.com/activeloopai/Hub/blob/master/examples/notebooks/nlp_using_hub.ipynb) | Fine Tuning Bert for CoLA|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/activeloopai/Hub/blob/master/examples/notebooks/nlp_using_hub.ipynb) |
| [Getting Started with Text on Hub](https://github.com/activeloopai/Hub/blob/master/examples/notebooks/Getting_Started_with_Text_on_Hub.ipynb) | Overview on using Text datasets in Hub | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/activeloopai/Hub/blob/master/examples/notebooks/Getting_Started_with_Text_on_Hub.ipynb) |


## Use Cases
Expand Down
2 changes: 1 addition & 1 deletion README_translations/README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@

---

[ [English](./README.md) | 简体中文 | [Türkçe](./README_TR.md) | [한글](./README_KR.md)]
[ [English](../README.md) | [Français](./README_FR.md) | 简体中文 | [Türkçe](./README_TR.md) | [한글](./README_KR.md) | [Bahasa Indonesia](./README_ID.md)]

### Hub 的作用是什么?

Expand Down
2 changes: 1 addition & 1 deletion README_translations/README_FR.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@

---

[ English | [简体中文](./README_CN.md) | [Türkçe](./README_TR.md) | [한글](./README_KR.md) | [Français](./READE_FR.md)]
[ [English](../README.md) | Français | [简体中文](./README_CN.md) | [Türkçe](./README_TR.md) | [한글](./README_KR.md) | [Bahasa Indonesia](./README_ID.md)]

### À quoi sert Hub ?

Expand Down
2 changes: 1 addition & 1 deletion README_translations/README_ID.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@

---

[ English | [Français](./README_FR.md) | [简体中文](./README_CN.md) | [Türkçe](./README_TR.md) | [한글](./README_KR.md) | [Bahasa Indonesia](./README_ID.md)]
[ [English](../README.md) | [Français](./README_FR.md) | [简体中文](./README_CN.md) | [Türkçe](./README_TR.md) | [한글](./README_KR.md) | Bahasa Indonesia ]

<i>Perhatian: translasi ini mungkin bukan berasal dari dokumen yang paling baru</i>

Expand Down
2 changes: 1 addition & 1 deletion README_translations/README_KR.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@

---

[ [English](./README.md) | [简体中文](./README_CN.md) | [Türkçe](./README_TR.md) | 한글]
[ [English](../README.md) | [Français](./README_FR.md) | [简体中文](./README_CN.md) | [Türkçe](./README_TR.md) | 한글 | [Bahasa Indonesia](./README_ID.md)]

### Hub는 무엇인가요?

Expand Down
2 changes: 1 addition & 1 deletion README_translations/README_TR.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@

---

[ [English](./README.md) | [简体中文](./README_CN.md) | Türkçe | [한글](./README_KR.md)]
[ [English](../README.md) | [Français](./README_FR.md) | [简体中文](./README_CN.md) | Türkçe | [한글](./README_KR.md) | [Bahasa Indonesia](./README_ID.md)]

### Hub ne içindir?

Expand Down
222 changes: 222 additions & 0 deletions benchmarks/dnafrag_hub_backend.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "dnafrag.ipynb",
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "tintoRRxUWBh"
},
"source": [
"# Integrating Hub to an Existing Package to boost performance\n",
"\n",
"This is an experiment to compare Hub's performance in real life packages. We select a package that currently uses a TileDB backend and record its performance, then the backend is ported to Hub, and performance of the new backend is recorded on the same tests as before. Ideally, the Hub backend should outperform the previous TileDB backend.\n",
"\n",
"### Name of package: dnafrag\n",
"Link to original package: [https://github.com/kundajelab/dnafrag](https://github.com/kundajelab/dnafrag)\n",
"\n",
"Link to modified package: [https://github.com/DebadityaPal/dnafrag](https://github.com/DebadityaPal/dnafrag)\n",
"\n",
"### Why this package?\n",
"\n",
"The motivation behind selecting \"dnafrag\" as the package is that we wanted to compare Hub's performance to that of TileDB first, before branching out to other similar alternatives. For ease of portability the package selected had to be smaller in size so that a single programmer can quickly understand the working mechanism of the package and start to port the backend to Hub, secondly the package had to use a TileDB backend since that is the package we want to compare Hub to. The intersection of these requirements produced a couple of OSS packages, but few of them were outdated, few more had errors which could not be resolved and so on. \"dnafrag\" was the optimal choice as it was relatively small in size and ran without any inherent errors.\n",
"\n",
"## The Experiment follows:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "d7dNNHIATVzC"
},
"source": [
"from google.colab import drive\n",
"drive.mount(\"/content/drive\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "EAEJD7efZQaa"
},
"source": [
"## Installation\n",
"\n",
"The following cell will install all the dependencies and the package itself, users will have to restart the colab environment after running this cell if they are running it on Google Colab"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Q_vdGS_VUMhX"
},
"source": [
"%cd \"/content/drive/MyDrive/\"\n",
"!git clone https://github.com/DebadityaPal/dnafrag\n",
"!pip install numpy\n",
"!pip install tqdm\n",
"!pip install scipy\n",
"!pip install tiledb\n",
"!pip install hub\n",
"!pip install pybedtools\n",
"%cd \"/content/drive/MyDrive/dnafrag\"\n",
"!pip install ."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "WEP2kDUOcEno"
},
"source": [
"(Only for Google Colab environment)\n",
"\n",
"If the runtime enviroment was restarted, users can run the notebook from the next cell onwards, previous cells dont need to be executed provided they were executed before the restart."
]
},
{
"cell_type": "code",
"metadata": {
"id": "2H5V5zCuYXd3"
},
"source": [
"%cd \"/content/drive/MyDrive/dnafrag\""
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "3BKFSkTWZsE5"
},
"source": [
"## Testing\n",
"\n",
"The following test has been taken from the original repository of the package, even the files have been taken from there. We have modified the functions to add the timer and utilize the Hub Backend."
]
},
{
"cell_type": "code",
"metadata": {
"id": "HZrf0g_UYL6m"
},
"source": [
"import os\n",
"import gzip\n",
"import json\n",
"import tempfile\n",
"import time\n",
"\n",
"import numpy as np\n",
"\n",
"import dnafrag\n",
"\n",
"FRAGBED_FILE = \"dnafrag/tests/test_fragbed_100k.fragbed.gz\"\n",
"GENOME_FILE = \"dnafrag/tests/test_fragbed_hg19.chrom.sizes\"\n",
"\n",
"TEST_CHROM_LENS = [500, 1034, 2031, 60001]\n",
"MAX_INTERVAL_LEN = 400\n",
"max_output_fraglen=300\n",
"\n",
"NUM_TEST_CHROMS = len(TEST_CHROM_LENS)\n",
"TEST_CHROM_NAMES = [\"chr{}\".format(i) for i in range(NUM_TEST_CHROMS)]\n",
"\n",
"output_dir = \"./temp\"\n",
"\n",
"bed_entries = None\n",
"\n",
"def time_tiledb():\n",
" start = time.time()\n",
" dnafrag.core.write_fragbed(\n",
" fragment_bed=FRAGBED_FILE, output_dir=output_dir+\"_tiledb/\", genome_file=GENOME_FILE, max_fraglen=max_output_fraglen, backend=\"tiledb\"\n",
" )\n",
" end = time.time()\n",
" print(\"Time elapsed in seconds (TileDB): \", end-start)\n",
"\n",
"def time_hub():\n",
" start = time.time()\n",
" dnafrag.core.write_fragbed(\n",
" fragment_bed=FRAGBED_FILE, output_dir=output_dir+\"_hub/\", genome_file=GENOME_FILE, max_fraglen=max_output_fraglen, backend=\"hub\"\n",
" )\n",
" end = time.time()\n",
" print(\"Time elapsed in seconds (Hub): \", end-start)\n",
"\n",
"if __name__ == \"__main__\":\n",
" time_tiledb()\n",
" time_hub()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "1NZEBiqLaC8r"
},
"source": [
"## Clearing out the Space\n",
"\n",
"**IMPORTANT:**\n",
"\n",
"Run this cell very carefully and only after checking the path. This cell will delete the root folder of the package that was created when the initial cells were run. Changing the path can have unwanted deletion of other files."
]
},
{
"cell_type": "code",
"metadata": {
"id": "tXclgmfXaAl8"
},
"source": [
"!rm -r \"/content/drive/MyDrive/dnafrag\""
],
"execution_count": 3,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "IFv1ucRyah7o"
},
"source": [
"## Inference\n",
"\n",
"We can clearly see that Hub outperforms the existing TileDB backend, thus for a package like this, shifting their backend to Hub would increase their performance in terms of time taken during computation.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CgLdS50RbOnI"
},
"source": [
"However, we should also test the performance with some more data, if we can get some."
]
},
{
"cell_type": "code",
"metadata": {
"id": "vGwcMhZIaeCB"
},
"source": [
""
],
"execution_count": null,
"outputs": []
}
]
}

0 comments on commit 3ead5eb

Please sign in to comment.