<a href="https://colab.research.google.com/github/bytehub-ai/code-examples/blob/main/tutorials/04_using_cloud_storage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using cloud storage buckets

ByteHub is built on [Dask](https://docs.dask.org/en/latest/), which includes utilities for saving/loading data onto cloud-based storage services such as [AWS S3](https://aws.amazon.com/s3/), [Azure Blob](https://azure.microsoft.com/en-gb/services/storage/blobs/), [GCP Cloud Storage](https://cloud.google.com/storage).

This tutorial demonstrates how to save feature data onto AWS S3. Start by installing ByteHub with the `aws` option.

In [1]:
!pip install -q bytehub[aws]

[K     |████████████████████████████████| 931kB 7.4MB/s 
[K     |████████████████████████████████| 112kB 41.7MB/s 
[K     |████████████████████████████████| 7.3MB 16.4MB/s 
[K     |████████████████████████████████| 133kB 44.1MB/s 
[K     |████████████████████████████████| 71kB 7.2MB/s 
[31mERROR: botocore 1.20.27 has requirement urllib3<1.27,>=1.25.4, but you'll have urllib3 1.24.3 which is incompatible.[0m
[?25h

In [2]:
import pandas as pd
import numpy as np
import os
import shutil
import bytehub as bh
print(f'ByteHub version {bh.__version__}')

ByteHub version 0.3.1


In [3]:
# Remove any previously created feature stores
try:
    os.remove('bytehub.db')
except FileNotFoundError:
    pass
try:
    shutil.rmtree('/tmp/featurestore/tutorial')
except FileNotFoundError:
    pass

Create a new featurestore - this will be stored in a local sqlite database named `bytehub.db`.

In [4]:
fs = bh.FeatureStore()

Next we need to create a namespace within the feature store that will allow us to save data to S3. [Follow this guide](https://medium.com/@shamnad.p.s/how-to-create-an-s3-bucket-and-aws-access-key-id-and-secret-access-key-for-accessing-it-5653b6e54337) to create an S3 storage bucket and access keys, then configure them in the cell below.

In [13]:
#@title Configure S3 bucket
bucket_name = "bytehub-demo" #@param {type:"string"}

In [7]:
from getpass import getpass
print('Input AWS access key ID:')
aws_access_key_id = getpass()
print('Input AWS secret access key:')
aws_secret_access_key = getpass()

Input AWS access key ID:
··········
Input AWS secret access key:
··········


In [18]:
# Create the namespace on the AWS storage bucket
fs.create_namespace(
    's3-demo',
    url=f's3://{bucket_name}/demo',
    description='S3 tutorial',
    storage_options={
        'key': aws_access_key_id, 'secret': aws_secret_access_key, 'use_ssl': True
    }
)

For details on how to configure for other cloud platforms see [here](https://docs.dask.org/en/latest/remote-data-services.html#amazon-s3).

Now create a feature in this namespace.

In [19]:
fs.create_feature('s3-demo/numbers', description='Timeseries of random numbers')

Now we can generate a Pandas dataframe with time and value columns to store.

In [20]:
dts = pd.date_range('2020-01-01', '2021-02-09')
df = pd.DataFrame({'time': dts, 'value': np.random.randn(len(dts))})

df.head()

Unnamed: 0,time,value
0,2020-01-01,0.491179
1,2020-01-02,1.282287
2,2020-01-03,-0.719213
3,2020-01-04,1.268211
4,2020-01-05,1.30999


In [21]:
fs.save_dataframe(df, 's3-demo/numbers')

This data is now stored in the cloud storage bucket. The underlying files can be viewed in the [AWS console](https://s3.console.aws.amazon.com/s3/home).

In [22]:
# Query the data
result = fs.load_dataframe('s3-demo/numbers')
result.head()

Unnamed: 0_level_0,s3-demo/numbers
time,Unnamed: 1_level_1
2020-01-01,0.491179
2020-01-02,1.282287
2020-01-03,-0.719213
2020-01-04,1.268211
2020-01-05,1.30999


In [23]:
# Delete the data saved in the tutorial
fs.delete_feature('s3-demo/numbers', delete_data=True) # delete_data will remove the data from S3