<a href="https://colab.research.google.com/github/bytehub-ai/code-examples/blob/main/tutorials/01_bytehub_quick_start.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ByteHub Feature Store: Quick-start guide

Start by installing ByteHub using pip.

In [1]:
!pip install -q bytehub
!pip install -q pyarrow > 2

In [2]:
import pandas as pd
import numpy as np
import os
import shutil
import bytehub as bh
print(f'ByteHub version {bh.__version__}')

ByteHub version 0.2.3


In [3]:
# Remove any previously created feature stores
try:
    os.remove('bytehub.db')
except FileNotFoundError:
    pass
try:
    shutil.rmtree('/tmp/featurestore/tutorial')
except FileNotFoundError:
    pass

Create a new featurestore - this will be stored in a local sqlite database named `bytehub.db`.

In [4]:
fs = bh.FeatureStore()

Next, create a namespace called `tutorial` to store some features in. Edit the url field to specify a local file storage location that you would like to use. Feature values will be saved within this folder using parquet format.

In [5]:
fs.create_namespace(
    'tutorial', url='/tmp/featurestore/tutorial', description='Tutorial datasets'
)

Now create a new feature inside this namespace.

In [6]:
fs.create_feature('tutorial/numbers', description='Timeseries of numbers')

Now we can generate a Pandas dataframe with time and value columns to store.

In [7]:
dts = pd.date_range('2020-01-01', '2021-02-09')
df = pd.DataFrame({'time': dts, 'value': list(range(len(dts)))})

df.head()

Unnamed: 0,time,value
0,2020-01-01,0
1,2020-01-02,1
2,2020-01-03,2
3,2020-01-04,3
4,2020-01-05,4


In [8]:
fs.save_dataframe(df, 'tutorial/numbers')

Now for some feature engineering. Suppose we want to create another feature called `tutorial/squared` that contains the square of every value in `tutorial/number`. To do this, define a transform as shown below. The transform receives a dataframe of everything in from_features and should return a series/dataframe of transformed timeseries values

In [9]:
@fs.transform('tutorial/squared', from_features=['tutorial/numbers'])
def squared_numbers(df):
    return df ** 2 # Square the input

We can now look at some of our timeseries data by using the `load_dataframe` method.

In [10]:
df_query = fs.load_dataframe(
    ['tutorial/numbers', 'tutorial/squared'],
    from_date='2021-01-01', to_date='2021-01-31'
)
df_query.head()

Unnamed: 0_level_0,tutorial/numbers,tutorial/squared
time,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-01-01,366,133956
2021-01-02,367,134689
2021-01-03,368,135424
2021-01-04,369,136161
2021-01-05,370,136900


Using the load_dataframe method, we can easily join, resample and filter the features.

In [11]:
df_query = fs.load_dataframe(
    ['tutorial/numbers', 'tutorial/squared'],
    from_date='2020-01-01', to_date='2020-12-31', freq='1M'
)
df_query.head()

Unnamed: 0,tutorial/numbers,tutorial/squared
2020-01-31,30,900
2020-02-29,59,3481
2020-03-31,90,8100
2020-04-30,120,14400
2020-05-31,151,22801


ByteHub also provides tools to list and search the contents of the feature store.

In [12]:
fs.list_features()

Unnamed: 0,namespace,name,version,description,meta,partition,serialized,transform
0,tutorial,numbers,1,Timeseries of numbers,{},date,False,False
1,tutorial,squared,1,,{},date,False,True


We can add key/value pairs of metadata to a feature.

In [13]:
fs.update_feature('tutorial/numbers', meta={'source': 'ByteHub tutorial'})

In [14]:
fs.list_features(regex=r'num.')

Unnamed: 0,namespace,name,version,description,meta,partition,serialized,transform
0,tutorial,numbers,2,Timeseries of numbers,{'source': 'ByteHub tutorial'},date,False,False


We can also copy features along with all of their data using the `clone_feature` method.

In [15]:
fs.clone_feature('tutorial/copy-of-numbers', from_name='tutorial/numbers')
fs.list_features()

Unnamed: 0,namespace,name,version,description,meta,partition,serialized,transform
0,tutorial,numbers,2,Timeseries of numbers,{'source': 'ByteHub tutorial'},date,False,False
1,tutorial,squared,1,,{},date,False,True
2,tutorial,copy-of-numbers,3,Timeseries of numbers,{'source': 'ByteHub tutorial'},date,False,False


For inference on new data, we might want to simply retrieve the last value of each feature.

In [16]:
fs.last(['tutorial/numbers', 'tutorial/squared', 'tutorial/copy-of-numbers'])

{'tutorial/copy-of-numbers': 405,
 'tutorial/numbers': 405,
 'tutorial/squared': 164025}