# Feast Hello World notebook
This notebook is a step-by-step tutorial on how to use Feast and how it works. Read the comments and you'll be fine.  
First part is about the offline store. Second part is about online store.

In [1]:
import feast
from feast import Client
from feast.entity import Entity
from feast.value_type import ValueType
import pandas as pd
import datetime as dt
import time

In [2]:
feast.__version__

'0.9.3'

# Part 1: Offline store
Offline store is for storing and retrieving data for training. Even if you update data there, you can still get old versions of your features, so that you're able to recreate your experiments.

## Connect to Feast Core and define our data schema

In [3]:
c = feast.Client(core_url="core:6565", telemetry=False)
c.list_feature_tables()

[]

In [4]:
# An Entity is like a private key in a database. We'll be managing a taxi company with some taxi drivers.
driver_id = feast.entity.Entity(
    name="driver_id", 
    description="Driver identifier", 
    value_type=ValueType.INT64
)

# These are feature definitions. We'll have two features describing each driver.
driver_name = feast.feature.Feature("name", dtype=feast.value_type.ValueType.STRING)
driver_surname = feast.feature.Feature("surname", dtype=feast.value_type.ValueType.STRING)

# This defines where Feast will store the data. Despite its name, it's not a source of the raw data.
# IMO a better name would be `batch_storage` or `batch_storage_backend`.
batch_source = feast.data_source.FileSource(
    file_format=feast.data_format.ParquetFormat(),  # how internal data will be stored
    file_url="file:///home/jovyan/work/feast/driver_batch.parquet",  # where internal data will be stored
    event_timestamp_column="event_timestamp",  # column defining when an event happened/when information started being true
    created_timestamp_column="created_timestamp",  # column defining when we created/ingested the record
)

# A feature table is like a database table. Different tables can share the same entity (aka private key).
# We'll be saving and retrieving our features from this table.
driver_info = feast.feature_table.FeatureTable(
    name = "driver_info",
    entities = ["driver_id"],
    features = [
        driver_name,
        driver_surname,
    ],
    batch_source=batch_source,
)

We must apply definitions of enitites and feature tables. We then check that Feast created our table.

In [5]:
c.apply(driver_id)
c.apply(driver_info)
c.list_feature_tables()

[FeatureTable <driver_info>]

## Ingest the data

Say we obtained information about two drivers. Alice was hired on January 1st, Bob was hired on January 2nd, hence their different `event_timestamp`s. But we're ingesting this data only on January 2nd, hence same `created_timestamp`s.

In [6]:
driver_info_df = pd.DataFrame({
    "driver_id": [1, 2],
    "name": ["Alice", "Bob"],
    "surname": ["Aoe", "Boe"],
    "event_timestamp": [dt.datetime(2020, 1, 1), dt.datetime(2020, 1, 2)],
    "created_timestamp": [dt.datetime(2020, 1, 2)] * 2,
})
driver_info_df

Unnamed: 0,driver_id,name,surname,event_timestamp,created_timestamp
0,1,Alice,Aoe,2020-01-01,2020-01-02
1,2,Bob,Boe,2020-01-02,2020-01-02


In [7]:
c.ingest(driver_info, driver_info_df)

Removing temporary file(s)...
Data has been successfully ingested into FeatureTable batch source.


## Retrieve the data

Entity dataframe used for retrieval (`entities_of_interest` in our case) can be thought of as a request: "give me the state of entity X as it was on this date". We can specify different timestamp for different retrieved entities. Notice that event timestamp in the result is set to the timestamp of retrieval; not the one of the source data. This makes sense because the result dataframe shows the state of requested entities _at requested timestamps_.

Now we're gonna retrieve the state of the world for both our drivers as it was on January 1st.

In [8]:
entities_of_interest = pd.DataFrame(
    {
        "driver_id": [1, 2],
        "event_timestamp": [dt.datetime(2020, 1, 1)] * 2,
    }
)
entities_of_interest

Unnamed: 0,driver_id,event_timestamp
0,1,2020-01-01
1,2,2020-01-01


In [9]:
## This is a helper function that waits for Spark job to be finished.
def wait_for_job(job):
    start = dt.datetime.now()
    print(job.get_status().name, end="")
    while job.get_status().name != "COMPLETED":
        print(".", end="")
        time.sleep(0.5)
    print(job.get_status().name)
    print("The job took", dt.datetime.now() - start)

In [10]:
job = c.get_historical_features(
    ["driver_info:name"], 
    entity_source=entities_of_interest, 
    output_location="file:///home/jovyan/work/output.parquet",
)

wait_for_job(job)

IN_PROGRESS...................................................COMPLETED
The job took 0:00:25.539381


In [11]:
output_df = pd.read_parquet("/home/jovyan/work/output.parquet")
output_df

Unnamed: 0,driver_id,event_timestamp,driver_info__name
0,1,2020-01-01,Alice
1,2,2020-01-01,


We see that the driver with ID 2 has no name. That's because its event_timestamp in the data source 
is after the timestamp we specified during retrieval (at the requested time of retirieval 
this information wasn't yet known). This is called _point-in-time correction_.

Let's try retrieving the state at January 2nd:

In [12]:
job = c.get_historical_features(
    ["driver_info:name"], 
    entity_source=entities_of_interest.assign(event_timestamp=dt.datetime(2020, 1, 2)), 
    output_location="file:///home/jovyan/work/output.parquet",
)

wait_for_job(job)

IN_PROGRESS...........................................COMPLETED
The job took 0:00:21.532359


In [13]:
output_df = pd.read_parquet("/home/jovyan/work/output.parquet")
output_df

Unnamed: 0,driver_id,event_timestamp,driver_info__name
0,1,2020-01-02,Alice
1,2,2020-01-02,Bob


Okay, now second driver's name is present too. 

## Appending data and retrieving it

### Case 1: ingesting data that contains the old records

Now our raw data source has been updated and contains more records. Old records weren't updated and are still there.

In [14]:
more_driver_info_df = driver_info_df.append(
    pd.DataFrame({
        "driver_id": [3, 4],
        "name": ["Celine", "Daniel"],
        "surname": ["Coe", "Doe"],
        "created_timestamp": [dt.datetime(2020, 1, 3)] * 2,
        "event_timestamp": [dt.datetime(2020, 1, 3)] * 2,
    }),
    ignore_index=True,
)
more_driver_info_df

Unnamed: 0,driver_id,name,surname,event_timestamp,created_timestamp
0,1,Alice,Aoe,2020-01-01,2020-01-02
1,2,Bob,Boe,2020-01-02,2020-01-02
2,3,Celine,Coe,2020-01-03,2020-01-03
3,4,Daniel,Doe,2020-01-03,2020-01-03


In [15]:
c.ingest(driver_info, more_driver_info_df)

Removing temporary file(s)...
Data has been successfully ingested into FeatureTable batch source.


Let's retrieve for all the drivers, but before we hired new ones (i.e. as of January 2nd). The new drivers should be None.

In [16]:
more_entities_of_interest = (
    entities_of_interest
    .append(pd.DataFrame({"driver_id": [3, 4]}), ignore_index=True)
    .assign(event_timestamp=dt.datetime(2020, 1, 2))
)
more_entities_of_interest

Unnamed: 0,driver_id,event_timestamp
0,1,2020-01-02
1,2,2020-01-02
2,3,2020-01-02
3,4,2020-01-02


In [17]:
job = c.get_historical_features(
    ["driver_info:name"], 
    entity_source=more_entities_of_interest, 
    output_location="file:///home/jovyan/work/output.parquet",
)

wait_for_job(job)
output_df = pd.read_parquet("/home/jovyan/work/output.parquet")
output_df

IN_PROGRESS..........................................COMPLETED
The job took 0:00:21.030560


Unnamed: 0,driver_id,event_timestamp,driver_info__name
0,4,2020-01-02,
1,1,2020-01-02,Alice
2,3,2020-01-02,
3,2,2020-01-02,Bob


Now let's retrieve them as of January 3rd:

In [18]:
job = c.get_historical_features(
    ["driver_info:name"], 
    entity_source=more_entities_of_interest.assign(event_timestamp=dt.datetime(2020, 1, 3)), 
    output_location="file:///home/jovyan/work/output.parquet",
)

wait_for_job(job)
output_df = pd.read_parquet("/home/jovyan/work/output.parquet")
output_df

IN_PROGRESS............................................COMPLETED
The job took 0:00:22.031121


Unnamed: 0,driver_id,event_timestamp,driver_info__name
0,4,2020-01-03,Daniel
1,1,2020-01-03,Alice
2,3,2020-01-03,Celine
3,2,2020-01-03,Bob


### Case 2: Appending only new data, without old records

In [19]:
new_driver_info_df = pd.DataFrame({
    "driver_id": [5, 6],
    "name": ["Elisabeth", "Frank"],
    "surname": ["Eoe", "Foe"],
    "created_timestamp": [dt.datetime(2020, 1, 4)] * 2,
    "event_timestamp": [dt.datetime(2020, 1, 4)] * 2,
})
new_driver_info_df

Unnamed: 0,driver_id,name,surname,created_timestamp,event_timestamp
0,5,Elisabeth,Eoe,2020-01-04,2020-01-04
1,6,Frank,Foe,2020-01-04,2020-01-04


In [20]:
c.ingest(driver_info, new_driver_info_df)

Removing temporary file(s)...
Data has been successfully ingested into FeatureTable batch source.


In [21]:
new_entities_of_interest = (
    more_entities_of_interest
    .append(pd.DataFrame({"driver_id": [5, 6]}), ignore_index=True)
    .assign(event_timestamp=dt.datetime(2020, 1, 4))
)
new_entities_of_interest

Unnamed: 0,driver_id,event_timestamp
0,1,2020-01-04
1,2,2020-01-04
2,3,2020-01-04
3,4,2020-01-04
4,5,2020-01-04
5,6,2020-01-04


In [22]:
job = c.get_historical_features(
    ["driver_info:name"], 
    entity_source=new_entities_of_interest, 
    output_location="file:///home/jovyan/work/output.parquet",
)

wait_for_job(job)
output_df = pd.read_parquet("/home/jovyan/work/output.parquet")
output_df

IN_PROGRESS............................................COMPLETED
The job took 0:00:22.034438


Unnamed: 0,driver_id,event_timestamp,driver_info__name
0,6,2020-01-04,Frank
1,4,2020-01-04,Daniel
2,1,2020-01-04,Alice
3,5,2020-01-04,Elisabeth
4,3,2020-01-04,Celine
5,2,2020-01-04,Bob


Okay, we still have all the records, including the old ones.

## Updating records

Let's say Celine got married and she chagned her surname from Coe to Coe-Joe.

In [23]:
updated_celine_info_df = pd.DataFrame({
    "driver_id": [3],
    "name": ["Celine"],
    "surname": ["Coe-Joe"],
    "created_timestamp": [dt.datetime(2020, 1, 5)],
    "event_timestamp": [dt.datetime(2020, 1, 5)],
})
updated_celine_info_df

Unnamed: 0,driver_id,name,surname,created_timestamp,event_timestamp
0,3,Celine,Coe-Joe,2020-01-05,2020-01-05


In [24]:
c.ingest(driver_info, updated_celine_info_df)

Removing temporary file(s)...
Data has been successfully ingested into FeatureTable batch source.


Let's see if we get the old data for the timestamp before her wedding:

In [25]:
job = c.get_historical_features(
    ["driver_info:name", "driver_info:surname"], 
    entity_source=pd.DataFrame({
        "driver_id": [3],
        "event_timestamp": [dt.datetime(2020, 1, 4)],  # before wedding
    }), 
    output_location="file:///home/jovyan/work/output.parquet",
)

wait_for_job(job)
output_df = pd.read_parquet("/home/jovyan/work/output.parquet")
output_df

IN_PROGRESS..........................................COMPLETED
The job took 0:00:21.030948


Unnamed: 0,driver_id,event_timestamp,driver_info__surname,driver_info__name
0,3,2020-01-04,Coe,Celine


Good, we got a historical value of this feature.

Let's get her data as it was a week after her wedding:

In [26]:
job = c.get_historical_features(
    ["driver_info:name", "driver_info:surname"], 
    entity_source=pd.DataFrame({
        "driver_id": [3],
        "event_timestamp": [dt.datetime(2020, 1, 12)],  # after wedding
    }), 
    output_location="file:///home/jovyan/work/output.parquet",
)

wait_for_job(job)
output_df = pd.read_parquet("/home/jovyan/work/output.parquet")
output_df

IN_PROGRESS..........................................COMPLETED
The job took 0:00:21.031503


Unnamed: 0,driver_id,event_timestamp,driver_info__surname,driver_info__name
0,3,2020-01-12,Coe-Joe,Celine


Good, now we got a value after the change.

## Overwriting wrong data

Now suppose we made a mistake. Celine's surname after her wedding is not Coe-Joe, but Coe-Zoe. We realised only two weeks after we ingested wrong data. We will ingest a new record with the same event_timestamp (because she still changed her surname on 5th of January) but a new created_timestamp (because that's when we fix the data).

For comparison, we previously updated Celine with this data:

In [27]:
updated_celine_info_df  # wrong data we used previously

Unnamed: 0,driver_id,name,surname,created_timestamp,event_timestamp
0,3,Celine,Coe-Joe,2020-01-05,2020-01-05


But now we will update her with this data:

In [28]:
fixed_updated_celine_info_df = updated_celine_info_df.assign(
    surname="Coe-Zoe", 
    created_timestamp=dt.datetime(2020, 1, 19)  # we realised late about the mistake
)
fixed_updated_celine_info_df

Unnamed: 0,driver_id,name,surname,created_timestamp,event_timestamp
0,3,Celine,Coe-Zoe,2020-01-19,2020-01-05


In [29]:
c.ingest(driver_info, fixed_updated_celine_info_df)

Removing temporary file(s)...
Data has been successfully ingested into FeatureTable batch source.


Now we'll try getting Celine's data as it was one week after the wedding. That is, **after she changed her name, but before we fixed the data**.

In [30]:
job = c.get_historical_features(
    ["driver_info:name", "driver_info:surname"], 
    entity_source=pd.DataFrame({
        "driver_id": [3],
        "event_timestamp": [dt.datetime(2020, 1, 12)],  # one week after wedding, before we fixed the data
    }), 
    output_location="file:///home/jovyan/work/output.parquet",
)

wait_for_job(job)
output_df = pd.read_parquet("/home/jovyan/work/output.parquet")
output_df

IN_PROGRESS.................................................COMPLETED
The job took 0:00:24.540411


Unnamed: 0,driver_id,event_timestamp,driver_info__surname,driver_info__name
0,3,2020-01-12,Coe-Zoe,Celine


The output contains **fixed** data. This is the true state of the world as it was on the requested date. 

When ingesting the fix, we used the same `event_timestamp` as we used when ingesting the update in the first place. Feast therefore uses the record with newer `create_timestamp` (that is the fixed record in our case) while retrieving the data.

If we wanted this to return the wrong data from before the fix, we should have ingested the fix with a different (newer) `event_timestamp` to pretend that the change happened later.

# Part 2: Online store

Online store allows getting a subset of data with low latency.

In the Part 1 our jobs for retrieving the data took quite some time. Now we'll ingest the data to an online store and retrieve it from there. This is useful if you need the features at inference time - meaning you need them quickly.

In [31]:
job = c.start_offline_to_online_ingestion(
    driver_info,
    dt.datetime(2020, 1, 1),
    dt.datetime(2020, 2, 1),
)
wait_for_job(job)

STARTING.................................................COMPLETED
The job took 0:00:26.510773


The API for retrieving the data from online store is a bit different from the API for online store. Here's how we define our entities of interest and how we request the data:

In [32]:
entities_of_interest_online = [
    {"driver_id": i}
    for i in range(1, 7)
]
entities_of_interest_online

[{'driver_id': 1},
 {'driver_id': 2},
 {'driver_id': 3},
 {'driver_id': 4},
 {'driver_id': 5},
 {'driver_id': 6}]

In [34]:
%%time

response = c.get_online_features(
    ["driver_info:name", "driver_info:surname"],
    entities_of_interest_online,
)

CPU times: user 9.36 ms, sys: 7.28 ms, total: 16.6 ms
Wall time: 157 ms


This time we got the response very quickly.

In [35]:
pd.DataFrame(response.to_dict())

Unnamed: 0,driver_info:surname,driver_id,driver_info:name
0,Aoe,1,Alice
1,Boe,2,Bob
2,Coe-Joe,3,Celine
3,Doe,4,Daniel
4,Eoe,5,Elisabeth
5,Foe,6,Frank
