# Feast 101 - Driver Trips Example

![Feast Data Flow](./images/data-flow.png)

## Setup

0. Install docker, kubernetes (minikube or Docker Desktop), helm
1. git clone https://github.com/feast-dev/feast && cd feast
2. kubectl create secret generic feast-postgresql --from-literal=postgresql-password=password
3. kubectl create secret generic feast-gcp-service-account --from-file=credentials.json=/path/to/key.json
4. helm install demo infra/charts/feast --values infra/charts/feast/values-demo.yaml
5. kubectl get pods
```
NAME                                               READY   STATUS    RESTARTS   AGE
demo-feast-core-7f75dc4d48-dzxhb                   1/1     Running   1          24m
demo-feast-jupyter-66bd6bc54f-fjxvh                1/1     Running   0          24m
demo-feast-online-serving-68d89cc996-xvxrj         1/1     Running   4          24m
demo-postgresql-0                                  1/1     Running   0          24m
demo-prometheus-statsd-exporter-799f847b6b-6472n   1/1     Running   0          24m
demo-redis-master-0                                1/1     Running   0          24m
demo-redis-slave-0                                 1/1     Running   0          24m
demo-redis-slave-1                                 1/1     Running   0          22m
```
6. kubectl port-forward demo-feast-jupyter-66bd6bc54f-fjxvh 8888:8888

## Features Registry (Feast Core)

### Configuration

In [None]:
import os

In [54]:
# os.environ['FEAST_SPARK_LAUNCHER'] = 'standalone'
# os.environ['FEAST_SPARK_STANDALONE_MASTER'] = 'local[*]'
# os.environ['FEAST_SPARK_HOME'] = os.path.dirname(pyspark.__file__)
# os.environ['FEAST_SPARK_EXTRA_OPTIONS'] = '--jars https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar'\
# ' --conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem'

In [None]:
os.environ['FEAST_SPARK_STAGING_LOCATION'] = "gs://feast-templocation-kf-feast/demo/staging/"
os.environ['FEAST_HISTORICAL_FEATURE_OUTPUT_LOCATION'] = "gs://feast-templocation-kf-feast/demo/output"

In [4]:
# os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/path/to/key"

### Basic Imports and Feast Client initialization

In [71]:
import os

from feast import Client, Feature, Entity, ValueType, FeatureTable
from feast.data_source import FileSource, KafkaSource
from feast.data_format import ParquetFormat, AvroFormat

In [65]:
CORE_HOST = os.getenv("DEMO_FEAST_CORE_SERVICE_HOST", "localhost")
SERVING_HOST = os.getenv("DEMO_FEAST_ONLINE_SERVING_SERVICE_HOST", "localhost")
REDIS_HOST = os.getenv('DEMO_REDIS_MASTER_SERVICE_HOST', 'localhost')

client = Client(
    core_url=f"{CORE_HOST}:6565",
    serving_url=f"{SERVING_HOST}:6566",    
    redis_host=REDIS_HOST
)

### Declare Features

In [66]:
driver_id = Entity(name="driver_id", description="Driver identifier", value_type=ValueType.INT64)

In [68]:
# Daily updated features 
acc_rate = Feature("acc_rate", ValueType.FLOAT)
conv_rate = Feature("conv_rate", ValueType.FLOAT)
avg_daily_trips = Feature("avg_daily_trips", ValueType.INT32)

# Real-time updated features
trips_today = Feature("trips_today", ValueType.INT32)

```python
FeatureTable(
    name = "driver_statistics",
    entities = ["driver_id"],
    features = [
        acc_rate,
        conv_rate,
        avg_daily_trips
    ]
    ...
)
```


```python
FeatureTable(
    name = "driver_trips",
    entities = ["driver_id"],
    features = [
        trips_today
    ]
    ...
)

```

![Features Join](./images/features-join.png)

```python
FeatureTable(
    ...,
    batch_source=FileSource(  # Required
        file_format=ParquetFormat(),
        file_url="gs://feast-demo-data-lake",
        ...
    ),
    stream_source=KafkaSource(  # Optional
        bootstrap_servers="...",
        topic="driver_trips",
        ...
    )
```

In [69]:
driver_statistics = FeatureTable(
    name = "driver_statistics",
    entities = ["driver_id"],
    features = [
        acc_rate,
        conv_rate,
        avg_daily_trips
    ],
    batch_source=FileSource(
        event_timestamp_column="datetime",
        created_timestamp_column="created",
        file_format=ParquetFormat(),
        file_url="gs://feast-demo-data-lake/driver_statistics",
        date_partition_column="date"
    )
)

In [70]:
driver_trips = FeatureTable(
    name = "driver_trips",
    entities = ["driver_id"],
    features = [
        trips_today
    ],
    batch_source=FileSource(
        event_timestamp_column="datetime",
        created_timestamp_column="created",
        file_format=ParquetFormat(),
        file_url="gs://feast-demo-data-lake/driver_trips",
        date_partition_column="date"
    )
)

### Registering entities and feature tables in Feast Core

In [11]:
client.apply_entity(driver_id)
client.apply_feature_table(driver_statistics)
client.apply_feature_table(driver_trips)

In [12]:
print(client.get_feature_table("driver_statistics").to_yaml())
print(client.get_feature_table("driver_trips").to_yaml())

spec:
  name: driver_statistics
  entities:
  - driver_id
  features:
  - name: acc_rate
    valueType: FLOAT
  - name: conv_rate
    valueType: FLOAT
  - name: avg_daily_trips
    valueType: INT32
  batchSource:
    type: BATCH_FILE
    eventTimestampColumn: datetime
    datePartitionColumn: date
    createdTimestampColumn: created
    fileOptions:
      fileFormat:
        parquetFormat: {}
      fileUrl: gs://feast-demo-data-lake/driver_statistics
meta:
  createdTimestamp: '2020-10-20T06:52:16Z'

spec:
  name: driver_trips
  entities:
  - driver_id
  features:
  - name: trips_today
    valueType: INT32
  batchSource:
    type: BATCH_FILE
    eventTimestampColumn: datetime
    datePartitionColumn: date
    createdTimestampColumn: created
    fileOptions:
      fileFormat:
        parquetFormat: {}
      fileUrl: gs://feast-demo-data-lake/driver_trips
meta:
  createdTimestamp: '2020-10-20T14:01:44Z'



### Populating batch source

In [13]:
import pandas as pd
import numpy as np
from datetime import datetime

In [14]:
def generate_entities():
    return np.random.choice(999999, size=100, replace=False)

In [15]:
def generate_trips(entities):
    df = pd.DataFrame(columns=["driver_id", "trips_today", "datetime", "created"])
    df['driver_id'] = entities
    df['trips_today'] = np.random.randint(0, 1000, size=100).astype(np.int32)
    df['datetime'] = pd.to_datetime(
            np.random.randint(
                datetime(2020, 10, 10).timestamp(),
                datetime(2020, 10, 20).timestamp(),
                size=100),
        unit="s"
    )
    df['created'] = pd.to_datetime(datetime.now())
    return df
    

In [16]:
def generate_stats(entities):
    df = pd.DataFrame(columns=["driver_id", "conv_rate", "acc_rate", "avg_daily_trips", "datetime", "created"])
    df['driver_id'] = entities
    df['conv_rate'] = np.random.random(size=100).astype(np.float32)
    df['acc_rate'] = np.random.random(size=100).astype(np.float32)
    df['avg_daily_trips'] = np.random.randint(0, 1000, size=100).astype(np.int32)
    df['datetime'] = pd.to_datetime(
            np.random.randint(
                datetime(2020, 10, 10).timestamp(),
                datetime(2020, 10, 20).timestamp(),
                size=100),
        unit="s"
    )
    df['created'] = pd.to_datetime(datetime.now())
    return df

In [17]:
entities = generate_entities()
stats_df = generate_stats(entities)
trips_df = generate_trips(entities)

In [18]:
stats_df.dtypes

driver_id                   int64
conv_rate                 float32
acc_rate                  float32
avg_daily_trips             int32
datetime           datetime64[ns]
created            datetime64[ns]
dtype: object

In [None]:
#!gsutil -m rm -r 'gs://feast-demo-data-lake/driver_statistics/'
#!gsutil -m rm -r 'gs://feast-demo-data-lake/driver_trips/'

In [27]:
client.ingest(driver_statistics, stats_df)
client.ingest(driver_trips, trips_df)

Removing temporary file(s)...
Data has been successfully ingested into FeatureTable batch source.
Removing temporary file(s)...
Data has been successfully ingested into FeatureTable batch source.


In [28]:
!gsutil ls 'gs://feast-demo-data-lake/driver_statistics/**'
!gsutil ls 'gs://feast-demo-data-lake/driver_trips/**'

gs://feast-demo-data-lake/driver_statistics/date=2020-10-09/0db7e166e0c54d8ca24325df642ae07f.parquet
gs://feast-demo-data-lake/driver_statistics/date=2020-10-10/567b6633d2e645a3af9e5db39d44f120.parquet
gs://feast-demo-data-lake/driver_statistics/date=2020-10-11/fb78ef63c2b14ca093ddd7c700e9abef.parquet
gs://feast-demo-data-lake/driver_statistics/date=2020-10-12/9a36a19e056c4b7b998b6221492e4c6c.parquet
gs://feast-demo-data-lake/driver_statistics/date=2020-10-13/9c50b1c80cda40759da776bbafeef793.parquet
gs://feast-demo-data-lake/driver_statistics/date=2020-10-14/e997f7bf9fc34bbca3198d91a4cbf2fe.parquet
gs://feast-demo-data-lake/driver_statistics/date=2020-10-15/fca72f76472948e6a884632f99233845.parquet
gs://feast-demo-data-lake/driver_statistics/date=2020-10-16/ce5634d71dd346a3963780a0b7bbac0a.parquet
gs://feast-demo-data-lake/driver_statistics/date=2020-10-17/c3dffa56ad164058abb1c3775b22f8fd.parquet
gs://feast-demo-data-lake/driver_statistics/date=2020-10-18/d1f8f71e05974fec9d649e5bae49f7e

## Historical Retrieval For Training

### Point-in-time correction

![Point In Time](./images/pit-2.png)

In [58]:
import gcsfs
from pyarrow.parquet import ParquetDataset

In [59]:
def read_remote_parquet(path):
    fs = gcsfs.GCSFileSystem()
    files = ["gs://" + path for path in gcsfs.GCSFileSystem().glob(path)]
    ds = ParquetDataset(files, filesystem=fs)
    return ds.read().to_pandas()

In [77]:
entities_with_timestamp = pd.DataFrame(columns=['driver_id', 'event_timestamp'])
entities_with_timestamp['driver_id'] = np.random.choice(entities, 10, replace=False)
entities_with_timestamp['event_timestamp'] = pd.to_datetime(np.random.randint(
    datetime(2020, 10, 18).timestamp(),
    datetime(2020, 10, 20).timestamp(),
    size=10), unit='s')
entities_with_timestamp

Unnamed: 0,driver_id,event_timestamp
0,330184,2020-10-19 09:07:43
1,333896,2020-10-18 21:19:01
2,522128,2020-10-18 00:10:16
3,789025,2020-10-18 19:41:26
4,836898,2020-10-18 19:25:35
5,61720,2020-10-19 15:08:43
6,43893,2020-10-17 17:31:12
7,390750,2020-10-19 05:27:16
8,99001,2020-10-19 03:28:45
9,794802,2020-10-18 15:09:06


In [81]:
job = client.get_historical_features(
    feature_refs=[
        "driver_statistics:avg_daily_trips",
        "driver_statistics:conv_rate",
        "driver_statistics:acc_rate",
        "driver_trips:trips_today"
    ], 
    entity_source=entities_with_timestamp
)

In [82]:
job.get_output_file_uri()

'gs://feast-templocation-kf-feast/demo/output'

In [83]:
read_remote_parquet(job.get_output_file_uri() + '/part-*')

Unnamed: 0,driver_id,event_timestamp,driver_statistics__acc_rate,driver_statistics__conv_rate,driver_statistics__avg_daily_trips,driver_trips__trips_today
0,522128,2020-10-18 00:10:16,0.013687,0.162499,892,883
1,330184,2020-10-19 09:07:43,0.788955,0.836066,912,642
2,390750,2020-10-19 05:27:16,0.06117,0.715991,865,119
3,836898,2020-10-18 19:25:35,0.882056,0.061671,155,573
4,61720,2020-10-19 15:08:43,0.958883,0.400128,113,415
5,99001,2020-10-19 03:28:45,0.790018,0.85518,644,972
6,333896,2020-10-18 21:19:01,0.315527,0.015839,275,377
7,43893,2020-10-17 17:31:12,0.316299,0.044608,209,805
8,794802,2020-10-18 15:09:06,0.661202,0.471721,770,403
9,789025,2020-10-18 19:41:26,0.3919,0.729488,891,43


... Train your model here ...

## Populating Online Storage with Batch Ingestion

In [62]:
job = client.start_offline_to_online_ingestion(
    driver_statistics,
    datetime(2020, 10, 10),
    datetime(2020, 10, 20)
)

In [27]:
job.get_status()

<SparkJobStatus.STARTING: 0>

In [30]:
entities_sample = np.random.choice(entities, 10, replace=False)
entities_sample = [{"driver_id": e} for e in entities_sample]
entities_sample

[{'driver_id': 416975},
 {'driver_id': 139796},
 {'driver_id': 667201},
 {'driver_id': 459097},
 {'driver_id': 549040},
 {'driver_id': 775871},
 {'driver_id': 232140},
 {'driver_id': 137533},
 {'driver_id': 353207},
 {'driver_id': 258085}]

In [31]:
features = client.get_online_features(
    feature_refs=["driver_statistics:avg_daily_trips"],
    entity_rows=entities_sample).to_dict()

In [32]:
pd.DataFrame(features)

Unnamed: 0,driver_id,driver_statistics:avg_daily_trips
0,416975,526
1,139796,329
2,667201,875
3,459097,260
4,549040,867
5,775871,122
6,232140,699
7,137533,756
8,353207,861
9,258085,441


.. Run your production prediction here ..

## Ingestion from Streaming (real-time) Source

In [33]:
import json
import pytz
import io
import avro.schema
from avro.io import BinaryEncoder, DatumWriter
from confluent_kafka import Producer

In [34]:
KAFKA_BROKER = "kafka:9092"

In [35]:
avro_schema_json = json.dumps({
    "type": "record",
    "name": "DriverTrips",
    "fields": [
        {"name": "driver_id", "type": "long"},
        {"name": "trips_today", "type": "int"},
        {
            "name": "datetime",
            "type": {"type": "long", "logicalType": "timestamp-micros"},
        },
    ],
})

In [36]:
driver_trips.stream_source = KafkaSource(
    event_timestamp_column="datetime",
    created_timestamp_column="datetime",
    bootstrap_servers=KAFKA_BROKER,
    topic="driver_trips",
    message_format=AvroFormat(avro_schema_json)
)
client.apply_feature_table(driver_trips)

In [37]:
job = client.start_stream_to_online_ingestion(
    driver_trips
)

In [38]:
def send_avro_record_to_kafka(topic, record):
    value_schema = avro.schema.parse(avro_schema_json)
    writer = DatumWriter(value_schema)
    bytes_writer = io.BytesIO()
    encoder = BinaryEncoder(bytes_writer)
    writer.write(record, encoder)
    
    producer = Producer({
        "bootstrap.servers": KAFKA_BROKER,
    })
    producer.produce(topic=topic, value=bytes_writer.getvalue())
    producer.flush()

In [None]:
for record in trips_df.drop(columns=['created']).to_dict('record'):
    record["datetime"] = (
        record["datetime"].to_pydatetime().replace(tzinfo=pytz.utc)
    )

    send_avro_record_to_kafka(topic="driver_trips", record=record)

### Retrieving joined features from several feature tables

In [None]:
entities_sample = np.random.choice(entities, 10)
entities_sample = [{"driver_id": e} for e in entities_sample]
entities_sample

In [None]:
features = client.get_online_features(
    feature_refs=["driver_statistics:avg_daily_trips", "driver_trips:trips_today"],
    entity_rows=entities_sample).to_dict()

In [None]:
pd.DataFrame(features)

In [40]:
job.cancel()