# Module 1: Serving fresh online features with Feast, Kafka, Redis

## 1. Overview
In this notebook, we explore using Spark to build streaming features from events in Kafka and registering them within Feast. We then showcase how Feast combines these streaming features with batch data sources in the online store (Redis). Users can then retrieve features at low latency from Redis through Feast.

If you haven't already, look at the [README](../README.md) for setup instructions prior to starting this notebook.

## 2. Setup Spark Structured Streaming to read this Kafka Topic
We first read in the events, apply the schema, run some transformations, and `forEachBatch` push to Feast

In [8]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, IntegerType, DoubleType, TimestampType

import pandas as pd
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 pyspark-shell"

In [9]:
spark = SparkSession.builder.master("local").appName("feast-spark").getOrCreate()
# Reduce partitions since default is 200 which will be slow on a local machine
spark.conf.set("spark.sql.shuffle.partitions", 5)

schema = (
    StructType()
        .add('driver_id', IntegerType(), False)
        .add('miles_driven', DoubleType(), False)
        .add('event_timestamp', TimestampType(), False)
        .add('conv_rate', DoubleType(), False)
        .add('acc_rate', DoubleType(), False)
)

# Subscribe to 1 topic, with headers
df = (
    spark.readStream.format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "drivers")
    .option("startingOffsets", "earliest")
    .load()
    .selectExpr('CAST(value AS STRING)')
    .select(from_json('value', schema).alias("temp"))
    .select("temp.*")
)

# 3. Setup the feature store

### Apply feature repository
We first run `feast apply` to register the data sources + features and setup Redis.

In [10]:
!feast apply

Created entity [1m[32mdriver[0m
Created feature view [1m[32mdriver_hourly_stats[0m
Created feature view [1m[32mdriver_daily_features[0m
Created on demand feature view [1m[32mtransformed_conv_rate[0m
Created feature service [1m[32mmodel_v2[0m

Deploying infrastructure for [1m[32mdriver_hourly_stats[0m
Deploying infrastructure for [1m[32mdriver_daily_features[0m


Now, we instantiate a Feast `FeatureStore` object to push data to

In [12]:
from feast import FeatureStore
from datetime import datetime

store = FeatureStore(repo_path=".")

### Fetch training data from offline store
Just to verify the features are in the batch sources.

In [13]:

entity_df = pd.DataFrame.from_dict(
    {
        "driver_id": [1001, 1002, 1003, 1004, 1001],
        "event_timestamp": [
            datetime(2021, 4, 12, 10, 59, 42),
            datetime(2021, 4, 12, 8, 12, 10),
            datetime(2021, 4, 12, 16, 40, 26),
            datetime(2021, 4, 12, 15, 1, 12),
            datetime.now()
        ]
    }
)
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_daily_features:daily_miles_driven"
    ],
).to_df()
print(training_df.head())

      driver_id                  event_timestamp  conv_rate  acc_rate  \
360        1001        2021-04-12 10:59:42+00:00   0.521149  0.751659   
721        1002        2021-04-12 08:12:10+00:00   0.089014  0.212637   
1084       1003        2021-04-12 16:40:26+00:00   0.188855  0.344736   
1445       1004        2021-04-12 15:01:12+00:00   0.296492  0.935305   
1805       1001 2022-05-14 01:52:59.452719+00:00   0.404588  0.407571   

      daily_miles_driven  
360            18.926695  
721            12.005569  
1084           23.490234  
1445           19.204191  
1805          350.650257  


### 4. Materialize batch features & fetch online features from Redis
First we materialize features (which generate the latest values for each entity key from batch sources) into the online store (Redis)

In [14]:
!feast materialize-incremental $(date +%Y-%m-%d)

Materializing [1m[32m2[0m feature views to [1m[32m2022-05-13 20:00:00-04:00[0m into the [1m[32mredis[0m online store.

[1m[32mdriver_hourly_stats[0m from [1m[32m1748-07-29 05:53:04-04:56:02[0m to [1m[32m2022-05-13 20:00:00-04:00[0m:
100%|████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 950.74it/s]
[1m[32mdriver_daily_features[0m from [1m[32m1748-07-29 05:53:04-04:56:02[0m to [1m[32m2022-05-13 20:00:00-04:00[0m:
100%|███████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1127.20it/s]


Now we can retrieve these features from Redis.

In [16]:
features = store.get_online_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_daily_features:daily_miles_driven",
    ],
    entity_rows=[
        {
            "driver_id": 1001,
        }
    ],
).to_dict()

def print_online_features(features):
    for key, value in sorted(features.items()):
        print(key, " : ", value)

print_online_features(features)

acc_rate  :  [0.4075707495212555]
conv_rate  :  [0.4045884609222412]
daily_miles_driven  :  [350.6502685546875]
driver_id  :  [1001]


## 5. Generating fresher features via stream transformations

### 5a. Building streaming features with Kafka + Spark Structured Streaming
Now we push streaming features into Feast by ingesting events from Kafka and processing with Spark Structured Streaming.
- These features can then be further post-processed and combined with other features or request data in on demand transforms.
- An example might be to push in the last 5 transactions, and in on demand transforms generate the average of those transactions.

In [18]:
def send_to_feast(df, epoch):
    pandas_df: pd.DataFrame = df.toPandas()
    if pandas_df.empty:
        return
    
    if "end" in pandas_df:
        print("processing window")
        # Filter out only for the latest window for the driver id
        pandas_df = pandas_df.sort_values(by=["driver_id","end"], ascending=False).groupby("driver_id").nth(-1)
        pandas_df = pandas_df.rename(columns = {"end": "event_timestamp"})
        pandas_df['created'] = pd.to_datetime('now')
        store.push("driver_stats_push_source", pandas_df)
    pandas_df.sort_values(by="driver_id", inplace=True)
    print(pandas_df.head(20))
    print(f"Num rows: {len(pandas_df.index)}")

daily_miles_driven = (
    df.withWatermark("event_timestamp", "1 second") 
        .groupBy("driver_id", window(timeColumn="event_timestamp", windowDuration="1 day", slideDuration="1 hour"))
        .agg(sum("miles_driven").alias("daily_miles_driven"))
        .select("driver_id", "window.end", "daily_miles_driven")
)

query_1 = daily_miles_driven \
    .writeStream \
    .outputMode("update") \
    .option("checkpointLocation", "/tmp/feast-workshop/q1/") \
    .trigger(processingTime="15 seconds") \
    .foreachBatch(send_to_feast) \
    .start()

query_1.awaitTermination(timeout=30)
query_1.stop()

22/05/14 01:53:51 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
  if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
                                                                                

processing window
              event_timestamp  daily_miles_driven                    created
driver_id                                                                   
1001      2022-04-11 08:00:00           18.926695 2022-05-14 05:53:54.228986
1002      2022-04-11 08:00:00           12.005569 2022-05-14 05:53:54.228986
1003      2022-04-11 08:00:00           23.490234 2022-05-14 05:53:54.228986
1004      2022-04-11 08:00:00           19.204191 2022-05-14 05:53:54.228986
1005      2022-04-11 08:00:00            5.764504 2022-05-14 05:53:54.228986
Num rows: 5


  if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):


processing window
              event_timestamp  daily_miles_driven                    created
driver_id                                                                   
1001      2023-03-06 12:00:00          709.549807 2022-05-14 05:54:04.133501
1002      2023-03-06 12:00:00          484.523544 2022-05-14 05:54:04.133501
1003      2023-03-06 12:00:00          818.795884 2022-05-14 05:54:04.133501
1004      2023-03-06 11:00:00          494.831386 2022-05-14 05:54:04.133501
1005      2023-03-06 11:00:00          556.985853 2022-05-14 05:54:04.133501
Num rows: 5


  if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):


processing window
              event_timestamp  daily_miles_driven                    created
driver_id                                                                   
1001      2023-03-08 22:00:00          642.781653 2022-05-14 05:54:15.907547
1002      2023-03-08 23:00:00          753.619536 2022-05-14 05:54:15.907547
1003      2023-03-08 22:00:00          626.791159 2022-05-14 05:54:15.907547
1004      2023-03-08 23:00:00          639.276020 2022-05-14 05:54:15.907547
1005      2023-03-08 23:00:00          590.422572 2022-05-14 05:54:15.907547
Num rows: 5




#### 5b. Verify fresh features
Now we can verify that the `daily_miles_driven` feature has indeed changed from the original materialized features.

In [19]:
features = store.get_online_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_daily_features:daily_miles_driven",
    ],
    entity_rows=[
        {
            "driver_id": 1001,
        }
    ],
).to_dict()
print_online_features(features)

acc_rate  :  [0.4075707495212555]
conv_rate  :  [0.4045884609222412]
daily_miles_driven  :  [642.7816772460938]
driver_id  :  [1001]


### Cleanup
Finally, let's clean up the checkpoint directory from Spark

In [20]:
import shutil

dir_path = '/tmp/feast-workshop/q1/'

try:
    shutil.rmtree(dir_path)
except OSError as e:
    print("Error: %s : %s" % (dir_path, e.strerror))