# Module 1: Getting started with Feast

## 1. Intro to Feast
- What is a feature view?
- What is the Feast registry?
    - Features as code (e.g. registering features that can be re-used elsewhere)
- Major components of Feast (offline store, online store, materialization, data sources, feature server)
- Batch data sources vs streaming sources

[Feast Quickstart](https://colab.research.google.com/github/feast-dev/feast/blob/master/examples/quickstart/quickstart.ipynb)

## 2. Ingesting stream Setup Spark Structured Streaming to read this Kafka Topic
We first read in the events, apply the schema, run some transformations, and forEachBatch push to feast

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, IntegerType, DoubleType, TimestampType

import pandas as pd
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 pyspark-shell"

In [2]:
spark = SparkSession.builder.master("local").appName("feast-spark").getOrCreate()
# Reduce partitions since default is 200 which will be slow on a local machine
spark.conf.set("spark.sql.shuffle.partitions", 5)

schema = (
    StructType()
        .add('driver_id', IntegerType(), False)
        .add('miles_driven', DoubleType(), False)
        .add('event_timestamp', TimestampType(), False)
        .add('conv_rate', DoubleType(), False)
        .add('acc_rate', DoubleType(), False)
)

# Subscribe to 1 topic, with headers
df = (
    spark.readStream.format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "drivers")
    .option("startingOffsets", "earliest")
    .load()
    .selectExpr('CAST(value AS STRING)')
    .select(from_json('value', schema).alias("temp"))
    .select("temp.*")
)



:: loading settings :: url = jar:file:/Users/dannychiao/.pyenv/versions/3.7.10/envs/python-3.7/lib/python3.7/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/dannychiao/.ivy2/cache
The jars for the packages stored in: /Users/dannychiao/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-0c7bfbbf-c37d-4b79-ae54-87167f0f2485;1.0
	confs: [default]
	found org.apache.spark#spark-sql-kafka-0-10_2.12;3.0.0 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.0.0 in central
	found org.apache.kafka#kafka-clients;2.4.1 in central
	found com.github.luben#zstd-jni;1.4.4-3 in central
	found org.lz4#lz4-java;1.7.1 in local-m2-cache
	found org.xerial.snappy#snappy-java;1.1.7.5 in central
	found org.slf4j#slf4j-api;1.7.30 in local-m2-cache
	found org.spark-project.spark#unused;1.0.0 in central
	found org.apache.commons#commons-pool2;2.6.2 in central
:: resolution report :: resolve 448ms :: artifacts dl 13ms
	:: modules in use:
	com.github.luben#zstd-jni;1.4.4-3 from central in [default]
	org.apache.commons#commons-

# Set up Feast
We now instantiate a Feast `FeatureStore` object to push data to

In [4]:
def print_online_features(features):
    for key, value in sorted(features.items()):
        print(key, " : ", value)

from feast import FeatureStore
from datetime import datetime

store = FeatureStore(repo_path=".")

### Fetch training data from offline store

In [5]:

entity_df = pd.DataFrame.from_dict(
    {
        "driver_id": [1001, 1002, 1003, 1004, 1001],
        "event_timestamp": [
            datetime(2021, 4, 12, 10, 59, 42),
            datetime(2021, 4, 12, 8, 12, 10),
            datetime(2021, 4, 12, 16, 40, 26),
            datetime(2021, 4, 12, 15, 1, 12),
            datetime.now()
        ]
    }
)
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_daily_features:daily_miles_driven"
    ],
).to_df()
print(training_df.head())

      driver_id                  event_timestamp  conv_rate  acc_rate  \
360        1001        2021-04-12 10:59:42+00:00   0.521149  0.751659   
721        1002        2021-04-12 08:12:10+00:00   0.089014  0.212637   
1084       1003        2021-04-12 16:40:26+00:00   0.188855  0.344736   
1445       1004        2021-04-12 15:01:12+00:00   0.296492  0.935305   
1805       1001 2022-05-12 22:15:47.590475+00:00   0.404588  0.407571   

      daily_miles_driven  
360            18.926695  
721            12.005569  
1084           23.490234  
1445           19.204191  
1805          350.650257  


### Fetch online features from Redis

In [6]:
features = store.get_online_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_daily_features:daily_miles_driven",
    ],
    entity_rows=[
        {
            "driver_id": 1001,
        }
    ],
).to_dict()
print_online_features(features)

acc_rate  :  [0.4075707495212555]
conv_rate  :  [0.4045884609222412]
daily_miles_driven  :  [350.6502685546875]
driver_id  :  [1001]


## Writing transformed features to Feast
Now we push events into Feast, which can then be further post-processed in on demand transforms.

An example might be to push in the last 5 transactions, and in on demand transforms generate the average of those transactions.

In [7]:
def send_to_feast(df):
    pandas_df: pd.DataFrame = df.toPandas()
    if pandas_df.empty:
        return
    
    if "end" in pandas_df:
        print("processing window")
        # Filter out only for the latest window for the driver id
        pandas_df = pandas_df.sort_values(by=["driver_id","end"], ascending=False).groupby("driver_id").nth(-1)
        pandas_df = pandas_df.rename(columns = {"end": "event_timestamp"})
        pandas_df['created'] = pd.to_datetime('now')
        store.push("driver_stats_push_source", pandas_df)
    pandas_df.sort_values(by="driver_id", inplace=True)
    print(pandas_df.head(20))
    print(f"Num rows: {len(pandas_df.index)}")

avg_conv_rate = (
    df.withWatermark("event_timestamp", "1 second")
        .groupBy("driver_id")
        .agg(max("event_timestamp").alias("timestamp"), count("event_timestamp").alias("num_rows"), avg("conv_rate"), sum("acc_rate"))
)

daily_miles_driven = (
    df.withWatermark("event_timestamp", "1 second") 
        .groupBy("driver_id", window(timeColumn="event_timestamp", windowDuration="1 day", slideDuration="1 hour"))
        .agg(sum("miles_driven").alias("daily_miles_driven"))
        .select("driver_id", "window.end", "daily_miles_driven")
)

query_1 = daily_miles_driven \
    .writeStream \
    .outputMode("update") \
    .option("checkpointLocation", "/tmp/feast-workshop/q1/") \
    .trigger(processingTime="15 seconds") \
    .foreachBatch(send_to_feast) \
    .start()

query_1.awaitTermination(timeout=30)
query_1.stop()

05/12/2022 10:16:06 PM INFO:Callback Server Starting
05/12/2022 10:16:06 PM INFO:Socket listening on ('127.0.0.1', 63229)
22/05/12 22:16:06 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
05/12/2022 10:16:13 PM INFO:Python Server ready to receive messages
05/12/2022 10:16:13 PM INFO:Received command c on object id p0
  if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
                                                                                

processing window
              event_timestamp  daily_miles_driven                    created
driver_id                                                                   
1001      2022-04-11 08:00:00           37.853390 2022-05-13 02:16:16.518250
1002      2022-04-11 08:00:00           24.011137 2022-05-13 02:16:16.518250
1003      2022-04-11 08:00:00           46.980469 2022-05-13 02:16:16.518250
1004      2022-04-11 08:00:00           38.408382 2022-05-13 02:16:16.518250
1005      2022-04-11 08:00:00           11.529007 2022-05-13 02:16:16.518250
Num rows: 5


05/12/2022 10:16:17 PM INFO:Received command c on object id p0
  if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
05/12/2022 10:16:30 PM INFO:Received command c on object id p0


In [8]:
features = store.get_online_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_daily_features:daily_miles_driven",
    ],
    entity_rows=[
        {
            "driver_id": 1001,
        }
    ],
).to_dict()
print_online_features(features)

acc_rate  :  [0.4075707495212555]
conv_rate  :  [0.4045884609222412]
daily_miles_driven  :  [37.853389739990234]
driver_id  :  [1001]


### Cleanup checkpoint

In [9]:
import shutil

dir_path = '/tmp/feast-workshop/q1/'

try:
    shutil.rmtree(dir_path)
except OSError as e:
    print("Error: %s : %s" % (dir_path, e.strerror))