# Feathr Feature Store on Azure Demo Notebook

This notebook illustrates the use of Feature Store to create a model that predicts NYC Taxi fares. It includes these steps:

- Compute and write features.
- Train a model using these features to predict fares.
- Evaluate that model on a new batch of data using existing features, saved to Feature Store.


Note that this is from a real world dataset which demonstrate the power of `Feathr` to deal with a real world use cases. The feature flow is as below:
![Feature Flow](https://github.com/linkedin/feathr/blob/main/docs/images/feature_flow.png?raw=true)

First, let's explore the dataset:


In [None]:
import pandas as pd
import json
pd.read_csv('mockdata/feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv')

Basically we want to predict the fares for each driver. 

In [None]:
# Install feathr if haven't installed
! pip install -U feathr

In [None]:
! pip install scikit-learn

# Feature Engineering with Feathr:
- Duration of trip
- Feature Engineering: Instead of using the raw datetime like `2021-01-01 00:15:56`, we want to feature engineering with customzied features, for exmaple we want to use the days of the week, the days of the months, etc. as the features.

Doing those transformations with Feathr is very straightforward.  

In [None]:
!pygmentize "features/non_agg_features.py"

In [None]:
!pygmentize "features/agg_features.py"

In [None]:
!pygmentize "features/request_features.py"

Setup neccessary environment variables first.

In [None]:
import os
os.environ['REDIS_PASSWORD'] = 'Li7Nn63iNB0x731VTnnz2Vr29WYJHx7JlAzCaH9lbHw='
os.environ['AZURE_CLIENT_ID'] = "b40e49c0-75c7-4959-ad25-896118cd79e8"
os.environ['AZURE_TENANT_ID'] = '72f988bf-86f1-41af-91ab-2d7cd011db47'
os.environ['AZURE_CLIENT_SECRET'] = 'kAB5ps6yvo_f08n-4Av~.IDwHFL_xl_63I'
os.environ['AZURE_PURVIEW_NAME'] = 'feathrazuretest3-purview1'

Then we will initialize a feathr client:

In [None]:
from feathr import FeathrClient
client = FeathrClient()

In [None]:
# We can register the features to a feature registry with Purview (optional):
client.register_features()

In [None]:
import os
from datetime import datetime, timedelta 
 
from feathr.query_feature_list import FeatureQuery
from feathr.settings import ObservationSettings 
from feathr.typed_key import TypedKey 
from feathr.dtype import ValueType

location_id = TypedKey(key_column="DOLocationID",
                key_column_type=ValueType.INT32,
                description="location id in NYC",
                full_name="nyc_taxi.location_id")
feature_query = FeatureQuery(feature_list=[ "f_trip_distance", "f_is_long_trip_distance", "f_day_of_week", 
                        "f_trip_time_duration", "f_location_avg_fare", "f_trip_time_distance"], key=location_id)
settings = ObservationSettings(
    observation_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv",
    event_timestamp_column="lpep_dropoff_datetime",
    timestamp_format="yyyy-MM-dd HH:mm:ss")

client.get_offline_features(observation_settings=settings,
    feature_query=feature_query,
    output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/output.avro")

The result is also availble in the cloud

After getting all the features, let's train a model:

In [None]:
from math import sqrt
import tempfile
import pandas as pd
from sklearn.linear_model import LinearRegression
import glob, os
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split  
import pandavro as pdx
from feathr.job_utils import get_result_df
res_url = client.get_job_result_uri(block=True,timeout_sec=500)
df_res = get_result_df(client)
df_res.sample(10)

In [None]:
# remove columns

final_df = df_res[['fare_amount', 'passenger_count', "f_trip_distance", "f_is_long_trip_distance", "f_day_of_week" ,"f_trip_time_duration", "f_location_avg_fare", "f_trip_time_distance"]]
final_df.fillna(0, inplace=True)
final_df['fare_amount'] = final_df['fare_amount'].astype("float64")


train_x, test_x, train_y, test_y = train_test_split(final_df.drop(["fare_amount"], axis=1),
                                                    final_df["fare_amount"],
                                                    test_size=0.2,
                                                    random_state=42)

model = LinearRegression()
model.fit(train_x, train_y)

y_predict = model.predict(test_x) 

y_actual = test_y.values.flatten().tolist() 
rmse = sqrt(mean_squared_error(y_actual, y_predict))

sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(y_actual, y_predict):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1

    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val

mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE:")
print(mean_abs_percent_error)
print()
print("Model Accuracy:")
print(1 - mean_abs_percent_error)

We now want to push the generated features to the online store:

In [None]:
from datetime import datetime, timedelta

from feathr._materialization_utils import _to_materialization_config
from feathr.materialization_settings import MaterializationSettings
from feathr.sink import RedisSink

redisSink = RedisSink(table_name="nycTaxiDemoFeature")
settings = MaterializationSettings("nycTaxiTable",
                                    sinks=[redisSink],
                                    feature_names=["f_location_avg_fare", "f_location_max_fare"])
job_res = client.materialize_features(settings=settings)

res_url = client.wait_job_to_finish(timeout_sec=300)

We can then get the features from the online store (Redis):

In [None]:
client.get_online_features("nycTaxiDemoFeature", "265", ['f_location_avg_fare', 'f_location_max_fare'])


In [None]:
client.multi_get_online_features("nycTaxiDemoFeature", ["239", "265"], ['f_location_avg_fare', 'f_location_max_fare'])

In [None]:
client.list_registered_features(project_name="frame_getting_started")