# Feathr Feature Store on Azure Demo Notebook

This notebook illustrates the use of Feature Store to create a model that predicts NYC Taxi fares. It includes these steps:

- Compute and write features.
- Train a model using these features to predict fares.
- Evaluate that model on a new batch of data using existing features, saved to Feature Store.


Note that this is from a real world dataset which demonstrate the power of `Feathr` to deal with a real world use cases. The feature flow is as below:
![Feature Flow](https://github.com/linkedin/feathr/blob/main/docs/images/feature_flow.png?raw=true)

And the architecture is as below:
![Architecture](https://github.com/linkedin/feathr/blob/main/docs/images/architecture.png?raw=true)

First, let's explore the dataset:


In [1]:
import pandas as pd
import json
pd.read_csv('mockdata/feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv')

ModuleNotFoundError: No module named 'pandas'

Basically we want to predict the fares for each driver. 

# Feature Engineering with Feathr:
- Duration of trip
- Feature Engineering: Instead of using the raw datetime like `2021-01-01 00:15:56`, we want to feature engineering with customzied features, for exmaple we want to use the days of the week, the days of the months, etc. as the features.

Setup neccessary environment variables first.

In [None]:
# Install feathr if haven't installed
! pip install -U feathr scikit-learn

Doing those transformations with Feathr is very straightforward. We only need to define features in Python:

In [None]:
from feathr.anchor import FeatureAnchor
from feathr.feature import Feature
from feathr.dtype import BOOLEAN, INT32, FLOAT, ValueType
from feathr.feature_derivations import DerivedFeature
from feathr.source import PASSTHROUGH_SOURCE

f_trip_distance = Feature(name="f_trip_distance", feature_type=FLOAT, transform="trip_distance")
f_trip_time_duration = Feature(name="f_trip_time_duration",
            feature_type=INT32,
            transform="time_duration(lpep_pickup_datetime, lpep_dropoff_datetime, 'minutes')")

features = [
    f_trip_distance,
    f_trip_time_duration,
    Feature(name="f_is_long_trip_distance",
            feature_type=BOOLEAN,
            transform="cast_float(trip_distance)>30"),
    Feature(name="f_day_of_week",
            feature_type=INT32,
            transform="dayofweek(lpep_dropoff_datetime)"),
  ]

request_anchor = FeatureAnchor(name="request_features",
                source=PASSTHROUGH_SOURCE,
                features=features)


f_trip_time_distance = DerivedFeature(name="f_trip_time_distance",
                feature_type=FLOAT,
                input_features=[f_trip_distance, f_trip_time_duration],
                transform="f_trip_distance * f_trip_time_duration")


In [None]:
import os
from feathr import FeathrClient
from feathr.dtype import ValueType
from feathr.typed_key import TypedKey
from feathr.client import FeathrClient
from feathr.job_utils import get_result_df
from feathr.query_feature_list import FeatureQuery
from feathr.settings import ObservationSettings
from math import sqrt
import tempfile
import pandas as pd
from sklearn.linear_model import LinearRegression
import glob, os
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split   

In [None]:
os.environ['REDIS_PASSWORD'] = ''
os.environ['AZURE_CLIENT_ID'] = ''
os.environ['AZURE_TENANT_ID'] = ''
os.environ['AZURE_CLIENT_SECRET'] = ''

Then we will initialize a feathr client:

In [None]:
client = FeathrClient()

We can register the features to a feature registry with Purview (optional):

In [None]:
client.register_features()

Preparing a training dataset by getting offline features for the input observation data:

In [None]:
location_id = TypedKey(key_column="DOLocationID",
                key_column_type=ValueType.INT32, 
                description="location id in NYC",
                full_name="nyc_taxi.location_id")

location_features = FeatureQuery(feature_list=["f_location_avg_fare", "f_location_max_fare"], key=location_id)
request_features = FeatureQuery(feature_list=["f_trip_time_distance", "f_trip_distance", "f_trip_time_duration",
                                              "f_is_long_trip_distance", "f_day_of_week"], key=location_id) 
settings = ObservationSettings(
    observation_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv",
    event_timestamp_column="lpep_dropoff_datetime",
    timestamp_format="yyyy-MM-dd HH:mm:ss")

client.get_offline_features(observation_settings=settings,
    feature_query=[location_features, request_features],
    output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/output.avro")


The result is also availble in the cloud

After getting all the features, let's train a model:

In [None]:
res_url = client.get_job_result_uri(block=True,timeout_sec=500)

tmp_dir = tempfile.TemporaryDirectory()
client.feathr_spark_laucher.download_result(result_path = res_url, local_folder=tmp_dir.name)
dataframe_list = []
# assuming the result are in avro format
for file in glob.glob(os.path.join(tmp_dir.name, "*.avro")):
    dataframe_list.append(pdx.read_avro(file))
vertical_concat_df = pd.concat(dataframe_list, axis=0)
tmp_dir.cleanup()
df_res = vertical_concat_df.copy()
df_res

In [None]:
# keep only columns of interest
final_df = df_res[['fare_amount', 'passenger_count', "f_trip_distance", "f_is_long_trip_distance", "f_day_of_week","f_trip_time_duration", "f_location_avg_fare", "f_trip_time_distance"]]
final_df.fillna(0, inplace=True)
final_df['fare_amount'] = final_df['fare_amount'].astype("float64")


train_x, test_x, train_y, test_y = train_test_split(final_df.drop(["fare_amount"], axis=1),
                                                    final_df["fare_amount"],
                                                    test_size=0.2,
                                                    random_state=42)

model = LinearRegression()
model.fit(train_x, train_y)

y_predict = model.predict(test_x) 

y_actual = test_y.values.flatten().tolist() 
rmse = sqrt(mean_squared_error(y_actual, y_predict))

sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(y_actual, y_predict):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1

    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val

mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE:")
print(mean_abs_percent_error)
print()
print("Model Accuracy:")
print(1 - mean_abs_percent_error)

We now want to push the generated features to the online store, so we configure the destination in the feature_gen config:

```python
operational: {
  name: generateWithDefaultParams
  endTime: 2022-01-02
  endTimeFormat: "yyyy-MM-dd"
  resolution: DAILY
  output:[{
                name: REDIS
                params: {
                  table_name: "nycTaxiFeatures"
                }
             }]
}
features: [f_location_avg_fare, f_location_max_fare]
```

In [None]:
job_res = client.materialize_features()

res_url = client.wait_job_to_finish(timeout_sec=300)

We can then get the features from the online store (Redis):

In [None]:
client.get_online_features("nycTaxiDemoFeature", "265", ['f_location_avg_fare', 'f_location_max_fare'])


In [None]:
client.multi_get_online_features("nycTaxiDemoFeature", ["239", "265"], ['f_location_avg_fare', 'f_location_max_fare'])

In [None]:
client.list_registered_features(project_name="frame_getting_started")