# Feathr Feature Store on Azure Demo Notebook

This notebook illustrates the use of Feature Store to create a model that predicts NYC Taxi fares. It includes these steps:

- Compute and write features.
- Train a model using these features to predict fares.
- Evaluate that model on a new batch of data using existing features, saved to Feature Store.


Note that this is from a real world dataset which demonstrate the power of `Feathr` to deal with a real world use cases. The feature flow is as below:
![Feature Flow](https://github.com/linkedin/feathr/blob/main/docs/images/feature_flow.png?raw=true)

And the architecture is as below:
![Architecture](https://github.com/linkedin/feathr/blob/main/docs/images/architecture.png?raw=true)

First, let's explore the dataset:


In [1]:
import pandas as pd
import json
pd.read_csv('mockdata/feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv')

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,2,2021-01-01 00:15:56,2021-01-01 00:19:52,N,1,43,151,1,1.01,5.5,0.5,0.5,0.0,0,,0.3,6.8,2,1,0.0
1,22,2021-01-01 11:25:59,2021-01-01 11:34:44,N,1,166,239,1,2.53,10.0,0.5,0.5,2.81,0,,0.3,16.86,1,1,2.75
2,23,2021-01-01 00:45:57,2021-01-01 00:51:55,N,1,41,42,1,1.12,6.0,0.5,0.5,1.0,0,,0.3,8.3,1,1,0.0
3,24,2020-12-31 23:57:51,2021-01-01 23:04:56,N,1,168,75,1,1.99,8.0,0.5,0.5,0.0,0,,0.3,9.3,2,1,0.0
4,25,2021-01-01 17:16:36,2021-01-01 17:16:40,N,2,265,265,3,0.0,-52.0,0.0,-0.5,0.0,0,,-0.3,-52.8,3,1,0.0
5,12,2021-01-01 00:16:36,2021-01-01 00:16:40,N,2,265,265,3,0.0,52.0,0.0,0.5,0.0,0,,0.3,52.8,2,1,0.0
6,42,2021-01-01 05:19:14,2021-01-01 00:19:21,N,5,265,265,1,0.0,180.0,0.0,0.0,36.06,0,,0.3,216.36,1,2,0.0
7,52,2021-01-01 00:26:31,2021-01-01 00:28:50,N,1,75,75,6,0.45,3.5,0.5,0.5,0.96,0,,0.3,5.76,1,1,0.0
8,2,2021-01-01 00:57:46,2021-01-01 00:57:57,N,1,225,225,1,0.0,2.5,0.5,0.5,0.0,0,,0.3,3.8,2,1,0.0
9,32,2021-01-01 00:58:32,2021-01-01 01:32:34,N,1,225,265,1,12.19,38.0,0.5,0.5,2.75,0,,0.3,42.05,1,1,0.0


Basically we want to predict the fares for each driver. 

# Feature Engineering with Feathr:
- Duration of trip
- Feature Engineering: Instead of using the raw datetime like `2021-01-01 00:15:56`, we want to feature engineering with customzied features, for exmaple we want to use the days of the week, the days of the months, etc. as the features.

Doing those transformations with Feathr is very straightforward. We only need to define a few configurations:

```python
f_trip_distance: "(float)trip_distance"
f_is_long_trip_distance: "trip_distance>30"
 ```

And Let's put it in a configuration file.

In [2]:
# Install feathr if haven't installed
! pip install -U feathr scikit-learn

Collecting feathr
  Using cached feathr-0.1.11-py3-none-any.whl (93 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.0.2-cp39-cp39-macosx_10_13_x86_64.whl (8.0 MB)
Collecting scipy>=1.1.0
  Using cached scipy-1.8.0-cp39-cp39-macosx_12_0_universal2.macosx_10_9_x86_64.whl (55.6 MB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Collecting joblib>=0.11
  Using cached joblib-1.1.0-py2.py3-none-any.whl (306 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn, feathr
^C
[31mERROR: Operation cancelled by user[0m
You should consider upgrading via the '/Users/hnlin/IdeaProjects/feathr_py_api_doc/feathr_py_api_doc/bin/python3.9 -m pip install --upgrade pip' command.[0m


Setup neccessary environment variables first.

In [3]:
import os
os.environ['REDIS_PASSWORD'] = ''
os.environ['AZURE_CLIENT_ID'] = ''
os.environ['AZURE_TENANT_ID'] = ''
os.environ['AZURE_CLIENT_SECRET'] = ''

Then we will initialize a feathr client:

In [4]:
from feathr import FeathrClient
client = FeathrClient()

In [5]:
# We can register the features to a feature registry with Purview (optional):
client.register_features()

2022-03-14 18:46:44.727 | INFO     | feathr._feature_registry:_read_config_from_workspace:418 - Reading feature configuration from ['/Users/hnlin/IdeaProjects/feathr_py_api_doc/feathr_project/feathrcli/data/feathr_user_workspace/feature_conf/features.conf']
2022-03-14 18:46:44.758 | INFO     | feathr._feature_registry:_read_config_from_workspace:430 - Reading feature join configuration from ['/Users/hnlin/IdeaProjects/feathr_py_api_doc/feathr_project/feathrcli/data/feathr_user_workspace/feature_join_conf/feature_join.conf']
2022-03-14 18:46:44.771 | INFO     | feathr._feature_registry:_read_config_from_workspace:440 - Reading feature generation configuration from ['/Users/hnlin/IdeaProjects/feathr_py_api_doc/feathr_project/feathrcli/data/feathr_user_workspace/feature_conf/features.conf']
2022-03-14 18:46:47.073 | INFO     | feathr._feature_registry:register_features:550 - Finished registering features. See https://web.purview.azure.com/resource/feathrazuretest3-purview1/main/catalog/br

In [6]:

returned_spark_job = client.get_offline_features()

2022-03-14 18:46:50.245 | INFO     | feathr._synapse_submission:upload_or_get_cloud_path:36 - Uploading /Users/hnlin/IdeaProjects/feathr_py_api_doc/feathr_project/feathrcli/data/feathr_user_workspace/feature_join_conf/feature_join.conf to cloud..
2022-03-14 18:46:50.812 | INFO     | feathr._synapse_submission:upload_file_to_workdir:295 - /Users/hnlin/IdeaProjects/feathr_py_api_doc/feathr_project/feathrcli/data/feathr_user_workspace/feature_join_conf/feature_join.conf is uploaded to location: abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/feathr_getting_started/feature_join.conf
2022-03-14 18:46:50.813 | INFO     | feathr._synapse_submission:upload_or_get_cloud_path:39 - /Users/hnlin/IdeaProjects/feathr_py_api_doc/feathr_project/feathrcli/data/feathr_user_workspace/feature_join_conf/feature_join.conf is uploaded to location: abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/feathr_getting_started/feature_join.conf
2022-03-14 18:46:50.813 | 

The result is also availble in the cloud

After getting all the features, let's train a model:

In [None]:
from math import sqrt
import tempfile
import pandas as pd
from sklearn.linear_model import LinearRegression
import glob, os
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split  
import pandavro as pdx

res_url = client.get_job_result_uri(block=True,timeout_sec=500)

tmp_dir = tempfile.TemporaryDirectory()
client.feathr_spark_laucher.download_result(result_path = res_url, local_folder=tmp_dir.name)
dataframe_list = []
# assuming the result are in avro format
for file in glob.glob(os.path.join(tmp_dir.name, "*.avro")):
    dataframe_list.append(pdx.read_avro(file))
vertical_concat_df = pd.concat(dataframe_list, axis=0)
tmp_dir.cleanup()
df_res = vertical_concat_df.copy()
df_res

In [None]:
# remove columns
final_df = df_res[['fare_amount', 'passenger_count', "f_hour_of_day", "f_trip_distance", "f_is_long_trip_distance", "f_day_of_week", "f_day_of_month" ,"f_trip_time_duration", "f_location_avg_fare", "f_trip_time_distance"]]
final_df.fillna(0, inplace=True)
final_df['fare_amount'] = final_df['fare_amount'].astype("float64")


train_x, test_x, train_y, test_y = train_test_split(final_df.drop(["fare_amount"], axis=1),
                                                    final_df["fare_amount"],
                                                    test_size=0.2,
                                                    random_state=42)

model = LinearRegression()
model.fit(train_x, train_y)

y_predict = model.predict(test_x) 

y_actual = test_y.values.flatten().tolist() 
rmse = sqrt(mean_squared_error(y_actual, y_predict))

sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(y_actual, y_predict):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1

    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val

mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE:")
print(mean_abs_percent_error)
print()
print("Model Accuracy:")
print(1 - mean_abs_percent_error)

We now want to push the generated features to the online store, so we configure the destination in the feature_gen config:

```python
operational: {
  name: generateWithDefaultParams
  endTime: 2022-01-02
  endTimeFormat: "yyyy-MM-dd"
  resolution: DAILY
  output:[{
                name: REDIS
                params: {
                  table_name: "nycTaxiFeatures"
                }
             }]
}
features: [f_location_avg_fare, f_location_max_fare]
```

In [None]:
job_res = client.materialize_features()

res_url = client.wait_job_to_finish(timeout_sec=300)

We can then get the features from the online store (Redis):

In [None]:
client.get_online_features("nycTaxiDemoFeature", "265", ['f_location_avg_fare', 'f_location_max_fare'])


In [None]:
client.multi_get_online_features("nycTaxiDemoFeature", ["239", "265"], ['f_location_avg_fare', 'f_location_max_fare'])

In [None]:
client.list_registered_features(project_name="frame_getting_started")