## Data Frame as Batch Source for Kfs

## Use case: Diabetes identification.
### Overview 
#### Predictiong a patinet has a chance to get diabetes based on their information Like, their glucose levels, blood pressure etc.

#### A user Can provide their data as a Data Frame, Pandas and Spark both data frames can be supported,

In [1]:
# Import the FeatureStore functionality to initialize the FeatureStore.

from katonic.fs.feature_store import FeatureStore

fs = FeatureStore(
    user_name="user",
    project_name="diatbetes_prediction",
    description="testing a dataframe source",
)

In [3]:
import pandas as pd
data = pd.read_csv("datasets/diabetes.csv")
data.head()

Unnamed: 0,patient_id,event_timestamp,pregnancies,glucose,blood_pressure,skin_thickness,insulin,bmi,diabetes_pedigree_function,age,outcome
0,100198,2021-10-17 01:00:00+00:00,6,148,72,35,0,33.6,0.627,50,1
1,100643,2021-10-17 02:00:00+00:00,1,85,66,29,0,26.6,0.351,31,0
2,100756,2021-10-17 03:00:00+00:00,8,183,64,0,0,23.3,0.672,32,1
3,101595,2021-10-17 04:00:00+00:00,1,89,66,23,94,28.1,0.167,21,0
4,101653,2021-10-17 05:00:00+00:00,0,137,40,35,168,43.1,2.288,33,1


In [4]:
# Import some more neccessary Functions.

from katonic.fs.entities import Entity, FeatureView
from katonic.fs.value_type import ValueType
from katonic.fs.core.offline_stores import DataFrameSource

# Entity
entity = Entity(name="patient_id", value_type=ValueType.STRING)

In [7]:
batch_source = DataFrameSource(
        df=data, # Provide your dataframe 
        event_timestamp_column="event_timestamp", # Event Timestamp 
    )

In [8]:
cols = ["age", "bmi","glucose", "blood_pressure", "insulin", "diabetes_pedigree_function"]

In [9]:
# Feature View

large_data_stats_view  = FeatureView(
    name="diabetes_prediction", # Feature View name
    entities=["patient_id"], # Entity Key
    ttl='2d', # ttl for your feature view i.e, Hours, Days, Months.
    features=cols, # Columns you want in Feature Store
    batch_source=batch_source,
)

In [10]:
%%time
fs.write_table([entity, large_data_stats_view])

Registered entity [1m[32mpatient_id[0m
Registered feature view [1m[32mdiabetes_prediction[0m
Deploying infrastructure for [1m[32mdiabetes_prediction[0m
CPU times: user 10.2 ms, sys: 0 ns, total: 10.2 ms
Wall time: 56.6 ms


In [12]:
import pandas as pd

entity_df = pd.read_csv("datasets/diabetes_entity_df.csv") #Reading the Entity Dataframe.
entity_df["event_timestamp"] = pd.to_datetime(entity_df["event_timestamp"]) # Make sure the timestamp data type is accurate.

In [13]:
%%time
training_df = fs.get_historical_features(
    entity_df = entity_df, # Entity Data Frame.
    feature_view = ["diabetes_prediction"], # Feature view name
    features = cols # Columns that we want to retrieve
).to_df()

CPU times: user 28.2 ms, sys: 320 µs, total: 28.6 ms
Wall time: 36.5 ms


In [14]:
training_df.head()

Unnamed: 0,event_timestamp,patient_id,Outcome,age,bmi,glucose,blood_pressure,insulin,diabetes_pedigree_function
0,2021-04-12 07:00:00+00:00,258594,1,59,23.5,194,78,0,0.129
1,2021-04-12 07:00:00+00:00,437014,1,40,33.7,115,60,0,0.245
2,2021-10-17 01:00:00+00:00,437078,0,24,34.8,112,80,132,0.217
3,2021-10-17 01:00:00+00:00,258935,0,31,27.5,129,60,231,0.527
4,2021-10-17 01:00:00+00:00,100198,1,50,33.6,148,72,0,0.627


Once we have retrieved the complete training dataset, we can:

- Drop timestamp columns and the `patient_id` column.
- Encode categorical features (if any).
- Split the training dataframe into a train, validation, and test set.

In [15]:
# Building a model with training data.
from joblib import load, dump
from sklearn.tree import DecisionTreeClassifier

In [16]:
X_train = training_df.drop(["event_timestamp","patient_id","Outcome"], axis=1)
y_train = training_df["Outcome"]

In [17]:
X_train.head()

Unnamed: 0,age,bmi,glucose,blood_pressure,insulin,diabetes_pedigree_function
0,59,23.5,194,78,0,0.129
1,40,33.7,115,60,0,0.245
2,24,34.8,112,80,132,0.217
3,31,27.5,129,60,231,0.527
4,50,33.6,148,72,0,0.627


In [18]:
# Building a model with training data.

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
dump(tree, "diabetes_model.bin")

['diabetes_model.bin']

In [19]:
from datetime import datetime

fs.publish_table(
    start_ts=datetime(2021, 10, 1), # Give a start date
    end_ts=datetime(2021, 11, 1) # End date.
)

Materializing [1m[32m1[0m feature views from [1m[32m2021-10-01 00:00:00+00:00[0m to [1m[32m2021-11-01 00:00:00+00:00[0m into the [1m[32mredis[0m online store.



In [20]:
# Getting the Online features by using the entity keys.

patient_ids = [103738, 137959, 170333, 235547]

test = fs.get_online_features(
    entity_rows=[{"patient_id": patient_id} for patient_id in patient_ids], # Entity keys
    feature_view=['diabetes_prediction'], # Feature View name
    features=cols,
).to_df()

In [21]:
# Test Dataframe.
test.head()

Unnamed: 0,age,bmi,glucose,blood_pressure,insulin,diabetes_pedigree_function,patient_id
0,53.0,30.5,197.0,70.0,543.0,0.158,103738
1,26.0,38.5,100.0,68.0,71.0,0.324,137959
2,22.0,32.5,108.0,52.0,63.0,0.318,170333
3,63.0,32.4,142.0,80.0,0.0,0.2,235547


In [22]:
# Loading the model and predicting the Out come for test data.
model = load("diabetes_model.bin")

model.predict(test.drop("patient_id", axis=1))

array([1, 0, 0, 0])