# Distributed Profiling of Model Features with Whylogs & Fugue

It is a usual practice in the Machine Learning worls to log incoming model inference requests and outgoing predictions. These logs are then processed and aggregated later for various monitoring and drift detection purposes. However, consuming this raw data presents several pain points:
+ Machine Learning models vary widely in the number and nature of features and predictions. Some have 10 features and emit probability scores while others may have 30 features and emit a ranking. 
+ They also differ significantly in the type of features, with some having more categorical features and others having more numerical features.

It is imperative for us to devise a uniform way of processing them. We cannot have a specific monitoring logic for each model. 

In this tutorial we show how to use [Whylogs](https://whylabs.ai/whylogs) to profile the features and predictions and extract only the essential metrics from these profiles, regardless of the scale of the data.

The purposes of profiling are:
+ To normalize and compress metric data while retaining maximal information.
+ We can unify data from totally different models and process them using the same pipeline in the following step.
+ The subsequent steps will only need to handle purely numerical time series.
+ Significantly reduce the scale of the problem, so the compute can be more efficient and cost effective.

We also use the open source framework called [Fugue](https://fugue-tutorials.readthedocs.io/index.html) for its excellent abstraction layer that unifies the computing logic over Pandas, Spark, Ray or Dask.One of Fugue's most popular features is the ability to use a simple Python function call to distribute logic across many partitions of a larger dataframe. Users can provide functions with type-annotated inputs and outputs, and Fugue then converts the data based on the type annotations. This makes the custom logic independent from Fugue and Spark, removing the need for any pre-existing knowledge.

![](images/scale-up-ad.png)

In [1]:
import seaborn as sns
from matplotlib import pyplot as plt

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# this allows plots to appear directly in the notebook
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)


In [2]:
import pandas as pd

pd.set_option('display.max_columns', 50)
pd.set_option('display.max_colwidth', 100)

In [3]:
import pandas as pd

In [4]:
demo_df = pd.read_parquet('addemo23/demo_raw_data.parquet')

In [None]:
#demo_df = demo_df.sample(frac=0.1)

## Load Model Feature and Prediction Logs

### Extract Features and Predictions from model logs

In [None]:
import json
import pandas as pd

def extract_features(df: pd.DataFrame) -> pd.DataFrame:
    json_str = "[" + (",".join(df.features)) + "]"
    feature_df = pd.DataFrame(json.loads(json_str))
    #feature_df = feature_df.reset_index(drop=True)
    return feature_df[sorted(feature_df.columns)]

In [None]:
feature_df = extract_features(demo_df)

In [None]:
feature_df.head(5)

In [None]:
demo_df.head(5)

In [None]:
feature_df.shape, demo_df.shape

In [None]:
pd.merge(demo_df[['occurred_at', 'model_name', 'version', 'predictions']], feature_df, left_index=True, right_index=True)

In [None]:
import json
import pandas as pd

def extract_features(model_logs_df: pd.DataFrame) -> pd.DataFrame:
    json_str = "[" + (",".join(model_logs_df.features)) + "]"
    feature_df = pd.DataFrame(json.loads(json_str))
    feature_df = feature_df[sorted(feature_df.columns)]
    model_logs_df['occurred_at'] = model_logs_df['occurred_at'].apply(lambda x: x.replace(microsecond=0))
    model_logs_df['ds'] = model_logs_df['occurred_at'].apply(lambda x: x.strftime("%Y-%m-%d"))
    model_logs_df['hour'] = model_logs_df['occurred_at'].apply(lambda x: x.hour)
    return pd.merge(model_logs_df[['occurred_at', 'ds', 'hour', 'model_name', 'version', 'predictions']], feature_df, left_index=True, right_index=True)

In [None]:
#demo_df

In [None]:
features_df = extract_features(demo_df)

In [None]:
features_df.head(5)

In [None]:
features_df.tail(5)

In [None]:
features_df.dtypes

In [None]:
#features_df.ds.unique()

In [None]:
len(features_df.ds.unique())

In [None]:
features_df.hour.unique()

In [None]:
features_df[(features_df['ds'] == '2023-02-10') & (features_df['hour'] == 5)]

### Generate Whylogs Profiles

In [None]:
import json
import numpy as np

import whylogs as why
from whylogs import DatasetProfileView

In [None]:
feb_test_df = features_df[(features_df['ds'] == '2023-02-10') & (features_df['hour'] == 5)]

In [None]:
feb_test_df.head(5)

In [None]:
feb_whylogs_prof = why.log(feb_test_df[['feature_5', 'feature_6']]).view()

In [None]:
mar_test_df = features_df[(features_df['ds'] == '2023-03-10') & (features_df['hour'] == 5)]

In [None]:
mar_test_df.head(5)

In [32]:
feb_whylogs_prof = why.log(feb_test_df[['feature_5', 'feature_6']]).view()

In [40]:
mar_test_df = features_df[(features_df['ds'] == '2023-03-10') & (features_df['hour'] == 5)]

In [41]:
mar_test_df.head(5)

Unnamed: 0,occurred_at,ds,hour,model_name,version,predictions,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6
1345041,2023-03-10 05:35:05,2023-03-10,5,demo_model,1.0.1,21.38233,0.0,3.25,-7.376024,-355.624186,33.0,-4.216038
1345042,2023-03-10 05:49:41,2023-03-10,5,demo_model,1.0.1,3.573456,1.0,7.159091,-16.573775,-222.700659,55.0,1.512827
1345043,2023-03-10 05:24:38,2023-03-10,5,demo_model,1.0.1,29.099798,0.0,3.157895,-7.376024,-355.871126,44.0,-4.487054
1345044,2023-03-10 05:38:00,2023-03-10,5,demo_model,1.0.1,13.405657,1.0,11.35,-16.573775,-222.698992,33.0,1.146094
1345045,2023-03-10 05:57:26,2023-03-10,5,demo_model,1.0.1,17.109037,1.0,1.785714,-7.376024,-356.029852,110.0,-3.64552


In [42]:
mar_whylogs_prof = why.log(mar_test_df[['feature_5', 'feature_6']]).view()

In [43]:
feb_whylogs_prof.to_pandas()

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,distribution/min,distribution/n,distribution/q_01,distribution/q_05,distribution/q_10,distribution/q_25,distribution/q_75,distribution/q_90,distribution/q_95,distribution/q_99,distribution/stddev,type,types/boolean,types/fractional,types/integral,types/object,types/string,types/tensor
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
feature_5,425.569275,420.134311,431.139744,0,1000,0,0,38148.0,2184.743,715.0,0.0,1000,11.0,22.0,33.0,55.0,3190.0,6248.0,9141.0,14388.0,3310.736242,SummaryType.COLUMN,0,1000,0,0,0,0
feature_6,233.000134,233.0,233.011768,0,1000,0,0,1.785209,-1.798066,-1.543214,-6.845155,1000,-5.256787,-4.975246,-4.793975,-4.1657,0.373155,1.207462,1.48241,1.724903,2.350622,SummaryType.COLUMN,0,1000,0,0,0,0


### Visualize Whylogs Profiles

In [44]:
from whylogs.viz import NotebookProfileVisualizer

from whylogs.viz.utils.histogram_calculations import histogram_from_view
from whylogs.viz.utils.frequent_items_calculations import frequent_items_from_view

In [45]:
visualization = NotebookProfileVisualizer()
visualization.set_profiles(target_profile_view=feb_whylogs_prof, reference_profile_view=mar_whylogs_prof)

In [46]:
visualization.double_histogram(feature_name="feature_6")

### Serialize Whylogs Profiles

In [47]:
feb_whylogs_prof.serialize()[0:100]

b'WHY1\x00\xc2\x02\n\x0e \xe6\xe0\xe4\xd1\xfb0(\xe6\xe0\xe4\xd1\xfb0\x12\x10\n\tfeature_5\x12\x03\n\x01\x00\x12\x11\n\tfeature_6\x12\x04\n\x02\xa6P \xd4\x97\x01*\x0e\x08\x01\x12\ncounts/inf*\x12\x08\n\x12\x0etypes/integral'

### Generate Hourly Profiles using Fugue

In [48]:
import json
import pandas as pd

def profile_features(features_df: pd.DataFrame) -> pd.DataFrame:
    features_buf = why.log(features_df[['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6']]).view().serialize()
    predictions_buf = why.log(features_df[['predictions']]).view().serialize()
    profiled_features = features_df.head(1).copy()
    profiled_features = profiled_features.drop(['occurred_at'], axis=1)
    profiled_features = profiled_features.assign(features_profile=features_buf, predictions_profile = predictions_buf, sample_records=len(features_df))
    return profiled_features

In [49]:
feb_test_df.shape

(1000, 12)

In [50]:
profile_features(feb_test_df[(feb_test_df['ds'] == '2023-02-10') & (feb_test_df['hour'] == 5)])

Unnamed: 0,ds,hour,model_name,version,predictions,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,features_profile,predictions_profile,sample_records
0,2023-02-10,5,demo_model,1.0.1,59.753181,0.0,1.904762,-16.573775,-231.864749,10087.0,0.528394,b'WHY1\x00\x92\x03\n\x0e \xda\xde\xe9\xd1\xfb0(\xda\xde\xe9\xd1\xfb0\x12\x11\n\tfeature_2\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xcd\xdf\xe9\xd1\xfb0(\xcd\xdf\xe9\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,1000


In [51]:
from fugue import transform

hourly_feature_profile_df = transform(
    df=features_df, 
    using=profile_features, 
    schema="*-occurred_at+features_profile:binary,predictions_profile:binary,sample_records:long",
    partition=dict(by=['ds', 'hour', 'model_name', 'version']), 
    engine=None
)

In [52]:
hourly_feature_profile_df

Unnamed: 0,ds,hour,model_name,version,predictions,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,features_profile,predictions_profile,sample_records
0,2023-01-01,0,demo_model,1.0.1,28.874601,0.0,5.000000e+08,-13.291135,-303.546123,957.0,-6.954191,b'WHY1\x00\x92\x03\n\x0e \xce\xb4\xea\xd1\xfb0(\xce\xb4\xea\xd1\xfb0\x12\x12\n\tfeature_6\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xca\xb5\xea\xd1\xfb0(\xca\xb5\xea\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,1000
1,2023-01-01,1,demo_model,1.0.1,34.759167,1.0,2.619048e+00,-13.291135,-226.094488,1111.0,-5.675494,b'WHY1\x00\x92\x03\n\x0e \xf1\xb5\xea\xd1\xfb0(\xf1\xb5\xea\xd1\xfb0\x12\x12\n\tfeature_6\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xd8\xb6\xea\xd1\xfb0(\xd8\xb6\xea\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,1000
2,2023-01-01,2,demo_model,1.0.1,31.434237,0.0,5.000000e+08,-13.291135,-257.492325,1969.0,-3.698853,b'WHY1\x00\x92\x03\n\x0e \xf9\xb6\xea\xd1\xfb0(\xf9\xb6\xea\xd1\xfb0\x12\x10\n\tfeature_1\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xe0\xb7\xea\xd1\xfb0(\xe0\xb7\xea\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,1000
3,2023-01-01,3,demo_model,1.0.1,26.973177,0.0,0.000000e+00,-13.291135,-260.120910,990.0,-3.830946,b'WHY1\x00\x92\x03\n\x0e \x82\xb8\xea\xd1\xfb0(\x82\xb8\xea\xd1\xfb0\x12\x11\n\tfeature_2\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xe9\xb8\xea\xd1\xfb0(\xe9\xb8\xea\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,1000
4,2023-01-01,4,demo_model,1.0.1,18.229908,0.0,2.590361e+00,-13.291135,-368.447875,55.0,-6.030039,b'WHY1\x00\x92\x03\n\x0e \x8a\xb9\xea\xd1\xfb0(\x8a\xb9\xea\xd1\xfb0\x12\x11\n\tfeature_2\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xf1\xb9\xea\xd1\xfb0(\xf1\xb9\xea\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,1000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2091,2023-03-29,3,demo_model,1.0.1,16.853651,0.0,1.739130e+00,16.573775,-226.308186,22.0,-0.404220,b'WHY1\x00\x8d\x03\n\x0e \xd4\xc5\xfa\xd1\xfb0(\xd4\xc5\xfa\xd1\xfb0\x12\x11\n\tfeature_2\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \x8e\xc6\xfa\xd1\xfb0(\x8e\xc6\xfa\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,188
2092,2023-03-29,4,demo_model,1.0.1,30.805862,1.0,4.736842e+00,16.573775,-355.970607,44.0,-4.701136,b'WHY1\x00\x8d\x03\n\x0e \xa8\xc6\xfa\xd1\xfb0(\xa8\xc6\xfa\xd1\xfb0\x12\x11\n\tfeature_6\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xde\xc6\xfa\xd1\xfb0(\xde\xc6\xfa\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,125
2093,2023-03-29,5,demo_model,1.0.1,9.924586,1.0,3.730159e+00,16.573775,-356.026755,22.0,-2.173682,b'WHY1\x00\x8d\x03\n\x0e \xf7\xc6\xfa\xd1\xfb0(\xf7\xc6\xfa\xd1\xfb0\x12\x11\n\tfeature_2\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xaa\xc7\xfa\xd1\xfb0(\xaa\xc7\xfa\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,67
2094,2023-03-29,6,demo_model,1.0.1,30.811672,0.0,2.500000e+00,7.376024,-264.412372,33.0,3.319832,b'WHY1\x00\x8d\x03\n\x0e \xc3\xc7\xfa\xd1\xfb0(\xc3\xc7\xfa\xd1\xfb0\x12\x11\n\tfeature_2\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xf6\xc7\xfa\xd1\xfb0(\xf6\xc7\xfa\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,74


### Merge Whylogs Profiles

In [53]:
type(feb_whylogs_prof)

whylogs.core.view.dataset_profile_view.DatasetProfileView

In [54]:
feb_whylogs_prof.to_pandas()

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,distribution/min,distribution/n,distribution/q_01,distribution/q_05,distribution/q_10,distribution/q_25,distribution/q_75,distribution/q_90,distribution/q_95,distribution/q_99,distribution/stddev,type,types/boolean,types/fractional,types/integral,types/object,types/string,types/tensor
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
feature_5,425.569275,420.134311,431.139744,0,1000,0,0,38148.0,2184.743,715.0,0.0,1000,11.0,22.0,33.0,55.0,3190.0,6248.0,9141.0,14388.0,3310.736242,SummaryType.COLUMN,0,1000,0,0,0,0
feature_6,233.000134,233.0,233.011768,0,1000,0,0,1.785209,-1.798066,-1.543214,-6.845155,1000,-5.256787,-4.975246,-4.793975,-4.1657,0.373155,1.207462,1.48241,1.724903,2.350622,SummaryType.COLUMN,0,1000,0,0,0,0


In [55]:
mar_whylogs_prof.to_pandas()

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,distribution/min,distribution/n,distribution/q_01,distribution/q_05,distribution/q_10,distribution/q_25,distribution/q_75,distribution/q_90,distribution/q_95,distribution/q_99,distribution/stddev,type,types/boolean,types/fractional,types/integral,types/object,types/string,types/tensor
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
feature_5,13.0,13.0,13.000649,0,267,0,0,154.0,39.426966,33.0,11.0,267,11.0,22.0,22.0,22.0,44.0,55.0,110.0,132.0,24.119169,SummaryType.COLUMN,0,267,0,0,0,0
feature_6,134.000044,134.0,134.006735,0,267,0,0,1.815312,-1.352451,-0.373155,-5.019625,267,-5.019625,-4.793975,-4.630565,-3.987045,0.838042,1.48241,1.573573,1.785209,2.434174,SummaryType.COLUMN,0,267,0,0,0,0


In [56]:
merged_prof_view = feb_whylogs_prof.merge(mar_whylogs_prof)
merged_prof_view.to_pandas()

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,distribution/min,distribution/n,distribution/q_01,distribution/q_05,distribution/q_10,distribution/q_25,distribution/q_75,distribution/q_90,distribution/q_95,distribution/q_99,distribution/stddev,type,types/boolean,types/fractional,types/integral,types/object,types/string,types/tensor
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
feature_5,425.569275,420.134311,431.139744,0,1267,0,0,38148.0,1732.651934,220.0,0.0,1267,11.0,22.0,22.0,44.0,2310.0,5500.0,7546.0,13640.0,3068.471691,SummaryType.COLUMN,0,1267,0,0,0,0
feature_6,235.000137,235.0,235.01187,0,1267,0,0,1.815312,-1.70416,-1.390994,-6.845155,1267,-5.214552,-4.952914,-4.770901,-4.140412,0.528394,1.268737,1.512827,1.755073,2.37447,SummaryType.COLUMN,0,1267,0,0,0,0


In [57]:
merge_test_df = features_df[((features_df['ds'] == '2023-02-10') | (features_df['ds'] == '2023-03-10')) & (features_df['hour'] == 5)]

In [58]:
merge_test_df['ds'].unique()

array(['2023-02-10', '2023-03-10'], dtype=object)

In [59]:
merged_whylogs_prof = why.log(merge_test_df[['feature_5', 'feature_6']]).view()

In [60]:
merged_whylogs_prof.to_pandas()

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,distribution/min,distribution/n,distribution/q_01,distribution/q_05,distribution/q_10,distribution/q_25,distribution/q_75,distribution/q_90,distribution/q_95,distribution/q_99,distribution/stddev,type,types/boolean,types/fractional,types/integral,types/object,types/string,types/tensor
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
feature_5,425.569275,420.134311,431.139744,0,1267,0,0,38148.0,1732.651934,220.0,0.0,1267,11.0,22.0,22.0,44.0,2365.0,5577.0,7590.0,14234.0,3068.471691,SummaryType.COLUMN,0,1267,0,0,0,0
feature_6,235.000137,235.0,235.01187,0,1267,0,0,1.815312,-1.70416,-1.390994,-6.845155,1267,-5.235719,-4.952914,-4.770901,-4.140412,0.528394,1.268737,1.512827,1.724903,2.37447,SummaryType.COLUMN,0,1267,0,0,0,0


### Generate Daily Profiles

In [61]:
from functools import reduce

def profile_reduce(hourly_profiles_df: pd.DataFrame) -> pd.DataFrame:
    features_buf = reduce(
        lambda acc, x: acc.merge(x),
        hourly_profiles_df.features_profile.apply(DatasetProfileView.deserialize),
    ).serialize()
    predictions_buf = reduce(
        lambda acc, x: acc.merge(x),
        hourly_profiles_df.predictions_profile.apply(DatasetProfileView.deserialize),
    ).serialize()
    records = hourly_profiles_df.sample_records.sum()
    daily_profiles_df = hourly_profiles_df.head(1).copy()
    daily_profiles_df = daily_profiles_df.drop(['hour'], axis=1)
    daily_profiles_df = daily_profiles_df.assign(features_profile=features_buf, predictions_profile = predictions_buf, sample_records=records)
    return daily_profiles_df

In [62]:
hourly_feature_profile_df

Unnamed: 0,ds,hour,model_name,version,predictions,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,features_profile,predictions_profile,sample_records
0,2023-01-01,0,demo_model,1.0.1,28.874601,0.0,5.000000e+08,-13.291135,-303.546123,957.0,-6.954191,b'WHY1\x00\x92\x03\n\x0e \xce\xb4\xea\xd1\xfb0(\xce\xb4\xea\xd1\xfb0\x12\x12\n\tfeature_6\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xca\xb5\xea\xd1\xfb0(\xca\xb5\xea\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,1000
1,2023-01-01,1,demo_model,1.0.1,34.759167,1.0,2.619048e+00,-13.291135,-226.094488,1111.0,-5.675494,b'WHY1\x00\x92\x03\n\x0e \xf1\xb5\xea\xd1\xfb0(\xf1\xb5\xea\xd1\xfb0\x12\x12\n\tfeature_6\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xd8\xb6\xea\xd1\xfb0(\xd8\xb6\xea\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,1000
2,2023-01-01,2,demo_model,1.0.1,31.434237,0.0,5.000000e+08,-13.291135,-257.492325,1969.0,-3.698853,b'WHY1\x00\x92\x03\n\x0e \xf9\xb6\xea\xd1\xfb0(\xf9\xb6\xea\xd1\xfb0\x12\x10\n\tfeature_1\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xe0\xb7\xea\xd1\xfb0(\xe0\xb7\xea\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,1000
3,2023-01-01,3,demo_model,1.0.1,26.973177,0.0,0.000000e+00,-13.291135,-260.120910,990.0,-3.830946,b'WHY1\x00\x92\x03\n\x0e \x82\xb8\xea\xd1\xfb0(\x82\xb8\xea\xd1\xfb0\x12\x11\n\tfeature_2\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xe9\xb8\xea\xd1\xfb0(\xe9\xb8\xea\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,1000
4,2023-01-01,4,demo_model,1.0.1,18.229908,0.0,2.590361e+00,-13.291135,-368.447875,55.0,-6.030039,b'WHY1\x00\x92\x03\n\x0e \x8a\xb9\xea\xd1\xfb0(\x8a\xb9\xea\xd1\xfb0\x12\x11\n\tfeature_2\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xf1\xb9\xea\xd1\xfb0(\xf1\xb9\xea\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,1000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2091,2023-03-29,3,demo_model,1.0.1,16.853651,0.0,1.739130e+00,16.573775,-226.308186,22.0,-0.404220,b'WHY1\x00\x8d\x03\n\x0e \xd4\xc5\xfa\xd1\xfb0(\xd4\xc5\xfa\xd1\xfb0\x12\x11\n\tfeature_2\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \x8e\xc6\xfa\xd1\xfb0(\x8e\xc6\xfa\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,188
2092,2023-03-29,4,demo_model,1.0.1,30.805862,1.0,4.736842e+00,16.573775,-355.970607,44.0,-4.701136,b'WHY1\x00\x8d\x03\n\x0e \xa8\xc6\xfa\xd1\xfb0(\xa8\xc6\xfa\xd1\xfb0\x12\x11\n\tfeature_6\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xde\xc6\xfa\xd1\xfb0(\xde\xc6\xfa\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,125
2093,2023-03-29,5,demo_model,1.0.1,9.924586,1.0,3.730159e+00,16.573775,-356.026755,22.0,-2.173682,b'WHY1\x00\x8d\x03\n\x0e \xf7\xc6\xfa\xd1\xfb0(\xf7\xc6\xfa\xd1\xfb0\x12\x11\n\tfeature_2\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xaa\xc7\xfa\xd1\xfb0(\xaa\xc7\xfa\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,67
2094,2023-03-29,6,demo_model,1.0.1,30.811672,0.0,2.500000e+00,7.376024,-264.412372,33.0,3.319832,b'WHY1\x00\x8d\x03\n\x0e \xc3\xc7\xfa\xd1\xfb0(\xc3\xc7\xfa\xd1\xfb0\x12\x11\n\tfeature_2\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xf6\xc7\xfa\xd1\xfb0(\xf6\xc7\xfa\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,74


In [63]:
profile_reduce(hourly_feature_profile_df[hourly_feature_profile_df['ds'] == '2023-01-01'])

Unnamed: 0,ds,model_name,version,predictions,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,features_profile,predictions_profile,sample_records
0,2023-01-01,demo_model,1.0.1,28.874601,0.0,500000000.0,-13.291135,-303.546123,957.0,-6.954191,b'WHY1\x00\x93\x03\n\x0e \xce\xb4\xea\xd1\xfb0(\xce\xb4\xea\xd1\xfb0\x12\x12\n\tfeature_4\x12\x0...,b'WHY1\x00\xb1\x02\n\x0e \xca\xb5\xea\xd1\xfb0(\xca\xb5\xea\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,24001


In [64]:
from fugue import transform

daily_feature_profile_df = transform(
    df=hourly_feature_profile_df, 
    using=profile_reduce, 
    schema="*-hour",
    partition=dict(by=['ds', 'model_name', 'version']), 
    engine=None
)

In [65]:
daily_feature_profile_df

Unnamed: 0,ds,model_name,version,predictions,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,features_profile,predictions_profile,sample_records
0,2023-01-01,demo_model,1.0.1,28.874601,0.0,5.000000e+08,-13.291135,-303.546123,957.0,-6.954191,b'WHY1\x00\x93\x03\n\x0e \xce\xb4\xea\xd1\xfb0(\xce\xb4\xea\xd1\xfb0\x12\x12\n\tfeature_6\x12\x0...,b'WHY1\x00\xb1\x02\n\x0e \xca\xb5\xea\xd1\xfb0(\xca\xb5\xea\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,24001
1,2023-01-02,demo_model,1.0.1,17.241968,1.0,3.750000e+00,-0.000000,-226.315937,44.0,-6.295416,b'WHY1\x00\x93\x03\n\x0e \xd6\xcf\xea\xd1\xfb0(\xd6\xcf\xea\xd1\xfb0\x12\x12\n\tfeature_2\x12\x0...,b'WHY1\x00\xb1\x02\n\x0e \xce\xd0\xea\xd1\xfb0(\xce\xd0\xea\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,23999
2,2023-01-03,demo_model,1.0.1,5.838543,1.0,5.000000e+08,13.291135,-222.223722,33.0,-6.190258,b'WHY1\x00\x93\x03\n\x0e \xa3\xe9\xea\xd1\xfb0(\xa3\xe9\xea\xd1\xfb0\x12\x10\n\tfeature_1\x12\x0...,b'WHY1\x00\xb1\x02\n\x0e \x89\xea\xea\xd1\xfb0(\x89\xea\xea\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,24000
3,2023-01-04,demo_model,1.0.1,15.285272,1.0,2.631579e+00,16.573775,-355.747054,363.0,-6.324387,b'WHY1\x00\x93\x03\n\x0e \x95\x85\xeb\xd1\xfb0(\x95\x85\xeb\xd1\xfb0\x12\x10\n\tfeature_1\x12\x0...,b'WHY1\x00\xb1\x02\n\x0e \xfa\x85\xeb\xd1\xfb0(\xfa\x85\xeb\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,24000
4,2023-01-05,demo_model,1.0.1,5.200397,1.0,5.000000e+08,7.376024,-223.297205,33.0,-6.862306,b'WHY1\x00\x93\x03\n\x0e \xf0\x9e\xeb\xd1\xfb0(\xf0\x9e\xeb\xd1\xfb0\x12\x12\n\tfeature_2\x12\x0...,b'WHY1\x00\xb1\x02\n\x0e \xd8\x9f\xeb\xd1\xfb0(\xd8\x9f\xeb\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,24000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
83,2023-03-25,demo_model,1.0.1,20.648773,0.0,8.088235e-01,-16.573775,-368.562824,22.0,-7.097175,b'WHY1\x00\x92\x03\n\x0e \x87\x84\xfa\xd1\xfb0(\x87\x84\xfa\xd1\xfb0\x12\x12\n\tfeature_3\x12\x0...,b'WHY1\x00\xb1\x02\n\x0e \xc5\x84\xfa\xd1\xfb0(\xc5\x84\xfa\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,4248
84,2023-03-26,demo_model,1.0.1,13.163588,1.0,2.500000e+00,-13.291135,-263.689394,33.0,-6.689304,b'WHY1\x00\x93\x03\n\x0e \xce\x93\xfa\xd1\xfb0(\xce\x93\xfa\xd1\xfb0\x12\x12\n\tfeature_4\x12\x0...,b'WHY1\x00\xb1\x02\n\x0e \x89\x94\xfa\xd1\xfb0(\x89\x94\xfa\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,3784
85,2023-03-27,demo_model,1.0.1,29.717026,1.0,2.500000e+00,-0.000000,-213.566814,110.0,-5.962720,b'WHY1\x00\x93\x03\n\x0e \xf8\xa2\xfa\xd1\xfb0(\xf8\xa2\xfa\xd1\xfb0\x12\x12\n\tfeature_3\x12\x0...,b'WHY1\x00\xb1\x02\n\x0e \xb5\xa3\xfa\xd1\xfb0(\xb5\xa3\xfa\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,3686
86,2023-03-28,demo_model,1.0.1,11.663711,1.0,5.520833e+00,13.291135,-263.896567,44.0,-6.435433,b'WHY1\x00\x92\x03\n\x0e \xb0\xb3\xfa\xd1\xfb0(\xb0\xb3\xfa\xd1\xfb0\x12\x12\n\tfeature_5\x12\x0...,b'WHY1\x00\xb1\x02\n\x0e \xe7\xb3\xfa\xd1\xfb0(\xe7\xb3\xfa\xd1\xfb0\x12\x12\n\x0bpredictions\x1...,5290


### Scaling up with Fugue & Dask

#### DASK

In [66]:
from fugue import transform

hourly_feature_profile_df = transform(
    df=features_df, 
    using=profile_features, 
    schema="*-occurred_at+features_profile:binary,predictions_profile:binary,sample_records:long",
    partition=dict(by=['ds', 'hour', 'model_name', 'version']), 
    engine="dask"
)

<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)
<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)
<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)


In [67]:
hourly_feature_profile_df.head(5)

This may cause some slowdown.
Consider scattering data ahead of time and using futures.


Unnamed: 0,ds,hour,model_name,version,predictions,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,features_profile,predictions_profile,sample_records
0,2023-01-01,2,demo_model,1.0.1,31.434237,0.0,500000000.0,-13.291135,-257.492325,1969.0,-3.698853,b'WHY1\x00\x92\x03\n\x0e \xae\xfc\x99\xd2\xfb0(\xae\xfc\x99\xd2\xfb0\x12\x12\n\tfeature_3\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xbd\xfe\x99\xd2\xfb0(\xbd\xfe\x99\xd2\xfb0\x12\x12\n\x0bpredictions\x1...,1000
1,2023-01-02,9,demo_model,1.0.1,19.137638,1.0,14.0,13.291135,-280.842539,44.0,6.013381,b'WHY1\x00\x92\x03\n\x0e \x84\xff\x99\xd2\xfb0(\x84\xff\x99\xd2\xfb0\x12\x12\n\tfeature_6\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \x8b\x81\x9a\xd2\xfb0(\x8b\x81\x9a\xd2\xfb0\x12\x12\n\x0bpredictions\x1...,1000
2,2023-01-02,22,demo_model,1.0.1,11.617042,1.0,4.473684,13.291135,-336.947768,110.0,-6.111603,b'WHY1\x00\x92\x03\n\x0e \xec\x81\x9a\xd2\xfb0(\xec\x81\x9a\xd2\xfb0\x12\x12\n\tfeature_6\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xb2\x83\x9a\xd2\xfb0(\xb2\x83\x9a\xd2\xfb0\x12\x12\n\x0bpredictions\x1...,1000
3,2023-01-03,2,demo_model,1.0.1,34.734276,1.0,500000000.0,13.291135,-222.465304,1584.0,-4.064077,b'WHY1\x00\x92\x03\n\x0e \x83\x84\x9a\xd2\xfb0(\x83\x84\x9a\xd2\xfb0\x12\x10\n\tfeature_1\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xc7\x85\x9a\xd2\xfb0(\xc7\x85\x9a\xd2\xfb0\x12\x12\n\x0bpredictions\x1...,1000
4,2023-01-04,10,demo_model,1.0.1,6.508048,0.0,5.0,7.376024,-352.856626,44.0,5.019625,b'WHY1\x00\x92\x03\n\x0e \xa3\x86\x9a\xd2\xfb0(\xa3\x86\x9a\xd2\xfb0\x12\x12\n\tfeature_5\x12\x0...,b'WHY1\x00\xb0\x02\n\x0e \xb9\x88\x9a\xd2\xfb0(\xb9\x88\x9a\xd2\xfb0\x12\x12\n\x0bpredictions\x1...,999


Similarly, we can also use `engine="ray"` `engine="spark"` as the backend engines to scale it up seamlessly with `Ray` or `Spark`.