In [None]:
import logging
import warnings

logging.basicConfig()
logging.getLogger().setLevel(logging.INFO)

warnings.filterwarnings("ignore")

## Execute an ML pipeline written with sklearn/pandas and setup a view generator

We execute a complex [ML pipeline for product review classification](/classify_amazonreviews_sklearn.py) written with sklearn and pandas. During execution, we track provenance and capture intermediates, in order to setup a `view_generator`, which allows to debug the pipeline data later.


In [2]:
from freamon.adapters.mlinspect.provenance import from_sklearn_pandas
view_generator = from_sklearn_pandas('classify_amazonreviews_sklearn.py')

INFO:root:Patching sys.argv with ['eyes']
INFO:root:Registering source 2 with columns: ['product_id', 'product_parent', 'product_title', 'category_id', 'mlinspect_lineage_2_0']
INFO:root:
                  CREATE OR REPLACE VIEW _freamon_source_2_with_prov AS 
                  SELECT 
                  "product_id" AS "product_id", "product_parent" AS "product_parent", "product_title" AS "product_title", "category_id" AS "category_id", "mlinspect_lineage_2_0" AS "prov_id_source_2"
                  FROM _freamon_source_2
                
INFO:root:Registering source 3 with columns: ['id', 'category', 'mlinspect_lineage_3_0']
INFO:root:
                  CREATE OR REPLACE VIEW _freamon_source_3_with_prov AS 
                  SELECT 
                  "id" AS "id", "category" AS "category", "mlinspect_lineage_3_0" AS "prov_id_source_3"
                  FROM _freamon_source_3
                
INFO:root:Registering source 1 with columns: ['review_id', 'star_rating', 'helpful_votes', 'to

Test accuracy 0.879650554862212


## Generate and materialize a view for data debugging

Next, we generate and materialize a view over the test labels and predictions of the pipeline, sliceable by the `category` and `star_rating` attributes from two input tables.


In [3]:
materialized_view = view_generator.test_view(
    sliceable_by=['category', 'star_rating'], 
    with_features=False, 
    with_y_true=True, 
    with_y_pred=True)

materialized_view

Unnamed: 0,category,star_rating,y_true,y_pred
0,Digital_Software,5,1,1
1,Digital_Software,4,1,1
2,Digital_Software,5,1,1
3,Digital_Software,5,1,1
4,Digital_Video_Games,5,1,1
...,...,...,...,...
29642,Digital_Video_Games,5,1,1
29643,Digital_Video_Games,1,1,1
29644,Digital_Video_Games,5,0,1
29645,Digital_Software,2,1,0


## Feed the materialized view into the fairlearn library to compute fairness metrics

The materializes view can directly be used by external data debugging libraries [FairLearn](https://fairlearn.org) library. We can for example compute the recall and false positive rate for different groups of reviews in the data (e.g., based on the product category and rating).

In [6]:
from fairlearn.metrics import MetricFrame, false_positive_rate
from sklearn.metrics import recall_score

materialized_view['rating'] = '(low rated)'
materialized_view['rating'].loc[materialized_view.star_rating.astype(int) > 3] = '(highly rated)'
materialized_view['category_and_rating'] = materialized_view.category + ' ' + materialized_view.rating

fairness_metrics = MetricFrame(
    metrics={ 'recall' : recall_score, 'false_positive_rate' : false_positive_rate },
    y_true=materialized_view.y_true,
    y_pred=materialized_view.y_pred,
    sensitive_features=materialized_view.category_and_rating
)

fairness_metrics.by_group

Unnamed: 0_level_0,recall,false_positive_rate
category_and_rating,Unnamed: 1_level_1,Unnamed: 2_level_1
Digital_Software (highly rated),0.940072,0.361386
Digital_Software (low rated),0.904889,0.191647
Digital_Video_Games (highly rated),0.993283,0.593622
Digital_Video_Games (low rated),0.865546,0.297251


## Data-debugging a la SliceFinder via an aggregation query

In addition, we can directly run SQL queries against a virtual internal view over the inputs and intermediates for model training and testing in the pipeline. 

We can for example compute the mean and variance of the cross-entropy loss of the pipeline predictions for different slices of the data, analogous to [SliceFinder](https://research.google/pubs/pub47966/).

In [7]:
view_generator.execute_query(
"""
SELECT 
    category,
    star_rating > 3 as highly_rated,
    AVG(-(y_true * log(y_pred + 0.00001) + (1 - y_true) * log(1.0 - y_pred + 0.00001))) AS avg_loss,
    VARIANCE(-(y_true * log(y_pred + 0.00001) + (1 - y_true) * log(1.0 - y_pred + 0.00001))) AS var_loss,    
    COUNT(*) as size
FROM _freamon_virtual_test_view    
GROUP BY GROUPING SETS ((star_rating > 3, category), (star_rating > 3), (category))
""")

Unnamed: 0,category,highly_rated,avg_loss,var_loss,size
0,Digital_Software,True,0.602093,2.648262,9060
1,Digital_Software,False,0.745306,3.171535,6769
2,Digital_Video_Games,True,0.386997,1.785416,10155
3,Digital_Video_Games,False,0.930927,3.789063,3663
4,,True,0.488417,2.203666,19215
5,,False,0.810483,3.395877,10432
6,Digital_Software,,0.663336,2.876865,15829
7,Digital_Video_Games,,0.531187,2.373968,13818
