# A Grammar for the Automated Visual Presentation of Computations on Data

This notebook generates visualizations using the grammar described in submission 1404 for the 2023 IEEE VIS Full Papers CFP.

## Use Case 1: Model metrics in computational notebook

In this use case, we consider a data scientist who has built a model and is trying to communicate the quality of the model using performance metrics.  Using our python package `specmetric`, the data scientist is able to generate visualizations of different metrics.


In [1]:
# Setup - load data and model
# Starting from https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
# Then changing one-D linear regression plot to r2 plot
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

# # Use one features
# diabetes_X = diabetes_X[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# Calculate a baseline - always predict the mean label of training set
mean_train_labels_baseline = np.full_like(diabetes_y_pred, np.mean(diabetes_y_train))


In [2]:
# Point notebook to local directory to pull in specmetric
import os
import sys
from pathlib import Path
module_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(''))))

if module_path not in sys.path:
    sys.path.append(module_path)

# Load up specmetric
from specmetric.parser import ComputationTreeParser
from specmetric.computation_tree import ComputationNode
from specmetric.renderer import AltairRenderer
from specmetric.visualization_container import VisualizationContainer


Now that the model is trained and the libraries are loaded, the data scientist calculates some metrics.  We write out these metrics explicitly as a computation tree.  In practice, it would be possible to write a function that takes Python's abstract syntax tree generated from the scoring functions written by the data scientist.  For this use case, we assume that the AST has already been parsed and written into the DSL that specmetric expects.

In [9]:
# r2
########### BEGIN COMPUTATION GRAPH ##########
## Everything below can be extracted from the abstract syntax tree
# Changes size of container
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
display(HTML("<style>div.output_scroll { height: 44em; display: block;}</style>"))
# display(HTML("<style>.output { flex-direction: row; }</style>"))

y_i = diabetes_y_test
ids = np.arange(len(diabetes_y_test))
y_hat_i = diabetes_y_pred
X = diabetes_X_test
y_bar_scalar = np.mean(y_i)
y_bar_vector = np.full(y_i.shape, y_bar_scalar)
y_i_minus_y_hat_i = y_i - y_hat_i
y_i_minus_y_bar = y_i - y_bar_vector
y_i_minus_y_hat_i_squared = np.square(y_i_minus_y_hat_i)
y_i_minus_y_bar_squared = np.square(y_i_minus_y_bar)
ss_res = np.sum(y_i_minus_y_hat_i_squared)
ss_tot = np.sum(y_i_minus_y_bar_squared)
one = 1
ss_res_ss_tot_ratio = ss_res / ss_tot
r2 = one - ss_res_ss_tot_ratio
data_dict = {
    'ids': ids,
    'y_i': y_i,
    'y_hat_i': y_hat_i,
    'X': X,
    'y_bar_scalar': y_bar_scalar,
    'y_bar_vector': y_bar_vector,
    'y_i_minus_y_hat_i': y_i_minus_y_hat_i,
    'y_i_minus_y_bar': y_i_minus_y_bar,
    'y_i_minus_y_hat_i_squared': y_i_minus_y_hat_i_squared,
    'y_i_minus_y_bar_squared': y_i_minus_y_bar_squared,
    'ss_res': ss_res,
    'ss_tot': ss_tot,
    'one': one,
    'ss_res_ss_tot_ratio': ss_res_ss_tot_ratio,
    'r2': r2
}

minus_scalar = ComputationNode('minus_scalar', None, 'scalar_diff', input_data=['one', 'ss_res_ss_tot_ratio'], output_data='r2')
one = ComputationNode('one', minus_scalar, 'scalar', input_data=[], output_data='one')
ratio = ComputationNode('ratio', minus_scalar, 'scalar_ratio', input_data=['ss_res', 'ss_tot'], output_data='ss_res_ss_tot_ratio')
vector_sum_ss_res = ComputationNode('ss_res', ratio, 'vector_sum',input_data=['y_i_minus_y_hat_i_squared'], output_data='ss_res')
vector_sum_ss_tot = ComputationNode('ss_tot', ratio, 'vector_sum', input_data=['y_i_minus_y_bar_squared'], output_data='ss_tot')
square_residuals = ComputationNode('square_residuals', vector_sum_ss_res, 'vector_square', input_data=['y_i_minus_y_hat_i'], output_data='y_i_minus_y_hat_i_squared')
square_variances = ComputationNode('square_variances', vector_sum_ss_tot, 'vector_square', input_data=['y_i_minus_y_bar'], output_data='y_i_minus_y_bar_squared')
vector_difference_residuals = ComputationNode('vector_difference_residuals', square_residuals, 'vector_diff', input_data=['y_i', 'y_hat_i'], output_data='y_i_minus_y_hat_i')
vector_difference_variances = ComputationNode('vector_difference_variances', square_variances, 'vector_diff', input_data=['y_i', 'y_bar_vector'], output_data='y_i_minus_y_bar')
y_i_var_node = ComputationNode('literal_yi_var', vector_difference_variances, 'vector', output_data='y_i')
broadcast = ComputationNode('broadcast_mean', vector_difference_variances, 'broadcast', input_data=['y_bar_scalar', 'y_i'], output_data='y_bar_vector')
mean_y = ComputationNode('mean_y', broadcast, 'mean', input_data=['y_i'], output_data='y_bar_scalar')
y_i_mean_node = ComputationNode('literal_yi_mean', mean_y, 'vector', output_data='y_i')
y_i_res_node = ComputationNode('literal_yi_res', vector_difference_residuals, 'vector', output_data='y_i')
y_hat_node = ComputationNode('literal_yhat', vector_difference_residuals, 'vector', output_data='y_hat_i')

parser = ComputationTreeParser(minus_scalar)
parser.parse_computation_tree()
vis_containers = parser.visualization_containers
########### END COMPUTATION GRAPH ##########

# print("vis_containers is ")
# [vc.pp() for vc in vis_containers]
r = AltairRenderer(vis_containers, data_dict)
charts = r.convert_to_charts()

# print("charts is ", charts)
# print("charts[0] is ", charts[0])
concat_c = None
for c in charts:
    print("c is ", c)
    if concat_c == None:
        concat_c = c
    else:
        concat_c = concat_c | c
concat_c.display()

SchemaValidationError: Invalid specification

        altair.vegalite.v4.api.Chart, validating 'required'

        'data' is a required property
        

alt.HConcatChart(...)

In [4]:
# mean absolute error






In [5]:
# mean squared error


In [6]:
# root mean squared error


In [7]:
# median absolute error


In [8]:
# mean absolute percentage error