# Dval - Shapley for data valuation

This notebook introduces Shapley Dval, a library to calculate the importance of single datapoints in the performance of machine learning models.

In order to show the practical advantages of shapley_dval, we will predict the popularity of songs in the "Top Hits Spotify from 2000-2019" dataset. While doing so, we will highlight how data valuation can help boost the performance by showing which points are hindering model learning.

Here, all the library main entry-points will be briefly described. In particular, we will discuss the advantages of this library compared to vanilla data-Shapley implementations, like runtime optimization for large datasets and models.

Let's start with some imports

In [None]:
%load_ext autoreload
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

To load the dataset, we will use the load_spotify_dataset method. In particular, we will only load data on songs published after 2014, so that we do not mix genres and generational taste too much.

In [None]:
from valuation.utils import load_spotify_dataset
data = load_spotify_dataset(min_year=2014)
data.head()

The dataset has many high level feature, some quite intuitive ('duration_ms' or 'tempo'), and other a bit more cryptic ('valence'?). If you want more information on each feature, you can find it on [this webpage](https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019?resource=download).
In our analysis, we will use all the columns, excluding 'artist' and 'song', to predict the 'popularity' of each song. The next cell prepares the dataset and samples train, validation and tests sets.

In [None]:
target_column = 'popularity'
y = data[target_column]
X = data.drop(target_column, axis=1)
X, X_test, y, y_test = train_test_split(X, y, test_size=0.3, random_state=24)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=24)

We will keep the information on song and artist in the train set in a separate pandas Series, for future reference. Then drop the columns in all the input (X) sets.

In [None]:
song_name = X_train['song']
artist = X_train['artist']
X_train = X_train.drop(['song', 'artist'], axis=1)
X_test = X_test.drop(['song', 'artist'], axis=1)
X_val = X_val.drop(['song', 'artist'], axis=1)

And now we can move directly to the data valuation part! 

The calculation of Shapley coefficients is very computationally expensive because it needs to go through several subsets of the original input dataset before converging. For this reason, the Dval library implements techniques to speed up the calculation, both caching intermediate results (more on this in later sections) and allowing to group data to calculate Shapley values on groups instead of singe-datapoints.

In the next cell we will import all the relevant methods, and try them in a non-parallel setting (num_jobs=1). In this way, a loading bar will be showed, with which we can adjust iterations_per_job to meet our time preferences.

As model, we will use a GradientBoostingRegressor, but any model from sklearn, xgboost and lightgbm works. More precisely, any model that has a fit and predict method should run without issues.

Note: Make sure to restart (or simply start if it is not already running) your memcache. In the terminal, type

`sudo service memcached restart`

In [None]:
from valuation.shapley import create_utility, shapley_dval
from sklearn.ensemble import GradientBoostingRegressor
utility = create_utility(model=GradientBoostingRegressor(n_estimators=3), x_train=X_train, y_train=y_train, x_test=X_val, y_test=y_val, scoring='neg_mean_absolute_error', data_groups=artist)
dval_df = shapley_dval(utility, iterations_per_job=50, num_jobs=1)

Depending on the application, if you do not want to use groups of data you can ignore the data_groups argument (or pass None).

The higher the iterations_per_job and num_jobs, the more precise the shapley coefficients will be. On my machine, 50 iterations takes less than a minute, so I will move on and parallelise the computation. 

In [None]:
utility = create_utility(model=GradientBoostingRegressor(n_estimators=3), x_train=X_train, y_train=y_train, x_test=X_val, y_test=y_val, scoring='neg_mean_absolute_error', data_groups=artist)
dval_df = shapley_dval(utility, iterations_per_job=50, num_jobs=20)

The first method, creat_utility, returns a utility object which holds all the information on the dataset, the model and the scoring method. Within utility, the dval library uses a cache to speed up calculation of the Shapley coefficients, and for this reason memcache needs to be restarted every time a new utility is created.
The second method, shapley_dva, calculates the actual Shapley values and manages parallelization. Let's take a look at the returned dataframe

In [None]:
dval_df.head()

The first thing to notice is that the dval_df dataframe is sorted in ascending order of shapley_dval. The data_key columns holds the labels for each data group: in this case, since the groups correspond to artists, it coincides with artist names. The second column corresponds to the shapley data valuation score, and the third to its (approximate) standard deviation.

A better way to analyse the results is through a plot. In the next cell we will take the 30 datapoints with lowest score and plot them with errorbars

In [None]:
from valuation.utils import plot_dval
low_dval = dval_df.iloc[:30]
plot_dval(low_dval, figsize=(20, 4), title='Artists with low data valuation scores', xlabel='artist', ylabel='shapley_dval')


As we can see, there are a lot of points which give negative shapley score, meaning that they tend to decrease the total score of the model when present in the training set! What happens if we remove it? In the next cell we will create a new training set which excludes the 30 points with lowest scores.

In [None]:
low_dval_artists = dval_df.iloc[:30].data_key.to_list()
artist_filter = ~artist.isin(low_dval_artists)
X_train_good_dval = X_train[artist_filter]
y_train_good_dval = y_train[artist_filter]

Now we will use this cleaned dataset to train a full GradientBoostingRegressor and compare its mean absolute error to the model which uses the full dataset. Notice that the score now is calculated using the test set, while for calculating the shapley values we were using the validation set.

In [None]:
from sklearn.metrics import mean_absolute_error
full_model = GradientBoostingRegressor(n_estimators=3).fit(X_train_good_dval, y_train_good_dval)
mean_absolute_error(full_model.predict(X_test), y_test)

In [None]:
full_model = GradientBoostingRegressor(n_estimators=3).fit(X_train, y_train)
mean_absolute_error(full_model.predict(X_test), y_test)

The score has improved by more than 15%! This is quite an important result, as it shows a self-consistent process to improve the performance of a model by excluding datapoints from its training set.

## Evaluation on anomalous data

Another interesting test is to corrupt some of the data and monitor how their valuation score changes. To do this, we will take one of the authors with the highest score and set all its popularity to 0.

In [None]:
high_dval = dval_df.iloc[-30:]
plot_dval(high_dval, figsize=(20, 4), title='Artists with high data valuation scores', xlabel='artist', ylabel='shapley_dval')

From the plot above, we can see that Rihanna has one of the highest scores. Let's now take all the train labels related to her, set the score to 0 and re-calculate the data valuation scores (remember to restart the memcache, since this is a new utility with a new train dataset).

In [None]:
y_train.loc[artist == 'Rihanna'] = 0
utility = create_utility(model=GradientBoostingRegressor(n_estimators=3), x_train=X_train, y_train=y_train, x_test=X_val, y_test=y_val, scoring='neg_mean_absolute_error', data_groups=artist)
dval_df = shapley_dval(utility, iterations_per_job=50, num_jobs=20)

Let's now take the low scoring artists and plot the results

In [None]:
low_dval = dval_df.iloc[:30]
plot_dval(low_dval, figsize=(20, 4), title='Artists with low data valuation scores', xlabel='artist', ylabel='shapley_dval')

And Rihanna (our anomalous data group) has moved from top contributor to having negative impact on the performance of the model, as expected!

# Advanced: Dval cache configuration

Internally, the Dval library uses memcache to speed up the calculation of the Shapley values. This is done within the Utility class: every time a model is trained on a subset of the full dataset, its scores are stored in a distributed cache. In this section, we will see how to change the memcache configuration to adapt the workflow to different situations.

In [None]:
from valuation.utils import MemcachedConfig
memcache_config = MemcachedConfig(
                        cache_threshold = 0.3,
                        allow_repeated_training = False,
                        rtol_threshold = 0.1, # ignored when allow_repeated_training = False
                        min_repetitions = 3, # ignored when allow_repeated_training = False
                    )

Above, you can see the main arguments of the memcache configuration: 
- cache_threshold determines the minimum number of seconds a model training needs to take to cache its scores. If a model is super fast to train, you may just want to re-train it every time without saving the score. In most cases, caching tue model, even when it takes very little to train, is preferable. The default to cache_threshold is 0.3 seconds.
- if allow_repeated_training is set to true, instead of storing just a single score of a model, the cache will store a running average of its score until a certain relative tolerance (set by the rtol_threshold argument) is achieved. More precisely, since most machine learning model-trainings are non-deterministic, depending on the starting weights or on randomness in the training process, the trained model can have very different scores. In your workflow, if you observe that the training process is very noisy even relative to the same training set, then we recommend to set allow_repeated_training to True. If instead the score is not impacted too much by non-deterministic training, setting allow_repeated_training to false will speed up the shapley_dval calculation substantially.
- As mentioned above, the rtol_threshold argument regulates the relative tolerance for returning the running average of a model instead of re-training it. If allow_repeated_training is True, set rtol_threshold to small values and the shapley coefficients will have higher precision.
- Similarly to rtol_threshold, min_repetitions regulates repeated trainings by setting the minimum number of repeated training a model has to go through before the cache can return its average score. If the model training is very noisy, set min_repetitions to higher values and the scores will be more reflective of the real average performance of the trained models.

To show how the Dval cache speeds up the calculation of the shapley values, we will next compare the running time when cache is enabled to that when the cache is turned off. This becomes particularly relevant for big models and many repeated iterations. We will also take a reduced training dataset, to speed up calculation a bit. Note: since this part will involve bigger models and more iterations_per_job, it will take much more time than previous ones.

In [None]:
X_train_small = X_train.iloc[:40]
y_train_small = y_train.iloc[:40]
artist_small = artist.iloc[:40]

In [None]:
%%time
utility = create_utility(model=GradientBoostingRegressor(n_estimators=100), x_train=X_train_small, y_train=y_train_small, x_test=X_val, y_test=y_val, scoring='neg_mean_absolute_error', data_groups=artist_small, enable_cache=True, cache_options=memcache_config)
dval_df = shapley_dval(utility, iterations_per_job=1000, num_jobs=20)

In [None]:
%%time
utility = create_utility(model=GradientBoostingRegressor(n_estimators=100), x_train=X_train_small, y_train=y_train_small, x_test=X_val, y_test=y_val, scoring='neg_mean_absolute_error', data_groups=artist_small, enable_cache=False)
dval_df = shapley_dval(utility, iterations_per_job=1000, num_jobs=20)

As you can see, the cache speeds up the training training by a 20%. The bigger the model or the total number of iterations (iterations_per_job*num_jobs), the more important the cache becomes to speed up the calculations!