# Dval - Shapley for data valuation

This notebook introduces Shapley methods for the computation of data value using pyDVL.

In order to show the practical advantages of shapley_dval, we will predict the popularity of songs in the "Top Hits Spotify from 2000-2019" dataset. While doing so, we will highlight how data valuation can help boost the performance of your models.

Here, all the library main entry-points will be briefly described. We will also show the advantages of this library compared to vanilla data-Shapley implementations, like runtime optimization for large datasets and models.

Let's start with some imports

In [None]:
%load_ext autoreload

To load the dataset, we will use the load_spotify_dataset method. Internally, the method will load data on songs published after 2014, use 30% of data for test, and another 30% of the remaining data for validation. Then, the method will return train, validation and test data as lists of the shape [X_input, Y_label]

In [None]:
from valuation.utils import load_spotify_dataset
train_data, val_data, test_data = load_spotify_dataset(val_size=0.3, test_size=0.3, target_column='popularity')
train_data[0].head()

The dataset has many high level feature, some quite intuitive ('duration_ms' or 'tempo'), and other a bit more cryptic ('valence'?). If you want more information on each feature, you can find it on [this webpage](https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019?resource=download).
In our analysis, we will use all the columns, excluding 'artist' and 'song', to predict the 'popularity' of each song. We will nonetheless keep the information on song and artist in the train set in a separate pandas Series, for future reference.

In [None]:
song_name = train_data[0]['song']
artist = train_data[0]['artist']
train_data[0] = train_data[0].drop(['song', 'artist'], axis=1)
test_data[0] = test_data[0].drop(['song', 'artist'], axis=1)
val_data[0] = val_data[0].drop(['song', 'artist'], axis=1)

Input and label data need to be stored in a Dataset object, a pyDVL specific tool.

In [None]:
from valuation.utils import Dataset
dataset = Dataset(*train_data, *val_data)

The calculation of Shapley coefficients is very computationally expensive because it needs to go through several subsets of the original input dataset before converging. For this reason, the Dval library implements techniques to speed up the calculation, both caching intermediate results and allowing to group data to calculate Shapley values on groups instead of single-datapoints. From a Dataset object, you can just do the following.

In [None]:
from valuation.utils import GroupedDataset
grouped_dataset = GroupedDataset.from_dataset(dataset, artist)

And now we can finally calculate the contribution of each datapoint to the modle performance!

As model, we will use a GradientBoostingRegressor, but any model from sklearn, xgboost and lightgbm works. More precisely, any model that has a fit and predict method should run without issues.

Note: Make sure to restart (or simply start if it is not already running) your memcache. See our [documentation](https://appliedai-initiative.github.io/valuation/install.html) for details.

In [None]:
from valuation.shapley import shapley_pydvl
from valuation.utils import Utility
from sklearn.ensemble import GradientBoostingRegressor

utility = Utility(
        model=GradientBoostingRegressor(n_estimators=3), data=grouped_dataset, scoring='neg_mean_absolute_error', enable_cache=True,
    )
dval_df = shapley_pydvl(utility, iterations_per_job=50, num_jobs=20)

As you can see, the first object that we have defined (utility) holds all the information on the dataset, the model and the scoring methods. Within utility, the dval library uses a cache to speed up calculation of the Shapley coefficients.
The second method, shapley_dval, calculates the actual Shapley values and manages parallelization. Let's take a look at the returned dataframe.

In [None]:
dval_df.head()

The first thing to notice is that the dval_df dataframe is sorted in ascending order of shapley_dval. The data_key columns holds the labels for each data group: in this case, since the groups correspond to artists, it coincides with artist names. The second column corresponds to the shapley data valuation score, and the third to its (approximate) standard deviation.

A better way to analyse the results is through a plot. In the next cell we will take the 30 datapoints with lowest score and plot them with errorbars

In [None]:
from valuation.utils import plot_shapley_pydvl
low_dval = dval_df.iloc[:30]
fig = plot_shapley_pydvl(low_dval, figsize=(20, 4), title='Artists with low data valuation scores', xlabel='artist', ylabel='shapley_dval')
fig.show()

As you can see, there are a lot of points which give negative shapley score, meaning that they tend to decrease the total score of the model when present in the training set! What happens if we remove it? In the next cell we will create a new training set which excludes the 30 points with lowest scores.

In [None]:
low_dval_artists = dval_df.iloc[:30].data_key.to_list()
artist_filter = ~artist.isin(low_dval_artists)
X_train_good_dval = train_data[0][artist_filter]
y_train_good_dval = train_data[0][artist_filter]

Now we will use this cleaned dataset to train a full GradientBoostingRegressor and compare its mean absolute error to the model which uses the full dataset. Notice that the score now is calculated using the test set, while for calculating the shapley values we were using the validation set.

In [None]:
from sklearn.metrics import mean_absolute_error
full_model = GradientBoostingRegressor(n_estimators=3).fit(X_train_good_dval, y_train_good_dval)
mean_absolute_error(full_model.predict(test_data[0]), test_data[1])

In [None]:
full_model = GradientBoostingRegressor(n_estimators=3).fit(train_data[0], train_data[1])
mean_absolute_error(full_model.predict(test_data[0]), test_data[1])

The score has improved by more than 15%! This is quite an important result, as it shows a self-consistent process to improve the performance of a model by excluding datapoints from its training set.

## Evaluation on anomalous data

Another interesting test is to corrupt some of the data and monitor how their valuation score changes. To do this, we will take one of the authors with the highest score and set all its popularity to 0.

In [None]:
high_dval = dval_df.iloc[-30:]
fig = plot_shapley_pydvl(high_dval, figsize=(20, 4), title='Artists with high data valuation scores', xlabel='artist', ylabel='shapley_dval')
fig.show()

From the plot above, we can see that Rihanna has one of the highest scores. Let's now take all the train labels related to her, set the score to 0 and re-calculate the data valuation scores.

In [None]:
train_data[0].loc[artist == 'Rihanna'] = 0
dataset = Dataset(*train_data, *val_data)
grouped_dataset = GroupedDataset.from_dataset(dataset, artist)
utility = Utility(
        model=GradientBoostingRegressor(n_estimators=3), data=grouped_dataset, scoring='neg_mean_absolute_error', enable_cache=True,
    )
dval_df = shapley_pydvl(utility, iterations_per_job=50, num_jobs=20)

Let's now take the low scoring artists and plot the results

In [None]:
low_dval = dval_df.iloc[:30]
fig = plot_shapley_pydvl(low_dval, figsize=(20, 4), title='Artists with low data valuation scores', xlabel='artist', ylabel='shapley_dval')
fig.show()

And Rihanna (our anomalous data group) has moved from top contributor to having negative impact on the performance of the model, as expected!