# Effects of Dataset Binarization on a Recommender's MAE

In this notebook we will explore the effects of binarizing the dataset on the MAE of a recommender system. Since pattern mining algorithms are not able to deal with continuous values, we will use the binarized dataset to mine the patterns and then use the patterns to predict the ratings of the original dataset. Different binarization thresholds will be tested to determine its effects on the MAE of the recommender system.

The binarization threshold determines the minimum rating value that will be considered as a positive rating. For example, if the threshold is 3, then all ratings greater than or equal to 3 will be considered as positive ratings, while all ratings less than 3 will be considered as negative ratings. This means that the binarization threshold will determine the number of positive ratings in the dataset. The higher the threshold, the less positive ratings there will be in the dataset.

Copyright 2022 Bernardo C. Rodrigues

See COPYING file for license details

In [None]:
# Determine the number of concepts for different binarization thresholds

import pandas as pd
from surprise import Dataset
from dataset.binary_dataset import BinaryDataset
from fca.formal_concept_analysis import GreConD

dataset = Dataset.load_builtin("ml-100k", prompt=False)
trainset = dataset.build_full_trainset()

results = []

for threshold in range(1, 6):
    binary_dataset = BinaryDataset.load_from_trainset(trainset, threshold=threshold)
    concepts, _ = GreConD(binary_dataset)

    result = [threshold, binary_dataset.number_of_trues, binary_dataset.sparsity, len(concepts)]
    results.append(result)

pd.DataFrame(results, columns=["Threshold", "# of True's", "Sparsity", "# of Concepts"])

In [None]:
# Generate some predictions
from surprise.model_selection import KFold

# Lets create a 80% train / 20% test ratio
kf = KFold(n_splits=5)
fold_generator = kf.split(dataset)
trainset, testset = next(fold_generator)

In [None]:
from itertools import product
from surprise.accuracy import mae, rmse
from recommenders.grecond_recommender import GreConDRecommender

thresholds = [1, 2, 3, 4, 5]
ks = [1, 5, 10, 20, 30, 50, 60]
coverages = [0.6, 0.8, 1.0]
results = []

for threshold, k, coverage in product(thresholds, ks, coverages):
    algo = GreConDRecommender(
        knn_k=k, grecond_coverage=coverage, dataset_binarization_threshold=threshold
    )
    algo.fit(trainset)

    predictions = algo.test(testset)

    result = [threshold, k, coverage, mae(predictions=predictions, verbose=False)]
    results.append(result)

results = pd.DataFrame(results, columns=["threshold", "k", "coverage", "mae"])

In [None]:
import plotly.express as px

fig = px.scatter_3d(results, x='threshold', y='k', z='coverage', color='mae')
fig.show()