## QuantumCLEF
### Task 1B - Feature Selection for Recommendation


The task is to select the subset of features that will produce the best recommendation quality when used for an Item-Based KNN recommendation model. The KNN model computes the item-item similarity with cosine on the feature vectors and applies to the denominator a shrinkage of 5, the number of kneighbors to is 100. The baselines for this task are the same Item-Based KNN recommendation model trained using all the features, and then trained using the features selected by a bayesian search optimizing the model recommendation effectiveness.

#### Datasets
The dataset is private and refers to a task of movie recommendation. The dataset contains both collaborative data and two different sets of item features:

* 100_ICM: Contains 100 features for each item.
* 400_ICM: Contains 400 features for each item.
The User Rating Matrix (URM) contains tuples in the form (UserID, ItemID), listing which user interacted with which item. The Item Content Matrix (ICM) contains tuples in the form (ItemID, FeatureID, Value), note that the ICM is sparse and any missing (ItemID, FeatureID) couples should be treated as missing data. A common assumption is to use a value of 0. The features refer to different types of descriptors and tags associated to the songs. Some of the features have been normalized. 

#### Submission
The submission should be a CSV file with two columns (UserID, CommunityID) that associates for each user the numerical identifier of the community it should belong to.

####  Metrics
The selected features will be used to train an Item-Based KNN recommendation model and measure its performance on the Test Dataset with nDCG@10.

In [1]:
import pandas as pd
from Evaluation.Evaluator import EvaluatorHoldout
import scipy.sparse as sps
import numpy as np
from sklearn.model_selection import train_test_split
import itertools, multiprocessing
from functools import partial

import dimod
from neal import SimulatedAnnealingSampler
from Recommenders.KNN.ItemKNNCBFRecommender import ItemKNNCBFRecommender
from Recommenders.DataIO import DataIO

### Step 1: Load the ICM and URM data from the provided zip file

In [6]:
n_features = 100

URM_all_df = pd.read_csv("feature_selection_dataset_URM_train.csv")
ICM_df = pd.read_csv("feature_selection_dataset_{}_ICM.csv".format(n_features))

In [7]:
URM_all_df

Unnamed: 0,UserID,ItemID
0,0,2
1,0,3
2,0,4
3,0,6
4,0,29
...,...,...
3209725,20427,7055
3209726,20427,7759
3209727,20427,8182
3209728,20427,8527


In [8]:
ICM_df

Unnamed: 0,ItemID,FeatureID,Value
0,0,0,1.0
1,0,1,1.0
2,0,2,1.0
3,0,3,1.0
4,0,4,1.0
...,...,...,...
106974,14606,1,1.0
106975,14606,2,1.0
106976,14606,5,1.0
106977,14606,9,1.0


### Step 2: Transform the ICM and URM from dataframes to sparse matrices

In [11]:
n_users = URM_all_df["UserID"].max() +1
n_items = ICM_df["ItemID"].max() +1
n_users, n_items

(20428, 14607)

In [12]:
def _from_df_to_sparse(URM_df, n_users, n_items):
    URM_sps = sps.csr_matrix((np.ones(len(URM_df)),
                    (URM_df["UserID"].values, URM_df["ItemID"].values)),
                    shape=(n_users, n_items))    
    return URM_sps

In [13]:
URM_all = _from_df_to_sparse(URM_all_df, n_users, n_items)
URM_all

<20428x14607 sparse matrix of type '<class 'numpy.float64'>'
	with 3209730 stored elements in Compressed Sparse Row format>

In [14]:
ICM = sps.csr_matrix((ICM_df["Value"].values,
                    (ICM_df["ItemID"].values, ICM_df["FeatureID"].values)),
                    shape=(n_items, n_features))
ICM

<14607x100 sparse matrix of type '<class 'numpy.float64'>'
	with 106979 stored elements in Compressed Sparse Row format>

### Step 3: Create a local validation split and an evaluator object that will compute the NDCG at 10

In [15]:
URM_train_df, URM_validation_df = train_test_split(URM_all_df, test_size=0.20)

URM_train = _from_df_to_sparse(URM_train_df, n_users, n_items)
URM_validation = _from_df_to_sparse(URM_validation_df, n_users, n_items)

In [16]:
evaluator_validation = EvaluatorHoldout(URM_validation, cutoff_list=[10])

EvaluatorHoldout: Ignoring 2 ( 0.0%) Users that have less than 1 test interactions


### Step 4: Select a subset of features for the ICM

This example is based on the algorithm published in Nembrini, R.; Ferrari Dacrema, M.; Cremonesi, P. *Feature Selection for Recommender Systems with Quantum Computing*. Entropy 2021, 23, 970. https://doi.org/10.3390/e23080970


The idea is that we use two recommendation models based on an item-item similarity, one collaborative and one content-based. If the collaborative model is more accurate than the content-based one, we wish to retain the features that create content-based similarityes that correspond to those that exist in the collaborative model.

The computation of the content-based similarity for this dataset may require significant RAM. This issue can be addressed by performing the computation in smaller batches.

In [17]:
similarity_collaborative = URM_train.T.dot(URM_train)
similarity_content = ICM.dot(ICM.T)

similarity_collaborative_bin = similarity_collaborative.astype(bool)
similarity_content_bin = similarity_content.astype(bool)

In [18]:
# Identify which item-item similarities we want to keep as the intersection of content-based and collaborative similarities
Keep = similarity_collaborative_bin.multiply(similarity_content_bin)

In [19]:
# Compute the Feature Penalization Matrix
FPM = ICM.T.dot(Keep).dot(ICM)

# Create the BQM 
BQM = dimod.BinaryQuadraticModel(FPM.toarray(), "BINARY")
BQM.normalize()

5.625481794388281e-09

In [20]:
# Define which is the desired number of features and the penalty for this constraint
k_largest = 50
penalty = 0.01

BQM_k = dimod.generators.combinations(BQM.num_variables, k_largest)*penalty
BQM_k.update(BQM)

In [21]:
sampler = SimulatedAnnealingSampler()

sampleset = sampler.sample(BQM_k, num_reads = 1000)
sampleset_df = sampleset.aggregate().to_pandas_dataframe()

In [22]:
sampleset_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,energy,num_occurrences
0,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,0,1,0.154053,1
1,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0.145916,1
2,0,0,0,0,0,0,0,0,0,0,...,1,0,1,1,1,1,1,1,0.140365,1
3,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0.152538,1
4,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,0,0.146911,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,0,0,0,0,0,0,...,1,0,1,1,1,1,1,1,0.152513,1
996,0,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,0,1,1,0.146128,1
997,0,0,0,0,0,0,0,0,0,0,...,1,0,1,1,1,1,1,1,0.153319,1
998,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,0.152258,1


In [23]:
# Obtain a dictionary with the feature index and 1/0 depending on whether it should be selected or not
sampleset.first.sample

{0: 0,
 1: 0,
 2: 0,
 3: 0,
 4: 0,
 5: 0,
 6: 0,
 7: 0,
 8: 0,
 9: 0,
 10: 0,
 11: 0,
 12: 0,
 13: 0,
 14: 0,
 15: 0,
 16: 0,
 17: 0,
 18: 0,
 19: 0,
 20: 0,
 21: 0,
 22: 0,
 23: 0,
 24: 0,
 25: 0,
 26: 0,
 27: 0,
 28: 0,
 29: 0,
 30: 0,
 31: 0,
 32: 0,
 33: 0,
 34: 0,
 35: 0,
 36: 0,
 37: 0,
 38: 0,
 39: 0,
 40: 0,
 41: 0,
 42: 0,
 43: 0,
 44: 1,
 45: 0,
 46: 0,
 47: 0,
 48: 0,
 49: 1,
 50: 0,
 51: 1,
 52: 1,
 53: 1,
 54: 1,
 55: 0,
 56: 1,
 57: 1,
 58: 1,
 59: 1,
 60: 1,
 61: 1,
 62: 1,
 63: 1,
 64: 1,
 65: 1,
 66: 1,
 67: 1,
 68: 1,
 69: 1,
 70: 1,
 71: 1,
 72: 1,
 73: 1,
 74: 1,
 75: 1,
 76: 1,
 77: 1,
 78: 1,
 79: 1,
 80: 1,
 81: 1,
 82: 1,
 83: 1,
 84: 1,
 85: 1,
 86: 1,
 87: 1,
 88: 1,
 89: 1,
 90: 1,
 91: 1,
 92: 1,
 93: 1,
 94: 1,
 95: 1,
 96: 1,
 97: 1,
 98: 1,
 99: 1}

In [24]:
def _filter_ICM(ICM_all, selection_dict):
    
    selected_flag = np.zeros(ICM_all.shape[1], dtype=bool)

    for key, value in selection_dict.items():
        if value == 1:
            selected_flag[int(key)] = True

    selected_ICM = ICM_all.tocsc()[:,selected_flag].tocsr()
    
    return selected_ICM, selected_flag.sum()

In [25]:
ICM_selected, n_selected = _filter_ICM(ICM, sampleset.first.sample)

### Step 5: Create an instance of the Item-Based KNN recommendation model, fit it based on the provided hyperparameters and evaluate it against the test URM

The ItemKNNCBFRecommender model requires the User Rating Matrix (shape n_users x n_items) and the Item Content Matrix (shape n_items x n_features).
Given $F$ the set of item features and $|F|$ its cardinality, the item-item similarity between items $i$ and $j$ is computed as follows:

$$
S_{i,j} = \frac{\sum_{f=0}^{|F|} ICM_{i,f}ICM_{j,f}}{\sqrt{\sum_{f=0}^{|F|} ICM_{i,f}^2} \sqrt{\sum_{f=0}^{|F|} ICM_{j,f}^2} + shrink}
$$    

Then, the 100 most similar items for each one are selected. Note that this process can be performed in blocks to save RAM.

In [26]:
recommender_instance = ItemKNNCBFRecommender(URM_train, ICM_selected)
recommender_instance.fit(topK = 100, shrink = 5, similarity = 'cosine', normalize = True)

result_df, result_string = evaluator_validation.evaluateRecommender(recommender_instance)

ItemKNNCBFRecommender: URM Detected 29 ( 0.2%) items with no interactions.
ItemKNNCBFRecommender: ICM Detected 10050 (68.8%) items with no features.
Unable to load Cython Compute_Similarity, reverting to Python
Similarity column 14607 (100.0%), 4962.50 column/sec. Elapsed time 2.94 sec
EvaluatorHoldout: Processed 20426 (100.0%) in 13.57 sec. Users per second: 1505


In [27]:
"The NDCG@10 is {:.4f}".format(result_df.loc[10, "NDCG"])

'The NDCG@10 is 0.0215'

### Step 6: Optimize the hyperparameters to improve the effectiveness

In [32]:
def _run_experiment(hyperparameters, BQM, ICM, URM_train, evaluator_validation):
        
    k_largest, penalty = hyperparameters

    BQM_k = dimod.generators.combinations(BQM.num_variables, k_largest)*penalty
    BQM_k.update(BQM)

    sampleset = sampler.sample(BQM_k, num_reads = 1000)
    ICM_selected, n_selected = _filter_ICM(ICM, sampleset.first.sample)

    if n_selected == 0:
        return k_largest, penalty, 0.0, n_selected, sampleset.first.sample
    
    recommender_instance = ItemKNNCBFRecommender(URM_train, ICM_selected)
    recommender_instance.fit(topK = 100, shrink = 5, similarity = 'cosine', normalize = True)

    result_df, result_string = evaluator_validation.evaluateRecommender(recommender_instance)    

    print("k_largest {}, penalty {:.2E}: NDCG@10 is {:.4f}, selected {}".format(k_largest, penalty, result_df.loc[10, "NDCG"], n_selected))
    
    return k_largest, penalty, result_df.loc[10, "NDCG"], n_selected, sampleset.first.sample
    

In [39]:
# If you wish to run this optimization in parallel the _run_experiment must not be defined in the 
# notebook but can be imported from a regular python script
# from run_experiment import _run_experiment

hyperparameter_list = itertools.product(range(75, n_features, 5), 
                                        [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2])

_run_experiment_partial = partial(_run_experiment,
                                  BQM = BQM,
                                  ICM = ICM,
                                  URM_train = URM_train,
                                  evaluator_validation = evaluator_validation)  
        
# pool = multiprocessing.Pool(processes=4, maxtasksperchild=1)
# result_list = pool.map(_run_experiment_partial, hyperparameter_list, chunksize=1)

# pool.close()
# pool.join()


result_list = []
for hyperp in hyperparameter_list:
    result_list.append(_run_experiment_partial(hyperp))

ItemKNNCBFRecommender: URM Detected 29 ( 0.2%) items with no interactions.
ItemKNNCBFRecommender: ICM Detected 12842 (87.9%) items with no features.
Unable to load Cython Compute_Similarity, reverting to Python
Similarity column 14607 (100.0%), 5826.19 column/sec. Elapsed time 2.51 sec
EvaluatorHoldout: Processed 20426 (100.0%) in 12.14 sec. Users per second: 1683
k_largest 75, penalty 1.00E-05: NDCG@10 is 0.0039, selected 17
ItemKNNCBFRecommender: URM Detected 29 ( 0.2%) items with no interactions.
ItemKNNCBFRecommender: ICM Detected 10432 (71.4%) items with no features.
Unable to load Cython Compute_Similarity, reverting to Python
Similarity column 14607 (100.0%), 5251.43 column/sec. Elapsed time 2.78 sec
EvaluatorHoldout: Processed 20426 (100.0%) in 13.70 sec. Users per second: 1491
k_largest 75, penalty 1.00E-04: NDCG@10 is 0.0132, selected 46
ItemKNNCBFRecommender: URM Detected 29 ( 0.2%) items with no interactions.
ItemKNNCBFRecommender: ICM Detected 8072 (55.3%) items with no fe

Similarity column 14607 (100.0%), 3572.63 column/sec. Elapsed time 4.09 sec
EvaluatorHoldout: Processed 20426 (100.0%) in 15.04 sec. Users per second: 1358
k_largest 85, penalty 1.00E-02: NDCG@10 is 0.0278, selected 79
ItemKNNCBFRecommender: URM Detected 29 ( 0.2%) items with no interactions.
ItemKNNCBFRecommender: ICM Detected 3043 (20.8%) items with no features.
Unable to load Cython Compute_Similarity, reverting to Python
Similarity column 14607 (100.0%), 3437.27 column/sec. Elapsed time 4.25 sec
EvaluatorHoldout: Processed 20426 (100.0%) in 15.13 sec. Users per second: 1350
k_largest 85, penalty 1.00E-01: NDCG@10 is 0.0280, selected 84
ItemKNNCBFRecommender: URM Detected 29 ( 0.2%) items with no interactions.
ItemKNNCBFRecommender: ICM Detected 2169 (14.8%) items with no features.
Unable to load Cython Compute_Similarity, reverting to Python
Similarity column 14607 (100.0%), 3302.46 column/sec. Elapsed time 4.42 sec
EvaluatorHoldout: Processed 20426 (100.0%) in 15.06 sec. Users per

EvaluatorHoldout: Processed 20426 (100.0%) in 15.62 sec. Users per second: 1308
k_largest 95, penalty 1.00E+01: NDCG@10 is 0.0377, selected 95
ItemKNNCBFRecommender: URM Detected 29 ( 0.2%) items with no interactions.
ItemKNNCBFRecommender: ICM Detected 185 ( 1.3%) items with no features.
Unable to load Cython Compute_Similarity, reverting to Python
Similarity column 14607 (100.0%), 2441.79 column/sec. Elapsed time 5.98 sec
EvaluatorHoldout: Processed 20426 (100.0%) in 15.63 sec. Users per second: 1307
k_largest 95, penalty 1.00E+02: NDCG@10 is 0.0304, selected 95


In [42]:
best_configuration = None
best_selected_dict = None
best_NDCG = None

for data in result_list:
    
    k_largest, penalty, NDCG, n_selected, selected_dict = data

    if best_NDCG is None or best_NDCG < NDCG:
        best_NDCG = NDCG
        best_configuration = (k_largest, penalty)
        best_selected_dict = selected_dict

In [43]:
k_largest, penalty = best_configuration
print("The overall optimal configuration found is: k_largest {}, penalty {}, NDCG@10 {:.4f}".format(k_largest, penalty, best_NDCG))

The overall optimal configuration found is: k_largest 85, penalty 10.0, NDCG@10 0.0391


### Step 7: Save the submission

In [47]:
selected_features = [key for key, value in selected_dict.items() if value==1]

In [48]:
file_name = "{task}_{dataset}_{method}_{groupname}_{submissionID}.txt".format(task = "1B",
                                                                              dataset = "{}_ICM".format(n_features),
                                                                              method = "SA", 
                                                                              groupname = "example-group",
                                                                              submissionID = "000")

In [49]:
with open(file_name, 'w') as f:
    f.write("\n".join(map(str, selected_features)))