# About

This notebook contains the sandbox for developing randomised perturbation techniques, i.e. for user-item explicit ratings.  
These correspond to algorithms 1-4 from the Polat and Batmaz framework.

This is used only for testing and debugging and **shouldn't be used to generate perturbed datasets**.

In [1]:
import pandas as pd
import numpy as np
import math
from io import StringIO

# Random Perturbation Technique (RPT)

## Steps
1. select $\alpha$ (this should be sent by the server)
2. for each user i, compute: mean vote, std, z-votes for those items that she has rated
3. for each user i and item j
    1. draw a random numbers from the interval $[-\alpha, \alpha] => r_{ij}$
    2. add it to that user's item-wise z-score: $z_{ij}'=z_{ij}+r_{ij}$
    
## Selecting $\alpha$
* Fixed range:
    1. choose the range $\gamma$
    2. choose the corresponding $\alpha$ s.t. $[-\alpha, \alpha]$ covers $\gamma$
    3. uniformly draw from $[-\alpha, \alpha]$
* Random range:
    1. choose the range $\gamma$
    2. randomly draw $\alpha$ from the range
    3. uniformly draw from $[-\alpha, \alpha]$

In [2]:
df_ratings = pd.read_csv("../data/ml-1m/ratings.dat", sep="::", header=None, engine="python", usecols=[0,1,2])
df_ratings.columns = ['user_id', 'movie_id', 'rating']

In [3]:
n_users = df_ratings['user_id'].nunique()
n_items = df_ratings['movie_id'].nunique()
n_users, n_items

(6040, 3706)

In [4]:
df_ratings

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
...,...,...,...
1000204,6040,1091,1
1000205,6040,1094,5
1000206,6040,562,5
1000207,6040,1096,4


In [4]:
# we are normalising all the ratings, so we can use sigma_max=1
# if we wouldn't standardise, we could use sigma_max = df_ratings['rating'].max()
is_standard = True
sigma_max = 1 if is_standard else df_ratings['rating'].max()
sigma_max

1

In [5]:
df_ratings['z_rating'] = df_ratings.groupby('user_id')['rating'].transform(lambda x: (x - x.mean())/x.std())

In [6]:
# in order to do reverse standardisation
df_ratings['user_std'] = df_ratings.groupby('user_id')['rating'].transform(lambda x: x.std())
df_ratings['user_mean'] = df_ratings.groupby('user_id')['rating'].transform(lambda x: x.mean())

In [7]:
cols_export = ['user_id', 'movie_id', 'rating', 'z_rating_noise', 'rating_noise']

# Fixed alpha, no masking

In [8]:
# fixed-alpha
alpha = np.sqrt(3) * sigma_max
df_ratings['z_rating_noise'] = df_ratings['z_rating'] + np.random.uniform(-alpha, alpha, df_ratings.shape[0])

In [9]:
# verify that the groupwise mean of z_rating_noise is close to 0
df_ratings_agg = df_ratings.groupby(['user_id']).agg({'z_rating_noise': ['mean']}).reset_index(col_level=1)
df_ratings_agg.columns = df_ratings_agg.columns.droplevel(0)

In [10]:
df_ratings['rating_noise'] = df_ratings['z_rating_noise'] * df_ratings['user_std'] + df_ratings['user_mean']

In [13]:
df_ratings[cols_export].to_csv('../data/ml-1m-obfuscation/ratings_obf1_fixed_alpha.csv', index=False)

In [None]:
# df_ratings['alpha'] = df_ratings.groupby('user_id').transform(alpha)

In [11]:
df_small = df_ratings.sample(1000).copy()

In [12]:
df_small

Unnamed: 0,user_id,movie_id,rating,z_rating,user_std,user_mean,z_rating_noise,rating_noise
191584,1181,2013,2,-0.909061,0.897531,2.815911,-0.944233,1.968432
891599,5387,2518,3,-0.515941,1.060577,3.547196,-1.161372,2.315471
671389,4033,2700,3,-0.757592,1.012909,3.767372,-0.984410,2.770254
3171,24,3448,5,1.269303,0.828384,3.948529,1.363415,5.077960
141401,911,991,4,0.164146,0.800681,3.868571,-0.324322,3.608893
...,...,...,...,...,...,...,...,...
233137,1418,2553,3,0.048585,1.007541,2.951049,-0.855722,2.088874
388795,2279,592,5,1.250489,1.089098,3.638095,1.059553,4.792053
895621,5412,1097,3,-0.700245,1.227731,3.859712,0.673679,4.686809
526026,3250,2599,3,-0.223026,1.084787,3.241935,-1.053485,2.099128


# Random alpha, no masking

In [None]:
# group by user id
grouped = df_ratings.groupby('user_id')

In [32]:
# random alpha
#df_ratings['alpha'] = df_ratings.groupby('user_id')['user_id'].transform(lambda x: np.random.uniform(0, beta))
#df_ratings['z_rating_noise'] = df_ratings['z_rating'] + np.random.uniform(-df_ratings['alpha'], df_ratings['alpha'])

In [15]:
# randomly choose a randomisation treatment for each user: coin flip between Gaussian and uniform
df_ratings['treatment'] = grouped['user_id'].transform(lambda x: np.random.binomial(1, 0.5))
# randomly choose a sigma for each user
df_ratings['sigma'] = grouped['user_id'].transform(lambda x: np.random.uniform(0, sigma_max))

In [22]:
# generate noise for the samples that receive the uniform random treatment
df_ratings.loc[df_ratings['treatment'] == 0, 'noise'] = np.random.uniform(-np.sqrt(3*df_ratings[df_ratings['treatment'] == 0]['sigma']), np.sqrt(3*df_ratings[df_ratings['treatment'] == 0]['sigma']))

In [23]:
# generate noise for the samples that receive the Gaussian random treatment
df_ratings.loc[df_ratings['treatment'] == 1, 'noise'] = np.random.normal(0, df_ratings[df_ratings['treatment'] == 1]['sigma'])

In [24]:
# add the noise
df_ratings['z_rating_noise'] = df_ratings['z_rating'] + df_ratings['noise']

In [25]:
# check that it generates one alpha per user (i.e. groupwise)
df_ratings['sigma'].nunique()

6040

In [26]:
# check that the z_rating_noise are unique (or almost unique, in any case not per user)
df_ratings['z_rating_noise'].nunique()

1000209

In [27]:
# check that the mean of the scrambled distribution for each user stays close to 0
df_ratings.groupby('user_id')['z_rating_noise'].mean().describe()

count    6040.000000
mean       -0.000737
std         0.079570
min        -0.470342
25%        -0.032229
50%        -0.000621
75%         0.032356
max         0.765541
Name: z_rating_noise, dtype: float64

In [22]:
df_ratings['rating_noise'] = df_ratings['z_rating_noise'] * df_ratings['user_std'] + df_ratings['user_mean']

In [23]:
df_ratings.groupby('user_id')[['rating', 'rating_noise']].mean().describe()

Unnamed: 0,rating,rating_noise
count,6040.0,6040.0
mean,3.702705,3.701852
std,0.429622,0.437718
min,1.015385,1.012373
25%,3.444444,3.439453
50%,3.735294,3.735794
75%,4.0,4.000616
max,4.962963,4.961408


In [25]:
df_ratings

Unnamed: 0,user_id,movie_id,rating,z_rating,user_std,user_mean,z_rating_noise,rating_noise,treatment,sigma,noise
0,1,1193,5,1.191425,0.680967,4.188679,1.105785,4.941682,1,0.312498,-0.085640
1,1,661,3,-1.745576,0.680967,4.188679,-1.853954,2.926199,1,0.312498,-0.108377
2,1,914,3,-1.745576,0.680967,4.188679,-1.715603,3.020411,1,0.312498,0.029973
3,1,3408,4,-0.277076,0.680967,4.188679,-0.852857,3.607912,1,0.312498,-0.575781
4,1,2355,5,1.191425,0.680967,4.188679,1.187340,4.997218,1,0.312498,-0.004085
...,...,...,...,...,...,...,...,...,...,...,...
1000204,6040,1091,1,-2.185022,1.179719,3.577713,-1.711541,1.558575,0,0.897242,0.473481
1000205,6040,1094,5,1.205615,1.179719,3.577713,0.377127,4.022616,0,0.897242,-0.828488
1000206,6040,562,5,1.205615,1.179719,3.577713,0.564847,4.244073,0,0.897242,-0.640769
1000207,6040,1096,4,0.357956,1.179719,3.577713,0.330450,3.967551,0,0.897242,-0.027506


In [24]:
df_ratings[cols_export].to_csv('../data/ml-1m-obfuscation/ratings_obf2_random_alpha.csv', index=False)

# Fixed alpha, with masking

In [9]:
# group by user id
grouped = df_ratings.groupby('user_id')

In [10]:
# fixed-alpha
alpha = np.sqrt(3) * sigma_max
# choose a beta max based on its user-wise distribution
beta_max = grouped.size().max()/n_items
beta = np.random.uniform(0, beta_max)

In [40]:
# initialise user-wise data frame for each unrated movie: conserve the beta, sigma, treatment
df_users_unrated = pd.DataFrame()
df_users_unrated['n_rated'] = grouped.size()
df_users_unrated['n_rated'].fillna(0)
# for each user, radomly choose a random number of unrated movie ids
all_items = set(df_ratings['movie_id'].unique())
df_users_unrated['unselected_set'] = grouped['movie_id'].agg(list).apply(
    lambda s: list(all_items.difference(set(s))))

In [49]:
# for each user, radomly choose a random number of unrated movie ids
all_items = set(range(n_items))
# user-wise set of unselected movie-ids
df_users_unrated = grouped['movie_id'].agg(list).apply(lambda s: list(all_items.difference(set(s)))).reset_index()
df_users_unrated.columns = ['user_id', 'unselected_set']
# pick a random subset of these unselected movies
df_users_unrated['unselected_subset'] = df_users_unrated.apply(lambda x: np.random.choice(x['unselected_set'], int(beta*(n_items-len(x['unselected_set']))), replace=False), axis=1)
df_users_unrated.drop(columns=['unselected_set'], inplace=True)
df_users_unrated.rename(columns={'unselected_subset':'movie_id'}, inplace=True)
# explode the table for each user_id and unrated movie_id => a data frame of users and a random selection of their unrated movies
df_users_unrated = df_users_unrated.explode('movie_id', ignore_index=True)

In [51]:
# Generate random ratings for the unrated items
ratings_distribution = list((df_ratings['rating'].value_counts()/len(df_ratings)).sort_index())
df_users_unrated['rating'] = np.random.choice(list(range(1,6)), len(df_users_unrated), ratings_distribution)
# normalise ratings
df_users_unrated['z_rating'] = df_users_unrated.groupby('user_id')['rating'].transform(lambda x: (x - x.mean())/x.std())

In [40]:
# add the noise to the rated items
df_ratings['z_rating_noise'] = df_ratings['z_rating'] + np.random.uniform(-alpha, alpha, df_ratings.shape[0])
# add the noise to the synthetically rated items
df_users_unrated['z_rating_noise'] = df_users_unrated['z_rating'] + np.random.uniform(-alpha, alpha, df_users_unrated.shape[0])

In [56]:
# De-normalise rated items with noise
df_ratings['rating_noise'] = df_ratings['z_rating_noise'] * df_ratings['user_std'] + df_ratings['user_mean']

In [69]:
# De-normalise syntehtially rated items with noise - we need the mean and std from the rated items
# user-wise ratings descriptive stats (mean, std)
df_user_ratings_desc_stats = df_ratings.groupby('user_id')[['user_mean', 'user_std']].agg(min).reset_index()
df_users_unrated = df_users_unrated.merge(df_user_ratings_desc_stats, on='user_id', how='left')
df_users_unrated['rating_noise'] = df_users_unrated['z_rating_noise'] * df_users_unrated['user_std'] + df_users_unrated['user_mean']

In [None]:
# add the original/synthetic flag
df_ratings['is_original'] = True
df_users_unrated['is_original'] = False
cols_export.append('is_original')

In [89]:
# concatenate the original and synthetic data with noise
df_ratings_agg = pd.concat([df_ratings[cols_export], df_users_unrated[cols_export]])

In [91]:
df_ratings_agg[cols_export].to_csv('../data/ml-1m-obfuscation/ratings_obf3_fixed_alpha_with_mask.csv', index=False)

# Random alpha, with masking

In [13]:
# group by user id
grouped = df_ratings.groupby('user_id')

In [14]:
# randomly choose a randomisation treatment for each user: coin flip between Gaussian and uniform
df_ratings['treatment'] = grouped['user_id'].transform(lambda x: np.random.binomial(1, 0.5))
# randomly choose a sigma for each user
df_ratings['sigma'] = grouped['user_id'].transform(lambda x: np.random.uniform(0, sigma_max))

In [15]:
# choose a beta max based on its user-wise distribution
beta_max = grouped.size().max()/n_items
# generate a beta for each user
df_ratings['beta'] = grouped['user_id'].transform(lambda x: np.random.uniform(0, beta_max))

In [16]:
# initialise user-wise data frame for each unrated movie: conserve the beta, sigma, treatment
df_users_unrated = grouped[['beta', 'sigma', 'treatment']].agg('min').reset_index()
# for each user, radomly choose a random number of unrated movie ids
all_items = set(range(n_items))
df_users_unrated['unselected_set'] = grouped['movie_id'].agg(list).apply(lambda s: list(all_items.difference(set(s)))).reset_index()['movie_id']
df_users_unrated['unselected_subset'] = df_users_unrated.apply(lambda x: np.random.choice(x['unselected_set'], int(x['beta']*(n_items-len(x['unselected_set']))), replace=False), axis=1)
df_users_unrated.drop(columns=['unselected_set'], inplace=True)
df_users_unrated.rename(columns={'unselected_subset':'movie_id'}, inplace=True)
# explode the table for each user_id and unrated movie_id
df_users_unrated = df_users_unrated.explode('movie_id', ignore_index=True)

In [18]:
# generate random ratings (1-5 stars) for the unrated movies, using the distribution of the rated ones
ratings_distribution = list((df_ratings['rating'].value_counts()/len(df_ratings)).sort_index())
df_users_unrated['rating'] = np.random.choice(list(range(1,6)), len(df_users_unrated), ratings_distribution)
# normalise ratings
df_users_unrated['z_rating'] = df_users_unrated.groupby('user_id')['rating'].transform(lambda x: (x - x.mean())/x.std())

In [19]:
# generate noise for the samples that receive the uniform random treatment
df_users_unrated.loc[df_users_unrated['treatment'] == 0, 'noise'] = np.random.uniform(-np.sqrt(3*df_users_unrated[df_users_unrated['treatment'] == 0]['sigma']), np.sqrt(3*df_users_unrated[df_users_unrated['treatment'] == 0]['sigma']))

In [20]:
# generate noise for the samples that receive the Gaussian random treatment
df_users_unrated.loc[df_users_unrated['treatment'] == 1, 'noise'] = np.random.normal(0, df_users_unrated[df_users_unrated['treatment'] == 1]['sigma'])

In [21]:
# add the noise
df_users_unrated['z_rating_noise'] = df_users_unrated['z_rating'] + df_users_unrated['noise']

In [45]:
# check that it generates one alpha per user (i.e. groupwise)
df_users_unrated['sigma'].nunique()

6040

In [46]:
# check that the z_rating_noise are unique (or almost unique, in any case not per user)
df_users_unrated['z_rating_noise'].nunique()

303202

In [47]:
# check that the mean of the scrambled distribution for each user stays close to 0
df_users_unrated.groupby('user_id')['z_rating_noise'].mean().describe()

count    5699.000000
mean       -0.000669
std         0.174612
min        -1.166566
25%        -0.061302
50%        -0.000273
75%         0.059241
max         1.347243
Name: z_rating_noise, dtype: float64

In [48]:
len(df_users_unrated.groupby('user_id'))

6040

At the end of this approach, we obtain 2 datasets: one with the obfuscated inital ratings and one with the synthetic obfuscated ratings.

In [28]:
# De-normalise rated items with noise
df_ratings['rating_noise'] = df_ratings['z_rating_noise'] * df_ratings['user_std'] + df_ratings['user_mean']

In [29]:
# De-normalise syntehtially rated items with noise - we need the mean and std from the rated items
# user-wise ratings descriptive stats (mean, std)
df_user_ratings_desc_stats = df_ratings.groupby('user_id')[['user_mean', 'user_std']].agg(min).reset_index()
df_users_unrated = df_users_unrated.merge(df_user_ratings_desc_stats, on='user_id', how='left')
df_users_unrated['rating_noise'] = df_users_unrated['z_rating_noise'] * df_users_unrated['user_std'] + df_users_unrated['user_mean']

In [30]:
# add the original/synthetic flag
df_ratings['is_original'] = True
df_users_unrated['is_original'] = False
cols_export.append('is_original')

In [33]:
# concatenate the original and synthetic data with noise
df_ratings_agg = pd.concat([df_ratings[cols_export], df_users_unrated[cols_export]])

In [35]:
df_ratings_agg[cols_export].to_csv('../data/ml-1m-obfuscation/ratings_obf4_random_alpha_with_mask.csv', index=False)