## This file pre-process the raw data in the following aspect:
1. Filter the raw data and eliminate items with ratings less than 5 and users who has rated less than 5 movies.
2. Holdout ratings: Split the dataset into training/testing set in proportion of approximately 80:20
3. Split training set into three user group by their taste distribution. G1 user group represent blockbuster-focused user, G2 user group represet the Diverse taste user and G3 user group represent the niche-focused user.

In [25]:
import os
import sys
import pandas as pd
import numpy as np


In [3]:
def get_count(tp, id):
    # compute the frequency of given id(users/items)
    playcount_groupbyid = tp[[id]].groupby(id, as_index=True)
    count = playcount_groupbyid.size()
    return count

In [4]:
def filter_triplets(tp, min_uc=5, min_sc=0):
    # Only keep the triplets for items which were clicked on by at least min_sc users. 
    if min_sc > 0:
        # find the number of times of occurance of each movie id：
        # itemcount contains movie ids and corresponding counts
        itemcount = get_count(tp, 'movieId')
        # select ratings of movies which occure more than min_sc times
        tp = tp[tp['movieId'].isin(itemcount.index[itemcount>= min_sc])]
    
    # Only keep the triplets for users who clicked on at least min_uc items
    # After doing this, some of the items will have less than min_uc users, but should only be a small proportion
    if min_uc > 0:
        usercount = get_count(tp, 'userId')
        # select ratings of users who has rated more than min_uc items
        tp = tp[tp['userId'].isin(usercount.index[usercount >= min_uc])]
    
    # Update both usercount and itemcount after filtering
    usercount, itemcount = get_count(tp, 'userId'), get_count(tp, 'movieId') 
    return tp, usercount, itemcount

In [5]:
# load the dataset
DATA_DIR = 'raw_data/ml-1m/'
raw_data = pd.read_csv(os.path.join(DATA_DIR, 'ratings.csv'), header=0)
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype
---  ------     --------------    -----
 0   userId     1000209 non-null  int64
 1   movieId    1000209 non-null  int64
 2   rating     1000209 non-null  int64
 3   timestamp  1000209 non-null  int64
dtypes: int64(4)
memory usage: 30.5 MB


In [6]:
raw_data, user_activity, item_popularity = filter_triplets(raw_data)
sparsity = 1. * raw_data.shape[0] / (user_activity.shape[0] * item_popularity.shape[0])

print("After filtering, there are %d watching events from %d users and %d movies (sparsity: %.3f%%)" % 
      (raw_data.shape[0], user_activity.shape[0], item_popularity.shape[0], sparsity * 100))

After filtering, there are 1000209 watching events from 6040 users and 3706 movies (sparsity: 4.468%)


In [7]:
def split_train_test_proportion(data, test_prop=0.2, randomSeed=98765):
    data_grouped_by_user = data.groupby('userId')
    tr_list, te_list = list(), list()

    np.random.seed(randomSeed)

    for i, (_, group) in enumerate(data_grouped_by_user):
        n_items_u = len(group)

        if n_items_u >= 5:
            idx = np.zeros(n_items_u, dtype='bool')
            idx[np.random.choice(n_items_u, size=int(test_prop * n_items_u), replace=False).astype('int64')] = True

            tr_list.append(group[np.logical_not(idx)])
            te_list.append(group[idx])
        else:
            tr_list.append(group)

        if i % 1000 == 0:
            print("%d users sampled" % i)
            sys.stdout.flush()

    data_tr = pd.concat(tr_list)
    data_te = pd.concat(te_list)
    
    return data_tr, data_te

## Holdout ratings

In [8]:
# split dataset，train：test = 80:20
data_train, data_test = split_train_test_proportion(raw_data)
print('Training set contains: {0:.2f}% ratings,\nTest set contains: {1:.2f}% ratings'.format(100*data_train.shape[0]/raw_data.shape[0], 100*data_test.shape[0]/raw_data.shape[0]))

0 users sampled
1000 users sampled
2000 users sampled
3000 users sampled
4000 users sampled
5000 users sampled
6000 users sampled
Training set contains: 80.24% ratings,
Test set contains: 19.76% ratings


In [9]:
# remove timestamps in the saved files
data_train = data_train.drop(['timestamp'],axis=1)
data_test = data_test.drop(['timestamp'],axis=1)
data_train.to_csv('processed_data/data_train.csv',header=False, index=False)
data_test.to_csv('processed_data/data_test.csv',header=False, index=False)

## Split user groups based on user's taste distribution

First, we need to compute the popularity of items. Since the popularity is defined as that the more peole interact with one item, the more popular this item is, we used the count of ratings of each movie to represent its popularity. Moreover, we normalise this count to range [0,1] by dividing the number of counts of the most popular item.

We affiliate all items into three groups:

1. H items: items whose total number of ratings occupy 20% total number of ratings and ranked at top of the sorted list of movie id.
2. M items: whose total number of ratings occupy the 60% total number of ratings and ranked after head items.
3. T items: whose total number of ratings occupy the 20% total number of ratings and ranked at the bottom of the sorted list of movie id.

After this affiliation, we create a new column to store the label of item groups for each ratings in the dataset. The column 'movies_pop' has values 1 or 2 or 3 which represnt H or M or T type of items respectively.

In [10]:
# sort all movie id by popularity(Descending)
raw_data_MoiveId = raw_data.groupby('movieId', as_index=True)
sorted_movieId = raw_data_MoiveId.size().sort_values(ascending=False)
# 
H_threshold = raw_data.shape[0]*0.2 # Head items
M_threshold = raw_data.shape[0]*0.8 # Long-tail items

# find the end index of H items and M items in the sorted list
sum_pop = 0
for index in range(len(sorted_movieId)): 
    if np.sum(sorted_movieId[:index]) >= H_threshold: break
H_endindex = index
for index in range(len(sorted_movieId)): 
    if np.sum(sorted_movieId[:index]) >= M_threshold: break
M_endindex = index

# max normalise the popularity to range [0,1]
sorted_movieId = sorted_movieId/max(sorted_movieId) 
# store the movie ids into different groups
H_movieId = sorted_movieId[:H_endindex]
M_movieId = sorted_movieId[H_endindex:M_endindex]
T_movieId = sorted_movieId[M_endindex:]
# create labels for each type of movies
movies_pop = raw_data['movieId'].isin(H_movieId.index)*1+raw_data['movieId'].isin(M_movieId.index)*2+raw_data['movieId'].isin(T_movieId.index)*3
raw_data['movies_pop'] = movies_pop

In [11]:
OUTPUT_DIR = 'processed_data'

In [12]:
with open(os.path.join(OUTPUT_DIR, 'VAE_CF/unique_movieId.txt'), 'w') as f:
    for movieId in sorted_movieId.index:
        f.write('%s\n' % movieId)
with open(os.path.join(OUTPUT_DIR, 'VAE_CF/unique_userId.txt'), 'w') as f:
    for userId in user_activity.index:
        f.write('%s\n' % userId)

In [24]:
with open(os.path.join(OUTPUT_DIR, 'VAE_CF/MovieCategories.txt'), 'w') as f:
    for movieId in H_movieId.index:
        f.write('%s,%s\n' % (movieId,'H'))
    for movieId in M_movieId.index:
        f.write('%s,%s\n' % (movieId,'M'))
    for movieId in T_movieId.index:
        f.write('%s,%s\n' % (movieId,'T'))

In [13]:
# compute users taste distribution for all users
user_taste_dist = pd.DataFrame(columns=['H_ratio','T_ratio'])
for userId in raw_data['userId'].drop_duplicates():
    user_subset = raw_data[raw_data['userId'] == userId]
    total_views = user_subset.shape[0]
    H_ratio = user_subset[user_subset['movies_pop'] == 1].shape[0]/total_views
    T_ratio = user_subset[user_subset['movies_pop'] == 3].shape[0]/total_views 
    user_taste = pd.DataFrame({'H_ratio': H_ratio,'T_ratio':T_ratio},index = [userId])
    user_taste_dist = user_taste_dist.append(user_taste)

In [21]:
user_taste_dist.to_csv(os.path.join(OUTPUT_DIR,'VAE_CF/user_taste_dist.csv'),header=True, index=True)

In [None]:
user_taste_dist = user_taste_dist.sort_values(by='H_ratio',axis=0, ascending=False)
H_ratio_mean = np.mean(user_taste_dist['H_ratio'])*100
T_ratio_mean = np.mean(user_taste_dist['T_ratio'])*100
M_ratio_mean = 100 - H_ratio_mean - T_ratio_mean
print('the overall taste distribution of all users: H:{0:.2f}%, M:{1:.2f}%, T:{2:.2f}%'.format(H_ratio_mean,M_ratio_mean,T_ratio_mean))
# split user group. G1:G2:G3 = 2:6:2
G1_user = user_taste_dist.iloc[:int(0.2*user_taste_dist.shape[0])] # Blockbuster-focused Users
G2_user = user_taste_dist.iloc[int(0.2*user_taste_dist.shape[0]):int(0.8*user_taste_dist.shape[0])] # Diverse Taste Users
G3_user = user_taste_dist.iloc[int(0.8*user_taste_dist.shape[0]):]# Niche-focused Users

In [21]:
# compute average taste distribution for each group
i = 0
for group in [G1_user,G2_user,G3_user]:
    i += 1
    H = np.mean(group['H_ratio'])*100
    T = np.mean(group['T_ratio'])*100
    M  = 100-H-T
    print('user group {0:} average taste distribution: H:{1:.2f}%, M:{2:.2f}%, T:{3:.2f}%'.format(i,H,M,T))

user group 1 average taste distribution: H:45.93%, M:46.93%, T:7.14%
user group 2 average taste distribution: H:25.50%, M:59.97%, T:14.53%
user group 3 average taste distribution: H:12.42%, M:58.52%, T:29.06%


In [22]:
G1_ratings_full = raw_data[raw_data['userId'].isin(G1_user.index)]
G2_ratings_full = raw_data[raw_data['userId'].isin(G2_user.index)]
G3_ratings_full = raw_data[raw_data['userId'].isin(G3_user.index)]
G1_ratings_train = data_train[data_train['userId'].isin(G1_user.index)]
G2_ratings_train = data_train[data_train['userId'].isin(G2_user.index)]
G3_ratings_train = data_train[data_train['userId'].isin(G3_user.index)]
G1_ratings_test = data_test[data_test['userId'].isin(G1_user.index)]
G2_ratings_test = data_test[data_test['userId'].isin(G2_user.index)]
G3_ratings_test = data_test[data_test['userId'].isin(G3_user.index)]

In [23]:
# save sorted movieId and corresponding popularity
sorted_movieId.to_csv('processed_data/sorted_movieId.csv',header=False, index=True)

In [34]:
# save the corresponding user group with their taste distribution
G1_user.to_csv(os.path.join(OUTPUT_DIR,'group_data/G1_user.csv'),header=True, index=True)
G2_user.to_csv(os.path.join(OUTPUT_DIR,'group_data/G2_user.csv'),header=True, index=True)
G3_user.to_csv(os.path.join(OUTPUT_DIR,'group_data/G3_user.csv'),header=True, index=True)

In [35]:
# full dataset：save the corrsponding rating of users from each user group
G1_ratings_full.to_csv(os.path.join(OUTPUT_DIR,'group_data/G1_ratings_full.csv'),header=False, index=False)
G2_ratings_full.to_csv(os.path.join(OUTPUT_DIR,'group_data/G2_ratings_full.csv'),header=False, index=False)
G3_ratings_full.to_csv(os.path.join(OUTPUT_DIR,'group_data/G3_ratings_full.csv'),header=False, index=False)
# training set：save the corrsponding rating of users from each user group
G1_ratings_train.to_csv(os.path.join(OUTPUT_DIR,'group_data/G1_ratings_train.csv'),header=False, index=False)
G2_ratings_train.to_csv(os.path.join(OUTPUT_DIR,'group_data/G2_ratings_train.csv'),header=False, index=False)
G3_ratings_train.to_csv(os.path.join(OUTPUT_DIR,'group_data/G3_ratings_train.csv'),header=False, index=False)
# test set：save the corrsponding rating of users from each user group
G1_ratings_test.to_csv(os.path.join(OUTPUT_DIR,'group_data/G1_ratings_test.csv'),header=False, index=False)
G2_ratings_test.to_csv(os.path.join(OUTPUT_DIR,'group_data/G2_ratings_test.csv'),header=False, index=False)
G3_ratings_test.to_csv(os.path.join(OUTPUT_DIR,'group_data/G3_ratings_test.csv'),header=False, index=False)

In [36]:
# training set：save the corrsponding rating of users from each user group
G1_ratings_train = pd.read_csv(os.path.join(OUTPUT_DIR,'group_data/G1_ratings_train.csv'),header=None, names=['userId', 'movieId','rating'])
G2_ratings_train = pd.read_csv(os.path.join(OUTPUT_DIR,'group_data/G2_ratings_train.csv'),header=None, names=['userId', 'movieId','rating'])
G3_ratings_train = pd.read_csv(os.path.join(OUTPUT_DIR,'group_data/G3_ratings_train.csv'),header=None, names=['userId', 'movieId','rating'])
# test set：save the corrsponding rating of users from each user group
G1_ratings_test = pd.read_csv(os.path.join(OUTPUT_DIR,'group_data/G1_ratings_test.csv'),header=None, names=['userId', 'movieId','rating'])
G2_ratings_test = pd.read_csv(os.path.join(OUTPUT_DIR,'group_data/G2_ratings_test.csv'),header=None, names=['userId', 'movieId','rating'])
G3_ratings_test = pd.read_csv(os.path.join(OUTPUT_DIR,'group_data/G3_ratings_test.csv'),header=None, names=['userId', 'movieId','rating'])

# Transfer data format for long-tail GAN training.
Since this is how the 'long-tail_GAN/Dataset/ml-1m' are generated, which is the dataset with holdout ratings. Eventually, we use the dataset as same way of split as VAE_CF, so the re-format of the dataset with holdout users.

In [None]:
data_train.to_csv('long-tail_GAN/Dataset/ml_1m/item_counts.csv',header=['userId','tagId','rating'], index=False)

In [37]:
unique_sid = pd.unique(raw_data['movieId'])
unique_uid = user_activity.index

In [9]:
with open('long-tail_GAN/Dataset/ml_1m/item_list.txt', 'w') as f:
    for sid in unique_sid:
        f.write('%s\n' % sid)

In [11]:
with open('long-tail_GAN/Dataset/ml_1m/unique_item_id.txt', 'w') as f:
    for (i, sid) in enumerate(unique_sid):
        f.write('%s\n' % i)

In [14]:
with open('long-tail_GAN/Dataset/ml_1m/item2id.txt', 'w') as f:
    for (i, sid) in enumerate(unique_sid):
        f.write('%s\t%s\n' % (sid,i))

In [19]:
with open('long-tail_GAN/Dataset/ml_1m/profile2id.txt', 'w') as f:
    for (i, uid) in enumerate(unique_uid):
        f.write('%s\t%s\n' % (uid,i))

In [45]:
with open('long-tail_GAN/Dataset/ml_1m/niche_items.txt', 'w') as f:
    for sid in pd.unique(raw_data[movies_pop == 2]['movieId']):
        f.write('%s\n' % sid)

In [38]:
# show2id： {movieId: index in unique_sid}
show2id = dict((sid, i) for (i, sid) in enumerate(unique_sid)) # sid是movieId, i是这个movieId在unique_sid的index
# profile2id： {userId: index in unique_uid}
profile2id = dict((pid, i) for (i, pid) in enumerate(unique_uid)) # pid是userId，i是这个userId在unique_uid的index

In [39]:
def numerize(tp):
    uid = list(map(lambda x: profile2id[x], tp['userId']))# 这里在map外面加了list 后续pandas才能处理
    sid = list(map(lambda x: show2id[x], tp['movieId']))
    return pd.DataFrame(data={'uid': uid, 'sid': sid}, columns=['uid', 'sid'])

In [43]:
numerize(raw_data).to_csv('long-tail_GAN/Dataset/ml_1m/train_GAN.csv', index=False)

In [44]:
numerize(raw_data[movies_pop == 1]).to_csv('long-tail_GAN/Dataset/ml_1m/train_GAN_popular.csv', index=False)

In [46]:
numerize(raw_data[movies_pop == 2]).to_csv('long-tail_GAN/Dataset/ml_1m/train_GAN_niche.csv', index=False)