# Notebook is dedicated to the preprocessing if the data. 

In this notebook, first we construct the an exploitable database for our probleme by filtering, cleaning and removing some items and users. Then, we will we do a leave-one-out split for the test set. Then, we evaluate our prediction for each user with 100 sampled items to which we add the last item bought.

    -   train.csv
    -   test.csv

As in the article, we cannot make the computation in all the data for evaluating the prediction, so aiming to reduce this cost  100 non-interacted items are randomly sampled for each user and we add the test item amoung them. 

    -   negative_test.csv

We will use two metrcis to evaluate the performance : 

- Hit Ratio : It check if the test item is in the top-K list 
- Normalized Discounted cumulative Gain : It considers the position of the test item by giving higher score for top ranks

In [1]:
# Requirements

import pandas as pd
import numpy as np
import json 
from sklearn.preprocessing import LabelEncoder

________________________________
### **INITIAL DATABASE ( restriction by ratings per user and users per items ) :**

To ensure effective utilization of the data, we remove items with fewer than 20 ratings and users who have rated fewer than 20 products. This approach helps to eliminate non-informative elements from the dataset, as such items and users provide insufficient information for meaningful analysis or recommendation generation. By focusing on more active users and frequently rated items, we aim to improve the reliability and robustness of the recommendations.

In [2]:
# Import of users data : 

#file = "/Users/aminerazig/Desktop/ENSAE 3A/ADVANCED ML/Advanced ML-project/DATA/Musical_Instruments.jsonl"
file = "C:/Users/USER/Desktop/ENSAE 3A/Advanced ML/Musical_Instruments.jsonl/Musical_Instruments.jsonl"

print (f"Loading of the data from {file}....")
with open(file, 'r') as file:
    data = [json.loads(line) for line in file]
print (f"End of loading")


Loading of the data from C:/Users/USER/Desktop/ENSAE 3A/Advanced ML/Musical_Instruments.jsonl/Musical_Instruments.jsonl....
End of loading


In [3]:
df_recommendation = pd.DataFrame(
    [{"id": item["parent_asin"], "user": item["user_id"], "rating": item["rating"], 
      "timestamp": item['timestamp']} for item in data]
)

df_recommendation['rating'] = 1

print(f"Initial Data base shape is : {df_recommendation.shape[0]} rows and {df_recommendation.shape[1]} columns")


Initial Data base shape is : 3017439 rows and 4 columns


In [4]:

# First we check if there is any duplicates in the dataset (ie a user that gives a rating twice for a product)
print(f"{df_recommendation[df_recommendation.duplicated(subset=['user', 'id'], keep=False)].shape}")

# Then we remove those duplicates (by doing the mean of the ratings) : 
df_recommendation = df_recommendation.groupby(['user', 'id'], as_index=False).agg(
    rating=('rating', 'mean'),
    timestamp=('timestamp', 'max')
)
df_recommendation['rating'] = np.ceil(df_recommendation['rating'])

print(f"Deletion of duplicates .... Shape after deletion {df_recommendation.shape[0]} rows and {df_recommendation.shape[1]} columns")

(77125, 4)
Deletion of duplicates .... Shape after deletion 2975551 rows and 4 columns


#### **Filter users that have rated less than 20 products**

In [5]:
rating_counts = df_recommendation.groupby('user').size().reset_index(name='count')

# Filtrer les 'user' qui ont au moins 20 ratings
has_rated_min_nb = 20
valid_users = rating_counts[rating_counts['count'] >= 20]['user']

# Garder uniquement les lignes correspondantes dans le DataFrame original
df_recommendation = df_recommendation[df_recommendation['user'].isin(valid_users)]

print(f"Deletion of users wich have ratings less than 20 products .... Shape after deletion {df_recommendation.shape[0]} rows and {df_recommendation.shape[1]} columns")

Deletion of users wich have ratings less than 20 products .... Shape after deletion 170027 rows and 4 columns


#### **Products with less than 20 ratings**

In [6]:
rating_counts = df_recommendation.groupby('id').size().reset_index(name='count')

# Filtrer les 'id' qui ont au moins 20 ratings
rating_min_nb = 20
valid_ids = rating_counts[rating_counts['count'] >= 20]['id']

# Garder uniquement les lignes correspondantes dans le DataFrame original
df_recommendation = df_recommendation[df_recommendation['id'].isin(valid_ids)]

print(f"Deletion of items with less than 20 ratings .... Shape after deletion {df_recommendation.shape[0]} rows and {df_recommendation.shape[1]} columns")

Deletion of items with less than 20 ratings .... Shape after deletion 42626 rows and 4 columns


In [7]:
print (f" Number of distincts products : {df_recommendation ['id'].nunique()}")
print (f" Number of distincts users : {df_recommendation['user'].nunique()}")

 Number of distincts products : 1003
 Number of distincts users : 5107


In [8]:
ratings_per_product = df_recommendation.groupby('id')['user'].nunique()
print(f"The proportion of products rated by different users : \n")
pd.DataFrame(ratings_per_product.describe())

The proportion of products rated by different users : 



Unnamed: 0,user
count,1003.0
mean,42.498504
std,39.940309
min,20.0
25%,24.0
50%,31.0
75%,44.0
max,473.0


In [10]:
# Initialize encoder
encoder = LabelEncoder()

df_recommendation['user'] = encoder.fit_transform(df_recommendation['user'])

# Reinitialize it
encoder = LabelEncoder()

df_recommendation['id'] = encoder.fit_transform(df_recommendation['id'])

In [11]:
csv_name = f"musical_instruments_{str(rating_min_nb)}_{str(has_rated_min_nb )}.csv"
print(f"Saving the final full data base on the csv format ( {csv_name} ) ...")

df_recommendation.to_csv(csv_name, index=False)

print(f"... Successfully saved")


Saving the final full data base on the csv format ( musical_instruments_20_20.csv ) ...
... Successfully saved


**(End preprocessing, csv database)**
___________________________

In [12]:
df_recommendation = pd.read_csv('/users/eleves-a/2024/amine.razig/Advanced-ML-project/base de donnée_20_10.csv')
df_recommendation.head()

Unnamed: 0,user,id,rating,timestamp
0,AE23JYHGEN3D35CHE5OQQYJOW5RA,B000EEHKVY,5.0,1427926325000
1,AE23JYHGEN3D35CHE5OQQYJOW5RA,B000TGSM6E,5.0,1480348230000
2,AE23JYHGEN3D35CHE5OQQYJOW5RA,B003WZ6VVM,3.0,1425049184000
3,AE23JYHGEN3D35CHE5OQQYJOW5RA,B008FDSWJ0,5.0,1528832546194
4,AE23JYHGEN3D35CHE5OQQYJOW5RA,B00EF8VGWE,5.0,1516308993648


Let's first encode the 'user' and 'item' columns. 

In [14]:
df_recommendation.head()

Unnamed: 0,user,id,rating,timestamp
0,0,3136,5.0,1427926325000
1,0,4374,5.0,1480348230000
2,0,9543,3.0,1425049184000
3,0,14360,5.0,1528832546194
4,0,17526,5.0,1516308993648


## Train / Test Split

We begin by doing a train-test split to perform leave-one-out evaluation on the recommendations. Moreover, we will create a file containing negative samples.

Train / Test Split
The test set must contain the last relevant item for each user. We define a relevant item for a specific user, an item that has been rated by 4 or more by that user.

In [12]:
# Sort dataframe by user and timestamp
df_recommendation = df_recommendation.sort_values(by=['user', 'timestamp'])


# Test set: the last item with a rating greater than 4 for each user
df_test = df_recommendation.groupby('user').tail(1)

#Train set
df_train = df_recommendation.drop(df_test.index)
 
df_train.shape, df_test.shape

((37519, 4), (5107, 4))

Let's create a dataset with 99 negative samples for each user that we concatenate with the test set.

In [13]:
# List of all users and items
all_users = df_recommendation['user'].unique()
all_items = df_recommendation['id'].unique()

# All existing interactions set
interactions = set(zip(df_recommendation['user'], df_recommendation['id']))

# Negative items list
negative_samples = []
num_negatives = 99

for user in df_test['user'].unique():
    # All negative samples for each user
    negative_items = [item for item in all_items if (user, item) not in interactions]

    # Sample from negative samples for each user
    sampled_negatives = np.random.choice(negative_items, size=num_negatives, replace=False)

    ## POUR GENERER UN DICTIONNAIRE DE LA BONNE TAILLE AUTOMATIQUE 

    negative_sample = {'user': user}
    for i in range(num_negatives):
        negative_sample[f'negative_{i + 1}'] = sampled_negatives[i]

    # adding the negative sample to the list 
    negative_samples.append(negative_sample)

negative_samples_df = pd.DataFrame(negative_samples)

df_test_negative = pd.merge(df_test, negative_samples_df, on='user', how='left')

df_test_negative.shape

(5107, 103)

In [14]:
df_test_negative.head()

Unnamed: 0,user,id,rating,timestamp,negative_1,negative_2,negative_3,negative_4,negative_5,negative_6,...,negative_90,negative_91,negative_92,negative_93,negative_94,negative_95,negative_96,negative_97,negative_98,negative_99
0,0,770,1.0,1642464772969,720,259,578,21,421,32,...,683,125,844,263,933,175,98,447,552,106
1,1,658,1.0,1664305424891,313,331,783,8,609,626,...,968,14,67,504,134,544,728,256,108,102
2,2,506,1.0,1631833014190,948,253,219,332,579,81,...,899,187,982,654,388,306,596,154,753,454
3,3,378,1.0,1549635394981,881,554,16,369,243,232,...,353,161,373,245,445,160,164,634,441,872
4,4,147,1.0,1647063756658,449,405,52,840,302,46,...,566,694,115,992,384,469,786,476,121,442


In [17]:
# Pandas Datframes to CSV
df_train.to_csv('data/train.csv', index=False)
df_test.to_csv('data/test.csv', index=False)
df_test_negative.to_csv('data/test_negative.csv', index=False)