# Advanced ML, Recomendation project : 

### Description of the General topic:  
 
A recommendation system is a class of machine learning tools designed to suggest 
relevant items to users based on their preferences, behaviors, and other users’ 
activities. They are widely used across e-commerce, streaming platforms, social 
media, and online advertising, aiming to enhance user experience by delivering 
personalized content or product suggestions.

### Flow of the Code Project (All in Python)  :  
 
**Data Preprocessing:** 
- Preprocess textual data using tokenization, stemming, and vectorization 
techniques like TF-IDF or word embeddings. 
 
**Building Collaborative Filtering Models:** 
- Implement SVD for matrix factorization. 
- Build user-based and item-based collaborative filtering models using 
libraries like Scikit-Learn. 
 
**Building Content-Based Filtering Models:**
- Create item profiles using metadata. 
- Use cosine similarity or neural embeddings to identify similar items. 
 
**Combine with Hybrid Techniques:** 
- Experiment with hybrid models (weighted hybrid, feature-augmented 
collaborative filtering, etc.) to combine collaborative and content-based 
methods. 
- Train deep hybrid models if using neural networks, concatenating 
collaborative and content-based embeddings as input. 

**Visualization:**  
- Récapitulatifs des des résultats et visualisation par des graphiques dans 
la mesure du possible 

In [2]:
import numpy as np
import pandas as pd
import json

________________________________
### **Import of users data :**

In [6]:
# Import of users data : 

file = "/Users/aminerazig/Desktop/ENSAE 3A/ADVANCED ML/Advanced ML-project/DATA/Musical_Instruments.jsonl"

with open(file, 'r') as file:
    data = [json.loads(line) for line in file]



In [7]:
df_images = pd.DataFrame(
    [{"id": item["parent_asin"], "images": item["images"]} for item in data]
)

In [12]:

df_recommendation = pd.DataFrame(
    [{"id": item["parent_asin"], "user": item["user_id"], "rating": item["rating"]} for item in data]
)

In [13]:
# First we check if there is any duplicates in the dataset (ie a user that gives a rating twice for a product)
print(f"{df_recommendation[df_recommendation.duplicated(subset=['user', 'id'], keep=False)].shape}")

# Then we remove those duplicates (by doing the mean of the ratings) : 
df_recommendation = df_recommendation.groupby(['user', 'id'], as_index=False)['rating'].mean()
df_recommendation['rating'] = np.ceil(df_recommendation['rating'])

(77125, 3)


In [101]:
df_recommendation.shape

(42626, 3)

#### **Filter users that have rated less than 20 products**

In [15]:
rating_counts = df_recommendation.groupby('user').size().reset_index(name='count')

# Filtrer les 'user' qui ont au moins 20 ratings
valid_users = rating_counts[rating_counts['count'] >= 20]['user']

# Garder uniquement les lignes correspondantes dans le DataFrame original
df_recommendation = df_recommendation[df_recommendation['user'].isin(valid_users)]

In [16]:
df_recommendation.shape

(170027, 3)

#### **Products with less than 20 ratings**

In [17]:
rating_counts = df_recommendation.groupby('id').size().reset_index(name='count')

# Filtrer les 'id' qui ont au moins 20 ratings
valid_ids = rating_counts[rating_counts['count'] >= 20]['id']

# Garder uniquement les lignes correspondantes dans le DataFrame original
df_recommendation = df_recommendation[df_recommendation['id'].isin(valid_ids)]

In [18]:
df_recommendation.shape

(42626, 3)

In [19]:
print (f" Number of distincts products : {df_recommendation ['id'].nunique()}")
print (f" Number of distincts users : {df_recommendation['user'].nunique()}")

 Number of distincts products : 1003
 Number of distincts users : 5107


In [20]:
df_recommendation.isna().sum() # 0 missing value

user      0
id        0
rating    0
dtype: int64

In [21]:
ratings_per_product = df_recommendation.groupby('id')['user'].nunique()
print(f"The proportion of products rated by different users : \n")
pd.DataFrame(ratings_per_product.describe())

The proportion of products rated by different users : 



Unnamed: 0,user
count,1003.0
mean,42.498504
std,39.940309
min,20.0
25%,24.0
50%,31.0
75%,44.0
max,473.0


In [22]:
df_recommendation.to_csv('base de donnée_20_20.csv')

**(End preprocessing, csv database)**
___________________________

In [8]:
df_recommendation = pd.read_csv("base de donnée_20_20.csv")
df_recommendation.head(5)

Unnamed: 0,user,id,rating
0,AE23JYHGEN3D35CHE5OQQYJOW5RA,B000EEHKVY,5.0
1,AE23JYHGEN3D35CHE5OQQYJOW5RA,B000TGSM6E,5.0
2,AE23JYHGEN3D35CHE5OQQYJOW5RA,B008FDSWJ0,5.0
3,AE23JYHGEN3D35CHE5OQQYJOW5RA,B012VQ5A7S,5.0
4,AE23JYHGEN3D35CHE5OQQYJOW5RA,B076ZSHQ47,3.0


#### Import of products metadata : ( not used know)


In [None]:
# import of products metadata : 

products_1000_metadata = []
file_metadata = "Musical_Instruments.jsonl"

with open(file_metadata, 'r') as file:
    for i, line in enumerate(file):
        if i >= 1000:  
            break
        products_1000_metadata.append(json.loads(line))

In [None]:
# ### AFFICHAGE DE QUELQUES IMAGES  : 
# import json
# import random
# import requests
# from PIL import Image
# from io import BytesIO



# def get_random_products_with_images(products, num_products=90):
#     products_with_images = [p for p in products if p.get('images') and len(p['images']) > 0]
#     return random.sample(products_with_images, min(num_products, len(products_with_images)))


# def fetch_and_resize_image(url, size=(30, 30)):
#     try:
#         response = requests.get(url)
#         response.raise_for_status()
#         img = Image.open(BytesIO(response.content))
#         return img.resize(size)
#     except Exception as e:
#         print(f"Erreur lors du téléchargement de l'image : {e}")
#         return None

# # mosaïque
# def create_mosaic(images, grid_size=(10, 9), image_size=(30, 30)):
#     mosaic = Image.new('RGB', (grid_size[0] * image_size[0], grid_size[1] * image_size[1]))
#     for idx, img in enumerate(images):
#         if img:
#             x = (idx % grid_size[0]) * image_size[0]
#             y = (idx // grid_size[0]) * image_size[1]
#             mosaic.paste(img, (x, y))
#     mosaic.show()
#     return mosaic


# selected_products = get_random_products_with_images(products_1000_metadata)
# image_urls = [p['images'][0]['large'] for p in selected_products]

# images = [fetch_and_resize_image(url) for url in image_urls]
# mosaic = create_mosaic(images)

# Data Fields

## For User Reviews

| Field              | Type   | Explanation                                                                                                                                                     |
|--------------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `rating`           | float  | Rating of the product (from 1.0 to 5.0).                                                                                                                        |
| `title`            | str    | Title of the user review.                                                                                                                                       |
| `text`             | str    | Text body of the user review.                                                                                                                                   |
| `images`           | list   | Images that users post after they have received the product. Each image has different sizes (small, medium, large), represented by `small_image_url`, `medium_image_url`, and `large_image_url`. |
| `asin`             | str    | ID of the product.                                                                                                                                              |
| `parent_asin`      | str    | Parent ID of the product. Note: Products with different colors, styles, sizes usually belong to the same parent ID. The “asin” in previous Amazon datasets is actually the parent ID. Please use parent ID to find product meta. |
| `user_id`          | str    | ID of the reviewer.                                                                                                                                             |
| `timestamp`        | int    | Time of the review (unix time).                                                                                                                                 |
| `verified_purchase`| bool   | User purchase verification.                                                                                                                                     |
| `helpful_vote`     | int    | Helpful votes of the review.                                                                                                                                    |


In [20]:
import EDA_functions
image_url = "https://images-na.ssl-images-amazon.com/images/I/71DFEoJ+Z9L._SL256_.jpg"
EDA_functions.show_image(image_url)

# I- Collaborative Filtering

Collaborative filtering recommends products to users based on the behavior of other users with similar preferences. CF methods work on the principle that users who agreed on items in the past are likely to agree again. 
It is an alternative to content filtering that relies only on past user behavior—for example, previous transactions or product ratings— without requiring the creation of explicit profiles.Collaborative filtering analyzes relationships between users and interdependencies among products to identify new user-item associations.

There are two main types:

### a) User-Based Collaborative Filtering
This approach identifies users who have similar preferences (based on ratings or clicks) and recommends items that similar users liked.

### Implementation
We can try to implement this method using matrix factorization techniques like **Singular Value Decomposition (SVD)**, which reduces the dimensionality of the data matrix, capturing latent factors that explain user-item interactions.


In [26]:
print(f"The number of unique products is : {df_recommendation.id.nunique()}")
print(f"The number of unique users is : {df_recommendation.user.nunique()}")

The number of unique products is : 1003
The number of unique users is : 5107


In [None]:
df_recommendation.groupby('id')['rating'].count().sort_values(ascending = False)

id
B09857JRP2    473
B0BCK6L7S5    434
B0BPJ4Q6FJ    421
B0BSGM6CQ9    404
B0BTC9YJ2W    299
             ... 
B07FYL7LW1     20
B016B6YFDO     20
B002H3EZMC     20
B00M9BSZMI     20
B07H87XJ19     20
Name: rating, Length: 1003, dtype: int64

In the context of Collaborative Filtering, users can use cosine similarity to measure how similar their preferences are to other users preferences. Chances are you might like the same products in the selection.

In [27]:
df_recommendation_pivot = df_recommendation.pivot(index='user', columns='id', values='rating')

In [28]:
df_recommendation_pivot = df_recommendation_pivot.fillna(0)

In [29]:
df_recommendation_pivot

id,1423414357,B00005ML71,B0002CZVWS,B0002D01K4,B0002D01KO,B0002D0CCQ,B0002D0CEO,B0002D0CNA,B0002D0L5E,B0002D0Q2W,...,B0C6J149WZ,B0C6J1BN77,B0C6J1X7TD,B0C6J2DPBW,B0C994NVQK,B0C9NGP88D,B0CB98SMQR,B0CBHMCGNS,B0CBK1WSMR,B0CCK4YYNM
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AE23JYHGEN3D35CHE5OQQYJOW5RA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AE23LDQTB7L76AP6E6WPBFVYL5DA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AE23WLBRYKEC67DM43M6E2MF7GPQ,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AE23ZFVUOMPKR57BVSWXV34QLMVA,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AE24I2EU3AJAAKBXF367XSV37U6Q,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AHZPLXCE5YQMLXFFBSURYHZUGMTA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AHZQPH7HHSWLUIQFWEQ54NNKKN6A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AHZT6MVWNF4GG6FISMZMORKZKK4A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AHZXMBKQJTVG2J7P7EB5WCYTOLDQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### **Cosine Similarity** : 
Web page for the formula and the explaindantion about this metric :  https://en.wikipedia.org/wiki/Cosine_similarity


$$
\text{cosine similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}
$$

Where:
- \( A \) and \( B \) are vectors.
- \( A • B \) is the dot product of \( A \) and \( B \).
- \( \|A\| \) and \( \|B\| \) are the  norms of \( A \) and \( B \).


In [None]:
def Cosine_similarity (a,b) : 
    """cosine similarity between two vectors"""

    vec1 = np.array(a)
    vec2 = np.array(b)
    
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    
    if norm_vec1 == 0 or norm_vec2 == 0:
        return 0  # NO division by zero
    else:
        return dot_product / (norm_vec1 * norm_vec2)


In [30]:
from sklearn.metrics.pairwise import cosine_similarity
# The similarity matrix is a square matrix wich gives the similaritie values between two users based on the formula (scalar product)
similarity_matrix = cosine_similarity(df_recommendation_pivot)

In [33]:
df_similarity_matrix = pd.DataFrame(similarity_matrix, index= df_recommendation_pivot.index, columns= df_recommendation_pivot.index)
df_similarity_matrix

user,AE23JYHGEN3D35CHE5OQQYJOW5RA,AE23LDQTB7L76AP6E6WPBFVYL5DA,AE23WLBRYKEC67DM43M6E2MF7GPQ,AE23ZFVUOMPKR57BVSWXV34QLMVA,AE24I2EU3AJAAKBXF367XSV37U6Q,AE24ZJSXZFHFKZF3UYR5CBAYGL7A,AE25GU3LWQGZJN4NNT5GWAGBN2KA,AE27PVJOEVGOHYF5WOXQDZ5NIULA,AE2AVSTY2ZSZUSXZA7GWMXC56ITQ,AE2BCWBDERZKN3ACIXQQISI3LHPA,...,AHZJVJET7N5JKBRKL6E7SMZV6FKQ,AHZKUBNYPXGNNNH5GQKSHJH5A7IA,AHZLJECK27R55RFTZXFUEZIVGHGQ,AHZNF3H5YLL623I5C6PO3TVZFXSQ,AHZNULZBLXPYXTCJSZ6FHCFM2Y5A,AHZPLXCE5YQMLXFFBSURYHZUGMTA,AHZQPH7HHSWLUIQFWEQ54NNKKN6A,AHZT6MVWNF4GG6FISMZMORKZKK4A,AHZXMBKQJTVG2J7P7EB5WCYTOLDQ,AHZYZ2BUDD7WAJPW5G6K2DK5LYPQ
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AE23JYHGEN3D35CHE5OQQYJOW5RA,1.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,...,0.000000,0.20109,0.0,0.147115,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
AE23LDQTB7L76AP6E6WPBFVYL5DA,0.0,1.000000,0.000000,0.039735,0.0,0.000000,0.109632,0.0,0.000000,0.0,...,0.221340,0.00000,0.0,0.000000,0.071853,0.0,0.000000,0.000000,0.109938,0.161294
AE23WLBRYKEC67DM43M6E2MF7GPQ,0.0,0.000000,1.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,...,0.000000,0.00000,0.0,0.000000,0.000000,0.0,0.058036,0.000000,0.033162,0.000000
AE23ZFVUOMPKR57BVSWXV34QLMVA,0.0,0.039735,0.000000,1.000000,0.0,0.000000,0.000000,0.0,0.070186,0.0,...,0.000000,0.00000,0.0,0.000000,0.000000,0.0,0.097176,0.000000,0.000000,0.053323
AE24I2EU3AJAAKBXF367XSV37U6Q,0.0,0.000000,0.000000,0.000000,1.0,0.000000,0.000000,0.0,0.212814,0.0,...,0.000000,0.00000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AHZPLXCE5YQMLXFFBSURYHZUGMTA,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,...,0.000000,0.00000,0.0,0.000000,0.000000,1.0,0.000000,0.000000,0.000000,0.000000
AHZQPH7HHSWLUIQFWEQ54NNKKN6A,0.0,0.000000,0.058036,0.097176,0.0,0.089002,0.000000,0.0,0.000000,0.0,...,0.065613,0.00000,0.0,0.000000,0.000000,0.0,1.000000,0.000000,0.081474,0.000000
AHZT6MVWNF4GG6FISMZMORKZKK4A,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,...,0.185231,0.00000,0.0,0.000000,0.000000,0.0,0.000000,1.000000,0.184006,0.000000
AHZXMBKQJTVG2J7P7EB5WCYTOLDQ,0.0,0.109938,0.033162,0.000000,0.0,0.152570,0.000000,0.0,0.000000,0.0,...,0.000000,0.00000,0.0,0.000000,0.000000,0.0,0.081474,0.184006,1.000000,0.000000


#### **Selection of a user :** 

In [159]:
# Get similar users :
select_userid = "AE23LDQTB7L76AP6E6WPBFVYL5DA"

# Weight matrix which takes the weighted average of similarities between a user and other users. 
# The weight matrix is used to give more importance to users who provide more frequent ratings 
# than others when calculating the average similarity : 

similarities = df_similarity_matrix[select_userid].drop(select_userid) # Nous donne le vecteur des similarité de notre utilsateur avec les autres
weights = similarities/similarities.sum() # calcul des poids 

In [140]:
similarities

user
AE23JYHGEN3D35CHE5OQQYJOW5RA    0.000000
AE23WLBRYKEC67DM43M6E2MF7GPQ    0.000000
AE23ZFVUOMPKR57BVSWXV34QLMVA    0.039735
AE24I2EU3AJAAKBXF367XSV37U6Q    0.000000
AE24ZJSXZFHFKZF3UYR5CBAYGL7A    0.000000
                                  ...   
AHZPLXCE5YQMLXFFBSURYHZUGMTA    0.000000
AHZQPH7HHSWLUIQFWEQ54NNKKN6A    0.000000
AHZT6MVWNF4GG6FISMZMORKZKK4A    0.000000
AHZXMBKQJTVG2J7P7EB5WCYTOLDQ    0.109938
AHZYZ2BUDD7WAJPW5G6K2DK5LYPQ    0.161294
Name: AE23LDQTB7L76AP6E6WPBFVYL5DA, Length: 5106, dtype: float64

In [136]:
# products that this user have bought : 
products_rated_by_a = df_recommendation_pivot.loc[select_userid, df_recommendation_pivot.loc[select_userid, :] != 0].index
for id in products_rated_by_a : 
    select_an_images(id)

URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/61mt0r+0QQL._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/710iW-iOQSL._SL256_.jpg


URL de l'image de l'objet https://m.media-amazon.com/images/I/81A37bd-CuL._SL256_.jpg


URL de l'image de l'objet https://m.media-amazon.com/images/I/61JHITA6ZxL._SL256_.jpg


URL de l'image de l'objet https://m.media-amazon.com/images/I/611yOqc6B8L._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71ToLBwHiZL._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71LuAdf274L._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/B16ytAeKkbS._SL256_.jpg


URL de l'image de l'objet https://m.media-amazon.com/images/I/612EmYtaRlL._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/91Xj+2m1pzL._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71OF+jjxZsL._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/81pTbG5M0WL._SL256_.jpg


URL de l'image de l'objet https://m.media-amazon.com/images/I/919nNPuCIgL._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/719r2UVnjGL._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/91VL6NonjcL._SL256_.jpg


URL de l'image de l'objet https://m.media-amazon.com/images/I/511miE781qL._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71KWkPtyJnL._SL256_.jpg


#### **Similar Users  :** 

we calculate the similarity between a **target user** and other users in the system to find similar users. The process involves creating a **weight matrix** that accounts for the similarity between the target user and other users, normalizing the similarities to give more weight to users with higher similarity.

1. **Target User Selection:**
   To begin, we first choose a **target user** (let’s denote this user as \( u_t \)) for whom we want to find similar users. For this example, we choose `user id = AE3335XF4PMHSXKTW5B7N7EALG3Q` as the target user:
   
   $$ u_t = AE3335XF4PMHSXKTW5B7N7EALG3Q $$

2. **Similarity Calculation:**
   We calculate the **similarity** between the target user \( u_t \) and all other users using a **similarity matrix** \( S \), where each entry \( S_{ij} \) represents the similarity between user \( i \) and user \( j \). 

   The **similarity vector** for the target user \( u_t \) is represented as:
   
   $$ \text{similarities}_{u_t} = S_{u_t} $$
   
   we also  remove the similarity between the target user and themselves:
   
   $$ \text{similarities}_{u_t, \text{others}} = S_{u_t} \setminus \{ S_{u_t, u_t} \} $$

1. **Weight Calculation:**
   After obtaining the similarity scores between the target user \( u_t \) and other users, we create a **weight matrix** where each user's similarity score is normalized. This ensures that users who are more similar to the target user are given higher weight. The weight for each user \( i \) is calculated by normalizing the similarity score:

   $$ w_i = \frac{S_{u_t, i}}{\sum_{j \neq u_t} S_{u_t, j}} $$

   Where:
   - \( w_i \) is the weight for user \( i \),
   - \( S_{u_t, i} \) is the similarity between the target user \( u_t \) and user \( i \),
   - The denominator is the sum of the similarities between the target user \( u_t \) and all other users (excluding \( u_t \)).



#### **Similar users :** 

In [145]:
#num of silimar users
k = 10

#set a threshold for similarity : only users with a similarity score greater than 0.5 will be considered
user_similarity_threshold =  0.3

# top k similar users
similar_users = df_similarity_matrix[df_similarity_matrix[select_userid]>user_similarity_threshold][select_userid].sort_values(ascending=False)[:k]
similar_users_df = similar_users.to_frame(name='similarity')

print (f"The similar (with similarity bigger than {user_similarity_threshold} of cosine similarity) users to user {select_userid} are  : ")

for index, row in similar_users_df.iterrows():
    similarity = row['similarity'] 
    print(f"User: {index}, Similarity: {similarity}")

The similar (with similarity bigger than 0.3 of cosine similarity) users to user AE23LDQTB7L76AP6E6WPBFVYL5DA are  : 
User: AE23LDQTB7L76AP6E6WPBFVYL5DA, Similarity: 1.0
User: AGKLMAODCW3RR4EBYZHPG7VV7J6A, Similarity: 0.3138762255948909
User: AGWSL6RCSUEPR5RGTVMO5HERQUEA, Similarity: 0.3082056047334018


In [146]:
similar_users_df

Unnamed: 0_level_0,similarity
user,Unnamed: 1_level_1
AE23LDQTB7L76AP6E6WPBFVYL5DA,1.0
AGKLMAODCW3RR4EBYZHPG7VV7J6A,0.313876
AGWSL6RCSUEPR5RGTVMO5HERQUEA,0.308206


#### **Bought and not bought products :** 

In [63]:
# The products the selected user boughts and rateed more than 3/5 :
bought_products = df_recommendation_pivot.loc[df_recommendation_pivot.index== select_userid, df_recommendation_pivot.loc[select_userid,:]>=3]
bought_products

id,B005M0MUQK,B007T8CUNG,B008BPI2OW,B00CGFRJ2Y,B00CRQWMYM,B015IJIO5U,B07F69KR6K,B0853X3VDC,B09198262S,B0928HW2P4,B09NLV5LBK,B0B3MWSSYF,B0B8F6LD9F,B0B95V41NR,B0BG95DG2H,B0BPJ4Q6FJ,B0BSR996X8
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
AE23LDQTB7L76AP6E6WPBFVYL5DA,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


In [82]:
#list of non rated product by our user
#not_rated_by_user = df_recommendation_pivot[df_recommendation_pivot.loc[select_userid, :] == 0]
not_rated_by_user = df_recommendation_pivot[(df_recommendation_pivot.loc[select_userid, :] == 0).index[(df_recommendation_pivot.loc[select_userid, :] == 0).values]]

In [88]:
not_bought_products = df_recommendation_pivot.loc[
    df_recommendation_pivot.index != select_userid,  # Exclude selected user
    not_rated_by_user.columns  # products not rated by the selected user
]


In [141]:
# a sub-matrix of the recommendation pivot table containing only products that select_userid has not rated.
not_bought_products

id,1423414357,B00005ML71,B0002CZVWS,B0002D01K4,B0002D01KO,B0002D0CCQ,B0002D0CEO,B0002D0CNA,B0002D0L5E,B0002D0Q2W,...,B0C6J149WZ,B0C6J1BN77,B0C6J1X7TD,B0C6J2DPBW,B0C994NVQK,B0C9NGP88D,B0CB98SMQR,B0CBHMCGNS,B0CBK1WSMR,B0CCK4YYNM
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AE23JYHGEN3D35CHE5OQQYJOW5RA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AE23WLBRYKEC67DM43M6E2MF7GPQ,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AE23ZFVUOMPKR57BVSWXV34QLMVA,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AE24I2EU3AJAAKBXF367XSV37U6Q,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0
AE24ZJSXZFHFKZF3UYR5CBAYGL7A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AHZPLXCE5YQMLXFFBSURYHZUGMTA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AHZQPH7HHSWLUIQFWEQ54NNKKN6A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AHZT6MVWNF4GG6FISMZMORKZKK4A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AHZXMBKQJTVG2J7P7EB5WCYTOLDQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [156]:
selected_similar_users = df_similarity_matrix[select_userid].drop(select_userid)
weights = selected_similar_users/selected_similar_users.sum()
weighted_averages = pd.DataFrame(not_bought_products.T.dot(weights.to_numpy()), columns=["weighted_avg"])

In [183]:
# Select only the rows of not_bought_products for similar users
similar_user_ids = similar_users_df.drop(select_userid).index
filtered_not_bought_products = not_bought_products.loc[similar_user_ids]

# Normalize weights
weights = similar_users_df.drop(select_userid) / similar_users_df.drop(select_userid).sum()

# Compute the weighted averages
weighted_averages = filtered_not_bought_products.T.dot(weights.to_numpy())


In [193]:
weighted_averages = pd.DataFrame(weighted_averages)#.sort_values(by=0, ascending=False))
weighted_averages.columns = ['weighted_averages']

weighted_averages.sort_values(by="weighted_averages", ascending = False)

Unnamed: 0_level_0,weighted_averages
id,Unnamed: 1_level_1
1423414357,0.0
B00005ML71,0.0
B0002CZVWS,0.0
B0002D01K4,0.0
B0002D01KO,0.0
...,...
B0C9NGP88D,0.0
B0CB98SMQR,0.0
B0CBHMCGNS,0.0
B0CBK1WSMR,0.0


In [196]:
top_7_recommendation = weighted_averages.sort_values(by="weighted_averages", ascending = False).head(7)

In [133]:
def select_an_images (id) : 
    url = df_images[df_images['id']== id]["images"]
    first_non_empty = url[url.apply(lambda x: len(x) > 0)].iloc[0]  # Get the first non-empty value
    url = first_non_empty[0].get("small_image_url")
    print(f"URL de l'image de l'objet {url}")
    EDA_functions.show_image(url)


In [198]:
import EDA_functions
i = 1
for index, rows in top_7_recommendation.iterrows(): 
    print (f"*************** Recommendation number {i} *************** : \n Product id : {index}, associated weight : {rows['weighted_averages']}")
    select_an_images(index)
    i += 1

*************** Recommendation number 1 *************** : 
 Product id : B0BK5CXZBT, associated weight : 2.4772111136146813
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/81Ji49dhQ4L._SL256_.jpg


*************** Recommendation number 2 *************** : 
 Product id : B09M7F7LFB, associated weight : 1.981768890891745
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/B1M8J6Sk-dS._SL256_.jpg


*************** Recommendation number 3 *************** : 
 Product id : B009A5JA98, associated weight : 1.981768890891745
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/716-5TxPjKL._SL256_.jpg


*************** Recommendation number 4 *************** : 
 Product id : B0B2LMZ9RT, associated weight : 1.5136733318311917
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71BFlaQ-PNL._SL256_.jpg


*************** Recommendation number 5 *************** : 
 Product id : B09VBWXMBC, associated weight : 0.5045577772770639
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/819XYAZuaJL._SL256_.jpg


*************** Recommendation number 6 *************** : 
 Product id : B0BSGM6CQ9, associated weight : 0.5045577772770639
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71-ODCpCowL._SL256_.jpg


*************** Recommendation number 7 *************** : 
 Product id : B01LQO3GGS, associated weight : 0.49544222272293625
URL de l'image de l'objet https://m.media-amazon.com/images/I/81ksZz+BCUL._SL256_.jpg


In [134]:
import EDA_functions
i = 1
for index, rows in top_7_recommendation.iterrows(): 
    print (f"*************** Recommendation number {i} *************** : \n Product id : {index}, associated weight : {rows['weighted_avg']}")
    select_an_images(index)
    i += 1

*************** Recommendation number 1 *************** : 
 Product id : B0BSGM6CQ9, associated weight : 0.648158443276254
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71-ODCpCowL._SL256_.jpg


*************** Recommendation number 2 *************** : 
 Product id : B09857JRP2, associated weight : 0.5771212710125483
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/91mFvYib+XL._SL256_.jpg


*************** Recommendation number 3 *************** : 
 Product id : B0BCK6L7S5, associated weight : 0.44555057965388223
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71CPiMwVvjL._SL256_.jpg


*************** Recommendation number 4 *************** : 
 Product id : B0BTC9YJ2W, associated weight : 0.44049313275962226
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/81j-j-I7uIL._SL256_.jpg


*************** Recommendation number 5 *************** : 
 Product id : B08R5GM6YB, associated weight : 0.3170638682729171
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/91RgxhgU3HL._SL256_.jpg


*************** Recommendation number 6 *************** : 
 Product id : B0BKR2ZM9X, associated weight : 0.26326062306415404
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/61n284XL9HL._SL256_.jpg


*************** Recommendation number 7 *************** : 
 Product id : B09V91H5XM, associated weight : 0.24941302291764603
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71c2CHulg4L._SL256_.jpg


#### **Recommendation de manière automatique** : 

In [13]:
import recommendations

recommendations.Csimilarity_user_recommendation(df_recommendation, "AHZXMBKQJTVG2J7P7EB5WCYTOLDQ", df_images)
# Attention certains users n'ont pas de similaires au dessus de 0.3 : "AHZQPH7HHSWLUIQFWEQ54NNKKN6A"
# ok : AG622C3E6PARXNYNYPZ6OWJZ4SHQ

The number of unique products is : 1003
The number of unique users is : 5107 

URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/81M3vFhTSKL._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/61mt0r+0QQL._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/51swrTvyx+L._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71GJ6vRLD3L._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71wUv7yqe7L._SL256_.jpg


URL de l'image de l'objet https://m.media-amazon.com/images/I/818ulEonBFL._SL256_.jpg


URL de l'image de l'objet https://m.media-amazon.com/images/I/612EmYtaRlL._SL256_.jpg


URL de l'image de l'objet https://m.media-amazon.com/images/I/81eghTWnlcL._SL256_.jpg


The similar (with similarity bigger than 0.3 of cosine similarity) users to user AHZXMBKQJTVG2J7P7EB5WCYTOLDQ are  : 
User: AHZXMBKQJTVG2J7P7EB5WCYTOLDQ, Similarity: 0.9999999999999999
User: AEPL7HRZFXZPV4HRLAVLIPG6SWMA, Similarity: 0.48791905846982475
User: AH24AXIG2WSMXXKEJWCZOCCBYP4A, Similarity: 0.48578324309888404
User: AEJQKRNJPIVHWSTSY3LBUTISJX2A, Similarity: 0.37371754637596794
User: AHEESJDW67WKM4PH5VA6VLWJWI6Q, Similarity: 0.37371754637596794
User: AGT4CKMXVZIC3HN6VXSIKKYURVSA, Similarity: 0.37371754637596794
User: AG3DT2NJL6FIRTCNKPJPB6FB7YGQ, Similarity: 0.37371754637596794
User: AEFB2SANVBGRV67PYAPK7VWNSFZQ, Similarity: 0.37371754637596794
User: AFJVVDPUVIX2RVK5R3J53FS62IZQ, Similarity: 0.37371754637596794
User: AEXDQ3ET3KGPPTZNC5U72WED6HEA, Similarity: 0.37371754637596794
*************** Recommendation number 1 *************** : 
 Product id : B0007OGTGS, associated weight : 0.6796050412169841
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71DB

*************** Recommendation number 2 *************** : 
 Product id : B0B52FBC66, associated weight : 0.6766301402205931
URL de l'image de l'objet https://m.media-amazon.com/images/I/B1fvzQ6xvGS._SL256_.jpg


*************** Recommendation number 3 *************** : 
 Product id : B0027V760M, associated weight : 0.6766301402205931
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/31em1bAem5L._SL256_.jpg


*************** Recommendation number 4 *************** : 
 Product id : B097Q9W8MW, associated weight : 0.5436840329735873
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/81gvUTxTdQL._SL256_.jpg


*************** Recommendation number 5 *************** : 
 Product id : B000V1K7FG, associated weight : 0.5436840329735873
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71EBehc9WGL._SL256_.jpg


*************** Recommendation number 6 *************** : 
 Product id : B089QY6YYQ, associated weight : 0.5205378312232032
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/710uwv-+sxL._SL256_.jpg


*************** Recommendation number 7 *************** : 
 Product id : B09Y1QWK5W, associated weight : 0.41643026497856256
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/81lBvHI7BDL._SL256_.jpg


Unnamed: 0_level_0,weighted_averages
id,Unnamed: 1_level_1
B0007OGTGS,0.679605
B0B52FBC66,0.67663
B0027V760M,0.67663
B097Q9W8MW,0.543684
B000V1K7FG,0.543684
B089QY6YYQ,0.520538
B09Y1QWK5W,0.41643


As it is doing in the paper : **Empirical Analysis of Predictive Algorithms for Collaborative Filtering by John S. Breese David Heckerman Carl Kadie**
We deal with the multiples grades by doing the mean of the grading by users, It a kind of Memory based algorithm. Maybe a user bought a product in the past and his opinion changed.
$$
\bar{v}_i = \frac{1}{|I_i|} \sum_{j \in I_i} v_{i,j}
$$

$$were \quad v_{i,j} \quad are \quad the \quad gradings \quad that \quad the \quad user \quad i \quad has \quad done \quad before.
$$

In [20]:
random_user = df_recommendation['user'].sample(n=1).iloc[0]
print(random_user)
recommendations.Csimilarity_user_recommendation(df_recommendation, random_user, df_images)

AG622C3E6PARXNYNYPZ6OWJZ4SHQ
The number of unique products is : 1003
The number of unique users is : 5107 

URL de l'image de l'objet https://m.media-amazon.com/images/I/61OYM7JBOnL._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/710tQqz6WdL._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/718MzeCyYbL._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/817kKHTU-FL._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71SOIq8ZkuL._SL256_.jpg


URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/81j-j-I7uIL._SL256_.jpg


The similar (with similarity bigger than 0.3 of cosine similarity) users to user AG622C3E6PARXNYNYPZ6OWJZ4SHQ are  : 
User: AG622C3E6PARXNYNYPZ6OWJZ4SHQ, Similarity: 0.9999999999999999
User: AHEPLA43HWNAN5VRVBK457ILTEVQ, Similarity: 0.4472135954999579
User: AGWA5ARINY5E3E3MH43OXEDZ6ZBA, Similarity: 0.40451991747794525
User: AEHTPUOE73PH7X3FN2222VWRVJBQ, Similarity: 0.40249223594996214
User: AECEYRS5X2CFMDC43NZWL3KGVVUA, Similarity: 0.36
User: AEZ37PNFAOEH33X7MJZI2JJL5WQQ, Similarity: 0.35777087639996635
User: AF4OPEDBV52TDNGHJL7MV7IRIAPQ, Similarity: 0.35777087639996635
User: AGQ4KUC5JAXB3A6UGXUM5DIPW4LQ, Similarity: 0.35777087639996635
User: AGV2W5HDU3JDLPVV2H5TWEONRKCA, Similarity: 0.3530090432487313
User: AHR64MUEH6REYKTCJ5RWCM5MOAUQ, Similarity: 0.35032452487268534
*************** Recommendation number 1 *************** : 
 Product id : B08JMQR2JK, associated weight : 0.6349425980625089
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/712zdUra7hL._SL256_.j

*************** Recommendation number 2 *************** : 
 Product id : B0C6J1X7TD, associated weight : 0.5934937124287156
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/61Czp0iMtxL._SL256_.jpg


*************** Recommendation number 3 *************** : 
 Product id : B008BPI2OW, associated weight : 0.5934937124287156
URL de l'image de l'objet https://m.media-amazon.com/images/I/81A37bd-CuL._SL256_.jpg


*************** Recommendation number 4 *************** : 
 Product id : B01LVXOO61, associated weight : 0.5308369140837279
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71HA4QYM1VL._SL256_.jpg


*************** Recommendation number 5 *************** : 
 Product id : B07PPHHG34, associated weight : 0.5308369140837279
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/716PuuF2EbL._SL256_.jpg


*************** Recommendation number 6 *************** : 
 Product id : B00P6ZOPP0, associated weight : 0.520528419893905
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71qMtuDODiL._SL256_.jpg


*************** Recommendation number 7 *************** : 
 Product id : B09PWCF6T9, associated weight : 0.520528419893905
URL de l'image de l'objet https://images-na.ssl-images-amazon.com/images/I/71comUxJc5L._SL256_.jpg


Unnamed: 0_level_0,weighted_averages
id,Unnamed: 1_level_1
B08JMQR2JK,0.634943
B0C6J1X7TD,0.593494
B008BPI2OW,0.593494
B01LVXOO61,0.530837
B07PPHHG34,0.530837
B00P6ZOPP0,0.520528
B09PWCF6T9,0.520528


AECRYQALXW35XL4PXVO3VEOC2CUA


### b) Item-Based Collaborative Filtering
Instead of focusing on user similarity, this method finds items that are similar based on user ratings or interactions.

    -> https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf 

# II- Content based filtering 

# III - Hybrid Recommender Systems 