# Advanced ML, Recomendation project : 

### Description of the General topic:  
 
A recommendation system is a class of machine learning tools designed to suggest 
relevant items to users based on their preferences, behaviors, and other users’ 
activities. They are widely used across e-commerce, streaming platforms, social 
media, and online advertising, aiming to enhance user experience by delivering 
personalized content or product suggestions.

### Flow of the Code Project (All in Python)  :  
 
**Data Preprocessing:** 
- Preprocess textual data using tokenization, stemming, and vectorization 
techniques like TF-IDF or word embeddings. 
 
**Building Collaborative Filtering Models:** 
- Implement SVD for matrix factorization. 
- Build user-based and item-based collaborative filtering models using 
libraries like Scikit-Learn. 
 
**Building Content-Based Filtering Models:**
- Create item profiles using metadata. 
- Use cosine similarity or neural embeddings to identify similar items. 
 
**Combine with Hybrid Techniques:** 
- Experiment with hybrid models (weighted hybrid, feature-augmented 
collaborative filtering, etc.) to combine collaborative and content-based 
methods. 
- Train deep hybrid models if using neural networks, concatenating 
collaborative and content-based embeddings as input. 

**Visualization:**  
- Récapitulatifs des des résultats et visualisation par des graphiques dans 
la mesure du possible 

In [1]:
# Import of users data : 
import pandas as pd
import json

file = "/Users/aminerazig/Desktop/ENSAE 3A/ADVANCED ML/Advanced ML-project/DATA/Health_and_Personal_Care.jsonl"

with open(file, 'r') as file:
    data = [json.loads(line) for line in file]

# first 1000 products
products_1000_usersdata = data[:10000]

In [6]:
len(products_1000_usersdata)

1000

In [18]:
# import of products metadata : 

products_1000_metadata = []
file_metadata = "/Users/aminerazig/Desktop/ENSAE 3A/ADVANCED ML/Advanced ML-project/DATA/meta_Health_and_Personal_Care.jsonl"

with open(file_metadata, 'r') as file:
    for i, line in enumerate(file):
        if i >= 1000:  
            break
        products_1000_metadata.append(json.loads(line))

In [22]:
### AFFICHAGE DE QUELQUES IMAGES  : 
import json
import random
import requests
from PIL import Image
from io import BytesIO



def get_random_products_with_images(products, num_products=90):
    products_with_images = [p for p in products if p.get('images') and len(p['images']) > 0]
    return random.sample(products_with_images, min(num_products, len(products_with_images)))


def fetch_and_resize_image(url, size=(30, 30)):
    try:
        response = requests.get(url)
        response.raise_for_status()
        img = Image.open(BytesIO(response.content))
        return img.resize(size)
    except Exception as e:
        print(f"Erreur lors du téléchargement de l'image : {e}")
        return None

# mosaïque
def create_mosaic(images, grid_size=(10, 9), image_size=(30, 30)):
    mosaic = Image.new('RGB', (grid_size[0] * image_size[0], grid_size[1] * image_size[1]))
    for idx, img in enumerate(images):
        if img:
            x = (idx % grid_size[0]) * image_size[0]
            y = (idx // grid_size[0]) * image_size[1]
            mosaic.paste(img, (x, y))
    mosaic.show()
    return mosaic


selected_products = get_random_products_with_images(products_1000_metadata)
image_urls = [p['images'][0]['large'] for p in selected_products]

images = [fetch_and_resize_image(url) for url in image_urls]
mosaic = create_mosaic(images)

# Data Fields

## For User Reviews

| Field              | Type   | Explanation                                                                                                                                                     |
|--------------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `rating`           | float  | Rating of the product (from 1.0 to 5.0).                                                                                                                        |
| `title`            | str    | Title of the user review.                                                                                                                                       |
| `text`             | str    | Text body of the user review.                                                                                                                                   |
| `images`           | list   | Images that users post after they have received the product. Each image has different sizes (small, medium, large), represented by `small_image_url`, `medium_image_url`, and `large_image_url`. |
| `asin`             | str    | ID of the product.                                                                                                                                              |
| `parent_asin`      | str    | Parent ID of the product. Note: Products with different colors, styles, sizes usually belong to the same parent ID. The “asin” in previous Amazon datasets is actually the parent ID. Please use parent ID to find product meta. |
| `user_id`          | str    | ID of the reviewer.                                                                                                                                             |
| `timestamp`        | int    | Time of the review (unix time).                                                                                                                                 |
| `verified_purchase`| bool   | User purchase verification.                                                                                                                                     |
| `helpful_vote`     | int    | Helpful votes of the review.                                                                                                                                    |


In [17]:
import EDA_functions
image_url = "https://images-na.ssl-images-amazon.com/images/I/71DFEoJ+Z9L._SL256_.jpg"
EDA_functions.show_image(image_url)

# I- Collaborative Filtering

Collaborative filtering recommends products to users based on the behavior of other users with similar preferences. CF methods work on the principle that users who agreed on items in the past are likely to agree again. There are two main types:

### a) User-Based Collaborative Filtering
This approach identifies users who have similar preferences (based on ratings or clicks) and recommends items that similar users liked.

### Implementation
We can try to implement this method using matrix factorization techniques like **Singular Value Decomposition (SVD)**, which reduces the dimensionality of the data matrix, capturing latent factors that explain user-item interactions.


In [2]:
df_recommendation = pd.DataFrame(
    [{"id": item["parent_asin"], "user": item["user_id"], "rating": item["rating"]} for item in products_1000_usersdata])


In [19]:
df_recommendation.head(5)

Unnamed: 0,id,user,rating
0,B07TDSJZMR,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,4.0
1,B08637FWWF,AEVWAM3YWN5URJVJIZZ6XPD2MKIA,5.0
2,B07KJVGNN5,AHSPLDNW5OOUK2PLH7GXLACFBZNQ,5.0
3,B092RP73CX,AEZGPLOYTSAPR3DHZKKXEFPAXUAA,4.0
4,B08KYJLF5T,AEQAYV7RXZEBXMQIQPL6KCT2CFWQ,1.0


In [5]:
print(f"The number of unique products is : {df_recommendation.id.nunique()}")
print(f"The number of unique users is : {df_recommendation.user.nunique()}")

The number of unique products is : 5927
The number of unique users is : 7336


In [6]:
ratings_per_product = df_recommendation.groupby('id')['user'].nunique()
print(f"The proportion of products rated by different users : \n")
pd.DataFrame(ratings_per_product.describe())

## Sur l'echantillon selectionné on remarque que plus de 50% des produits ne sont évalué que par deux personnes ... 


The proportion of products rated by different users : 



Unnamed: 0,user
count,5927.0
mean,1.671335
std,2.885594
min,1.0
25%,1.0
50%,1.0
75%,1.0
max,74.0


As it is doing in the paper : **Empirical Analysis of Predictive Algorithms for Collaborative Filtering by John S. Breese David Heckerman Carl Kadie**
We deal with the multiples grades by doing the mean of the grading by users, It a kind of Memory based algorithm. Maybe a user bought a product in the past and his opinion changed.
$$
\bar{v}_i = \frac{1}{|I_i|} \sum_{j \in I_i} v_{i,j}
$$

$$were \quad v_{i,j} \quad are \quad the \quad gradings \quad that \quad the \quad user \quad i \quad has \quad done \quad before.
$$

In [7]:
# First we check if there is any duplicates in the dataset (ie a user that gives a rating twice for a product)
print(f"{df_recommendation[df_recommendation.duplicated(subset=['user', 'id'], keep=False)].shape}")

# Then we remove those duplicates (by doing the mean of the ratings) : 
df_recommendation = df_recommendation.groupby(['user', 'id'], as_index=False)['rating'].mean()

(171, 3)


In [9]:
#We convert the data in the long format for usre based collaborative filltering : 
df_recommendation = df_recommendation.pivot(index='user', columns='id', values='rating')

This operation gives us a very sparse matrix than can be hard to handle with many data because of the memory...

In [13]:
df_recommendation

id,1465874399,1934786004,197480772X,6148479311,B000050FEQ,B00008LUPV,B00009QJW6,B00009ZY40,B0000A605R,B0000DJ27H,...,B0C61G3BB3,B0C6H6RRVB,B0C71W41C5,B0C7V193T7,B0C9H8MYG9,B0CB33QW6H,B0CCW81Y4T,B0CCWPQL6X,B0CD14W2QT,B0CDDBK2G4
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AE227RVA23EPOD52V7J7CCRYIHBQ,,,,,,,,,,,...,,,,,,,,,,
AE22IPO5AD7T3QUS6TOPU6T6OL6Q,,,,,,,,,,,...,,,,,,,,,,
AE22Z3RLVIRU6RT5PNRK5CFFNEFQ,,,,,,,,,,,...,,,,,,,,,,
AE23HUJD2RENUFCMHPVVC3F64KRQ,,,,,,,,,,,...,,,,,,,,,,
AE24FFSUQHE3J6NYBICB7V2WHUAA,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AHZXJ4N5GBXLRDEKD37LB6ZZTPWQ,,,,,,,,,,,...,,,,,,,,,,
AHZYBJVSPJO4NMRWIQ4TI4Y42CJA,,,,,,,,,,,...,,,,,,,,,,
AHZZJQYNVZUJNPNQ737ITGEQUB4A,,,,,,,,,,,...,,,,,,,,,,
AHZZNR5FSD5ODQYVFCWFNLHGX55Q,,,,,,,,,,,...,,,,,,,,,,


In [27]:
user_id = "AE227RVA23EPOD52V7J7CCRYIHBQ"
df_recommendation.loc[user_id][df_recommendation.loc[user_id].notna()].index # pour avoir la liste des produit qu'un user a évalué 
# I_i is the set of items on wich user i has graded

print (f"The user {user_id} has rating the products : {df_recommendation.loc[user_id][df_recommendation.loc[user_id].notna()].index}")

The user AE227RVA23EPOD52V7J7CCRYIHBQ has rating the products : Index(['B076Q1WTB8'], dtype='object', name='id')


### b) Item-Based Collaborative Filtering
Instead of focusing on user similarity, this method finds items that are similar based on user ratings or interactions.

# II- Content based filtering 

# III - Hybrid Recommender Systems 