## Lab 7: Recommendations 

In this lab, we will take what we did with movies, and look at Amazon reviews.  This dataset comes from https://jmcauley.ucsd.edu/data/amazon/, and is a subset of a large dataset, just showing reviews for musical instruments.

Open the dataset (after downloading it from the Learning Hub), and answer some questions!


*I've put in some code, (from the movies file), but you can change whatever you'd like.  This code won't actually run as is, it needs you to finish it.*


In [264]:
##load some packages

import random
from traceback import print_tb

import pandas
import matplotlib.pyplot as plt
import numpy as np

## this will optimize our math
from scipy.sparse import csr_matrix as sparse_matrix

from sklearn.neighbors import NearestNeighbors


import os

## You can add more here, if you need it!


In [265]:
## read in the data set
amazon = pandas.read_csv("data/ratings_Musical_Instruments.csv",names=("user","item","rating","timestamp"))
amazon.head()
## we  don't need to read in everything


Unnamed: 0,user,item,rating,timestamp
0,A1YS9MDZP93857,6428320,3.0,1394496000
1,A3TS466QBAWB9D,14072149,5.0,1370476800
2,A3BUDYITWUSIS7,41291905,5.0,1381708800
3,A19K10Z0D2NTZK,41913574,5.0,1285200000
4,A14X336IB4JD89,201891859,1.0,1350432000


In [266]:
## How big is dataset
rows, columns = amazon.shape
print(f"{rows} rows and {columns} columns")

## How many unique users are there
print(f"Unique users: {amazon["user"].nunique()}")

## How many unique items are there
print(f"Unique items: {amazon["item"].nunique()}")

## User with most ratings rates how many items
max_ratings = amazon["user"].value_counts().max()
print(f"Max ratings by a user: {max_ratings}")

## Item with most rating
most_ratings_items = amazon["item"].value_counts()
most_rated_item = most_ratings_items.index[0]
num_most_rated = most_ratings_items.iloc[0]
print(f"Most rated item: {most_rated_item}")
print(f"Number of ratings for most rated item: {num_most_rated}")

## Highest and lowest
avg_ratings_per_item = amazon.groupby("item")["rating"].mean()
highest_rated_item = avg_ratings_per_item.idxmax()
highest_avg_rating = avg_ratings_per_item.max()
print(f"Highest rated item: {highest_rated_item}")
print(f"Highest average rating: {highest_avg_rating}")

lowest_rated_item = avg_ratings_per_item.idxmin()
lowest_avg_rating = avg_ratings_per_item.min()
print(f"Lowest rated item: {lowest_rated_item}")
print(f"Lowest average rating: {lowest_avg_rating}")

## How large would the matrix be
matrix_rows = amazon["user"].nunique()
matrix_columns = amazon["item"].nunique()
total_entries = matrix_rows * matrix_columns
print(f"The matrix would have {matrix_rows} rows and {matrix_columns} columns, with {total_entries} entries")

## Non-zero entries
non_zero_entries = len(amazon)
non_zero_percentage = (non_zero_entries / total_entries) * 100
print(f"The percentage of non-zero entries is {non_zero_percentage}%")

500176 rows and 4 columns
Unique users: 339231
Unique items: 83046
Max ratings by a user: 483
Most rated item: B000ULAP4U
Number of ratings for most rated item: 3523
Highest rated item: 0014072149
Highest average rating: 5.0
Lowest rated item: 0201891859
Lowest average rating: 1.0
The matrix would have 339231 rows and 83046 columns, with 28171777626 entries
The percentage of non-zero entries is 0.0017754506181334572%


### Part 1: Descriptive Questions:

1. How big is the dataset, in rows and columns? **500176 rows and 4 columns**
* How many unique users are there? **339231 unique users**
* How many unique items are there? **83046 unique items**
* The user who rated the most instruments has rated how many items? **483 items**
* The item with the MOST ratings is what? How many ratings does it have?  Hint: Check out the Amazon website, by going to "www.amazon.com/dp/item_code", where you put the item into item_code. **B000ULAP4U (Audio Technica ATH-M50 Pro Headphones) with 3523 ratings**
* The item with the highest mean average rating is what?  What is the rating? **0014072149 (Bach, J.S. - Double Concerto in d minor BWV 1043 for Two Violins and Piano) with a 5.0 rating**
* What is the item with the lowest mean average?  What is the rating? **0201891859 (Chopin: Nocturnes) with a 1.0 rating**
* If we built a matrix with all of the users and items, how large would it be? (dimensions, and how many total entries): **The matrix would have 339231 rows and 83046 columns, with 28171777626 entries**

* Looking at the size of the dataset, what is the percentage of non-zero entries in the matrix?: **The percentage of non-zero entries is 0.0017754506181334572%**

In [267]:
## Here we are making the same sparse matrix that we made in class
## Look it over, but don't worry overly

def create_X(ratings,n,d,user_key="user",item_key="item"):
    """
    This code takes a dataset and makes it a Sparse matrix, with some baby functions attached
    """
    user_mapper = dict(zip(np.unique(ratings[user_key]), list(range(d))))
    item_mapper = dict(zip(np.unique(ratings[item_key]), list(range(n))))

    user_inverse_mapper = dict(zip(list(range(d)), np.unique(ratings[user_key])))
    item_inverse_mapper = dict(zip(list(range(n)), np.unique(ratings[item_key])))

    user_ind = [user_mapper[i] for i in ratings[user_key]]
    item_ind = [item_mapper[i] for i in ratings[item_key]]

    X = sparse_matrix((ratings["rating"], (item_ind, user_ind)), shape=(n,d))

    return X, user_mapper, item_mapper, user_inverse_mapper, item_inverse_mapper, user_ind, item_ind



In [268]:
## define X in a new window
## For this to work, you need to define n and d properly above!
## The error you see is BECAUSE n and d are incorrect
n = amazon["item"].nunique() # the number of unique items (not 10!)
d = amazon["user"].nunique() # the number of unique users (more than 6!)
X, user_mapper, item_mapper, user_inverse_mapper, item_inverse_mapper, user_ind, item_ind = create_X(amazon, n=n, d=d)

## Part 2: Try some Machine Learning

Thinking of the code from our movie exploration in class, build a BASIC recommender for similar items.  You can do ANY of the following, but you need to choose at least 1:
* Use KNN with at least two different iterations to find 5 items that are close to a spcific item.  In class, my algorithm always returned the original movie in the list - can you modify the code so this doesn't happen?
* Use *any type of clustering* to cluster items OR users into groups that are similar.
* Can you decompose the matrix via PCA?  How can this help us recommend?
* Do some outside research, and make something!


In [269]:
## ML goes here
## Applying KNN
knn_model = NearestNeighbors(metric='cosine', algorithm='brute')
knn_model.fit(X)

def find_similar_items(item_name, model, item_mapper, item_inverse_mapper, X, k=10):
    try:
        item_index = item_mapper[item_name]
    except KeyError:
        return f"Item '{item_name}' not found."

    # Request more neighbors to ensure we have enough after filtering
    distances, indices = model.kneighbors(X[item_index].reshape(1, -1), n_neighbors=k+5)
    
    # Convert indices to item names and exclude the input item
    similar_items = [item_inverse_mapper[int(i)] for i in indices.flatten() 
                     if item_inverse_mapper[int(i)] != item_name]
    
    return similar_items[:k]  # Return only the top k items

# 3. Create a recommendation function and test it
test_item = amazon['item'].iloc[0]
similar_items = find_similar_items(test_item, knn_model, item_mapper, item_inverse_mapper, X, k=5)
print(f"Similar items to '{test_item}': {similar_items}")

Similar items to '0006428320': ['B00021SUHC', 'B004I0CDM6', 'B0002TNGYG', 'B002Q0WT6U', 'B009CIIWQA']


Using your work above, do the following:

1. Create *some sort* of recomendation system, and 
2. EXPLAIN what you have done and why
3. Answer the following questions:
    * What are the three top products you would recommend to a new user with no rating or purchasing history (the "cold start" problem).
    * What rating do you think a new user would give to item "B009CIIWQA" (a rechargeable Music Stand LED lamp, at https://www.amazon.com/dp/B009CIIWQA
    * What are the top three products you would recommend to user "A27L1LDJZVRLJD"?
    * What rating do you think user "A27L1LDJZVRLJD" would give to the LED music lamp, item "B009CIIWQA"?

### 1. Create some sort of recommendation system

In [270]:
# Retun the most popular items based on average ratings
def get_top_popular_items(amazon, n=3, min_reviews=5):
    item_stats = amazon.groupby('item').agg(
        avg_rating=('rating', 'mean'),
        num_ratings=('rating', 'count')
    )
    
    # Filter items with at least min_reviews
    item_stats = item_stats[item_stats['num_ratings'] >= min_reviews]
    
    # Calculate a weighted score that considers both rating and popularity
    # This formula gives more weight to items with more reviews
    item_stats['score'] = item_stats['avg_rating'] * (1 + np.log1p(item_stats['num_ratings']))
    
    # Sort by the weighted score and return the top n items
    return item_stats.sort_values('score', ascending=False).head(n).index.tolist()

# Calculate average rating for each item in the dataset
def get_average_ratings(amazon):
    return amazon.groupby('item')['rating'].mean().sort_values(ascending=False)

# Predict rating for a new user for a specific item    
def predict_rating_for_new_user(item, average_ratings, item_mapper):
    if item in item_mapper:
        return average_ratings[item]
    return None

#REcommend items for an existing user based on their rating history
def recommend_for_existing_user(user, amazon, knn_model, item_mapper, 
                               item_inverse_mapper, X, k=3):
    user_ratings = amazon[amazon['user'] == user]
    
    if user_ratings.empty:
        return None
        
    # Pick the most recently rated item to find similar items
    latest_rated_item = user_ratings.sort_values(by='timestamp', ascending=False).iloc[0]['item']
    return find_similar_items(latest_rated_item, knn_model, item_mapper, 
                             item_inverse_mapper, X, k=k)

# Predict rating for an existing user for a specific item
def predict_rating_for_existing_user(user, item, amazon, average_ratings):
    user_ratings = amazon[amazon['user'] == user]
    
    if user_ratings.empty:
        return None
        
    # Predict the average rating given by the user to other items
    return user_ratings['rating'].mean()

In [271]:
average_ratings = get_average_ratings(amazon)

# 1. Recommend three top products for a new user (cold start problem)
top_3_items = get_top_popular_items(amazon, n=3)
print("Top 3 products for new users (cold start):", top_3_items)

# 2. Predict rating for a new user for item "B009CIIWQA"
item = "B009CIIWQA"
predicted_rating_new_user = predict_rating_for_new_user(item, average_ratings, item_mapper)
if predicted_rating_new_user is not None:
    print(f"Predicted rating for a new user for item 'B009CIIWQA': {predicted_rating_new_user:.2f}")
else:
    print(f"Item '{item}' not found in the dataset.")

# 3. Recommend top three products to user "A27L1LDJZVRLJD"
user = "A27L1LDJZVRLJD"
if user in user_mapper:
    similar_items = recommend_for_existing_user(user, amazon, knn_model, 
                                                item_mapper, item_inverse_mapper, X, k=3)
    if similar_items:
        print(f"Top 3 products recommended for user 'A27L1LDJZVRLJD': {similar_items}")
    else:
        print(f"User '{user}' has no ratings. Recommending popular items.")
        print("Top 3 products for new users (cold start):", top_3_items)
else:
    print(f"User '{user}' not found in the dataset.")

# 4. Predict rating for user "A27L1LDJZVRLJD" for item "B009CIIWQA"
if user in user_mapper:
    predicted_rating_for_user = predict_rating_for_existing_user(user, item, amazon, average_ratings)
    if predicted_rating_for_user is not None:
        print(f"Predicted rating for user 'A27L1LDJZVRLJD' for item 'B009CIIWQA': {predicted_rating_for_user:.2f}")
    else:
        print(f"User '{user}' has no ratings.")
        print(f"Using average item rating: {predicted_rating_new_user:.2f}")
else:
    print(f"User '{user}' not found in the dataset.")

Top 3 products for new users (cold start): ['B000ULAP4U', 'B003VWJ2K8', 'B00FPPQYXM']
Predicted rating for a new user for item 'B009CIIWQA': 4.01
Top 3 products recommended for user 'A27L1LDJZVRLJD': ['B00HTXIP4E', 'B00I9ZITRY', 'B00BUME2XS']
Predicted rating for user 'A27L1LDJZVRLJD' for item 'B009CIIWQA': 4.00


### 2. Explanation
I split the simple recommendation system into smaller functions
##### 1. `get_top_popular_items(amazon, n=3, min_reviews=5)`

* To get the 'n' most popular items, I group items together and calculate the average rating and the number of ratings. Then I compute the weighted score so the system recommends items with good ratings that have a consider number of ratings. This function helps solve the cold start problem.

##### 2. `get_average_ratings(amazon)`

* This calculates the average rating for each item in the dataset. This function provides the baseline for predicting ratings, especially for new users or items with limited ratings

##### 3. `predict_rating_for_new_user(item, average_ratings, item_mapper)`

* The function predicts the rating a new user would give to specific item. If the item exists in the dataset, it returns the average rating of that item, otherwise it return none. Without any user-specific data, the best prediction we can make is the overall average rating of the item. We must assume that the rating that a new user give would be around the average rating. This also solves the cold start problem.

##### 4. `recommend_for_existing_user(user, amazon, knn_model, item_mapper, item_inverse_mapper, X, k=3)`

* I'm using the KNN model for this recommendation function. In this function, I identify the items they the user has rated in the past, and use those items to recommend new ones to the user.


##### 5. `predict_rating_for_existing_user(user, item, amazon, average_ratings)`

* This function helps with predicting the rating an existing user would give to a specific item. Since there is no specific personalized model, the best way to predict how an existing user might rate a specific item is using the user's average rating across all the items they have previously rated.



### 3. Answer the following questions
- **What are the three top products you would recommend to a new user with no rating or purchasing history (the "cold start" problem)?**

    The top three products that would be recommended to a new user are **Audio Technica ATH-M50 Pro Headphones (B000ULAP4U), Unknown (B003VWJ2K8), DOC MARTIN: SERIES 6 (B00FPPQYXM)**

- **What rating do you think a new user would give to item "B009CIIWQA" (a rechargeable Music Stand LED lamp, at https://www.amazon.com/dp/B009CIIWQA)?**

    A new user would give a rating of **4.01 or round it to 4.0**
    
- **What are the top three products you would recommend to user "A27L1LDJZVRLJD"?**

    The three products recommended to user "A27L1LDJZVRLJD" are **Fashionable Fabric Tenor Trombone Gig Bag Backpack Case Purplish Red (B00HTXIP4E)**, **Dunlop Tortex Standard, 0.73mm, Yellow Guitar Pick, 72 Pack (B00BUME2XS)**, and **Unknown Item (B00I9ZITRY)**
- **What rating do you think user "A27L1LDJZVRLJD" would give to the LED music lamp, item "B009CIIWQA"?**

    The predicted rating for user "A27L1LDJZVRLJD" for item "B009CIIWQA" is **4.0**

### Citation
ChatGPT: https://chatgpt.com/