## Lab 7: Recommendations 

In this lab, we will take what we did with movies, and look at Amazon reviews.  This dataset comes from https://jmcauley.ucsd.edu/data/amazon/, and is a subset of a large dataset, just showing reviews for musical instruments.

Open the dataset (after downloading it from the Learning Hub), and answer some questions!


*I've put in some code, (from the movies file), but you can change whatever you'd like.  This code won't actually run as is, it needs you to finish it.*


In [199]:
##load some packages

import random
from traceback import print_tb

import pandas
import matplotlib.pyplot as plt
import numpy as np

## this will optimize our math
from scipy.sparse import csr_matrix as sparse_matrix

from sklearn.neighbors import NearestNeighbors


import os

## You can add more here, if you need it!


In [200]:
## read in the data set
amazon = pandas.read_csv("data/ratings_Musical_Instruments.csv",names=("user","item","rating","timestamp"))
amazon.head()
## we  don't need to read in everything


Unnamed: 0,user,item,rating,timestamp
0,A1YS9MDZP93857,6428320,3.0,1394496000
1,A3TS466QBAWB9D,14072149,5.0,1370476800
2,A3BUDYITWUSIS7,41291905,5.0,1381708800
3,A19K10Z0D2NTZK,41913574,5.0,1285200000
4,A14X336IB4JD89,201891859,1.0,1350432000


In [201]:
## How big is dataset
rows, columns = amazon.shape
print(f"{rows} rows and {columns} columns")

## How many unique users are there
print(f"Unique users: {amazon["user"].nunique()}")

## How many unique items are there
print(f"Unique items: {amazon["item"].nunique()}")

## User with most ratings rates how many items
max_ratings = amazon["user"].value_counts().max()
print(f"Max ratings by a user: {max_ratings}")

## Item with most rating
most_ratings_items = amazon["item"].value_counts()
most_rated_item = most_ratings_items.index[0]
num_most_rated = most_ratings_items.iloc[0]
print(f"Most rated item: {most_rated_item}")
print(f"Number of ratings for most rated item: {num_most_rated}")

## Highest and lowest
avg_ratings_per_item = amazon.groupby("item")["rating"].mean()
highest_rated_item = avg_ratings_per_item.idxmax()
highest_avg_rating = avg_ratings_per_item.max()
print(f"Highest rated item: {highest_rated_item}")
print(f"Highest average rating: {highest_avg_rating}")

lowest_rated_item = avg_ratings_per_item.idxmin()
lowest_avg_rating = avg_ratings_per_item.min()
print(f"Lowest rated item: {lowest_rated_item}")
print(f"Lowest average rating: {lowest_avg_rating}")

## How large would the matrix be
matrix_rows = amazon["user"].nunique()
matrix_columns = amazon["item"].nunique()
total_entries = matrix_rows * matrix_columns
print(f"The matrix would have {matrix_rows} rows and {matrix_columns} columns, with {total_entries} entries")

## Non-zero entries
non_zero_entries = len(amazon)
non_zero_percentage = (non_zero_entries / total_entries) * 100
print(f"The percentage of non-zero entries is {non_zero_percentage}%")

500176 rows and 4 columns
Unique users: 339231
Unique items: 83046
Max ratings by a user: 483
Most rated item: B000ULAP4U
Number of ratings for most rated item: 3523
Highest rated item: 0014072149
Highest average rating: 5.0
Lowest rated item: 0201891859
Lowest average rating: 1.0
The matrix would have 339231 rows and 83046 columns, with 28171777626 entries
The percentage of non-zero entries is 0.0017754506181334572%


### Part 1: Descriptive Questions:

1. How big is the dataset, in rows and columns? **500176 rows and 4 columns**
* How many unique users are there? **339231 unique users**
* How many unique items are there? **83046 unique items**
* The user who rated the most instruments has rated how many items? **483 items**
* The item with the MOST ratings is what? How many ratings does it have?  Hint: Check out the Amazon website, by going to "www.amazon.com/dp/item_code", where you put the item into item_code. **B000ULAP4U (Audio Technica ATH-M50 Pro Headphones) with 3523 ratings**
* The item with the highest mean average rating is what?  What is the rating? **0014072149 (Bach, J.S. - Double Concerto in d minor BWV 1043 for Two Violins and Piano) with a 5.0 rating**
* What is the item with the lowest mean average?  What is the rating? **0201891859 (Chopin: Nocturnes) with a 1.0 rating**
* If we built a matrix with all of the users and items, how large would it be? (dimensions, and how many total entries): **The matrix would have 339231 rows and 83046 columns, with 28171777626 entries**

* Looking at the size of the dataset, what is the percentage of non-zero entries in the matrix?: **The percentage of non-zero entries is 0.0017754506181334572%**

In [202]:
## Here we are making the same sparse matrix that we made in class
## Look it over, but don't worry overly

def create_X(ratings,n,d,user_key="user",item_key="item"):
    """
    This code takes a dataset and makes it a Sparse matrix, with some baby functions attached
    """
    user_mapper = dict(zip(np.unique(ratings[user_key]), list(range(d))))
    item_mapper = dict(zip(np.unique(ratings[item_key]), list(range(n))))

    user_inverse_mapper = dict(zip(list(range(d)), np.unique(ratings[user_key])))
    item_inverse_mapper = dict(zip(list(range(n)), np.unique(ratings[item_key])))

    user_ind = [user_mapper[i] for i in ratings[user_key]]
    item_ind = [item_mapper[i] for i in ratings[item_key]]

    X = sparse_matrix((ratings["rating"], (item_ind, user_ind)), shape=(n,d))

    return X, user_mapper, item_mapper, user_inverse_mapper, item_inverse_mapper, user_ind, item_ind



In [203]:
## define X in a new window
## For this to work, you need to define n and d properly above!
## The error you see is BECAUSE n and d are incorrect
n = amazon["item"].nunique() # the number of unique items (not 10!)
d = amazon["user"].nunique() # the number of unique users (more than 6!)
X, user_mapper, item_mapper, user_inverse_mapper, item_inverse_mapper, user_ind, item_ind = create_X(amazon, n=n, d=d)

## Part 2: Try some Machine Learning

Thinking of the code from our movie exploration in class, build a BASIC recommender for similar items.  You can do ANY of the following, but you need to choose at least 1:
* Use KNN with at least two different iterations to find 5 items that are close to a spcific item.  In class, my algorithm always returned the original movie in the list - can you modify the code so this doesn't happen?
* Use *any type of clustering* to cluster items OR users into groups that are similar.
* Can you decompose the matrix via PCA?  How can this help us recommend?
* Do some outside research, and make something!


In [204]:
## ML goes here
## Applying KNN
knn_model = NearestNeighbors(metric='cosine', algorithm='brute')
knn_model.fit(X)

def find_similar_items(item_name, model, item_mapper, item_inverse_mapper, X, k=10):
    try:
        item_index = item_mapper[item_name]
    except KeyError:
        return f"Item '{item_name}' not found."

    distances, indices = model.kneighbors(X[item_index].reshape(1, -1), n_neighbors=k + 1)# +1 to exclude self
    similar_items = [item_inverse_mapper[int(i)] for i in indices.flatten()[1:]]  # Exclude the first one (itself)
    return similar_items

# 3. Create a recommendation function and test it
test_item = amazon['item'].iloc[0]
similar_items = find_similar_items(test_item, knn_model, item_mapper, item_inverse_mapper, X, k=5)
print(f"Similar items to '{test_item}': {similar_items}")

Similar items to '0006428320': ['B00021SUHC', 'B004I0CDM6', 'B0002TNGYG', 'B002Q0WT6U', 'B009CIIWQA']


Using your work above, do the following:

1. Create *some sort* of recomendation system, and 
2. EXPLAIN what you have done and why
3. Answer the following questions:
    * What are the three top products you would recommend to a new user with no rating or purchasing history (the "cold start" problem).
    * What rating do you think a new user would give to item "B009CIIWQA" (a rechargeable Music Stand LED lamp, at https://www.amazon.com/dp/B009CIIWQA
    * What are the top three products you would recommend to user "A27L1LDJZVRLJD"?
    * What rating do you think user "A27L1LDJZVRLJD" would give to the LED music lamp, item "B009CIIWQA"?

### 1. Create some sort of recommendation system

In [205]:
## Recommend New Users
def recommend_for_new_user(data, top_n=3):
    top_items = (
        data.groupby('item')
        .agg(avg_rating=('rating', 'mean'), rating_count=('rating', 'count'))
        .sort_values(['rating_count', 'avg_rating'], ascending=False)
        .head(top_n)
    )
    return top_items.index.tolist()

In [206]:
## Average rating for new user for an item
def avg_item_rating(item_id):
    item_ratings = amazon[amazon["item"] == item_id]["rating"]
    if len(item_ratings) == 0:
        return "Item not found."
    return round(item_ratings.mean(), 2)

In [207]:
## Recommend for user
def recommend_for_user(user_id, X, user_mapper, item_inverse_mapper, item_mapper, model, k=5):
    if user_id not in user_mapper:
        return "User not found."

    user_idx = user_mapper[user_id]
    user_ratings = X[:, user_idx].toarray().flatten()
    rated_items_idx = np.where(user_ratings > 0)[0]
    scores = {}

    for item_idx in rated_items_idx:
        distances, indices = model.kneighbors(X[item_idx], n_neighbors=k + 1)
        for i in indices.flatten()[1:]:
            if i not in rated_items_idx:
                scores[i] = scores.get(i, 0) + 1

    top_items_idx = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:3]
    return [item_inverse_mapper[i] for i, _ in top_items_idx]

In [208]:
## Predict rating for a user and utem
def predict_rating(user_id, item_id, X, user_mapper, item_mapper, item_inverse_mapper, model, k=5):
    try:
        user_idx = user_mapper[user_id]
        item_idx = item_mapper[item_id]
    except KeyError:
        return "User or item not found."

    distances, indices = model.kneighbors(X[item_idx], n_neighbors=k + 1)
    indices = indices.flatten()[1:]

    user_ratings = X[:, user_idx].toarray().flatten()
    neighbor_ratings = [user_ratings[i] for i in indices if user_ratings[i] > 0]

    if not neighbor_ratings:
        return "Not enough info to predict rating."
    return round(np.mean(neighbor_ratings), 2)

In [209]:
# Top 3 products for a new user
print("Top 3 recommendations for a new user:")
print(recommend_for_new_user(amazon))

# New user rating for item "B009CIIWQA"
print("\nEstimated rating for new user for item B009CIIWQA:")
print(avg_item_rating("B009CIIWQA"))

# Top 3 recommendations for user "A27L1LDJZVRLJD"
print("\nTop 3 recommendations for user A27L1LDJZVRLJD:")
print(recommend_for_user("A27L1LDJZVRLJD", X, user_mapper, item_inverse_mapper, item_mapper, knn_model))

# Predicted rating for item "B009CIIWQA" by user "A27L1LDJZVRLJD"
print("\nPredicted rating for LED lamp by A27L1LDJZVRLJD:")
print(predict_rating("A27L1LDJZVRLJD", "B009CIIWQA", X, user_mapper, item_mapper, item_inverse_mapper, knn_model))

Top 3 recommendations for a new user:
['B000ULAP4U', 'B003VWJ2K8', 'B003VWKPHC']

Estimated rating for new user for item B009CIIWQA:
4.01

Top 3 recommendations for user A27L1LDJZVRLJD:
['B00HTXIP4E', 'B00BUME2XS', 'B00I9ZITRY']

Predicted rating for LED lamp by A27L1LDJZVRLJD:
Not enough info to predict rating.


### 2. Explanation
1. Using the provided function to create a sparse matrix, I use the numbers of unique users and items to as d and n create a sparse matrix. This helps with similarity comparisons.
2. I used KNN model for this simple recommendation system, and train on X (returned from the sparse matrix function).
3. To handle cold start, we recommend the most popular items (by number of ratings), or items with the highest ratings.
4. To predict a user's rating, i find items that the user rated, find similar items, and average the user's ratings on these similar items.

### 3. Answer the following questions
- What are the three top products you would recommend to a new user with no rating or purchasing history (the "cold start" problem).
 The top three products that would be recommended to a new user are **Audio Technica ATH-M50 Pro Headphones (B000ULAP4U)**, **Unknown Item (B003VWJ2K8)**, and **Snark ST-2 All Instrument Clip-On Chromatic Tuner (B003VWKPHC)**.

- What rating do you think a new user would give to item "B009CIIWQA" (a rechargeable Music Stand LED lamp, at https://www.amazon.com/dp/B009CIIWQA)
A new user would give a rating of **4.01 or 4.0**
- What are the top three products you would recommend to user "A27L1LDJZVRLJD"?
The three products recommended to user "A27L1LDJZVRLJD" are **Fashionable Fabric Tenor Trombone Gig Bag Backpack Case Purplish Red (B00HTXIP4E)**, **Dunlop Tortex Standard, 0.73mm, Yellow Guitar Pick, 72 Pack (B00BUME2XS)**, and **Unknown Item (B00I9ZITRY)**
- What rating do you think user "A27L1LDJZVRLJD" would give to the LED music lamp, item "B009CIIWQA"?
According to my code, there is not enough information to predict