Idea: supplement a traditional recommendation system with an LLM that generates a customized recommendation tip for each recommendation. 

Demonstration: user either picks a specific node in the network or a random one. Generate info about this user. Generate the product recommendation with tip (top-k times). If ambitious we could set up a user-friendly front-end demo with an app or something. 

TODO: 

- build graph  - [x] 
- recommendation algorithm, which must return for a User u: - [x]
    - top-k recommended products p1, p2 etc.
    - N set of most similar users v
- From this we find for each  
    - Product pi and User ui:
        - M set of top-l highest rated and most helpful reviews for pi by ui
- O set of u's reviews with highest rating of products q, r etc. that were also rated highly by the M set users, i.e. products on which the similarity is based. This is to give the LLM (a) info about which products u likes and (b) u's writing style. 
- Prompt packaging: 
    - We generate a prompt that contains the following information:
        1. Description of task. 
        2. M set sorted by rating and helpfulness. 
        3. O set sorted by rating. 



# 0. Functions and Global Directories

In [7]:
import os 
import gzip 
import json
import networkx as nx 
import numpy as np

cwd = os.getcwd()
project_dir = os.path.dirname(cwd)
print(project_dir) 
data_path = os.path.join(project_dir, "data", "Musical_Instruments_5.json.gz") 
print(os.path.exists(data_path))

c:\Users\bened\DataScience\Autumn 2025\SINA\assignments\3
True


In [5]:
def parse(path): 
    g = gzip.open(path, 'r') 
    for l in g:
        yield json.loads(l) 

In [21]:
def create_graph(path): 
    G = nx.DiGraph() 
    for idx, review in enumerate(parse(path)): 
        # -- EXTRACT NODE IDs AND ATTRIBUTES --- 
        user_id = review['reviewerID'] 
        user_name = review.get('reviewerName', None) 
        product_id = review['asin'] 

        # --- USER NODE --- 
        if G.has_node(user_id): 
            # Node exists; check names 
            current_name = G.nodes[user_id].get('reviewerName') 

            if user_name:  # if the dict has a reviewerName (not None) 
                if not current_name:  # and the node corresponding to that reviewerID does not have a name already
                    G.nodes[user_id]['reviewerName'] = user_name  # name reviewer 
                # else if there's a conflict between node's current name and name on this review
                elif user_name != current_name: 
                    # add an additional name attribute
                    name_keys = [k for k in G.nodes[user_id] if 'Name' in k] # by first checking how many names the node already has 
                    count = len(name_keys) 
                    key = "secondName" if count == 2 else f"{count}thName" 
                    G.nodes[user_id][key] = user_name 
                # if user_name is None we do nothing
        # if no existing node for this user_id, create a new one 
        else: 
            G.add_node(user_id, bipartite='user') 
            if user_name: 
                G.nodes[user_id]['reviewerName'] = user_name 
        
        # --- PRODUCT NODES --- 
        if not G.has_node(product_id): 
            G.add_node(
                product_id, 
                bipartite='product',
                image=review.get('image', None),
                style=review.get('style', None)
                ) 
        
        # --- REVIEW EDGES (User-Review->Product) --- 
        G.add_edge(
            user_id, 
            product_id, 
            type='Reviewed',
            id=idx, 
            overall=review.get('overall', np.nan),
            verified=review.get('verified', False),
            vote=review.get('vote', np.nan),
            reviewText=review.get('reviewText', None), 
            summary=review.get('summary', None), 
            reviewTime=review.get('reviewTime', None), 
            unixReviewTime=review.get('unixReviewTime', np.nan)
        )

    return G

# 1. Data Inspection

In [6]:
# inspect possible keys 
keys = set() 

for review in parse(data_path): 
    for k in review.keys(): 
        if k not in keys: 
            keys.add(k) 

print(keys) 

{'overall', 'verified', 'asin', 'reviewerID', 'image', 'reviewText', 'style', 'reviewTime', 'summary', 'unixReviewTime', 'reviewerName', 'vote'}


In [13]:
# inspect reviewTime field 
for idx, review in enumerate(parse(data_path)): 
    if idx==10: 
        break 
    elif 'reviewTime' in review: 
        print(review['reviewTime'])

10 30, 2016
06 30, 2016
05 9, 2016
04 10, 2016
02 6, 2016
01 2, 2016
06 21, 2015
03 28, 2015
03 19, 2015
03 15, 2015


format: MM DD, YYYY  

Since we have Unix already this is likely not needed for any computational task. 

In [11]:
# style field 
styles = {}

for review in parse(data_path): 
    if 'style' in review: 
        for k, v in review['style'].items(): 
            if k in styles:
                styles[k].add(v) 
            else: 
                styles[k] = set(v) 

for k, v in styles.items(): 
    print(f"{k}: {v}")

Format:: {'u', 'p', ' DVD-ROM', ' Wireless Phone Accessory', ' CD-ROM', ' ', 'i', 'S', '.', 's', 'e', ' Electronics', ' Misc. Supplies', ' Audio CD', ' Software Download', 'l', 'M', 'c', ' Paperback'}
style:: {' Latching Right Angle', ' HPACA1 (Coiled)', ' SV100-WA', ' Arrows', ' Bass', ' Tone Job', ' Wired Mic Desktop Stand', ' US-16x08', ' Channel R', ' Right Angle', ' 25 Keys, Mini', ' Loudspeaker', ' Rockaway Archer V1', ' Sea Machine', ' Boompole', ' Komp. 10 Ult. Upgrade', ' AUD ATR1300', ' CHS-DUO', ' Key 9', ' Built-in Effects', ' Ditto X4', ' Pitchblack Poly Polyphonic Pedal Tuner', ' UM Pro 10 - Discontinued Model (Red)', ' 61 Key', ' ETX-15SP-Cover', ' DeadCat VMP Windshield', ' Sparkle Drive MOD', ' 15-Inch/1,400 Watts', ' R8', ' Wireless Mic Desktop Stand', ' FATRAT', ' RT-0008-00', ' MXL 770', ' Guitar-Gibson', ' Sonar Platinum', ' Animal Overdrive', ' 16-Rack', ' Small Diaphragm Cardioid and Omni', ' iQ6', ' Style III', ' SuperCardioid Dynamic Snare', ' Reignmaker', ' 12

The style field does likely contain info relevant to recommendation tips. 

# 2. Graph Build

In [22]:
G = create_graph(data_path)

# 3. Recommendation Mining 

## 3.1. Create Utility Matrix  for lookups 

Makes more sense to create the Utility Matrix U only once so we can check it as needed. 

In [25]:
import pandas as pd 

def create_utility_matrix(graph): 
    data = [] 
    for user, product, review in G.edges(data=True): 
        rating = review.get('overall') 
        if rating is not None: 
            data.append((product, user, rating))

    df = pd.DataFrame(data, columns=['product_id', 'user_id', 'rating']) 

    utility_matrix = df.pivot(index='product_id', columns='user_id', values='rating') 

    return utility_matrix 
        

In [26]:
U = create_utility_matrix(G)

In [27]:
U.head() 

user_id,A0072193KFP6LUHKEXLT,A0096681Y127OL1H8W3U,A0103849GBVWICKXD4T6,A0279100VZXR9A2495P4,A0600727NK5MAF66IOY5,A0727497OR0PPNFLFPDV,A07936821FOVJO6NP4Q8,A0833006NJW9KRF77ZFY,A0955928C2RRWOWZN7UC,A10044ECXDUVKS,...,AZYCGMFCK9AIM,AZYJTD9J82V5I,AZYP4FQ2L2C4O,AZZ3WYDJ0XNZW,AZZCLFV6V8693,AZZM5MUOG0LRK,AZZT9G4MJFCHD,AZZX23UGJGKTT,AZZZ3LGTCGUZF,AZZZG8PGB1FS0
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
739079891,,,,,,,,,,,...,,,,,,,,,,
786615206,,,,,,,,,,,...,,,,,,,,,,
1480360295,,,,,,,,,,,...,,,,,,,,,,
1928571018,,,,,,,,,,,...,,,,,,,,,,
9792372326,,,,,,,,,,,...,,,,,,,,,,


In [28]:
U.isnull().sum()

user_id
A0072193KFP6LUHKEXLT    10616
A0096681Y127OL1H8W3U    10581
A0103849GBVWICKXD4T6    10615
A0279100VZXR9A2495P4    10615
A0600727NK5MAF66IOY5    10616
                        ...  
AZZM5MUOG0LRK           10612
AZZT9G4MJFCHD           10612
AZZX23UGJGKTT           10613
AZZZ3LGTCGUZF           10615
AZZZG8PGB1FS0           10615
Length: 27530, dtype: int64

In [29]:
print(len(U.columns)) 
print(len(U.index))

27530
10620


<h2> <i> User-User Collaborative Filtering </i> </h2>

<b> Goal </b> 

Given a user u, we want to return a vector of items with top-r highest predicted rating. 

<b> Procedure </b>

Given user u, generate a similarity vector with all other users with cosine similarity metric. 

Then, for each item i that u has not reviewed, generate u's predicted rating for i by: 

1. Finding set N of top-k users most similar to u who have also rated i. Consult the existing similarity vector to do so. 

2. Generate predicted rating based on N set. 

Repeat this for all items not rated by u. Find the min top-r highest ratings. For straight ties return all?   

Would I think make sense to create a single lookup for this so we only have to do it once. Repeat the process for all users. 

## 3.2. Create Similarity Matrix (S)

In [32]:
from sklearn.metrics.pairwise import cosine_similarity 
from sklearn.impute import SimpleImputer 

def create_similarity_matrix(UM): 
    # transpose Utility Matrix 
    user_item_matrix = UM.T 
    # Impute missing values 
    imputer = SimpleImputer(strategy='constant', fill_value=0) 
    user_item_filled = imputer.fit_transform(user_item_matrix) 
    # Compute cosine similarity between users 
    user_similarity = cosine_similarity(user_item_filled) 
    # DataFrame for quick lookups 
    user_similarity_df = pd.DataFrame(
        user_similarity, 
        index=user_item_matrix.index, 
        columns=user_item_matrix.index 
    )
    return user_similarity_df 

In [33]:
# call to create the similarity matrix 
S = create_similarity_matrix(U) 

In [None]:
S.head()

## 3.3. Create Prediction Matrix (P) 

In [36]:
import time 

def predict_ratings(utility_matrix, similarity_matrix, k_only=True, k=5): 
    """
    Function to compute a predicted ratings matrix based on a utility matrix and similarity matrix. 
    Args: 
        utility_matrix: pd.DatFrame. Utility Matrix of users' ratings of items in the form of a pandas DataFrame. 
        similarity_matrix: pd.DataFrame. Similarity Matrix between users' ratings of items in the form of a pandas DataFrame. 
        k_only: bool, default True. If True, compute similarity based only on topk most similar users. If False, compute similarity based on all users with similarity > 0 who have also rated the item. 
        k: int. number of most similar users for computing ratings. 
    """
    # matrix to store predicted ratings
    prediction_matrix = utility_matrix.copy() 
    start = time.time() 
    # iterate through users
    for idx, user in enumerate(utility_matrix.columns): 
        # get similarity vector for this user 
        user_sim = similarity_matrix[user] 
        # iterate through products
        for product in utility_matrix.index: 
            # skip already rated items 
            if not pd.isna(utility_matrix.at[product, user]): 
                continue 
            # Find users who rated this product 
            raters = utility_matrix.loc[product].dropna() 
            if raters.empty: 
                continue 
            # Get similarities of those users to current user 
            similarities = user_sim[raters.index] 
            if k_only: 
                # Select top-k most similar users
                N_sim = similarities.sort_values(ascending=False).head(k) 
            else: 
                # select all users with similarity greater than 0 
                N_sim = similarities[similarities > 0] 
            # Filter to users who rated the product 
            N_ratings = raters[N_sim.index] 
            # Compute weighted average of ratings
            numerator = (N_sim * N_ratings).sum() 
            denominator = N_sim.sum() 

            if denominator > 0: 
                r_hat = numerator / denominator 
                prediction_matrix.at[product, user] = r_hat 
        
        # predict time to generate the matrix 
        if idx == 0: 
            user_time = time.time() - start 
            remaining_users = len(similarity_matrix.index) - 1 
            predicted_latency = remaining_users * user_time 
            print("Seconds to compute for first user:", user_time) 
            print("Minutes predicted to compute matrix:", predicted_latency // 60)

    return prediction_matrix 


Create the prediction matrix. 

Serious latency issue not yet resolved. Not worth executing until issue resolved.

In [35]:
P = predict_ratings(U, S, k_only=False) 

KeyboardInterrupt: 

Save as CSV to avoid recomputing

In [None]:
path = os.path.join(project_dir, "data", "Prediction_Matrix.csv") 
P.to_csv(path)

Reload Prediction Matrix

In [None]:
path = os.path.join(project_dir, "data", "Prediction_Matrix.csv") 
P = pd.read_csv(path) 

## 3.4. Top-r recommendations lookup

In [None]:
def get_top_recs(prediction_matrix, utility_matrix, users=None, r=5): 
    recommendations = {} 
    for user in prediction_matrix.columns: 
        # Filter for items not actually rated by user
        user_preds = prediction_matrix[user][utility_matrix[user].isna()] 
        # sort by predicted rating 
        top_r_items = user_preds.sort_values(ascending=False).head(r) 
        recommendations[user] = top_r_items.index.tolist() 
    for user in users: 
        user_preds = prediction_matrix[user][utility_matrix[user].isna()] 
    return recommendations

recs = get_top_recs(P, U) 

# 4. LLM Prompting

Given some user u identified by their user_id and k desired recommendatio0ns, we want to do the following: 

1. Get the product_ids (p, q etc.) of those k products from the recs lookup. 
2. Find the N set of users who have rated p and are most similar to u. 
3. Finding p review samples: 
    a. If N is insufficiently large, say below three, also find the highest ratings for p with the most votes.  
    b. Regardless of how big N is, find the review of p with highest rating and most votes. 
4. u review sample: Find user u's review with highest rating. 
    

In [None]:
def find_prompt_info(user_id, graph=G, top_p=1, similarity_matrix=S, recommendations_lookup=recs): 
    recommended_products = recommendations_lookup[user_id][:top_p] 
    # get similarity vector for this user 
    sim_vector = similarity_matrix[user_id]  
    