# Part 5 - ALS Recommender System (Sampled Data)

## Install Dependencies

This section installs the required libraries: DuckDB for efficient data sampling, Pandas for data manipulation, and implicit for ALS-based recommendation modeling. The scikit-learn package is used for data splitting and evaluation.

In [1]:
!pip install duckdb



In [2]:
!pip install pandas



In [11]:
!pip install implicit


Collecting implicit
  Downloading implicit-0.7.2-cp311-cp311-win_amd64.whl.metadata (6.3 kB)
Downloading implicit-0.7.2-cp311-cp311-win_amd64.whl (750 kB)
   ---------------------------------------- 0.0/750.8 kB ? eta -:--:--
   ------------- -------------------------- 262.1/750.8 kB ? eta -:--:--
   ---------------------------------------- 750.8/750.8 kB 3.5 MB/s eta 0:00:00
Installing collected packages: implicit
Successfully installed implicit-0.7.2


In [1]:
!pip install scikit-learn



## ALS Recommendation System

This section implements the ALS recommendation system on a 0.01% sample of the Amazon Reviews dataset:

### Data Sampling: DuckDB samples 0.01% of the data from 34 category-specific Parquet files, yielding 51,477 interactions. Users with fewer than 5 reviews in the sample are filtered out, resulting in 6,472 interactions, 728 users, and 5,962 items.

### Data Splitting: The sampled data is split into 80% training (5,177 interactions) and 20% testing (1,295 interactions).

### Model Training: An ALS model is trained using the implicit library (factors=100, regularization=0.05, iterations=15) on a sparse user-item matrix.

### Evaluation: The model achieves an RMSE of 4.4865 on the test set, indicating prediction challenges possibly due to sparse data or the small sample size.

### Recommendations: Top-5 product recommendations are generated for 3 random users, showing predicted ratings (e.g., User AHKLCGILATXKYSYAZROMECV4DPLQ recommends Product B07TSG87YD with a score of 0.361).

The process completes in ~102 seconds, operating on a significantly reduced dataset for computational feasibility.

In [1]:
import duckdb
import implicit
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import glob
import time
import random
import os

input_path_pattern = "F:/sentiments/sentiments/sentiment_*.parquet"
min_reviews_per_user = 5
test_size = 0.20
n_recommendations = 5
n_demo_users = 3
random_state = 42

sample_fraction = 0.0001
duckdb_sample_percent = sample_fraction * 100

als_factors = 100
als_regularization = 0.05
als_iterations = 15

random.seed(random_state)
np.random.seed(random_state)


start_time = time.time()
print("--- Starting ALS ---")
print(f"Operating on approximately {sample_fraction*100:.4f}% sample of the data using DuckDB")


print(f"Scanning for Parquet files: {input_path_pattern}")
parquet_files = glob.glob(input_path_pattern)

if not parquet_files:
    print(f"Error: No files found matching pattern '{input_path_pattern}'. Exiting.")
    exit()

print(f"Found {len(parquet_files)} files. Reading, filtering, and sampling each via DuckDB...")
sampled_dfs = []
total_rows_processed = 0 # Note: Can't easily get rows processed without extra query per file
errors_reading = 0
duckdb_con = None

try:
    duckdb_con = duckdb.connect(config={'memory_limit': '10GB'}) # In-memory, adjust limit if needed

    for i, file_path in enumerate(parquet_files):
        file_name = os.path.basename(file_path)
        print(f"Processing file {i+1}/{len(parquet_files)}: {file_name}...")
        try:
            # Replace backslashes for SQL compatibility if needed, though DuckDB often handles it
            sql_safe_file_path = file_path.replace('\\', '/')

            query = f"""
            SELECT
                user_id,
                parent_asin AS product_id,
                CAST(rating AS FLOAT) AS rating
            FROM read_parquet('{sql_safe_file_path}')
            WHERE user_id IS NOT NULL AND TRIM(user_id) != ''
              AND parent_asin IS NOT NULL AND TRIM(parent_asin) != ''
              AND rating BETWEEN 1 AND 5
            USING SAMPLE {duckdb_sample_percent} PERCENT (SYSTEM);
            """

            df_sampled_chunk = duckdb_con.execute(query).fetch_df()

            if len(df_sampled_chunk) > 0:
                sampled_dfs.append(df_sampled_chunk)

        except Exception as e:
            print(f"  Warning: Error processing file {file_name}: {e}")
            errors_reading += 1
            continue

    print(f"Finished processing {len(parquet_files)} files. {errors_reading} errors encountered.")

    if not sampled_dfs:
        print("Error: No data was successfully sampled. Check file contents, filters, or sample fraction. Exiting.")
        exit()

    print("Concatenating sampled data chunks...")
    interactions_df_sampled = pd.concat(sampled_dfs, ignore_index=True)
    n_sampled_interactions = len(interactions_df_sampled)
    print(f"Total sampled interactions collected: {n_sampled_interactions}")

    if n_sampled_interactions == 0:
        print("Sample is empty after concatenation. Exiting.")
        exit()

except Exception as e:
    print(f"Error during DuckDB processing loop or concatenation: {e}")
    exit()
finally:
    if duckdb_con:
        print("Closing DuckDB connection.")
        duckdb_con.close()


print(f"Filtering users with >= {min_reviews_per_user} reviews *within the collected sample*...")
try:
    user_counts = interactions_df_sampled.groupby('user_id')['rating'].transform('count')
    interactions_df_final = interactions_df_sampled[user_counts >= min_reviews_per_user].copy() # Use copy to avoid SettingWithCopyWarning

    n_interactions_final = len(interactions_df_final)

    if n_interactions_final == 0:
        print("No interaction data remaining after filtering the sample by user review count. Exiting.")
        exit()

    n_users_final = interactions_df_final["user_id"].nunique()
    n_items_final = interactions_df_final["product_id"].nunique()
    print(f"Final data after sampling and filtering: {n_interactions_final} interactions from {n_users_final} users and {n_items_final} items.")

except Exception as e:
    print(f"Error during user filtering on the sample: {e}")
    exit()


print("Creating user and item ID mappings...")
try:
    # interactions_df_final is already a Pandas DataFrame
    interactions_pd = interactions_df_final

    interactions_pd['user_idx'] = interactions_pd['user_id'].astype('category').cat.codes
    interactions_pd['item_idx'] = interactions_pd['product_id'].astype('category').cat.codes

    user_map = interactions_pd[['user_idx', 'user_id']].drop_duplicates().set_index('user_idx')
    item_map = interactions_pd[['item_idx', 'product_id']].drop_duplicates().set_index('item_idx')

    n_users = interactions_pd['user_idx'].max() + 1
    n_items = interactions_pd['item_idx'].max() + 1
    print(f"Mapped to {n_users} unique user indices and {n_items} unique item indices.")

except Exception as e:
    print(f"Error creating ID mappings: {e}")
    exit()


print(f"Splitting sampled data into training ({1-test_size:.0%}) and testing ({test_size:.0%})...")
try:
    train_pd, test_pd = train_test_split(
        interactions_pd,
        test_size=test_size,
        random_state=random_state,
        stratify=interactions_pd['user_idx']
    )
    print(f"Training set size (sampled): {len(train_pd)}")
    print(f"Test set size (sampled): {len(test_pd)}")

    if train_pd.empty or test_pd.empty:
         print("Train or test set is empty after split. Check sample or split ratio.")
         exit()

except Exception as e:
    print(f"Error splitting sampled data: {e}")
    exit()


print("Creating sparse user-item matrix for training...")
try:
    train_user_items = csr_matrix(
        (train_pd['rating'].astype(np.float32),
         (train_pd['user_idx'], train_pd['item_idx'])),
        shape=(n_users, n_items)
    )
    print("Sparse matrix created.")
except Exception as e:
    print(f"Error creating sparse matrix: {e}")
    exit()


print(f"Training Implicit ALS model (factors={als_factors}, regularization={als_regularization}, iterations={als_iterations})...")
try:
    model = implicit.als.AlternatingLeastSquares(
        factors=als_factors,
        regularization=als_regularization,
        iterations=als_iterations,
        random_state=random_state
    )
    model.fit(train_user_items)
    print("ALS model training complete.")
    user_factors = model.user_factors
    item_factors = model.item_factors
except Exception as e:
    print(f"Error training ALS model: {e}")
    exit()


print("Evaluating model on the test set sample (calculating RMSE)...")
try:
    test_user_indices = test_pd['user_idx'].values
    test_item_indices = test_pd['item_idx'].values
    actual_ratings = test_pd['rating'].values

    predicted_ratings = []
    for u_idx, i_idx in zip(test_user_indices, test_item_indices):
        if u_idx < user_factors.shape[0] and i_idx < item_factors.shape[0]:
            pred = user_factors[u_idx, :].dot(item_factors[i_idx, :])
            predicted_ratings.append(pred)
        else:
             predicted_ratings.append(np.nan)

    valid_indices = ~np.isnan(predicted_ratings)
    if not np.all(valid_indices):
        print(f"Warning: {np.sum(~valid_indices)} test interactions could not be predicted (index out of bounds?).")
        actual_ratings = actual_ratings[valid_indices]
        predicted_ratings = np.array(predicted_ratings)[valid_indices]

    if len(predicted_ratings) > 0:
        mse = mean_squared_error(actual_ratings, predicted_ratings)
        rmse = np.sqrt(mse)
        print(f"Evaluation complete. Test Set RMSE (on sample): {rmse:.4f}")
    else:
        print("No valid predictions generated for the test set. Cannot calculate RMSE.")

except Exception as e:
    print(f"Error evaluating model: {e}")
    exit()


print(f"\n--- Generating Top {n_recommendations} Recommendations for {n_demo_users} Random Users (from sample) ---")
try:
    all_train_user_indices = train_pd['user_idx'].unique()

    if len(all_train_user_indices) == 0:
         print("No users found in the training set sample to generate recommendations for.")
    else:
        num_users_to_sample = min(n_demo_users, len(all_train_user_indices))
        if num_users_to_sample < n_demo_users:
            print(f"Warning: Only {num_users_to_sample} unique users in training data sample. Showing recommendations for {num_users_to_sample}.")


        random_user_indices = random.sample(list(all_train_user_indices), num_users_to_sample)
        random_users_original_ids = [user_map.loc[idx, 'user_id'] for idx in random_user_indices]
        print(f"Selected random users for demo (original IDs from sample): {random_users_original_ids}")

        for user_idx in random_user_indices:
            original_user_id = user_map.loc[user_idx, 'user_id']
            print(f"\nRecommendations for User: {original_user_id} (Index: {user_idx})")

            recommended_indices, scores = model.recommend(
                user_idx,
                train_user_items[user_idx],
                N=n_recommendations,
                filter_already_liked_items=True
            )

            if len(recommended_indices) == 0:
                print("  No recommendations could be generated.")
                continue

            print(f"  Top {len(recommended_indices)} recommendations:")
            for i, item_idx in enumerate(recommended_indices):
                 if item_idx < item_factors.shape[0]:
                    original_product_id = item_map.loc[item_idx, 'product_id']
                    predicted_rating = user_factors[user_idx, :].dot(item_factors[item_idx, :])
                    print(f"    {i+1}. Product ID: {original_product_id}, Predicted Rating: {predicted_rating:.3f} (Score: {scores[i]:.3f})")
                 else:
                     print(f"    {i+1}. Recommended item index {item_idx} out of bounds.")


except Exception as e:
    print(f"Error generating recommendations: {e}")


end_time = time.time()
print(f"\n--- Script finished in {end_time - start_time:.2f} seconds ---")
print(f"--- Results based on approximately {sample_fraction*100:.4f}% sample of the original data ---")

--- Starting ALS ---
Operating on approximately 0.0100% sample of the data using DuckDB
Scanning for Parquet files: F:/sentiments/sentiments/sentiment_*.parquet
Found 34 files. Reading, filtering, and sampling each via DuckDB...
Processing file 1/34: sentiment_All_Beauty.parquet...
Processing file 2/34: sentiment_Amazon_Fashion.parquet...
Processing file 3/34: sentiment_Appliances.parquet...
Processing file 4/34: sentiment_Arts_Crafts_and_Sewing.parquet...
Processing file 5/34: sentiment_Automotive.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 6/34: sentiment_Baby_Products.parquet...
Processing file 7/34: sentiment_Beauty_and_Personal_Care.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 8/34: sentiment_Books.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 9/34: sentiment_CDs_and_Vinyl.parquet...
Processing file 10/34: sentiment_Cell_Phones_and_Accessories.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 11/34: sentiment_Clothing_Shoes_and_Jewelry.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 12/34: sentiment_Digital_Music.parquet...
Processing file 13/34: sentiment_Electronics.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 14/34: sentiment_Gift_Cards.parquet...
Processing file 15/34: sentiment_Grocery_and_Gourmet_Food.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 16/34: sentiment_Handmade_Products.parquet...
Processing file 17/34: sentiment_Health_and_Household.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 18/34: sentiment_Health_and_Personal_Care.parquet...
Processing file 19/34: sentiment_Home_and_Kitchen.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 20/34: sentiment_Industrial_and_Scientific.parquet...
Processing file 21/34: sentiment_Kindle_Store.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 22/34: sentiment_Magazine_Subscriptions.parquet...
Processing file 23/34: sentiment_Movies_and_TV.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 24/34: sentiment_Musical_Instruments.parquet...
Processing file 25/34: sentiment_Office_Products.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 26/34: sentiment_Patio_Lawn_and_Garden.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 27/34: sentiment_Pet_Supplies.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 28/34: sentiment_Software.parquet...
Processing file 29/34: sentiment_Sports_and_Outdoors.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 30/34: sentiment_Subscription_Boxes.parquet...
Processing file 31/34: sentiment_Tools_and_Home_Improvement.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 32/34: sentiment_Toys_and_Games.parquet...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Processing file 33/34: sentiment_Unknown.parquet...
Processing file 34/34: sentiment_Video_Games.parquet...
Finished processing 34 files. 0 errors encountered.
Concatenating sampled data chunks...
Total sampled interactions collected: 51477
Closing DuckDB connection.
Filtering users with >= 5 reviews *within the collected sample*...
Final data after sampling and filtering: 6472 interactions from 728 users and 5962 items.
Creating user and item ID mappings...
Mapped to 728 unique user indices and 5962 unique item indices.
Splitting sampled data into training (80%) and testing (20%)...
Training set size (sampled): 5177
Test set size (sampled): 1295
Creating sparse user-item matrix for training...
Sparse matrix created.
Training Implicit ALS model (factors=100, regularization=0.05, iterations=15)...


  check_blas_config()


  0%|          | 0/15 [00:00<?, ?it/s]

ALS model training complete.
Evaluating model on the test set sample (calculating RMSE)...
Evaluation complete. Test Set RMSE (on sample): 4.4865

--- Generating Top 5 Recommendations for 3 Random Users (from sample) ---
Selected random users for demo (original IDs from sample): ['AHKLCGILATXKYSYAZROMECV4DPLQ', 'AGBEWUAJTLUIA3RXP52JCFCD267A', 'AE4AYOESICGVKF2I3WA6WAU5SBRQ']

Recommendations for User: AHKLCGILATXKYSYAZROMECV4DPLQ (Index: 640)
  Top 5 recommendations:
    1. Product ID: B07TSG87YD, Predicted Rating: 0.361 (Score: 0.361)
    2. Product ID: B0C3LV4FVT, Predicted Rating: 0.358 (Score: 0.358)
    3. Product ID: B0C3FG4HSV, Predicted Rating: 0.226 (Score: 0.226)
    4. Product ID: B00B7M7CLA, Predicted Rating: 0.218 (Score: 0.218)
    5. Product ID: B004GCJMGG, Predicted Rating: 0.218 (Score: 0.218)

Recommendations for User: AGBEWUAJTLUIA3RXP52JCFCD267A (Index: 412)
  Top 5 recommendations:
    1. Product ID: B0BNQGJTF7, Predicted Rating: 0.302 (Score: 0.302)
    2. Product 