# Recommendation Systems (Continued)

- **`Collaborative Filtering`**
  - Recommends items based on the preferences of similar users.
  - It doesn't require knowledge of the items themselves, just information about user interactions.
  - For example, a music streaming service might recommend songs that other users with similar tastes have enjoyed.

<br>

### Model-Based Collaborative Filtering

- Model-based collaborative filtering is a technique that uses machine learning algorithms to predict user preferences by building a model from user-item interaction data.

- Its key characteristics and advantages include:
  - `Model Training`: Learning patterns from data to build a predictive model using techniques like matrix factorization, neural networks, and clustering.
  - `Scalability`: More scalable than memory-based methods since it doesn't require real-time similarity computations.
  - `Handling Sparsity`: Effective in handling sparse data by learning latent factors that capture underlying patterns.
  - `Accuracy`: Can achieve higher accuracy by discovering complex relationships between users and items.
  - `Efficiency`: Often faster in generating recommendations due to pre-computed models.

### Common Model-based Techniques include:

-` Matrix Factorization`: Singular Value Decomposition (SVD) and Alternating Least Squares (ALS) to decompose the user-item interaction matrix.
- `Neural Networks`: Autoencoders and deep learning models like RNNs or CNNs to learn compressed representations or capture temporal/spatial patterns.
- `Clustering: K-Means Clustering and Hierarchical Clustering` to group users or items based on interaction patterns.


In [1]:
%load_ext watermark
%watermark -v -p numpy,pandas,polars,mlxtend,omegaconf --conda

Python implementation: CPython
Python version       : 3.11.8
IPython version      : 8.22.2

numpy    : 1.26.4
pandas   : 2.2.1
polars   : 0.20.18
mlxtend  : 0.23.1
omegaconf: 2.3.0

conda environment: torch_p11



In [2]:
# Built-in library
from pathlib import Path
import re
import json
from typing import Any, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
import numpy.typing as npt
from pprint import pprint
import pandas as pd
import polars as pl
from rich.console import Console
from rich.theme import Theme

custom_theme = Theme(
    {
        "info": "#76FF7B",
        "warning": "#FBDDFE",
        "error": "#FF0000",
    }
)
console = Console(theme=custom_theme)

# Visualization
import matplotlib.pyplot as plt

# NumPy settings
np.set_printoptions(precision=4)

# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

# Polars settings
pl.Config.set_fmt_str_lengths(1_000)
pl.Config.set_tbl_cols(n=1_000)
pl.Config.set_tbl_rows(n=200)

warnings.filterwarnings("ignore")


# auto reload imports# Built-in library
from pathlib import Path
import re
import json
from typing import Any, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
import numpy.typing as npt
from pprint import pprint
import pandas as pd
import polars as pl
from rich.console import Console
from rich.theme import Theme

custom_theme = Theme(
    {
        "info": "#76FF7B",
        "warning": "#FBDDFE",
        "error": "#FF0000",
    }
)
console = Console(theme=custom_theme)

# Visualization
import matplotlib.pyplot as plt

# NumPy settings
np.set_printoptions(precision=4)

# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

# Polars settings
pl.Config.set_fmt_str_lengths(1_000)
pl.Config.set_tbl_cols(n=1_000)
pl.Config.set_tbl_rows(500)

warnings.filterwarnings("ignore")


# Black code formatter (Optional)
%load_ext lab_black

# auto reload imports
%load_ext autoreload
%autoreload 2

In [3]:
# Set verbosity level
pl.Config.set_fmt_str_lengths(100)

polars.config.Config

In [4]:
fp: str = "../../data/cleaned_products_data.parquet"

prod_df: pl.DataFrame = pl.read_parquet(fp)
print(f"{prod_df.shape = }")
prod_df.head(3)

prod_df.shape = (37853, 6)


productId,title,product_category,summary,genres,metadata
str,str,str,str,str,str
"""B00000APZD""","""tension at the seams""","""hard rock & metal""","""solid funk thrash album""","""hard rock & metal""","""solid funk thrash album hard rock & metal tension at the seams"""
"""B00005KBJR""","""shchedrin: carmen suite / concertos for orchestra nos. 1- naughty limericks, & 2- the chimes""","""classical""","""why is carmen so polite? not great music, but a lot of fun new twist on an old favorite, with delig…","""classical""","""why is carmen so polite? not great music, but a lot of fun new twist on an old favorite, with delig…"
"""B00004TVH5""","""flamenco""","""world | pop | latin""","""unexciting & pointless""","""world pop latin""","""unexciting & pointless world pop latin flamenco"""


### Matrix Factorization

- Matrix factorization is a popular technique used in `collaborative filtering` for recommendation systems.
- It involves `decomposing a large matrix into smaller matrices` to uncover the `latent features` underlying the interactions between users and items.

- Here's a brief explanation:

**Problem Setup**:

- In a recommendation system, we typically have a user-item interaction matrix $( R )$, where each entry $( R_{ui} )$ represents the rating or interaction of user $( u )$ with item $( i )$.
- This matrix is usually `sparse`, meaning most entries are missing because users have interacted with only a small subset of items.

**Goal**:

- The goal of matrix factorization is to predict the missing entries in the interaction matrix $( R )$.
- This helps in recommending items to users that they have not yet interacted with.

**Decomposition**:

- Matrix factorization decomposes the interaction matrix $( R )$ into two lower-dimensional matrices:
  - $( U )$: A user-feature matrix where each `row represents a user` and the `columns represent the latent features`.
  - $( I^T )$: An item-feature matrix where each `row represent the latent features` of items and the `columns represent the individual items`.
  - Mathematically, $( R \approx U \times I^T )$.
  - e.g. the rating of $ r $  of the user $ m $ for the item $ 1 $ is given as:
    - $ r_{m,1} = \sum_{k=1}^{K} U_{m,k} \cdot I_{k,1} $
    - $ r_{m,1} = [U_{m,1}, \dots, U_{m,K}] \cdot [I_{1,1}, \dots, I_{K,1}]^T $


**Latent Features:**

- The latent features are abstract representations that `capture the underlying factors influencing user preferences and item characteristics`.
- For example, in a movie recommendation system, latent features might represent genres, actors, or directors.

**Optimization**:

- The decomposition is typically achieved by `minimizing` the difference between the actual and predicted interactions.
- This can be formulated as an optimization problem: 
  - $[ \min_{U, V} \sum_{(i,j) \in \text{observed}} (R_{ij} - U_i \cdot V_j^T)^2 + \lambda (|U|^2 + |V|^2) ]$
  - The first term measures the reconstruction error for observed interactions.
  - The second term is a regularization term to prevent overfitting, with $( \lambda )$ being the regularization parameter.

In [5]:
from surprise import accuracy, Dataset, KNNBasic, Reader, SVD
from surprise.model_selection import (
    cross_validate,
    GridSearchCV,
    KFold,
    train_test_split,
)

In [6]:
fp1: str = "../../data/ratings_data.parquet"
columns: list[str] = ["user_id", "item_id", "rating"]

df: pl.DataFrame = (
    pl.read_parquet(fp1)
    .rename({"userId": "user_id", "productId": "item_id", "score": "rating"})
    .select(columns)
)
print(f"{df.shape = }")

df_pd: pd.DataFrame = df.to_pandas()
df_pd.columns = columns

df.head()

df.shape = (91637, 3)


user_id,item_id,rating
str,str,f32
"""A2WIJ0KSWC96M9""","""B00001R3G3""",3.0
"""A1XPTVJC8OMM3I""","""B000HBK10Q""",4.0
"""A2N71K3SDAS4PV""","""B00027JYY4""",2.0
"""AYXOKMQ8IJX7M""","""B00004WJHY""",5.0
"""ANDR5PCA6W83K""","""B00004WIL4""",5.0


In [7]:
df_pd.head()

Unnamed: 0,user_id,item_id,rating
0,A2WIJ0KSWC96M9,B00001R3G3,3.0
1,A1XPTVJC8OMM3I,B000HBK10Q,4.0
2,A2N71K3SDAS4PV,B00027JYY4,2.0
3,AYXOKMQ8IJX7M,B00004WJHY,5.0
4,ANDR5PCA6W83K,B00004WIL4,5.0


In [8]:
seed: int = 123

# Reader for the custom dataset
reader: Reader = Reader(rating_scale=(1, 5))

# Load the dataset. It must be a Pandas dataframe with
# three columns: user_id, item_id, and rating in that order.
dataset: Dataset = Dataset.load_from_df(df=df_pd, reader=reader)

# Create the grid with hyperparameters to tune.
# After creating a grid of possible values for the hyperparameters, we'll perform
# an exhaustive grid search (2x2x2x2=16 combinations) to find the best combination.
# i.e. (num_options^num_hyperparams)
param_grid: dict[str, list[int | float]] = {
    "n_factors": [200, 300],
    "n_epochs": [5, 10],
    "lr_all": [0.002, 0.005],
    "reg_all": [0.4, 0.6],
    "random_state": [seed],
}

In [9]:
# Perform grid search for SVD (SVD is a dimensionality reduction technique wtih matrix factorization)
grid_search: GridSearchCV = GridSearchCV(
    algo_class=SVD,
    param_grid=param_grid,
    measures=["rmse", "mae"],
    cv=KFold(n_splits=3, random_state=seed, shuffle=True),
)
# Fit the grid search
grid_search.fit(dataset)
# Cost function to minimize is RMSE
algo = grid_search.best_estimator["rmse"]

In [10]:
print(f'The best RMSE: {grid_search.best_score["rmse"]}')

print(f'The best hyperparams: {grid_search.best_params["rmse"]}')

The best RMSE: 0.9710149211593863
The best hyperparams: {'n_factors': 200, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4, 'random_state': 123}


In [11]:
# Accuracy of the best algorithm using cross validation
cross_validate(
    algo=algo,
    data=dataset,
    measures=["RMSE", "MAE"],
    cv=KFold(n_splits=5, random_state=seed),
    verbose=True,
)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9565  0.9534  0.9673  0.9714  0.9641  0.9625  0.0067  
MAE (testset)     0.7341  0.7365  0.7413  0.7473  0.7422  0.7403  0.0046  
Fit time          0.93    0.99    0.93    0.97    0.93    0.95    0.03    
Test time         0.10    0.10    0.18    0.10    0.12    0.12    0.03    


{'test_rmse': array([0.9565, 0.9534, 0.9673, 0.9714, 0.9641]),
 'test_mae': array([0.7341, 0.7365, 0.7413, 0.7473, 0.7422]),
 'fit_time': (0.9341890811920166,
  0.9937460422515869,
  0.9310412406921387,
  0.9713180065155029,
  0.927001953125),
 'test_time': (0.0987699031829834,
  0.10191988945007324,
  0.1807880401611328,
  0.10056424140930176,
  0.12252211570739746)}

#### Train A Recommender System

- Using the hyperparameters, train a recommender system.

In [12]:
from typing import TypedDict


class RecSysModelParams(TypedDict):
    n_factors: int
    n_epochs: int
    lr: float
    reg: float
    random_state: int


model_params: RecSysModelParams = {**grid_search.best_params["rmse"]}
model_params

{'n_factors': 200,
 'n_epochs': 10,
 'lr_all': 0.005,
 'reg_all': 0.4,
 'random_state': 123}

In [13]:
test_size: float = 0.2

data = Dataset.load_from_df(df_pd, reader)
trainset, testset = train_test_split(data, test_size=test_size, random_state=seed)

# Init model
algo_svd: SVD = SVD(**model_params)

# Train model
algo_svd.fit(trainset)

# Predict
predictions = algo_svd.test(testset)

In [14]:
# Calculate the Root Mean Square Error (RMSE)
accuracy.rmse(predictions)

RMSE: 0.9641


0.9640530201704663

In [15]:
df.sample(n=10, seed=seed)

user_id,item_id,rating
str,str,f32
"""AB2AQMK9PCWWX""","""B000BRI5KE""",5.0
"""A1E11HKN4IRY09""","""B00005YW50""",1.0
"""A2P49WD75WHAG5""","""B000KD3DYC""",5.0
"""AAHXKHWO53M9E""","""B000068W4M""",3.0
"""AOYLVUVF5VZTU""","""B00008G0PI""",4.0
"""AXZ60GNHLQMT9""","""B000INX49S""",5.0
"""A27ZOCD5B63Y0P""","""B0000004R2""",5.0
"""ANP3R2R4V4GAN""","""B000PIFXTU""",5.0
"""A1EYSN1T55SV7U""","""B000H5TYIC""",3.0
"""A20Z1PKIH0PFUF""","""B00004NHCC""",5.0


In [16]:
# Extract music recommendations
u_id: str = "A1435P5AMCPB3X"

# Find the number of reviews/movies seen by the user
movies_seen_df: pl.DataFrame = df.filter(pl.col("user_id").eq(u_id))
num_reviews: int = movies_seen_df.shape[0]
print(f"Number of reviews: {num_reviews}")

# Unique movies the user has NOT seen
movies_not_seen_df: pl.DataFrame = df.filter(pl.col("user_id").ne(u_id)).unique(
    subset=["item_id"]
)
print(f"Number of movies not seen: {movies_not_seen_df.shape[0]:,}")
movies_not_seen_df.head()

Number of reviews: 27
Number of movies not seen: 22,255


user_id,item_id,rating
str,str,f32
"""A3BTZBO1380KB1""","""B000GFLIEQ""",5.0
"""A3TCXTAKB0DRIH""","""B000J103KM""",1.0
"""ATMKJX8W0XYDV""","""B00000BKDU""",5.0
"""A3SGZVO83J5KCH""","""B000PSJD0A""",4.0
"""A2ZO6AUEH7UAKQ""","""B00000JRL9""",5.0


In [17]:
from surprise.prediction_algorithms import Prediction

# Sample (Test the API)
p: Prediction = algo_svd.predict(
    uid=u_id, iid=movies_not_seen_df["item_id"].to_list()[0]
)

console.print(p)

In [18]:
pred_movies: list[str] = [
    # (item_id, est)
    (row["item_id"], algo_svd.predict(uid=u_id, iid=row["item_id"]).est)
    for row in movies_not_seen_df.iter_rows(named=True)
]
len(pred_movies)

22255

In [19]:
# DF containing the predicted movies for the user
pred_df: pl.DataFrame = pl.DataFrame(
    pred_movies, schema={"item_id": pl.Utf8, "predicted_rating": pl.Float64}
)
pred_df.head()

item_id,predicted_rating
str,f64
"""B000GFLIEQ""",4.450579
"""B000J103KM""",4.35954
"""B00000BKDU""",4.655666
"""B000PSJD0A""",4.458233
"""B00000JRL9""",4.513728


In [20]:
prod_columns: list[str] = ["productId", "title", "product_category"]
unique_prod_df: pl.DataFrame = prod_df.select(prod_columns).unique(subset="productId")
print(f"{unique_prod_df.shape = }")
unique_prod_df.head()

unique_prod_df.shape = (37853, 3)


productId,title,product_category
str,str,str
"""B00005N5HS""","""always faithful""","""miscellaneous | pop | jazz"""
"""B00000HYG3""","""quintet for a day""","""jazz | pop"""
"""B000001V5B""","""the royal philharmonic orchestra plays the beatles [rock dreams, vol. 5]""","""rock | classical | classic rock | pop"""
"""B000BM6AOM""","""homenaje a papo""","""world | latin"""
"""B00001AQZQ""","""for sale""",""""""


In [21]:
movies_seen_full_df = movies_seen_df.join(
    unique_prod_df,
    left_on="item_id",
    right_on="productId",
    how="left",
)
movies_seen_full_df

user_id,item_id,rating,title,product_category
str,str,f32,str,str
"""A1435P5AMCPB3X""","""B00004VNV2""",4.0,"""fistful of metal""","""hard rock & metal | rock"""
"""A1435P5AMCPB3X""","""B00002NDBB""",5.0,,
"""A1435P5AMCPB3X""","""B000A345HQ""",3.0,"""recollection, vol. 3: relapse video collection (2005)""","""hard rock & metal tv movies & tv | rock | alternative rock"""
"""A1435P5AMCPB3X""","""B000CNEQBO""",2.0,"""parabola""","""hard rock & metal movies & tv | movies"""
"""A1435P5AMCPB3X""","""B000FQVYGI""",5.0,"""7th inning stretch""","""classic rock | soundtracks | rock | pop"""
"""A1435P5AMCPB3X""","""B000006UB9""",5.0,"""1967: the first 3 singles, 30th anniversary edition""","""rock | classic rock | pop"""
"""A1435P5AMCPB3X""","""B000AA4J5W""",4.0,"""mrs god (dig)""","""world | rock | hard rock & metal | pop"""
"""A1435P5AMCPB3X""","""B000063DFV""",5.0,"""virtual xi (vinyl replica) (dig)""","""rock | alternative rock | hard rock & metal | pop"""
"""A1435P5AMCPB3X""","""B000NI3G6Y""",4.0,"""light the universe""","""hard rock & metal | pop"""
"""A1435P5AMCPB3X""","""B000BEZOVU""",5.0,"""lashing the rye""","""rock | alternative rock | miscellaneous | pop"""


In [22]:
N: int = 10

rec_df: pl.DataFrame = pred_df.join(
    unique_prod_df,
    left_on="item_id",
    right_on="productId",
    how="left",
).sort(by="predicted_rating", descending=True)

rec_df.head(N)

item_id,predicted_rating,title,product_category
str,f64,str,str
"""B00005A0JQ""",4.899867,"""criminal minded""","""blues | r&b | pop | rap & hip-hop"""
"""B000025IK3""",4.898644,"""blow by blow""","""jazz | rock | hard rock & metal | classic rock"""
"""B000027A9I""",4.893869,"""e. 1999 eternal""","""rock | r&b | rap & hip-hop"""
"""B00008F4LD""",4.889394,"""group sex / wild in the streets""","""rock | alternative rock | hard rock & metal"""
"""B00008G0XL""",4.888565,"""danzig ii: lucifuge [vinyl]""","""hard rock & metal | rock"""
"""B00009L1SV""",4.887216,"""little sparrow""","""country | pop | folk"""
"""B0000004TS""",4.877594,"""collection of songs""","""pop | christian"""
"""B00005YCK2""",4.875045,"""blue train (stereo) [vinyl]""","""pop | jazz"""
"""B0000004X7""",4.872712,"""the low end theory""","""blues | r&b | rap & hip-hop | pop"""
"""B000FLCRK0""",4.86973,"""everybody knows this is nowhere""","""rock | classic rock"""


#### Putting It All Together

In [23]:
from surprise import AlgoBase
from typing import List, Dict, Union, Any
import polars as pl


def recommend_items(
    algo: AlgoBase, df: pl.DataFrame, user_id: str, top_n: Union[int, None] = None
) -> List[Dict[str, float]]:
    """
    Recommend items for a user based on a collaborative filtering algorithm.

    Parameters
    ----------
    algo : AlgoBase
        The collaborative filtering algorithm to use for predictions.
    df : pl.DataFrame, shape (n_samples, n_features)
        The DataFrame containing user-item interactions.
    user_id : str
        The ID of the user for whom to make recommendations.
    top_n : int | None, optional
        The number of top recommendations to return. If None, defaults to 10.

    Returns
    -------
    List[Dict[str, float]]
        A list of dictionaries containing the top N recommended items and their
        predicted ratings, sorted in descending order of predicted rating.

    Raises
    ------
    TypeError
        If `algo` is not an instance of AlgoBase or `df` is not a Polars DataFrame.

    Notes
    -----
    The function filters the input DataFrame to find movies the user has not seen,
    predicts ratings for these movies, and returns the top N recommendations.
    """
    if not isinstance(algo, AlgoBase):
        raise TypeError("algo must be an instance of AlgoBase")

    if not isinstance(df, pl.DataFrame):
        raise TypeError("df must be a Polars DataFrame")

    if top_n is None:
        top_n = 10

    # Find the number of reviews/movies seen by the user
    movies_seen_df: pl.DataFrame = df.filter(pl.col("user_id").eq(user_id))
    num_reviews: int = movies_seen_df.shape[0]
    print(f"Number of reviews: {num_reviews:,}")

    # Unique movies the user has NOT seen
    movies_not_seen_df: pl.DataFrame = df.filter(pl.col("user_id").ne(user_id)).unique(
        subset="item_id"
    )
    print(f"Number of movies not seen: {movies_not_seen_df.shape[0]:,}")

    pred_movies: List[Dict[str, float]] = [
        {
            # "user_id": user_id,
            "item_id": row["item_id"],
            "rating": algo.predict(uid=user_id, iid=row["item_id"]).est,
        }
        for row in movies_not_seen_df.iter_rows(named=True)
    ]
    # Inplace sorting based on the estimated rating (descending)
    pred_movies.sort(key=lambda x: x["rating"], reverse=True)

    return pred_movies[:top_n]


def create_rec_data(
    pred: dict[str, list[float]], prod_df: pl.DataFrame
) -> pl.DataFrame:
    """
    Create a recommendation dataframe by joining prediction data with product data.

    Parameters
    ----------
    pred : dict[str, list[float]]
        Dictionary containing prediction data.
        Expected to have 'item_id' and 'rating' as keys.
    prod_df : pl.DataFrame
        DataFrame containing product data.
        Expected to have 'productId' column.

    Returns
    -------
    pl.DataFrame
        Sorted recommendation dataframe with product details.

    """
    recommend_df: pl.DataFrame = (
        pl.DataFrame(pred)
        .join(prod_df, left_on="item_id", right_on="productId")
        .sort("rating", descending=True)
    )
    return recommend_df

In [24]:
pred: list[dict[str, str | float]] = recommend_items(
    algo=algo_svd, df=df, user_id=u_id, top_n=5
)

console.print(pred)

Number of reviews: 27
Number of movies not seen: 22,255


In [25]:
rec_df: pl.DataFrame = create_rec_data(pred=pred, prod_df=prod_df)
rec_df

item_id,rating,title,product_category,summary,genres,metadata
str,f64,str,str,str,str,str
"""B00005A0JQ""",4.899867,"""criminal minded""","""blues | r&b | pop | rap & hip-hop""","""...criminal not to already own this!... this is still the dope beat! live and direct... boogie down…","""blues r&b pop rap & hip-hop""","""...criminal not to already own this!... this is still the dope beat! live and direct... boogie down…"
"""B000025IK3""",4.898644,"""blow by blow""","""jazz | rock | hard rock & metal | classic rock""","""blow by blow=tight!!! a generation later, blow by blow still dazzles the ear one of the best guitar…","""jazz rock hard rock & metal classic rock""","""blow by blow=tight!!! a generation later, blow by blow still dazzles the ear one of the best guitar…"
"""B000027A9I""",4.893869,"""e. 1999 eternal""","""rock | r&b | rap & hip-hop""","""a classik true artist bone is tha bomb !!!!!!!!!!!!!!!!!!!! classic. deadly beats!!! east. 1999 u c…","""rock r&b rap & hip-hop""","""a classik true artist bone is tha bomb !!!!!!!!!!!!!!!!!!!! classic. deadly beats!!! east. 1999 u c…"
"""B00008F4LD""",4.889394,"""group sex / wild in the streets""","""rock | alternative rock | hard rock & metal""","""speed, budweiser, and pimples best scene in the world amazing every song is great circle jerks. tha…","""rock alternative rock hard rock & metal""","""speed, budweiser, and pimples best scene in the world amazing every song is great circle jerks. tha…"
"""B00008G0XL""",4.888565,"""danzig ii: lucifuge [vinyl]""","""hard rock & metal | rock""","""danzig's 1989 debut danzig 2:electric boogaloo (sorry, force of habit) the dark side of infinity by…","""hard rock & metal rock""","""danzig's 1989 debut danzig 2:electric boogaloo (sorry, force of habit) the dark side of infinity by…"


<hr>

[![image.png](https://i.postimg.cc/8kdK7bMT/image.png)](https://postimg.cc/3WRC6mtq)

[image source](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html)

<hr>

```py
from surprise import prediction_algorithms

dir(prediction_algorithms)

# Results:
# [
#  'AlgoBase',
#  'BaselineOnly',
#  'CoClustering',
#  'KNNBaseline',
#  'KNNBasic',
#  'KNNWithMeans',
#  'KNNWithZScore',
#  'NMF',
#  'NormalPredictor',
#  'Prediction',
#  'PredictionImpossible',
#  'SVD',
#  'SVDpp',
#  'SlopeOne',
#  ..., 
# ]
```

In [26]:
unique_prod_df.head()

productId,title,product_category
str,str,str
"""B00005N5HS""","""always faithful""","""miscellaneous | pop | jazz"""
"""B00000HYG3""","""quintet for a day""","""jazz | pop"""
"""B000001V5B""","""the royal philharmonic orchestra plays the beatles [rock dreams, vol. 5]""","""rock | classical | classic rock | pop"""
"""B000BM6AOM""","""homenaje a papo""","""world | latin"""
"""B00001AQZQ""","""for sale""",""""""


In [27]:
def get_item_id(item_name: str, df: pl.DataFrame) -> str | None:
    """
    Get the product ID for a given item name.

    Parameters
    ----------
    item_name : str
        The name of the item to search for.
    df : pl.DataFrame
        The DataFrame containing the product information.

    Returns
    -------
    str or None
        The product ID if found, None otherwise.
    """
    try:
        item_id: str = df.filter(
            pl.col("title").str.to_lowercase().eq(item_name.lower())
        )["productId"][0]
    except IndexError:
        item_id: None = None

    return item_id


def get_item_name(
    df: pl.DataFrame, item_raw_id: str | None = None, item_id: int | None = None
) -> str | None:
    """
    Get the item name for a given product ID.

    Parameters
    ----------
    item_raw_id : str
        The product ID to search for.
    df : pl.DataFrame
        The DataFrame containing the product information.

    Returns
    -------
    str or None
        The item name if found, None otherwise.
    """
    if item_raw_id is None and item_id is None:
        raise ValueError("Either `item_raw_id` or `item_id` must be provided.")
    if item_raw_id is not None and item_id is not None:
        raise ValueError("Only one of `item_raw_id` or `item_id` must be provided.")

    if item_id is not None:
        item_id = min(item_id, len(df))
        item_name: str = df["title"].to_list()[item_id]
    else:
        try:
            item_name: str = df.filter(
                pl.col("productId").str.to_lowercase().eq(item_raw_id.lower())
            )["title"][0]
        except IndexError:
            item_name: None = None

    return item_name

In [28]:
print(get_item_id(item_name="vigilante of hope", df=prod_df))
print(get_item_name(item_raw_id="B000H0M4R0", df=prod_df))
print(get_item_name(item_id=200, df=prod_df))

B00008F554
cities and desire
hidden in the stomach


<br><hr>

#### Content-based Filtering

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix


metadata: pl.DataFrame = (
    prod_df.select(["productId", "title", "metadata"])
    # .unique(subset=["productId"])
    .clone()
)
metadata = metadata.with_columns(
    metadata=pl.concat_str(pl.col("title"), pl.col("metadata"), separator=" | ")
)

pipe_tfidf: Pipeline = Pipeline(
    [
        ("tfidf", TfidfVectorizer(stop_words="english", max_df=0.95, min_df=2)),
        ("svd", TruncatedSVD(n_components=300, random_state=seed, n_iter=10)),
    ]
)
item_tfidf_matrix: np.ndarray = pipe_tfidf.fit_transform(metadata["metadata"])

In [30]:
from polars import selectors as cs
import numpy as np
import polars as pl
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity


def get_content_rec_idxs(
    user_profile: str,
    item_tfidf_matrix: np.ndarray,
    pipe_tfidf: Pipeline,
    top_n: int = 5,
) -> np.ndarray:
    """
    Generate content-based recommendations for a user.

    Parameters
    ----------
    user_profile : str
        The user's profile text.
    item_tfidf_matrix : np.ndarray, shape (n_items, n_features)
        TF-IDF matrix of all items.
    pipe_tfidf : Pipeline
        Scikit-learn Pipeline object for TF-IDF transformation.
    top_n : int, optional
        Number of recommendations to return (default is 5).

    Returns
    -------
    np.ndarray, shape (top_n,)
        Indices of top N recommended items.
    """
    user_profile_vector: np.ndarray = pipe_tfidf.transform([user_profile])
    cosine_sim: np.ndarray = cosine_similarity(user_profile_vector, item_tfidf_matrix)
    return cosine_sim.argsort()[0][::-1][:top_n]


def get_content_based_recs(
    df: pl.DataFrame,
    item_tfidf_matrix: np.ndarray,
    pipe_tfidf: Pipeline,
    item_name: str | None = None,
    user_profile: str | None = None,
    top_n: int = 5,
) -> pl.DataFrame:
    """
    Get content-based recommendations for a user.

    Parameters
    ----------
    df : pl.DataFrame
        DataFrame containing item information.
    item_tfidf_matrix : np.ndarray, shape (n_items, n_features)
        TF-IDF matrix of all items.
    pipe_tfidf : Pipeline
        Scikit-learn Pipeline object for TF-IDF transformation.
    item_name : str | None, optional
        Name of the item to base recommendations on (default is None).
    user_profile : str | None, optional
        The user's profile text (default is None).
    top_n : int, optional
        Number of recommendations to return (default is 5).

    Returns
    -------
    pl.DataFrame
        DataFrame containing recommended items.
    """
    if item_name is None and user_profile is None:
        raise ValueError("Either `item_name` or `user_profile` must be provided.")
    if item_name is not None and user_profile is not None:
        raise ValueError(
            "Only one of `item_name` or `user_profile` should be provided."
        )

    title_df: pl.DataFrame = pl.DataFrame(
        data={
            "title": df["title"].str.to_lowercase(),
        }
    )
    latent_df: pl.DataFrame = pl.concat(
        [title_df, pl.DataFrame(item_tfidf_matrix)], how="horizontal"
    )

    if item_name is not None:
        item_vector: np.ndarray = latent_df.filter(
            pl.col("title").eq(item_name)
        ).select(cs.float())
        cos_sim: np.ndarray = cosine_similarity(
            latent_df.select(cs.float()), item_vector
        )
        sim_df: pl.DataFrame = (
            pl.DataFrame({"title": latent_df["title"], "similarity": cos_sim.flatten()})
            .sort(by="similarity", descending=True)
            .drop("similarity")
        ).slice(0, top_n)
        rec_df: pl.DataFrame = sim_df.join(df, on="title", how="left")

    if user_profile is not None:
        items: list[str] = df["title"].to_list()
        pred_ids: np.ndarray = get_content_rec_idxs(
            user_profile, item_tfidf_matrix, pipe_tfidf, top_n
        )
        rec_item_names: list[str] = [items[idx] for idx in pred_ids]
        rec_df: pl.DataFrame = df.filter(pl.col("title").is_in(rec_item_names))

    return rec_df

In [31]:
from polars import selectors as cs
import numpy as np
import polars as pl
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity


def get_content_rec_idxs(
    user_profile: str,
    item_tfidf_matrix: np.ndarray,
    pipe_tfidf: Pipeline,
    top_n: int = 5,
) -> np.ndarray:
    """
    Generate content-based recommendations for a user.

    Parameters
    ----------
    user_profile : str
        The user's profile text.
    item_tfidf_matrix : np.ndarray, shape (n_items, n_features)
        TF-IDF matrix of all items.
    pipe_tfidf : Pipeline
        Scikit-learn Pipeline object for TF-IDF transformation.
    top_n : int, optional
        Number of recommendations to return (default is 5).

    Returns
    -------
    np.ndarray, shape (top_n,)
        Indices of top N recommended items.
    """
    user_profile_vector: np.ndarray = pipe_tfidf.transform([user_profile])
    cosine_sim: np.ndarray = cosine_similarity(user_profile_vector, item_tfidf_matrix)
    return cosine_sim.argsort()[0][::-1][:top_n]


def get_content_based_recs(
    df: pl.DataFrame,
    item_tfidf_matrix: np.ndarray,
    pipe_tfidf: Pipeline,
    item_name: str | None = None,
    user_profile: str | None = None,
    top_n: int = 5,
) -> pl.DataFrame:
    """
    Get content-based recommendations for a user.

    Parameters
    ----------
    df : pl.DataFrame
        DataFrame containing item information.
    item_tfidf_matrix : np.ndarray, shape (n_items, n_features)
        TF-IDF matrix of all items.
    pipe_tfidf : Pipeline
        Scikit-learn Pipeline object for TF-IDF transformation.
    item_name : str | None, optional
        Name of the item to base recommendations on (default is None).
    user_profile : str | None, optional
        The user's profile text (default is None).
    top_n : int, optional
        Number of recommendations to return (default is 5).

    Returns
    -------
    pl.DataFrame
        DataFrame containing recommended items.
    """
    if item_name is None and user_profile is None:
        raise ValueError("Either `item_name` or `user_profile` must be provided.")
    if item_name is not None and user_profile is not None:
        raise ValueError(
            "Only one of `item_name` or `user_profile` should be provided."
        )

    title_df: pl.DataFrame = pl.DataFrame(
        data={
            "title": df["title"].str.to_lowercase(),
        }
    )
    latent_df: pl.DataFrame = pl.concat(
        [title_df, pl.DataFrame(item_tfidf_matrix)], how="horizontal"
    )

    if item_name is not None:
        item_vector: np.ndarray = latent_df.filter(
            pl.col("title").eq(item_name)
        ).select(cs.float())
        cos_sim: np.ndarray = cosine_similarity(
            latent_df.select(cs.float()), item_vector
        )
        sim_df: pl.DataFrame = (
            pl.DataFrame({"title": latent_df["title"], "similarity": cos_sim.flatten()})
            .sort(by="similarity", descending=True)
            .drop("similarity")
        ).slice(0, top_n)
        rec_df: pl.DataFrame = sim_df.join(df, on="title", how="left")

    if user_profile is not None:
        pattern: str = r"(" + user_profile + ")"
        rand_metadata: list[str] = (
            df.with_columns(flag=pl.col("product_category").str.extract_all(pattern))
            .filter(pl.col("flag").ne([]))["product_category"]
            .sample(n=3, shuffle=True, seed=42)
            .to_list()
        )
        metadata_str = " ".join(rand_metadata)
        items: list[str] = df["title"].to_list()
        pred_ids: np.ndarray = get_content_rec_idxs(
            metadata_str, item_tfidf_matrix, pipe_tfidf, top_n
        )
        rec_item_names: list[str] = [items[idx] for idx in pred_ids]
        rec_df: pl.DataFrame = df.filter(pl.col("title").is_in(rec_item_names))

    return rec_df

In [32]:
get_content_based_recs(
    df=unique_prod_df,
    item_tfidf_matrix=item_tfidf_matrix,
    pipe_tfidf=pipe_tfidf,
    item_name=None,
    user_profile="pop|gospel|r&b",
    top_n=10,
)

productId,title,product_category
str,str,str
"""B000026Y9J""","""crystal palace fc: glad all over""","""rock | pop"""
"""B00000HYU6""","""best of the best""","""world | pop | rock"""
"""B00004R7T0""","""survival sickness""","""alternative rock | rock"""
"""B000MTFFTK""","""bill engvall: 15 off cool (2007)""","""miscellaneous tv movies & tv"""
"""B00004TBUQ""","""lotus""","""hard rock & metal"""
"""B0001Q4BL2""","""thomas the tank engine and friends - totally thomas 3-disc collection with train (make someone happ…","""tv movies & tv | children's"""
"""B00008F4RT""","""boomin in ya jeep""","""dance & electronic | rap & hip-hop"""
"""B00005YWJN""","""rotary downs""","""broadway & vocalists | alternative rock | rock"""
"""B00067W2QC""","""the messenger""","""jazz | r&b"""
"""B000BNC3W4""","""full metal panic tsr (the second raid) original soundtrack [audio cd]""","""soundtracks"""


In [33]:
get_content_based_recs(
    df=unique_prod_df,
    item_tfidf_matrix=item_tfidf_matrix,
    pipe_tfidf=pipe_tfidf,
    item_name="hidden in the stomach",
    # user_profile="rap|hip-hop",
    top_n=5,
)

title,productId,product_category
str,str,str
"""hidden in the stomach""","""B0002F4COI""","""jazz"""
"""mas cara que espalda""","""B0002B7DG6""","""latin | world"""
"""os reis do ritmo""","""B00008Y49U""","""world | latin"""
"""coleccion de oro""","""B000068D0W""","""rock | latin | world"""
"""live: out on the road""","""B00008FHSJ""","""folk | miscellaneous | pop"""


In [34]:
# Neighbourhood-based Collaborative Filtering