In [1]:
from sklearnex import patch_sklearn
patch_sklearn()
import polars as pl
from polars.lazyframe.group_by import LazyGroupBy
from sklearn.cluster import KMeans
import numpy as np

Intel(R) Extension for Scikit-learn* enabled (https://github.com/uxlfoundation/scikit-learn-intelex)


Loads a Parquet file into a `LazyFrame`, groups by `parent_asin`, and computes the total reviews, mean rating, and first non-null brand and category (lowercased, with nulls filled as "Unknown"). Brand and category are cast to categorical IDs. Finally, selected columns are collected into a `DataFrame`.

In [None]:
lf: LazyGroupBy = pl.scan_parquet("data/processed/amazon-2023.parquet").group_by("parent_asin")
lf = lf.agg([
	pl.len().alias("total_reviews"),
	pl.col("rating").mean().alias("mean_rating"),
	pl.col("brand").first().fill_null("Unknown").str.to_lowercase().alias("brand_name"),
	pl.col("main_category").first().fill_null("Unknown").str.to_lowercase().alias("category_name")
])

lf = lf.with_columns([
	pl.col("brand_name").cast(pl.Categorical).to_physical().alias("brand_id"),
	pl.col("category_name").cast(pl.Categorical).to_physical().alias("category_id")
])

columns: list[str] = ["mean_rating", "total_reviews", "brand_id", 
                      "brand_name","category_id","category_name"]

df: pl.DataFrame = lf.select(columns).collect(engine="streaming")

Columns mean_rating, total_reviews, brand_id, and category_id are selected from the DataFrame and converted into a NumPy array. KMeans clustering is then applied with 5 clusters, using 10 initializations and a fixed random seed of 42, to predict the cluster labels for each data point.

In [None]:
training_columns: list[str]  = ["mean_rating", "total_reviews", 
                     "brand_id","category_id"]

X: np.ndarray = df.select(training_columns).to_numpy()
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
labels: np.ndarray = kmeans.fit_predict(X)

In [None]:
df: pl.DataFrame = df.with_columns(pl.Series("cluster", labels))

summary: pl.DataFrame = df.group_by("cluster").agg([
	pl.len().alias("cluster_size"),
	pl.col("mean_rating").mean().alias("avg_mean_rating"),
	pl.col("total_reviews").mean().alias("avg_total_reviews"),
	pl.col("brand_id").mean().alias("avg_brand_id"),
	pl.col("category_id").mean().alias("avg_category_id"),
	pl.col("brand_id").mode().alias("top_brand_id"),
	pl.col("brand_name").mode().alias("top_brand_name"),
	pl.col("category_name").mode().alias("top_category_name"),
	pl.col("category_id").mode().alias("top_category_id"),
]).sort("cluster")

In [None]:
additional: pl.DataFrame = summary.drop(summary.columns[1:6])
summary: pl.DataFrame = summary.drop(summary.columns[6:])

Two dataframes are made to show the results of the k-means clustering. `summary` dataframe holds the sizes of the clusters along with the averages of the rating, reviews, and ids. `additional` shows top category and brand.

In [28]:
summary

cluster,cluster_size,avg_mean_rating,avg_total_reviews,avg_brand_id,avg_category_id
i32,u32,f64,f64,f64,f64
0,4900703,4.098899,13.111658,656981.198234,185631.833606
1,1520795,4.324956,6.132356,4062700.0,195285.389854
2,1820326,4.279179,7.70994,2733000.0,193409.32291
3,2569403,4.201071,10.548661,1565700.0,190014.075359
4,24554101,4.077346,15.8126,81107.899496,182308.524842


Cluster 0 consists of 4.9 million moderately-rated books `(avg. rating 4.10, 13.1 reviews)` from niche brands like abundant earth works.

Cluster 1 includes 1.5 million high-rated books `(avg. rating 4.32, lowest avg. reviews at 6.13)` from brands like vipmvpup. 

Cluster 2 features 1.8 million well-rated Kindle books `(avg. rating 4.28, 7.7 reviews)` from known authors like kate hoffmann.

Cluster 3 represents 2.5 million solid mid-tier books `(avg. rating 4.20, 10.5 reviews)` from brands like pennzoni.

Cluster 4 contains 2.4 million fashion items with the highest review count `(avg. rating 4.08, 15.8 reviews)`, primarily from unknown brands in amazon fashion.

On a more statistical point of view, an `inverse relationships` between brand prominence and review volume appears to emerge, and the highest-rated clusters are not the largest, indicating quality does not scale linearly with popularity.

In [27]:
additional

cluster,top_brand_id,top_brand_name,top_category_name,top_category_id
i32,list[u32],list[str],list[str],list[u32]
0,[708484],"[""abundant earth works""]","[""books""]",[200611]
1,[3541601],"[""vipmvpup""]","[""books""]",[200611]
2,[2163190],"[""kate hoffmann (author) format: kindle edition""]","[""books""]",[200611]
3,[1128601],"[""pennzoni""]","[""books""]",[200611]
4,[0],"[""unknown""]","[""amazon fashion""]",[200620]


Furthermore, this clustering reveals a clear inverse relationship between scale and satisfaction across the five clusters (0–4). 

Cluster 0 `(4.90 M, "Abundant Earth Works" Books)` sits at a middling 4.099 rating with 13.11 reviews, suggesting broad but tepid enthusiasm. 

Cluster 1 `(1.52 M items; "vipmvpup" Books)` tops the rankings with a 4.325 mean rating yet only 6.13 reviews, indicating a small but highly loyal audience. 

Cluster 2 `(1.82 M "Kate Hoffmann" Books)` follows closely with a 4.279 rating and 7.71 reviews, characteristic of mid-tier bestsellers. 


Cluster 3 `(2.57 M "Pennzoni" Books)` strikes a balance, 4.201 rating and 10.55 reviews—reflecting steady engagement from established brands. 


Finally, Cluster 4 `(24.55 M "Unknown" Amazon Fashion)` commands the highest review volume (15.81) yet the lowest satisfaction (4.077), underscoring that mass exposure often dilutes quality perception.


Statistically, the highest-rated clusters are the smallest and most niche, while the largest, most generic segments trade premium appeal for sheer scale.