# Collaborative Filtering with ListenBrainz: A Short Report

**Student:** Carlos Eduardo Patiño Gómez

**Teachers:** Alastair Porter and Dmitry Bogdanov  
**Date:** 19 February 2025

---

## Introduction

This project aims to build a collaborative filtering model to identify similar musical artists based on user listening history from ListenBrainz. Given the massive scale of the raw data (over 90 million listening events spanning 10+ years), we developed an efficient preprocessing pipeline and focused our efforts on data from the first month of 2024. This allowed us to create and test the pipeline on a manageable subset while laying the groundwork for scaling up to the entire year.

---

## Data Preprocessing

1. **Data Extraction:**  
   - **Process:** We read the ListenBrainz JSON Lines file for January 2024, extracting `user_id` and `recording_msid` (MessyBrainz ID).  
   - **Output:** A CSV file (`userid-msid.csv`) containing user–MSID pairs.

2. **Mapping Filtering:**  
   - **Process:** We filtered the large ListenBrainz MSID mapping file (nearly 8GB) to retain only rows corresponding to MSIDs in our extracted data with a match quality of `exact_match` or `high_quality`.  
   - **Output:** A filtered mapping CSV (`small_msid_mapping.csv`).

3. **Canonicalization & Artist Extraction:**  
   - **Process:** We joined our user–MSID data with the filtered mapping to obtain user–Recording MBID pairs, applied canonical redirects from `canonical_recording_redirect.csv`, and extracted artist information using canonical metadata (`canonical_musicbrainz_data.csv`). For recordings with multiple artists, only the first artist was selected.  
   - **Output:** A CSV file (`userid-artist.csv`) mapping user IDs to artist MBIDs.

4. **Aggregation:**  
   - **Process:** We aggregated individual listening events into listen counts, resulting in a user–artist count matrix stored in `userid-artist-counts.csv`.

---

## Model Building

Using the preprocessed data, we built our collaborative filtering model as follows:

1. **Data Matrix Construction:**  
   - The aggregated CSV was loaded into a Pandas DataFrame and converted into a sparse matrix (with users as rows and artists as columns) using SciPy.

2. **ALS Model Training:**  
   - We applied BM25 weighting to the matrix and trained an Alternating Least Squares (ALS) model using the Implicit library with parameters such as 64 latent factors, a regularization of 0.05, and an alpha of 2.0.

3. **Querying for Similar Artists:**  
   - Using helper functions from `listenbrainz_model.py`, we mapped artist MBIDs to matrix indices and queried the model to obtain similar artist recommendations.  
   - An artist mapping (from `musicbrainz_artist.csv` or a generated JSON) was used to convert MBIDs into human-readable names.

---

## Observations and Comparison

- **Recommendations:**  
  For example, the model generated similar artist recommendations for The Beatles (MBID: `b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d`) that included artists like John Lennon, The Rolling Stones, and The Who. These align with our expectations based on known musical relationships.
  
- **External Comparison:**  
  When compared with similar recommendations from Spotify, we observed substantial overlap (e.g., The Rolling Stones and The Who appeared in both sets) but also some differences in ranking and additional suggestions. Such differences may stem from variations in data sources, model parameters, and user demographics.

- **Data Model Insights:**  
  Our model was built using listening events from the first month of 2024, involving *[insert number]* unique users. In contrast, the Last.fm 360K dataset contains approximately 359,347 users, suggesting that a model built on a larger dataset might be more representative. Scaling up to include data for the entire year would likely improve the robustness and diversity of the recommendations.

---

## Future Work and Extensions

- **Scaling the Pipeline:**  
  While we focused on January 2024 for development, our pipeline is designed to process data month by month. Extending the model to the entire year involves sequentially processing each month's data, aggregating the results, and re-training the model on the combined dataset.

- **Model Tuning and Evaluation:**  
  Further fine-tuning of model parameters and additional comparisons with external services (such as Spotify and Last.fm) will help refine the recommendations and address potential biases.

---

## Conclusion

Our end-to-end pipeline—from data extraction and filtering to canonicalization, aggregation, and ALS model training—demonstrates that ListenBrainz data can be effectively used to generate meaningful artist similarity recommendations. Although our initial model is based on data from one month, scaling the approach to cover the entire year holds promise for even more robust and representative recommendations. Future work will focus on model refinement and broader evaluation against external music recommendation services.

---

*Note: All code is fully documented and reproducible via the project repository and Documentation notebook. Additionally, any use of LLMs or coding assistants has been recorded, and adjustments were made as needed to ensure accuracy.*
