# Collaborative Filtering with ListenBrainz: A Short Report

**Student:** Carlos Eduardo Patiño Gómez

**Teachers:** Alastair Porter and Dmitry Bogdanov  
**Date:** 19 February 2025

All code for this project is available in our repository:  
[https://github.com/cepatinog/ListenBrainz_project](https://github.com/cepatinog/ListenBrainz_project)

---

## Introduction

This project aims to build a collaborative filtering model to identify similar musical artists based on user listening history from ListenBrainz. Given the massive scale of the raw data (over 90 million listening events spanning 10+ years), we developed an efficient preprocessing pipeline. For development, we focused on data from the first month of 2024, stored on an external hard drive at `J:\MusicBrainz`, while running all code locally under WSL (Ubuntu). The approach can be easily extended to include data for the entire year by processing each month sequentially and aggregating the results.



---

## Data Preprocessing

1. **Data Extraction:**  
   - We read the ListenBrainz JSON Lines files (for January 2024) and extracted the `user_id` and `recording_msid` (MessyBrainz ID).  
   - Output: `userid-msid.csv`.

2. **MSID Mapping Filtering:**  
   - We filtered the large ListenBrainz MSID mapping file to retain only rows with MSIDs present in our data and with a match quality of either `exact_match` or `high_quality`.  
   - Output: `small_msid_mapping.csv`.

3. **Canonicalization & Artist Extraction:**  
   - User–MSID data was joined with the filtered mapping to obtain user–Recording MBID pairs.  
   - Canonical redirects (from `canonical_recording_redirect.csv`) were applied, and canonical metadata (from `canonical_musicbrainz_data.csv`) was used to extract artist information. For recordings with multiple artists, only the first artist was selected.  
   - Output: `userid-artist.csv`.

4. **Aggregation:**  
   - Individual user–artist events were aggregated to compute listen counts.  
   - Output: `userid-artist-counts.csv`, located in the `data/processed` directory.

---

## Model Building

Using the aggregated listen counts, we built our collaborative filtering model as follows:

1. **Data Matrix Construction:**  
   - The `userid-artist-counts.csv` file was loaded and converted into a sparse matrix (users as rows, artists as columns) using functions from `listenbrainz_model.py`.

2. **ALS Model Training:**  
   - BM25 weighting was applied to the sparse matrix, and an Alternating Least Squares (ALS) model was trained using the Implicit library with 64 latent factors, a regularization parameter of 0.05, and an alpha of 2.0.

3. **Querying for Similar Artists:**  
   - We implemented functions to map artist MBIDs to indices and queried the model using its `similar_items` method.  
   - The recommendations were then mapped to human-readable artist names using our artist mapping (generated from `musicbrainz_artist.csv`).

---

## System Architecture

The project is organized as follows (all paths in this project point to our external hard drive `J:\MusicBrainz`):


**Key Points:**
- **Data Storage:** Processed outputs (including the final aggregated matrix) are in `data/processed`.
- **Notebooks:** Interactive exploration, model building, and final reporting are handled in the `notebooks` directory.
- **Source Code:** All scripts for data preprocessing and utility functions are organized within `src/`.

---

## Observations and Comparison

- **Model Performance:**  
  Our ALS model, trained on January 2024’s data, generated intuitive similar artist recommendations. For example, querying for similar artists to The Beatles (MBID: `b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d`) produced suggestions such as John Lennon and The Rolling Stones.

- **External Comparisons:**  
  When comparing these results with recommendations from Spotify, we observed significant overlap, although some differences in ranking and additional suggestions were noted. These discrepancies likely arise from differences in data sources, weighting schemes, and user demographics.

- **Data Representativeness:**  
  While our model currently uses data from one month, scaling the process to incorporate all 12 months of 2024 would improve robustness. Notably, the Last.fm 360K dataset includes around 359,347 unique users, suggesting that our model might benefit from a larger, more diverse dataset for even more representative recommendations.

---

## Conclusion

This project demonstrates an end-to-end pipeline—from data extraction and preprocessing to ALS model training and recommendation querying—using ListenBrainz data. Although developed on a single month’s data, our approach is fully scalable and can be extended to process the entire year. The code is available in our GitHub repository: [https://github.com/cepatinog/ListenBrainz_project](https://github.com/cepatinog/ListenBrainz_project), and all paths within the project point to our external hard drive (`J:\MusicBrainz`). Our final user–artist count matrix is stored in `data/processed/userid-artist-counts.csv`.

Future work includes refining model parameters, integrating larger datasets, and further comparing recommendations with external music services to address any biases and improve accuracy.

---

*Note: All code used in this project is fully documented and reproducible via the repository and our Documentation notebook.*
