# Data Preprocessing Documentation

## Overview

This notebook documents the preprocessing steps we have completed so far for the ListenBrainz collaborative filtering project. Our objective is to convert the ListenBrainz data dump into a format that can be used to build a user–artist listen count matrix. 

So far, we have completed the following steps:

1. **Extracting Listen Events:**
   - **Task:**  
     Read the ListenBrainz JSON Lines file and extract the `user_id` and `recording_msid` fields.
   - **Process:**
     - We decompressed the raw file (e.g., `1.listens.zst`) using a utility function (`decompress_if_needed`) that handles `.zst` files.
     - We streamed through the JSON Lines file line-by-line, extracted the required fields, and wrote them to a CSV file named `userid-msid.csv`.
   - **Result:**  
     The file `userid-msid.csv` contains records in the following format:
     ```
     user_id,recording_msid
     24848,92fc76b4-bda0-4c2c-82b9-1ef4d489c071
     24076,002091de-47f9-49d0-8da9-9af50e28f06e
     22845,a6877bfb-0256-4471-82ad-2e60a78329c7
     2966,1f1eae6d-e858-4236-9ffa-f4c8bb15d9c5
     ```

2. **Filtering the ListenBrainz MSID Mapping File:**
   - **Task:**  
     Map the MessyBrainz IDs (MSIDs) in our `userid-msid.csv` file to canonical MusicBrainz Recording IDs (MBIDs) using the large MSID mapping file.
   - **Process:**
     - We first extracted all unique MSIDs from `userid-msid.csv`.
     - We then processed the large mapping file (`listenbrainz_msid_mapping.csv-003.zst`) line-by-line (using a progress bar via `tqdm` for visibility) to filter out only the rows where:
       - The `recording_msid` is present in our unique MSID set.
       - The `match_type` column has acceptable values (we chose `"exact_match"` and `"high_quality"`).
     - We adjusted our code to use the correct column names from the mapping file: `recording_msid`, `recording_mbid`, and `match_type`.
     - To maintain clean and modular code, we factored out common functions (like `decompress_if_needed`) into a shared utility module (`src/utils/file_utils.py`).
   - **Result:**  
     The filtered mapping file (`small_msid_mapping.csv`) was created and now contains only the rows relevant for our processing. This file is much smaller and can be loaded into memory for subsequent processing steps.

      ```
      Loaded 3013073 unique MSIDs from /mnt/j/MusicBrainz/working/userid-msid.csv
      Filtering mapping file: 111339169it [05:39, 328388.01it/s]
      Processed 111339169 rows from /mnt/j/MusicBrainz/listenbrainz_msid_mapping.csv-003
      Kept 1874637 rows in the filtered mapping file: /mnt/j/MusicBrainz/working/small_msid_mapping.csv
      ```

## Code Modules and Structure

- **Extraction Module:**  
  - `src/preprocessing/extract_listens.py`  
    Contains code to read and inspect the JSON Lines file and extract `user_id` and `recording_msid` values to `userid-msid.csv`.

- **Filtering Module:**  
  - `src/preprocessing/filter_mapping.py`  
    Uses the unique MSIDs from `userid-msid.csv` to filter the large mapping file. The acceptable match types are set to `"exact_match"` and `"high_quality"`. The resulting file is `small_msid_mapping.csv`.

- **Shared Utilities:**  
  - `src/utils/file_utils.py`  
    Contains helper functions such as `decompress_if_needed` to manage file decompression across modules.


# ListenBrainz Data Preprocessing Documentation

This notebook documents the preprocessing steps performed for the ListenBrainz collaborative filtering project. It includes code to:
1. Extract listen events from the raw JSON Lines file.
2. Filter the large MSID mapping file to obtain a smaller, relevant mapping.
3. Canonicalize the mapping and extract artist information to create a user–artist mapping.

Each step is executed via Python scripts located in the `src/preprocessing/` directory. The outputs are stored on our external drive (mounted under `/mnt/j/MusicBrainz/`).

Let's begin!


## Step 1: Extract Listen Events

In this step, we run the extraction script to read the raw ListenBrainz JSON Lines file and extract the `user_id` and `recording_msid` fields. The output is saved as `userid-msid.csv` in our working directory.


In [None]:
#!python src/preprocessing/extract_listens.py /mnt/j/MusicBrainz/1.listens.zst /mnt/j/MusicBrainz/working/userid-msid.csv

## Step 2: Filter the MSID Mapping File

Next, we filter the large ListenBrainz MSID mapping file using our extracted `userid-msid.csv`. We retain only rows where the `recording_msid` is present in our file and the `match_type` is either `exact_match` or `high_quality`. The filtered file is saved as `small_msid_mapping.csv`.


In [3]:
# Inspect the first 20 lines of the mapping file.
mapping_file = "/mnt/j/MusicBrainz/listenbrainz_msid_mapping.csv-003"

print("First 20 lines of the mapping file:\n")
with open(mapping_file, "r", encoding="utf-8", errors="replace") as f:
    for i in range(20):
        line = f.readline().strip()
        print(line)


First 20 lines of the mapping file:

recording_msid,recording_mbid,match_type
13ca445f-c0dd-4f64-8726-7da78a3821aa,,no_match
54a40ef8-6bfe-4803-b74a-7b93885c2f01,,no_match
21c07966-97c6-4e02-a575-e1b2fadf0d34,,no_match
3b438b25-b9ad-480f-b248-bd06281919e0,,no_match
26d8cc2b-c249-4b70-9f16-4eea1419303c,,no_match
bf3fe1e0-7ae6-4e57-a38e-6ac7d42d33c3,,no_match
937d9d97-ba8b-4ff4-9af7-1b4a485945c1,,no_match
0e9230f2-1d9d-47ab-a44f-dbc788cfbdf1,,no_match
2c9111a7-96e6-4083-8b45-78cadb11796e,,no_match
012868b2-f6d7-40e6-80a4-d89d077e3d9e,,no_match
03feccff-3632-4acf-974b-b787ff7e9bbf,,no_match
364902a8-1067-4804-b8be-d6b801ee4179,,no_match
13a44caa-8109-46ef-ac25-0e89785bdd18,,no_match
273849c3-b1d7-4656-bf67-748c6cac2179,,no_match
da3aea98-c857-4eb7-b4d6-cce8aa0318d7,,no_match
c514427c-89e1-4746-86bb-49f5d5c8e04a,,no_match
67c2a304-1ff5-44ea-a076-9d797ba15c28,,no_match
127b9ad4-20d6-49e0-ab82-5aad8d27b5e1,,no_match
703adaef-f000-4e6c-bd96-6c9c299af991,,no_match


In [4]:
#!python src/preprocessing/filter_mapping.py /mnt/j/MusicBrainz/working/userid-msid.csv /mnt/j/MusicBrainz/listenbrainz_msid_mapping.csv-003.zst /mnt/j/MusicBrainz/working/small_msid_mapping.csv


## Step 3: Canonicalize and Extract Artist Information

In this step, we combine our user–MSID data with the filtered mapping to obtain user–MBID pairs. Then we apply canonical redirect mapping and look up canonical metadata to extract artist information. For simplicity, we extract only the first artist in the artist credit. The final output is a CSV file (`userid-artist.csv`) mapping user IDs to artist IDs.


In [8]:
!head -n 20 /mnt/j/MusicBrainz/canonical_recording_redirect.csv


recording_mbid,canonical_recording_mbid,canonical_release_mbid
f3f8a7b8-a376-450c-8139-934d2393d49a,ecb125d7-d23e-4d76-8282-745713563110,ffb4aba3-8aa4-479d-9a62-1bbe881804b8
a20f4c73-1f7a-48e7-903b-a34721c13629,ecb125d7-d23e-4d76-8282-745713563110,ffb4aba3-8aa4-479d-9a62-1bbe881804b8
17344c3d-d600-4bb8-ac2d-93cab18ced4e,ecb125d7-d23e-4d76-8282-745713563110,ffb4aba3-8aa4-479d-9a62-1bbe881804b8
7bf54872-7e1b-450c-9af4-385bcba33b78,ecb125d7-d23e-4d76-8282-745713563110,ffb4aba3-8aa4-479d-9a62-1bbe881804b8
910f5db3-9a25-44ba-8f07-9956123c8e00,ecb125d7-d23e-4d76-8282-745713563110,ffb4aba3-8aa4-479d-9a62-1bbe881804b8
354a700b-0cf8-428a-b486-6474cde76277,ecb125d7-d23e-4d76-8282-745713563110,ffb4aba3-8aa4-479d-9a62-1bbe881804b8
7e6539ce-bc87-4a12-b133-ab9cc8bffb6d,ecb125d7-d23e-4d76-8282-745713563110,ffb4aba3-8aa4-479d-9a62-1bbe881804b8
6282a948-8a9e-4f40-8aa8-9b7352f45181,ecb125d7-d23e-4d76-8282-745713563110,ffb4aba3-8aa4-479d-9a62-1bbe881804b8
7300fa87-9740-401f-b53d-7768d687897b,ecb125d7-d23

In [9]:
!head -n 20 /mnt/j/MusicBrainz/canonical_musicbrainz_data.csv


id,artist_credit_id,artist_mbids,artist_credit_name,release_mbid,release_name,recording_mbid,recording_name,combined_lookup,score
1,1,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,4fd4f7ee-cee8-47fd-84d2-8d65e74bd8f7,Nadal en galego,00b1a29d-ad9e-4b64-aed6-281f69f628ae,Catro Mancebos,variousartistscatromancebos,91870
2,1,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,4fd4f7ee-cee8-47fd-84d2-8d65e74bd8f7,Nadal en galego,0aeea6af-3f85-45f3-88ed-8ce2bdedc4c6,Con Un Sombreiro de Palla,variousartistsconunsombreirodepalla,91870
3,1,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,4fd4f7ee-cee8-47fd-84d2-8d65e74bd8f7,Nadal en galego,24f32cf2-127e-45ca-ad19-91ed3ec87409,Nadais de Xanceda,variousartistsnadaisdexanceda,91870
4,1,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,4fd4f7ee-cee8-47fd-84d2-8d65e74bd8f7,Nadal en galego,28e2548b-9c6f-47b7-8ab5-b1735499f291,De Lejos Venimos,variousartistsdelejosvenimos,91870
5,1,89ad4ac3-39f7-470e-963a-56509c546377,Various Artists,4f

In [11]:
!python ../src/preprocessing/canonicalize.py /mnt/j/MusicBrainz/working/userid-msid.csv /mnt/j/MusicBrainz/working/small_msid_mapping.csv /mnt/j/MusicBrainz/canonical_recording_redirect.csv.zst /mnt/j/MusicBrainz/canonical_musicbrainz_data.csv.zst /mnt/j/MusicBrainz/working/userid-artist.csv


Loaded 8839350 user listening events from /mnt/j/MusicBrainz/working/userid-msid.csv
Loaded 1874637 MSID-to-MBID mappings from /mnt/j/MusicBrainz/working/small_msid_mapping.csv
Loaded 6968027 canonical redirects from /mnt/j/MusicBrainz/canonical_recording_redirect.csv
Loaded artist metadata for 27070779 recordings from /mnt/j/MusicBrainz/canonical_musicbrainz_data.csv
Processing user events: 100%|██████| 8839350/8839350 [01:31<00:00, 96509.83it/s]
Processed 8839350 user events; converted 6089272 events to user-artist pairs.
Output written to /mnt/j/MusicBrainz/working/userid-artist.csv


## Verification and Inspection

After each step, we verify the output files to ensure correctness. For example, we can view the first few lines of the `userid-artist.csv` to check that user IDs have been correctly mapped to artist IDs.


In [12]:
!head -n 10 /mnt/j/MusicBrainz/working/userid-artist.csv


user_id,artist_id
24076,29266b3d-b5ae-4d09-b721-326246adf68f
22845,744b52c8-509b-4451-abfd-a17d18d4bd1d
2966,b7539c32-53e7-4908-bda3-81449c367da6
31175,875203e1-8e58-4b86-8dcb-7190faf411c5
4942,84825fb6-c98c-4b43-a184-c7f70619f355
27911,f3e2a7d9-c6bb-4848-95e5-04c0a1e2f511
27045,aa7a2827-f74b-473c-bd79-03d065835cf7
34783,3eadae13-fc37-4c6a-ab0c-d23702e9b455
16639,dfa715ac-b536-44df-af43-570d3ea3edec


## Step 4: Aggregate Listen Counts

In this step, we aggregate the user-artist events from `userid-artist.csv` into listen counts. The resulting file, `userid-artist-counts.csv`, contains three columns: `user_id`, `artist_id`, and `listen_count`.


In [13]:
!python ../src/preprocessing/aggregate_counts.py /mnt/j/MusicBrainz/working/userid-artist.csv /mnt/j/MusicBrainz/working/userid-artist-counts.csv


Aggregated counts written to /mnt/j/MusicBrainz/working/userid-artist-counts.csv


## Verification

In [14]:
!head -n 10 /mnt/j/MusicBrainz/working/userid-artist-counts.csv


user_id,artist_id,listen_count
24076,29266b3d-b5ae-4d09-b721-326246adf68f,36
22845,744b52c8-509b-4451-abfd-a17d18d4bd1d,2
2966,b7539c32-53e7-4908-bda3-81449c367da6,64
31175,875203e1-8e58-4b86-8dcb-7190faf411c5,2
4942,84825fb6-c98c-4b43-a184-c7f70619f355,11
27911,f3e2a7d9-c6bb-4848-95e5-04c0a1e2f511,1
27045,aa7a2827-f74b-473c-bd79-03d065835cf7,2
34783,3eadae13-fc37-4c6a-ab0c-d23702e9b455,1
16639,dfa715ac-b536-44df-af43-570d3ea3edec,114


## Step 5: Build the Artist Mapping

In this step, we process the `musicbrainz_artist.csv` file to create a dictionary mapping each artist's MBID to their textual name. This mapping will be saved as a JSON file (`artist_mapping.json`) and will be used later to display human-readable artist names in our recommendation results.


In [15]:
!python ../src/preprocessing/artist_mapping.py /mnt/j/MusicBrainz/musicbrainz_artist.csv /mnt/j/MusicBrainz/working/artist_mapping.json


Built mapping for 2531408 artists.
Mapping saved to /mnt/j/MusicBrainz/working/artist_mapping.json


## Verification

In [16]:
!head -n 20 /mnt/j/MusicBrainz/working/artist_mapping.json


{
  "fadeb38c-833f-40bc-9d8c-a6383b38b1be": "Доктор Сатана",
  "49add228-eac5-4de8-836c-d75cde7369c3": "Pete Moutso",
  "165a49a0-2b3b-4078-a3c1-905afdc07c0a": "Babyglock",
  "7b4a548e-a01a-49b7-82e7-b49efeb9732c": "Aric Leavitt",
  "60aca66f-e91a-4cb5-9308-b6e293cd833e": "Fonograff",
  "3e1bd546-d2a7-49cb-b38d-d70904a1d719": "Al Street",
  "df120895-f6c6-4a66-b9cf-73350f0beb61": "Love .45",
  "c14f8d3f-ee81-416f-800f-8eff7e77a2e1": "Sintellect",
  "b68a3969-319a-462f-942b-cd35581414fc": "Evie Tamala",
  "2c8ae2e0-3934-440e-81f5-2ec7fd0d7899": "Jean-Pierre Martin",
  "ac63d693-7b24-4258-a3db-09743b1b4269": "Deejay One",
  "4c4b7c6f-9285-4d6a-bc10-e5c9e08045f8": "wecamewithbrokenteeth",
  "055f435f-dba6-4156-9050-6ac41113e45f": "The Blackbelt Band",
  "ab1b631b-9896-4433-bef9-7868bf8a42f3": "Giant Tomo",
  "66de1369-f9eb-43cb-ae4f-88582a47a624": "Elvin Jones & Jimmy Garrison Sextet",
  "1fbb9556-b647-498a-a8ed-d3b5e8d7f85c": "Tobias Lorsbach",
  "e6895f6e-f636-4ff6-b406-f5ddaf6cb243": "

## Summary

This notebook has documented and executed the following preprocessing steps:
- Extraction of listen events from JSON Lines data.
- Filtering the MSID mapping file based on quality criteria.
- Canonicalizing the MBIDs and extracting artist information to form the user–artist matrix.

These steps have been performed using modular scripts in the `src/preprocessing/` directory and are fully reproducible. Future steps will involve building the collaborative filtering model using the aggregated user–artist data.

Please refer to the final report for an in-depth discussion and analysis of the results.
