## Using the provided listenbrainz_model.py file

In [6]:
import os
import pandas as pd
import numpy as np
import json

# Import our model functions from the provided module.
# Make sure that the project root is in PYTHONPATH.
import listenbrainz_model as lb

# Define the working directory (adjust if needed)
working_root = "/mnt/j/MusicBrainz/working"

# Step 1: Load the data matrix
data_matrix_path = os.path.join(working_root, "userid-artist-counts.csv")
matrix_artists, matrix_users, plays = lb.load_data_matrix(data_matrix_path)
print(f"Data matrix shape: {plays.shape}")

# Step 2: Build the ALS model
model = lb.build_model(plays)
print("Model training complete!")

# Step 3: Load the artist mapping.
# Option 1: Use our JSON file created earlier
artist_map_path = os.path.join(working_root, "artist_mapping.json")
with open(artist_map_path, "r", encoding="utf-8") as f:
    artist_mapping = json.load(f)

# Option 2: Use the provided get_artist_map function if using the MusicBrainz CSV directly:
# musicbrainz_artist_csv = "/mnt/j/MusicBrainz/musicbrainz_artist.csv"
# artist_mapping = lb.get_artist_map(musicbrainz_artist_csv)

# Step 4: Query the model for similar artists.
# For demonstration, we select an artist from the matrix.
# We'll choose the first artist from matrix_artists (you can choose a different one if desired)
target_artist_mbid = matrix_artists[0]
print(f"Using target artist MBID: {target_artist_mbid}")

# Find the index of the target artist
try:
    target_index = lb.artist_index(matrix_artists, target_artist_mbid)
except ValueError as e:
    print(f"Error: {e}")
    target_index = None

if target_index is not None:
    # Retrieve similar items from the model
    similar_ids, scores = model.similar_items(target_index, N=20)
    
    # Prepare a DataFrame with MBIDs, names, and similarity scores
    similar_artists = []
    for idx, score in zip(similar_ids, scores):
        mbid = matrix_artists[idx]
        name = artist_mapping.get(mbid, "unknown")
        similar_artists.append((mbid, name, score))
    
    df_similar = pd.DataFrame(similar_artists, columns=["Artist MBID", "Artist Name", "Score"])
    print(df_similar)


Data matrix shape: (9087, 163209)


  0%|          | 0/15 [00:00<?, ?it/s]

Model training complete!
Using target artist MBID: 00006766-a163-44eb-b6f1-1d82973b95ec
                             Artist MBID                 Artist Name     Score
0   00006766-a163-44eb-b6f1-1d82973b95ec                       Heron  1.000000
1   488bc172-1e96-43ac-874f-2aa6c68a8d5f        The Depth Beneath Us  0.977081
2   2255a6ed-3149-4f1c-a127-ccf6f320e63b            Deer Park Ranger  0.975456
3   7dec8f78-33d5-42d4-95ca-3a1b314b591c            A River Crossing  0.975456
4   28d18fb2-80d1-4ebc-8b2a-491458b35161                       Waves  0.975455
5   bcc19451-48f0-4753-b2ad-f87720ea8570  Old Seas / Young Mountains  0.975455
6   44f979db-4415-46c2-94ee-217427cf5e36                     Soonago  0.975455
7   f6819f90-9a94-4cf2-a8a5-638135b23780                  Satellites  0.975455
8   38a689b1-31ba-4d2e-b314-c62f7a7f4f85            Six Days of Calm  0.975455
9   d3b97060-f1c6-4d92-a1e9-46f1b8a50a80            Catch The Breeze  0.975455
10  defbcce5-e467-4d4e-99f0-91624086b340   

## Model Building with `listenbrainz_model.py`

We leveraged the provided `listenbrainz_model.py` module to build and query our collaborative filtering model. The key steps in this process were as follows:

1. **Loading the Data Matrix**
   - **Function Used:** `load_data_matrix(user_artist_counts_path)`
   - **Process:**  
     - The aggregated CSV file (`userid-artist-counts.csv`), which contains user IDs, artist MBIDs, and play counts, is loaded using Pandas.
     - The `user` and `artist` columns are treated as categorical data, and their codes are used to construct a sparse matrix (using SciPy's `coo_matrix`), representing the number of listens per user-artist pair.
   - **Output:**  
     - `matrix_artists`: An array (Pandas Categorical) containing artist MBIDs.
     - `matrix_users`: An array containing user IDs.
     - `plays`: A sparse matrix of shape *(num_users, num_artists)* holding the listen counts.

2. **Training the ALS Model**
   - **Function Used:** `build_model(plays)`
   - **Process:**  
     - BM25 weighting is applied to the sparse matrix to adjust the raw play counts, reducing the influence of extremely popular items and over-active users.
     - The Alternating Least Squares (ALS) model from the Implicit library is then initialized with parameters (e.g., 64 latent factors, regularization of 0.05, and an alpha of 2.0) and trained on the weighted matrix.
   - **Output:**  
     A trained ALS model that can be used to compute similarities between items (artists).

3. **Loading the Artist Mapping**
   - **Function Used:** `get_artist_map(musicbrainz_artist_path)`  
     *(Alternatively, a JSON mapping file generated from the `musicbrainz_artist.csv` file is used.)*
   - **Process:**  
     - The mapping file provides a dictionary that links each artist MBID to its textual name.
     - This mapping is critical for converting the model’s internal recommendations (which use MBIDs) into human-readable artist names.
   - **Output:**  
     A dictionary mapping artist MBIDs to their names.

4. **Querying the Model for Similar Artists**
   - **Function Used:** `artist_index(artists, artist_mbid)` along with the model's `similar_items` method.
   - **Process:**  
     - First, we identify the index of a target artist in the `matrix_artists` array using `artist_index()`.
     - Next, we call the model's `similar_items` method with the target artist’s index to obtain a list of similar artist indices and their corresponding similarity scores.
     - Finally, we convert these indices back to artist MBIDs and use the artist mapping to display the artist names.
   - **Output:**  
     A list (displayed as a Pandas DataFrame) of similar artists, showing each artist's MBID, name, and similarity score.

This structured process—from data matrix creation to model training and querying—forms the backbone of our collaborative filtering system. It allows us to generate artist recommendations based on implicit listening behavior, while also providing human-readable outputs for evaluation and analysis.
