# Source Data

The input is a CSV file containing artist playlist data generated e.g. like this:

```csharp
// Configure the business logic factory
var context = new MusicCatalogueDbContextFactory().CreateDbContext([]);
var factory = new MusicCatalogueFactory(context);

List<string> lines = ["Playlist,Time Of Day,Artist"];

// Iterate over the times of day
foreach (var tod in Enum.GetValues<TimeOfDay>())
{
    // Playlist creation parameters
    var numberOfEntries = new[] { TimeOfDay.Evening, TimeOfDay.Late }.Contains(tod) ? 5 : 10;
    var numberOfPlaylists = 500;

    for (int i = 0; i < numberOfPlaylists; i++)
    {
        // Alternate between tightly curated and "normal" playlists
        var type = i %2 == 0 ? "Curated" : "Normal";
        var playlist = i %2 == 0 ?
            await factory.ArtistPlaylistBuilder.BuildCuratedArtistPlaylist(tod, numberOfEntries) :
            await factory.ArtistPlaylistBuilder.BuildNormalArtistPlaylist(tod, numberOfEntries);
        lines.AddRange(playlist.Select(x => $"{i},{tod},{type},{x.ArtistName}"));
    }
}

// Write the CSV file
File.WriteAllLines("playlists.csv", lines!);
```

In [None]:
TOTAL_ARTISTS = 0     # Set this to the total number of artists in the collection

In [None]:
import pandas as pd

df = pd.read_csv("playlists.csv")
display(df)

# Playlist Diversity

This metric is only revealing if the number of artists in the playlists varies, in which case the notes below apply. If the test data always contains the same number of artists per playlist, this metric isn't meaningful other than as a sanity check.

This determines how many of the total artists in the collection each playlist represents. The playlist builder is designed to generate playlists that target:

- A specific time of day
- Mood-based listening
- Prioritisation of coherence of the artists in the list

The lower the %, the more the playlists represent a focussed slice of the collection and the less exploratory they are.

In [None]:
playlist_variety = (
    df.groupby(["Type", "Time Of Day", "Playlist"])["Artist"]
      .nunique()
      .rename("unique_artists")
      .to_frame()
)

playlist_variety["coverage_pct"] = (
    playlist_variety["unique_artists"] / TOTAL_ARTISTS * 100
)

playlist_variety.groupby(level=["Type", "Time Of Day"]).describe()

# Artist Overlap Between Playlists

The Jaccard distance is used to calculate what fraction of artists appearing in one playlist also appear in another:

| Jaccard   | Meaning               |
| --------- | --------------------- |
| < 0.30    | Strongly distinct     |
| 0.30–0.50 | Related but different |
| 0.50–0.65 | Clearly overlapping   |
| 0.65–0.80 | Very similar          |
| > 0.80    | Near duplicates       |


In [None]:
def jaccard(a, b):
    a = set(a) if not isinstance(a, set) else a
    b = set(b) if not isinstance(b, set) else b
    return len(a & b) / len(a | b) if (a or b) else 0.0

In [None]:
from itertools import combinations
import numpy as np

def build_jaccard_matrix(playlist_sets_subset: pd.DataFrame) -> pd.DataFrame:
    ids = playlist_sets_subset.index.tolist()
    artist_sets = playlist_sets_subset["artists"].to_dict()

    mat = pd.DataFrame(index=ids, columns=ids, dtype=float)
    for p1, p2 in combinations(ids, 2):
        sim = jaccard(artist_sets[p1], artist_sets[p2])
        mat.loc[p1, p2] = sim
        mat.loc[p2, p1] = sim

    np.fill_diagonal(mat.values, 1.0)
    return mat

In [None]:
from itertools import combinations
import pandas as pd

# Keep Type and Time Of Day attached to each playlist id
playlist_meta = (
    df[["Playlist", "Type", "Time Of Day"]]
    .drop_duplicates()
    .set_index("Playlist")
)

# Artist set per playlist (still keyed by Playlist id)
playlist_artists = (
    df.groupby("Playlist")["Artist"]
      .apply(set)
)

# Combine to a single table for convenient filtering
playlist_sets = (
    playlist_meta
    .join(playlist_artists.rename("artists"))
)

# Build the Jaccard distance matrices for each playlist type
jaccard_curated = build_jaccard_matrix(playlist_sets[playlist_sets["Type"] == "Curated"])
jaccard_normal  = build_jaccard_matrix(playlist_sets[playlist_sets["Type"] == "Normal"])

From the following:

| Quantity | Interpretation                                                                          |
| -------- | --------------------------------------------------------------------------------------- |
| mean     | Multiply by 100 to show the % of artists shared between two randomly selected playlists |
| std      | The spread, how constrained the playlists are                                           |
| min      | Lower tail, highest diversity                                                           |
| max      | Upper tail, lowest diversity                                                            |

In [None]:
import pandas as pd
import numpy as np

def upper_triangle_values(mat: pd.DataFrame) -> pd.Series:
    return (
        mat.where(np.triu(np.ones(mat.shape), k=1).astype(bool))
          .stack()
    )

In [None]:
curated_vals = upper_triangle_values(jaccard_curated)
normal_vals  = upper_triangle_values(jaccard_normal)

pd.DataFrame({
    "Curated": curated_vals.describe(),
    "Normal":  normal_vals.describe()
})

The following is a similar overlap within type and time of day, allowing comparison of e.g.

- Curated Late vs Normal Late
- Curated Morning vs Normal Morning

In [None]:
jaccard_by_group = {}

for (typ, tod), subset in playlist_sets.groupby(["Type", "Time Of Day"]):
    if len(subset) >= 2:
        jaccard_by_group[(typ, tod)] = upper_triangle_values(build_jaccard_matrix(subset)).describe()

jaccard_by_group_df = pd.DataFrame(jaccard_by_group).T
jaccard_by_group_df.index = pd.MultiIndex.from_tuples(jaccard_by_group_df.index, names=["Type", "Time Of Day"])
jaccard_by_group_df.sort_index()

The following is a cross-type overlap:

> Do curated lists pick from a different pool or tighter subset?


In [None]:
playlist_meta = (
    df[["Playlist", "Type", "Time Of Day"]]
    .drop_duplicates(subset=["Playlist"])
    .set_index("Playlist")
)

playlist_artists = (
    df.groupby("Playlist")["Artist"]
      .apply(lambda s: set(s.dropna()))
      .rename("artist_set")
)

playlist_sets = playlist_meta.join(playlist_artists)
cur = playlist_sets[playlist_sets["Type"] == "Curated"]
nor = playlist_sets[playlist_sets["Type"] == "Normal"]

cross_vals = []
for p1 in cur.index:
    a = cur.at[p1, "artist_set"]
    for p2 in nor.index:
        b = nor.at[p2, "artist_set"]
        cross_vals.append(jaccard(a, b))

cross_vals = pd.Series(cross_vals, name="cross_type_jaccard")
cross_vals.describe()

# Time of Day Variety

The expectation is that as time of day progresses from morning to late in the day, the number of unique artists and coverage should reduce as the playlists become more curated and less exploratory

In [None]:
tod_variety = (
    df.groupby(["Type", "Time Of Day"])["Artist"]
      .nunique()
      .rename("unique_artists")
      .to_frame()
)

tod_variety["coverage_pct"] = (
    tod_variety["unique_artists"] / TOTAL_ARTISTS * 100
)

tod_variety.sort_values(["Type", "coverage_pct"], ascending=[True, False])

Time of day variety with normal vs curated side by side:

In [None]:
tod_variety_pivot = (
    tod_variety.reset_index()
      .pivot(index="Time Of Day", columns="Type", values="coverage_pct")
      .sort_index()
)

tod_variety_pivot

# Artist Frequency Distribution

If these numbers show a steep head then it means that a small core of artists dominate the selection for each time of day. 

In [None]:
# Calculate the artist frequency by Type + time of day
tod_artist_freq = (
    df.groupby(["Type", "Time Of Day", "Artist"])
      .size()
      .rename("count")
      .reset_index()
)

# Normalise within each (Type, Time Of Day) for comparison
tod_artist_freq["share"] = (
    tod_artist_freq
    .groupby(["Type", "Time Of Day"])["count"]
    .transform(lambda x: x / x.sum())
)

# Display the top artists for each (Type, Time Of Day)
for (typ, tod), group in tod_artist_freq.groupby(["Type", "Time Of Day"]):
    display(
        group.sort_values("share", ascending=False).head(10)
             .assign(Type=typ, **{"Time Of Day": tod})
             [["Type", "Time Of Day", "Artist", "count", "share"]]
    )

# Dominance

This asks the question:

> Within a given time of day, what fraction of all artist selections come from the top 5 most-used artists?

Healthy numbers would look something like this:

| Context   | Top-5 Share |
| --------- | ----------- |
| Late      | 0.35–0.45   |
| Evening   | 0.35–0.50   |
| Morning   | 0.30–0.45   |
| Afternoon | 0.25–0.40   |

In [None]:
def top_n_share(df, n=5):
    return df.sort_values("share", ascending=False)["share"].head(n).sum()

In [None]:
dominance = (
    tod_artist_freq
    .groupby(["Type", "Time Of Day"])
    .apply(lambda g: top_n_share(g, n=5))
    .rename("top_5_share")
    .to_frame()
)

dominance_pivot = (
    dominance.reset_index()
    .pivot(index="Time Of Day", columns="Type", values="top_5_share")
    .sort_index()
)

dominance_pivot