# Dataset Exploration

Here we explore the dataset with thousands of Spotify songs and their playlist groupings.

First we import the packages we need and set the palette for our plots.

In [None]:
import os

for _ in range(3):
    if os.path.exists(f'{os.getcwd()}/setup.py'):
        break
    os.chdir('..')
print('Current working directory:', os.getcwd())

In [None]:
from collections import Counter

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import powerlaw

from src.utils.config import get_dataset_path
from src.utils.styling import apply_styling

In [None]:
colors = apply_styling()
palette = colors['palette']

Read the parquet file

In [None]:
df = pd.read_parquet(get_dataset_path('master_spotify'))
print('Rows: ', len(df))
df.head(3)

What artists and songs are most popular?

In [None]:
artist_counter = Counter(list(df['artist']))
song_counter = Counter(list(df['track']))
print('Top artists: {}'.format(artist_counter.most_common(10)))
print('Top songs: {}'.format(song_counter.most_common(10)))

Let's visualize the distribution of tracks and artist in our dataset.

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(10, 4))

# Histogram of artists
n_bins = 40
axs[0].hist(artist_counter.values(), bins=n_bins, color=palette[0])
axs[0].set_title('Artists')
axs[0].set_yscale('log')
# axs[0].ticklabel_format(axis='x', style='sci', scilimits=(0, 0))
axs[0].set_ylabel('No. of artists')
axs[0].set_xlabel('No. of times artist is in a playlist')

# Histogram of songs
axs[1].hist(song_counter.values(), bins=n_bins, color=palette[1])
axs[1].set_title('Songs')
axs[1].set_yscale('log')
axs[1].set_ylabel('No. of songs')
axs[1].set_xlabel('No. of times song is in a playlist')

fig.tight_layout(pad=3.0)
fig.savefig('data/06_viz/artists_songs_histogram.png', bbox_inches='tight')
plt.show()

Since it looks like our data is very skewed, we can use the `powerlaw` powerlaw library and formally compare the distribution of how artists are represented in playlists to a powerlaw. Specifically, we use the package to visualize the [probability density function](https://pythonhosted.org/powerlaw/#powerlaw.Fit.plot_pdf) for the theoretical distribution estimated using the number of times artists are represented in playlists.

In [None]:
data = list(artist_counter.values())
fit = powerlaw.Fit(data, discrete=True)

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(10, 4))

fit.plot_pdf(
    color=palette[2], linewidth=1.5, linestyle='-', ax=axs[0], label='Power law fit'
)
fit.power_law.plot_pdf(
    color=palette[2],
    linewidth=1.5,
    linestyle='--',
    ax=axs[0],
    label='Theoretical power law',
)
axs[0].hist(
    data,
    bins=np.logspace(np.log10(1), np.log10(max(data)), 40),
    density=True,
    alpha=0.75,
    color=colors['lines'],
)
axs[0].set_xscale('log')
axs[0].set_yscale('log')
axs[0].set_title('Artist Playlist Distribution vs Power Law')
axs[0].set_ylabel('Density')
axs[0].set_xlabel('No. of times artist is in a playlist')
axs[0].legend(frameon=False)
axs[1].axis('off')

fig.tight_layout(pad=3.0)
fig.savefig('data/06_viz/artists_powerlaw.png', bbox_inches='tight')
plt.show()

In this notebook, we explored a dataset with millions of Spotify songs and their playlist groupings. You saw which artists and songs are most popular and observed how the distribution of how artists are represented in playlists follows a power law.