# Exploratory Data Analysis
In this notebook, I investigate the stucture of the Spotify pop playlists in search for interesting conclusions. All plots are based on data as of July $27^{th}$.

In [45]:
# Import necessary libraries

import pandas as pd
import numpy as np
import sqlite3
from lets_plot import *
# Unfortunately, lets-plot does not allow for inserting images into plots
# Thus, I will use plotly for one of the visualizations
import plotly.graph_objects as go
import plotly.io as pio
# I had difficulties saving the plotly plot using their package
# Thus I want for an alternative way with the packages below
import matplotlib.pyplot as plt
from PIL import Image
import io

# Set up the lets-plot packages and sql magic
LetsPlot.setup_html()
%load_ext sql
%config SqlMagic.autocommit=True

# Connect to the database
%sql sqlite:///../data//clean/spotify_playlists.db --alias db


The sql extension is already loaded. To reload it, use:
  %reload_ext sql


## What playlists do people usually listen to?
Let us inspect how popular each playlists is.

In [46]:
%sql pop << SELECT name, num_followers FROM playlists

pop = pop.DataFrame()
pop = pop.sort_values('num_followers')
p1 = ggplot(pop, aes(x='name', y='num_followers')) + \
    geom_lollipop(fatten=1.5)  + \
    scale_x_log10() + \
    coord_flip() + \
    ylab('Number of Followers (log scale)') + \
    xlab('Playlist') + \
    ggtitle('Only six playlists cross the line of 5 mln followers') + \
    geom_hline(yintercept=5000000, color='red', size=0.5) + \
    theme(plot_title=element_text(hjust=0.5)) + \
    ggsize(width=1000, height=800)

p1.show()
ggsave(p1, filename='../../docs/figures/playlist_popularity.html')

'c:\\Users\\adamw\\Education\\LSE Summer School\\ME204\\me204-2024-project-adamwadolowski\\docs\\figures\\playlist_popularity.html'

The red line represents 5 million followers. We clearly see that the six most popular playlists beat all other playlists in terms of number of followers by a 2 million margin. Those are [Today's Top Hits](https://open.spotify.com/playlist/37i9dQZF1DXcBWIGoYBM5M), [Songs to Sing in the Car](https://open.spotify.com/playlist/37i9dQZF1DWWMOmoXKqHTD), [Mega Hit Mix](https://open.spotify.com/playlist/37i9dQZF1DXbYM3nMM0oPk), [Mood Booster](https://open.spotify.com/playlist/37i9dQZF1DX3rxVfibe1L0), [Hit Rewind](https://open.spotify.com/playlist/37i9dQZF1DX0s5kDXi1oC5) and [This Is Taylor Swift](https://open.spotify.com/playlist/37i9dQZF1DX5KpP2LN299J). To see the distribution of followers for all other playlists in more detail, I remove the top 6 playlists and consider a similar graph.

In [47]:
%%sql tab << SELECT name, num_followers
FROM playlists
WHERE name NOT IN ("Today’s Top Hits", "Songs to Sing in the Car", "Mega Hit Mix", "Mood Booster", "Hit Rewind", "This Is Taylor Swift");

In [48]:
pop = tab.DataFrame()
pop = pop.sort_values('num_followers')
p2 = ggplot(pop, aes(x='name', y='num_followers')) + \
    geom_lollipop(fatten=1.5)  + \
    coord_flip() + \
    ylab('Number of Followers') + \
    xlab('Playlist') + \
    ggtitle('Number of followers of other playlists varies from a few thousands to almost four million') + \
    theme(plot_title=element_text(hjust=0.5)) + \
    ggsize(width=1000, height=500)

p2.show()
ggsave(p2, filename='../../docs/figures/playlist_popularity_without_outliers.html')

'c:\\Users\\adamw\\Education\\LSE Summer School\\ME204\\me204-2024-project-adamwadolowski\\docs\\figures\\playlist_popularity_without_outliers.html'

After removing the outliers, we see that other playlists have number of followers that is rather uniformly distributed on the interval of 0 to 4 million.

## Are songs with adult content more popular than others?
Now it is time to dig into the details and find out whether inclusion of adult content is a recipe for song's success. The [popularity](https://developer.spotify.com/documentation/web-api/reference/get-track) variable is provided by Spotify's API and is based on the total number of plays the track has had and how recent those plays are. Its range is from 0 to 100.

In [49]:
%%sql

tab << SELECT is_explicit, popularity, release_date, title, album_name
FROM songs
LEFT JOIN song_album_map
ON songs.song_id = song_album_map.song_id
LEFT JOIN albums
ON song_album_map.album_id = albums.album_id


In [50]:
songs = tab.DataFrame()
songs['release_date'] = pd.to_datetime(songs['release_date'], format = 'ISO8601')
songs = songs.sort_values('release_date')

# Categorical type resulted in incorrectly formated plots so I changed the type to str
songs['Explicit content'] = songs['is_explicit'].map({0: 'No', 1: 'Yes'})

In [51]:
p3 = ggplot(songs, aes(x='release_date', y='popularity', color='Explicit content')) + \
    geom_point(alpha=0.5, tooltips=layer_tooltips(['title', 'album_name']), size = 2.5) + \
    ggtitle('Most songs on pop playlists were recently released') + \
    ylab('Popularity') + \
    xlab('Date of Release') + \
    scale_x_datetime() + \
    theme(plot_title=element_text(hjust=0.5)) + \
    scale_color_manual(values=['blue', 'red'], name='color') + \
    ggsize(width=3000, height=800)

p3.show()
ggsave(p3, filename='../../docs/figures/explicit_content.html')

'c:\\Users\\adamw\\Education\\LSE Summer School\\ME204\\me204-2024-project-adamwadolowski\\docs\\figures\\explicit_content.html'

The plot above is difficult to interpret, mainly because of overwhelmingly many data point in the two most recent years. The only interesting insight is that when it comes to songs with explicit content, the pop playlists on Spotify do not have such songs that were produced before the 90s. Moreover, most songs on the pop playlists are rather recent, released in the last three years. In the next plot I aggregate the songs from each year and compute following statistics: mean and standard deviation of popularity and the percentage of songs with explicit content.

In [52]:
songs['year'] = songs['release_date'].dt.year
songs['avg_popularity'] = songs.groupby('year')['popularity'].transform('mean')
songs['std_popularity'] = songs.groupby('year')['popularity'].transform('std')
songs['is_explicit'] = songs['is_explicit'].astype(int)
songs['Fraction of explicit songs'] = songs.groupby('year')['is_explicit'].transform('mean')
songs['lower_ci'] = songs['avg_popularity'] - songs['std_popularity']
songs['upper_ci'] = songs['avg_popularity'] + songs['std_popularity']

In [53]:
p4 = ggplot(songs, aes(x='year', y='avg_popularity', color='Fraction of explicit songs')) + \
    geom_point(alpha=0.6) + \
    geom_errorbar(aes(ymin='lower_ci', ymax='upper_ci'), width=0.2) + \
    ggtitle('Songs become gold hits when there is no explicit content in them') + \
    ylab('Popularity') + \
    xlab('Date of Release') + \
    scale_color_viridis() + \
    scale_x_continuous(breaks=[1960, 1970, 1980, 1990, 2000, 2010, 2020], 
                       labels=['1960', '1970', '1980', '1990', '2000', '2010', '2020']) + \
    theme(plot_title=element_text(hjust=0.5)) + \
    ggsize(width=1000, height = 330)
    # Explicit labels were necessary to remove comma from the year (it was treated as numeric)
p4.show()
ggsave(p4, filename='../../docs/figures/explicit_content_per_year.html')

'c:\\Users\\adamw\\Education\\LSE Summer School\\ME204\\me204-2024-project-adamwadolowski\\docs\\figures\\explicit_content_per_year.html'

Here, we see that for songs released in the same year, the popularity varies a lot, with the exception for years when only a few songs on the pop playlists were released(for example 1960 and 1982). Interestingly, one possible insight from this graph is that only songs with no explicit content are big hits decades after their release. Additionally, the pop playlists have the highest percentage of explicit content for songs released in 2021. This might be just a coincidence but also could have been caused by the Covid-19 lockdown and, from current point of view, listeners associate those songs with the emotions they had during the pandemic.

## Who is the most popular pop singer?
Currently, the consensus is that Taylor Swift is the most popular singer worldwide. However, in case of our UK pop playlists, some other singer might take the lead. In order to find the answer, a metric needs to be constructed. My choice is to focus on number of songs of an artist on all of the playlists multiplied by their number of occurrances and average popularity of songs weighted by the time since their release (as more recent songs tend to be more popular) 

In [54]:
%%sql tab << SELECT songs.song_id, artists, popularity, num_occurrences, release_date
FROM songs
LEFT JOIN song_album_map
ON songs.song_id = song_album_map.song_id
LEFT JOIN albums
ON song_album_map.album_id = albums.album_id

In [55]:
singers = tab.DataFrame()
singers['release_date'] = pd.to_datetime(singers['release_date'], format='ISO8601')
singers['days_since_release'] = singers['release_date'].apply(lambda x: str(pd.to_datetime('2024-07-29', format='ISO8601') - x)).apply(lambda x: x.split(' ')[0]).apply(lambda x: int(x) if x != 'NaT' else None)

singers['artists'] = singers['artists'].apply(lambda x: x.split(','))
singers = singers.explode('artists')

singers = singers.dropna(subset=['popularity', 'days_since_release'])

def weighted_avg(df):
    v = df['popularity']
    w = df['days_since_release']
    return (v * w).sum() / w.sum()

avg_popularity = singers.groupby('artists')[['popularity', 'days_since_release']].apply(weighted_avg).reset_index()
avg_popularity.columns = ['artists', 'avg_popularity']
singers = singers.merge(avg_popularity, on='artists')

# chatGPT helped in the three line below
sum_df = singers.groupby('artists')['num_occurrences'].sum().reset_index()
sum_df.rename(columns={'num_occurrences': 'num_of_songs'}, inplace=True)
singers = singers.merge(sum_df, on='artists', how='left')

singers = singers.drop_duplicates(subset='artists')
singers['artist'] = singers['artists']
median_popularity = singers.agg(median = ('avg_popularity', 'median'))
median_popularity = median_popularity['avg_popularity']['median'].astype(int)

In [56]:
p5 = ggplot(singers, aes(y='num_of_songs', x='avg_popularity')) + \
              geom_point(alpha = 0.6, tooltips=layer_tooltips(['artist']), color='black', size=2.5) + \
              xlab('Average songs\' popularity per artist') + \
              ylab('Number of songs in the playlists (with repetition)') + \
              ggtitle('Taylor Swift crashes all the rivals in terms of quantity of songs but not the average popularity') + \
              theme(plot_title=element_text(hjust=0.5)) + \
              geom_vline(xintercept=median_popularity, color='red', size=0.5) + \
              geom_text(aes(x=[64], y=[100], label=['Median artists\' popularity']), color='black', hjust=0, size=8) + \
              geom_text(aes(x=[68], y=[273], label=['Taylor Swift']), color='black', hjust=0, size=8) + \
              ggsize(width = 1000, height = 500)
p5.show()
ggsave(p5, filename='../../docs/figures/singers_popularity.html')

'c:\\Users\\adamw\\Education\\LSE Summer School\\ME204\\me204-2024-project-adamwadolowski\\docs\\figures\\singers_popularity.html'

Taylor Swift dominates the playlists with an astonising number of 273 songs (with repetition). This result should have been expected as the [This Is Taylor Swift](https://open.spotify.com/playlist/37i9dQZF1DX5KpP2LN299J) playlist was in the top 5. But surprisingly, she barely exceeds the median popularity of songs on the pop playlists (reporesented by the vertical red line). That suggests she is a well-established singer. Next plot removes the observation connected to Taylor Swift in hope to capture the general behaviour.

In [57]:
singers2 = singers[singers['artist'] != 'Taylor Swift']

p6 = ggplot(singers2, aes(y='num_of_songs', x='avg_popularity')) + \
            geom_point(alpha = 0.6, tooltips=layer_tooltips(['artist']), color='black', size=2.5) + \
            xlab('Average songs\' popularity per artist') + \
            ylab('Number of songs in the playlists (with repetition)') + \
            ggtitle('Most artists have up to two songs on the pop playlists') + \
            theme(plot_title=element_text(hjust=0.5)) + \
            ggsize(width = 1000, height = 500)

p6.show()
ggsave(p6, filename='../../docs/figures/singers_popularity_without_outliers.html')

'c:\\Users\\adamw\\Education\\LSE Summer School\\ME204\\me204-2024-project-adamwadolowski\\docs\\figures\\singers_popularity_without_outliers.html'

Here we see a more natural structure. Vast majority of artists have at most two songs in all the pop playlists with a few reaching up to 20 occurrances. Shakira, Beoncé and Luis Miguel stand out with 31, 31 and 33 songs, respectively.

## What defines a good playlist?
People have different tastes but we can analize the average behaviour of Spotify listeners. I assume playlist's goodness can be expressed as the number of followers and will try to find a relation between the number of songs present on a playlist, their average popularity and the number of followers.

In [58]:
%%sql tab << SELECT playlists.playlist_id, num_followers, popularity, name
FROM playlists
LEFT JOIN song_playlist_map
ON playlists.playlist_id = song_playlist_map.playlist_id
LEFT JOIN songs
ON song_playlist_map.song_id = songs.song_id

In [59]:
playlists = tab.DataFrame()
playlist_avg_popularity = playlists.groupby('playlist_id')['popularity'].mean().reset_index()
playlist_avg_popularity.columns = ['playlist_id', 'playlist_avg_popularity']
playlists = playlists.merge(playlist_avg_popularity, on='playlist_id', how='left')

counts = playlists.groupby('playlist_id').size().reset_index(name='num_of_songs')

playlists = playlists.merge(counts, on='playlist_id', how='left')

In [60]:
playlists = playlists.drop_duplicates(subset='playlist_id')

playlists['image_path'] = [f'../data/clean/images/{id}.jpg' for id in playlists['playlist_id']]
playlists['link'] = [f'https://open.spotify.com/playlist/{id}' for id in playlists['playlist_id']]

In [62]:
# chatGPT helped in this cell

p7 = go.Figure()

# Add scatter plot for hover information
p7.add_trace(go.Scatter(
    x=playlists['playlist_avg_popularity'],
    y=playlists['num_of_songs'],
    mode='markers',
    marker=dict(size=20, opacity=0),
    hovertext=playlists['name'],
    hoverinfo='text'
))

# Add images as annotations
for idx, row in playlists.iterrows():
    img_size = row['num_followers'] / 400000  # Adjustment to control the image size
    p7.add_layout_image(
        dict(
            source=row['image_path'],
            xref="x",
            yref="y",
            x=row['playlist_avg_popularity'],
            y=row['num_of_songs'],
            sizex=img_size,
            sizey=img_size,
            xanchor="center",
            yanchor="middle"
        )
    )


p7.update_layout(
    xaxis_title='Average Popularity of Songs on a Playlist',
    yaxis_title='Number of Songs',
    title='It is all about quality, not quantity',
    showlegend=False,
    xaxis=dict(
        range=[54, max(playlists['playlist_avg_popularity']) + 10],
        showgrid=True,
        gridcolor='lightgray',
        gridwidth=1
    ),
    yaxis=dict(
        range=[0, max(playlists['num_of_songs']) + 20],
        showgrid=True,
        gridcolor='lightgray',
        gridwidth=1
    ),
    plot_bgcolor='white' 
)

p7.show()
p7.write_html('../docs/figures/images_plot.html')

We see that, generally, playlists have at most 100 songs and that there is a positive correlation between the average popularity of songs on a playlists and the number of people that follow it. The best example is the [Today's Top Hits](https://open.spotify.com/playlist/37i9dQZF1DXcBWIGoYBM5M) playlist with the highest number of followers and the highest average popularity. It is important to mention that in this case, we are experiencing endogeneity in the case of this playlist the number of followers has large influence on average popularity and vise versa.