# An exploration of the top 100 Spotify hits from 2010-2019

## Table of Contents
* [Set Up](#setup)
    * [Import libraries](#libraries)
    * [Set conventions](#conventions)
    * [Make data accessibile](#access)

* [Data Exploration](#explore)
    * [Characteristics](#charas)
    * [Display (pre-normalization)](#previs)
        * [Notes on the histograms](#histnotes)
    * [Feature Engineering](#featengin)
        
* [Predictions](#predict)
    
* [Normalization](#norm)
    * [Split Data](#split)
    * [Calculate](#calcnorm)
        * [clipping](#clipping)
        * [z-score](#zscore)
    * [Display (post-normalization)](#postvis)
        * [Notes on the violin plots](#violinnotes)

## Set Up <a class="anchor" id="setup"></a>

### Import libraries <a class="anchor" id="libraries"></a>

In [None]:
from os import path
import csv
import opendatasets as od
import pandas as pd
import numpy as np
import seaborn as sns

import IPython
import IPython.display
from ipywidgets import widgets, interactive, fixed
import matplotlib as mpl
import matplotlib.pyplot as plt

import dataframe_image as dfi

### Set conventions <a class="anchor" id="conventions"></a>

In [None]:
# conventions
%matplotlib inline

mpl.rcParams['figure.figsize'] = (12,6)
mpl.rcParams['axes.grid'] = False

sns.set(style="whitegrid")

### Make data accessibe<a class="anchor" id="access"></a>
Dataset found on [Kaggle](https://www.kaggle.com/datasets/muhmores/spotify-top-100-songs-of-20152019)

In [None]:
# TODO: update file download parameters for outsider user

# download dataset
if not path.exists('hits_2010-2019.csv'):
    od.download('https://www.kaggle.com/datasets/muhmores/spotify-top-100-songs-of-20152019')

In [None]:
# covert to dataframe
data = pd.read_csv('hits_2010-2019.csv', sep=',')
df = pd.DataFrame(data)

## Data Exploration <a class="anchor" id="explore"></a>

### Characteristics <a class="anchor" id="charas"></a>

First look at the shape of the data frame. This will tell us the size of the dataset we are working with. We know from the Kaggle link to expect 1000 rows (100 hits/year x 10 years) and 17 columns of data, 17000 data points. 

In [None]:
# (rows, columns)
df.shape

In [None]:
print("column labels: ")
for i, label in enumerate(list(df.columns)):
    print(i, label)

In [None]:
df.dtypes

In [None]:
# print first few rows
# note: data sorted sequentially increasing by 'top year'
df.head()

Below is the list of features measured on a scale from 0-100, 0 is low and 100 is high respectively to the meaning of each feature. For more information on each feature visit the [Spotify API](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features)

energy (nrgy), danceability (dnce, live (live), valence (val), acousticness (acous), speechiness (spch), popularity (pop)

In [None]:
#  function will only return stats for numerical data columns
df.describe().transpose()

### Display (before normalization) <a class="anchor" id="previs"></a>

In [None]:
# create list of all features we will be running analysis on
features_list = list(df.columns)[3:]

We are mapping the spread and density of each feature. This allows us to get a general idea of the distribution of each data type. We will refer to these graphs later when prepping the data for further analysis.

In [None]:
# create a dropdown menu so we can easily switch between histograms
def opt(_list):
    feature = widgets.Dropdown(options=_list,
                 value='year released',
                 description='Feature:',
                 disabled=False,)
    return feature

# define function for a histogram
# graphing each feature against the year it topped the charts
def hist2d(feature:str, df):
    sns.histplot(data=df, x='top year', y=feature, stat='count', discrete=(True, True), cbar=True)
    
#     filename = f"hist_{feature}.png"
#     plt.savefig(filename, format='png')
    
    plt.show()

In [None]:
# return the histograms
feature = opt(features_list)
interactive(hist2d, feature=feature, df=fixed(df))
# plt.savefig("hist_feats.png", format='png')

#### Notes on the histograms:<a class="anchor" id="histnotes"></a>

- The distribution of 'year released' initially does not make sense. How can a song be a chart topper in a year prior to it's release? 
    - Upon further investigation, I learned that the 'year released' column was determined by the album release 
        date. The singles that charted off the album were released ahead of time in promotion of the album. 
    - There are 75 songs with this error present.
    
- The 'top genre' feature is a bit of a mess to look at. But, based on the colorbar, one of the genres dominated the charts. Almost half of all songs fell into this one genre. The popularity of that genre was cut basically in half in 2017 and that trend has not returned since.

- Beginnging to see small trend down of 'dur' suggesting pop songs are becoming shorter. Are attention spans wanning?

- There is also a small trend down of energy. Although these trends are too small to be considered actual trends. The extent of the data is too small to draw such conclusions (only 10 years). 

- 'acous' and 'live' reminds us that most music is electronically produced today. Possibly though they are becoming more acoustic, meaning the interest in organic sounds is becoming popular again. 

### Feature Engineering <a class="anchor" id="featengin"></a>

In [None]:
# remove empty rows
df.dropna(how='all', inplace=True)

In [None]:
# drop 'added' column, irrelevant to the goal of this project
df.drop(labels='added', axis=1, inplace=True)

In [None]:
# df_err.shape[0] is the number of data points with release years inconsistent with the top year
df_err = df.loc[df['year released'] > df['top year']]

# when we run the next cell, shape[0] should be zero
print('# of inconsistent data points is', df_err.shape[0])

At this point, we need to do two things to the 'year released' column

1. Eliminate inconsistent data points. For reference, we identified 75 inconsistent data points in the 'year released' column in the above cell. 
2. Recognize that we are more interested in how many years 'year released' is offset from 'top year' than we are in the actual release date. With that in mind, let's remeasure/redefine 'year released'.

Note: a negative 'year released' value now means that the song was released x years before it hit the charts.

In [None]:
# 1
df['year released'].mask(df['year released'] > df['top year'], inplace=True)

# 2
df['year released'] = df['year released'].sub(df['top year'])

In [None]:
df.head()

In [None]:
# convert year type from float64 to ints
df['top year'] = df['top year'].astype('int64')

In [None]:
# convert columns that measure on a scale of 0-100 to 0-1

list1 = ['nrgy', 'dnce', 'live', 'val', 'acous', 'spch', 'pop']
df[list1] = df[list1].div(100)

# return some data to verify changes have been made
df.head()

In [None]:
# at this point we only care about columns of numerical measure, drop everything else
DROP_list = ['title', 'artist', 'top genre', 'top year', 'artist type']
df_num = df.drop(labels=DROP_list, axis=1, inplace=False)

df_num = pd.DataFrame(df_num)

## Predictions <a class="anchor" id="predict"></a>

The goal is create an algorithm that will forecast the features of next year's pop hits. "Intuition" in this case is based on the range I expected the algorithm to output within. For most of the features, I expect around 68% of the forecasting to fall in the range of the mean +/- 1 std. I will specify later on why I can't do this for all the features.

In [None]:
# I printed already when exploring the characteristics of the dataset
# I'm printing it again to reminds us of the spread of the data
df_num.describe().transpose()

In [None]:
# create lower and upper bounds of expected output range
LOWER = df_num.mean(axis=0, skipna=True) - df_num.std(axis=0, skipna=True)
UPPER = df_num.mean(axis=0, skipna=True) + df_num.std(axis=0, skipna=True)
df_predict = pd.DataFrame({'lower bound': LOWER, 
                           'upper bound': UPPER},
                          index=list(df_num.columns))
# print the predictions
df_predict = df_predict.transpose()
df_predict

Make corrections to upper and lower bounds based on what is logically possible
1. Bounds of 'year released' can only be integers and must be <= 0.
2. The column 'acous' can only be >= 0 or <= 1. 

In [None]:
# 1
df_predict.loc['upper bound', 'year released'] = 0

# 2
df_predict.loc['lower bound', 'acous'] = 0

# 3 
list2 = ['year released', 'bpm', 'dB', 'dur']
df_predict[list1] = df_predict[list1].round(2)
df_predict[list2] = df_predict[list2].round(0)

In [None]:
df_predict

## Normalization <a class="anchor" id="norm"></a>

### Split the data* <a class="anchor" id="split"></a>
Data is not split at random because of the time dependency aspect of this exploration. The data size is a power of 10 so it splits evenly.

70%, training set data from 2010-2016 

20%, validation set data from 2017-2018

10%, test set data from 2019

In [None]:
column_indices = {name: i for i, name in enumerate(df_num.columns)}

n = len(df_num)

# 70%
train_df = df_num[0:int(n*0.7)]
# 20%
val_df = df_num[int(n*0.7):int(n*0.9)]
# 10%
test_df = df_num[int(n*0.9):]

num_features = df_num.shape[1]

### Calculate norms <a class="anchor" id="calcnorm"></a>

#### clipping <a class="anchor" id="clipping"></a>
Adjust for obvious outliers.

In [None]:
# where cond False keep original value
df_num['dB'].mask(df_num['dB'] < -12.5, inplace=True)
df_num['dur'].mask(df_num['dur'] > 300, inplace=True)
df_num['year released'].mask(df_num['year released'] < -5, inplace=True)

#### z-score <a class="anchor" id="zscore"></a>
x' = (value - mean) / std 

In [None]:
# preventing bias:
# only using training mean and std so the other values don't have access to validation and test sets

train_mean = train_df.mean()
train_std = train_df.std()

train_df = (train_df - train_mean) / train_std
val_df = (val_df - train_mean) / train_std
test_df = (test_df - train_mean) / train_std

### Display (after normalization) <a class="anchor" id="postvis"></a>

In [None]:
df_std = (df_num - train_mean) / train_std
df_std = df_std.melt(var_name='Column', value_name='numalized')
plt.figure(figsize=(12, 6))
ax = sns.violinplot(x='Column', y='numalized', data=df_std)
_ = ax.set_xticklabels(df_num.keys(), rotation=90)

# plt.savefig("norm_feats.png", format='png')

#### Notes on the violin plots:<a class="anchor" id="violinnotes"></a>
#### #TODO