# An exploration of the top 100 Spotify hits from 2010-2019

## Table of Contents
* [Set Up](#setup)
    * [Import libraries](#libraries)
    * [Set conventions](#conventions)
    * [Make data accessibile](#access)

* [Data Exploration](#explore)
    * [Characteristics](#charas)
    * [Display (pre-normalization)](#previs)
    
* [Normalization](#norm)
    * [Split Data](#split)
    * [Calculate](#calcnorm)
    * [Display (post-normalization)](#postvis)
    
* [Feature Engineering](#featengin)

* [Predictions](#predict)

## Set Up <a class="anchor" id="setup"></a>

### Import libraries <a class="anchor" id="libraries"></a>

In [None]:
from os import path
import csv
import opendatasets as od
import pandas as pd
import numpy as np
import seaborn as sns

import IPython
import IPython.display
from ipywidgets import widgets, interactive, fixed
import matplotlib as mpl
import matplotlib.pyplot as plt

### Set conventions <a class="anchor" id="conventions"></a>

In [None]:
# conventions
mpl.rcParams['figure.figsize'] = (8, 6)
mpl.rcParams['axes.grid'] = False

### Make data accessibe<a class="anchor" id="access"></a>

In [None]:
# TODO: update file download parameters for outside user

# download dataset
if not path.exists('hits_2010-2019.csv'):
    od.download('https://www.kaggle.com/datasets/muhmores/spotify-top-100-songs-of-20152019')

In [None]:
# covert to dataframe
data = pd.read_csv('hits_2010-2019.csv', sep=',')
df = pd.DataFrame(data)

I did this exploration ahead of time, so I already know that there are extra blank rows. I'll remove them now to prevent problems in the future, mainly being unable to perform operations on rows/columns because of NA elements.

In [None]:
# remove empty rows
df.dropna(how='all', inplace=True)

## Data Exploration <a class="anchor" id="explore"></a>

### Characteristics <a class="anchor" id="charas"></a>

First look at the shape of the data frame. This will tell us the size of the dataset we are working with. We know from the Kaggle link to expect 1000 rows (100 hits/year x 10 years) and 17 columns of data, 17000 data points. 

In [None]:
# (rows, columns)
df.shape

In [None]:
print("column labels: ")
for i, label in enumerate(list(df.columns)):
    print(i, label)

In [None]:
df.dtypes
print("'year released' dtype:", df['year released'].dtypes)
print("'top year' dtype:", df['top year'].dtypes)

In [None]:
# make updates: convert year types to ints
df['year released'] = df['year released'].astype('int64')
df['top year'] = df['top year'].astype('int64')

print("'year released' dtype updated to", df['year released'].dtypes)
print("'top year' dtype updated to", df['top year'].dtypes)

In [None]:
# print first few rows
# note: data sorted sequentially increasing by 'top year'
df.head()

In [None]:
df.describe().transpose()

## Feature Engineering <a class="anchor" id="featengin"></a>

In [None]:
# drop 'added' column, irrelevant to the goal of this project
# Spotify publishes top hits lists when the software updates, no correlation to other features
# so there can be no causation
df.drop(labels='added', axis=1, inplace=True)

In [None]:
# change artist type to IDs
artist_type_IDs = {'Solo':1, 'Duo':2, 'Trio':3, 'Band/Group':4}
df = df.replace(artist_type_IDs)
df['artist type'] = df['artist type'].astype('int32')

In [None]:
df.head()

In [None]:
# SKIP FOR NOW

# change artist type to IDs
artist_type_IDs = {'Solo':0, 'Duo':1, 'Trio':2, 'Band/Group':3}
df = df.replace(artist_type_IDs)
df['artist type'] = df['artist type'].astype('int32')

# change top genre to IDs
top_genre_IDs = {}
ID = 1
for i, genre in enumerate(df['top genre']):
    if genre not in top_genre_IDs:
        top_genre_IDs[genre] = ID
        ID += 1
df = df.replace(top_genre_IDs)
df['top genre'] = df['top genre'].astype('int32')

### Display (before normalization) <a class="anchor" id="previs"></a>

In [None]:
def opt(_list):
    feature = widgets.Dropdown(options=_list,
                 value='bpm',
                 description='Feature:',
                 disabled=False,)
    return feature

In [None]:
features = list(df.columns)[4:15]
df_test = df[features]
df_test.head()

In [None]:
# TODO: need to updated conventions, reference 'top year' hist2d to see problem 
def hist2d(feature:str, df):
    year_BINS = len(df) / 100 # one hundred tops songs/year
    feat_BINS = df[feature].max() - df[feature].min()
    plt.hist2d(df['top year'], df[feature], bins=(year_BINS, feat_BINS))
    plt.colorbar()
    plt.xlabel('Top Year')
    plt.ylabel(feature)
    plt.show()

feature = opt(list(df.columns)[4:15])
interactive(hist2d, feature=feature, df=fixed(df))

In [None]:
def hist_alt(feature:str, df):
    sns.histplot(data=df, x="top year", y="feature", discrete=True, cbar=True)
    plt.show()
    
feature = opt(list(df.columns)[3:16])
interactive(hist_alt, feature=feature, df=fixed(df))

In [None]:
# SKIP FOR NOW
def hist3d(feature:str, df):
    fig = plt.figure()
    COLS = df[feature].max() - df[feature].min()
    ax = fig.add_subplot(nrows=10, ncols=COLS, index=1, projection='3d')

    # data
    _x = df['top year']
    _y = df[feature]
    _xx, _yy = np.meshgrid(_x, _y)
    x, y = _xx.ravel(), _yy.ravel()

    top = x + y
    bottom = np.zeros_like(top)
    width = depth = 1

    ax.bar3d(x, y, bottom, width, depth, top, shade=True)
    plt.show()
    
feature = opt(list(df.columns)[3:15])
interactive(hist3d, feature=feature, df=fixed(df))

## Normalization <a class="anchor" id="norm"></a>
the guide tells me to apply z-score, x' = (value - mean) / std to every feature

trying a few others...

##### Split the data <a class="anchor" id="split"></a>
70% training, 20% validation, 10% test

TODO: EXPLAIN... not selected randomly due to time dependence of data

In [None]:
# only normalizing features of numerical measure
df2 = df.iloc[:,3:14]  # ['year released' : 'top year']

column_indices = {name: i for i, name in enumerate(df2.columns)}

n = len(df2)
# 70%
train_df = df2[0:int(n*0.7)]
# 20%
val_df = df2[int(n*0.7):int(n*0.9)]
# 30%
test_df = df2[int(n*0.9):]

num_features = df2.shape[1] # 12 features ['year released' : 'top year']

##### Calculate norms <a class="anchor" id="calcnorm"></a>

z-score normalization

In [None]:
# preventing bias:
# only using training mean and std so the other values don't have access to validation and test sets

train_mean = train_df.mean()
train_std = train_df.std()

train_df = (train_df - train_mean) / train_std
val_df = (val_df - train_mean) / train_std
test_df = (test_df - train_mean) / train_std

In [None]:
train_df.describe().transpose()

##### Display (after normalization) <a class="anchor" id="postvis"></a>

In [None]:
df2_std = (df2 - train_mean) / train_std
df2_std = df2_std.melt(var_name='Column', value_name='Normalized')
plt.figure(figsize=(12, 6))
ax = sns.violinplot(x='Column', y='Normalized', data=df2_std)
_ = ax.set_xticklabels(df2.keys(), rotation=90)

## Predictions <a class="anchor" id="predict"></a>