# Exploratory Data Analysis (EDA) of Spotify Data Over Time
## Ange Olson

Since April of 2016, I've made (almost) monthly playlists on Spotify consisting of the songs I enjoyed listening to that month. I created them as the month went on, meaning they've got a couple of cool characteristics:

* They can be used to see which artist/genres I was listening to together at any given time
* They provide point-in-time stats rather than the short-term, medium-term, and long-term song histories directly available from the API 
* They can more accurately be used to determine songs I particularly liked than looking at the songs I listened to frequently, and compared to the songs I listened to that didn't end up on a playlist (read: content-based recommendation system)

I've gone through and pulled the data from all the playlists, grabbing the typical metadata on each song (see the [Spotify Documentation on audio features](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features) for a full list and descriptions) including artist name, genre, release year, valence, tempo, key, etc. I was able to use the name of the playlist itself, dated almost uniformly throughout the past 6 years, to grab the month and year I listened to the song. 

In the analysis below, I mostly focus on either month trends for all years, or trends over the past 6 years rather than analyzing the data as a whole. 

# Section I: Basic Info 
## How many observations do I have? How many months am I missing? How balanced is the dataset? 

In [None]:
# import packages, data, set theme 

# packages
from matplotlib import markers
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats 
import statistics
import plotly.express as px
import statsmodels 

# graphing
sns.set_theme(style="whitegrid", palette="Paired")

# data
df = pd.read_csv("time_series_data_clean.csv")

In [None]:
# define pivot table function and use to create main dataframes

# get vals that need to be averaged
mean_vals = "danceability", "energy", "loudness", "speechiness", "instrumentalness", "liveness", "valence", "tempo", "duration_ms", "track_pop", "artist_pop", "release_year"

# get vals for mode
mode_vals = "key", "mode", "time_signature", "artist_genre", "artist"

def getPivot(mean_vals, mode_vals, index):
    
    # define pivot table of means
    table_means = pd.pivot_table(df, index = index, values = mean_vals, aggfunc=np.mean)
    
    # define pivot table of modes
    table_modes = pd.pivot_table(df, index = index, values = mode_vals, aggfunc=statistics.mode)

    pivot_df = table_means.join(table_modes, on=index)
    pivot_df.reset_index(inplace=True)
    pivot_df = pivot_df.rename(columns = {'index': index})

    return pivot_df

# create pivot table by date
# get index val
index = "playlist_date"

datePivot = getPivot(mean_vals=mean_vals, mode_vals=mode_vals, index=index)

datePivot["playlist_date"] = pd.to_datetime(datePivot.playlist_date)

# create pivot table by month
# get index val
index2 = "playlist_month"

monthPivot = getPivot(mean_vals=mean_vals, mode_vals=mode_vals, index=index2)

The months with the lowest number of observations are June and August (2), but all months have at least one observation.

In [None]:
df.groupby('playlist_month')['playlist_date'].nunique()