# Spotify Wrapped But Better

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='wrangling'></a>
## Data Wrangling

### Data Gathering


In [1]:
import numpy as np #linear algebra
import pandas as pd #data processing
import matplotlib.pyplot as plt #data visualisation
import seaborn as sns #data visualisation
import plotly.express as px
import datapane as dp 
import json

In [2]:
#Get input Data
data = pd.read_json('streaming.json') #load json file into a variable
data.head(10) #fetch first 10 rows

Unnamed: 0,ts,username,platform,ms_played,conn_country,ip_addr_decrypted,user_agent_decrypted,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,...,episode_name,episode_show_name,spotify_episode_uri,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode
0,2019-09-26T14:16:32Z,zuzana91,"Android OS 9 API 28 (samsung, SM-A600FN)",199873,CZ,78.128.191.43,unknown,달라달라 (DALLA DALLA),ITZY,IT'z Different,...,,,,trackdone,trackdone,False,,False,1569506707249,False
1,2022-01-18T10:52:14Z,zuzana91,"Android OS 10 API 29 (samsung, SM-A600FN)",175200,SK,194.154.244.235,unknown,Scotty Doesn't Know,Lustra,Left for Dead,...,,,,trackdone,trackdone,False,,False,1642502958552,False
2,2019-04-04T07:51:32Z,zuzana91,Windows 10 (10.0.17134; x64; AppX),135200,CZ,80.79.16.218,unknown,Please Be A Hit,Tiny Meat Gang,Locals Only,...,,,,trackdone,trackdone,False,,False,1554364157462,False
3,2019-05-14T20:24:59Z,zuzana91,Windows 10 (10.0.17134; x64; AppX),229455,CZ,78.128.169.252,unknown,Caro,Bad Bunny,X 100PRE,...,,,,clickrow,trackdone,True,,False,1557865272609,False
4,2020-10-15T16:44:32Z,zuzana91,Windows 10 (10.0.19041; x64; AppX),222502,CZ,78.128.169.252,unknown,In Seoul,Epik High,sleepless in __________,...,,,,trackdone,trackdone,False,,False,1602780049460,False
5,2019-01-21T12:32:35Z,zuzana91,Windows 10 (10.0.17134; x64; AppX),1680,CZ,78.128.169.252,unknown,NEW ORLEANS,BROCKHAMPTON,iridescence,...,,,,clickrow,endplay,False,,False,1548073954071,False
6,2019-06-27T17:39:05Z,zuzana91,Windows 10 (10.0.17134; x64; AppX),219466,CZ,78.128.169.252,unknown,Georgia,Kevin Abstract,ARIZONA BABY,...,,,,trackdone,trackdone,False,,False,1561656926093,False
7,2022-07-05T12:16:41Z,zuzana91,"Android OS 10 API 29 (samsung, SM-A600FN)",207775,CZ,78.128.169.252,unknown,,,,...,Casual Interactions Podcast: Episode 3 - Madam...,Casual Interactions,spotify:episode:0gnsPkAi4PWPIIsHBzBdYL,clickrow,endplay,False,,False,1657023122598,False
8,2017-03-07T07:11:01Z,zuzana91,Windows 10 (10.0.14393; x86),220925,SK,178.253.129.90,unknown,Sleepover,Hayley Kiyoko,Sleepover,...,,,,appload,trackdone,False,,False,1488870349373,False
9,2018-11-29T13:43:15Z,zuzana91,Windows 10 (10.0.17134; x64; AppX),194699,CZ,78.128.169.252,unknown,I Love You,EXID,I Love You,...,,,,clickrow,trackdone,False,,False,1543488440486,False


Content of our data:

- *ts* (timestamp): Nominal data
- *username:* Nominal data
- *platform:* Nominal data
- *ms_played:* Ratio data
- *conn_country:* Nominal data
- *ip_addr_decrypted:* Nominal data
- *master_metadata_track_name:* Nominal data
- *master_metadata_album_artist_name:* Nominal data
- *master_metadata_album_album_name:* Nominal data
- *spotify_track_uri:* Nominal data
- *episode_name:* Nominal data
- *episode_show_name:* Nominal data
- *spotify_episode_uri:* Nominal data
- *reason_start:* Nominal data
- *reason_end:* Nominal data
- *shuffle:* Nominal data
- *skipped:* Nominal data
- *offline:* Nominal data
- *offline_timestamp:* Ratio data
- *incognito_mode:* Nominal data

### Data Cleaning

There are several unnecessary columns for the purpose of this analysis. For example I am the single user of this account so every record contains the same input for ‘username’.

In [3]:

data = data.drop(['spotify_track_uri'],axis=1)
data = data.drop(['ip_addr_decrypted'],axis=1)
data = data.drop(['user_agent_decrypted'],axis=1)
data = data.drop(['username'],axis=1)

data = data.rename(columns={'master_metadata_track_name':'track','master_metadata_album_artist_name':'artist','master_metadata_album_album_name':'album'})
data.head(10)

Unnamed: 0,ts,platform,ms_played,conn_country,track,artist,album,episode_name,episode_show_name,spotify_episode_uri,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode
0,2019-09-26T14:16:32Z,"Android OS 9 API 28 (samsung, SM-A600FN)",199873,CZ,달라달라 (DALLA DALLA),ITZY,IT'z Different,,,,trackdone,trackdone,False,,False,1569506707249,False
1,2022-01-18T10:52:14Z,"Android OS 10 API 29 (samsung, SM-A600FN)",175200,SK,Scotty Doesn't Know,Lustra,Left for Dead,,,,trackdone,trackdone,False,,False,1642502958552,False
2,2019-04-04T07:51:32Z,Windows 10 (10.0.17134; x64; AppX),135200,CZ,Please Be A Hit,Tiny Meat Gang,Locals Only,,,,trackdone,trackdone,False,,False,1554364157462,False
3,2019-05-14T20:24:59Z,Windows 10 (10.0.17134; x64; AppX),229455,CZ,Caro,Bad Bunny,X 100PRE,,,,clickrow,trackdone,True,,False,1557865272609,False
4,2020-10-15T16:44:32Z,Windows 10 (10.0.19041; x64; AppX),222502,CZ,In Seoul,Epik High,sleepless in __________,,,,trackdone,trackdone,False,,False,1602780049460,False
5,2019-01-21T12:32:35Z,Windows 10 (10.0.17134; x64; AppX),1680,CZ,NEW ORLEANS,BROCKHAMPTON,iridescence,,,,clickrow,endplay,False,,False,1548073954071,False
6,2019-06-27T17:39:05Z,Windows 10 (10.0.17134; x64; AppX),219466,CZ,Georgia,Kevin Abstract,ARIZONA BABY,,,,trackdone,trackdone,False,,False,1561656926093,False
7,2022-07-05T12:16:41Z,"Android OS 10 API 29 (samsung, SM-A600FN)",207775,CZ,,,,Casual Interactions Podcast: Episode 3 - Madam...,Casual Interactions,spotify:episode:0gnsPkAi4PWPIIsHBzBdYL,clickrow,endplay,False,,False,1657023122598,False
8,2017-03-07T07:11:01Z,Windows 10 (10.0.14393; x86),220925,SK,Sleepover,Hayley Kiyoko,Sleepover,,,,appload,trackdone,False,,False,1488870349373,False
9,2018-11-29T13:43:15Z,Windows 10 (10.0.17134; x64; AppX),194699,CZ,I Love You,EXID,I Love You,,,,clickrow,trackdone,False,,False,1543488440486,False


<a id='eda'></a>
## Exploratory Data Analysis

### Range


In [4]:
data_sorted = data
data_sorted.sort_values(by='ts', ascending=False)
print( "Last timestamp: ",data_sorted['ts'].head(1))
print( "First timestamp: ",data_sorted['ts'].tail(1))

Last timestamp:  0    2019-09-26T14:16:32Z
Name: ts, dtype: object
First timestamp:  145063    2022-04-16T15:22:45Z
Name: ts, dtype: object


### Most streamed song/album/artist

In [5]:
toptrack = data[['artist', 'track', 'ms_played']].groupby(['artist', 'track']).sum('ms_played').sort_values('ms_played', ascending=False)
print(toptrack.head(10))

                                                                     ms_played
artist              track                                                     
RM                  moonchild                                         70055812
BTS                 HOME                                              69286192
RM                  seoul (prod. HONNE)                               68969800
My Chemical Romance Thank You for the Venom                           65575849
RM                  uhgood                                            64459124
                    tokyo                                             62294688
My Chemical Romance The Foundations of Decay                          60503550
RM                  everythingoes                                     59410067
My Chemical Romance You Know What They Do to Guys Like Us in Prison   59336996
BTS                 Dionysus                                          58792928


In [6]:
topartist = data[['artist', 'ms_played']].groupby(['artist']).sum().sort_values('ms_played', ascending=False)
print(topartist.head(10))

                      ms_played
artist                         
BTS                  2095829430
My Chemical Romance  1664463065
Fall Out Boy         1478409335
Frank Iero            820984362
BROCKHAMPTON          686144693
SEVENTEEN             651750914
Twenty One Pilots     473567481
RM                    417794675
DAY6                  386992678
Stray Kids            374403865


In [7]:
topalbum = data[['album', 'ms_played']].groupby(['album']).sum().sort_values('ms_played', ascending=False)
print(topalbum.head(10))

                                ms_played
album                                    
Three Cheers for Sweet Revenge  492048295
mono.                           415584201
Love Yourself 結 'Answer'        398137755
MAP OF THE SOUL : 7             359125993
From Under The Cork Tree        358207013
The Black Parade                347993882
Folie à Deux                    347848373
Parachutes                      347441713
MAP OF THE SOUL : PERSONA       332129106
Infinity On High                313669920


### Most skipped track

In [8]:
skipped_tracks = data.loc[data['reason_end']== 'fwdbtn'].groupby(['artist','track']).size().sort_values(ascending=False)
print(skipped_tracks.head(10))

artist               track                               
DAY6                 EMERGENCY                               77
BTS                  Intro: Never Mind                       70
Frank Iero           Miss Me                                 69
My Chemical Romance  Welcome to the Black Parade             63
Awkwafina            Intro III                               41
Tiny Meat Gang       Intro                                   39
BTS                  Skit                                    38
                     The Truth Untold                        33
Green Day            Holiday / Boulevard of Broken Dreams    31
Frank Iero           9-6-15                                  31
dtype: int64


### Time of the day/week/year

In [9]:
data['ts'] = data['ts'].str.replace('T', ' ')
data['ts'] = data['ts'].str.replace('Z', '')
data['ts'] = pd.to_datetime(data['ts'], utc=True)

data['weekday'] = np.where(data['ts'].dt.dayofweek < 5, True, False)


In [10]:
mean_ms = data.groupby(data['ts'].dt.date).sum('ms_played')['ms_played'].sum() / len(data.groupby(data['ts'].dt.date))
mean_ms

8456257.327347549

In [11]:
mean_ms = data[data['ts'].dt.year == 2021].groupby(data['ts'].dt.date).sum('ms_played')['ms_played'].sum() / len(data[data['ts'].dt.year == 2021].groupby(data['ts'].dt.date))
mean_ms

12651914.757575758

In [12]:
#weekday
track_play_hour_weekday = data.groupby([data['ts'].dt.weekday, data['ts'].dt.hour]).sum('ms_played')['ms_played']
week_time_matrix = np.zeros((24,7))
for weekday in range(0, 7):
    for h in range(0,24):
        if (weekday, h) in track_play_hour_weekday:
            week_time_matrix[h, weekday] = track_play_hour_weekday[weekday, h]

fig = px.imshow(week_time_matrix, origin='lower', template='plotly_dark', color_continuous_scale=['#11272e', '#1cc8ff'],
                labels=dict(x="Day of Week", y="Hour of Day", color="Play time"),
                x=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],
                y=np.arange(0,24)
               )
fig.write_html('plot.html')
fig.update(layout_coloraxis_showscale=False) 
report = dp.Report(dp.Plot(fig)) #Create a report
report.upload(name='Streaming Hour/Weekday Distribution')

Uploading report and associated data - *please wait...*

Your report only contains a single element - did you know you can include additional plots, tables and text in a single report? More info <a href='https://docs.datapane.com/reports/blocks/layout-pages-and-selects' target='_blank'>here</a>

Report successfully uploaded. View and share your report <a href='https://datapane.com/reports/d7dw2Z3/streaming-hourweekday-distribution/' target='_blank'>here</a>, or edit your report <a href='https://datapane.com/reports/d7dw2Z3/streaming-hourweekday-distribution/edit/' target='_blank'>here</a>.

In [13]:
#weekday 2021
track_play_hour_weekday = data[data['ts'].dt.year == 2021].groupby([data['ts'].dt.weekday, data['ts'].dt.hour]).sum('ms_played')['ms_played']
week_time_matrix = np.zeros((24,7))
for weekday in range(0, 7):
    for h in range(0,24):
        if (weekday, h) in track_play_hour_weekday:
            week_time_matrix[h, weekday] = track_play_hour_weekday[weekday, h]

fig = px.imshow(week_time_matrix, origin='lower', template='plotly_dark', color_continuous_scale=['#11272e', '#1cc8ff'],
                labels=dict(x="Day of Week", y="Hour of Day", color="Play time"),
                x=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],
                y=np.arange(0,24)
               )
fig.write_html('plot.html')
fig.update(layout_coloraxis_showscale=False) 
report = dp.Report(dp.Plot(fig)) #Create a report
report.upload(name='Streaming Hour/Weekday Distribution')

Uploading report and associated data - *please wait...*

Your report only contains a single element - did you know you can include additional plots, tables and text in a single report? More info <a href='https://docs.datapane.com/reports/blocks/layout-pages-and-selects' target='_blank'>here</a>

Report successfully uploaded. View and share your report <a href='https://datapane.com/reports/d7dw2Z3/streaming-hourweekday-distribution/' target='_blank'>here</a>, or edit your report <a href='https://datapane.com/reports/d7dw2Z3/streaming-hourweekday-distribution/edit/' target='_blank'>here</a>.

In [14]:
# Day of Year, Github style
track_play_dayofyear = data[data['ts'].dt.year == 2021].groupby([data['ts'].dt.dayofyear]).sum('ms_played')['ms_played']

# Create time matrix
dayofyear_matrix = np.zeros((7,53))
for week in range(0, 53):
    for day in range(0,7):
        day_of_year = week * 7 + day
        if day_of_year+1 in track_play_dayofyear:
            dayofyear_matrix[day, week] = track_play_dayofyear[day_of_year+1] / (1000.0 * 60.0 * 60.0)


fig = px.imshow(dayofyear_matrix, origin='lower', template='plotly_dark', color_continuous_scale=['#11272e', '#1cc8ff'],
                labels=dict(x="Week", y="Weekday", color="Play time (h)"),
                x=np.arange(0,53),
                y=np.arange(0,7)
               )
fig.write_html('plot.html')

fig.update(layout_coloraxis_showscale=False) 
report = dp.Report(dp.Plot(fig)) #Create a report
report.upload(name='test_plot')

Uploading report and associated data - *please wait...*

Your report only contains a single element - did you know you can include additional plots, tables and text in a single report? More info <a href='https://docs.datapane.com/reports/blocks/layout-pages-and-selects' target='_blank'>here</a>

Report successfully uploaded. View and share your report <a href='https://datapane.com/reports/MA1ZYY3/test-plot/' target='_blank'>here</a>, or edit your report <a href='https://datapane.com/reports/MA1ZYY3/test-plot/edit/' target='_blank'>here</a>.