# A Musical Interlude: Understanding my music tastes

- In this project, I will use a combination of my streaming history from Spotify and the Spotify API (using spotipy) to see what interesting things we can learn from it.

## Requirements
- Python (tested with 3.9 and 3.11)
- Spotipy (``pip install spotipy``) >= 2.21
- NumPy
- Pandas

## Initialisation

In [1]:
import os
import re
import json
import numpy as np
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials


## Importing my streaming history
The core dataset for this project - the user's streaming history - can be downloaded from the Spotify Account/Privacy page ([here](https://www.spotify.com/uk/account/privacy/)). It'll take a while to arrive, but when it does you'll have a folder containing a whole host of json files. Of interest for this project are those named ``StreamingHistory[x].json`` - these contain:
- Artist Name - The name of the Artist
- Song Name - The name of the song
- endTime - Timestamp that the song finished playing at, ``YYYY-MM-DD HH:MM``
- msPlayed - Amount of the song that was played in milliseconds

There are multiple files, named from 0 through [n], split up to ease reading.

In and of itself this data is interesting, but it can be combined with spotify's catalog to allow for more advanced analysis. The Catalog contains a wide variety of datapoints about each song, including its BPM, dancability, valence and energy. We'll use these traits later to assess my listening habits.

Now, let's import all of these StreamingHistory files into a dataframe:

In [2]:
dataPath = "./data/2021/"
# has to be filtered to remove the .ipynb_checkpoint files that sometimes appear
fileList = list(filter(lambda path: "json" in path, os.listdir(dataPath)))
frames = []
for path in fileList:
    content = pd.read_json(dataPath + path)
    frames.append(content)
history = pd.concat(frames)
history


Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2020-11-30 16:44,Ben Fankhauser,King Of New York,217569
1,2020-12-01 00:47,Matilda the Musical Original Cast,Quiet,228000
2,2020-12-01 09:51,Matilda the Musical Original Cast,Quiet,930
3,2020-12-01 09:52,Andy Grammer,85,351
4,2020-12-01 09:52,Bridgit Mendler,Snap My Fingers,264
...,...,...,...,...
5997,2021-12-01 23:05,Maisie Peters,Sad Girl Summer,805
5998,2021-12-01 23:05,Dizzee Rascal,Bonkers,174552
5999,2021-12-01 23:05,Original Broadway Cast Of Matilda The Musical,When I Grow up (feat. Lauren Ward & Bailey Ryon),530
6000,2021-12-01 23:05,Joan Jett & The Blackhearts,I Love Rock 'N Roll,0


This gives us a dataframe with ~26000 entries - i.e. a record of 26000 streams by this user, which should be quite enough for this project.

## Adding Details about each song
Let's use *spotipy* to get some extra info about each song. API requests are slow and expensive, so we want to reduce them to a minimum where possible. Quite a lot of the songs in the ``history`` dataframe are duplicate, so let's filter it down to remove any duplicates

In [3]:
unique = history.drop_duplicates(
    subset=['artistName', 'trackName'], keep="first")
unique

Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2020-11-30 16:44,Ben Fankhauser,King Of New York,217569
1,2020-12-01 00:47,Matilda the Musical Original Cast,Quiet,228000
3,2020-12-01 09:52,Andy Grammer,85,351
4,2020-12-01 09:52,Bridgit Mendler,Snap My Fingers,264
5,2020-12-01 09:52,Nikki Blonsky,Good Morning Baltimore,12996
...,...,...,...,...
5770,2021-11-29 13:35,Klaus Badelt,He's a Pirate,90426
5771,2021-11-29 13:38,Marc Streitenfeld,Walter's Burial,184826
5772,2021-11-29 13:54,James Newton Howard,Your Mother Loves You,119356
5933,2021-11-30 20:34,David Bowie,Heroes - 2017 Remaster,4882


We now have 5185 rows, which seems like about the right number of unique tracks  for this period.

Now, let's initialise spotipy

In [4]:
SPOTIPY_CLIENT_ID = "c87d94fc86754ae4bb478a50402a3254"
SPOTIPY_CLIENT_SECRET = "1739de3f5fe14e74b8e6c61eb1c9069c"

auth_manager = SpotifyClientCredentials(client_id= SPOTIPY_CLIENT_ID, client_secret=SPOTIPY_CLIENT_SECRET)
sp = spotipy.Spotify(auth_manager=auth_manager)

In [5]:
mp = sp.search("Maisie Peters", type="artist")
mp['artists']['items'][0]['uri']
trying = sp.search("Maisie Peters I'm Trying (Not Friends)", type="track", market="GB", limit=1)
trying['tracks']['items'][0]

{'album': {'album_type': 'album',
  'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/2RVvqRBon9NgaGXKfywDSs'},
    'href': 'https://api.spotify.com/v1/artists/2RVvqRBon9NgaGXKfywDSs',
    'id': '2RVvqRBon9NgaGXKfywDSs',
    'name': 'Maisie Peters',
    'type': 'artist',
    'uri': 'spotify:artist:2RVvqRBon9NgaGXKfywDSs'}],
  'external_urls': {'spotify': 'https://open.spotify.com/album/1X1EZB1hCoymZ9gU8JKv86'},
  'href': 'https://api.spotify.com/v1/albums/1X1EZB1hCoymZ9gU8JKv86',
  'id': '1X1EZB1hCoymZ9gU8JKv86',
  'images': [{'height': 640,
    'url': 'https://i.scdn.co/image/ab67616d0000b273084229044ca0f2f9f43584cc',
    'width': 640},
   {'height': 300,
    'url': 'https://i.scdn.co/image/ab67616d00001e02084229044ca0f2f9f43584cc',
    'width': 300},
   {'height': 64,
    'url': 'https://i.scdn.co/image/ab67616d00004851084229044ca0f2f9f43584cc',
    'width': 64}],
  'name': 'You Signed Up For This',
  'release_date': '2021-08-27',
  'release_date_precision': 

Let's test it

Looking in the Spotify app, we see that Maisie Peter's URI is spotify:artist:2RVvqRBon9NgaGXKfywDSs, which matches with what the API just gave us.

Doing this for a track returns lots of details about each song; for this project i've selected the following as 'interesting':
- release-date
- duration_ms
- explicit
- popularity
- uri
Now, let's go through the entire `unique` dataframe and add track details. the search can return multiple tracks, but let's assume whatever one comes back first is the correct one - we can test that later if we need to.


In [15]:
def getSongDetails(artist, track):
    searchRequest = f"{artist} {track}"
    results = sp.search(searchRequest, type="track", market="GB")
    items = results['tracks']['items']
    if len(items) != 1:
        results
    return {
        "name": items[0]['name'],
        "release_date" : items[0]['album']['release_date'],
        "duration_ms" : items[0]['duration_ms'],
        "explicit" : items[0]['explicit'],
        "popularity" : items[0]['popularity'],
        "uri" : items[0]['uri']
    }
    
getSongDetails("Maisie Peter", "You Signed Up For This")


{'name': 'You Signed Up For This',
 'release_date': '2021-08-27',
 'duration_ms': 195133,
 'explicit': False,
 'popularity': 53,
 'uri': 'spotify:track:1xZyqMoN55Rd5WD8cHSva8'}

This function appears to be working correctly, so let's deploy it with the full 


In [20]:
detail_unique = unique.head(2)
detail_unique
detail_unique[['name','release_date', 'duration_ms', 'explicit', 'popularity', 'uri']] =  detail_unique.apply(lambda row: getSongDetails(row['artistName'], row['trackName']).values(), axis = 1)
detail_unique

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  detail_unique['name','release_date', 'duration_ms', 'explicit', 'popularity', 'uri'] =  detail_unique.apply(lambda row: getSongDetails(row['artistName'], row['trackName']).values(), axis = 1)


Unnamed: 0,endTime,artistName,trackName,msPlayed,"(name, release_date, duration_ms, explicit, popularity, uri)"
0,2020-11-30 16:44,Ben Fankhauser,King Of New York,217569,"(King Of New York, 2012-04-10, 249275, False, ..."
1,2020-12-01 00:47,Matilda the Musical Original Cast,Quiet,228000,"(Quiet, 2011-10-13, 228000, False, 45, spotify..."
