# A Musical Interlude: Understanding my music tastes

- In this project, I will use a combination of my streaming history from Spotify and the Spotify API (using spotipy) to see what interesting things we can learn from it.

## Requirements
- Python (tested with 3.9 and 3.11)
- Spotipy (``pip install spotipy``) >= 2.21
- NumPy
- Pandas

## Initialisation

In [1]:
import os
import re
import json
import numpy as np
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials


## Importing my streaming history
The core dataset for this project - the user's streaming history - can be downloaded from the Spotify Account/Privacy page ([here](https://www.spotify.com/uk/account/privacy/)). It'll take a while to arrive, but when it does you'll have a folder containing a whole host of json files. Of interest for this project are those named ``StreamingHistory[x].json`` - these contain:
- Artist Name - The name of the Artist
- Song Name - The name of the song
- endTime - Timestamp that the song finished playing at, ``YYYY-MM-DD HH:MM``
- msPlayed - Amount of the song that was played in milliseconds

There are multiple files, named from 0 through [n], split up to ease reading.

In and of itself this data is interesting, but it can be combined with spotify's catalog to allow for more advanced analysis. The Catalog contains a wide variety of datapoints about each song, including its BPM, dancability, valence and energy. We'll use these traits later to assess my listening habits.

Now, let's import all of these StreamingHistory files into a dataframe:

In [2]:
dataPath = "./data/2022/"
# has to be filtered to remove the .ipynb_checkpoint files that sometimes appear
fileList = list(filter(lambda path: "json" in path, os.listdir(dataPath)))
frames = []
for path in fileList:
    content = pd.read_json(dataPath + path)
    frames.append(content)
history = pd.concat(frames)
history
history.to_csv("./data/output/history.csv")

# This gives us a dataframe with ~17000 entries - i.e. a record of 17000 streams by this user, which should be quite enough for this project.

## Adding Details about each song
Let's use *spotipy* to get some extra info about each song. API requests are slow and expensive, so we want to reduce them to a minimum where possible. Quite a lot of the songs in the ``history`` dataframe are duplicate, so let's filter it down to remove any duplicates

In [3]:
unique = history.copy(deep=True)
unique.drop_duplicates(
    subset=['artistName', 'trackName'], keep="first", inplace=True)
unique

Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2021-11-28 18:50,Professor Green,Read All About It,77971
1,2021-12-06 23:59,WALK THE MOON,Shut Up and Dance,119774
2,2021-12-07 04:49,Stephanie Hsu,A Guy That I'd Kinda Be Into,12068
4,2021-12-07 04:52,Maisie Peters,Boy,178013
5,2021-12-07 04:56,Maisie Peters,Feels Like This,223346
...,...,...,...,...
7079,2022-12-07 17:10,Johnny Hodges,The Last Time I Saw Paris,174960
7080,2022-12-07 17:11,Ella Fitzgerald,Let's Fall In Love,27365
7081,2022-12-07 17:18,Paul Desmond,The Way You Look Tonight - 2003 Remastered,439507
7082,2022-12-07 17:22,Hank Jones,My Wish,212587


We now have 4279 rows, which seems like about the right number of unique tracks  for this period.

Now, let's initialise spotipy

In [4]:
SPOTIPY_CLIENT_ID = "c87d94fc86754ae4bb478a50402a3254"
SPOTIPY_CLIENT_SECRET = "1739de3f5fe14e74b8e6c61eb1c9069c"

auth_manager = SpotifyClientCredentials(client_id= SPOTIPY_CLIENT_ID, client_secret=SPOTIPY_CLIENT_SECRET)
sp = spotipy.Spotify(auth_manager=auth_manager)

In [5]:
mp = sp.search("Maisie Peters", type="artist")
mp['artists']['items'][0]['uri']
trying = sp.search("Maisie Peters I'm Trying (Not Friends)", type="track", market="GB", limit=1)
trying['tracks']['items'][0]

{'album': {'album_type': 'album',
  'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/2RVvqRBon9NgaGXKfywDSs'},
    'href': 'https://api.spotify.com/v1/artists/2RVvqRBon9NgaGXKfywDSs',
    'id': '2RVvqRBon9NgaGXKfywDSs',
    'name': 'Maisie Peters',
    'type': 'artist',
    'uri': 'spotify:artist:2RVvqRBon9NgaGXKfywDSs'}],
  'external_urls': {'spotify': 'https://open.spotify.com/album/1X1EZB1hCoymZ9gU8JKv86'},
  'href': 'https://api.spotify.com/v1/albums/1X1EZB1hCoymZ9gU8JKv86',
  'id': '1X1EZB1hCoymZ9gU8JKv86',
  'images': [{'height': 640,
    'url': 'https://i.scdn.co/image/ab67616d0000b273084229044ca0f2f9f43584cc',
    'width': 640},
   {'height': 300,
    'url': 'https://i.scdn.co/image/ab67616d00001e02084229044ca0f2f9f43584cc',
    'width': 300},
   {'height': 64,
    'url': 'https://i.scdn.co/image/ab67616d00004851084229044ca0f2f9f43584cc',
    'width': 64}],
  'name': 'You Signed Up For This',
  'release_date': '2021-08-27',
  'release_date_precision': 

Doing this for a track returns lots of details about each song. I've chosen to keep them all in a JSON field, so that we can use whichever fields seem most appropriate to answer the questions at hand. 

Now, let's go through the entire `unique` dataframe and add track details. the search can return multiple tracks, but let's assume whatever one comes back first is the correct one - we can test that later if we need to. To do this, I've created a function:

In [6]:
def getSongDetails(artist, track):
    searchRequest = f"{artist} {track}"
    try:
        results = sp.search(searchRequest, type="track", market="GB")
    except: 
        print(f"Failed to download this song: {searchRequest}")
            
    if 'error' in results:
        raise Exception(results['error'])
    items = results['tracks']['items']
    # print(artist, track)
    if len(items) != 1:
        results
    return items[0]

As the API data is in JSON format, let's convert our input dataframe to a JSON list. This will make storing and analysing the data much easier

In [7]:
uniqueList = unique.to_dict('records')

Now we have this list, let's apply the getSongDetails() function over the first 10 lines to check that it works correctly

In [8]:
x = 0
for record in uniqueList:
    record['details'] = getSongDetails(record['artistName'], record['trackName'])
    x+=1
    if x > 10:
        break

uniqueList[1:10]

[{'endTime': '2021-12-06 23:59',
  'artistName': 'WALK THE MOON',
  'trackName': 'Shut Up and Dance',
  'msPlayed': 119774,
  'details': {'album': {'album_type': 'album',
    'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/6DIS6PRrLS3wbnZsf7vYic'},
      'href': 'https://api.spotify.com/v1/artists/6DIS6PRrLS3wbnZsf7vYic',
      'id': '6DIS6PRrLS3wbnZsf7vYic',
      'name': 'WALK THE MOON',
      'type': 'artist',
      'uri': 'spotify:artist:6DIS6PRrLS3wbnZsf7vYic'}],
    'external_urls': {'spotify': 'https://open.spotify.com/album/3mNoFlD1wsoXfkljfFzExT'},
    'href': 'https://api.spotify.com/v1/albums/3mNoFlD1wsoXfkljfFzExT',
    'id': '3mNoFlD1wsoXfkljfFzExT',
    'images': [{'height': 640,
      'url': 'https://i.scdn.co/image/ab67616d0000b27343294cfa2688055c9d821bf3',
      'width': 640},
     {'height': 300,
      'url': 'https://i.scdn.co/image/ab67616d00001e0243294cfa2688055c9d821bf3',
      'width': 300},
     {'height': 64,
      'url': 'https://i.s

This does appear to work as expected. The following lines of code will apply this function over the entire unique list of songs. 
**Please note that this takes > 15 minutes to run, so is disabled by default. An example of the resultant file is stored in ./data/output/unique_with_details.json, and can be used for the analysis part of this project.**

In [9]:
for record in uniqueList:
    record['details'] = getSongDetails(record['artistName'], record['trackName'])

with open('data/output/unique_with_details.json', 'w', encoding='utf-8') as file:
    json.dump(uniqueList, file, ensure_ascii=False, indent=4)
