# Visualizing an artists musical network using the Spotify API
> Chandler Haukap

### I love music and I love data visualization. Why not use music to learn how to visualize network graphs?

Coming into this project I had no experience plotting. [Rebecca Weng's article](https://towardsdatascience.com/tutorial-network-visualization-basics-with-networkx-and-plotly-and-a-little-nlp-57c9bbb55bb9) on creating network graphics using plotly was a huge help!

## Importing the libraries

1) [Numpy](https://numpy.org/) and [Pandas](https://pandas.pydata.org/): The standard data science tools for manipulating matrices. If you're unfamiliar, [start here](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html).
2) [Spotipy](https://spotipy.readthedocs.io/en/2.19.0/): A fantastic Python wrapper for the [Spotify API](https://developer.spotify.com/documentation/web-api/).
3) [Networkx](https://networkx.org/): A tool for analyzing networks. To oversimplify, networkx is to networks as pandas is to matrices.
4) [MatPlotLib.pyplot](https://matplotlib.org/): We'll use this library for simple graphics like bar charts.
5) [Plotly](https://plotly.com/python/): This graphics tool is much more powerful, but also more complicated. We'll use it to make the network graphs.
6) Json: A built-in library. It converts objects from JavaScript Object Notation to Python dictionaries and back.

In [1]:
import pandas as pd
import spotipy
import numpy as np
from spotipy.oauth2 import SpotifyOAuth
import networkx as nx
import matplotlib.pyplot as plt
import plotly.offline as py
import plotly.graph_objects as go
import json
%matplotlib inline

## Initialize the Spotify API

I've stored my credentials for the Spotify API in a file called "keys.json". I obviously can't show you, but there is an example file in the GitHub repository!

In [3]:
with open('../keys.json') as json_file:
    keys = json.load(json_file)
    
scope = ""
sp = spotipy.Spotify(
  auth_manager=SpotifyOAuth(client_id=keys['spotify']['client_id'],
  client_secret=keys['spotify']['secret'],
  redirect_uri=keys['spotify']['redirect_uri'],
  scope=scope,
  open_browser=False)
)
sp.me()['display_name']

'Chandler'

It works! However, getting my name isn't helping us form a network. We need an artist as a starting point, and I know just who to pick...

In [4]:
sp.search(q='artist:Lana Del Rey', type='artist')['artists']['items'][0]['id']

'00FQb4jTyendYWaN8pK0wa'

Lana Del Rey. No reason other than I'm a huge fan. With that ID we can request her albums.

In [6]:
albums = sp.artist_albums('00FQb4jTyendYWaN8pK0wa')['items']
for album in albums:
    print(album['name'])

Blue Banisters
Chemtrails Over The Country Club
Norman Fucking Rockwell!
Lust For Life
Lust For Life
Lust For Life
Honeymoon
Honeymoon
Honeymoon
Ultraviolence (Deluxe)
Ultraviolence
Ultraviolence
Ultraviolence (Deluxe)
Ultraviolence
Ultraviolence
Ultraviolence
Ultraviolence (Deluxe)
Ultraviolence
Ultraviolence - Audio Commentary
Born To Die - The Paradise Edition


And now we can get every track from every album

In [7]:
all_tracks = []
for album in albums:
    all_tracks += sp.album_tracks(album['id'])['items']

The Spotify API returns a list of artists for every track. Let's see who Lana has featured on her songs:

In [11]:
for track in all_tracks:
    for artist in track['artists']:
        if artist['name'] != 'Lana Del Rey':
            print(artist['name'])

Nikki Lane
Zella Day
Weyes Blood
The Weeknd
A$AP Rocky
Playboi Carti
A$AP Rocky
Stevie Nicks
Sean Ono Lennon
The Weeknd
A$AP Rocky
Playboi Carti
A$AP Rocky
Stevie Nicks
Sean Ono Lennon
The Weeknd
A$AP Rocky
Playboi Carti
A$AP Rocky
Stevie Nicks
Sean Ono Lennon
Photek


Ok, there are some duplicates, but we can deal with that in the next section. For now, it's enough to know that it's working properly.

## Wrapping it all up into a function

This function encapsulates all the work that we've done so far.

In [12]:
def get_featured_artists(artist_ids):
    artists = []
    for artist_id in artist_ids:
        artist = sp.artist(artist_id)
        all_tracks = []
        for album in sp.artist_albums(artist_id)['items']:
            all_tracks += sp.album_tracks(album['id'])['items']

        for track in all_tracks:
            for i in range(0, len(track['artists'])):
                if track['artists'][i]['id'] != artist['id']:
                    artists.append([
                        artist['name'], 
                        artist['id'], 
                        track['artists'][i]['name'], 
                        track['artists'][i]['id'],
                        track['name'],
                        track['id']
                    ])
                    
    return artists

In [14]:
artists = get_featured_artists(['00FQb4jTyendYWaN8pK0wa'])
for artist in artists:
    print(f"{artist[0]} -> {artist[2]}")

Lana Del Rey -> Nikki Lane
Lana Del Rey -> Zella Day
Lana Del Rey -> Weyes Blood
Lana Del Rey -> The Weeknd
Lana Del Rey -> A$AP Rocky
Lana Del Rey -> Playboi Carti
Lana Del Rey -> A$AP Rocky
Lana Del Rey -> Stevie Nicks
Lana Del Rey -> Sean Ono Lennon
Lana Del Rey -> The Weeknd
Lana Del Rey -> A$AP Rocky
Lana Del Rey -> Playboi Carti
Lana Del Rey -> A$AP Rocky
Lana Del Rey -> Stevie Nicks
Lana Del Rey -> Sean Ono Lennon
Lana Del Rey -> The Weeknd
Lana Del Rey -> A$AP Rocky
Lana Del Rey -> Playboi Carti
Lana Del Rey -> A$AP Rocky
Lana Del Rey -> Stevie Nicks
Lana Del Rey -> Sean Ono Lennon
Lana Del Rey -> Photek


Fantastic! That takes care of the first layer of the network, but we need more than that to make a cool graphic- we need a function that also gets the network of all of Lana Del Rey's connections. 

[Recursion](https://www.youtube.com/watch?v=Mv9NEXX1VHc) would be the simplest solution to write. Just recursively call `get_featured_artists` until we reach some stopping condition. However, that poses some problems. Namely, it would duplicate a ton of work. For example, if The Weeknd and A$ap Rocky have any shared connections, we would end up performing the same operation twice. Given how quickly I expect the number of connections to grow, we need a more elegant solution.

That's why I wrote this dynamic program that does the same thing, but deletes duplicate entries before proceeding.

In [15]:
def get_network(size, root_artist_id):
    network = get_featured_artists([root_artist_id])
    next_index = 1
    last_index = 0
    
    for i in range(0, size-1):
        next_index = len(network)
        network += get_featured_artists(np.unique(np.array([a[3] for a in network[last_index:]])))
        last_index = next_index

    return network

In [20]:
result = get_network(3, '00FQb4jTyendYWaN8pK0wa')
print(len(result))

KeyboardInterrupt: 

74,521 entries. That's way more than I expected. But, I bet there are some duplicates. I know for a fact that most music labels put out multiple versions of an album and I know that the function I wrote doesn't account for two artists connecting on multiple songs. 

Therefore, we'll need to clean the data to see how many connections are actually present.

In [None]:
df = pd.DataFrame(result)
df.columns = [
    'primary_artist',
    'primary_artist_id',
    'featured_artist',
    'featured_artist_id',
    'track',
    'track_id'
]
df['unique_id'] = df.apply(lambda t: t.primary_artist_id + t.featured_artist_id, axis=1)
df['track'] = df.groupby('unique_id').track.transform(lambda q: "|".join(np.unique(np.array(q))))
df['track_id'] = df.groupby('unique_id').track_id.transform(lambda q: "|".join(np.unique(np.array(q))))
print(df.drop_duplicates(['unique_id']).shape)

24,776 entries seems more reasonable. Now, I'm going to save this dataframe to a csv file so I don't have to go through this whole process twice.

In [None]:
df.drop_duplicates(['unique_id']).to_csv("lana_3.csv", index=False)

## Playing with the network

In [22]:
depth_3 = pd.read_csv('lana_3.csv')
depth_3.head()

Unnamed: 0,primary_artist,primary_artist_id,featured_artist,featured_artist_id,track,track_id,unique_id
0,Lana Del Rey,00FQb4jTyendYWaN8pK0wa,Nikki Lane,2kWeFaiHBskk8oqky3KHcR,Breaking Up Slowly,1hn1kCOG5dm1XgZYKpfaLR,00FQb4jTyendYWaN8pK0wa2kWeFaiHBskk8oqky3KHcR
1,Lana Del Rey,00FQb4jTyendYWaN8pK0wa,Zella Day,100sLnojEpcadRx4edEBA6,For Free,2lhfd0CF0dFlwRVH8NG8vv,00FQb4jTyendYWaN8pK0wa100sLnojEpcadRx4edEBA6
2,Lana Del Rey,00FQb4jTyendYWaN8pK0wa,Weyes Blood,3Uqu1mEdkUJxPe7s31n1M9,For Free,2lhfd0CF0dFlwRVH8NG8vv,00FQb4jTyendYWaN8pK0wa3Uqu1mEdkUJxPe7s31n1M9
3,Lana Del Rey,00FQb4jTyendYWaN8pK0wa,The Weeknd,1Xyo4u8uXC1ZmMpatF05PJ,Lust For Life (with The Weeknd),0mt02gJ425Xjm7c3jYkOBn|2KzCkzW1nt8qGHjYoTWGlc|...,00FQb4jTyendYWaN8pK0wa1Xyo4u8uXC1ZmMpatF05PJ
4,Lana Del Rey,00FQb4jTyendYWaN8pK0wa,A$AP Rocky,13ubrt8QOOCPljQ2FL1Kca,Groupie Love (feat. A$AP Rocky)|Summer Bummer ...,03hqMhmCZiNKMSPmVabPLP|2LYIQ9DuoE92bTfXDOwRiM|...,00FQb4jTyendYWaN8pK0wa13ubrt8QOOCPljQ2FL1Kca


First, I want to add a column with the number of songs that the artists share. This should be simple given the cleaning work that I already did.

In [23]:
depth_3['songs'] = depth_3.track.map(lambda t: len(t.split("|")))
depth_3.songs.describe()

count    24776.000000
mean         1.990111
std          4.291406
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max        132.000000
Name: songs, dtype: float64

132 songs... no idea what's going on there but for now I want to push on!

## Visualizing the graph

First, I ran into a problem with names like 'Joey Bada$$'. The HTML that I plan on generating needs the html code for '$'

In [25]:
depth_3.featured_artist = depth_3.featured_artist.map(lambda r: r.replace("$", "&#36;"))
depth_3.primary_artist = depth_3.primary_artist.map(lambda r: r.replace("$", "&#36;"))
depth_3.iloc[4].featured_artist

'A&#36;AP Rocky'

I want the edges to be undirected. I might be able to gain more insights from a directed network, but for the purposes of visualizing who has worked with whom the directionality isn't important.

In [26]:
def add_undirected_edge(graph, node_1, node_2, songs):
    graph.add_edge(
        node_1, 
        node_2, 
        songs=songs
    )
        
    graph.add_edge(
        node_2, 
        node_1, 
        songs=songs
    )

Adding the nodes and edges!

In [27]:
graph = nx.Graph()
_ = depth_3.apply(lambda e: graph.add_node(e.featured_artist), 
                  axis=1)
_ = depth_3.apply(lambda y: 
                  add_undirected_edge(
                      graph,
                      y.primary_artist, 
                      y.featured_artist, 
                      songs=y.songs
                  ), 
                  axis=1)

The last piece of information I need is the "closeness" of each artist to Lana Del Rey.

In [40]:
paths = dict(nx.shortest_path_length(graph, source='Lana Del Rey'))
depth_3['closeness'] = depth_3.featured_artist.map(lambda t: paths[t])
depth_3.closeness.describe()

count    24776.000000
mean         2.475218
std          0.532184
min          0.000000
25%          2.000000
50%          2.000000
75%          3.000000
max          3.000000
Name: closeness, dtype: float64

### Selecting a layout

There are 10,000 nodes. I can't place them by hand. Luckily, there are a few algorithms out there specifically for designing a network visualization layout! I'm using a spring layout, but I would engourage your to try [different layouts](https://networkx.org/documentation/stable/reference/drawing.html#module-networkx.drawing.layout).

In [41]:
node_positions = nx.spring_layout(graph, scale=2)
node_positions['Lana Del Rey']

array([-0.21876849, -0.16519715])

With the nodes positioned we'll start by drawing the lines. This step consists of creating a scatterplot, with mode=lines, for each edge. 

In [48]:
edge_trace = []
for edge in graph.edges(data=True):
    char_1 = edge[0]
    char_2 = edge[1]

    x0, y0 = node_positions[char_1]
    x1, y1 = node_positions[char_2]

    trace  = go.Scatter(x = [x0, x1, None],
                        y = [y0, y1, None],
                        line = dict(color = f'rgba(0, 0, 255, 1)'),
                        hoverinfo = 'text',
                        text = ([char_1 + '--' + char_2 + ': ' + '1']),
                        mode = 'lines')
    
    edge_trace.append(trace)

And now the nodes. These will all be added to a single scatterplot

In [52]:
node_trace = go.Scatter(x = [],
                        y = [],
                        text = [],
                        textposition = "top center",
                        textfont_size = 10,
                        mode = 'markers+text',
                        hoverinfo = 'none',
                        marker = dict(color = [],
                                      size  = [],
                                      line  = None),
                        textfont = dict(family= "sans serif",
                                        size  = 18,
                                        color = "Black")
                       )

for node in graph.nodes():
    x, y = node_positions[node]
    node_trace['x'] += tuple([x])
    node_trace['y'] += tuple([y])
    if node == 'Lana Del Rey':
        node_trace['marker']['color'] += tuple(['red'])
        node_trace['marker']['size'] += tuple([40])
    else:
        node_trace['marker']['color'] += tuple(['white'])
        node_trace['marker']['size'] += tuple([
            40 / (paths[node] if paths[node] is not 0 else 1)
        ])
    node_trace['text'] += tuple(['<b>' + node + '</b>'])

KeyboardInterrupt: 

Finally, we can combine the edges and the nodes into a single figure and save it. I'm saving to an HTML because plotly has some built-in HTML functionality that allow for zoom. 

With 10,000+ nodes, we'll need zoom ;)

In [None]:
layout = go.Layout(
    paper_bgcolor='rgba(73,73,73,1)',
    plot_bgcolor='rgba(0,100,0,0)',
    height=1080,
    width=1080
)
fig = go.Figure(layout = layout)

for trace in edge_trace:
    fig.add_trace(trace)

fig.add_trace(node_trace)

fig.update_layout(showlegend = False)
fig.update_xaxes(showticklabels = False, showgrid=False, zeroline=False)
fig.update_yaxes(showticklabels = False, showgrid=False, zeroline=False)

#fig.show()
py.plot(fig, filename='lana_test.html')