# Getting data from the web: API

## Spotify and `Spotipy`

Spotify allows developers to programmatically obtain data through its API.  

We are going to use the credentials you've already created to explore this API.  The [developer dashboard](https://developer.spotify.com/dashboard/login) allows you to access those credentials.

We'll also be using [spotipy](https://spotipy.readthedocs.io/), a Python library that is designed to interact with this API.

In [None]:
import spotipy

In order to use the API, we need to authenticate with Spotify.  We are going to provide credentials to Spotify so that it knows we are a legitimate registered user.

In [None]:
from spotipy.oauth2 import SpotifyClientCredentials

As it's written, the code below is not quite right:
1. It needs to have legitimate values for the CLIENT_ID and CLIENT_SECRET
2. Ideally one NEVER wants to store credentials in a workflow like this
  * if you share the notebook with other people, or store it on GitHub, or otherwise put it in a publicly accessible place, then your secret credentials will be exposed

In [None]:
# Storing our credentials
CLIENT_ID='2a3'
CLIENT_SECRET='3b4'

# Authenticating with spotify
sp = spotipy.Spotify(auth_manager = SpotifyClientCredentials(client_id=CLIENT_ID,
                                                             client_secret=CLIENT_SECRET))

# Carrying out a basic query
sp.search(q='Neil Young', limit=20)

We can fix the code in two ways:
1. Supply the proper credentials
2. Put the credentials into a file that's kept more private

In [None]:
# If we put our module in an unconventional location, then 
# we need to tell Python where to look for module files on this system
import sys
sys.path.append('/home/jovyan')

# Import the module file with our credential info
import spotify_key

Now we can access the credentials through variable names, rather than by explicitly writing out any private information.

In [None]:
# Example:
print(spotify_key.CLIENT_ID)

Here is the more correct code.

In [None]:
# WE NO LONGER NEED THIS
# Storing our credentials
# CLIENT_ID='2a3'
# CLIENT_SECRET='3b4'

# Authenticating with spotify
sp = spotipy.Spotify(auth_manager = SpotifyClientCredentials(client_id=spotify_key.CLIENT_ID,
                                                             client_secret=spotify_key.CLIENT_SECRET))

# Carrying out a basic query
sp.search(q='Neil Young', limit=20)

Now we can get rolling with the API.

In [None]:
results = sp.search(q='Neil Young', limit=20)
print(results)

Sometimes it's like trying to drink water from a firehose.

### How do I know how to do the above code in the first place?

It can be VERY important and helpful to consult the documentation.

Check out: https://developer.spotify.com/documentation/web-api/
* The examples are mainly in javascript, which we're not using in this course.

Fortunately there is also documentation in turn for Python libraries like spotipy to interact with the Spotify API: 
* https://spotipy.readthedocs.io/
  * For installation, examples, reference, link to source code
* GitHub repos can themselves contain useful documentation on the main README
  * ... not that we want to overburden ourselves with Python, but note that you can even peruse source code if you want
    * https://github.com/spotipy-dev/spotipy
    * note the use of requests and json in, for example, "spotipy/client.py" (and something called "urllib3" -> requests is built on urllib3 and intended to make HTTP requests more painless)

In [None]:
# Let's put our Python dictionary skills to good use

type(results)

In [None]:
results.keys()

In [None]:
results['tracks'].keys()

In [None]:
results['tracks']['items'].keys()

So.... we have to navigate our way through nested data structures, not all of which are dictionaries.

In [None]:
results['tracks']['items'][0].keys()

I will let you explore the documentation links for more info, and we will just cut to using some interesting data like artist name, track name, etc.

In [None]:
print(results['tracks']['items'][0]['album']['name'])
print(results['tracks']['items'][0]['name'])
print(results['tracks']['items'][0]['artists'])
print(results['tracks']['items'][0]['popularity'])

In [None]:
results = sp.search(q='Neil Young', limit=20)

# using 'enumerate' for a collection of items allows you to loop over the items, as well
# as to use a numerical count for indexing
for idx, track in enumerate(results['tracks']['items']):
    print(idx, track['name'])

In [None]:
results = sp.search(q='Neil Young', limit=20)

for idx, track in enumerate(results['tracks']['items']):
    print(idx, track['name'], ' : ', track['artists'][0]['name'])

As long as you can traverse this data structure, then you can collect information from every item that is returned into something a little more manageable.... 

We are going to collect information from a Spotify playlist, and put information about the tracks into a Pandas dataframe.

In [None]:
playlists = sp.user_playlists('spotify')

In [None]:
playlists['items'][0]

URI:  Uniform Resource Identifier
* These allow us to pass specific identifiers to Spotify so that we can tell it exactly what we're looking for.

In [None]:
pitems = sp.playlist_items('spotify:playlist:37i9dQZF1DXcBWIGoYBM5M')

In [None]:
len(pitems['items'])

In [None]:
for i in pitems['items']:
    print(i['track']['artists'][0]['name'] + ' : ' + i['track']['name'])

In [None]:
print(pitems['items'][0]['track']['artists'][0]['name'])
print(pitems['items'][0]['track']['name'])
print(pitems['items'][0]['track']['uri'])

This track also has a URI, which we can use to get audio features about this track:

In [None]:
sp.audio_features('spotify:track:6dOtVTDdiauQNBQEDOtlAB')

In [None]:
sp.track('spotify:track:6dOtVTDdiauQNBQEDOtlAB')

If we have the track url, can we somehow get the lyrics too?

In [None]:
response = requests.get('https://open.spotify.com/track/6dOtVTDdiauQNBQEDOtlAB')

In [None]:
# Checking that we have a successful request
response

In [None]:
# Find the first occurrence of "lyrics"
response.text.find('lyrics')

In [None]:
# Is it an indicator of an html tag?
response.text[0:200]

In [None]:
# Find the next occurrence of "lyrics"
response.text.find('lyrics', 134)
# response.text[]

... looks like it wants us to have Premium access.  But in any case, it is not straight-forward here to retrieve the lyrics, and indeed there are likely copyright issues related to sharing lyrical content.

Back to some data analysis.

In [None]:
eilish_track = sp.track('spotify:track:6dOtVTDdiauQNBQEDOtlAB')

In [None]:
eilish_track['popularity']

We are now going to put some track info into a dataframe for easier analysis.



In [None]:
import pandas as pd

The audio features that we can grab are detailed as the keys of the returned audio_features object.

In [None]:
audiofeatures = sp.audio_features('spotify:track:6dOtVTDdiauQNBQEDOtlAB')[0].keys()

In [None]:
audiofeatures

Now we do a little for loop to create a dictionary that has a list of 50 values for each audio feature.

In [None]:
# initialize the dictionary
# with empty lists for each audio feature
af = {}
for i in audiofeatures:
    af[i] = []

# iteratate over every track, retrieve the audio_features, and append values to the dictionary's lists
for i in pitems['items']:
    f = sp.audio_features(i['track']['uri'])[0]
    for j in audiofeatures:
        af[j].append(f[j])

In [None]:
af

We can quickly make this into a Pandas dataframe:

In [None]:
top50_df = pd.DataFrame(af)

In [None]:
top50_df

But, we'll also include the artist name, track name, top50 rank, and popularity score.

In [None]:
top50rank = []
artist = []
track = []
popularity = []

for ix,item in enumerate(pitems['items']):
    top50rank.append(ix+1)
    artist.append(item['track']['artists'][0]['name'])
    track.append(item['track']['name'])
    popularity.append(item['track']['popularity'])

top50_df['top50rank'] = top50rank
top50_df['artist'] = artist
top50_df['track'] = track
top50_df['popularity'] = popularity

In [None]:
top50_df

It may be more convenient when looking at snapshots of the data to have the columns in a different order.

To do that, I'll get a list of the column names in the order I want, and then reassign a view of the dataframe with that column name list back into the dataframe variable.

In [None]:
columnsinorder = list(top50_df.columns[[-4,-3,-2,-1]]) + list(top50_df.columns[:-4])

In [None]:
top50_df = top50_df.loc[:,columnsinorder]

In [None]:
top50_df

Now we've made a more complete dataframe.

At this point we can have more fun.  Analyze, visualize, summarize,....

In [None]:
top50_df.sort_values(by='popularity',ascending=False)

In [None]:
top50_df['energy'].plot(kind='hist')

In [None]:
top50_df.plot(x='energy',y='danceability',kind='scatter')

In [None]:
import matplotlib.pyplot as plt

In [None]:
top50_df.plot(x='energy',y='danceability',kind='scatter')
plt.xlim(0,1)
plt.ylim(0,1)

In [None]:
top50_df.describe()

In [None]:
top50_df.plot(x='instrumentalness',y='liveness',kind='scatter')

In [None]:
top50_df.loc[top50_df['instrumentalness'] > 0.05]

In [None]:
top50_df.plot(x='instrumentalness',y='loudness',kind='scatter')

In [None]:
top50_df.plot(y='loudness',kind='hist')

We could easily get carried away with analysis and visualization at this point.

We will soon look at song lyrics in tandem with this, as a branching off point for getting into natural language processing.

# Branch points

There are now several items of interest that we can pursue:
1. Looking at statistics and modeling
2. Analyzing the language of lyrical content
3. Networks of artists, users, playlists, etc

For the moment, let's look at some ways to make music recommendations.

# Popularity score

The first way to make recommendations is the easiest.  Just look at what songs are on the top50 list, or most popular, or highest on some other metric, and use that to recommend music.

In [None]:
top50_df

In [None]:
# Here are the top 10 recommendations based on the top50 list:
top10 = top50_df[:10]

for i,row in top10.iterrows():
    print(row['artist'] + ' : ' + row['track'])

In [None]:
# Here are the top 10 recommendations based on popularity score:
top10 = top50_df.sort_values(by='popularity',ascending=False)[:10]

for i,row in top10.iterrows():
    print(row['artist'] + ' : ' + row['track'])

In [None]:
# Here are the top 10 recommendations if I want a high danceability score:
top10 = top50_df.sort_values(by='danceability',ascending=False)[:10]

for i,row in top10.iterrows():
    print(row['artist'] + ' : ' + row['track'])

# Collaborative filtering

A second way to make recommendations is to find other users who like the same stuff you do, and then look at what other content they love.

Let's make some fictious users and look at their (fictious) similarities.

In [None]:
people = ['Alice','Ben','Charlie','Dan','Evelyn']

In [None]:
import random
random.seed(3)

In [None]:
random.randint(0,100)

In [None]:
top50_df['track'][:5]

In [None]:
t10 = list(top50_df['track'][:10])

In [None]:
t10

In [None]:
t = []
p = []
r = []
for i in t10[:5]:
    for j in people:
        t.append(i)
        p.append(j)
        r.append(random.randint(0,100))
for i in t10[5:10]:
    for j in people:
        if j != 'Ben':
            t.append(i)
            p.append(j)
            r.append(random.randint(0,100))            

raters = pd.DataFrame({'track': t, 'people':p, 'rating':r})

In [None]:
raters

In [None]:
raters[raters['people']=='Evelyn']

In [None]:
raters[raters['people']=='Dan']

In [None]:
import matplotlib.pyplot as plt

In [None]:
for i in ['Alice', 'Charlie', 'Dan', 'Evelyn']:
    plt.scatter(raters.loc[raters['people']=='Ben', 'rating'], 
                raters.loc[raters['people']==i, 'rating'][:5])
    plt.xlim([0,100])
    plt.xlabel("Ben's rating")
    plt.ylim([0,100])
    plt.ylabel(i+"'s rating")
    plt.show()

We could also look at the similarities in everyone's ratings of pairs of tracks, to see if there are similarities between tracks rather than between users.

For example, which tracks in the top 5 have similar ratings with the top track, in terms of how our users are rating them.

In [None]:
for i in range(0,5):
    rating1 = raters.loc[raters['track'] == t10[0], 'rating']
    rating2 = raters.loc[raters['track'] == t10[i], 'rating']

    plt.scatter(rating1, rating2)
    plt.xlim([0,100])
    plt.xlabel(t10[0]+" ratings")
    plt.ylim([0,100])
    plt.ylabel(t10[i]+" ratings")
    plt.show()

# User-based collaborative filtering

Making rating predictions on the basis of other users who are similar.

In [None]:
from scipy.spatial.distance import euclidean

In [None]:
def euclidean_dist(person1, person2):
    
    person1ratings = raters.loc[raters['people']==person1, 'rating']
    person2ratings = raters.loc[raters['people']==person2, 'rating'][:5]
    
    return 1 / (1 + euclidean(person1ratings, person2ratings))

In [None]:
euclidean_dist('Ben','Evelyn')

In [None]:
for i in people:
    if i != 'Ben':
        print(i,':',euclidean_dist('Ben',i))

In [None]:
def matches(person):
    best = {}
    for i in people:
        if i != person:
            best[i] = euclidean_dist(person,i)
    return dict(sorted(best.items(), key=lambda item: -item[1]))

In [None]:
matches('Ben')

It's easy now to switch out the similarity metric.  Simply replace "euclidean" with something else.  (But remember too whether a low score or a high score means better or worse similarity.)

In [None]:
from scipy.spatial.distance import cosine

def cosine_dist(person1, person2):    
    person1ratings = raters.loc[raters['people']==person1, 'rating']
    person2ratings = raters.loc[raters['people']==person2, 'rating'][:5]
    return 1 / (1 + cosine(person1ratings, person2ratings))

def matches(person):
    best = {}
    for i in people:
        if i != person:
            best[i] = cosine_dist(person,i)
    return dict(sorted(best.items(), key=lambda item: -item[1]))

matches('Ben')

To get rankings of my unrated tracks, I could just look at Evelyn's ratings, since she's closest to me.

In [None]:
evelyn_ratings = raters.loc[raters['people'] == 'Evelyn', ['track','rating']][5:]
evelyn_ratings.sort_values(by='rating',ascending=False)

It's more comprehensive to look at a weighted average over everyone.  We weight every rating by the similarity score between myself and that person, and then we divide the total by the sum of all the similarity scores.

In [None]:
weights = matches('Ben')

# Initialize a dictionary to hold our predicted ratings
track_predictions = {}

# Get my 5 unrated tracks and assign an initial score of 0
for i in t10[5:10]:

    weighted_rating = 0

    # Calculate the weighted score based on my similarity with others
    total_weight = 0
    for person in people:
        if person != 'Ben':
            weight = weights[person]
            weighted_rating += weight * raters.loc[(raters['people'] == person) & 
                                                   (raters['track'] == i), 'rating'].iloc[0]
            total_weight += weight
    track_predictions[i] = weighted_rating / total_weight
    
preds_sorted = dict(sorted(track_predictions.items(), key=lambda item: -item[1]))
for i in preds_sorted.keys():
  print(i, ":", round(preds_sorted[i],1))

# Item-based filtering

What if we look not at the similarity between people, but the similarity between tracks?

... effectively, conceptually, this is just switching the places of people and tracks.

In [None]:
def euclidean_dist(track1, track2):    
    
    # one difference here is that we do not use Ben's ratings, because he has not rated all pairs
    # for a larger dataset, we would only want to include ratings here from people who have rated both tracks
    track1ratings = raters.loc[(raters['track']==track1) & (raters['people']!='Ben'), 'rating']
    track2ratings = raters.loc[(raters['track']==track2) & (raters['people']!='Ben'), 'rating']
    return 1 / (1 + euclidean(track1ratings, track2ratings))

def matches(track1):
    best = {}
    for i in t10:
        if i != track1:
            best[i] = euclidean_dist(track1, i)
    return dict(sorted(best.items(), key=lambda item: -item[1]))

In [None]:
print('Matching with', t10[0])
matches(t10[0])

In [None]:
# Initialize a dictionary to hold our predicted ratings
track_predictions = {}

# Get my 5 unrated tracks and assign an initial score of 0
for i in t10[5:10]:

    weights = matches(i)
    
    weighted_rating = 0

    # Calculate the weighted score based on my current track ratings
    total_weight = 0
    for t in t10[:5]:
        weight = weights[t]
        weighted_rating += weight * raters.loc[(raters['people'] == 'Ben') & 
                                               (raters['track'] == t), 'rating'].iloc[0]
        total_weight += weight
    track_predictions[i] = weighted_rating / total_weight
    
preds_sorted = dict(sorted(track_predictions.items(), key=lambda item: -item[1]))
for i in preds_sorted.keys():
  print(i, ":", round(preds_sorted[i],1))

# Content-based filtering

A different idea:  Use content to make ratings.

That content could be audio feature scores, sentiment in the lyrics, similar lyrical topics, genre, etc.

### To be continued ###