## Getting data from Spotify API to get features for prediction
1. started with the songs that I scraped from Billboards and stored in `billboards.csv`

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import spotipy
import spotipy.util as util
import json

import pickle
import json
from datetime import datetime
from collections import Counter
from fuzzywuzzy import fuzz

# custome functions used to clean the data queried
import cleaning_functions

Using the spotipy package to create an object that makes it easier to navigate the API instead of using requests.

NTS: I should do something with the requests just so I have some idea of how to do it.

In [2]:
# reading in credentials necessary to use the API
# remember to save credentials in double quotes or else it gets mad at you
with open('../credentials.json') as filename:
    credentials = json.load(filename)
    filename.close()
    
token = util.oauth2.SpotifyClientCredentials(client_id=credentials['client_id'],
                                             client_secret=credentials['client_secret'])
## creates an access token for you to do what you need to do
cache_token = token.get_access_token()
sp = spotipy.Spotify(cache_token)

### Loading in data and search queries to use the API

In [4]:
data = pd.read_csv('../data/billboards.csv')
data.rename(columns={
    'Artist': 'artist',
    'Song':'song'
}, inplace=True)

data['search_queries'] = (data.artist +' '+ data.song).str.lower()
data.head()

```python
with open('../data/spotify_api_search_results.pkl', 'wb') as filename:
    results_list=[]
    results_list.extend([x for x in get_features(data)]) # testing on the head only
    pickle.dump(results_list[0], filename)
    filename.close()
```

```python
pickle.dump(test_df, open('../data/spotify_api_search_results.pkl', 'wb'))
```

2. Using the search query from concatening the artist and song name, the results are stored in `spotify_api_search_results.pkl`

In [13]:
output_data = pickle.load(open('../data/spotify_api_search_results.pkl', 'rb')).dropna()
output_data.rename(columns={'key':'artists'}, inplace=True, errors='ignore') # this is to fix a mistake that i made in ETL
output_data.head()

Unnamed: 0,duration_ms,explicit,artists,name,release_date,search_query,uri
0,255560.0,True,"2 Chainz, Travis Scott",4 AM,2017-06-16,2 chainz 4 am,spotify:track:1nX9KhK3Fff27SnrIor2Yb
1,210200.0,True,"2 Chainz, Ty Dolla $ign, Trey Songz, Jhene Aiko",It's A Vibe,2017-06-16,2 chainz it's a vibe,spotify:track:6H0AwSQ20mo62jGlPGB8S6
2,234666.0,True,"2 Chainz, YG, Offset",PROUD,2018-02-08,2 chainz proud,spotify:track:365wwIjijQdlRJEjUWTidq
4,225893.0,True,"2 Chainz, Drake, Quavo",Bigger Than You (feat. Drake & Quavo),2018-06-15,"2 chainz, drake bigger > you",spotify:track:5S1IUPueD0xE0vj4zU3nSf
5,220306.0,True,21 Savage,Bank Account,2017-09-23,21 savage bank account,spotify:track:2fQrGHiQOvpL9UgPvtYy6G


In [15]:
uri = [uri.split('track:')[1] for uri in output_data.uri.values]

output_data['id'] = uri
output_data.drop(['uri'],axis=1, inplace=True)
output_data.head()

Unnamed: 0,duration_ms,explicit,artists,name,release_date,search_query,id
0,255560.0,True,"2 Chainz, Travis Scott",4 AM,2017-06-16,2 chainz 4 am,1nX9KhK3Fff27SnrIor2Yb
1,210200.0,True,"2 Chainz, Ty Dolla $ign, Trey Songz, Jhene Aiko",It's A Vibe,2017-06-16,2 chainz it's a vibe,6H0AwSQ20mo62jGlPGB8S6
2,234666.0,True,"2 Chainz, YG, Offset",PROUD,2018-02-08,2 chainz proud,365wwIjijQdlRJEjUWTidq
4,225893.0,True,"2 Chainz, Drake, Quavo",Bigger Than You (feat. Drake & Quavo),2018-06-15,"2 chainz, drake bigger > you",5S1IUPueD0xE0vj4zU3nSf
5,220306.0,True,21 Savage,Bank Account,2017-09-23,21 savage bank account,2fQrGHiQOvpL9UgPvtYy6G


### Combining the songs that I found manually

There were some songs that weren't found with the programmatic search query, so I went back and created a modified query hoping that it would return (a) song. These songs and their results were stored in `manually_found_songs.pkl`

```python

file_path = '../data/spotify_api_search_results.pkl'

with open(file_path, 'rb') as file:
    output_data = pickle.load(file)
    missing_data = output_data[output_data.name.isna()]
    file.close()
    
missing_data = output_data[output_data.name.isna()]

missing_modified = pd.read_csv('../data/missing_modified_results.csv')
# i should be able to just concat across since its the same length 

missing_data = pd.concat([missing_data.search_query.reset_index(),
missing_modified.reset_index()], axis=1).drop('index', axis=1).dropna()

pickle.dump(missing_data, open('../data/manually_found_songs.pkl', 'wb'))

```

In [16]:
missing_data = pickle.load(open('../data/manually_found_songs.pkl', 'rb'))
missing_data.head()

Unnamed: 0,search_query,artists,duration_ms,explicit,id,modified_query,name,popularity
0,2 chainz x gucci mane x quavo good drank,"2 Chainz, Gucci Mane, Quavo",222706.0,True,39pS70eeDvyCAF3t8NAlVV,quavo good drank,Good Drank,65.0
1,21 savage my choppa hate n****s,"21 Savage, Metro Boomin",148640.0,True,2D2w9943rsnJOGCrI4aMQp,21 savage my choppa,My Choppa Hate Niggas,67.0
2,becky g + natti natasha sin pijama,Marillo,191555.0,False,6ItVmJiq8rDwBKq026zSa3,becky g pijama,Como Becky G Sin Pijama,0.0
9,jacquees x dej loaf at the club,"Jacquees, DeJ Loaf",173053.0,False,0NqZ65jPelNB13gzsvH2Ma,dej loaf at the club,At The Club,67.0
10,kygo x selena gomez it ain't me,"Kygo, Selena Gomez",216586.0,False,677RjvAT2lpYjo1Whczjzx,kygo it aint me,It Ain't Me,50.0


In [17]:
output_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 809 entries, 0 to 827
Data columns (total 7 columns):
duration_ms     809 non-null float64
explicit        809 non-null object
artists         809 non-null object
name            809 non-null object
release_date    809 non-null object
search_query    809 non-null object
id              809 non-null object
dtypes: float64(1), object(6)
memory usage: 50.6+ KB


In [18]:
missing_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13 entries, 0 to 18
Data columns (total 8 columns):
search_query      13 non-null object
artists           13 non-null object
duration_ms       13 non-null float64
explicit          13 non-null object
id                13 non-null object
modified_query    13 non-null object
name              13 non-null object
popularity        13 non-null float64
dtypes: float64(2), object(6)
memory usage: 936.0+ bytes


3. Need to drop the missing songs from the first pass of the API, and concat the manually found songs to that dataframe. Note that JAY-Z's songs are not on Spotify, which doesn't surprise me as he has his own competitive music streaming site TIDAL.

In [20]:
all_songs = pd.concat([output_data.drop('uri', axis=1, errors='ignore'),
           missing_data.drop(['modified_query','popularity'], axis=1)], axis=0)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [21]:
all_songs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 822 entries, 0 to 18
Data columns (total 7 columns):
artists         822 non-null object
duration_ms     822 non-null float64
explicit        822 non-null object
id              822 non-null object
name            822 non-null object
release_date    809 non-null object
search_query    822 non-null object
dtypes: float64(1), object(6)
memory usage: 51.4+ KB


In [22]:
all_songs.to_csv('../data/all_data_0504.csv', index=False)

4. Now that I have the song's URI from the search results, Spotify has an API endpoint that let's you get the audio features. The results of this process is stored in `audio_features.pkl`

In [None]:
audio_features = all_songs['id'].apply(get_audio_features)

# need to flatten it 
audio_list = []
audio_list.extend([value for value in audio_features.values])
audio_features = pd.DataFrame(audio_list)

pickle.dump(audio_features, open('../data/audio_features.pkl', 'wb'))

NTS: it seems like the track method returns results that I already have stores so i'm not going to bother with that right now.... might be something to look into later when i come back to the project?

5. There is also an audio_analysis endpoint on the Spotify API. The result of this query is stored in `analysis_query.pkl`

In [None]:
analysis_df = pd.DataFrame.from_dict(output_data['id'].apply(get_audio_analysis))

flatten_list = []
flatten_values = flatten_list.extend([values[0] for values in analysis_df.values])
analysis_df = pd.DataFrame(flatten_list)

In [None]:
query_date = str(datetime.today().month).zfill(2) + '_' + str(datetime.today().day).zfill(2)
print(query_date)
pickle.dump(analysis_df, open(f'../data/analysis_query_{query_date}.pkl', 'wb'))