# Spotify Top 100 P1 - Scraping Using Spotify API

Question: Can a top 100 song be predicted? What are the features that make a song hit the top 100 or even top 10 streams on Spotify?

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import SpotifyUtilities.SpotifyUtilities as spotify
import numpy as np
import pandas as pd
import pickle

## Part 1 - Scraping Track and Artist Names from the Charts and a Playlist

- get_top200() gets the top 200 track and artist names that are updated weekly by spotify
- use the parameter "daily=True" to make it top 200 **daily** tracks

In [3]:
top200 = spotify.get_top200()

In [4]:
top200[:10]

[('ROCKSTAR ', 'DaBaby'),
 ('Savage Love ', 'Jawsh 685'),
 ('Blinding Lights', 'The Weeknd'),
 ('Watermelon Sugar', 'Harry Styles'),
 ('Roses - Imanbek Remix', 'SAINt JHN'),
 ('Come & Go ', 'Juice WRLD'),
 ('POPSTAR ', 'DJ Khaled'),
 ('Wishing Well', 'Juice WRLD'),
 ('Breaking Me', 'Topic'),
 ('death bed ', 'Powfu')]

- get_songs() gets the track and artist names from **any** playlist. 
- Default is [this playlist](https://open.spotify.com/playlist/5tIkO3qnEYSRYnEs1jgP8x). See documentation if youd like to use another playlist.

In [5]:
decade_hits = spotify.get_songs()

1000 track and artist names uploaded


In [6]:
decade_hits[990:]

[('Let Me Hold You ', 'Cheat Codes'),
 ('My Way', 'Calvin Harris'),
 ('Cheers ', 'Rihanna'),
 ("Hold On, We're Going Home", 'Drake'),
 ("Can't Stop Dancin'", 'Becky G'),
 ('SOS', 'Rihanna'),
 ('Welcome To St Tropez ', 'DJ Antoine'),
 ('Stereo Love ', 'Edward Maya'),
 ('Me & My Girls', 'Selena Gomez'),
 ('Years ', 'Alesso')]

## Part 2 - Using Spotify API to upload track features as DataFrame

get_features_df() is the heart of this notebook. It retrieves audio features given a track and artist name 100 tracks at a time (as limitted by spotify). Then, If any null values are in the df, it researches and replaces them with the right values. 

In [7]:
top200_df = spotify.get_features_df(top200)

Having trouble finding 'Don't Start Now' by 'Dua Lipa'
1 to 100 complete!
Having trouble finding 'Can't Die' by 'Juice WRLD'
101 to 200 complete!
2 null values found. Now adjusting...
Null values recovered


In [8]:
del top200
top200_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track             200 non-null    object 
 1   artist            200 non-null    object 
 2   danceability      200 non-null    float64
 3   energy            200 non-null    float64
 4   key               200 non-null    float64
 5   loudness          200 non-null    float64
 6   mode              200 non-null    float64
 7   speechiness       200 non-null    float64
 8   acousticness      200 non-null    float64
 9   instrumentalness  200 non-null    float64
 10  liveness          200 non-null    float64
 11  valence           200 non-null    float64
 12  tempo             200 non-null    float64
 13  duration_ms       200 non-null    float64
 14  time_signature    200 non-null    float64
 15  top100            200 non-null    int64  
 16  top10             200 non-null    int64  
dt

In [9]:
decade_hits_df = spotify.get_features_df(decade_hits)

Having trouble finding 'Don't Tell 'Em' by 'Jeremih'
Having trouble finding 'Don't You Worry Child ' by 'Swedish House Mafia'
Having trouble finding 'Don't Wanna Go Home' by 'Jason Derulo'
1 to 100 complete!
Having trouble finding 'Don't' by 'Ed Sheeran'
Having trouble finding 'Don't Wake Me Up' by 'Chris Brown'
Having trouble finding 'Ain't It Fun' by 'Paramore'
Having trouble finding 'You're The One That I Want' by 'Lo-Fang'
Having trouble finding 'I'm an Albatraoz' by 'AronChupa'
Having trouble finding 'Don't Look Down ' by 'Martin Garrix'
101 to 200 complete!
Having trouble finding 'Can't Feel My Face' by 'The Weeknd'
Having trouble finding 'Don't Stop the Party ' by 'Pitbull'
Having trouble finding 'Don't Wanna Go Home' by 'Jason Derulo'
Having trouble finding 'Danza Kuduro ' by 'Lucenzo & Qwote'
201 to 300 complete!
Having trouble finding 'Ain't Nobody ' by 'Felix Jaehn'
Having trouble finding 'Perfect Timing' by 'Jason Derulo'
Having trouble finding 'I Cant Fight This Feeling ' 

In [10]:
del decade_hits
decade_hits_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track             1000 non-null   object 
 1   artist            1000 non-null   object 
 2   danceability      1000 non-null   float64
 3   energy            1000 non-null   float64
 4   key               1000 non-null   float64
 5   loudness          1000 non-null   float64
 6   mode              1000 non-null   float64
 7   speechiness       1000 non-null   float64
 8   acousticness      1000 non-null   float64
 9   instrumentalness  1000 non-null   float64
 10  liveness          1000 non-null   float64
 11  valence           1000 non-null   float64
 12  tempo             1000 non-null   float64
 13  duration_ms       1000 non-null   float64
 14  time_signature    1000 non-null   float64
 15  top100            1000 non-null   int64  
 16  top10             1000 non-null   int64  
d

## Part 3 - Cleaning the Data Part 1

In [11]:
# Include the ranking for top200
top200_df.reset_index(inplace=True)
top200_df['index'] = top200_df['index'] + 1
top200_df.rename({'index': 'rank'}, axis=1, inplace=True)

top200_df.head()

Unnamed: 0,rank,track,artist,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,top100,top10
0,1,ROCKSTAR,DaBaby,0.746,0.69,11.0,-7.956,1.0,0.164,0.247,0.0,0.101,0.497,89.977,181733.0,4.0,0,0
1,2,Savage Love,Jawsh 685,0.767,0.481,0.0,-8.52,0.0,0.0803,0.234,0.0,0.269,0.761,150.076,171375.0,4.0,0,0
2,3,Blinding Lights,The Weeknd,0.514,0.73,1.0,-5.934,1.0,0.0598,0.00146,9.5e-05,0.0897,0.334,171.005,200040.0,4.0,0,0
3,4,Watermelon Sugar,Harry Styles,0.548,0.816,0.0,-4.209,1.0,0.0465,0.122,0.0,0.335,0.557,95.39,174000.0,4.0,0,0
4,5,Roses - Imanbek Remix,SAINt JHN,0.785,0.721,8.0,-5.457,1.0,0.0506,0.0149,0.00432,0.285,0.894,121.962,176219.0,4.0,0,0


In [12]:
# Adjusting top100 and top10 column for top200
mask_top100 = top200_df['rank'] <= 100
mask_top10 = top200_df['rank'] <= 10

top200_df.loc[mask_top100, 'top100'] = 1
top200_df.loc[mask_top10, 'top10'] = 1

top200_df.head()

Unnamed: 0,rank,track,artist,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,top100,top10
0,1,ROCKSTAR,DaBaby,0.746,0.69,11.0,-7.956,1.0,0.164,0.247,0.0,0.101,0.497,89.977,181733.0,4.0,1,1
1,2,Savage Love,Jawsh 685,0.767,0.481,0.0,-8.52,0.0,0.0803,0.234,0.0,0.269,0.761,150.076,171375.0,4.0,1,1
2,3,Blinding Lights,The Weeknd,0.514,0.73,1.0,-5.934,1.0,0.0598,0.00146,9.5e-05,0.0897,0.334,171.005,200040.0,4.0,1,1
3,4,Watermelon Sugar,Harry Styles,0.548,0.816,0.0,-4.209,1.0,0.0465,0.122,0.0,0.335,0.557,95.39,174000.0,4.0,1,1
4,5,Roses - Imanbek Remix,SAINt JHN,0.785,0.721,8.0,-5.457,1.0,0.0506,0.0149,0.00432,0.285,0.894,121.962,176219.0,4.0,1,1


In [13]:
# Fixing the data types for top200
update1 = top200_df.round({'tempo':0})
update2 = update1.astype({
    'key':'int64',
    'mode':'int64',
    'tempo':'int64',
    'duration_ms':'int64',
    'time_signature':'int64'
})

top_100 = update2.head(100)
top_100.drop('rank', axis=1, inplace=True)
del update1, update2, top200_df

top_100.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,track,artist,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,top100,top10
0,ROCKSTAR,DaBaby,0.746,0.69,11,-7.956,1,0.164,0.247,0.0,0.101,0.497,90,181733,4,1,1
1,Savage Love,Jawsh 685,0.767,0.481,0,-8.52,0,0.0803,0.234,0.0,0.269,0.761,150,171375,4,1,1
2,Blinding Lights,The Weeknd,0.514,0.73,1,-5.934,1,0.0598,0.00146,9.5e-05,0.0897,0.334,171,200040,4,1,1
3,Watermelon Sugar,Harry Styles,0.548,0.816,0,-4.209,1,0.0465,0.122,0.0,0.335,0.557,95,174000,4,1,1
4,Roses - Imanbek Remix,SAINt JHN,0.785,0.721,8,-5.457,1,0.0506,0.0149,0.00432,0.285,0.894,122,176219,4,1,1


In [14]:
# Fixing the data types for decade_hits
update1 = decade_hits_df.round({'tempo':0})
decade_hits = update1.astype({
    'key':'int64',
    'mode':'int64',
    'tempo':'int64',
    'duration_ms':'int64',
    'time_signature':'int64'
})

del update1, decade_hits_df

decade_hits.head()

Unnamed: 0,track,artist,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,top100,top10
0,She Came II Give It II U,Usher,0.821,0.565,4,-5.403,0,0.0837,0.0419,0.0,0.0796,0.732,120,242560,4,0,0
1,Loyal,Chris Brown,0.841,0.522,10,-5.963,0,0.049,0.0168,1e-06,0.188,0.616,99,264947,4,0,0
2,Fireball,Pitbull,0.69,0.937,10,-5.393,1,0.0642,0.0894,0.000104,0.0532,0.794,123,235012,4,0,0
3,Chandelier,Sia,0.399,0.787,1,-2.88,1,0.0499,0.0197,6.1e-05,0.0685,0.572,117,216120,5,0,0
4,Black Widow,Iggy Azalea,0.743,0.72,11,-3.753,1,0.124,0.192,0.000386,0.109,0.519,164,209423,4,0,0


In [15]:
decade_hits.shape

(1000, 17)

## Save to CSV and Pickle for next Notebook

In [16]:
top_100.to_csv('top_100.csv')
decade_hits.to_csv('decade_hits.csv')

In [17]:
with open('top_100.pickle', 'wb') as to_write:
    pickle.dump(top_100, to_write)
    
with open('decade_hits.pickle', 'wb') as to_write:
    pickle.dump(decade_hits, to_write)

In [18]:
!ls

[34mClasswork_nonsense[m[m             [34m__pycache__[m[m
EDA_MVP_proj3.ipynb            challenge_set_7_emmanuel.ipynb
[34mMetis_2020_Submissions[m[m         challenge_set_8_emmanuel.ipynb
[34mMetis_Files[m[m                    decade_hits.csv
[34mProject2_BGG[m[m                   decade_hits.pickle
Scraping_Instagram.ipynb       [34mgitHub_blog[m[m
Scraping_Spotify.ipynb         model_selection.py
[34mSpotifyUtilities[m[m               top_100.csv
[34mStats_Prework[m[m                  top_100.pickle


In [19]:
del top_100, decade_hits