# Analyzing the Data of Top 200 songs on Spotify
## Amanuel Awoke
##  Ferzam Mohammad
## Josue Velasquez

# Introduction
## Motivation
The music industry has changed a lot in the last decade with the introduction of streaming services like Apple Music or Spotify. Services like Spotify allow users to livestream music for personal consumption, often for free or for a subscription fee. These services have made it easier to consume music and have increased opportunities for people to start producing music, but they have also changed how musicians make money. Whenever a user listens to a song on a streaming service, the service typically keeps track of the number of “plays” that song has. Music artists are then paid a small amount based on the number of plays they have accumulated for their music. Given how little these artists are paid from streaming services, maximizing the amount of revenue made from a song seems pretty valuable for those looking to push out music to these services. Play count also indicates where a song stands in the streaming services’ popularity lists, and making it onto their top 100 or 200 songs is a factor considered in whether these songs are added to global, official top songs charts i.e. Billboard 200.
 
Our group thought it would be interesting to see if we could try to make predictions for how popular a song might be given different features for a song (e.g. the genre of the song, how fast or slow it is, the key the song is written in, the time of year a song was released, how many listeners an artist already gets on average, etc.). If we can indicate how many plays a song will get, we can give a prediction for how much money a song will make on a streaming service. Much like the Moneyball scenario, it’s possible that artists are focusing on producing music that meets criteria which they think makes a song popular when, in reality, they should be focusing on other aspects of their music. Understanding what components of a song make it popular would help artists figure out the best way to produce music in order to make money off of these streaming services.

The Moneyball story demonstrated the importance of data science in producing a strong baseball team, and while music is different from sports, our project should hopefully reflect similar data science practices in order to reach a valuable conclusion. It may be relatively straightforward to look at aspects like which genres make the most money or whether a song by Taylor Swift will end up on the top 200 chart given her “incredibly loyal fanbase” of over 40 million people, but maybe there are other similarities between popular songs that could indicate factors which help make a song more popular. Data science practices like t-tests (maybe) or machine learning models help us here by giving us tools to help identify characteristics in a song, clarify how those characteristics might relate to play count, and predict what the play count (or popularity) for a similar song could be given factors that we have determined have an affect on play count (or popularity)
<One valuable conclusion might be the LACK OF money artists are being paid from streaming services, or how little they make off streaming services alone>


# Collect Data

This is the first step in the data lifecycle where we must identify information to web scrape. We gather data from the [Spotify Charts Regional Top 200](https://spotifycharts.com/regional) to first identify which songs had the highest stream counts in the United States, dating back to January 1st 2017, to current day. Spotify Charts provides tracks with the highest stream count, their top 200 rank, and the artist(s) who created that song. Spotify Charts already compiles the data into Excel tables, so it isn't necessary to directly scrape from the website. If you wanted to download one yourself, at the top right of the website, select a date you'd like to download in the dropdown, then select further up "Download to CSV." The pandas method read_csv() was used to process the Excel files into dataframes.



In [None]:
import os
import pandas as pd
import requests
from bs4 import BeautifulSoup
import spotipy
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Since there were consistent download URLs of Excel sheets in relation to the date they recorded, we used a looped to retreive the links then later download all sheets.

In [None]:
# Collect links from spotify charts top 200 streams per day
ref_str = "https://spotifycharts.com/regional/global/daily/"
ref_arr = []

# gets every day from janurary 2017 to October 2020

# for year in range(2017, 2021):
for year in range(2017, 2018):
    date = ""
    
    endingMonth = 12
    if year == 2020:
        endingMonth = 10
        
    # for month in range (1, endingMonth + 1):
    for month in range (1, 13):
       
        dayCount = -1

        #gets proper day count per month
        thirtyDayCountMonths = [4, 6, 9, 11]
        if month == 2:
            dayCount = 29
        elif month in thirtyDayCountMonths:
            dayCount = 30
        else:
            dayCount = 31

        if int(month) < 10:
            month = "0" + str(month)
        #for day in range (1, 16):
           
        #    if int(day) < 10:
        #        day = "0" + str(day)

        date = str(year) + "-" + str(month) + "-" + "01" + "/download"
        date = ref_str + date
        ref_arr.append(date)

ref_arr

In [None]:
#Loop downloading and appending of dataframes 

df = pd.DataFrame(columns =['position', 'track_name', 'artist', 'streams', 'url', 'date'] )
#make dir to save to
path = "sheets"
try:
    os.mkdir(path)
except OSError:
    print ("Folder already exists")

for i in ref_arr:
    r = requests.get(i, allow_redirects = True)
    #String manipulation to read from the correct csv files
    date = i[48:58]
    #print(date)
    fileName = "regional-global-daily-" + date + ".csv"
    #print(fileName)
    open(fileName, "wb").write(r.content)

    #os.rename(fileName, "sheets/" + fileName)  

    df_new = pd.read_csv(fileName)
    df_new.columns= ['position', 'track_name', 'artist', 'streams', 'url']
    df_new['date'] = date
    
    df_new = df_new.iloc[1:] #deletes junk row from csv conversion
    df = df.append(df_new)


df = df.reset_index() # Sets index back to being the regular 0-based index. This is really helpful when trying to add more to the dataframe later, because otherwise there are lots of duplicate indices
#df.drop(['position'], axis=1, inplace=True) #delete position row since rank alraedy has this information

## Wrangled data into dataframe

In [None]:
df

# Data Processing

[Spotipy](https://spotipy.readthedocs.io/en/2.16.1/#) is a lightweight Python library for the [Spotify Web API](https://developer.spotify.com/documentation/web-api/) used to retrieve more detailed data for tracks now that their names have been retrieved from the Spotify Top 200. We must first authenticate our usage of the API.

In [None]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth
from spotipy.oauth2 import SpotifyClientCredentials


SPOTIPY_CLIENT_ID="ea1a162fbc6f413990542b76ab82a168"
SPOTIPY_CLIENT_SECRET="a09882042ce54f158fdd2b6baaf2b26d"
SPOTIPY_CLIENT_REDIRECT="http://www.cs.umd.edu/class/fall2020/cmsc320-0201/"

scope = "user-library-read"

sp = spotipy.Spotify(auth_manager=SpotifyOAuth(scope=scope, client_id=SPOTIPY_CLIENT_ID, client_secret=SPOTIPY_CLIENT_SECRET, redirect_uri=SPOTIPY_CLIENT_REDIRECT))

In [None]:
#gets artist genres and artist ids for each artist in dataframe and puts them in arrays
# artist_genres = []
# artist_ids = []

# for index, row in df.iterrows():
#     artist = row['artist']
#     #print(index)
#     #print(artist)
#     trackArtistWithoutSpaces = '+'.join(artist.split())
#     result = sp.search(trackArtistWithoutSpaces)
#     track = result['tracks']['items'][0]
#     artist_id = track["artists"][0]["id"]
#     #print(artist_id)
#     #print(track)
#     artist_ids.append(artist_id)
#     artist = sp.artist(track["artists"][0]["external_urls"]["spotify"])
#     artist_genres.append(artist["genres"])
#     #print(artist["genres"])



In [None]:
#print(df["artist_genres"])

#creates new column in dataframe based on genre filter within filter func
#def filt_func(genre_list):
#    genre = ['pop','rap','edm','rock','indie']
#    result = list(filter(lambda x: x in genre, genre_list))
#    return "other" if len(result) == 0 else result[0]
    #print(result)
#df['genre'] = df['artist_genres'].apply(lambda x: filt_func(x))
#df['streams'] = df['streams'].apply(lambda x: int(x))

In [None]:
# This cell was to help understand how to get the track_id and artist_id from a search query
#trackName = df.iloc[0].at['track_name']
#trackNameWithoutSpaces = '+'.join(trackName.split())
#searchQuery = sp.search(trackNameWithoutSpaces, 1, 0)
#track_object = searchQuery['tracks']['items'][0]
#artist_object = track_object['artists'][0]
#artist_id = artist_object['id']
#track_id = track_object['id']
#audiofeatures = sp.audio_features(track_id)
#print(track_id)
#print(artist_id)
#print(audiofeatures[0]['danceability'])
#print(trackNameWithoutSpaces)

In [None]:
artist_id_list = []
track_id_list = []
audioFeaturesDf = pd.DataFrame()
# This gets the artist id and the track id (which is included in the "audio_features" search we make)
# Does this by making one search through the api and gets the ids from the information returned
for index, row in df.iterrows():
    trackName = row['track_name']
    track_id = ""
    artist_id = ""
    print(trackName)
    # We need to check if our track_name received was a nan value. Idk how these got in here, but there are nans
    if(type(trackName) == str):
        trackNameWithoutSpaces = '+'.join(trackName.split())
        searchQuery = sp.search(trackNameWithoutSpaces, 1, 0)
        if (len(searchQuery['tracks']['items']) != 0):
            track_object = searchQuery['tracks']['items'][0]
            artist_object = track_object['artists'][0] if type(track_object['artists']) is list else track_object['artists']
            artist_id = artist_object['id']
            track_id = track_object['id']
            artist_id_list.append(artist_id)
            track_id_list.append(track_id)
        # If our query returned nothing then append a nan in the place of artist and track for this entry
        else:
            artist_id_list.append(np.nan)
            track_id_list.append(np.nan)
        #print(trackItem)
    # If we had stored a nan, then just plan to append a nan in this position
    else:
        artist_id_list.append(np.nan)
        track_id_list.append(np.nan)
    audiofeatures = {'duration_ms' : np.nan, 'key' : np.nan, 'mode' : np.nan, 'time_signature' : np.nan, 'acousticness' : np.nan, 'danceability' : np.nan, 'energy' : np.nan, 'instrumentalness' : np.nan, 'liveness' : np.nan, 'loudness' : np.nan, 'speechiness' : np.nan, 'valence' : np.nan, 'tempo' : np.nan, 'id' : np.nan, 'uri' : np.nan, 'track_href' : np.nan, 'analysis_url' : np.nan, 'type' : np.nan, }
    #print(audiofeatures)
    # If we successfully found a track when we did our seach, then get the audio features for that
    if (track_id != ""):
        audiofeatures = sp.audio_features(track_id)[0]
    #print(track_id)
    #print(audiofeatures)
    audioFeaturesDf = audioFeaturesDf.append(audiofeatures, ignore_index=True)
    
audioFeaturesDf.head()

In [None]:
audioFeaturesDf['artist_id'] = artist_id_list
audioFeaturesDf.head(205)

In [None]:
audioFeaturesDf
df['duration_ms'] = audioFeaturesDf['duration_ms']
df['acousticness'] = audioFeaturesDf['acousticness']
df['danceability'] = audioFeaturesDf['danceability']
df['energy'] = audioFeaturesDf['energy']
df['instrumentalness'] = audioFeaturesDf['instrumentalness']
df['liveness'] = audioFeaturesDf['liveness']
df['loudness'] = audioFeaturesDf['loudness']
df['speechiness'] = audioFeaturesDf['speechiness']
df['valence'] = audioFeaturesDf['valence']
df['tempo'] = audioFeaturesDf['tempo']
df.head(201)

In [None]:
#visualization
#plotting all the new metrics in our dataframe vs streams
df['streams'] = df['streams'].astype(float)
df['position'] = df['position'].astype(int)
df.loc[df['position'] == 1].head(10) # Previously this was showing that every 

Here are the features for the first few songs sorted by streams. We will remove duplicates of songs and maintain the versions with the highest streams

In [None]:
df.sort_values('streams', ascending=False).drop_duplicates(['artist', 'duration_ms', 'acousticness', 'danceability', 'energy'], keep='first').head(10)

In [None]:
from scipy.stats import normaltest
from numpy.random import seed
from numpy.random import randn
alpha = 0.05
#data = df['tempo'].sample(n=10).array
data = []
for i in range(0,100):
    data.append(np.mean(df['streams'].sample(n=100)))
print(data)
plt.hist(data)
plt.xlabel("Estimate")
plt.ylabel("Frequency")


In [None]:
from sklearn import linear_model

#Get averages of each col
duration_mean = np.mean(df['duration_ms'])
acousticness_mean = np.mean(df['acousticness'])
danceability_mean = np.mean(df['danceability'])
energy_mean = np.mean(df['energy'])
instrumentalness_mean = np.mean(df['instrumentalness'])
liveness_mean = np.mean(df['liveness'])
loudness_mean = np.mean(df['loudness'])
speechiness_mean = np.mean(df['speechiness'])
valence_mean = np.mean(df['valence'])
tempo_mean = np.mean(df['tempo'])





print(duration_mean)
# line = linear_model.LinearRegression()



# for crime in crimes:
#     x = []
#     y = []
#     for region in regions:
#         x.append(region["Total"].values)
#         y.append(region[crime].values)
        
        
#     x = np.array(x)
#     x = x.flatten()
#     y = np.array(y)
#     y = y.flatten()
    
    
#     plt.scatter(x, y)
#     plt.title(crime)
        
# line = linear_model.LinearRegression()
# x = np.array(df['duration_ms']).flatten()
# y = np.array(df['streams']).flatten()
# print(x)
# line.fit(x,y)
# predicted = line.predict(x.reshape(-1,1))
    
# #     r_value = line.score(x.reshape(-1,1),y)
        
        
# #     plt.plot(x,predicted, c='r', label=r_value)
# #     plt.legend()
    
# #     plt.show()
# #     plt.close()

# plt.scatter(x, y)
# print(type(df['streams']))



In [None]:
#how do i get rows with values in a range
#where dancability

# df_dance = 

In [None]:
plt.scatter(df['duration_ms'], df['streams'])
plt.xlabel('duration in milliseconds')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['acousticness'],df['streams'])
plt.xlabel('accousticness scale of 0-1')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['danceability'],df['streams'])
plt.xlabel('danceability scale of 0-1')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['energy'],df['streams'])
plt.xlabel('energy scale of 0-1')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['instrumentalness'],df['streams'])
plt.xlabel('instrumentalness scale of 0-1')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['liveness'],df['streams'])
plt.xlabel('liveness scale of 0-1')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['loudness'],df['streams'])
plt.xlabel('loudness in decibels')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['speechiness'],df['streams'])
plt.xlabel('speechiness scale of 0-.5')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['valence'],df['streams'])
plt.xlabel('valence scale of 0-1')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['tempo'],df['streams'])
plt.xlabel('tempo scale of 0-200')
plt.ylabel('streams in millions')

In [None]:
#violin plot of genre vs streams in millions
# ax = sns.violinplot(x='genre', y='streams', data=df, palette='muted')


In [None]:
plt.scatter(df['date'],df['streams'])
plt.xlabel('date')
plt.ylabel('streams in millions')

Can we find any trends within these features for the top 5 songs from all the months we look at in our dataframe?

Going to standardize the number of streams to help us more easily visualize data. 

In [None]:
df['streams'] = df['streams'] / df['streams'].mean()

In [None]:
top5s = df.loc[df['position'] <= 5]
top5s.head(10)

We're going to try to take a look at the same comparisons we made in the graphs above, but only for the top 10 tracks for the 1st of January. These tracks have the most streams compared to all the other songs from that day, so it could be valuable to see if there are any features that made them so successful. Going on only plot features where data appeared to be more spread out in previous plots.

In [None]:
color_list = ['r', 'cyan', 'purple', 'b', 'green']

i = 0
for index, row in top5s.iterrows():
    plt.scatter(row['duration_ms'],row['streams'], color=color_list[i])
    i = (i + 1) % 5

plt.xlabel('duration in milliseconds')
plt.ylabel('streams in millions')
plt.legend(['Number 1 Song For Month', 'Number 2 Song For Month', 'Number 3 Song For Month', 'Number 4 Song For Month', 'Number 5 Song For Month'])


This graph Shows how the duration of a song in milliseconds compares to the number of streams that song received, and we're only using the first 10 pieces of data from our dataframe. This shows us that the songs with the most streams from this set of data are songs which are > 240000 ms, or 4 minutes. This is surprising, because the average song is usually around 3 minutes and 30 seconds or less.

In [None]:
i = 0
for index, row in top5s.iterrows():
    plt.scatter(row['acousticness'],row['streams'], color=color_list[i])
    i = (i + 1) % 5

plt.xlabel('accousticness scale of 0-1')
plt.ylabel('streams in millions')
#plt.legend(['Number 1 Song For Month', 'Number 2 Song For Month', 'Number 3 Song For Month', 'Number 4 Song For Month', 'Number 5 Song For Month'])

This graph displays a confidence score for how likely it is that a song is acoustic (with a value of 1 being very likely that the song is acoustic) compared to the number of streams the song has. All of the confidence scores are less than .5, which indicates most of these songs are probably not acoustic

In [None]:
i = 0
for index, row in top5s.iterrows():
    plt.scatter(row['energy'],row['streams'], color=color_list[i])
    i = (i + 1) % 5

plt.xlabel('energy scale of 0-1')
plt.ylabel('streams in millions')
#plt.legend(['Number 1 Song For Month', 'Number 2 Song For Month', 'Number 3 Song For Month', 'Number 4 Song For Month', 'Number 5 Song For Month'])

This graph shows how the "energy" of a song, or generally how noisy and fast the song is, compares to the number of streams for the top 10 songs on the 1st of January. Here, we see that the songs with the most streams are around or above .6 on the energy scale (a higher score means the song is higher energy)

In [None]:
i = 0
for index, row in top5s.iterrows():
    plt.scatter(row['danceability'],row['streams'], color=color_list[i])
    i = (i + 1) % 5

plt.xlabel('danceability scale of 0-1')
plt.ylabel('streams in millions')
#plt.legend(['Number 1 Song For Month', 'Number 2 Song For Month', 'Number 3 Song For Month', 'Number 4 Song For Month', 'Number 5 Song For Month'])

This graph shows how "danceable" a song is using a value provided to us by the Spotify API comapred to the number of streams that song got. Danceability is measured as a value from 0 to 1, where 1 is most danceable. This graph appears to be similar to the graph describing, so they may have been determined using similar characteristisc (i.e. both are measuring how upbeat or fast a song is)

In [None]:
i = 0
for index, row in top5s.iterrows():
    plt.scatter(row['loudness'],row['streams'], color=color_list[i])
    i = (i + 1) % 5

plt.xlabel('loudness in decibels')
plt.ylabel('streams in millions')
#plt.legend(['Number 1 Song For Month', 'Number 2 Song For Month', 'Number 3 Song For Month', 'Number 4 Song For Month', 'Number 5 Song For Month'])

This graph describes the average volume of each track in our top 5s data set compared to the number of streams each song had. It appears to trend similarly to the last two graphs, indicating that the volume of a track may be correlated with how danceable or energetic a song is.

In [None]:
i = 0
for index, row in top5s.iterrows():
    plt.scatter(row['valence'],row['streams'], color=color_list[i])
    i = (i + 1) % 5

plt.xlabel('valence scale of 0-1')
plt.ylabel('streams in millions')
#plt.legend(['Number 1 Song For Month', 'Number 2 Song For Month', 'Number 3 Song For Month', 'Number 4 Song For Month', 'Number 5 Song For Month'])

This graph describes the "valence" of a song compared to the # of streams it got. Valence is described as the "positivity" of a song where "Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)," according to the Spotify API reference. The reference does not describe how this value is determined, but our data seems to show there may be a correlation between valence and the number of streams a song is getting in the set of number 1 songs. However, this graph does not take into account the other features for the songs. It may be worth trying to consider songs where features except for this one are held to a constant, so that we can consider if there is a correlation between this value and the number of streams.

In [None]:
i = 0
for index, row in top5s.iterrows():
    plt.scatter(row['tempo'],row['streams'], color=color_list[i])
    i = (i + 1) % 5

plt.xlabel('tempo scale of 0-200 Beats Per Minute')
plt.ylabel('streams in millions')
#plt.legend(['Number 1 Song For Month', 'Number 2 Song For Month', 'Number 3 Song For Month', 'Number 4 Song For Month', 'Number 5 Song For Month'])

This graph describes the tempo of a song comapred to the number of streams that song has. Given our dataset, it is unclear whether there is a correlation between the tempo of a song and the number of streams it gets.

There appeared to be a potential relationship between valence and the number of streams a song was getting, so it might be interesting to look at what the different features are like for songs with a valence of around .4 or higher

In [None]:
highValenceTopTracks = top5s.loc[top5s['valence'] > .4]
highValenceTopTracks.sort_values('streams', ascending=False).head(10)

We have duplicate pieces of data, so lets remove the duplicates for this test. We're going to try to keep the versions of the song that have the most streams

In [None]:
highValenceTopTracks = highValenceTopTracks.sort_values('streams', ascending=False).drop_duplicates(['artist', 'duration_ms', 'acousticness', 'danceability', 'energy'], keep='first') # Keeping the last seen version of each song, as that will probably hold it's total streams more accurately
highValenceTopTracks.sort_values('streams', ascending=False).head(10)

In [None]:
highValenceTopTracks = highValenceTopTracks.sort_values('valence', ascending=False)
highValenceTopTracks.head()

We believe no one feature has a strong effect on the number of streams, but it's possible that a combination of values in different features work together to improve stream count.

Using our list of songs with a high valence, we can look at the relationship between featuers like danceability or tempo and stream count to try to see how songs with both a high valence and varying levels of these features affect streams.

Let's get high valence songs from our original dataframe to have a larger sample size

In [None]:
highValenceTracks = df.loc[(df['valence'] > .5) & (df['valence'] < .8)]
veryHighValenceTracks = df.loc[df['valence'] > .8]
lowValenceTracks = df.loc[(df['valence'] < .5) & (df['valence'] > .3)]
veryLowValenceTracks = df.loc[df['valence'] < .3]
#lowValenceTracks.head()
#highValenceTracks.head()

In [None]:
# Plotting songs with a high valence and varying levels of danceability against streams to see if these two values work together to impact stream counts
plt.scatter(veryLowValenceTracks['danceability'],veryLowValenceTracks['streams'], color="blue")
plt.scatter(lowValenceTracks['danceability'],lowValenceTracks['streams'], color="turquoise")
plt.scatter(highValenceTracks['danceability'],highValenceTracks['streams'], color="yellow")
plt.scatter(veryHighValenceTracks['danceability'],veryHighValenceTracks['streams'], color="red")
plt.title('Danceability and streams for songs with bounded valences')
plt.xlabel('danceability scale of 0-1')
plt.ylabel('streams in millions')
plt.legend(['Song with valence > .8', 'Song with valence > .5 and < .8', 'Song with valence < .5 and > .3', 'Song with valence < .3'])

This graph displays how the danceability of a song compares to the number of streams it has for songs that have a high valence (>.5). While there is little indication of a linear correlation, it appears that the songs with the most streams all also have a danceability of > .5. 

No longer seeing the relationship we were seeing earlier between valence and number of streams. Maybe the relationship that leads to more streams is a combination of these features together. It might be worth trying to see if there is a relationship between streams and a combination of features like valence AND loudness or tempo AND danceability

Instead of considering high valence songs, let's looks at the top streamed songs for the 1st day of every month in the year of 2017

In [None]:
top500OverYear = df.sort_values('streams', ascending=False).drop_duplicates(['artist', 'duration_ms', 'acousticness', 'danceability', 'energy'], keep='first').head(500)
top500OverYear.head(20)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

plt.scatter(top500OverYear['valence'], top500OverYear['tempo'], top500OverYear['streams'])
plt.xlabel('energy scale of 0-1')
plt.ylabel('streams in millions')

In [None]:
corr = df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

It seems like each individual feature has very little effect on the number of streams, which would support our findings above. After seeing this, we had a couple of ideas. It is possible different features we are currently tracking work together to make a song popular, but it is also possible we are missing other important features. After looking back at the most popular songs over the course of our entire dataframe, we noticed the majority of artists were well known or already accomplished. While it is obvious that an artists "followers" or typical listeners will increase the number of streams a song will get, it would be interesting to know if the number of typical listeners was more important than all these other aspects of the song.

We can start by getting the number of followers each artist has for every track entry in our dataframe. The Spotify API also provides a "popularity" index from 0 to 100, with 100 being the most popular. We will get this information, as well. There will be repetition between duplicate entries for an artist, but we are keeping these in so that we can still build a dataframe.

In [None]:
popularity_index_list = []
num_followers_list = []
for index, row in df.iterrows():
    print(row)
    artist_object = sp.artist(row['artist_id'])
    followers_object = artist_object['followers']
    popularity_value = artist_object['popularity']
print(popularity_index_list)
print(num_followers_list)