# Analyzing the Data of Top 200 songs on Spotify
## Amanuel Awoke
##  Ferzam M
## Josue Velasquez

# Introduction
## Motivation
The music industry has changed a lot in the last decade with the introduction of streaming services like Apple Music or Spotify. Services like Spotify allow users to livestream music for personal consumption, often for free or for a subscription fee. These services have made it easier to consume music and have increased opportunities for people to start producing music, but they have also changed how musicians make money. Whenever a user listens to a song on a streaming service, the service typically keeps track of the number of “plays” that song has. Music artists are then paid a small amount based on the number of plays they have accumulated for their music. Given how little these artists are paid from streaming services, maximizing the amount of revenue made from a song seems pretty valuable for those looking to push out music to these services. Play count also indicates where a song stands in the streaming services’ popularity lists, and making it onto their top 100 or 200 songs is a factor considered in whether these songs are added to global, official top songs charts i.e. Billboard 200.
 
Our group thought it would be interesting to see if we could try to make predictions for how popular a song might be given different features for a song (e.g. the genre of the song, how fast or slow it is, the key the song is written in, the time of year a song was released, how many listeners an artist already gets on average, etc.). If we can indicate how many plays a song will get, we can give a prediction for how much money a song will make on a streaming service. Much like the Moneyball scenario, it’s possible that artists are focusing on producing music that meets criteria which they think makes a song popular when, in reality, they should be focusing on other aspects of their music. Understanding what components of a song make it popular would help artists figure out the best way to produce music in order to make money off of these streaming services.

The Moneyball story demonstrated the importance of data science in producing a strong baseball team, and while music is different from sports, our project should hopefully reflect similar data science practices in order to reach a valuable conclusion. It may be relatively straightforward to look at aspects like which genres make the most money or whether a song by Taylor Swift will end up on the top 200 chart given her “incredibly loyal fanbase” of over 40 million people, but maybe there are other similarities between popular songs that could indicate factors which help make a song more popular. Data science practices like t-tests (maybe) or machine learning models help us here by giving us tools to help identify characteristics in a song, clarify how those characteristics might relate to play count, and predict what the play count (or popularity) for a similar song could be given factors that we have determined have an affect on play count (or popularity)
<One valuable conclusion might be the LACK OF money artists are being paid from streaming services, or how little they make off streaming services alone>


# Collect Data

This is the first step in the data lifecycle where we must identify information to web scrape. We gather data from the [Spotify Charts Regional Top 200](https://spotifycharts.com/regional) to first identify which songs had the highest stream counts in the United States, dating back to January 1st 2017, to current day. Spotify Charts provides tracks with the highest stream count, their top 200 rank, and the artist(s) who created that song. Spotify Charts already compiles the data into Excel tables, so it isn't necessary to directly scrape from the website. If you wanted to download one yourself, at the top right of the website, select a date you'd like to download in the dropdown, then select further up "Download to CSV." The pandas method read_csv() was used to process the Excel files into dataframes.



In [None]:
import os
import pandas as pd
import requests
from bs4 import BeautifulSoup
import spotipy
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Since there were consistent download URLs of Excel sheets in relation to the date they recorded, we used a looped to retreive the links then later download all sheets.

In [None]:
# Collect links from spotify charts top 200 streams per day
ref_str = "https://spotifycharts.com/regional/global/daily/"
ref_arr = []
# gets every day from janurary 2017 to October 2020

# for year in range(2017, 2021):
for year in range(2017, 2018):
    date = ""
    
    endingMonth = 12
    if year == 2020:
        endingMonth = 10
        
    # for month in range (1, endingMonth + 1):
    for month in range (1,2):
       
        dayCount = -1

        #gets proper day count per month
        thirtyDayCountMonths = [4, 6, 9, 11]
        if month == 2:
            dayCount = 29
        elif month in thirtyDayCountMonths:
            dayCount = 30
        else:
            dayCount = 31

        if int(month) < 10:
            month = "0" + str(month)
        # for day in range (1, daycount + 1):
        for day in range (1, 16):
           
            if int(day) < 10:
                day = "0" + str(day)

            date = str(year) + "-" + str(month) + "-" + str(day) + "/download"
            date = ref_str + date
            ref_arr.append(date)

ref_arr

In [None]:
#Loop downloading and appending of dataframes 

df = pd.DataFrame(columns =['position', 'track_name', 'artist', 'streams', 'url', 'date'] )
#make dir to save to
path = "sheets"
folderExists = False
try:
    os.mkdir(path)
except OSError:
    print ("Folder already exists")
    folderExists = True

    for i in ref_arr:

        r = requests.get(i, allow_redirects = True)
        #String manipulation to read from the correct csv files
        date = i[48:58]
        fileName = "regional-global-daily-" + date + ".csv"
        if not folderExists:
            print("Downloading... " + filename)
            open(fileName, "wb").write(r.content)

            os.rename(fileName, "sheets/" + fileName)


        df_new = pd.read_csv(path + "/" + fileName)
        df_new.columns= ['position', 'track_name', 'artist', 'streams', 'url']
        df_new['date'] = date
        
        df_new = df_new.iloc[1:] #deletes junk row from csv conversion
        df = df.append(df_new)

    print("Done")
df.drop(['position'], axis=1, inplace=True) #delete position row since rank alraedy has this information
#streams are a string of a num, must wrap as type int always
df['streams'] = df['streams'].astype(int)

## Wrangled data into dataframe

In [None]:
df

# Data Processing

[Spotipy](https://spotipy.readthedocs.io/en/2.16.1/#) is a lightweight Python library for the [Spotify Web API](https://developer.spotify.com/documentation/web-api/) used to retrieve more detailed data for tracks now that their names have been retrieved from the Spotify Top 200. We must first authenticate our usage of the API.

In [None]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth
from spotipy.oauth2 import SpotifyClientCredentials


SPOTIPY_CLIENT_ID="ea1a162fbc6f413990542b76ab82a168"
SPOTIPY_CLIENT_SECRET="a09882042ce54f158fdd2b6baaf2b26d"
SPOTIPY_CLIENT_REDIRECT="http://www.cs.umd.edu/class/fall2020/cmsc320-0201/"

scope = "user-library-read"

sp = spotipy.Spotify(auth_manager=SpotifyOAuth(scope=scope, client_id=SPOTIPY_CLIENT_ID, client_secret=SPOTIPY_CLIENT_SECRET, redirect_uri=SPOTIPY_CLIENT_REDIRECT))

# results = sp.current_user_saved_tracks()


# How to get audio features of a track from our data frame
#not needed anymore but good refrence for how to get trackdata
#trackName = df.iloc[0].at['track_name']
#trackNameWithoutSpaces = '+'.join(trackName.split())
#print(trackNameWithoutSpaces)
#trackItem = sp.search(trackNameWithoutSpaces, 1, 0)
#track_id = trackItem['tracks']['items'][0]['id']
#audiofeatures = sp.audio_features(track_id)
#print(track_id)
#audiofeatures[0]

In [None]:
#gets artist genres and artist ids for each artist in dataframe and puts them in arrays
# artist_genres = []
# artist_ids = []

# for index, row in df.iterrows():
#     artist = row['artist']
#     #print(index)
#     #print(artist)
#     trackArtistWithoutSpaces = '+'.join(artist.split())
#     result = sp.search(trackArtistWithoutSpaces)
#     track = result['tracks']['items'][0]
#     artist_id = track["artists"][0]["id"]
#     #print(artist_id)
#     #print(track)
#     artist_ids.append(artist_id)
#     artist = sp.artist(track["artists"][0]["external_urls"]["spotify"])
#     artist_genres.append(artist["genres"])
#     #print(artist["genres"])



In [None]:
#add artist id and genres to dataframe
# df['Artist Id'] = artist_ids
# df['artist_genres'] = artist_genres

In [None]:
#print(df["artist_genres"])

#creates new column in dataframe based on genre filter within filter func
# def filt_func(genre_list):
#     genre = ['pop','rap','edm','rock','indie']
#     result = list(filter(lambda x: x in genre, genre_list))
#     return "other" if len(result) == 0 else result[0]
#     #print(result)
# df['genre'] = df['artist_genres'].apply(lambda x: filt_func(x))
# df['streams'] = df['streams'].apply(lambda x: int(x))

In [None]:

audioFeaturesDf = pd.DataFrame(columns=["duration_ms", "key", "mode", "time_signature", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "valence", "tempo", "id", "uri", "track_href", "analysis_url", "type"])

#Example lookup of a song 
trackName = df.iloc[0].at['track_name']
trackNameWithoutSpaces = '+'.join(trackName.split())
trackItem = sp.search(trackNameWithoutSpaces, 1, 0)
track_id = trackItem['tracks']['items'][0]['id']
audiofeatures = sp.audio_features(track_id)
print(track_id)
print(audiofeatures[0]['danceability'])
# print(row['track_name'])

In [None]:
import time

tic = time.perf_counter()

#Take each song and lookup its audio features, then create a dataframe for them
print("Searching...")
for index, row in df.iterrows():
    trackName = df.iloc[index - 1].at['track_name']
    trackNameWithoutSpaces = '+'.join(trackName.split())
    trackItem = sp.search(trackNameWithoutSpaces, 1, 0)
    audiofeatures = {'duration_ms' : np.nan, 'key' : np.nan, 'mode' : np.nan, 'time_signature' : np.nan, 'acousticness' : np.nan, 'danceability' : np.nan, 'energy' : np.nan, 'instrumentalness' : np.nan, 'liveness' : np.nan, 'loudness' : np.nan, 'speechiness' : np.nan, 'valence' : np.nan, 'tempo' : np.nan, 'id' : np.nan, 'uri' : np.nan, 'track_href' : np.nan, 'analysis_url' : np.nan, 'type' : np.nan, }
    if (len(trackItem['tracks']['items']) != 0):
        track_id = trackItem['tracks']['items'][0]['id']
        audiofeatures = sp.audio_features(track_id)[0]
    audioFeaturesDf = audioFeaturesDf.append(audiofeatures, ignore_index=True)
toc = time.perf_counter()
print(f"Searches took {toc - tic:0.4f} mf seconds damn")



audioFeaturesDf

In [None]:
#Append audio features to master dataframe
audioFeaturesDf
df['duration_ms'] = audioFeaturesDf['duration_ms']
df['acousticness'] = audioFeaturesDf['acousticness']
df['danceability'] = audioFeaturesDf['danceability']
df['energy'] = audioFeaturesDf['energy']
df['instrumentalness'] = audioFeaturesDf['instrumentalness']
df['liveness'] = audioFeaturesDf['liveness']
df['loudness'] = audioFeaturesDf['loudness']
df['speechiness'] = audioFeaturesDf['speechiness']
df['valence'] = audioFeaturesDf['valence']
df['tempo'] = audioFeaturesDf['tempo']
df

In [None]:
#visualization
#plotting all the new metrics in our dataframe vs streams

# Data Visualization

In [None]:
#Histogram takes 100 random tracks, takes the average of all their streams, then does this 100 times
#Is a standarrd deviation


from scipy.stats import normaltest
from numpy.random import seed
from numpy.random import randn


alpha = 0.05
#data = df['tempo'].sample(n=10).array
data = []
for i in range(0,100):
    data.append(np.mean(df['streams'].sample(n=100)))
print(data)
plt.hist(data)
plt.xlabel("Estimate")
plt.ylabel("Frequency")


In [None]:
from sklearn import linear_model

#Get averages of each col
duration_mean = np.mean(df['duration_ms'])
acousticness_mean = np.mean(df['acousticness'])
danceability_mean = np.mean(df['danceability'])
energy_mean = np.mean(df['energy'])
instrumentalness_mean = np.mean(df['instrumentalness'])
liveness_mean = np.mean(df['liveness'])
loudness_mean = np.mean(df['loudness'])
speechiness_mean = np.mean(df['speechiness'])
valence_mean = np.mean(df['valence'])
tempo_mean = np.mean(df['tempo'])





print(duration_mean)
# line = linear_model.LinearRegression()



# for crime in crimes:
#     x = []
#     y = []
#     for region in regions:
#         x.append(region["Total"].values)
#         y.append(region[crime].values)
        
        
#     x = np.array(x)
#     x = x.flatten()
#     y = np.array(y)
#     y = y.flatten()
    
    
#     plt.scatter(x, y)
#     plt.title(crime)
        
# line = linear_model.LinearRegression()
# x = np.array(df['duration_ms']).flatten()
# y = np.array(df['streams']).flatten()
# print(x)
# line.fit(x,y)
# predicted = line.predict(x.reshape(-1,1))
    
# #     r_value = line.score(x.reshape(-1,1),y)
        
        
# #     plt.plot(x,predicted, c='r', label=r_value)
# #     plt.legend()
    
# #     plt.show()
# #     plt.close()

# plt.scatter(x, y)
# print(type(df['streams']))



In [None]:
#how do i get rows with values in a range
#where dancability

# df_dance = 

In [None]:
plt.scatter(df['duration_ms'], df['streams'])
plt.xlabel('duration in milliseconds')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['acousticness'],df['streams'])
plt.xlabel('accousticness scale of 0-1')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['danceability'],df['streams'])
plt.xlabel('danceability scale of 0-1')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['energy'],df['streams'])
plt.xlabel('energy scale of 0-1')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['instrumentalness'],df['streams'])
plt.xlabel('instrumentalness scale of 0-1')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['liveness'],df['streams'])
plt.xlabel('liveness scale of 0-1')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['loudness'],df['streams'])
plt.xlabel('loudness SCALE???????????????????')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['speechiness'],df['streams'])
plt.xlabel('speechiness scale of 0-.5')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['valence'],df['streams'])
plt.xlabel('valence scale of 0-1')
plt.ylabel('streams in millions')

In [None]:
plt.scatter(df['tempo'],df['streams'])
plt.xlabel('tempo scale of 0-200')
plt.ylabel('streams in millions')

In [None]:
#violin plot of genre vs streams in millions
# ax = sns.violinplot(x='genre', y='streams', data=df, palette='muted')


In [None]:
plt.scatter(df['date'],df['streams'])
plt.xlabel('date')
plt.ylabel('streams in millions')

# Insight