# Analysis for Spotify Trends

## Introduction

The dataset we are going to be using for this analysis is going to be based on Spotify Data. The main reason behind this is trying to undertand what makes music likeable and apt in certain situation and providing analysis to allow for artists to make use of that information.

Below, I have described our data and the questions that we will be answering, its signifance, relevance and how someone can benifit from this information.

## Data Extraction from Spotify

For this project we have used the spotify API for data extraction and since we will be conducting inference as well as regression based analysis we needed to maintain certain features within the DataSet

1) It needed to be a random sample (true randomness does not exists, therefore we can say its pseudorandom)
2) The size of the dataset needs to be less than 10% of the total population by size
3) Implement a function for extracting the data so someone else can conduct this same analysis later on on a slightly different dataset, with parameters they decide upon.

*Note: Before diving further deeper into the dataset and what it does, I am going to extract the data and clean it*

### Getting the Spotify API setup

In [1]:
# Import for data cleaning and othe function as well
import spotipy
import pandas as pd


In [2]:
# Would need to use the secrets file, if this was not a dummy account

SPOTIPY_CLIENT_ID = '544d6eb7a5c14986bdd9727cf2b40f9d'
SPOTIPY_CLIENT_SECRET = '631e2374886c4d508cbcc8b83afa3dd3' # This secret can be rotated later to prevent maluse
PORT_NUMBER = 8080

client_credentials_manager = spotipy.SpotifyClientCredentials(client_id=SPOTIPY_CLIENT_ID, client_secret=SPOTIPY_CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

sp


<spotipy.client.Spotify at 0x7fa628b3e5e0>

### Getting the Pseudo Random Song Samples with specific parameters

In [3]:
def get_random_songs(number_of_songs=1, genre='Pop', year_range='2011-2022', random_state = 101):
    #Set random seed
    import random
    random.seed(random_state)

    #Generate random offsets (sampling without replacement)
    random_offset = random.sample(range(0, 900), number_of_songs)

    #Generate random search character for query (sampling with replacement)
    chars = 'abcdefghijklmnopqrstuvwxyz'
    random_char = random.choices(chars, k=number_of_songs)

    #Generate random id to select in the output list of 10 (sampling with replacement)
    random_id = random.choices(range(0, 10), k=number_of_songs)

    df_random_songs = pd.DataFrame(columns=['track_id', 'track_name', 'artist_id', 'artist_name', 'album_id', 'album_name', 'release_date'])

    for i in range(0,number_of_songs):
        #Pseudo-random query selection
        results = sp.search(q='genre:' + genre + ' year:' + year_range+' '+random_char[i], type='track', offset=random_offset[i])

        song_list = []
        #Adding track info
        song_list.append(results['tracks']['items'][random_id[i]]['id'])
        song_list.append(results['tracks']['items'][random_id[i]]['name'])

        #Adding artist info (first one listed)
        song_list.append(results['tracks']['items'][random_id[i]]['artists'][0]['id'])
        song_list.append(results['tracks']['items'][random_id[i]]['artists'][0]['name'])

        #Adding album info
        song_list.append(results['tracks']['items'][random_id[i]]['album']['id'])
        song_list.append(results['tracks']['items'][random_id[i]]['album']['name'])
        song_list.append(results['tracks']['items'][random_id[i]]['album']['release_date'])

        df_song = pd.DataFrame([song_list], columns=['track_id', 'track_name', 'artist_id', 'artist_name', 'album_id', 'album_name', 'release_date'])

        df_random_songs=pd.concat([df_random_songs, df_song])

    df_random_songs=df_random_songs.reset_index(drop=True)
    
    return df_random_songs    
    

In [4]:
def get_song_info(track_uri):
    #1. Extract the audio features of the track
    song_results = sp.audio_features(track_uri)
    
    #2. Collect the following audio information about the track
    song_list=[track_uri.split(':')[2],
               song_results[0]['danceability'],
               song_results[0]['energy'],
               song_results[0]['key'],
               song_results[0]['loudness'],
               song_results[0]['mode'],
               song_results[0]['speechiness'],
               song_results[0]['acousticness'],
               song_results[0]['instrumentalness'],
               song_results[0]['liveness'],
               song_results[0]['valence'],
               song_results[0]['tempo'],
               song_results[0]['type'],
               song_results[0]['time_signature']]
               
    
    df_song_audio=pd.DataFrame([song_list], columns=['track_id', 'danceability', 'energy', 'key', 'loudness', 'mode',
                                                    'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence',
                                                    'tempo', 'type', 'time_signature'])
    return df_song_audio
    

In [9]:
#Creating a dataset that is Genre Specific with 80 songs i.e. here we have chosen Pop
df = get_random_songs(number_of_songs=90, genre='Pop', year_range='2011-2022', random_state = 100)

#Creates an empty dataframe with the columns that we want
song_audio_info = pd.DataFrame(columns=['track_id', 'tempo',
                                           'danceability', 'energy', 'key',
                                           'loudness', 'acousticness', 'instrumentalness',
                                           'liveness', 'valence'])


for track_id in df['track_id']:
    #print(track_id)
    song_audio_info = pd.concat([song_audio_info, get_song_info('spotify:track:'+track_id)])
    

df= pd.merge(df,song_audio_info, on=['track_id'])
df.head()

Unnamed: 0,track_id,track_name,artist_id,artist_name,album_id,album_name,release_date,tempo,danceability,energy,key,loudness,acousticness,instrumentalness,liveness,valence,mode,speechiness,type,time_signature
0,41P6Tnd8KIHqON0QIydx6a,the perfect pair,35l9BRT7MXmM8bv2WDQiyB,beabadoobee,2rhNQbqRNxiNQkDXTffe1V,Beatopia,2022-07-15,146.053,0.634,0.663,11,-6.818,0.433,0.124,0.102,0.6,1.0,0.0331,audio_features,4.0
1,08zJpaUQVi9FrKv2e32Bah,Planez,3KV3p5EY4AvKxOlhGHORLg,Jeremih,7DMyQuDPe8xzjC0UDSDa96,Late Nights: The Album,2015-12-04,129.336,0.688,0.556,6,-7.738,0.7,2e-06,0.108,0.416,1.0,0.264,audio_features,4.0
2,0HZhYMZOcUzZKSFwPOti6m,Jar of Hearts,7H55rcKCfwqkyDFH9wpKM6,Christina Perri,3XNK8vPk3O1rjhDZyOMJ6n,lovestrong.,2011-05-10,74.541,0.349,0.348,3,-6.142,0.726,0.0,0.12,0.0886,1.0,0.0316,audio_features,4.0
3,1hL3lpPKYZdLoKAbWXd1ni,Last Friday Night (T.G.I.F.),6jJ0s89eD6GaHleKKya26X,Katy Perry,4SKjR5h4bkN68UlkdSnF6j,New Year's Eve Party 2022,2022-12-01,126.024,0.65,0.817,3,-3.826,0.00124,1.1e-05,0.658,0.728,0.0,0.0441,audio_features,4.0
4,2SwoSqftZfZgpfUGfiEKhB,I'm Good (Blue),1Cs0zKBU1kc0i8ypK3B9ai,David Guetta,6hlI7DWjxUCZ8NiWCyWSv8,House Hits,2022-12-02,128.04,0.561,0.965,7,-3.673,0.00383,7e-06,0.371,0.304,0.0,0.0343,audio_features,4.0


In [10]:
# Going to save this data and push it to the repo, since running the above function takes a long time
#   ...the below code is only for time saving purposes, so we meed not run the API call multiple times
#   ...also if the internet connection fails, then too we need a dataset that allows for analysis without having to get new data
df.to_csv(r"extracted_uncleaned_data.csv", header=True, index = False)
data = pd.read_csv('extracted_uncleaned_data.csv')
data.head()
## df = data # This line can be used in the above mentioned scenario

Unnamed: 0,track_id,track_name,artist_id,artist_name,album_id,album_name,release_date,tempo,danceability,energy,key,loudness,acousticness,instrumentalness,liveness,valence,mode,speechiness,type,time_signature
0,41P6Tnd8KIHqON0QIydx6a,the perfect pair,35l9BRT7MXmM8bv2WDQiyB,beabadoobee,2rhNQbqRNxiNQkDXTffe1V,Beatopia,2022-07-15,146.053,0.634,0.663,11,-6.818,0.433,0.124,0.102,0.6,1.0,0.0331,audio_features,4.0
1,08zJpaUQVi9FrKv2e32Bah,Planez,3KV3p5EY4AvKxOlhGHORLg,Jeremih,7DMyQuDPe8xzjC0UDSDa96,Late Nights: The Album,2015-12-04,129.336,0.688,0.556,6,-7.738,0.7,2e-06,0.108,0.416,1.0,0.264,audio_features,4.0
2,0HZhYMZOcUzZKSFwPOti6m,Jar of Hearts,7H55rcKCfwqkyDFH9wpKM6,Christina Perri,3XNK8vPk3O1rjhDZyOMJ6n,lovestrong.,2011-05-10,74.541,0.349,0.348,3,-6.142,0.726,0.0,0.12,0.0886,1.0,0.0316,audio_features,4.0
3,1hL3lpPKYZdLoKAbWXd1ni,Last Friday Night (T.G.I.F.),6jJ0s89eD6GaHleKKya26X,Katy Perry,4SKjR5h4bkN68UlkdSnF6j,New Year's Eve Party 2022,2022-12-01,126.024,0.65,0.817,3,-3.826,0.00124,1.1e-05,0.658,0.728,0.0,0.0441,audio_features,4.0
4,2SwoSqftZfZgpfUGfiEKhB,I'm Good (Blue),1Cs0zKBU1kc0i8ypK3B9ai,David Guetta,6hlI7DWjxUCZ8NiWCyWSv8,House Hits,2022-12-02,128.04,0.561,0.965,7,-3.673,0.00383,7e-06,0.371,0.304,0.0,0.0343,audio_features,4.0
