# Analysis for Spotify Trends

## Introduction

The dataset we are going to be using for this analysis is going to be based on Spotify Data. The main reason behind this is trying to undertand what makes music likeable and apt in certain situation and providing analysis to allow for artists to make use of that information.

Below, I have described our data and the questions that we will be answering, its signifance, relevance and how someone can benifit from this information.

**The dataset we plan to extract consists of genre and time-period based data and randomly samples 90 songs from a larger dataset (these parameters can be adjusted as well). For each of those 90 tracks we have a lot of information as described a few cells below. Using this, we hope to answer the following questions:**

### Motivation

As a group we are very passionate about music and some of us play instruments as well. In such a scenario and with the rise of new stars almost daily, we felt that its a good idea to learn what makes music the way it is. Specifically we wanted to the answers to a few questions as listed below.

Spotify is the one of the most popular platforms used to listen to music and hence working on data used from that website provides us with a large and almost completely representative population on an international scale.

### Research Questions

**Descriptive and Data Based Analysis:**
--> @Yesh, please fill this out

**Inference Based Analysis:**
--> @Yesh, please fill this out

**Linear Regression Based Analysis:**
--> @Yesh, please fill this out, in my opinion we should be using danceability variable and try to predict that using linear regression

**Logistic Regression Based Analysis:**
--> I will fill this out when I get to logistic regression, but I will be predicting the mode of the song, i.e. if its major or minor

### Data Extraction from Spotify

For this project we have used the spotify API for data extraction and since we will be conducting inference as well as regression based analysis we needed to maintain certain features within the DataSet

1) It needed to be a random sample (true randomness does not exists, therefore we can say its pseudorandom)
2) The size of the dataset needs to be less than 10% of the total population by size
3) Implement a function for extracting the data so someone else can conduct this same analysis later on on a slightly different dataset, with parameters they decide upon.

*Note: Before diving further deeper into the dataset and what it does, I am going to extract the data and clean it*

#### Getting the Spotify API setup

In [1]:
# Import for data cleaning and othe function as well
import spotipy
import pandas as pd


In [2]:
# Would need to use the secrets file, if this was not a dummy account

SPOTIPY_CLIENT_ID = '544d6eb7a5c14986bdd9727cf2b40f9d'
SPOTIPY_CLIENT_SECRET = '631e2374886c4d508cbcc8b83afa3dd3' # This secret can be rotated later to prevent maluse
PORT_NUMBER = 8080

client_credentials_manager = spotipy.SpotifyClientCredentials(client_id=SPOTIPY_CLIENT_ID, client_secret=SPOTIPY_CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

sp


<spotipy.client.Spotify at 0x7fa628b3e5e0>

#### Getting the Pseudo Random Song Samples with specific parameters

In [3]:
def get_random_songs(number_of_songs=1, genre='Pop', year_range='2011-2022', random_state = 101):
    #Set random seed
    import random
    random.seed(random_state)

    #Generate random offsets (sampling without replacement)
    random_offset = random.sample(range(0, 900), number_of_songs)

    #Generate random search character for query (sampling with replacement)
    chars = 'abcdefghijklmnopqrstuvwxyz'
    random_char = random.choices(chars, k=number_of_songs)

    #Generate random id to select in the output list of 10 (sampling with replacement)
    random_id = random.choices(range(0, 10), k=number_of_songs)

    df_random_songs = pd.DataFrame(columns=['track_id', 'track_name', 'artist_id', 'artist_name', 'album_id', 'album_name', 'release_date'])

    for i in range(0,number_of_songs):
        #Pseudo-random query selection
        results = sp.search(q='genre:' + genre + ' year:' + year_range+' '+random_char[i], type='track', offset=random_offset[i])

        song_list = []
        #Adding track info
        song_list.append(results['tracks']['items'][random_id[i]]['id'])
        song_list.append(results['tracks']['items'][random_id[i]]['name'])

        #Adding artist info (first one listed)
        song_list.append(results['tracks']['items'][random_id[i]]['artists'][0]['id'])
        song_list.append(results['tracks']['items'][random_id[i]]['artists'][0]['name'])

        #Adding album info
        song_list.append(results['tracks']['items'][random_id[i]]['album']['id'])
        song_list.append(results['tracks']['items'][random_id[i]]['album']['name'])
        song_list.append(results['tracks']['items'][random_id[i]]['album']['release_date'])

        df_song = pd.DataFrame([song_list], columns=['track_id', 'track_name', 'artist_id', 'artist_name', 'album_id', 'album_name', 'release_date'])

        df_random_songs=pd.concat([df_random_songs, df_song])

    df_random_songs=df_random_songs.reset_index(drop=True)
    
    return df_random_songs    
    

In [4]:
def get_song_info(track_uri):
    #1. Extract the audio features of the track
    song_results = sp.audio_features(track_uri)
    
    #2. Collect the following audio information about the track
    song_list=[track_uri.split(':')[2],
               song_results[0]['danceability'],
               song_results[0]['energy'],
               song_results[0]['key'],
               song_results[0]['loudness'],
               song_results[0]['mode'],
               song_results[0]['speechiness'],
               song_results[0]['acousticness'],
               song_results[0]['instrumentalness'],
               song_results[0]['liveness'],
               song_results[0]['valence'],
               song_results[0]['tempo'],
               song_results[0]['type'],
               song_results[0]['time_signature']]
               
    
    df_song_audio=pd.DataFrame([song_list], columns=['track_id', 'danceability', 'energy', 'key', 'loudness', 'mode',
                                                    'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence',
                                                    'tempo', 'type', 'time_signature'])
    return df_song_audio
    

In [9]:
#Creating a dataset that is Genre Specific with 80 songs i.e. here we have chosen Pop
df = get_random_songs(number_of_songs=90, genre='Pop', year_range='2011-2022', random_state = 100)

#Creates an empty dataframe with the columns that we want
song_audio_info = pd.DataFrame(columns=['track_id', 'tempo',
                                           'danceability', 'energy', 'key',
                                           'loudness', 'acousticness', 'instrumentalness',
                                           'liveness', 'valence'])


for track_id in df['track_id']:
    #print(track_id)
    song_audio_info = pd.concat([song_audio_info, get_song_info('spotify:track:'+track_id)])
    

df= pd.merge(df,song_audio_info, on=['track_id'])
df.head()

Unnamed: 0,track_id,track_name,artist_id,artist_name,album_id,album_name,release_date,tempo,danceability,energy,key,loudness,acousticness,instrumentalness,liveness,valence,mode,speechiness,type,time_signature
0,41P6Tnd8KIHqON0QIydx6a,the perfect pair,35l9BRT7MXmM8bv2WDQiyB,beabadoobee,2rhNQbqRNxiNQkDXTffe1V,Beatopia,2022-07-15,146.053,0.634,0.663,11,-6.818,0.433,0.124,0.102,0.6,1.0,0.0331,audio_features,4.0
1,08zJpaUQVi9FrKv2e32Bah,Planez,3KV3p5EY4AvKxOlhGHORLg,Jeremih,7DMyQuDPe8xzjC0UDSDa96,Late Nights: The Album,2015-12-04,129.336,0.688,0.556,6,-7.738,0.7,2e-06,0.108,0.416,1.0,0.264,audio_features,4.0
2,0HZhYMZOcUzZKSFwPOti6m,Jar of Hearts,7H55rcKCfwqkyDFH9wpKM6,Christina Perri,3XNK8vPk3O1rjhDZyOMJ6n,lovestrong.,2011-05-10,74.541,0.349,0.348,3,-6.142,0.726,0.0,0.12,0.0886,1.0,0.0316,audio_features,4.0
3,1hL3lpPKYZdLoKAbWXd1ni,Last Friday Night (T.G.I.F.),6jJ0s89eD6GaHleKKya26X,Katy Perry,4SKjR5h4bkN68UlkdSnF6j,New Year's Eve Party 2022,2022-12-01,126.024,0.65,0.817,3,-3.826,0.00124,1.1e-05,0.658,0.728,0.0,0.0441,audio_features,4.0
4,2SwoSqftZfZgpfUGfiEKhB,I'm Good (Blue),1Cs0zKBU1kc0i8ypK3B9ai,David Guetta,6hlI7DWjxUCZ8NiWCyWSv8,House Hits,2022-12-02,128.04,0.561,0.965,7,-3.673,0.00383,7e-06,0.371,0.304,0.0,0.0343,audio_features,4.0


In [10]:
# Going to save this data and push it to the repo, since running the above function takes a long time
#   ...the below code is only for time saving purposes, so we meed not run the API call multiple times
#   ...also if the internet connection fails, then too we need a dataset that allows for analysis without having to get new data
df.to_csv(r"extracted_uncleaned_data.csv", header=True, index = False)
data = pd.read_csv('extracted_uncleaned_data.csv')
data.head()
## df = data # This line can be used in the above mentioned scenario

Unnamed: 0,track_id,track_name,artist_id,artist_name,album_id,album_name,release_date,tempo,danceability,energy,key,loudness,acousticness,instrumentalness,liveness,valence,mode,speechiness,type,time_signature
0,41P6Tnd8KIHqON0QIydx6a,the perfect pair,35l9BRT7MXmM8bv2WDQiyB,beabadoobee,2rhNQbqRNxiNQkDXTffe1V,Beatopia,2022-07-15,146.053,0.634,0.663,11,-6.818,0.433,0.124,0.102,0.6,1.0,0.0331,audio_features,4.0
1,08zJpaUQVi9FrKv2e32Bah,Planez,3KV3p5EY4AvKxOlhGHORLg,Jeremih,7DMyQuDPe8xzjC0UDSDa96,Late Nights: The Album,2015-12-04,129.336,0.688,0.556,6,-7.738,0.7,2e-06,0.108,0.416,1.0,0.264,audio_features,4.0
2,0HZhYMZOcUzZKSFwPOti6m,Jar of Hearts,7H55rcKCfwqkyDFH9wpKM6,Christina Perri,3XNK8vPk3O1rjhDZyOMJ6n,lovestrong.,2011-05-10,74.541,0.349,0.348,3,-6.142,0.726,0.0,0.12,0.0886,1.0,0.0316,audio_features,4.0
3,1hL3lpPKYZdLoKAbWXd1ni,Last Friday Night (T.G.I.F.),6jJ0s89eD6GaHleKKya26X,Katy Perry,4SKjR5h4bkN68UlkdSnF6j,New Year's Eve Party 2022,2022-12-01,126.024,0.65,0.817,3,-3.826,0.00124,1.1e-05,0.658,0.728,0.0,0.0441,audio_features,4.0
4,2SwoSqftZfZgpfUGfiEKhB,I'm Good (Blue),1Cs0zKBU1kc0i8ypK3B9ai,David Guetta,6hlI7DWjxUCZ8NiWCyWSv8,House Hits,2022-12-02,128.04,0.561,0.965,7,-3.673,0.00383,7e-06,0.371,0.304,0.0,0.0343,audio_features,4.0


### Data Exploration

#### What the Data Means:

1) **acousticness**: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

2) **analysis_url**: A URL to access the full audio analysis of this track. An access token is required to access this data.


3) **danceability**: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

4) **duration_ms**: The duration of the track in milliseconds.

5) **energy**: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

6) **id**: The Spotify ID for the track.

7) **instrumentalness**: Predicts whether a track contains no vocals. **Ooh** and **aah** sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly **vocal**. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

8) **key**: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

9) **liveness**: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

10) **loudness**: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.

11) **mode**: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

12) **speechiness**: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

13) **tempo**: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

14) **time_signature**: An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of **3/4**, to **7/4**.

15) **track_href**: A link to the Web API endpoint providing full details of the track.

16) **type**: **audio_features**, this is not very informative, but is information provided by the API

17) **uri**: The Spotify URI for the track.

18) **valence**: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).


*Note: Some of this is in a data type that is not ideal or simple for analysis and hence it needs to be recoded while performing the cleaning part*

For more information refer : https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90 entries, 0 to 89
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track_id          90 non-null     object 
 1   track_name        90 non-null     object 
 2   artist_id         90 non-null     object 
 3   artist_name       90 non-null     object 
 4   album_id          90 non-null     object 
 5   album_name        90 non-null     object 
 6   release_date      90 non-null     object 
 7   tempo             90 non-null     object 
 8   danceability      90 non-null     object 
 9   energy            90 non-null     object 
 10  key               90 non-null     object 
 11  loudness          90 non-null     object 
 12  acousticness      90 non-null     object 
 13  instrumentalness  90 non-null     object 
 14  liveness          90 non-null     object 
 15  valence           90 non-null     object 
 16  mode              90 non-null     float64
 17 

**Note:** As shown above some of the datatypes are not representative of its values and hence need to be recoded, in addition to checking for missing values and basic cleaning.

### Data Cleaning and Pre-Processing

In [24]:
# Dropping this column since its not helpful for analysis
df = df.drop(columns=['type'])
df.head()

Unnamed: 0,track_id,track_name,artist_id,artist_name,album_id,album_name,release_date,tempo,danceability,energy,key,loudness,acousticness,instrumentalness,liveness,valence,mode,speechiness,time_signature
0,41P6Tnd8KIHqON0QIydx6a,the perfect pair,35l9BRT7MXmM8bv2WDQiyB,beabadoobee,2rhNQbqRNxiNQkDXTffe1V,Beatopia,2022-07-15,146.053,0.634,0.663,11,-6.818,0.433,0.124,0.102,0.6,1.0,0.0331,4.0
1,08zJpaUQVi9FrKv2e32Bah,Planez,3KV3p5EY4AvKxOlhGHORLg,Jeremih,7DMyQuDPe8xzjC0UDSDa96,Late Nights: The Album,2015-12-04,129.336,0.688,0.556,6,-7.738,0.7,2e-06,0.108,0.416,1.0,0.264,4.0
2,0HZhYMZOcUzZKSFwPOti6m,Jar of Hearts,7H55rcKCfwqkyDFH9wpKM6,Christina Perri,3XNK8vPk3O1rjhDZyOMJ6n,lovestrong.,2011-05-10,74.541,0.349,0.348,3,-6.142,0.726,0.0,0.12,0.0886,1.0,0.0316,4.0
3,1hL3lpPKYZdLoKAbWXd1ni,Last Friday Night (T.G.I.F.),6jJ0s89eD6GaHleKKya26X,Katy Perry,4SKjR5h4bkN68UlkdSnF6j,New Year's Eve Party 2022,2022-12-01,126.024,0.65,0.817,3,-3.826,0.00124,1.1e-05,0.658,0.728,0.0,0.0441,4.0
4,2SwoSqftZfZgpfUGfiEKhB,I'm Good (Blue),1Cs0zKBU1kc0i8ypK3B9ai,David Guetta,6hlI7DWjxUCZ8NiWCyWSv8,House Hits,2022-12-02,128.04,0.561,0.965,7,-3.673,0.00383,7e-06,0.371,0.304,0.0,0.0343,4.0


In [25]:
# Checking for obvious missing values and NaN values
df.isna().sum()

track_id            0
track_name          0
artist_id           0
artist_name         0
album_id            0
album_name          0
release_date        0
tempo               0
danceability        0
energy              0
key                 0
loudness            0
acousticness        0
instrumentalness    0
liveness            0
valence             0
mode                0
speechiness         0
time_signature      0
dtype: int64

In [29]:
df.dtypes

track_id             object
track_name           object
artist_id            object
artist_name          object
album_id             object
album_name           object
release_date         object
tempo                object
danceability         object
energy               object
key                  object
loudness             object
acousticness         object
instrumentalness     object
liveness             object
valence              object
mode                float64
speechiness         float64
time_signature      float64
dtype: object

**Recoding/ Encoding Style:**

1) tempo : needs to become an int/float

2) danceability: needs to become a float

3) energy: needs to become a float

4) key: needs to become a Letter based object (not a must for analysis, but imporoves the understandability of the code)

5) loudness: needs to become a float

6) acousticness: needs to become a float

7) instrumentalness: needs to become a float

8) liveness: needs to become a float

9) valence: needs to become a float 

10 mode: needs to become a String : "Major" / "Minor"

*Therefore, we have 3 types of encoding to be done:*

**1) Encoding to floating point type**

**2) Encoding mode to String type**

**3) Making key a String types as well**





In [30]:
# Checking for some basic data if there are any missing values not detected by the above check
# Only checking for these since the rest needs to be encoded
df['track_id'].unique()
df['track_name'].unique()
df['album_id'].unique()
df['album_name'].unique()
df['release_date'].unique()
print("Manually checked for all above columns/values and can conclude that they are no missing values")




Manually checked for all above columns/values and can conclude that they are no missing values


#### Encoding to a floating Point type:


In [52]:
# Checkin for nan values here
df['tempo'].unique()
df['danceability'].unique()
df['energy'].unique()
df['loudness'].unique()
df['acousticness'].unique()
df['instrumentalness'].unique()
df['liveness'].unique()
df['valence'].unique()
print("Manually checked for all above columns/values and can conclude that they are no missing values")

# While they show as object type, when trying to check an individual value, thay act as a float. Hence going to change values just for good measure
# It aslo helps with rounding some value

Manually checked for all above columns/values and can conclude that they are no missing values


In [54]:
df['tempo'] = df['tempo'].astype(float)
df['danceability'] = df['danceability'].astype(float)
df['energy'] = df['energy'].astype(float)
df['loudness'] = df['loudness'].astype(float)
df['acousticness'] = df['acousticness'].astype(float)
df['instrumentalness'] = df['instrumentalness'].astype(float)
df['liveness'] = df['liveness'].astype(float)
df['valence'] = df['valence'].astype(float)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90 entries, 0 to 89
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track_id          90 non-null     object 
 1   track_name        90 non-null     object 
 2   artist_id         90 non-null     object 
 3   artist_name       90 non-null     object 
 4   album_id          90 non-null     object 
 5   album_name        90 non-null     object 
 6   release_date      90 non-null     object 
 7   tempo             90 non-null     float64
 8   danceability      90 non-null     float64
 9   energy            90 non-null     float64
 10  key               90 non-null     object 
 11  loudness          90 non-null     float64
 12  acousticness      90 non-null     float64
 13  instrumentalness  90 non-null     float64
 14  liveness          90 non-null     float64
 15  valence           90 non-null     float64
 16  mode              90 non-null     float64
 17 

In [55]:
df.head()

Unnamed: 0,track_id,track_name,artist_id,artist_name,album_id,album_name,release_date,tempo,danceability,energy,key,loudness,acousticness,instrumentalness,liveness,valence,mode,speechiness,time_signature
0,41P6Tnd8KIHqON0QIydx6a,the perfect pair,35l9BRT7MXmM8bv2WDQiyB,beabadoobee,2rhNQbqRNxiNQkDXTffe1V,Beatopia,2022-07-15,146.053,0.634,0.663,11,-6.818,0.433,0.124,0.102,0.6,1.0,0.0331,4.0
1,08zJpaUQVi9FrKv2e32Bah,Planez,3KV3p5EY4AvKxOlhGHORLg,Jeremih,7DMyQuDPe8xzjC0UDSDa96,Late Nights: The Album,2015-12-04,129.336,0.688,0.556,6,-7.738,0.7,2e-06,0.108,0.416,1.0,0.264,4.0
2,0HZhYMZOcUzZKSFwPOti6m,Jar of Hearts,7H55rcKCfwqkyDFH9wpKM6,Christina Perri,3XNK8vPk3O1rjhDZyOMJ6n,lovestrong.,2011-05-10,74.541,0.349,0.348,3,-6.142,0.726,0.0,0.12,0.0886,1.0,0.0316,4.0
3,1hL3lpPKYZdLoKAbWXd1ni,Last Friday Night (T.G.I.F.),6jJ0s89eD6GaHleKKya26X,Katy Perry,4SKjR5h4bkN68UlkdSnF6j,New Year's Eve Party 2022,2022-12-01,126.024,0.65,0.817,3,-3.826,0.00124,1.1e-05,0.658,0.728,0.0,0.0441,4.0
4,2SwoSqftZfZgpfUGfiEKhB,I'm Good (Blue),1Cs0zKBU1kc0i8ypK3B9ai,David Guetta,6hlI7DWjxUCZ8NiWCyWSv8,House Hits,2022-12-02,128.04,0.561,0.965,7,-3.673,0.00383,7e-06,0.371,0.304,0.0,0.0343,4.0
