<a href="https://www.kaggle.com/code/akouaorsot/musical-analytics-spotify-2010-2022?scriptVersionId=143739317" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

With the rise of musical softwares such as Soundcloud, Deezer, and Spotify, there has a expansion in the access of an extensive music library while breaking down barriers for upcoming artists to share their music with others. In so doing, there is now more traceability on the songs/albumns catching the most attention and popularity. With this newly found pool of data points, music and entertainment industry would be greatly served in leveraging data analytics to tailor the musical offering onto this market. 

It will be our modest attempt to conduct different analyses onto this 2010-2022 Spotify dataset to enhance the quality of the dataset, explore a variety of questions, run some tests, and draw some insights.

The work of [James Chryssanthacopoulos](https://towardsdatascience.com/advanced-music-analytics-using-machine-learning-f344e4795bbc) was also a great inpsiration

# 1. Set-up

In [None]:
## Importing Libraries
import numpy as np  # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re           # Regex Text manipulation
import itertools    # For iterators and efficent looping
from collections import Counter


## Data Visualization
import matplotlib.pyplot as plt     # data visualization
import seaborn as sns               # Other visualization package
sns.set_style("darkgrid")

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

## Categorical Encoding
import category_encoders as ce      # Feature encoding

from sklearn import metrics

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
paths = []
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        paths.append(os.path.join(dirname, filename)) 
#         print(os.path.join(dirname, filename))

In [None]:
# Loading the data
df = pd.read_csv(paths[0])
df.head()

In [None]:
df.info()

#### As noted above, there are 23 columns of which 13 represents audio features of any given song. To have a deeper understanding of the underlying information captured, will we rely on the following source: [SPOTONTRACK - What do the audio features mean?](https://help.spotontrack.com/article/what-do-the-audio-features-mean)

* **Danceability**: describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.

* **Energy**: representing a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
 
* **Key**: Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

* **Mode**: indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

* **Loudness**: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.

* **Speechiness**: detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

* **Acousticness**: A confidence measure from 0.0 to 1.0 of whether the track is acoustic; thus indicating the likelihood of electrical amplification.

* **Instrumentalness**: whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.

* **Liveness**: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. 

* **Valence**: describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

* **Tempo**: estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

* **Time Signature**: indicates how many counts are in each measure and which type of note will receive one count.

# 2. Data Cleaning

In [None]:
# Since the following columns do not hold 
# as much pertinent information, we will drop them

df.drop(columns=['playlist_url', 'track_id', 'artist_id'],
       inplace=True)

In [None]:
# Shape of data: No. of rows and columns
no_rows, no_cols = df.shape

# Check for missing values
null_values = df.isnull().sum().sum()

# No. of duplicates values
duplicates_n = df[df.duplicated()].shape[0]


# Diagnostics results
data_check = """
- The data has {0} rows and {1} columns
- There are {2} null/na values and {3} duplicated rows. 
""" 

print(data_check.format(no_rows, no_cols,
                        null_values, duplicates_n))

In [None]:
# A closer look shows row 447 has na/null values
# mostly in the audio features columns.
df[df.isnull().any(axis=1)]

# Comment 
# df[(df['year'] == 2004) & (df['artist_genres'].str.find('pop') > -1)]

# Since it's only 1 observation, we will drop it
df.drop([447], inplace=True)

# 3. Exploratory Data Analysis

In [None]:
# Descriptive Statistics - Quick overview
df.describe()

In [None]:
# Segment audio features
col_names = ['track_popularity', 'artist_popularity', 
             'danceability', 'energy', 'key',
             'loudness', 'speechiness', 
             'acousticness', 'instrumentalness',
             'liveness', 'valence', 'tempo',]
audio_features = df[col_names]

## Multiplot of distribution
# Proportions by seat type
fig = plt.figure(figsize=(20, 12))
gs = fig.add_gridspec(4, 3, hspace=0.5, wspace=0.5)
axs = gs.subplots()
fig.suptitle("Distribution of All Audio Features")

for i, ax in enumerate(axs.flatten()):
    vals = audio_features.iloc[:, i].values
    title = audio_features.columns[i]
    ax.hist(vals)
    ax.set_title(f'{title}')

plt.tight_layout()
plt.show()

In [None]:
df.columns

### Takeaways: The following correlation matrix gives us a quick snapshot of possible relationships underlying between our variables. With the color shading indicating how weak or strong such a relationship holds on a positive/negative scale. Some noteworthy points to makes:
1. Our target variable 'track_popularity' does not seem strongly correlated with most individual variable, except the 'artist_popularity' having a positive coefficient of 0.33. In conducting any further analyses (such as regression), we would expect a greater weight for this variable compared to others.
2. The 'energy' variable which denotes how intense the song seems is positively correlated to loudness at 0.69 and negatively with acoustiness at -0.54 (indiciating the degree of electrical amplification).
3. The 'valence' variable, predicting the positiveness of the song, is positively correlated 'danceability' at 0.41, 'loudness' at 0.39, and 'energy' at 0.31.

In [None]:
# Segment numerical variables
col_nums = ['track_popularity', 'artist_popularity', 
             'danceability', 'energy', 'key',
             'loudness', 'mode', 'speechiness', 
             'acousticness', 'instrumentalness',
             'liveness', 'valence', 'tempo',
            'duration_ms', 'time_signature']
df_nums = df[col_nums]

# Creating the correlation matrix
corr_matrix = df_nums.corr()

# Heatmap for display
fig = plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True)
plt.show()

In [None]:
# Categorical Encoding: Binary variable
binary_encoder = ce.BinaryEncoder(cols='mode',return_df=True)
df_nums = binary_encoder.fit_transform(df_nums)

## One-hot-encoding
one_hot_encoder = ce.OneHotEncoder(cols=["time_signature"], use_cat_names=True,
                                   return_df=True)
df_nums = one_hot_encoder.fit_transform(df_nums)
df_nums.head()

In [None]:
# Split the data into training and testing sets
X = df_nums.drop(columns=['track_popularity'], axis=1)
y = df_nums['track_popularity']
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, random_state=42)

In [None]:
# Creating a regression model
reg = LinearRegression().fit(X_train, y_train)
mlr_eq = "ŷ = " + str(round(reg.intercept_, 2))

for i, coef in enumerate(reg.coef_):
    mlr_eq += " + {:.2f} X_{}".format(coef, i)

print(mlr_eq)

reg_coef_df = pd.DataFrame(data=reg.coef_, index=X.columns, columns=['Coefficient'])
reg_coef_df.sort_values(by='Coefficient')

In [None]:
# Since target variable is decrete, we shall make our predictions similarly
# before calculating the error rate
y_pred = np.round(reg.predict(X_test), 0)

r2 = metrics.r2_score(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

print("r2 Score = {:.2%}".format(r2))
print("Mean Squarer Error = {:.2f}".format(mse))
print("Root Mean Squared Error = {:.2f}".format(rmse))

# Potential Pathways
* **Music analysis**: Explore patterns in audio features to understand trends and preferences in popular songs.
* **Artist impact**: Analyze how artist involvement and attributes relate to a song's success.
* **Temporal trends**: Identify any shifts in music attri
* Audio features across time, most notable increase/decrease
* If artist has substring of certain genres, create a K-means (others) clustering model for track/artist popularity and other audio feature

### Reduction
Given that the 'artist_genres' columns is a list of sub_genres, we will proceed in the following manner to condense it into a single string capturing what we will assume to be the doimnant genres. 

1. We created a reference list leveraging the sources below for a more condense list of noteworthy genres. It is not exhaustive, but would capture the majority if not all subgenres in this dataset.
    * [Dam Vibes - 25 Types of Music Genres](https://www.damvibes.com/music-theory/25-types-of-music-genres/)
    * [Gideon Waman - 25 Most Popular Genres of Music](https://strongsounds.com/blog/most-popular-genres-of-music/)
    * [Brian Clark - 35 Types of Music Genres and Styles](https://www.musicianwave.com/top-music-genres/)
2. The initial column is an object. It is a list, but encoded as a string. So, we will clean it up by stripping certain characters, then save everything into a 'sub_genres' list of lists. 
3. From that list, we filter and extract the dominant genre to store in another list
4. There will be an additional column 'main_genre' with that new list.

In [None]:
reference_list = [
    'Country', 'R&B', 'Folk', 'Jazz', 
    'Metal', 'Soul','Reggae','Disco',
    'Classical', 'Gospel', 'Grime', 
    'Blues', ['Afro', 'Soca'],
    ['Pop', 'neo mellow', 'singer-songwriter', 
    'Stomp And Holler', 'Idol', 'Boy Band'], 
    ['Rock', 'British Invasion'],
    ['Hip Hop', 'drill', 'rap'],

    ['Electro', 'EDM', 'House','Techno', 
     'Ambient', 'Dubstep', 'Trance', 'New French Touch',
     'Eurodance', 'DNB', 'Drum and Bass', 'UK Garage'],
    ['Latin', 'salsa', 'bachata', 'sertanejo',
     'reggaeton', 'cumbia', 'urbano'],
    ['Alternative', 'Punk', 'Funk', 
     'Emo', 'Alt', 'Indie', 'Grunge'],
    ['Instrumental', 'Schlager', 'Instru'],
    ['Soundtrack', 'movie', 'show', 'hollywood', 'Film'],
    ['Musical', 'Broadway'],

]

In [None]:
def extract_genre(string):
    """ Obtain main genres from list of 
    sub_genres through instance counts
    """
    main_genre, genre_count = '', 0

    for genre in reference_list:
        # Special case for subgenres not using a root word of maingenre
        if type(genre) != list:
            instances = string.count(genre.lower())
            if instances > genre_count:
                main_genre = genre
                genre_count = instances
            else:
                continue
        else:
            for sub in genre:
                instances += string.count(sub.lower())
            if instances > genre_count:
                main_genre = genre[0]
                genre_count = instances
            else:
                continue
    
    if main_genre == '':
        main_genre = 'Other'
        
    return main_genre

In [None]:
df['artist_genres_clean'] = df['artist_genres'].apply(lambda st: re.sub("\['|'\]|'|\[\]", "", st).split(", "))

genres = [genre for genre in df['artist_genres_clean']]
df['artist_genres_clean'] = [" | ".join(sub_genre) for sub_genre in genres]
df['main_genre'] = list(map(lambda x: extract_genre(x), df['artist_genres_clean']))

In [None]:
df['main_genre'].value_counts()

# Machine Learning

In [None]:
## Develop a model and use the 2023 as test data