## 🎶 Jukebox AI 🎶

As K.R.A.P. records, we want to develop a predictive machine learning model that can analyze music and market trends to optimize our release strategy. We have two datasets that contains information for every musical aspect of each top 50 songs of the day for different countries, and also the name, country, artist and publication date of the tracks.

To start, we are going to do Exploratory Data Analysis or EDA in short. EDA in machine learning involves analyzing datasets to summarize their main characteristics, often using visual methods. It helps in understanding the data's structure, detecting patterns, spotting anomalies, and forming hypotheses before applying machine learning models.

Our goal is to find correlations between song characteristics to understand what makes a song popular. Identifying these relationships will help us determine whether and how strongly different variables are connected. 

![inner-loop-1](../images/inner-loop-1.png)


### 🐠 Install & Import packages

We will need to install and import packages as we develop our notebook. We've created a couple of starter cells for you but you will need to add more as you work through the notebook.

In [None]:
!pip install seaborn
# Install more modules that you need here

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
# Import more modules and classes that you need here - REMEMBER TO RERUN THE CELL AFTER MODIFYING!

### 📦 Load Data

Let's get our hands into our data and explore it! We have two separate datasets; first one just contains the song characteristics and the other is the popularity info of the song and which country it is popular. We need both information for our model so we will merge the datasets and do the analysis on the entire data.

The datasets are stored in GitHub, so if you don't have your dataset locally, this is one way to load them.

In [None]:
# Load the datasets
song_properties_data = pd.read_parquet('https://github.com/rhoai-mlops/jukebox/raw/refs/heads/main/99-data_prep/song_properties.parquet')
song_properties_data

In [None]:
song_rankings_data = pd.read_parquet('https://github.com/rhoai-mlops/jukebox/raw/refs/heads/main/99-data_prep/song_rankings.parquet')
song_rankings_data

In [None]:
# Merge the datasets
data = pd.merge(song_properties_data.drop(["snapshot_date", "name", "artists"], axis=1), song_rankings_data, on='spotify_id')
data.head()

In [None]:
# Transposing the dataframe, making it easier to compare statistics for each column. 
# Instead of scrolling horizontally through a wide table, you can read the statistics vertically.
data.describe().T

In [None]:
# Remove rows from a DataFrame that contain missing values
# Dropping missing values ensures that the dataset used for training or analysis is complete and consistent
data = data.dropna()
data.head()

In [None]:
# the country codes we have data from
data['country'].unique()

In [None]:
# Country codes are strings. Here we are giving a unique number to each country
# so we can treat each country as number instead of strings
# because computers don't like strings
mapping = {c:i for i, c in enumerate(data['country'].unique())}
mapping

In [None]:
pd.set_option('future.no_silent_downcasting', True)
data["country"] = data['country'].replace(
   mapping
).astype(int)

Now since the data is in a better shape, let's look for the correlations!

We want to find a correlation between song characteristics to decide what makes a song popular.

It will help us understand whether and how strongly variables are connected.

In [None]:
# Again dropping some strings, because, yes, computers don't like strings 😁

corr_data = data.drop(["spotify_id", "snapshot_date", "album_name", "name", "artists", "album_release_date"], axis=1)

In [None]:
# First let's check "country" to see changes in other variables might be associated with 'country'

corr = corr_data.corr()['country'].sort_values(ascending = False)
corr = corr.to_frame()
corr.style.background_gradient(cmap="RdYlBu")

As you might have spotted, the correlation values range from **-1 to 1**:  
- **Positive values (closer to 1)** indicate a strong direct relationship (as one variable increases, so does the other).  
- **Negative values (closer to -1)** suggest an inverse relationship (as one increases, the other decreases).  
- **Values near 0** mean little to no correlation.  

The heatmap uses a **Red-Yellow-Blue (RdYlBu) color scale**:  
- **Red** represents strong negative correlations.  
- **Blue** indicates strong positive correlations.  
- **Yellow** highlights weaker or no correlations.  

Check the visual again, you can spot patterns and relationships at a glance now! 🔍 

Let's also check the `popularity`.

In [None]:
# Now lets check the popularity

corr = corr_data.corr()['popularity'].sort_values(ascending = False)
corr = corr.to_frame()
corr.style.background_gradient(cmap="RdYlBu")

In [3]:
## See what's more correlated? What does the data tell you? 

### Quiz Time 🤓

In [None]:
import sys
import os
sys.path.append(os.path.abspath('../.dontlookhere/'))
from quiz1 import *

In [None]:
quiz_eda()

### A heatmap!

A heatmap is a graphical representation of data where individual values are represented by colors. It is commonly used to visualize the intensity of values in a matrix format, making patterns or relationships between variables easier to interpret.

In data analysis, heatmaps are often used to visualize the correlation matrix of a dataset. The correlation matrix is a table showing the pairwise correlation coefficients between features in the dataset. A heatmap colors these correlation values, providing an intuitive way to understand the relationships between different features.

Create a heatmap below for our dataset and spot relationships between song characteristics at a glance. You can quickly see which features are highly correlated, both positively and negatively.

In [None]:
plt.figure(figsize = (20, 10))
sns.heatmap(corr_data.corr(), annot = True, cmap='RdYlBu')
plt.show()

### A heatmap helps for feature selection

By visualizing the characteristics with the correlation heatmap, we can choose which song characteristics (data features) to retain in our model. Features with strong correlations to the target variable (country) might be more useful, while highly correlated features to each other may be redundant.

Look at the map again and think about the important features we should use for model development!

🌍 Let's see what is the mean values of the features for each country - remember countries are represented as numbers now. 

In [None]:
corr_data.groupby('country').mean().reset_index()

It seems like we have a country with a better taste - or has a largely influence on the popularity list 🙈

If you scroll to the right in the output, you'll see `popularity` column. A country with higher popularity values tends to have more globally recognized songs in its chart. So that country might have "better taste" because its popular songs align more closely with global trends (higher popularity) :)

Now let's see what are the popular songs for a given period 👇

In [None]:
data['snapshot_date'] = pd.to_datetime(data['snapshot_date'])

start_date = pd.Timestamp(2023, 12, 20)
end_date = pd.Timestamp(2024, 1, 1)
filtered_data = data[(data['snapshot_date'] >= start_date) & (data['snapshot_date'] < end_date)]

# Group by song and calculate the mean popularity for each song
popularity_per_song = filtered_data.groupby('name')['popularity'].mean()

# Sort the songs by popularity in descending order and select the top 10
top_10_songs = popularity_per_song.nlargest(10).reset_index()['name']

print("Top 10 Popular Songs from 20th Dec 2023 to 1th January 2024:")
print(top_10_songs)

Christmas songs! 🎅  not surprising :)

### Quiz Time 🤓

In [None]:
quiz_heatmap()

🦄 Now that we understand the data and have identified the key characteristics to focus on for determining 'which country would like my song more,' we can dive into Data Science! Exciting, isn’t it?

Let's move to the second folder.

From the left menu, click `🗂️/jukebox` to go one folder up, then go to folder `2-dev_datascience` and open up the first notebook [1-experiment_train.ipynb](../2-dev_datascience/1-experiment_train.ipynb) and continue from there :)