# Song Popularity EDA

## 1. Introduction

Welcome to a new and exciting Kaggle community competition!

This competition is about analyzing the popularity of songs based on a set of features.

It is a great opportunity to get a better understanding of the data and to learn how to use the data to
 make predictions. Hence it is targeted at beginners.

This is a classification challenge, with the evaluation metric being the [AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic).

The data consistsof the standard Kaggle `train.csv` and `test.csv`, along with `sample_submission.csv`, to show how
the structure of the submission file should be.

This notebook is a python representation of the R EDA notebook by Heads Or Tails from this live stream [here](https://www.youtube.com/watch?v=JXF-7rCcR1c)

## 2. Preparations

We load a range of libraries, set the working directory and load the data.

In [2]:
from warnings import filterwarnings
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [3]:
filterwarnings('ignore')
plt.style.use('seaborn')
%matplotlib inline

In [4]:
if Path('../kaggle').exists():
    path: str = '../kaggle/input/song-popularity-prediction'
else:
    path: str = '../input/'

In [5]:
train: pd.DataFrame = pd.read_csv(path + '/train.csv', index_col='id')
test: pd.DataFrame = pd.read_csv(path + '/test.csv')
sample_submission: pd.DataFrame = pd.read_csv(path + '/sample_submission.csv')

## 3. Overview: structure and data content

The first thing you want to do is to look at your actual data in its raw form. This will tell you about the types of
features you will be dealing with (numerical, categorical, string, etc.), as well as already reveal some characteristics
 of the dataset. This includes checking for missing values.

Generally, we don't want to look at the test data any more than strictly necessary. The test dataset is intended to
serve as our final model validation, and should only include data that the model has never seen before. Since our
brain is a part of the modelling process as well (os lease it should be), we want to avoid picking up any signal
in the test data that could consciously or unconsciously influence our decision. Thus, this EDA will almost entirely
focus on the `train.csv` data.

### 3.1 A look at the data

In [6]:
train.head()

Unnamed: 0_level_0,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence,song_popularity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,212990.0,0.642286,0.85652,0.707073,0.002001,10.0,,-5.619088,0,0.08257,158.386236,4,0.734642,0
1,,0.054866,0.733289,0.835545,0.000996,8.0,0.436428,-5.236965,1,0.127358,102.752988,3,0.711531,1
2,193213.0,,0.188387,0.783524,-0.002694,5.0,0.170499,-4.951759,0,0.052282,178.685791,3,0.425536,0
3,249893.0,0.48866,0.585234,0.552685,0.000608,0.0,0.094805,-7.893694,0,0.035618,128.71563,3,0.453597,0
4,165969.0,0.493017,,0.740982,0.002033,10.0,0.094891,-2.684095,0,0.050746,121.928157,4,0.741311,0


In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 0 to 39999
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   song_duration_ms  35899 non-null  float64
 1   acousticness      36008 non-null  float64
 2   danceability      35974 non-null  float64
 3   energy            36025 non-null  float64
 4   instrumentalness  36015 non-null  float64
 5   key               35935 non-null  float64
 6   liveness          35914 non-null  float64
 7   loudness          36043 non-null  float64
 8   audio_mode        40000 non-null  int64  
 9   speechiness       40000 non-null  float64
 10  tempo             40000 non-null  float64
 11  time_signature    40000 non-null  int64  
 12  audio_valence     40000 non-null  float64
 13  song_popularity   40000 non-null  int64  
dtypes: float64(11), int64(3)
memory usage: 4.6 MB


We find:
- With 15 columns and 40k rows, this is relatively small dataset. The dataset is small enough to explore in its
entirety, without having to select subsets for reasons of speed.
- There is an `id` column which appears to be sequentially numbered rows. This can be directly used as index columns
for the dataframe.
- There are no string columns in the dataset (or otherwise complex columns). All of the features can be expressed
numericqally. Our target `song_popularity` appears to be binary; probably `audio_mode` as well. The features `key`
and `time_signatures` look like categorical or ordinal variables.
- We can immediately see some missing values in the data. This is something we need to keep in mind in future
exploratory and modelling steps.

Let's look at a larger subset of data in the tabular format

In [8]:
train.sample(50)

Unnamed: 0_level_0,song_duration_ms,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,audio_mode,speechiness,tempo,time_signature,audio_valence,song_popularity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
28983,,0.023593,0.800073,0.767618,0.001328,11.0,0.139236,-5.565384,0,0.171397,119.834165,4,0.938746,0
9548,204875.0,0.0236,0.362087,0.502013,0.001538,7.0,0.115882,-4.619155,0,0.10722,130.903058,4,0.520312,1
25493,250127.0,0.065226,0.332186,0.733157,0.002163,3.0,0.188708,-10.246128,0,0.082915,99.230396,4,0.707097,0
20480,,,0.479867,0.657406,-0.00043,6.0,0.078104,-3.501137,0,0.052342,123.239636,4,0.632949,0
25273,266133.0,0.008426,0.567235,,-0.0002,5.0,0.104818,-6.140414,1,0.25271,150.768351,3,0.422224,0
36948,256275.0,0.044745,0.635059,0.663622,0.001854,9.0,0.098191,-2.636522,0,0.193922,128.857669,3,0.447307,0
27232,112799.0,0.917544,0.239584,0.158236,0.004231,10.0,0.183332,-19.045237,0,0.072375,92.64558,3,0.374094,1
28101,173910.0,0.565735,0.537464,0.512794,0.001328,6.0,0.109059,,0,0.046291,126.30482,3,0.470938,0
29583,,0.944546,,0.396532,0.003519,2.0,0.171368,-12.986482,1,0.06557,86.943964,2,0.330708,0
25657,,0.178766,0.499098,0.536247,0.115561,1.0,0.073414,-12.669024,0,0.2264,92.814619,3,0.68703,0


We find:
- Using `sample()`, we get random rows of data, which may be beneficial to avoid any kind of rows that might have
been changed.
- The 50 rows that `sample()` returned confirms most of our impressions.
- There are plenty of missing values (encoded as `NA`) in several of the columns.
- It is also apparent the different scales of the data. Some features have values around 0.5, while others fo down to
 1e-6 or up to almost 200.

### 3.2. Missing values

Let's take a closer look at the missing values.

In [9]:
print(
    f"The training set has {train.isnull().sum().sum()} missing values, the test set has {test.isnull().sum().sum()}.")

The training set has 32187 missing values, the test set has 7962.


We can also visualise the values to give us an overview of the entire dataset.

In [None]:
plt.figure(figsize=(18,16))
sns.displot(
    data=train.isna().melt(value_name="missing"),
    y="variable",
    hue="missing",
    multiple="fill",
    aspect=3,
    palette='copper' 
)
plt.title('Bar plot showing Non-Missing Values in Train data', weight = 'bold', size = 20, color = 'brown')
plt.xlabel(" ")
plt.ylabel(" ")
plt.xticks(size = 12, weight = 'bold', color = 'maroon')
plt.yticks(size = 12, weight = 'bold', color = 'maroon');

In [None]:
sns.heatmap(train.isna().transpose(), yticklabels=False, cbar=True, cmap='viridis')
plt.show();