Imports

In [128]:
import pandas as pd
import numpy as np

# Graphic libraries
import matplotlib.pyplot as plt
import plotly.express   as px
import seaborn as sns

# Clustering
from sklearn_extra.cluster import KMedoids


## Loading data, quick analysis.

I need to get a better understanding of what each field on the given dataset represents. 

Based on Google search, and the quick analysis i do below, the meanings i can give to these columns are:

* 'song_name', 'artist', 'album', 'duration_ms', 'artist_genres', 'artist_popularity', 'artist_folowers', and 'release_year' = The names of these columns are pretty specific.
* 'danceability' = Means how danceable a song is, the bigger the value the more danceable a song is.
* 'energy' = In music is usually used in relation to sound power. The high pich values have high frequencies and thus they represent high energy states. Also a bigger value indicates more energetic a song is.
* 'loudness' = Amount of intensity a sound wave has, it is measured in decibel units. (dB). The bigger the value the louder.
* 'mode' = Mode or modus its most commonly use may be described as a type of musical scale coupled with a set of characteristics melodic and harmonic behaviors. It is applied to major and minor keys. So Major would be represented by 1 and minor is 0. (<a href='https://en.wikipedia.org/wiki/Mode_(music)'>Link about mode</a>)
* 'key' = This field is related to the previous one and its a categorical field that each value represents a pitch. In set theory, an integer notation is used, wich assigns a number between 0 and 11 to each pitch class 0 = C, 1 = C#, 2 = D, and so on.(<a href="https://open.library.okstate.edu/musictheory/chapter/pitch-and-pitch-class/#:~:text=Pitch%20classes%20are%20given%20an,tone%20with%20an%20individual%20frequency.&text=A%20system%20of%20naming%20pitch,%2C%20D%20as%202%2C%20etc.">Link about pitch</a>)
* 'speechiness' = Detects the presence of spoken words in a track. A Value near 0 means that the song most likely has little voice singing.
* 'acousticness' = How a acustic a song is. A score of 1.0 means the song is most likely to be an acoustic one.
* 'instrumentalness' = How likely the music contains no spoken word vocals. So the closer to 1.0 the more instrumental the song is.
* 'liveness' = Presence of a live audience in song, a bigger value might indicate that the song was recorded live.
* 'valence' =  Describes if the song is likely to make someone feel happy or sad, higher values might be asociated to more happyness.
* 'tempo' = Is the speed of pace of a given song, its meassured in Beats Per Minute (BPM), the bigger the value then the more beats per minute it has.

* 'mode' and 'key' = I can't interpret the meaning of these fields so I'll try to figure them out in EDA.


In [129]:
data = pd.read_csv('./data/dataset_desafio.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 291 entries, 0 to 290
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   song_name          291 non-null    object 
 1   artist             291 non-null    object 
 2   album              291 non-null    object 
 3   danceability       291 non-null    float64
 4   energy             291 non-null    float64
 5   key                291 non-null    int64  
 6   loudness           291 non-null    float64
 7   mode               291 non-null    int64  
 8   speechiness        291 non-null    float64
 9   acousticness       291 non-null    float64
 10  instrumentalness   291 non-null    float64
 11  liveness           291 non-null    float64
 12  valence            291 non-null    float64
 13  tempo              291 non-null    float64
 14  duration_ms        291 non-null    int64  
 15  song_popularity    291 non-null    int64  
 16  artist_genres      291 non

In [130]:
# Lets pull out some statistics related to the data. Ill select the fields that I consider provide more information. 
# 1st of all im going to separate bewteen categorical columns and numerical column, in orther to facilitate Data Exploration.
mask_num_cols = (data.dtypes != 'object')
data_num_cols = data[data.dtypes.index[mask_num_cols]]
data_cat_cols = data[data.dtypes.index[~mask_num_cols]]

In [131]:
data_num_cols.describe()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,song_popularity,artist_popularity,artist_followers,release_year
count,291.0,291.0,291.0,291.0,291.0,291.0,291.0,291.0,291.0,291.0,291.0,291.0,291.0,291.0,291.0,291.0
mean,0.577519,0.664526,5.402062,-8.009598,0.487973,0.064844,0.320144,0.074815,0.215163,0.594482,120.306581,226859.237113,44.075601,54.838488,2741029.0,2002.19244
std,0.169332,0.219575,3.628412,3.394579,0.500716,0.052145,0.313434,0.217098,0.201732,0.253076,28.324863,70661.662009,25.216876,19.259517,6953692.0,14.362214
min,0.103,0.157,0.0,-18.752,0.0,0.025,4e-06,0.0,0.0265,0.0382,44.37,86893.0,0.0,0.0,13.0,1961.0
25%,0.4515,0.4905,2.0,-10.2245,0.0,0.0351,0.0334,0.0,0.09105,0.399,98.5585,180967.0,26.0,45.0,42388.5,1991.0
50%,0.587,0.701,6.0,-7.178,0.0,0.0453,0.202,2.9e-05,0.135,0.617,120.099,217560.0,50.0,59.0,468567.0,2004.0
75%,0.7105,0.8535,9.0,-5.578,1.0,0.06765,0.5775,0.00263,0.26,0.822,136.165,266126.5,63.0,67.0,1868729.0,2014.5
max,0.949,0.986,11.0,-1.657,1.0,0.435,0.984,0.942,0.97,0.973,193.66,622000.0,98.0,91.0,60574310.0,2022.0


We can see that there isn't a lot of variation in the magnitude of the data, except for the 'duration_ms' variable (it has a lot of variation in the magnitude).

Since im planning on implementing K-Means, and this algorithm is distance-based, this difference in magnitude can create problems. So later on im planning to bring all the variables to the same magnitude by standardizing the data.

Another thing we can observe here is that probably we are going to have outliers due to the difference between the mean, the standard deviation, and the max value. So ill be investigating on that also later on, wheter if we need to eliminate them or we can ignore them.

## Null and duplicate presence analysis.

In [132]:
# Looking at a sample of the Dataset i could see, that for many rows, the column 'artist_genres' is an empty list.
len(data[data['artist_genres'] == "['[]']"])

41

In [133]:
# I will procede an replace this empty list with Nan Values.
data['artist_genres'].replace("['[]']", np.NAN, inplace=True)
data.isnull().sum()

song_name             0
artist                0
album                 0
danceability          0
energy                0
key                   0
loudness              0
mode                  0
speechiness           0
acousticness          0
instrumentalness      0
liveness              0
valence               0
tempo                 0
duration_ms           0
song_popularity       0
artist_genres        41
artist_popularity     0
artist_followers      0
release_year          0
dtype: int64

In [134]:
data.duplicated().sum()

0

We can observe that there is no presence of nulls or either duplicates in the given dataset, so no imputation is needed.

## Outliers

In [158]:
# Checking outliers on numerical cols.
for i in data_num_cols.columns:
    fig = px.box(data, x=i)
    fig.update_layout(height=250, width=750)
    fig.show()

Okey, after inspecting the Data and seeing the outliers i've reached to the conclusion that i wont be removing outliers.

Seeing the outliers values, and with a bit more undersanding of the values each field can take, I can say that these outliers are valid data and representative of the sample. I strongly believe that they are legitimate values that represent important escenarios, removing them could disort the reality of the data, harm the analysis, and lead to a loss of important information.

Conclusions i get with this:
* K-Means algorithm its sensible to outliers (because it seeks to minimize the sum of the squared distances). So it could be a good starting point but not a reliable one, i will consider using other clustering algorithms that are more robust to the presence of outliers (DBSCAN or Meadian-Based algorithms).

## EDA

Lets do a Exploratory Data Analysis to get more insights from the given data.

In [136]:
correlation_matrix = data.corr()
fig = px.imshow(correlation_matrix)
fig.update_layout(height=850, width=850)
fig.show()





This correlation matrix indicates the linear relationship between each variable. Positive correlation means that when one variable increase its value, other tends to increase as well. Negative correlation, means that when one variable increases, other thends to decreace.

From the correlation matrix we can say that:
* We cannot observe a significant positive or negative correlation between the given data. Also there are no signs of multicollinearity between the variables,
* 'danceability' is highly correlated with 'valence' (how happy a song makes you feel).
* The more energetic a song is the less acoustic it is and it has more loudnes. 'energy' and 'acousticness' are poorly correlated. Meanwhile 'energy' and 'loudness' are highly correlated. 
* Song Popularity is correlated positively with the Artist popularity, which make sense.

In [137]:
# let's see which artists are in the top 5 most popular ones
# Grouping by artist, and obtaning its mean popularity, i use mean because i know that each artist has a unique popularity.
data_artist_grouped = data.groupby('artist', as_index=False)['artist_popularity'].mean()
# Now ill sort artis_popularity in descending form in order to keep only the most popular ones.
data_artist_grouped = data_artist_grouped.sort_values('artist_popularity', ascending=False).iloc[0:5]

fig = px.bar(data_artist_grouped, x='artist', y='artist_popularity')
fig.update_layout(height=500, width=500)
fig.show()

In [138]:
# Lets see if this popular artist also have the most popular songs.
# Grouping by artist, and obtaning its mean song popularity, in order to consider all artist songs and obtian a average song popularity value.
data_artist_grouped = data.groupby('artist', as_index=False)['song_popularity'].mean()
# Now ill sort artis_popularity in descending form in order to keep only the most popular ones.
data_artist_grouped = data_artist_grouped.sort_values('song_popularity', ascending=False).iloc[0:5]

fig = px.bar(data_artist_grouped, x='artist', y='song_popularity')
fig.update_layout(height=500, width=500)
fig.show()

Here we can see that out of the top 5 artist, 3 have the most popular songs (David Guetta, Eminem, and Shawn Mendes)

In [139]:
# Now im going to analyze the artist_genre popularity.
# Creating DF with no null on 'artist_genres'.
data_artist_genre = data[['artist_genres']].dropna(how='any').reset_index(drop=True)

# Empty list to append all genres
list_artist_genre = []

# Iterating over each row.
for index in range(len(data_artist_genre)):
    # i've observed that the column 'genres' has a list passed as values, so ill split into idividual values.
    # Obtaning all artist genres on the list of values.
    row_artist_genres = data_artist_genre.iloc[index]['artist_genres'].split(',')
    
    for genre in row_artist_genres:
        # Apending each genre to my created list.
        list_artist_genre.append(genre)


data_artist_genre = pd.DataFrame(list_artist_genre, columns=['artist_genres'])

# Removing '[' and ']'.
data_artist_genre['artist_genres'] = data_artist_genre['artist_genres'].str.replace('[','', regex=False)
data_artist_genre['artist_genres'] = data_artist_genre['artist_genres'].str.replace(']','', regex=False)

# Removing ', character.
data_artist_genre['artist_genres'] = data_artist_genre['artist_genres'].str.replace("'",'', regex=False)

# Remove leading spaces from the 'column_name'
data_artist_genre['artist_genres'] = data_artist_genre['artist_genres'].str.replace(r'^\s*', '', regex=True)

data_artist_genre = data_artist_genre.value_counts().sort_values(ascending=False).reset_index()[0:10]


# Renaming column, to give more sense.
data_artist_genre.rename(columns={0 : 'total_count'}, inplace=True)

fig = px.pie(data_artist_genre, values='total_count', names='artist_genres')
fig.update_layout(height=500, width=500)
fig.show()

Here we can see which are the most listened artist genres, "rock" get the most number of reproductions.

In [140]:
# Lets see how many songs where release each year. 
df_song_release_per_year = data.groupby('release_year')['song_name'].count().reset_index().sort_values('song_name',ascending=False)

fig = px.bar(df_song_release_per_year, x='release_year', y='song_name')
fig.update_layout(height=500, width=1000)
fig.show()

The year in which the most songs were released was 2012, 5.5% of total songs in the dataset.

In [151]:
# Lets see how each variable changed over the year, if we can see a pattern in hwo song where changing over the years, in each of the given variables
num_cols = data_num_cols.columns.tolist().pop('release_year')

data_groupesd_per_year = data.groupby('release_year')[num_cols].mean().reset_index()
#fig = px.line(data, x='release_year', y=num_cols)
#fig.show()
data_groupesd_per_year


TypeError: 'str' object cannot be interpreted as an integer

In [156]:
num_cols = data_num_cols.columns.tolist() != 'release_year'
num_cols

True

Its a well balanced class, there is not much information we can get out of here