
# Introduction

This notebook is a project about identifying Oku Hanako's music styles by grouping her songs that are played in a similar way or similar style.
The similarity will be determined by comparing songs based on their audio features from Spotify.
The data used in this notebook were obtained using the code in [/common/get_Spotify_audio_features.ipynb](https://github.com/edwardthezeroth/data-projects/blob/master/common/get_Spotify_audio_features.ipynb)

In [0]:
# Suppress output of this cell
%%capture

"""Install and import dependencies."""
# Import numpy to help with plots
import numpy as np

# Import pandas for dataframes
import pandas as pd 

# Install and import pandas profiling for better data summarisation
!pip install pandas-profiling

import pandas_profiling

# Import matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline 

# Import scikit-learn for scaling and k-means clustering
#from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

In [23]:
"""Set up connection to Google Drive."""

# import Drive helper and mount drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [26]:
"""Define paths and load file audio features"""

# Main path in Google Drive
home_path = '/content/drive/My Drive/Colab Notebooks'

# Path to save data
data_path = f'{home_path}/Data/Spotify audio features'

# File name
save_file_name = 'oku_hanako_audio_features.csv'

# Load file
all_songs_features = pd.read_csv(f'{data_path}/{save_file_name}')

# This command won't work until the pip installation is updated
#all_songs_features.drop(['album', 'track', 'track_id'], axis=1).profile_report(style={'full_width':True})

#pandas_profiling.ProfileReport(all_songs_features)

all_songs_features.drop(['album', 'track', 'track_id'], axis=1).describe()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
count,269.0,269.0,269.0,269.0,269.0,269.0,269.0,269.0,269.0,269.0,269.0,269.0,269.0
mean,0.743738,0.514996,297460.156134,0.355054,0.059672,4.836431,0.143045,-8.958569,0.921933,0.030319,102.06484,3.981413,0.413611
std,0.245904,0.10111,44680.873209,0.165996,0.209257,3.052283,0.089277,2.31707,0.268777,0.005027,33.134268,0.160539,0.170933
min,0.00129,0.264,90680.0,0.0215,0.0,0.0,0.0431,-17.103,0.0,0.0225,60.688,3.0,0.0482
25%,0.675,0.452,274067.0,0.229,4e-06,2.0,0.0958,-10.465,1.0,0.0268,77.074,4.0,0.279
50%,0.829,0.513,297213.0,0.329,3.8e-05,5.0,0.116,-8.636,1.0,0.029,82.464,4.0,0.392
75%,0.922,0.58,323973.0,0.452,0.00057,7.0,0.158,-7.386,1.0,0.033,130.093,4.0,0.517
max,0.988,0.841,401720.0,0.799,0.964,11.0,0.692,-3.8,1.0,0.0484,199.712,5.0,0.902


# Feature Selection and Engineering

A description of the audio features is documented by [Spotify](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/).
Some of the audio features will be removed because they do not relate to music style.

* remove duration - music style is about how the music is played, not how long it is played
* remove liveliness - although an artist may play their music differently before a live audience, the presence of the audience does not describe the music style
* remove speechiness - because these songs contain singing, not talking or rap


The key or pitch class is [represented with numbers](https://en.wikipedia.org/wiki/Pitch_class#Other_ways_to_label_pitch_classes). 
Successive tones are one semitone apart, and the tone returns to the same class every 12 semitone increments.
Due to the cyclical nature of this labelling system, it would be more appropriate to treat the key as a categorical variable instead of a continuous variable.
One-hot encoding will thus be applied so that the key is represented more appropriately.

The numerical audio features have values between 0 and 1 except for loudness, tempo, and time signature.
These loudness and tempo will be transformed so that their scale is similar to the other numerical features.
The time signature values range from 3 to 5, so the scale is still similar to the other features that are between 0 and 1.
This is important to avoid putting undue high importance on features with larger scales in later steps (clustering and dimension reduction) .

In [27]:
"""Keep selected features"""

# Specify features to keep
columns_to_keep = ['album', 'track'
                   , 'acousticness'
                   , 'danceability'
                   , 'energy'
                   , 'instrumentalness'
                   , 'key'
                   , 'loudness'
                   , 'mode'
                   , 'tempo'
                   , 'time_signature'
                   , 'valence'
                  ]

# Keep selected features
selected_songs_features = all_songs_features[columns_to_keep]

selected_songs_features.head(3)

Unnamed: 0,album,track,acousticness,danceability,energy,instrumentalness,key,loudness,mode,tempo,time_signature,valence
0,Kasumisou,Kaban no Naka no Yakimochi,0.541,0.535,0.323,2e-06,5,-7.98,1,78.248,4,0.339
1,Kasumisou,Hontowane,0.343,0.378,0.473,3e-06,5,-8.124,1,175.899,4,0.572
2,Kasumisou,Aenakutemo,0.926,0.361,0.207,0.000699,11,-12.718,1,81.101,4,0.335


In [41]:
def one_hot_encode_key(df):
  """
  One-hot encode key with useful column names
  
  Args:
      df: The dataframe with the key column
      
  Returns:
      A dataframe with the key one-hot encoded with useful column names
  """
  
  # Define the key dictionary
  key_dict = {
      0: "C",
      1: "C_sharp",
      2: "D",
      3: "D_sharp",
      4: "E",
      5: "F",
      6: "F_sharp",
      7: "G",
      8: "G_sharp",
      9: "A",
      10: "A_sharp",
      11: "B"
  }
  
  # Convert the numerical representation of key to letters
  df_key_described = df.replace({"key": key_dict})
  
  # One-hot encode key
  key_one_hot = pd.get_dummies(df_key_described['key'])
  
  # Join the encoding to the dataframe
  df_joined_encoding = df_key_described.join(key_one_hot)
  
  # Drop key
  encoded_df = df_joined_encoding.drop('key', axis=1)
  
  return encoded_df


encoded_df = one_hot_encode_key(selected_songs_features)
encoded_df.head(3)

Unnamed: 0,album,track,acousticness,danceability,energy,instrumentalness,loudness,mode,tempo,time_signature,valence,A,A_sharp,B,C,C_sharp,D,D_sharp,E,F,F_sharp,G,G_sharp
0,Kasumisou,Kaban no Naka no Yakimochi,0.541,0.535,0.323,2e-06,-7.98,1,78.248,4,0.339,0,0,0,0,0,0,0,0,1,0,0,0
1,Kasumisou,Hontowane,0.343,0.378,0.473,3e-06,-8.124,1,175.899,4,0.572,0,0,0,0,0,0,0,0,1,0,0,0
2,Kasumisou,Aenakutemo,0.926,0.361,0.207,0.000699,-12.718,1,81.101,4,0.335,0,0,1,0,0,0,0,0,0,0,0,0


In [42]:
# Define columns to scale
minmax_cols = ['tempo', 'loudness']

# Initialise min max scaler
minmax_scaler = MinMaxScaler()

# Train the min max scaler on the selected columns
minmax_scaler.fit(encoded_df[minmax_cols])

# Apply the min max scaler
encoded_df[minmax_cols] = minmax_scaler.transform(encoded_df[minmax_cols])


encoded_df.head(3)

Unnamed: 0,album,track,acousticness,danceability,energy,instrumentalness,loudness,mode,tempo,time_signature,valence,A,A_sharp,B,C,C_sharp,D,D_sharp,E,F,F_sharp,G,G_sharp
0,Kasumisou,Kaban no Naka no Yakimochi,0.541,0.535,0.323,2e-06,0.685785,1,0.126309,4,0.339,0,0,0,0,0,0,0,0,1,0,0,0
1,Kasumisou,Hontowane,0.343,0.378,0.473,3e-06,0.674961,1,0.828713,4,0.572,0,0,0,0,0,0,0,0,1,0,0,0
2,Kasumisou,Aenakutemo,0.926,0.361,0.207,0.000699,0.329625,1,0.146831,4,0.335,0,0,1,0,0,0,0,0,0,0,0,0


# Remove Duplicates

Some songs appear multiple times in different albums and singles as shown by the shared track name.
Besides that, some songs have different versions as indicated in the track name.

To simplify the decision about whether different versions of the same song are duplicates or not, this part of the project will only consider the base version of each song.

In [0]:
# Count the number of times a track appears
num_track_appearances = all_songs_features \
    .groupby('track') \
    .count() \
    [['track_id']] \
    .rename(columns = {'track_id': 'number_of_tracks'})

# Show sample of tracks where the name appears more than once
num_track_appearances \
.sort_values(by=['number_of_tracks'], ascending=False) \
.head(5)
#.loc[num_track_appearances['number_of_tracks'] > 1]

Unnamed: 0_level_0,number_of_tracks
track,Unnamed: 1_level_1
Tegami,4
Yasashii Hana,4
Happy Days,4
Kimi no Egao,4
Waratte Waratte,4


In [0]:
# Show an example of a track with multiple versions
all_songs_features.loc[(all_songs_features['track'].str.contains('Garnet'))]

Unnamed: 0,album,track,track_id,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
29,Time Note,Garnet (Hikigatari),5hu1dBV5JKRU6gUnZpOJ1Y,0.929,0.614,311587,0.236,0.0,3,0.201,-10.005,1,0.0347,72.688,4,0.517
110,Hanako Oku Best ~ My Letters ~,Garnet,40jwSjLscs3wDOvYUpaJoB,0.753,0.515,320493,0.355,0.000176,3,0.129,-8.36,1,0.0283,73.643,4,0.362
202,Garnet,Garnet - Hikigatari,6rXPX0WpTKNGAOAoLPcF8Q,0.923,0.601,316280,0.251,1.2e-05,3,0.23,-10.281,1,0.033,72.921,4,0.493
204,Garnet,Garnet,2rZUu2qXb7SoUNktHKgfky,0.79,0.51,319067,0.353,0.000278,3,0.148,-8.399,1,0.0292,73.565,4,0.318
240,Hatsukoi,Garnet - Live Ver.,5u0psQQOoq6CA5E8RqsWmE,0.899,0.39,348733,0.255,0.00027,10,0.676,-9.847,1,0.0335,74.362,4,0.279


In [0]:


"""

1. feature selection
-- remove duration because how the music is played should be independent of the track length
-- remove liveliness. Although a live performance may be played differently, it is still mostly the same song
-- remove speechiness because there is no need to evaluate how much speaking occurs in the songs


2. feature engineering
-- encode key: treat it as categorical instead of numerical


3. PCA and plot
-- https://www.kdnuggets.com/2019/01/dimension-reduction-data-science.html


4. clustering k-means with number of clusters evaluation
-- try other clustering methods
-- https://www.kdnuggets.com/2018/06/5-clustering-algorithms-data-scientists-need-know.html
-- https://www.kdnuggets.com/2019/10/clustering-metrics-better-elbow-method.html


1. remove duplicates
-- try using all tracks first, then try removing duplicates and see whether the results change


6. Publish results on a Tableau public dashboard for others to play with
--
"""