
# Introduction

This notebook is a project about identifying Oku Hanako's music styles by grouping her songs that have similar style.
The similarity of style will be determined by the audio features from Spotify.
The data used in this notebook were obtained using the code in [/common/get_Spotify_audio_features.ipynb](https://github.com/edwardthezeroth/data-projects/blob/master/common/get_Spotify_audio_features.ipynb)

In [0]:
# Suppress output of this cell
%%capture

"""Install and import dependencies."""
# Import pandas for dataframes
import pandas as pd 

# Import matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline 

# Import numpy to help with plots
import numpy as np

# Import scikit-learn for scaling and k-means clustering
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

In [5]:
"""Set up connection to Google Drive."""

# import Drive helper and mount drive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [23]:
"""Define paths and load file audio features"""

# Main path in Google Drive
home_path = '/content/drive/My Drive/Colab Notebooks'

# Path to save data
data_path = f'{home_path}/Data/Spotify audio features'

# File name
save_file_name = 'oku_hanako_audio_features.csv'

# Load file
all_songs_features = pd.read_csv(f'{data_path}/{save_file_name}')


all_songs_features.head(5)

Unnamed: 0,album,track,track_id,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Kasumisou,Kaban no Naka no Yakimochi,3krX7tJSZtgcExvoz2Brdt,0.541,0.535,329067,0.323,2e-06,5,0.113,-7.98,1,0.0268,78.248,4,0.339
1,Kasumisou,Hontowane,2iU639pRsHohsb7JnD3O7s,0.343,0.378,272933,0.473,3e-06,5,0.18,-8.124,1,0.0258,175.899,4,0.572
2,Kasumisou,Aenakutemo,3brkLSKOt1XqeNWs26q803,0.926,0.361,314493,0.207,0.000699,11,0.128,-12.718,1,0.029,81.101,4,0.335
3,Kasumisou,Zettai,2zOZ64scyLxDLCfWMVzBcC,0.85,0.406,303000,0.297,0.000196,9,0.183,-10.697,0,0.0257,83.399,4,0.171
4,Kasumisou,Negai,76HWQNnmamYdndbKeB5aSo,0.843,0.455,332507,0.393,0.00355,1,0.231,-7.52,1,0.0275,79.01,4,0.22


# Feature Selection and Engineering

A description of the audio features is documented by [Spotify](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/).

-- remove duration because music style should be independent of the track length

-- remove liveliness because the fact that there is an audience present or not does not reflect the music style 

-- remove speechiness because Oku Hanako does not mix singing and talking in her songs

Although the key is represented with numbers and the scale is continuous, the scale is also .
The key should hence be considered a categorical variable instead of a continuous variable.
One-hot encoding is thus applied to the key so that it can be represented more appropriately.

The features will also be engineered to have similar scales. 
This is important for later steps (clustering and dimension reduction) to work properly.


In [48]:
# Features to keep
columns_to_keep = ['album', 'track'
                   , 'acousticness'
                   , 'danceability'
                   , 'energy'
                   , 'instrumentalness'
                   , 'key'
                   , 'loudness'
                   , 'mode'
                   , 'tempo'
                   , 'time_signature'
                   , 'valence'
                  ]

# Select features to keep
selected_songs_features = all_songs_features[columns_to_keep]

selected_songs_features.head(1)

Unnamed: 0,album,track,acousticness,danceability,energy,instrumentalness,key,loudness,mode,tempo,time_signature,valence
0,Kasumisou,Kaban no Naka no Yakimochi,0.541,0.535,0.323,2e-06,5,-7.98,1,78.248,4,0.339


# Remove Duplicates

Some songs appear multiple times in different albums and singles as shown by the shared track name.
Besides that, some songs have different versions as indicated in the track name.

To simplify the decision about whether different versions of the same song are duplicates or not, this part of the project will only consider the base version of each song.

In [42]:
# Count the number of times a track appears
num_track_appearances = all_songs_features \
    .groupby('track') \
    .count() \
    [['track_id']] \
    .rename(columns = {'track_id': 'number_of_tracks'})

# Show sample of tracks where the name appears more than once
num_track_appearances \
.sort_values(by=['number_of_tracks'], ascending=False) \
.head(5)
#.loc[num_track_appearances['number_of_tracks'] > 1]

Unnamed: 0_level_0,number_of_tracks
track,Unnamed: 1_level_1
Tegami,4
Yasashii Hana,4
Happy Days,4
Kimi no Egao,4
Waratte Waratte,4


In [44]:
# Show an example of a track with multiple versions
all_songs_features.loc[(all_songs_features['track'].str.contains('Garnet'))]

Unnamed: 0,album,track,track_id,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
29,Time Note,Garnet (Hikigatari),5hu1dBV5JKRU6gUnZpOJ1Y,0.929,0.614,311587,0.236,0.0,3,0.201,-10.005,1,0.0347,72.688,4,0.517
110,Hanako Oku Best ~ My Letters ~,Garnet,40jwSjLscs3wDOvYUpaJoB,0.753,0.515,320493,0.355,0.000176,3,0.129,-8.36,1,0.0283,73.643,4,0.362
202,Garnet,Garnet - Hikigatari,6rXPX0WpTKNGAOAoLPcF8Q,0.923,0.601,316280,0.251,1.2e-05,3,0.23,-10.281,1,0.033,72.921,4,0.493
204,Garnet,Garnet,2rZUu2qXb7SoUNktHKgfky,0.79,0.51,319067,0.353,0.000278,3,0.148,-8.399,1,0.0292,73.565,4,0.318
240,Hatsukoi,Garnet - Live Ver.,5u0psQQOoq6CA5E8RqsWmE,0.899,0.39,348733,0.255,0.00027,10,0.676,-9.847,1,0.0335,74.362,4,0.279


In [0]:


"""

1. feature selection
-- remove duration because how the music is played should be independent of the track length
-- remove liveliness. Although a live performance may be played differently, it is still mostly the same song
-- remove speechiness because there is no need to evaluate how much speaking occurs in the songs


2. feature engineering
-- encode key: treat it as categorical instead of numerical


3. PCA and plot
-- https://www.kdnuggets.com/2019/01/dimension-reduction-data-science.html


4. clustering k-means with number of clusters evaluation
-- try other clustering methods
-- https://www.kdnuggets.com/2018/06/5-clustering-algorithms-data-scientists-need-know.html
-- https://www.kdnuggets.com/2019/10/clustering-metrics-better-elbow-method.html


1. remove duplicates
-- try using all tracks first, then try removing duplicates and see whether the results change


6. Publish results on a Tableau public dashboard for others to play with
--
"""