# Spotify Genre Detection
----
In this project, we aim to create a classification system for genres of Spotify tracks. We will explore a few different classification systems and preprocessing steps for our data. The dataset came from [Kaggle](https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db), and comes with precomputed features by Spotify. These aren't necessarily industry standard features that have been computed, but are specific to Spotify. In a later project we will investigate our own feature extraction from audio snippets, so that we can create a more specialized classification system that is tailor made more to my own music library.

## Preliminary Steps and Data Processing
Before we can get started with classification by genre, we should probably start by importing our required libraries, so that we can begin to explore some basic features of our dataset and potentially process the data further if need be.

In [1]:
# for processing and scaling
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# for visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

Let's now read in our dataset using Pandas.

In [2]:
data = pd.read_csv('data/spotify-features.csv')
data.head()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232725 entries, 0 to 232724
Data columns (total 18 columns):
genre               232725 non-null object
artist_name         232725 non-null object
track_name          232725 non-null object
track_id            232725 non-null object
popularity          232725 non-null int64
acousticness        232725 non-null float64
danceability        232725 non-null float64
duration_ms         232725 non-null int64
energy              232725 non-null float64
instrumentalness    232725 non-null float64
key                 232725 non-null object
liveness            232725 non-null float64
loudness            232725 non-null float64
mode                232725 non-null object
speechiness         232725 non-null float64
tempo               232725 non-null float64
time_signature      232725 non-null object
valence             232725 non-null float64
dtypes: float64(9), int64(2), object(7)
memory usage: 32.0+ MB


Let's separate into our feature set and our target labels.

In [15]:
X_feats = data.iloc[:,4:]
X_id_info = data.iloc[:,1:4]
y = data.genre

In [17]:
X_feats.head()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
1,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
2,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
3,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39


Next we can separate out only our columns which are categorical vs numerical.

In [6]:
X_feats_numeric = X_feats.select_dtypes(exclude='object')
X_feats_numeric.head()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
0,0,0.611,0.389,99373,0.91,0.0,0.346,-1.828,0.0525,166.969,0.814
1,1,0.246,0.59,137373,0.737,0.0,0.151,-5.559,0.0868,174.003,0.816
2,3,0.952,0.663,170267,0.131,0.0,0.103,-13.879,0.0362,99.488,0.368
3,0,0.703,0.24,152427,0.326,0.0,0.0985,-12.178,0.0395,171.758,0.227
4,4,0.95,0.331,82625,0.225,0.123,0.202,-21.15,0.0456,140.576,0.39


In [10]:
X_feats_categorical = X_feats.select_dtypes(include='object')
X_feats_categorical.head()

NameError: name 'X_feats' is not defined