In [1]:
import pandas as pd

pd.options.display.max_rows = 200

train_data = pd.read_csv('../input/Train.csv')
test_data = pd.read_csv('../input/Test.csv')

First, let's remove unwanted features (like `id`, `track url`, `name`, etc). 

In [2]:
ignore = (['analysis_url', 
           'id', 
           'track_href', 
           'uri', 
           'type', 
           'album', 
           'name', 
           'duration_ms',
          ])

train_data.drop(ignore, axis=1, inplace=True)
clean_test_data = test_data.drop(ignore, axis=1)


Now, assign classes. The training set is made up of 200 songs. 100 are Steven Wilson songs and the rest are totally different songs. So, the positive class is **1** and the negative class is **0**.

Yes, this should totally be in the csv ¯\\_(ツ)_/¯ (I'm gonna update it Soon™).

In [3]:
train_data.loc[:99, 'class'] = [1] * 100
train_data.loc[100:200, 'class'] = [0] * 100

Now, the important part: **which features am I gonna use?** What features define Steven Wilson's style the best?
  
After trying many combinations, analyzing some [graphs](https://www.kaggle.com/danielgrijalvas/comparing-steven-wilson-and-porcupine-tree) and checking with the Test dataset (a playlist with songs that may or may not sound like SW), I came to the conclusion that the best features for this problem are:
* Energy
* Instrumentalness
* Loudness
* Acousticness
* Valence  
  
However, using some statistical tools like a t-test, we can tell whether the features from class 1 are significantly different from class 2, and select those with higher significance. Using scikit-learn's `feature_selection` and `f_classif` (basically a t-test), the features with higher significance (or "score", according to [scikit-learn docs](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)) are:
* Danceability
* Instrumentalness
* Loudness
* Speechiness
* Valence  
  
See, I *almost* got it right. `SelectKBest` chose `Danceability` instead of `Energy`, and `Speechiness` instead of `Acousticness`. I understand that the data from `Energy` is very spread out and that's why it got a low score/significance; but `Speechiness`...? That seems weird.

In [4]:
from sklearn.feature_selection import SelectKBest

features = SelectKBest(k=5)
features.fit(train_data.loc[:, train_data.columns != 'class'], train_data['class'])

cols = list(train_data.columns[features.get_support(1)])
cols

['danceability', 'instrumentalness', 'loudness', 'speechiness', 'valence']

Now, let's train the machine learning algorithm, K-Nearest Neighbors. I'm leaving K as the default (5), but maybe later I'll try with higher values.  

Also, I'm adding some weights with `distance`, that way, the closer neighbors will have *even more* influence over the classifications -- but honestly I don't really know how that works, so I'll probably change that to `uniform` weights.

In [5]:
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier as knn

# train KNN with training data 
kn = knn(weights='distance', p=2)
kn.fit(train_data[cols], train_data['class'])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='distance')

The algorithm is trained. Let's use what it learned to classify a playlist (`Test.csv`, a playlist with 100 songs that may or may not fit well between Steven Wilson songs). Also I'll add the predicted value and probabilites to the `Test` dataset.

In [6]:
# classify/predict class of test songs
predictions = kn.predict(clean_test_data[cols])
test_data['predict'] = predictions

# probability that a song is 0/1
test_prob = kn.predict_proba(clean_test_data[cols])
prob0 = [p[0] for p in test_prob]
prob1 = [p[1] for p in test_prob]
test_data['prob0'] = prob0
test_data['prob1'] = prob1

## Results
Let's check the results. If you want to check them yourself, give [Steven Wilson](https://open.spotify.com/artist/4X42BfuhWCAZ2swiVze9O0)/[Porcupine Tree](https://open.spotify.com/artist/5NXHXK6hOCotCF8lvGM1I0) a listen and then head over to the [Test playlist](https://open.spotify.com/user/jdgs.gt/playlist/6wCTUaDlzdzrqMUzkCd9Zx) and listen to the songs where `prob1` is 1 to check similarities. Or listen to the songs where `prob0` is 1, and you'll see the huge differences of musical style.

 **UPDATE**: I added the songs with > 80% probability of being **1**  (similar to Steven Wilson) to [this playlist](https://open.spotify.com/user/jdgs.gt/playlist/0D7EWUrrBuza4H8SuzDqyI). It's just 35 songs long, but the results are really good. 

In [7]:
test_data[['name','predict', 'prob0', 'prob1']]

Unnamed: 0,name,predict,prob0,prob1
0,Animals,0.0,0.575349,0.424651
1,The Closest I've Come,1.0,0.0,1.0
2,Coma Pony,0.0,0.81181,0.18819
3,Doce,1.0,0.350808,0.649192
4,The End Is Begun,0.0,1.0,0.0
5,(Go) Get It,1.0,0.435726,0.564274
6,Heartbreaker,1.0,0.334085,0.665915
7,The Invisible Man,1.0,0.0,1.0
8,Kodokunohatsumei,1.0,0.0,1.0
9,The Lesson,1.0,0.195803,0.804197


## Conclusion
In my opinion, the algorithm did an excellent job. But in a problem like this, the accuracy is a little subjective, right?