Skip to content

Commit

Permalink
GridSearch params added to final estimators
Browse files Browse the repository at this point in the history
  • Loading branch information
bhavika committed Dec 18, 2016
1 parent fcb271e commit cf995d6
Show file tree
Hide file tree
Showing 34 changed files with 283 additions and 2,873 deletions.
53 changes: 51 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,51 @@
# JoyDivision
Music Mood Classification on the Million Song Dataset
## Music Mood Classification Using the Million Song Dataset

### Installation

1. Unzip tekwani.zip
2. cd/tekwani
3. If you want to create a virtual environment, run virtualenv <the name of the environment, say 'tekwani'>
4. To begin using the virtual environment, you must activate it.
$ source tekwani/bin/activate
5. Now install the packages specified in requirements.txt. You can do this using
pip freeze > requirements.txt (freeze the current state of the environment)
pip install -r requirements.txt


### Running the solution

1. Download the data from the Google Drive link here: http://bit.do/datasets
The total download size should be about 2.8 GB.
2. Move the downloaded files to tekwani/data and check that it contains the following files:
-fullset.pkl
-train.pkl
-test.pkl

The folder tekwani/explore contains plots, output for different estimators' grid search results, scripts to handle h5 files and a list of getter methods
for h5 files (hdf5_getters.txt)

The file `evaluation.py` is the final file that generates results as shown in the report.
It will run 6 models for 3 feature combinations - audio and descriptive, descriptive only, audio only.


### Other files

1. `create_dataset.py` builds the dataset and randomly splits it into a 60-40 distribution fof train and test sets.
2. `models.py` evaluates feature importance for ensemble estimators and performs cross validation for all estimators.
3. `read_h5.py` is used to pull data out of HDF5 files. To run this, you need to download the Million Song Subset and place it in `data`
4. `spotify.py` searches for Track IDs for the songs we have labeled in Spotify's database.
5. `spotify_audio_features.py` fetches Danceability, Energy, Speechiness, Acousticness, Valence and Instrumentalness for all the track IDs we were able to get from
`spotify.py`.
6. `rfe.py` gets the ranking of features. I use the number of optimal features (n) obtained here to select the top n features in `feature_importance.py` for the estimators.
7. `feature_combinations.py` evaluates the importance of features when compared to the groups they're combined with. For e.g., Descriptive features paired with timbre,
audio features paired with descriptive, etc
8. `get_train_test.py` serves the train and test sets (*.pkl) to any file that imports it.
7. `scratch\labels.csv` contains the full list of songs and the labels I assigned to them.
8. `scratch\models.out` contains the output for `models.py`. These are only cross validation results.
9. `learning_curve.py` plots the training score and cross validation score for an SVM with a linear kernel.
10. `*.out` files - output files.
11. `hdf5_getters.py` is an interface provided along with the Million Song Dataset by LabROSA (Columbia University). It is used to read HDF5 files which is the initial
form of the Million Song Dataset.
12. All files named `gridsearch_.py` are used to do a hyperparameter search for the models used. These models are ADABoostClassifier, ExtraTreesClassifier,
GradientBoostingClassifier, SVM, KNearestNeighbour and RandomForestClassifier.

48 changes: 48 additions & 0 deletions explore/gridsearch_svm.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
/home/bhavika/anaconda2/bin/python /home/bhavika/PycharmProjects/JoyDivision/src/gridsearch_svm.py
Train--------
False 2205
True 2171
Name: Mood, dtype: int64
Test----------
True 1522
False 1498
Name: Mood, dtype: int64
SVM grid search:
CV results {'std_train_score': array([ 0.02296023, 0.00114103, 0.02296023, 0. , 0.02296023,
0. , 0.02296023, 0. , 0.02296023, 0. ]), 'rank_test_score': array([1, 6, 1, 7, 1, 8, 1, 8, 1, 8], dtype=int32), 'mean_score_time': array([ 0.21195304, 0.43780494, 0.40236795, 0.7167635 , 0.27751696,
0.49956441, 0.17739654, 0.4735024 , 0.19878852, 0.47999859]), 'std_test_score': array([ 1.97534395e-02, 1.67744387e-02, 1.97534395e-02,
6.83468817e-04, 1.97534395e-02, 1.77551496e-06,
1.97534395e-02, 1.77551496e-06, 1.97534395e-02,
1.77551496e-06]), 'param_gamma': masked_array(data = [0.1 0.1 1 1 2 2 3 3 4 4],
mask = [False False False False False False False False False False],
fill_value = ?)
, 'split1_train_score': array([ 0.71402467, 0.9954317 , 0.71402467, 1. , 0.71402467,
1. , 0.71402467, 1. , 0.71402467, 1. ]), 'split0_test_score': array([ 0.70077661, 0.68570123, 0.70077661, 0.50525354, 0.70077661,
0.50388305, 0.70077661, 0.50388305, 0.70077661, 0.50388305]), 'mean_test_score': array([ 0.72052102, 0.70246801, 0.72052102, 0.50457038, 0.72052102,
0.50388483, 0.72052102, 0.50388483, 0.72052102, 0.50388483]), 'param_C': masked_array(data = [3 3 3 3 3 3 3 3 3 3],
mask = [False False False False False False False False False False],
fill_value = ?)
, 'split0_train_score': array([ 0.75994513, 0.99771376, 0.75994513, 1. , 0.75994513,
1. , 0.75994513, 1. , 0.75994513, 1. ]), 'params': ({'kernel': 'linear', 'C': 3, 'gamma': 0.1}, {'kernel': 'rbf', 'C': 3, 'gamma': 0.1}, {'kernel': 'linear', 'C': 3, 'gamma': 1}, {'kernel': 'rbf', 'C': 3, 'gamma': 1}, {'kernel': 'linear', 'C': 3, 'gamma': 2}, {'kernel': 'rbf', 'C': 3, 'gamma': 2}, {'kernel': 'linear', 'C': 3, 'gamma': 3}, {'kernel': 'rbf', 'C': 3, 'gamma': 3}, {'kernel': 'linear', 'C': 3, 'gamma': 4}, {'kernel': 'rbf', 'C': 3, 'gamma': 4}), 'std_fit_time': array([ 0.04328609, 0.02048838, 0.12678599, 0.03631639, 0.53177655,
0.04621804, 0.15611959, 0.02182198, 0.11950135, 0.02658749]), 'std_score_time': array([ 0.02327311, 0.00198412, 0.02210987, 0.12797236, 0.06835306,
0.01183248, 0.01528645, 0.02398157, 0.00379264, 0.00293255]), 'mean_train_score': array([ 0.7369849 , 0.99657273, 0.7369849 , 1. , 0.7369849 ,
1. , 0.7369849 , 1. , 0.7369849 , 1. ]), 'mean_fit_time': array([ 1.73574996, 0.73997653, 2.153373 , 0.97303653, 2.27456653,
0.74146402, 1.66263652, 0.62005496, 2.29283547, 0.67161345]), 'param_kernel': masked_array(data = ['linear' 'rbf' 'linear' 'rbf' 'linear' 'rbf' 'linear' 'rbf' 'linear' 'rbf'],
mask = [False False False False False False False False False False],
fill_value = ?)
, 'split1_test_score': array([ 0.74028349, 0.71925011, 0.74028349, 0.5038866 , 0.74028349,
0.5038866 , 0.74028349, 0.5038866 , 0.74028349, 0.5038866 ])}
Best SVM SVC(C=3, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.1, kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Best CV score for SVM 0.720521023766
Best SVM params: {'kernel': 'linear', 'C': 3, 'gamma': 0.1}
Finished in: 51.4404470921
[Sun Dec 18 07:07:30 2016 : 424343] speechd: Speech Dispatcher 0.8 starting
[Sun Dec 18 07:07:30 2016 : 424545] speechd: Speech Dispatcher already running.

Speech Dispatcher already running.


Process finished with exit code 0
20 changes: 0 additions & 20 deletions explore/songs_df.py

This file was deleted.

7 changes: 7 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
pandas==0.18.1
scikit-learn==0.17.1
numpy==1.11.0
virtualenv==1.11.4
matplotlib=1.5.3
seaborn==0.7.1
xgboost==0.6
Loading

0 comments on commit cf995d6

Please sign in to comment.