-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
GridSearch params added to final estimators
- Loading branch information
Showing
34 changed files
with
283 additions
and
2,873 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,51 @@ | ||
# JoyDivision | ||
Music Mood Classification on the Million Song Dataset | ||
## Music Mood Classification Using the Million Song Dataset | ||
|
||
### Installation | ||
|
||
1. Unzip tekwani.zip | ||
2. cd/tekwani | ||
3. If you want to create a virtual environment, run virtualenv <the name of the environment, say 'tekwani'> | ||
4. To begin using the virtual environment, you must activate it. | ||
$ source tekwani/bin/activate | ||
5. Now install the packages specified in requirements.txt. You can do this using | ||
pip freeze > requirements.txt (freeze the current state of the environment) | ||
pip install -r requirements.txt | ||
|
||
|
||
### Running the solution | ||
|
||
1. Download the data from the Google Drive link here: http://bit.do/datasets | ||
The total download size should be about 2.8 GB. | ||
2. Move the downloaded files to tekwani/data and check that it contains the following files: | ||
-fullset.pkl | ||
-train.pkl | ||
-test.pkl | ||
|
||
The folder tekwani/explore contains plots, output for different estimators' grid search results, scripts to handle h5 files and a list of getter methods | ||
for h5 files (hdf5_getters.txt) | ||
|
||
The file `evaluation.py` is the final file that generates results as shown in the report. | ||
It will run 6 models for 3 feature combinations - audio and descriptive, descriptive only, audio only. | ||
|
||
|
||
### Other files | ||
|
||
1. `create_dataset.py` builds the dataset and randomly splits it into a 60-40 distribution fof train and test sets. | ||
2. `models.py` evaluates feature importance for ensemble estimators and performs cross validation for all estimators. | ||
3. `read_h5.py` is used to pull data out of HDF5 files. To run this, you need to download the Million Song Subset and place it in `data` | ||
4. `spotify.py` searches for Track IDs for the songs we have labeled in Spotify's database. | ||
5. `spotify_audio_features.py` fetches Danceability, Energy, Speechiness, Acousticness, Valence and Instrumentalness for all the track IDs we were able to get from | ||
`spotify.py`. | ||
6. `rfe.py` gets the ranking of features. I use the number of optimal features (n) obtained here to select the top n features in `feature_importance.py` for the estimators. | ||
7. `feature_combinations.py` evaluates the importance of features when compared to the groups they're combined with. For e.g., Descriptive features paired with timbre, | ||
audio features paired with descriptive, etc | ||
8. `get_train_test.py` serves the train and test sets (*.pkl) to any file that imports it. | ||
7. `scratch\labels.csv` contains the full list of songs and the labels I assigned to them. | ||
8. `scratch\models.out` contains the output for `models.py`. These are only cross validation results. | ||
9. `learning_curve.py` plots the training score and cross validation score for an SVM with a linear kernel. | ||
10. `*.out` files - output files. | ||
11. `hdf5_getters.py` is an interface provided along with the Million Song Dataset by LabROSA (Columbia University). It is used to read HDF5 files which is the initial | ||
form of the Million Song Dataset. | ||
12. All files named `gridsearch_.py` are used to do a hyperparameter search for the models used. These models are ADABoostClassifier, ExtraTreesClassifier, | ||
GradientBoostingClassifier, SVM, KNearestNeighbour and RandomForestClassifier. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
/home/bhavika/anaconda2/bin/python /home/bhavika/PycharmProjects/JoyDivision/src/gridsearch_svm.py | ||
Train-------- | ||
False 2205 | ||
True 2171 | ||
Name: Mood, dtype: int64 | ||
Test---------- | ||
True 1522 | ||
False 1498 | ||
Name: Mood, dtype: int64 | ||
SVM grid search: | ||
CV results {'std_train_score': array([ 0.02296023, 0.00114103, 0.02296023, 0. , 0.02296023, | ||
0. , 0.02296023, 0. , 0.02296023, 0. ]), 'rank_test_score': array([1, 6, 1, 7, 1, 8, 1, 8, 1, 8], dtype=int32), 'mean_score_time': array([ 0.21195304, 0.43780494, 0.40236795, 0.7167635 , 0.27751696, | ||
0.49956441, 0.17739654, 0.4735024 , 0.19878852, 0.47999859]), 'std_test_score': array([ 1.97534395e-02, 1.67744387e-02, 1.97534395e-02, | ||
6.83468817e-04, 1.97534395e-02, 1.77551496e-06, | ||
1.97534395e-02, 1.77551496e-06, 1.97534395e-02, | ||
1.77551496e-06]), 'param_gamma': masked_array(data = [0.1 0.1 1 1 2 2 3 3 4 4], | ||
mask = [False False False False False False False False False False], | ||
fill_value = ?) | ||
, 'split1_train_score': array([ 0.71402467, 0.9954317 , 0.71402467, 1. , 0.71402467, | ||
1. , 0.71402467, 1. , 0.71402467, 1. ]), 'split0_test_score': array([ 0.70077661, 0.68570123, 0.70077661, 0.50525354, 0.70077661, | ||
0.50388305, 0.70077661, 0.50388305, 0.70077661, 0.50388305]), 'mean_test_score': array([ 0.72052102, 0.70246801, 0.72052102, 0.50457038, 0.72052102, | ||
0.50388483, 0.72052102, 0.50388483, 0.72052102, 0.50388483]), 'param_C': masked_array(data = [3 3 3 3 3 3 3 3 3 3], | ||
mask = [False False False False False False False False False False], | ||
fill_value = ?) | ||
, 'split0_train_score': array([ 0.75994513, 0.99771376, 0.75994513, 1. , 0.75994513, | ||
1. , 0.75994513, 1. , 0.75994513, 1. ]), 'params': ({'kernel': 'linear', 'C': 3, 'gamma': 0.1}, {'kernel': 'rbf', 'C': 3, 'gamma': 0.1}, {'kernel': 'linear', 'C': 3, 'gamma': 1}, {'kernel': 'rbf', 'C': 3, 'gamma': 1}, {'kernel': 'linear', 'C': 3, 'gamma': 2}, {'kernel': 'rbf', 'C': 3, 'gamma': 2}, {'kernel': 'linear', 'C': 3, 'gamma': 3}, {'kernel': 'rbf', 'C': 3, 'gamma': 3}, {'kernel': 'linear', 'C': 3, 'gamma': 4}, {'kernel': 'rbf', 'C': 3, 'gamma': 4}), 'std_fit_time': array([ 0.04328609, 0.02048838, 0.12678599, 0.03631639, 0.53177655, | ||
0.04621804, 0.15611959, 0.02182198, 0.11950135, 0.02658749]), 'std_score_time': array([ 0.02327311, 0.00198412, 0.02210987, 0.12797236, 0.06835306, | ||
0.01183248, 0.01528645, 0.02398157, 0.00379264, 0.00293255]), 'mean_train_score': array([ 0.7369849 , 0.99657273, 0.7369849 , 1. , 0.7369849 , | ||
1. , 0.7369849 , 1. , 0.7369849 , 1. ]), 'mean_fit_time': array([ 1.73574996, 0.73997653, 2.153373 , 0.97303653, 2.27456653, | ||
0.74146402, 1.66263652, 0.62005496, 2.29283547, 0.67161345]), 'param_kernel': masked_array(data = ['linear' 'rbf' 'linear' 'rbf' 'linear' 'rbf' 'linear' 'rbf' 'linear' 'rbf'], | ||
mask = [False False False False False False False False False False], | ||
fill_value = ?) | ||
, 'split1_test_score': array([ 0.74028349, 0.71925011, 0.74028349, 0.5038866 , 0.74028349, | ||
0.5038866 , 0.74028349, 0.5038866 , 0.74028349, 0.5038866 ])} | ||
Best SVM SVC(C=3, cache_size=200, class_weight=None, coef0=0.0, | ||
decision_function_shape=None, degree=3, gamma=0.1, kernel='linear', | ||
max_iter=-1, probability=False, random_state=None, shrinking=True, | ||
tol=0.001, verbose=False) | ||
Best CV score for SVM 0.720521023766 | ||
Best SVM params: {'kernel': 'linear', 'C': 3, 'gamma': 0.1} | ||
Finished in: 51.4404470921 | ||
[Sun Dec 18 07:07:30 2016 : 424343] speechd: Speech Dispatcher 0.8 starting | ||
[Sun Dec 18 07:07:30 2016 : 424545] speechd: Speech Dispatcher already running. | ||
|
||
Speech Dispatcher already running. | ||
|
||
|
||
Process finished with exit code 0 |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
pandas==0.18.1 | ||
scikit-learn==0.17.1 | ||
numpy==1.11.0 | ||
virtualenv==1.11.4 | ||
matplotlib=1.5.3 | ||
seaborn==0.7.1 | ||
xgboost==0.6 |
Oops, something went wrong.