GridSearch params added to final estimators

bhavika · Dec 18, 2016 · cf995d6 · cf995d6
1 parent fcb271e
commit cf995d6
Show file tree

Hide file tree

Showing 34 changed files with 283 additions and 2,873 deletions.
diff --git a/README.md b/README.md
@@ -1,2 +1,51 @@
-# JoyDivision
-Music Mood Classification on the Million Song Dataset
+## Music Mood Classification Using the Million Song Dataset
+
+### Installation
+
+1. Unzip tekwani.zip
+2. cd/tekwani
+3. If you want to create a virtual environment, run virtualenv <the name of the environment, say 'tekwani'>
+4. To begin using the virtual environment, you must activate it.
+   $ source tekwani/bin/activate
+5. Now install the packages specified in requirements.txt. You can do this using
+   pip freeze > requirements.txt (freeze the current state of the environment)
+   pip install -r requirements.txt
+
+
+### Running the solution
+
+1. Download the data from the Google Drive link here: http://bit.do/datasets
+   The total download size should be about 2.8 GB. 
+2. Move the downloaded files to tekwani/data and check that it contains the following files:
+    -fullset.pkl
+    -train.pkl
+    -test.pkl
+
+The folder tekwani/explore contains plots, output for different estimators' grid search results, scripts to handle h5 files and a list of getter methods 
+for h5 files (hdf5_getters.txt)
+
+The file `evaluation.py` is the final file that generates results as shown in the report. 
+It will run 6 models for 3 feature combinations - audio and descriptive, descriptive only, audio only.
+
+
+### Other files
+
+1. `create_dataset.py` builds the dataset and randomly splits it into a 60-40 distribution fof train and test sets.
+2. `models.py` evaluates feature importance for ensemble estimators and performs cross validation for all estimators.
+3. `read_h5.py` is used to pull data out of HDF5 files. To run this, you need to download the Million Song Subset and place it in `data`
+4. `spotify.py` searches for Track IDs for the songs we have labeled in Spotify's database.
+5. `spotify_audio_features.py` fetches Danceability, Energy, Speechiness, Acousticness, Valence and Instrumentalness for all the track IDs we were able to get from 
+    `spotify.py`. 
+6. `rfe.py` gets the ranking of features. I use the number of optimal features (n) obtained here to select the top n features in `feature_importance.py` for the estimators.
+7. `feature_combinations.py` evaluates the importance of features when compared to the groups they're combined with. For e.g., Descriptive features paired with timbre, 
+    audio features paired with descriptive, etc
+8. `get_train_test.py` serves the train and test sets (*.pkl) to any file that imports it.
+7. `scratch\labels.csv` contains the full list of songs and the labels I assigned to them. 
+8. `scratch\models.out` contains the output for `models.py`. These are only cross validation results. 
+9. `learning_curve.py` plots the training score and cross validation score for an SVM with a linear kernel.
+10. `*.out` files - output files. 
+11. `hdf5_getters.py` is an interface provided along with the Million Song Dataset by LabROSA (Columbia University). It is used to read HDF5 files which is the initial
+    form of the Million Song Dataset.
+12. All files named `gridsearch_.py` are used to do a hyperparameter search for the models used. These models are ADABoostClassifier, ExtraTreesClassifier, 
+    GradientBoostingClassifier, SVM, KNearestNeighbour and RandomForestClassifier.
+
diff --git a/explore/gridsearch_svm.out b/explore/gridsearch_svm.out
@@ -0,0 +1,48 @@
+/home/bhavika/anaconda2/bin/python /home/bhavika/PycharmProjects/JoyDivision/src/gridsearch_svm.py
+Train--------
+False    2205
+True     2171
+Name: Mood, dtype: int64
+Test----------
+True     1522
+False    1498
+Name: Mood, dtype: int64
+SVM grid search:
+CV results {'std_train_score': array([ 0.02296023,  0.00114103,  0.02296023,  0.        ,  0.02296023,
+        0.        ,  0.02296023,  0.        ,  0.02296023,  0.        ]), 'rank_test_score': array([1, 6, 1, 7, 1, 8, 1, 8, 1, 8], dtype=int32), 'mean_score_time': array([ 0.21195304,  0.43780494,  0.40236795,  0.7167635 ,  0.27751696,
+        0.49956441,  0.17739654,  0.4735024 ,  0.19878852,  0.47999859]), 'std_test_score': array([  1.97534395e-02,   1.67744387e-02,   1.97534395e-02,
+         6.83468817e-04,   1.97534395e-02,   1.77551496e-06,
+         1.97534395e-02,   1.77551496e-06,   1.97534395e-02,
+         1.77551496e-06]), 'param_gamma': masked_array(data = [0.1 0.1 1 1 2 2 3 3 4 4],
+             mask = [False False False False False False False False False False],
+       fill_value = ?)
+, 'split1_train_score': array([ 0.71402467,  0.9954317 ,  0.71402467,  1.        ,  0.71402467,
+        1.        ,  0.71402467,  1.        ,  0.71402467,  1.        ]), 'split0_test_score': array([ 0.70077661,  0.68570123,  0.70077661,  0.50525354,  0.70077661,
+        0.50388305,  0.70077661,  0.50388305,  0.70077661,  0.50388305]), 'mean_test_score': array([ 0.72052102,  0.70246801,  0.72052102,  0.50457038,  0.72052102,
+        0.50388483,  0.72052102,  0.50388483,  0.72052102,  0.50388483]), 'param_C': masked_array(data = [3 3 3 3 3 3 3 3 3 3],
+             mask = [False False False False False False False False False False],
+       fill_value = ?)
+, 'split0_train_score': array([ 0.75994513,  0.99771376,  0.75994513,  1.        ,  0.75994513,
+        1.        ,  0.75994513,  1.        ,  0.75994513,  1.        ]), 'params': ({'kernel': 'linear', 'C': 3, 'gamma': 0.1}, {'kernel': 'rbf', 'C': 3, 'gamma': 0.1}, {'kernel': 'linear', 'C': 3, 'gamma': 1}, {'kernel': 'rbf', 'C': 3, 'gamma': 1}, {'kernel': 'linear', 'C': 3, 'gamma': 2}, {'kernel': 'rbf', 'C': 3, 'gamma': 2}, {'kernel': 'linear', 'C': 3, 'gamma': 3}, {'kernel': 'rbf', 'C': 3, 'gamma': 3}, {'kernel': 'linear', 'C': 3, 'gamma': 4}, {'kernel': 'rbf', 'C': 3, 'gamma': 4}), 'std_fit_time': array([ 0.04328609,  0.02048838,  0.12678599,  0.03631639,  0.53177655,
+        0.04621804,  0.15611959,  0.02182198,  0.11950135,  0.02658749]), 'std_score_time': array([ 0.02327311,  0.00198412,  0.02210987,  0.12797236,  0.06835306,
+        0.01183248,  0.01528645,  0.02398157,  0.00379264,  0.00293255]), 'mean_train_score': array([ 0.7369849 ,  0.99657273,  0.7369849 ,  1.        ,  0.7369849 ,
+        1.        ,  0.7369849 ,  1.        ,  0.7369849 ,  1.        ]), 'mean_fit_time': array([ 1.73574996,  0.73997653,  2.153373  ,  0.97303653,  2.27456653,
+        0.74146402,  1.66263652,  0.62005496,  2.29283547,  0.67161345]), 'param_kernel': masked_array(data = ['linear' 'rbf' 'linear' 'rbf' 'linear' 'rbf' 'linear' 'rbf' 'linear' 'rbf'],
+             mask = [False False False False False False False False False False],
+       fill_value = ?)
+, 'split1_test_score': array([ 0.74028349,  0.71925011,  0.74028349,  0.5038866 ,  0.74028349,
+        0.5038866 ,  0.74028349,  0.5038866 ,  0.74028349,  0.5038866 ])}
+Best SVM SVC(C=3, cache_size=200, class_weight=None, coef0=0.0,
+  decision_function_shape=None, degree=3, gamma=0.1, kernel='linear',
+  max_iter=-1, probability=False, random_state=None, shrinking=True,
+  tol=0.001, verbose=False)
+Best CV score for SVM 0.720521023766
+Best SVM params: {'kernel': 'linear', 'C': 3, 'gamma': 0.1}
+Finished in:  51.4404470921
+[Sun Dec 18 07:07:30 2016 : 424343] speechd: Speech Dispatcher 0.8 starting
+[Sun Dec 18 07:07:30 2016 : 424545] speechd: Speech Dispatcher already running.
+
+Speech Dispatcher already running.
+
+
+Process finished with exit code 0
diff --git a/explore/songs_df.py b/explore/songs_df.py
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,7 @@
+pandas==0.18.1
+scikit-learn==0.17.1
+numpy==1.11.0
+virtualenv==1.11.4
+matplotlib=1.5.3
+seaborn==0.7.1
+xgboost==0.6