-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using pipeline in the TCGA-MLexample, add feature selection #25
Conversation
@yl565, awesome! I've never used an sklearn pipeline, but I think it's the right move. Use the conda environment in #15 (even though it's not merged yet) and remove the pandas package list. Export notebook to script so we can comment on specific lines. Rather than create a new notebook, save to original name. Let me know if you disagree. |
You can use the following command to export all notebooks to scripts: jupyter nbconvert --to=script --FilesWriter.build_directory=scripts *.ipynb |
Ok, it takes me a while to setup the environment and commit all the changes. Still learning how to use git. Is everything working? Also, can I ask what command we should use to commit all changes (including delete some files) |
You can do |
Everything is working. One issue is that you made your changes on top of ab862e0 (an old repo version) rather than ae27311 (the most recent version). Therefore, you're missing these four commits:
There may be a way for you to rebase and redo your changes on top of the most recent master. For now this isn't a huge deal, but it means that the pull request cannot be automatically merged. (I can still merge it manually though). |
Conflicts: scripts/1.TCGA-MLexample.py
Thanks for the tips. I synced my fork to master and updated the two files, looks like it can be auto-merged now? |
Yeah, looks like what you did did the trick. The diff view on |
import matplotlib.pyplot as plt | ||
import seaborn as sns | ||
from sklearn import preprocessing, grid_search | ||
from sklearn.linear_model import SGDClassifier | ||
from sklearn.cross_validation import train_test_split | ||
from sklearn.metrics import roc_auc_score, roc_curve | ||
from sklearn.pipeline import make_pipeline | ||
from sklearn.preprocessing import Imputer, StandardScaler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imputer
is never used. let's keep the imports minalist.
Those are all the comments I have. |
OK, I updated the files. Also fixed a bug so that its now printing the correct number of features
|
Great work with this pull request. I'm going to merge it. |
After I merged this I realized that 34225cc (which all of your commits were squashed into) modified the mode of several text files to executables. This is something we will want to change back. Will look more into it later |
cognoma/machine-learning@34225cc accidentally changed the mode of these files. Revert mode to state prior to cognoma#25.
cognoma/machine-learning@34225cc accidentally changed the mode of these files. Revert mode to state prior to cognoma/machine-learning#25.
I modified the example to use the following pipeline:
median absolute deviation (MAD) feature selection -> z-score scaling -> Grid_search with SGDClassifier
Since pipeline is used, all the feature extraction steps (feature selection/scaling) are estimated using only x_train. The mask of selected feature and the estimated mean/std (for scaling) is then applied to x_test.
To reduce computation time, I used the top 500 features with the largest MAD. Higher performance may be achieved with more features included (grid search the optimal number of features may also be an option).
(I noticed in the example x_train and x_test are standardized separately which works fine if x_test contains sufficient samples. From a practical point of view, I think it would works better to estimate the mean/std from x_train and apply it to standardize x_test)