Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using pipeline in the TCGA-MLexample, add feature selection #25

Merged
merged 9 commits into from
Aug 5, 2016
Merged

Using pipeline in the TCGA-MLexample, add feature selection #25

merged 9 commits into from
Aug 5, 2016

Conversation

yl565
Copy link
Contributor

@yl565 yl565 commented Aug 3, 2016

I modified the example to use the following pipeline:
median absolute deviation (MAD) feature selection -> z-score scaling -> Grid_search with SGDClassifier
Since pipeline is used, all the feature extraction steps (feature selection/scaling) are estimated using only x_train. The mask of selected feature and the estimated mean/std (for scaling) is then applied to x_test.

To reduce computation time, I used the top 500 features with the largest MAD. Higher performance may be achieved with more features included (grid search the optimal number of features may also be an option).

(I noticed in the example x_train and x_test are standardized separately which works fine if x_test contains sufficient samples. From a practical point of view, I think it would works better to estimate the mean/std from x_train and apply it to standardize x_test)

@dhimmel
Copy link
Member

dhimmel commented Aug 3, 2016

@yl565, awesome! I've never used an sklearn pipeline, but I think it's the right move.

Use the conda environment in #15 (even though it's not merged yet) and remove the pandas package list.

Export notebook to script so we can comment on specific lines.

Rather than create a new notebook, save to original name. Let me know if you disagree.

@dhimmel
Copy link
Member

dhimmel commented Aug 3, 2016

You can use the following command to export all notebooks to scripts:

jupyter nbconvert --to=script --FilesWriter.build_directory=scripts *.ipynb

@yl565
Copy link
Contributor Author

yl565 commented Aug 3, 2016

Ok, it takes me a while to setup the environment and commit all the changes. Still learning how to use git. Is everything working? Also, can I ask what command we should use to commit all changes (including delete some files)

@dhimmel
Copy link
Member

dhimmel commented Aug 4, 2016

what command we should use to commit all changes (including delete some files)

You can do git add --all which adds and removes everything. This is sometimes too large, as there are changes you don't want to commit. You can therefore do git rm deleted_file_name and then git add new_file_name.

@dhimmel
Copy link
Member

dhimmel commented Aug 4, 2016

Still learning how to use git. Is everything working?

Everything is working. One issue is that you made your changes on top of ab862e0 (an old repo version) rather than ae27311 (the most recent version). Therefore, you're missing these four commits:

ae27311b684c31bacfee7d0dd2be9f418d7301fc Use grid_search in notebook and add visualization (#18)
1781a412c9ad818c9cb290e97bfbf29a2ed6075c Ignore Jupyter notebook checkpoints (#17)
7ab14ad44860c904abc4ecc768a13508b4f79575 Export notebook to .py file for easy diff viewing (#16)
892c6994d0f57c671611781d1b96cbba31ee3c7e adding machine learning example (#10)

There may be a way for you to rebase and redo your changes on top of the most recent master. For now this isn't a huge deal, but it means that the pull request cannot be automatically merged. (I can still merge it manually though).

@yl565
Copy link
Contributor Author

yl565 commented Aug 4, 2016

Thanks for the tips. I synced my fork to master and updated the two files, looks like it can be auto-merged now?

@dhimmel
Copy link
Member

dhimmel commented Aug 4, 2016

Yeah, looks like what you did did the trick. The diff view on scripts/1.TCGA-MLexample.py is now really helpful. I have a few more comments to make, which I'll do now.

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing, grid_search
from sklearn.linear_model import SGDClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer, StandardScaler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imputer is never used. let's keep the imports minalist.

@dhimmel
Copy link
Member

dhimmel commented Aug 4, 2016

Those are all the comments I have.

@yl565
Copy link
Contributor Author

yl565 commented Aug 5, 2016

OK, I updated the files. Also fixed a bug so that its now printing the correct number of features

In [13]:

# Typically, this can only be done where the number of mutations is large enough
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
'Size: {:,} features, {:,} training samples, {:,} testing samples'.format(len(X.columns), len(X_train), len(X_test))
Out[13]:
'Size: 20,501 features, 6,935 training samples, 771 testing samples'

@dhimmel
Copy link
Member

dhimmel commented Aug 5, 2016

Great work with this pull request. I'm going to merge it.

@dhimmel dhimmel merged commit 34225cc into cognoma:master Aug 5, 2016
@dhimmel
Copy link
Member

dhimmel commented Aug 5, 2016

After I merged this I realized that 34225cc (which all of your commits were squashed into) modified the mode of several text files to executables. This is something we will want to change back. Will look more into it later

dhimmel added a commit to dhimmel/machine-learning that referenced this pull request Aug 5, 2016
cognoma/machine-learning@34225cc accidentally
changed the mode of these files. Revert mode to state prior to
cognoma#25.
dhimmel added a commit that referenced this pull request Aug 6, 2016
34225cc accidentally changed the mode of these files. Revert mode to state prior to #25.
dhimmel added a commit to cognoma/cognoml that referenced this pull request Oct 25, 2016
cognoma/machine-learning@34225cc accidentally changed the mode of these files. Revert mode to state prior to cognoma/machine-learning#25.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants