Using pipeline in the TCGA-MLexample, add feature selection #25

yl565 · 2016-08-03T21:18:16Z

I modified the example to use the following pipeline:
median absolute deviation (MAD) feature selection -> z-score scaling -> Grid_search with SGDClassifier
Since pipeline is used, all the feature extraction steps (feature selection/scaling) are estimated using only x_train. The mask of selected feature and the estimated mean/std (for scaling) is then applied to x_test.

To reduce computation time, I used the top 500 features with the largest MAD. Higher performance may be achieved with more features included (grid search the optimal number of features may also be an option).

(I noticed in the example x_train and x_test are standardized separately which works fine if x_test contains sufficient samples. From a practical point of view, I think it would works better to estimate the mean/std from x_train and apply it to standardize x_test)

dhimmel · 2016-08-03T21:34:13Z

@yl565, awesome! I've never used an sklearn pipeline, but I think it's the right move.

Use the conda environment in #15 (even though it's not merged yet) and remove the pandas package list.

Export notebook to script so we can comment on specific lines.

Rather than create a new notebook, save to original name. Let me know if you disagree.

dhimmel · 2016-08-03T21:47:43Z

You can use the following command to export all notebooks to scripts:

jupyter nbconvert --to=script --FilesWriter.build_directory=scripts *.ipynb

yl565 · 2016-08-03T22:43:31Z

Ok, it takes me a while to setup the environment and commit all the changes. Still learning how to use git. Is everything working? Also, can I ask what command we should use to commit all changes (including delete some files)

dhimmel · 2016-08-04T13:12:14Z

what command we should use to commit all changes (including delete some files)

You can do git add --all which adds and removes everything. This is sometimes too large, as there are changes you don't want to commit. You can therefore do git rm deleted_file_name and then git add new_file_name.

dhimmel · 2016-08-04T13:22:39Z

Still learning how to use git. Is everything working?

Everything is working. One issue is that you made your changes on top of ab862e0 (an old repo version) rather than ae27311 (the most recent version). Therefore, you're missing these four commits:

ae27311b684c31bacfee7d0dd2be9f418d7301fc Use grid_search in notebook and add visualization (#18)
1781a412c9ad818c9cb290e97bfbf29a2ed6075c Ignore Jupyter notebook checkpoints (#17)
7ab14ad44860c904abc4ecc768a13508b4f79575 Export notebook to .py file for easy diff viewing (#16)
892c6994d0f57c671611781d1b96cbba31ee3c7e adding machine learning example (#10)

There may be a way for you to rebase and redo your changes on top of the most recent master. For now this isn't a huge deal, but it means that the pull request cannot be automatically merged. (I can still merge it manually though).

Conflicts: scripts/1.TCGA-MLexample.py

yl565 · 2016-08-04T16:52:32Z

Thanks for the tips. I synced my fork to master and updated the two files, looks like it can be auto-merged now?

dhimmel · 2016-08-04T16:57:10Z

Yeah, looks like what you did did the trick. The diff view on scripts/1.TCGA-MLexample.py is now really helpful. I have a few more comments to make, which I'll do now.

dhimmel · 2016-08-04T16:57:56Z

scripts/1.TCGA-MLexample.py

 import matplotlib.pyplot as plt
 import seaborn as sns
 from sklearn import preprocessing, grid_search
 from sklearn.linear_model import SGDClassifier
 from sklearn.cross_validation import train_test_split
 from sklearn.metrics import roc_auc_score, roc_curve
+from sklearn.pipeline import make_pipeline
+from sklearn.preprocessing import Imputer, StandardScaler


Imputer is never used. let's keep the imports minalist.

dhimmel · 2016-08-04T17:23:05Z

Those are all the comments I have.

yl565 · 2016-08-05T00:12:15Z

OK, I updated the files. Also fixed a bug so that its now printing the correct number of features

In [13]:

# Typically, this can only be done where the number of mutations is large enough
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
'Size: {:,} features, {:,} training samples, {:,} testing samples'.format(len(X.columns), len(X_train), len(X_test))
Out[13]:
'Size: 20,501 features, 6,935 training samples, 771 testing samples'

dhimmel · 2016-08-05T02:41:55Z

Great work with this pull request. I'm going to merge it.

dhimmel · 2016-08-05T02:47:35Z

After I merged this I realized that 34225cc (which all of your commits were squashed into) modified the mode of several text files to executables. This is something we will want to change back. Will look more into it later

cognoma/machine-learning@34225cc accidentally changed the mode of these files. Revert mode to state prior to cognoma#25.

34225cc accidentally changed the mode of these files. Revert mode to state prior to #25.

cognoma/machine-learning@34225cc accidentally changed the mode of these files. Revert mode to state prior to cognoma/machine-learning#25.

Modified to use pipeline

70e58a6

yl565 added 2 commits August 3, 2016 18:26

script included

cf7c366

script included

345739b

yl565 added 3 commits August 4, 2016 12:14

Merge remote-tracking branch 'upstream/master'

f6bc3a8

Update from ae27311

dbc988d

Merge branch 'master' of https://github.com/yl565/machine-learning

4201239

Conflicts: scripts/1.TCGA-MLexample.py

dhimmel reviewed Aug 4, 2016
View reviewed changes

yl565 added 3 commits August 4, 2016 19:52

cleaned up scripts, fixed a bug

fcc61bd

cleaned up scripts, fixed a bug

ea833d0

cleaned up scripts, fixed a bug

7e3bc3d

dhimmel merged commit 34225cc into cognoma:master Aug 5, 2016

dhimmel added a commit to dhimmel/machine-learning that referenced this pull request Aug 5, 2016

Remove executable flags for text files

f890007

cognoma/machine-learning@34225cc accidentally changed the mode of these files. Revert mode to state prior to cognoma#25.

dhimmel mentioned this pull request Aug 5, 2016

Remove executable flags for text files #26

Merged

dhimmel added a commit that referenced this pull request Aug 6, 2016

Remove executable flags for text files (#26)

5eadf03

34225cc accidentally changed the mode of these files. Revert mode to state prior to #25.

This was referenced Aug 7, 2016

Median absolute deviation feature selection #22

Open

Claim an sklearn algorithm to implement and troubleshoot #27

Open

dhimmel added a commit to cognoma/cognoml that referenced this pull request Oct 25, 2016

Remove executable flags for text files (#26)

9c76455

cognoma/machine-learning@34225cc accidentally changed the mode of these files. Revert mode to state prior to cognoma/machine-learning#25.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using pipeline in the TCGA-MLexample, add feature selection #25

Using pipeline in the TCGA-MLexample, add feature selection #25

yl565 commented Aug 3, 2016 •

edited

Loading

dhimmel commented Aug 3, 2016

dhimmel commented Aug 3, 2016

yl565 commented Aug 3, 2016

dhimmel commented Aug 4, 2016

dhimmel commented Aug 4, 2016

yl565 commented Aug 4, 2016

dhimmel commented Aug 4, 2016

dhimmel Aug 4, 2016

dhimmel commented Aug 4, 2016

yl565 commented Aug 5, 2016

dhimmel commented Aug 5, 2016

dhimmel commented Aug 5, 2016

Using pipeline in the TCGA-MLexample, add feature selection #25

Using pipeline in the TCGA-MLexample, add feature selection #25

Conversation

yl565 commented Aug 3, 2016 • edited Loading

dhimmel commented Aug 3, 2016

dhimmel commented Aug 3, 2016

yl565 commented Aug 3, 2016

dhimmel commented Aug 4, 2016

dhimmel commented Aug 4, 2016

yl565 commented Aug 4, 2016

dhimmel commented Aug 4, 2016

dhimmel Aug 4, 2016

Choose a reason for hiding this comment

dhimmel commented Aug 4, 2016

yl565 commented Aug 5, 2016

dhimmel commented Aug 5, 2016

dhimmel commented Aug 5, 2016

yl565 commented Aug 3, 2016 •

edited

Loading