Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved the performance of the SGD classifier on sparse mutations by reducing the noise #71

Merged
merged 11 commits into from
Feb 2, 2017

Conversation

htcai
Copy link
Member

@htcai htcai commented Dec 2, 2016

In response to #52, I experimented a variety of methods to improve the performance of the SGD Classifier on sparse genes (e.g., RIT1, MAP2K1).

Mainly, I used PCA to reduce the noise and then employed LDA to compress the data to 2 dimensions. This strategy is justified by the apparent overfitting regarding many genes captured in #52.

Taking RIT1 as the example, the improvement is notable. The test AUROC is 0.837, comparing with the 0.65 in #52. For several other genes I explored (e.g., MAP2K1, SMAD2), the improvement is mostly between 0.07 and 0.10. Moreover, the total time needed for dimensionality reduction and model training is shorter than half a minute in Ubuntu on my ASUS laptop.

Still, there are a few issues. The optimal n_component of PCA for different genes is different. Therefore, it will be great if I can search over multiple values of n_components, which is probably viable using pipeline. However, Ubuntu (8GB RAM + 8GB swap) always runs out of memory when running the pipeline. My MacBook does not have this issue since it uses compressing memory, but it never ends running the pipeline. Hence, I didn't use pipeline in the file submitted.

Also, I am not sure how to interpret the probability plot at the end of the notebook.

Any suggestions and/or comments are welcome. Thanks in advance!

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Really cool idea combining PCA and LDA - it is surprising that the ROC curves look so good.

There are some plots (namely, the decision plot and probability plot) that are concerning. I made some comments inline.

I also have a general comment about RIT1:

  1. It is an oncogene important in the RAS/MAPK pathway
  2. Therefore, it would be interesting to test if a RIT1 classifier can also classify KRAS/NRAS/HRAS mutations.
  3. Along the same lines, can a KRAS/NRAS/HRAS classifier predict RIT1?

I think these questions could be an alternative way of improving the performance of a RIT specific classifier - but may not work for other genes with low prevalence.

# In[9]:

# Typically, this can only be done where the number of mutations is large enough
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like with this split there are only (6016 / 0.002874) * 0.1 = 1.7 gold standard positives in the test set. We may want to get a bit creative to overcome this sample size issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to change the size of the testing data to 0.2 and set stratify=y, which probably is able to make the testing score more reliable.


# In[10]:

get_ipython().run_cell_magic('time', '', 'scale = StandardScaler()\nX_train_scale = scale.fit_transform(X_train)\nX_test_scale = scale.transform(X_test)')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why you are fit_transforming X_train but not X_test? Same comment about PCA implementation below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we train the models for feature transformation and dimensionality reduction, I think I should only use the information from the train data. A sample is from Raschka 2015.


# In[12]:

get_ipython().run_cell_magic('time', '', '\nlda = LinearDiscriminantAnalysis(n_components=2)\nX_train_lda = lda.fit_transform(X_train_pca, y_train)\nX_test_lda = lda.transform(X_test_pca)')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow! only 2 components! Can you add a scatter plot here where you color RIT1 mutation status?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I misunderstood the component number for LDA. According to the documentation of sklearn.discriminant_analysis.LinearDiscriminantAnalysis, n_components cannot be larger than n_classes-1. Therefore, LDA cannot provide a one-dimensional data, which is probably not very informative. But now I am using PCA only, and keeping a 30-dimensional data.

# Ignore numpy warning caused by seaborn
warnings.filterwarnings('ignore', 'using a non-integer number instead of an integer')

ax = sns.distplot(predict_df.query("status == 0").decision_function, hist=False, label='Negatives')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the LDA transform causes the decision function to shift far to the negative. Any reason for this?

# In[24]:

ax = sns.distplot(predict_df.query("status == 0").probability, hist=False, label='Negatives')
ax = sns.distplot(predict_df.query("status == 1").probability, hist=False, label='Positives')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the positives all have extremely low probability of being positive but the negatives are flat. Looks fishy. Is there an intuition as to why this is happening?

@htcai
Copy link
Member Author

htcai commented Dec 7, 2016

@gwaygenomics thanks a lot for so many important comments and questions! I will try to answer the questions one by one.

@KT12
Copy link
Contributor

KT12 commented Dec 7, 2016

@htcai I'll just leave this link for reference since it contains discussion of PCA >> LDA for classification.

@htcai
Copy link
Member Author

htcai commented Dec 8, 2016

@KT12 thank you for the link! That is actually where I learned about this strategy. But now I believe that I need to re-read the post and the linked articles.

@htcai
Copy link
Member Author

htcai commented Dec 20, 2016

Following @dhimmel 's suggestion, I am using only PCA to remove the noise and reduce the dimensionality to 30. I renamed the file to RIT1-PCA-htcai.ipynb. As a side note, I corrected my misunderstanding of LDA. Specifically, n_components cannot be larger than n_class-1, according to the documentation. Therefore, LDA can only produce a 1-dimensional data in this case, which is not very (intuitively) informative.

Firstly, I set test_size=0.2 and stratify=y, since there are very few positive samples in the dataset, as @gwaygenomics pointed out.

Also, I scaled the data both before and after PCA. If without the second time of scaling, the range of decision_function is extremely large (e.g., > 50,000). Also, it actually improved the testing score. Now the testing score is 0.843, even better than the result from the earlier analysis involving LDA.

In addition, the plot of predict_proba looks bumpy, which may well be a consequence of PCA.

Please let me know what I should do to further improve the notebook. Thanks in advance!

Moreover, I am interested in @gwaygenomics 's comments on the the path. I would like to further explore the connection between RIT1 and KRAS/NRAS/HRAS.

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job with some really interesting and helpful results. Made some comments. We can discuss at meetup tomorrow if you can make it.

# In[1]:

import os
import urllib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove urllib and random imports

# In[8]:

# Here are the percentage of tumors with RIT1
y.value_counts(True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also run y.value_counts()? There are so few positives I want to know the exact number... this is cool!

get_ipython().run_cell_magic('time', '', 'scale_pre = StandardScaler()\nX_train_scale = scale_pre.fit_transform(X_train)\nX_test_scale = scale_pre.transform(X_test)')


# ## Reducing noise via PCA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of %%time in this section for better diff viewing. And the "Feature Standardization" as well.


# In[10]:

get_ipython().run_cell_magic('time', '', 'scale_pre = StandardScaler()\nX_train_scale = scale_pre.fit_transform(X_train)\nX_test_scale = scale_pre.transform(X_test)')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can write this all in a pipeline, which will simplify the code and make the process more clear. See example in 2.TCGA-MLexample.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss at the meetup, whether we should do cross-validation the right or wrong but faster way as discussed in #70 (comment).


param_grid = {
'alpha': [10 ** x for x in range(-4, 1)],
'l1_ratio': [0, 0.2, 0.5, 0.8, 1],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd more densely sample alpha and reduce the l1_ratio to [0, 0.15]. See #56.


# In[17]:

# Cross-validated performance heatmap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This figure shows how hard the problem is. We're struggling to find a nice optimum of parameters. Performance is all over the place.


# In[11]:

get_ipython().run_cell_magic('time', '', 'n_components = 30\npca = PCA(n_components=n_components, random_state=0)\nX_train_pca = pca.fit_transform(X_train_scale)\nX_test_pca = pca.transform(X_test_scale)')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to know what percentage of variance 30 components are capturing? It seems like 30 could be removing some important signals.

],
"source": [
"# Percentage of preserved variance\n",
"print('{:.4}'.format(sum(pca.explained_variance_ratio_)))"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So 35 components captures ~60% of variance. Can we increase the number of components until we capture at least 90% of variance?

Copy link
Member Author

@htcai htcai Jan 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned below, increasing the dimensionality leads to over-fitting and extremely low testing score immediately. It appears that a large proportion of the variation in the dataset is probably noise.

@htcai
Copy link
Member Author

htcai commented Jan 10, 2017

I made the following main changes.

  1. I increased the proportion of testing data to 30%, which leads to a more interpretable visualization of probability (i.e., the last plot). As @gwaygenomics pointed out, a small portion of negative samples were mistakenly predicted to be positive, which is displayed as a second (but small) peak in the high-probability range. In addition, although this strategy slightly reduces the training and testing score (to 0.763), the testing score is still better than the baseline SGD Classifier (0.65).

  2. Following @dhimmel 's suggestion, I fixed the l1_ratio and searched within a more condensed range of alpha (np.linspace(10**-4, 10**-2, num=16)). The validation results display a complicated effect of alpha's change.

  3. I calculated the ratio of explained variance of the 35 dimensions preserved in PCA (5 more than it was), which turns out to be 0.5996. This appears to be quite low. However, if more dimensions are kept, over-fitting will occur immediately. Moreover, I am applying PCA to reduce the noise in the data. A low value of explained_variance_ratio might not be unexpected.

  4. Raw counts of positive samples are provided for both the testing set and the entire dataset.

@dhimmel
Copy link
Member

dhimmel commented Jan 10, 2017

Great work with this pull request -- this is interesting stuff!

Can you re-export RIT1-PCA-htcai.py?

The validation results display a complicated effect of alpha's change.

Here's the plot:

perf-grid

I don't actually think the effect of alpha is complicated. Instead I think we have a really noisy estimate of performance. It is concerning that we're not able to identify an optimal alpha range. Instead, the max performing values are scattered. Yuck! The bright side is if we can crack how to deal with these low number of positive situations, we're doing something really powerful.

I think we may be able to decrease the noise in our performance measurements by playing with the cv parameter of GridSearchCV. Let's start by trying out 10-fold crossvalidation (cv=10). If that still doesn't clarify things, we may need repeated cross validation, although we'll have to see how inconvenient that is in sklearn (very according to this answer although I don't think it should actually be that hard).

I calculated the ratio of explained variance of the 35 dimensions preserved in PCA (5 more than it was), which turns out to be 0.5996. This appears to be quite low. However, if more dimensions are kept, over-fitting will occur immediately.

Interesting. I'm worried that for some mutations the ~40% of uncaptured variance will be important. But I agree it's more important to actually be able to fit a model than include all the relevant features.

@dhimmel
Copy link
Member

dhimmel commented Jan 10, 2017

We could also use StratifiedShuffleSplit -- while not as elegant as repeated cross-validation, it should work.

On second thought, perhaps it's even more elegant!

@htcai
Copy link
Member Author

htcai commented Jan 10, 2017

@dhimmel Thanks a lot for your detailed and valuable comments! I will re-export the .py script after I further revise the notebook according to your suggestions. Unfortunately, pipeline still drain the memory even if I only cross-validate over two values of alpha.

In order to locate the optimal range of alpha, a cross-validation involving a larger number of values (e.g., 100, 200, etc) can be run. After the dimensionality reduction, it will probably take only less than one minute if alpha is the only hyper-parameter that is being tuned.

Training a classifier over an extremely imbalanced dataset has always been a thorny issue. I tried under-sampling and over-sampling several weeks ago, but it turned out to be a disaster. The performance is much worse than the baseline SGD classifier. I am happy to explore other methods later.

It is a good idea to try 10-fold cross-validation. My only concern is that there will be only 1 or 2 positive samples in each of the 10 folds. Still, only the experiment result can clarify whether that is a problem.

I agree that the removal of ~40% of the variance is a problem for other mutations. For the about 10 mutations I experimented, the optimal number of principal components is between 30 - 100. Therefore, if with sufficient computational resource, a GridSearchCV over n_components should work for our purpose of optimizing the classifier for a different mutation.

@dhimmel
Copy link
Member

dhimmel commented Jan 10, 2017

Therefore, if with sufficient computational resource, a GridSearchCV over n_components should work for our purpose of optimizing the classifier for a different mutation.

Great idea -- really insightful! Agree that computation time will be the drawback.

My only concern is that there will be only 1 or 2 positive samples in each of the 10 folds.

There will be more positives in the training folds, which is ultimately more important. The larger number of testing folds (10 versus 3) will counterbalance the fewer number of positives in each. This is why leave one out CV is best, even though there are often 0 positives in a given testing fold.

@htcai
Copy link
Member Author

htcai commented Jan 14, 2017

@dhimmel I used 10-fold GridSearchCV without StratifiedShuffleSplit and searched over a larger space of alpha: [x for x in np.linspace(10**-4, 10**-2, num=200)]. The performance measurement is still rather noisy. I probably need to try repeated cross-validation to reduce the noise.
image

On the other hand, StratifiedShuffleSplit tends to cause over-fitting and makes the selection prefer extremely large values.
image

@dhimmel
Copy link
Member

dhimmel commented Jan 14, 2017

@htcai interesting. I'd love to see 1 × 10-fold for larger alphas, like you did for StratifiedShuffleSplit. I'm wondering if these higher alphas are actually producing better models... I don't really see a way that StratifiedShuffleSplit could cause inflated CV performance.

Do you want to switch the notebook to StratifiedShuffleSplit, so I can look over that code? Then let's see the coefficients of the best model to see what's going on with these high-performing large-alphas.

An alternative would be 10 × 10-fold CV, but that could be a pain to implement. And I don't really think it's fundamentally different from StratifiedShuffleSplit -- just a little more efficient.

@htcai
Copy link
Member Author

htcai commented Jan 14, 2017

I updated the notebook with StratifiedShuffleSplit and searched for alpha over [2**x for x in range(-20, 60)]. Larger values of alpha are generally more preferable, but 0.5 is the optimal one.

It is good to see that the positive and the negative curves are more neatly segregated in the plots of decision_function and probability.

However, over-fitting seems to be salient.

cv_folds = [[t, c] for t, c in cv_folds]


# In[17]:

get_ipython().run_cell_magic('time', '', "clf = SGDClassifier(random_state=0, class_weight='balanced', loss='log', penalty='elasticnet')\ncv = GridSearchCV(estimator=clf, param_grid=param_grid, n_jobs=-1, scoring='roc_auc')\ncv.fit(X = X_train_scale, y=y_train)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move everything besides cv.fit(X = X_train_scale, y=y_train) out of the %%time cell for better py export.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In GridSearchCV you need to specify cv=sss.

# In[14]:
# In[16]:

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If computation time doesn't become prohibitive, I'd love something more thorough such as:

StratifiedShuffleSplit(n_splits=100, test_size=0.1, random_state=0)

n_splits times test_size equals the average number of times each observation will be used for testing in CV. The higher this is the less variation we should have in our performance estimates. However, we want test_size to be low, so the percent of observations used for training is maximized. Therefore, we should increase n_splits up until the point where it starts taking a long time. Make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is very sensible. It does not take a long time to run a 100-fold Grid Search. Actually, I set n_split=150. Now the heat map looks much less noisy.


sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=0)
cv_folds = sss.split(X_train_scale, y_train)
cv_folds = [[t, c] for t, c in cv_folds]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computing cv_folds yourself may not be necessary.

@dhimmel
Copy link
Member

dhimmel commented Jan 16, 2017

Cool one last thing so we can understand what's going on the super high alphas. Can you export the coefficients from the best model with alpha = 2048?

@htcai
Copy link
Member Author

htcai commented Jan 16, 2017

That is a good idea! I visualized the coefficients of the best_estimator_ with a bar plot.

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going over these results with @cgreene right now. We think the predicted probabilities should be centered around 0.0029 (21 / (21 + 7285)). So we're trying to diagnose what's going on. Two changes to param_grid will help us figure this out... hopefully.


param_grid = {
'alpha': [2**x for x in range(-20, 60)],
'l1_ratio': [0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can add 0.15 to this list and rerun?

# In[15]:

param_grid = {
'alpha': [2**x for x in range(-20, 60)],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reduce the upper bound to 30 at most.

@htcai
Copy link
Member Author

htcai commented Jan 16, 2017

I searched over four values [0, 0.05, 0.1, 0.15] and the best parameters are {'alpha': 1, 'l1_ratio': 0.05}. Still, it seems that the predicted probabilities are higher than expected.

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm a little confused by the probabilities. Using a pipeline would help us make sure the right transformations are being applied.

Otherwise these are fantastic results. Really like the new CV plot!


# In[24]:

X_transformed = scale_post.transform(pca.transform(scale_pre.transform(X)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to put scale_pre, pca, and scale_post in a pipeline? This would really simplify things and help avoid errors.

@htcai
Copy link
Member Author

htcai commented Jan 19, 2017

I finally got it run with pipeline (and hopefully of correct order) including scaling, PCA and classifier. Within a small amount of time, I can only search over a smaller space of alpha and use a 3-fold cross validation. As a consequence, I decided to increase test_size in StratifiedShuffleSplit to 0.2 and only use 20% of the entire data set for testing. Also, the CV plot does not seem as good as it was.

@dhimmel
Copy link
Member

dhimmel commented Jan 29, 2017

Hey @htcai, I meant to suggest using a pipeline but preserving the exact functionality. Sorry I wasn't more clear. Feel free to revert to 544fd03.

I don't want to drag this pull request out, so either I can merge now, or we can talk Tuesday about the pipeline method I meant to suggest.

@htcai
Copy link
Member Author

htcai commented Jan 31, 2017

@dhimmel Thanks for your clarification. I guess you are suggesting pipelining the scaling and the PCA while leaving the classifier alone. If this is not the case, we should talk further about the implementation tomorrow.

I increased the number of principal components to 50, which is a plausible value if we want a uniform one for all of the sparse genes. The testing score was actually improved for RIT1 comparing with 30 and 35.

@dhimmel
Copy link
Member

dhimmel commented Feb 1, 2017

@htcai I remember that you made some progress last night at the meetup. Let me know when this is ready for me to take a look again.

@htcai
Copy link
Member Author

htcai commented Feb 2, 2017

@dhimmel I am not sure whether you are referring to pipelining the data standardization and PCA. If this is the case, then it is already contained in the latest commit.

I tried FastICA and NMF. Both of them, with different n_components, either skew the prediction of probability (as does PCA) or drop the testing score to some level below random guess (i.e., 50%).

In addition, I am currently running NMF with 500 components. It seems that it will not finish within a reasonable amount of time.

@dhimmel
Copy link
Member

dhimmel commented Feb 2, 2017

@dhimmel I am not sure whether you are referring to pipelining the data standardization and PCA. If this is the case, then it is already contained in the latest commit.

Great, that's what I was referring to. This PR is ready to merge. Let's do a separate for the other dimensionality reduction methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants