Skip to content
This repository has been archived by the owner on Sep 10, 2020. It is now read-only.

Add initial machine learning pipeline #57

Merged
merged 27 commits into from
Nov 30, 2016
Merged

Add initial machine learning pipeline #57

merged 27 commits into from
Nov 30, 2016

Conversation

redshiftzero
Copy link
Contributor

This PR adds an initial machine learning pipeline that takes the features in the database, trains a series of binary classifiers, evaluates how well each classifier performs, and then saves a bunch of relevant performance metrics in the database as well as pickling the trained model objects (for use in future scoring). The work in this PR corresponds to the latter half of this diagram from the features schema on:
pipeline

A more complete description of our pipeline is described in docs/pipeline.md and a (very) brief description of how specialized classifiers might be integrated is stored in CONTRIB.md.

Copy link
Contributor

@psivesely psivesely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After changes and a rebase are made, I'll take another look. I'll do my part to help us iterate faster than happened with the features branch.


Run this step with:

```python features.py```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails because the default Python in Xenial is 2.7. ./features.py will read the shebang and choose the version appropriately. Be sure to fix the ones below too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Word - changed. Also changed attack.py below


* `num_kfolds`: value of k for k-fold cross-validation

* `feature_scaling`: this option will take the features and rescale them to a zero mean and unit standard deviation. For some classifiers, e.g. primarily those based on decision trees, this should not improve performance, but for many classifiers, e.g. SVM, this is necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposal to help the ML newbies like me understand what's going on here:

feature_scaling: setting feature scaling will take the features and compute their Standard score, re-scaling them to zero.... See also scikit-learn's documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these links!

* `auc`: [Area under the ROC curve](http://people.inf.elte.hu/kiss/12dwhdm/roc.pdf)
* `tpr`: true positive rate [array for default sklearn thresholds]
* `fpr`: false positive rate [array for default sklearn thresholds]
* `precision_at_k` for `k=[0.01, 0.05, 0.1, 0.5, 1, 5, 10]`: "Fraction of SecureDrop users correctly identified in the top k percent"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Percent of what? I'll probably figure out later as I read on in the docs or source, but would be better if it was clarified here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Percentage of the testing set, explicitly added


Args:
options [dict]: attack setup file
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last code I wrote (see

"""Return an :obj:collections.OrderedDict from ``dict_str``.
), I wrote with the documentation style described in #53. I'm not absolutely set on one particular style, but we should be consistent. If you want to make other suggestions, do so in #53 and we can discuss, but if you also like the python-gnupg style, then you should make the appropriate changes here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using Google style docstrings, but don’t have strong feelings one way or the other so the docstring format you like there is fine w/me and from now on I will follow this format. However, most of the existing docstrings in this branch I wrote before that issue was filed so would rather not rewrite the Google style docstrings at this stage in the interests of time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sgtm.

import matplotlib.pyplot as plt
import numpy as np
import pickle
from scipy import interp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see you using this dependency, and believe you meant to import the interpolate sub-package anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this.

return fig


def precision_recall_at_x_proportion(test_labels, test_predictions, x_proportion=0.01,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea what this function is doing or why. Needs at least a docstring, maybe a comment in get_metrics as well. Also, can you move this function right below get_metrics since they are related and the two functions above this one are not related to get_metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docstring, moved up in the file

self.feature_scaling = feature_scaling
self.db = database.ModelStorage()

def get_dict(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a method that does this, self.__dict__. I would just rename the object attributes to match sklearn's. More clear that way anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5a47434

return engine


class DatasetLoader(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class DatasetLoader(Database)--classes below too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

options["world_type"], options["model_type"],
options["base_rate"], json.dumps(options["hyperparameters"]),
self.metric_formatter(eval_metrics)))
with safe_session(self.engine) as session:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with self.safe_session() as session:--and below too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def get_feature_importances(model):

try:
return model.feature_importances_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This attribute is cool. If we can identify features that are consistently found not to be useful for our top classifiers, we could use this information to help us implement the feature selection stage described in #63.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@redshiftzero redshiftzero force-pushed the ml-classifiers branch 2 times, most recently from cdf0dc7 to 07dc7b8 Compare November 17, 2016 01:43
@coveralls
Copy link

Coverage Status

Coverage remained the same at 72.727% when pulling 847e4e6 on ml-classifiers into b183c0c on master.

Database needs a password, so I'm not sure how this was working before?
Either way, it's there now, and reads from the environmental variable
$PGPASSFILE.
The existing codes were always using the test database, which is not want we want.
The production or test database can be selected using this new test keyword.
@coveralls
Copy link

Coverage Status

Coverage remained the same at 72.727% when pulling 5a47434 on ml-classifiers into b183c0c on master.

@redshiftzero
Copy link
Contributor Author

OK comments addressed, ansible-ification of the creation of the models schema and tables is done, Travis builds are passing, and it's rebased on current master. Should be good to go 🌞

@conorsch
Copy link
Contributor

What a review process this has been! Thanks for your patience here, @redshiftzero. Given the frequent back-and-forth here, I'm inclined to merge, and we can bite off smaller hunks to discuss in discrete issues going forward.

@redshiftzero
Copy link
Contributor Author

redshiftzero commented Nov 30, 2016

👍 sounds good - any other outstanding problems we can make issues for and address in smaller PRs

@conorsch conorsch dismissed psivesely’s stale review November 30, 2016 23:26

@redshiftzero has implemented the changes requested

Copy link
Contributor

@conorsch conorsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes requested during review have been implemented. :shipit:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants