Add initial machine learning pipeline #57

redshiftzero · 2016-10-11T18:00:34Z

This PR adds an initial machine learning pipeline that takes the features in the database, trains a series of binary classifiers, evaluates how well each classifier performs, and then saves a bunch of relevant performance metrics in the database as well as pickling the trained model objects (for use in future scoring). The work in this PR corresponds to the latter half of this diagram from the features schema on:

A more complete description of our pipeline is described in docs/pipeline.md and a (very) brief description of how specialized classifiers might be integrated is stored in CONTRIB.md.

psivesely

After changes and a rebase are made, I'll take another look. I'll do my part to help us iterate faster than happened with the features branch.

psivesely · 2016-10-28T18:50:51Z

docs/Pipeline.md

+
+Run this step with:
+
+```python features.py```


This fails because the default Python in Xenial is 2.7. ./features.py will read the shebang and choose the version appropriately. Be sure to fix the ones below too.

Word - changed. Also changed attack.py below

psivesely · 2016-10-28T19:22:31Z

docs/Pipeline.md

+
+* `num_kfolds`: value of k for k-fold cross-validation 
+
+* `feature_scaling`: this option will take the features and rescale them to a zero mean and unit standard deviation. For some classifiers, e.g. primarily those based on decision trees, this should not improve performance, but for many classifiers, e.g. SVM, this is necessary. 


Proposal to help the ML newbies like me understand what's going on here:

feature_scaling: setting feature scaling will take the features and compute their Standard score, re-scaling them to zero.... See also scikit-learn's documentation.

Added these links!

psivesely · 2016-10-28T19:27:49Z

docs/Pipeline.md

+* `auc`: [Area under the ROC curve](http://people.inf.elte.hu/kiss/12dwhdm/roc.pdf)
+* `tpr`: true positive rate [array for default sklearn thresholds]
+* `fpr`: false positive rate [array for default sklearn thresholds]
+* `precision_at_k` for `k=[0.01, 0.05, 0.1, 0.5, 1, 5, 10]`: "Fraction of SecureDrop users correctly identified in the top k percent"


Percent of what? I'll probably figure out later as I read on in the docs or source, but would be better if it was clarified here.

Percentage of the testing set, explicitly added

psivesely · 2016-10-28T20:14:32Z

fpsd/attack.py

+
+    Args:
+        options [dict]: attack setup file
+    """


The last code I wrote (see

fingerprint-securedrop/fpsd/utils.py

Line 28 in fecb7fa

"""Return an :obj:collections.OrderedDict from ``dict_str``.

), I wrote with the documentation style described in #53. I'm not absolutely set on one particular style, but we should be consistent. If you want to make other suggestions, do so in #53 and we can discuss, but if you also like the python-gnupg style, then you should make the appropriate changes here.

I was using Google style docstrings, but don’t have strong feelings one way or the other so the docstring format you like there is fine w/me and from now on I will follow this format. However, most of the existing docstrings in this branch I wrote before that issue was filed so would rather not rewrite the Google style docstrings at this stage in the interests of time.

psivesely · 2016-10-28T20:28:50Z

fpsd/classify.py

+import matplotlib.pyplot as plt
+import numpy as np
+import pickle
+from scipy import interp


Don't see you using this dependency, and believe you meant to import the interpolate sub-package anyway.

Removed this.

psivesely · 2016-10-29T02:25:14Z

fpsd/evaluation.py

+    return fig
+
+
+def precision_recall_at_x_proportion(test_labels, test_predictions, x_proportion=0.01,


No idea what this function is doing or why. Needs at least a docstring, maybe a comment in get_metrics as well. Also, can you move this function right below get_metrics since they are related and the two functions above this one are not related to get_metrics?

Added docstring, moved up in the file

psivesely · 2016-10-29T02:37:05Z

fpsd/classify.py

+        self.feature_scaling = feature_scaling
+        self.db = database.ModelStorage()
+
+    def get_dict(self):


There's a method that does this, self.__dict__. I would just rename the object attributes to match sklearn's. More clear that way anyway.

Done in 5a47434

psivesely · 2016-10-29T02:42:04Z

fpsd/database.py

+    return engine
+
+
+class DatasetLoader(object):


class DatasetLoader(Database)--classes below too.

psivesely · 2016-10-29T02:43:10Z

fpsd/database.py

+                    options["world_type"], options["model_type"],
+                    options["base_rate"], json.dumps(options["hyperparameters"]),
+                    self.metric_formatter(eval_metrics)))
+        with safe_session(self.engine) as session:


with self.safe_session() as session:--and below too.

psivesely · 2016-10-29T03:05:15Z

fpsd/evaluation.py

+def get_feature_importances(model):
+
+    try:
+        return model.feature_importances_


This attribute is cool. If we can identify features that are consistently found not to be useful for our top classifiers, we could use this information to help us implement the feature selection stage described in #63.

…ier wrapper

coveralls · 2016-11-23T23:43:57Z

Coverage remained the same at 72.727% when pulling 847e4e6 on ml-classifiers into b183c0c on master.

…e analysis pipeline

Database needs a password, so I'm not sure how this was working before? Either way, it's there now, and reads from the environmental variable $PGPASSFILE.

The existing codes were always using the test database, which is not want we want. The production or test database can be selected using this new test keyword.

coveralls · 2016-11-28T22:07:22Z

Coverage remained the same at 72.727% when pulling 5a47434 on ml-classifiers into b183c0c on master.

redshiftzero · 2016-11-28T22:18:03Z

OK comments addressed, ansible-ification of the creation of the models schema and tables is done, Travis builds are passing, and it's rebased on current master. Should be good to go 🌞

conorsch · 2016-11-30T23:21:02Z

What a review process this has been! Thanks for your patience here, @redshiftzero. Given the frequent back-and-forth here, I'm inclined to merge, and we can bite off smaller hunks to discuss in discrete issues going forward.

redshiftzero · 2016-11-30T23:24:28Z

👍 sounds good - any other outstanding problems we can make issues for and address in smaller PRs

@redshiftzero

@redshiftzero has implemented the changes requested

conorsch

Changes requested during review have been implemented.

This was referenced Oct 12, 2016

Model evaluation over a range of base rates #60

Open

Open world cross-validation #64

Open

redshiftzero assigned psivesely Oct 17, 2016

psivesely previously requested changes Oct 29, 2016

View reviewed changes

redshiftzero force-pushed the ml-classifiers branch 2 times, most recently from cdf0dc7 to 07dc7b8 Compare November 17, 2016 01:43

redshiftzero added 8 commits November 23, 2016 15:34

Specify evaluation type in YAML file

44804a2

Refactor engine creation and add initial DatasetLoader class

8dec954

Add generic function for getting feature importances of trained model

b221b39

Add function for plotting feature importances

c2fb467

Add function for ROC plot

260ba7c

Add machine learning classifier and CV code, add Experiment() classif…

d4e306b

…ier wrapper

Add function to load closed world dataset

f47349f

Add scikit-learn classifiers and hyperparameters

43ea65b

redshiftzero force-pushed the ml-classifiers branch from f527a27 to 847e4e6 Compare November 23, 2016 23:36

redshiftzero added 14 commits November 23, 2016 16:18

Add database setup for model storage

42eba49

Add model storage to database.py

e25c950

Add observed_fraction to open world validation

c578a79

Add requirements for machine learning codes

efe2732

Add little guide describing how to integrate a new classifier into th…

819d6a2

…e analysis pipeline

Add feature scaling option to attack setup

e1ecf56

Add figures to documentation directory

0d1db54

Add writeup/readme on how pipeline works

42e272c

Add tests for custom evaluation code

c365bbd

Add a ton of evaluation metrics to the models table

286529f

Feedback from code review

01dd6b9

Move requirements

0955916

Add Ansible configuration of models schema

2c2b442

Add new pip requirements from pip-compile.sh

4977ac3

redshiftzero added 5 commits November 28, 2016 11:42

Point database base class to use the PGPASSFILE

ce1e0ad

Database needs a password, so I'm not sure how this was working before? Either way, it's there now, and reads from the environmental variable $PGPASSFILE.

Label column should always be is_sd

fe28cfe

Database classes should use test keyword

d0d8896

Add test keyword in database class

4fc9d54

The existing codes were always using the test database, which is not want we want. The production or test database can be selected using this new test keyword.

Replace get_dict() with __dict__

5a47434

redshiftzero force-pushed the ml-classifiers branch from 847e4e6 to 5a47434 Compare November 28, 2016 22:00

conorsch approved these changes Nov 30, 2016

View reviewed changes

conorsch merged commit 90f557e into master Nov 30, 2016

conorsch mentioned this pull request Nov 30, 2016

Simplifies group membership update task #90

Merged

redshiftzero mentioned this pull request Dec 1, 2016

Re-name postgres-schemas dir to raw-tables #73

Closed

psivesely deleted the ml-classifiers branch February 7, 2017 00:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial machine learning pipeline #57

Add initial machine learning pipeline #57

redshiftzero commented Oct 11, 2016

psivesely left a comment

psivesely Oct 28, 2016

redshiftzero Nov 17, 2016

psivesely Oct 28, 2016

redshiftzero Nov 17, 2016

psivesely Oct 28, 2016

redshiftzero Nov 17, 2016

psivesely Oct 28, 2016

redshiftzero Nov 17, 2016

psivesely Dec 2, 2016

psivesely Oct 28, 2016

redshiftzero Nov 17, 2016

psivesely Oct 29, 2016

redshiftzero Nov 17, 2016

psivesely Oct 29, 2016

redshiftzero Nov 28, 2016

psivesely Oct 29, 2016

redshiftzero Nov 17, 2016

psivesely Oct 29, 2016

redshiftzero Nov 17, 2016

psivesely Oct 29, 2016

redshiftzero Nov 17, 2016

coveralls commented Nov 23, 2016

coveralls commented Nov 28, 2016

redshiftzero commented Nov 28, 2016

conorsch commented Nov 30, 2016

redshiftzero commented Nov 30, 2016 •

edited

Loading

conorsch left a comment


		* `num_kfolds`: value of k for k-fold cross-validation

		* `feature_scaling`: this option will take the features and rescale them to a zero mean and unit standard deviation. For some classifiers, e.g. primarily those based on decision trees, this should not improve performance, but for many classifiers, e.g. SVM, this is necessary.

		return fig


		def precision_recall_at_x_proportion(test_labels, test_predictions, x_proportion=0.01,

Add initial machine learning pipeline #57

Add initial machine learning pipeline #57

Conversation

redshiftzero commented Oct 11, 2016

psivesely left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Nov 23, 2016

coveralls commented Nov 28, 2016

redshiftzero commented Nov 28, 2016

conorsch commented Nov 30, 2016

redshiftzero commented Nov 30, 2016 • edited Loading

conorsch left a comment

Choose a reason for hiding this comment

redshiftzero commented Nov 30, 2016 •

edited

Loading