New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Implement classification algorithm #2

Closed
reubano opened this Issue Aug 1, 2017 · 12 comments

Comments

Projects
None yet
5 participants
@reubano
Copy link
Contributor

reubano commented Aug 1, 2017

Overview

Currently, there is a just a placeholder in the algorithm that classifies nodules in scans. Nodules are areas of interest that might be cancerous. We need to adapt the Data Science Bowl algorithms to predict P(cancer) for a given set of centroids for nodules.

Expected Behavior

Given a model trained to perform this task, a DICOM image, and the nodule centroids, return the P(cancer) for each nodule.

Design doc reference:
Jobs to be done > Annotate > Prediction service

Technical details

Out of scope

This feature is a first-pass at getting a model that completes the task with the defined input and output. We are not yet judging the model based on its accuracy or computational performance.

Acceptance criteria

  • trained model
  • documentation for the trained model (e.g., cross validation performance, data used) and how to re-train it

NOTE: All PRs must follow the standard PR checklist.

@reubano reubano added this to the 1-mvp milestone Aug 1, 2017

@reubano reubano added the feature label Aug 1, 2017

@reubano reubano changed the title major official prediction Feature: Implement classification algorithm Aug 1, 2017

@reiinakano

This comment has been minimized.

Copy link
Contributor

reiinakano commented Aug 22, 2017

Not sure if this is a dumb question, but given that there will probably be multiple models in the end, how do you intend to separate them? Would it make sense to to further divide prediction/classify into a subfolder for each model? e.g. prediction/classify/vgg would contain prediction/classify/vgg/src/, prediction/classify/vgg/trained_model/predict/, and prediction/classify/vgg/assets/

Question applicable to #1 and #3

@isms

This comment has been minimized.

Copy link
Contributor

isms commented Aug 22, 2017

@reiinakano Not a dumb question at all, and in fact having a "model zoo" was considered. It's not out of the question as the project moves forward, but until we have something to go from the current goal is just to have a best-available frozen model (+ its preprocessing and training pipeline) for each task.

@reiinakano

This comment has been minimized.

Copy link
Contributor

reiinakano commented Aug 23, 2017

Thanks for the answer. My concern is that you're considering a lot of different algorithms from the get-go (as seen in #18 - #28). What happens when multiple people decide working on different algorithms? Personally I don't think it would be a terrible idea to separate them into subpackages from the start to facilitate parallel development.

As long as they have the same interface, switching to the best-available model is a matter of changing the subpackage name in calls. Also, if the best model moving forward is an ensemble of the algorithms (as is virtually always the case), they're all already nicely separated for easy integration into an ensemble.

@isms

This comment has been minimized.

Copy link
Contributor

isms commented Aug 23, 2017

Agreed that could be a useful refactoring once the pain point exists. If you're interested in working on this bit of architecture I'd encourage you to keep alert for the right time (could be soon) and propose the change as a refactoring; at the moment it's premature.

@vessemer

This comment has been minimized.

Copy link
Contributor

vessemer commented Aug 24, 2017

As it was already mentioned: different architectures require different preprocessing methods. Whether it will be a good decision if the input of the feature will include the information about preprocessing function either? I hope it'll be a better to open a new issue dedicated to the preprocessing step. In this way, model may be fed along with a parameters' dictionary. The latter will be used in order to create a preprocessing instance. I don't open the issue myself, because I'm not highly confident in how the expected behaviour should look like. Newertheless I'll be glad to implement such features.
One more question I have: which backend will be included in requirements?

@isms

This comment has been minimized.

Copy link
Contributor

isms commented Aug 25, 2017

@vessemer Thanks for the thinking around this. I have a couple questions just to make sure I understand!

different architectures require different preprocessing methods. Whether it will be a good decision if the input of the feature will include the information about preprocessing function either?

You mean should we be passing more information from the interface part back to the prediction part? Or something else?

One more question I have: which backend will be included in requirements?

Sorry, not sure I understand the question?

And seems like you probably already grabbed this but as a general observation, our thought is that we'd like to first see one solid implementation even if it will need to be refactored to generalize later on. We're afraid of being "achitecture astronauts" and making things too complicated before everyone has a common point of reference.

@vessemer

This comment has been minimized.

Copy link
Contributor

vessemer commented Aug 25, 2017

@isms thanks for the quick response. I mean that prior to any model evaluation some preprocessing should take place over the raw DICOM data. I agree with your concern regarding cumbersome architecture at the starting point.

Given a model trained to perform this task

Right now for me, it is not clear what is meant by the 'model': whether it is a model object from some ML backend such as TensorFlow or Theano, or it's some essence of higher level with its own preprocessing method.

@vessemer vessemer referenced this issue Aug 27, 2017

Merged

Classification algorithm #76

1 of 1 task complete
@reubano

This comment has been minimized.

Copy link
Contributor

reubano commented Aug 28, 2017

I mean that prior to any model evaluation some preprocessing should take place over the raw DICOM data.

Hi @vessemer, I see you've already added some preprocessing logic in your PR, but #58 may give you a bit more context. Essentially, each competition implementation has it's own set of preprocessing steps. During the integration of any individual implementation, its associated preprocessing (and other) steps should be decoupled so that they work together as modular pieces of a functional whole. As @isms said, in the future, we may go down the path of mixing and matching the best pieces of various implementations and/or allow for an ensemble of implementations to work together.

Right now for me, it is not clear what is meant by the 'model': whether it is a model object from some ML backend such as TensorFlow or Theano, or it's some essence of higher level with its own preprocessing method.

You're on the right track with your initial thought. Serialized models will live in the assets folder of their respective step (classify, identify, and segment). In this case it will be algorithms/classify/assets/. And of course, the use of TensorFlow/Theano/etc. will be up to individual implementation.

@vessemer

This comment has been minimized.

Copy link
Contributor

vessemer commented Sep 4, 2017

Hi @reubano, as it was mentioned in the technical details from the issue #2:

This feature should be implemented in the prediction/classify/trained_model/predict method. Code to train the model should live in the prediction/classify/src/ folder.
A fully serialized version of the model that can be loaded should live in the prediction/classify/assets/ folder using git-lfs.

The classify shouldn't lie in the prediction/src directory but in the prediction itself. Is there a typo in the issue or is it now outdated?

@reubano

This comment has been minimized.

Copy link
Contributor

reubano commented Sep 4, 2017

@vessemer thanks for pointing that out! The file structure changed since this issues was first created. The new base path is prediction/src/algorithms/classify/.

@vessemer

This comment has been minimized.

Copy link
Contributor

vessemer commented Sep 6, 2017

Essentially, each competition implementation has it's own set of preprocessing steps. During the integration of any individual implementation, its associated preprocessing (and other) steps should be decoupled so that they work together as modular pieces of a functional whole. As @isms said, in the future, we may go down the path of mixing and matching the best pieces of various implementations and/or allow for an ensemble of implementations to work together.

@reubano, I've noted that the major part of algorithms aimed at nodule candidates' classification has similar pipelines for preprocessing steps. In general, these pipelines can be separated into two stages: global- (lungs-) and local- (nodule-) levels' operations. It's important to uncouple these stages, i.e. it is neither unnecessary nor consistently to apply zero-mean normalization at the local stage. Moreover, for the local-stage, it is common to apply augmentation on the fly (even at a prediction time). Thus I've introduced the pre-processing class which operates at the global level, along with the possibility to apply it while reading medical files. Related to the data-generators at the local level, it's also quite common to use trivial spatial augmentation, but then the patch approached to the desired shape. The latter transformations may highly deviate from model to model, hence I guess it will be better if we will wrap the model by class with a standardized functionality, just to be able to build 'em once with a desired setup, serialise then and use when ever we want.

@lamby

This comment has been minimized.

Copy link
Contributor

lamby commented Sep 8, 2017

Merged in #76

@lamby lamby closed this Sep 8, 2017

@vessemer vessemer referenced this issue Sep 21, 2017

Merged

Data Generator #132

1 of 1 task complete

@reubano reubano referenced this issue Sep 26, 2017

Closed

3D data augmentation #137

0 of 1 task complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment