New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Cleanup #3900

Merged
merged 8 commits into from Aug 22, 2017

Conversation

Projects
None yet
4 participants
@AlexDBlack
Copy link
Member

AlexDBlack commented Aug 21, 2017

Good to go IMO. I'll leave this open a bit longer for reviews.

Addresses:

  • #3889 (Fixed override, added proposed convenience method: .inputType().convolutional(h,w,d))
  • #3888 (.layer(Layer) overload now exists, in addition to .layer(int, Layer))
  • #3897 (.regularization(boolean) deprecated, use removed internally)
  • #3896 (improved dropout javadoc)
  • #3895 (.units(int) is equivalent to .nOut(int))
@@ -213,6 +214,7 @@ public double getL2ByParam(String variable) {
* Fluent interface for building a list of configurations
*/
public static class ListBuilder extends MultiLayerConfiguration.Builder {
private AtomicInteger layerCounter = new AtomicInteger(-1); //Used only for .layer(Layer) method

This comment has been minimized.

@stolsvik

stolsvik Aug 21, 2017

Contributor

AtomicInteger sounds like ("communicates") you're expecting concurrency, while that is not true - the use of an ordinary map rules out concurrency. You could thus just use an int.

However, I think it would be wiser to just use the "find max key in map" logic and then just invoke the indexed variant with max+1, as that is very straightforward and immediately understandable for a code-reader, and there is no possibility to start wondering what happens if you mix the methods. Right now, while reviewing this, I am wondering what happens if I e.g. add layers using index in out-of-order fashion, and then add one layer using the non-indexed method. After thinking a bit, I can't immediately see how that would fail - but my point is that this particular thinking should really not be necessary!

Also, JavaDoc, explaining what they do, and how they interact.

I would note that the "same index overrides" has tripped me up over a dozen times already, and I am still just getting to know the API! It should rather have thrown - it makes NO sense that this is possible, IMHO. It should at least be JavaDocced, and could potentially also output a log.warn(..)

This comment has been minimized.

@AlexDBlack

AlexDBlack Aug 22, 2017

Member

Changed to int.

public ListBuilder setInputType(InputType inputType){
return (ListBuilder)super.setInputType(inputType);
}

This comment has been minimized.

@stolsvik

stolsvik Aug 21, 2017

Contributor

Since my ML-level hasn't reached anything else than feedforward at this point, this might be utterly meaningless: Are you sure it makes sense for a ListBuilder to accept all the different types of InputType? Will the ListBuilder - in this usage scenario - be used in a non-feedforward setting? I had this idea that these other types would be used "within" the network, but not as the top, "receiving" part? As I said, might be totally off base here.

Since I evidently cannot comment on non-changed parts of this file, I'd just want to point out that in the .build() method, the reference to inputType is the only one using "this.". Why is that? I makes me wonder whether that means something.

This comment has been minimized.

@maxpumperla

maxpumperla Aug 21, 2017

Contributor

e.g. Embedding would be another common type

This comment has been minimized.

@AlexDBlack

AlexDBlack Aug 22, 2017

Member

Yes, it does make sense to accept all input types.

* </p>
*
* @param dropOut Dropout probability (probability of retaining an activation)
* @param inputRetainFraction Dropout probability (probability of retaining an input activation for a layer)

This comment has been minimized.

@stolsvik

stolsvik Aug 21, 2017

Contributor

Point out (again) here in the @param section that 0.0 disables dropout, while 1.0 effectively disables it, since it means that all nodes are retained. Helps for understanding.

I am not sure that "fraction" is the best word. I'm not native English speaker, so often I just have this "hunch" that there is a better synonym, but I can't immediately grasp it.. So: If the percentage is 56%, what is the number 0.56 called? Reading the TensorFlow doc, they use "probability", which makes much sense.

This comment has been minimized.

@AlexDBlack

AlexDBlack Aug 22, 2017

Member

Additional javadoc changes + changed to inputRetainProbability

* Note 1: This sets the probability per-layer. Care should be taken when setting lower values for
* complex networks (too much information may be lost with aggressive dropout values).<br>
* Note 2: Frequently, dropout is not applied to input (first layer) our output layer. This needs to be
* handled MANUALLY by the user - set .dropout(0) on those layers when using global dropout setting.<br>

This comment has been minimized.

@stolsvik

stolsvik Aug 21, 2017

Contributor

Point again out that 0 disables it. Point is that this disable-value is hard to grasp, as it semantically means "keep NO nodes"..

* activation with probability x, and set to 0 with probability 1-x.<br>
* dropOut(0.0) is disabled (default).
* Dropout probability. This is the probability of <it>retaining</it> an input activation for a layer. So
* dropOut(x) will keep an input activation with probability x, and set to 0 with probability 1-x.<br>

This comment has been minimized.

@stolsvik

stolsvik Aug 21, 2017

Contributor

I am not sure whether I understand this. Because if I do, I think it is wrong.
Dropout should "zero out nodes", and I've come to understand that this means that their OUTPUT will be zero. Which again means that the NEXT layer will see a consistent set of zeroed and non-zeroed node values, i.e. if for Layer L node 0, 2 and 5 is zeroed, then ALL nodes in Layer L+1 will see node 0, 2 and 5 as zeroed.

Reading this JavaDoc, I get the feeling that node 0 in Layer L+1 (which is THIS layer, since we're talking "input activations") might see that the input activation is zeroed for node 1, 2, 3 from Layer L, while node 1 in Layer L+1 could see zeroed nodes for 0, 2 and 3 - i.e. independently zeroed.

If the latter is the case, then I believe this dropout is implemented incorrectly. But I probably misunderstand the wording.

Also, with dropout, one should have this scaling thing. This should be mentioned in the JavaDoc. Here's TensorFlow's dropout doc:
https://www.tensorflow.org/api_docs/python/tf/nn/dropout

"With probability keep_prob, outputs the input element scaled up by 1 / keep_prob, otherwise outputs 0. The scaling is so that the expected sum is unchanged."

Which brings us to the next thing that should be javadocced, and that is that dropout should only be applied in the training phase, not the evaluation. In particular, this also goes for the scaling. Since when using TensorFlow, you as user needs to set the keep_prob to your wanted value when training, while in evaluation mode, you need to set it to 1. Using e.g. "0.5", this means that when training, half of the nodes are zeroed, but the output is scaled by 2 for the remaining. But in evaluation, no nodes are silenced, and there is no scaling.

Is this handled automatically by sl4j? Would be nice to have it here in the JavaDoc.

Regarding whether it is "before" or "after" this layer: This could be explicitly explained like "setting a dropout on this layer is the same as adding a {@link DropoutLayer} in front of this layer".

This comment has been minimized.

@AlexDBlack

AlexDBlack Aug 22, 2017

Member

Dropout zeros out activations, by definition.
The "scaling thing" is mentioned in the inverted dropout implementation.
Added note on train vs. test.

* dropOut(0.0) is disabled (default).
* Dropout probability. This is the probability of <it>retaining</it> an input activation for a layer. So
* dropOut(x) will keep an input activation with probability x, and set to 0 with probability 1-x.<br>
* dropOut(0.0) is disabled (default). When {@link #useDropConnect(boolean)} is set to true (false by default),

This comment has been minimized.

@stolsvik

stolsvik Aug 21, 2017

Contributor

Something like "dropout(0) is a special value that disables the dropout mechanism, which effectively is the same as dropout(1.0) since that means to retain all nodes."

Problem is that 0 semantically would mean to drop all nodes, which can hinder understanding unless being very explicit about it.

This comment has been minimized.

@maxpumperla

maxpumperla Aug 21, 2017

Contributor

keras' definition is not exactly cleaner btw:

rate: float between 0 and 1. Fraction of the input units to drop.

* Note 1: This sets the probability per-layer. Care should be taken when setting lower values for
* complex networks (too much information may be lost with aggressive dropout values).<br>
* Note 2: Frequently, dropout is not applied to input (first layer) our output layer. This needs to be
* handled MANUALLY by the user - set .dropout(0) on those layers when using global dropout setting.<br>

This comment has been minimized.

@stolsvik

stolsvik Aug 21, 2017

Contributor

Regarding input layer. The "dropout paper" (the original!) suggests 0.8 retain on input, 0.5 on hidden layers. So yes, it should definitely be different, but not necessarily "retain all".

AGAIN the point about "0": I consistently get a "huh?!"-feeling seeing that number - which implies that it should always be pointed out something like ".. thus disabling the dropout for this layer", or something.

Yes, I realize that I really don't like that special value, as it so directly contradicts the semantic of what the number means.

* Note: This sets the probability per-layer. Care should be taken when setting lower values for complex networks.
* Note 1: This sets the probability per-layer. Care should be taken when setting lower values for
* complex networks (too much information may be lost with aggressive dropout values).<br>
* Note 2: Frequently, dropout is not applied to input (first layer) our output layer. This needs to be

This comment has been minimized.

@sorensf

sorensf Aug 22, 2017

Note 2: Frequently, dropout is not applied to input (first layer) our output layer.

"our" should probably be "or". And since the java doc is the same for Layer.java any changes here should be reflected there as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment