Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting learningRate dynamicly produces NaN #6809

Closed
liweigu opened this issue Dec 6, 2018 · 10 comments

Comments

@liweigu
Copy link

commented Dec 6, 2018

DL4J version: 1.0.0-SNAPSHOT

  • In order to implement a GAN model, i should set generator's or discriminator's learningRate to 0 alternately.
  • There might be a bug to do it that way, and this is a simple program to test it.
  • When setting 'train1stTime = true', the '2nd predictedValue' is NaN, which should be a valid double value.
  • -- Setting 'train1stTime = true' means to do training first, and to set some layers' learningRate to 0, then to do training again.
  • When setting 'train1stTime = false', the '2nd predictedValue' is a valid double value, which is expected.
  • -- Setting 'train1stTime = false' means to set some layers' learningRate to 0 first, and to do training.

The code:

package com.liweigu.dl.study.gan;

import org.deeplearning4j.nn.api.Layer;
import org.deeplearning4j.nn.api.OptimizationAlgorithm;
import org.deeplearning4j.nn.conf.BackpropType;
import org.deeplearning4j.nn.conf.ComputationGraphConfiguration.GraphBuilder;
import org.deeplearning4j.nn.conf.NeuralNetConfiguration;
import org.deeplearning4j.nn.conf.layers.DenseLayer;
import org.deeplearning4j.nn.conf.layers.OutputLayer;
import org.deeplearning4j.nn.graph.ComputationGraph;
import org.deeplearning4j.nn.layers.BaseLayer;
import org.deeplearning4j.nn.weights.WeightInit;
import org.nd4j.linalg.activations.Activation;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.dataset.DataSet;
import org.nd4j.linalg.factory.Nd4j;
import org.nd4j.linalg.learning.config.Adam;
import org.nd4j.linalg.lossfunctions.LossFunctions.LossFunction;

/**

  • Set learningRate dynamicly for testing.
  • In order to implement a GAN model, i should set generator's or discriminator's learningRate to 0 alternately.
  • There might be a bug to do it that way, and this is a simple program to test it.
  • When setting 'train1stTime = true', the '2nd predictedValue' is NaN, which should be a valid double value.
  • -- Setting 'train1stTime = true' means to do training first, and to set some layers' learningRate to 0, then to do training again.
  • When setting 'train1stTime = false', the '2nd predictedValue' is a valid double value, which is expected.
  • -- Setting 'train1stTime = false' means to set some layers' learningRate to 0 first, and to do training.
  • @author liweigu714@163.com

*/
public class DynamicLR {
public static double LearningRate = 2e-4;

public static void main(String[] args) {
	run();
}

public static void run() {
	System.out.println("run DynamicLR");

	// Notice:
	// if set train1stTime = true, the 2nd predictedValue is NaN.
	// if set train1stTime = false, the 2nd predictedValue is a valid double value.
	boolean train1stTime = true;

	ComputationGraph multiLayerNetwork = getNetwork();

	DataSet trainDataSet;
	DataSet testDataSet;
	INDArray[] outputs;
	INDArray output;
	double predictedValue;

	// train for 1st time
	if (train1stTime) {
		trainDataSet = getRandomDataSet();
		multiLayerNetwork.fit(trainDataSet);
		// print current output
		testDataSet = getRandomDataSet();
		outputs = multiLayerNetwork.output(testDataSet.getFeatures());
		output = outputs[0];
		predictedValue = output.getDouble(0);
		System.out.println("1st predictedValue = " + predictedValue);
	}

	// freeze some layer(s)
	String freezeLayerType = "g";
	freeze(multiLayerNetwork, freezeLayerType);

	// train for 2nd time
	trainDataSet = getRandomDataSet();
	multiLayerNetwork.fit(trainDataSet);
	// print current output
	testDataSet = getRandomDataSet();
	outputs = multiLayerNetwork.output(testDataSet.getFeatures());
	output = outputs[0];
	predictedValue = output.getDouble(0);
	System.out.println("2nd predictedValue = " + predictedValue);
}

// generate random DataSet
private static DataSet getRandomDataSet() {
	INDArray features = Nd4j.randn(new long[]{1, 100});
	// set label to one, for testing.
	INDArray labels=Nd4j.ones(new long[]{1, 1}) ;
	DataSet dataSet = new DataSet(features, labels);
	return dataSet;
}

/**
 * Set some layers' learningRate to 0 to 'freeze' it, which means don't change the parameters on the layer durning training.
 * @param multiLayerNetwork multiLayerNetwork
 * @param freezeLayerType "g" means freeze layers like "g-*", "d" means freeze layers like "d-*".
 */
private static void freeze(ComputationGraph multiLayerNetwork, String freezeLayerType) {
	Layer[] layers = multiLayerNetwork.getLayers();
	for (Layer layer : layers) {
		if (layer instanceof BaseLayer) {
			BaseLayer baseLayer = (BaseLayer) layer;
			String layerName = baseLayer.getConf().getLayer().getLayerName();
			if (freezeLayerType.equals("g")) {
				if (layerName.contains("g-")) {
					multiLayerNetwork.setLearningRate(layerName, 0);
				} else if (layerName.contains("d-")) {
					multiLayerNetwork.setLearningRate(layerName, LearningRate);
				}
			} else if (freezeLayerType.equals("d")) {
				if (layerName.contains("g-")) {
					System.out.println(layerName + " = " + LearningRate);
					multiLayerNetwork.setLearningRate(layerName, LearningRate);
				} else if (layerName.contains("d-")) {
					System.out.println(layerName + " = " + 0);
					multiLayerNetwork.setLearningRate(layerName, 0);
				}
			}
		}
	}
}

private static ComputationGraph getNetwork() {
	ComputationGraph multiLayerNetwork;

	NeuralNetConfiguration.Builder builder = new NeuralNetConfiguration.Builder();
	builder.seed(140);
	builder.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT); 
	builder.weightInit(WeightInit.XAVIER);
	builder.updater(new Adam(LearningRate));

	GraphBuilder graphBuilder = builder.graphBuilder()
			.pretrain(false).backprop(true)
			.backpropType(BackpropType.Standard)
			.addInputs("input")
			.setOutputs("f-output");

	// g-* for generator
	graphBuilder = graphBuilder.addLayer("g-dense",
    		new DenseLayer.Builder()
			.nIn(100)
			.nOut(10)
			.updater(new Adam(LearningRate))
			.weightInit(WeightInit.RELU)
			.activation(Activation.LEAKYRELU)
			.build(),
			"input");
	// d-* for discriminator
	graphBuilder = graphBuilder.addLayer("d-dense",
    		new DenseLayer.Builder()
			.nIn(10)
			.nOut(10)
			.updater(new Adam(LearningRate))
			.weightInit(WeightInit.RELU)
			.activation(Activation.LEAKYRELU).build(),
			"g-dense");
	// final output
	graphBuilder = graphBuilder.addLayer("f-output",
			new OutputLayer.Builder(LossFunction.XENT)
			.nIn(10)
			.nOut(1)
			.updater(new Adam(LearningRate))
			.weightInit(WeightInit.XAVIER)
			.activation(Activation.SIGMOID)
			.build(),
			"d-dense");

	multiLayerNetwork = new ComputationGraph(graphBuilder.build());
	multiLayerNetwork.init();
	System.out.println(multiLayerNetwork.summary());

	return multiLayerNetwork;
}

}

@liweigu

This comment has been minimized.

Copy link
Author

commented Dec 11, 2018

The NaN occurs in AdamUpdater at the code:
INDArray sqrtV = Transforms.sqrt(v.dup(gradientReshapeOrder), false).addi(epsilon);
At run time, the values are:
v = [[ 0.0079,-4.0894e-5, 0.0004 ... 4.945e-7 5.9659e-6, 3.8776e-5]]
epsilon = [[ 0.0079,-4.0894e-5, 0.0004 ... 4.945e-7 5.9659e-6, 3.8776e-5]]
sqrtV = [[ 0.0891, �, 0.0195 ... 0.0007 0.0024, 0.0062]]
(The second value in sqrtV is NaN.)

@liweigu

This comment has been minimized.

Copy link
Author

commented Dec 11, 2018

Using RmsProp instead of Adam, no NaN occurs.
And the predictedValues are:
1st predictedValue = 0.5404776334762573
2nd predictedValue = 0.6183269023895264

Check AdamUpdater please .

@saudet

This comment has been minimized.

Copy link
Member

commented Dec 12, 2018

One difference between Adam and RmsProp is that Adam has a momentum parameter, while RmsProp doesn't. Could you try to reset the state of the updater too see if that helps?

@liweigu

This comment has been minimized.

Copy link
Author

commented Dec 12, 2018

It works with this code after the 1stTraining:
multiLayerNetwork.setUpdater(null);
And no NaN in the predictedValues:
1st predictedValue = 0.4953722655773163
2nd predictedValue = 0.6004246473312378

@liweigu

This comment has been minimized.

Copy link
Author

commented Dec 12, 2018

Another way to produce NaN is to add a BN using RenormalizeL2PerParamType after layer "g-dense", like this:
graphBuilder = graphBuilder.addLayer("g-dense",
new DenseLayer.Builder()
.nIn(100)
.nOut(10)
.updater(new Adam(LearningRate))
.weightInit(WeightInit.RELU)
.activation(Activation.LEAKYRELU)
.build(),
"input");
graphBuilder = graphBuilder.addLayer("g-bn1",
new BatchNormalization.Builder()
.gradientNormalization(GradientNormalization.RenormalizeL2PerParamType)
.nIn(10)
.nOut(10)
.updater(new RmsProp(LearningRate))
.weightInit(WeightInit.RELU)
.activation(Activation.LEAKYRELU)
.build(),
"g-dense");
Notice that using RenormalizeL2PerLayer instead of RenormalizeL2PerParamType doesn't produce NaN.

@saudet

This comment has been minimized.

Copy link
Member

commented Dec 12, 2018

That doesn't necessarily indicate there is a bug. Could you elaborate on the issue and why you think the behavior is incorrect?

@liweigu

This comment has been minimized.

Copy link
Author

commented Dec 12, 2018

There are many aspects leading to NaN, and we cannot firmly believe RenormalizeL2PerParamType causes it.
I might do more tests and then open a new issue for RenormalizeL2PerParamType.
Just focus on AdamUpdater now.

@AlexDBlack AlexDBlack self-assigned this Feb 15, 2019

@AlexDBlack

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2019

OK, so after looking at this again: turns out it is a legitimate issue.
What is happening here is that we we combine updaters between layers when the configuration is the same.
With updaters like Adam, they have 2 components... m and v, for a total updater state size of 2*numParams.
Originally the updater state is like [mParam1, mParam2][vParam1, vParam2] in one block
Once we change the LR, param 1 and param2 now belong to different updater blocks.
So we need to rearrange the updater state to be like [mParam1][vParam1] in block 1, [mParam2][vParam2] in block 2

AlexDBlack added a commit that referenced this issue Feb 15, 2019
@AlexDBlack

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2019

@liweigu Thanks for reporting: has been fixed here and will be merged soon. #7169

AlexDBlack added a commit that referenced this issue Feb 15, 2019
[WIP] Misc DL4J/ND4J Fixes (#7169)
* #7152 Yolo2OutputLayer label shape validation

* #7143 empty array checks

* #7123 Update old URLs

* #7115 Fix issue with LastTimeStep and masks

* #6809 Fix updater state view array layout when dynamically changing LR with updaters like Adam

* Delete old/unmaintained VideoRecordReader

* #7104 Fix memory info reporting when activations shape/length can't be inferred (some RNNs etc)
@lock

This comment has been minimized.

Copy link

commented Mar 17, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 17, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
3 participants
You can’t perform that action at this time.