Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Function #11

Closed
eralmansouri opened this issue Dec 8, 2014 · 13 comments
Closed

Performance Function #11

eralmansouri opened this issue Dec 8, 2014 · 13 comments

Comments

@eralmansouri
Copy link

Hello,

I'm trying to use cross entropy instead of MSE but I am not sure how to change the perform func. Is it built in?

@cazala
Copy link
Owner

cazala commented Dec 9, 2014

Hi @eralmansouri, there's not an option to change the error criteria, but if you want to change the mean squared error in the trainer for an average cross-entropy you could change line 54 in the file src/trainer.js:

delta += Math.pow(target[i] - output[i], 2);

with this:

delta += Math.log(output[i]) * target[i];

also probably you should change the way the injected error is computed at the output layer cos I think the derivative term cancels out when using CE, to do that you should change line 119 in the file src/neuron.js from this:

this.error.projected = this.derivative * error;

to this:

this.error.projected = 'undefined' != typeof target ? error : this.derivative * error;

that means that if there's a target value in the backpropagation (which means this is an output neuron) then you don't have to multiply by the derivative.

also you have to comment/erase the line 523 from that same file:

buildSentence(responsibility, ' *= ', derivative, store_propagation);

by doing so the optimizer wont multiply by the derivative when computing the output layer's backpropagation error.

I've never played with cross-entropy for training, but it would be cool to have the option to use it as an alternative to MSE.

@eralmansouri
Copy link
Author

Wow, thanks, I am having better performance on my dataset with this.

However, I changed this.error.projected, what about this.error.gated?

Your new value for delta doesn't work properly for me. I used the cost function I found online.

delta -= (target[k] * Math.log(output[k])) + ((1-target[k]) * Math.log(1-output[k]));

By the way, after the modifications for crossentropy, your DSR example here reaches 96% success rate in 12k iterations, down from 100k.

I thought that could be interesting for you. Also I couldn't help but notice that you update weights after each training example, would batch learning make learning faster? It would be great if there was some type of regularization parameter too.

I apologize for any English mistakes. I'm not very good at it to begin with and I have been awake for way too long.

@cazala
Copy link
Owner

cazala commented Dec 9, 2014

Impressive, I knew that CE was better for classification networks but I never payed much attention to it, that's a huge improvement you've made on the DSR test, I'll play a little bit with it and try to add crossentropy as built in option.

About the error.gated, you could change that too but output neurons shouldn't gate any connection so that error should always be zero anyway.

And about batch training to be honest I have no clue, I should read about how to implement it , i'm an amateur at neural nets, i still have a lot to learn, but if you feel you can implement it go ahead and please share the code, if we can find a way to add it as a parameter for the training it would be awesome (:

@cazala
Copy link
Owner

cazala commented Dec 9, 2014

@eralmansouri I tried to replicate what you did but I couldn't get such a performance boost, this are the lines that I changed: 60a8fd4 did you do anything else besides what I did there?

EDIT

I just tweaked a lil bit the learning rate and now I got a huge performance increase, when training both networks (MSE against CE) on the same task (Discrete Sequence Recall) the crossentropy one is able to finish in 15k iterations while the other one takes sometimes up to 100k or more.

Also the CE network is able to solve tasks way more complex like a DSR using 10 different symbols and a longer sequence:

var LSTM = new Architect.LSTM(10,8,2);      
LSTM.trainer.DSR({
    targets: [3,5,7,9],
    distractors: [2,4,6,8],
    prompts: [0,1],         
    length: 12,
    iterations: 250000,
    rate: .17
})

The MSE network is unable to solve this task before reaching the 250k iterations limit, while using CE finishes in about 70k iterations.

@eralmansouri
Copy link
Author

I actually don't even remember what I changed or when I ran that test. That said, I tried your original code and it hit 95% success in < 50k iterations, and then it hit the 250k limit the next time I ran it. The drastic improvement I saw earlier may have been due to random initialization. I'm glad that you are seeing performance boost.

I would suggest you don't use targets or inputs that aren't between 0 or 1. Use minmax or similar to pre-process the inputs and outputs. Also, I would suggest you don't use CE with targets that aren't exactly 1 or 0. Perhaps you can see further performance improvement with CE on networks other than DSR.

By the way the line to calculate delta that I gave you earlier has an error in some cases. I changed it to avoid Math.log(0) because it results in NaN once it is multiplied by zero. To clarify, this is the equation I posted earlier:

delta -= (target[k] * Math.log(output[k])) + ((1-target[k]) * Math.log(1-output[k]));

If the network outputs exactly 1 and the target value is exactly 1, the second part of the equation equates to ((1-1) * -Inf) which then causes the training to stop. I just added a tiny tiny push away from zero.

delta -= (target[k] * Math.log(output[k]+1e-15)) + ((1-target[k]) * Math.log((1+1e-15)-output[k]));

I don't think it will affect the network's performance, it's really small. I could be wrong, I don't have a lot of experience with this.

@cazala
Copy link
Owner

cazala commented Dec 10, 2014

I added a feature to the trainer to be able to pass an optional cost function for the training, and I added 2 built-in cost functions (CE and MSE). The documentation is in the readme,

I figured out that actually the performance boost was cos of a bug at the output layer, the error responsibility was being multiplied by derivative of the activation function (as it happens with all the other layers) but I read the paper again and I found out that Eq. 10 says that for the output layer the error responsibility is just the injected error. This was a huge bug that has been there since I started coding the lib (oops?) and it was slowing down the training process A LOT. I tested the Perceptron and LSTM architectures with the XOR and DSR tasks and the performance increase is about 80% (!!)

@eralmansouri
Copy link
Author

Woah, that seems like something we should have noticed earlier. Sadly, I'm not experiencing improvements in performance on my dataset even after using the latest code. I think I assumed the part where you multiplied by the derivative in the output layer was for MSE, and that CE acted differently so I had that removed from the start.

Also, I believe for CE, the change in weight is calculated as rate * gradient * input.from.activation instead of only rate * gradient. I'm not sure.

Did you update the examples on your README to use the latest revision?

@cazala
Copy link
Owner

cazala commented Dec 10, 2014

not yet, I just updated the Trainer section of the readme to add the optional cost function part. I'll update the Demos branch later today.

@eralmansouri
Copy link
Author

Hey I'm having a small misunderstanding about something that I think maybe you can help clear up. I was under the impression LSTM networks can correctly learn and perform on this training set but I think I'm doing something wrong.

I'm trying to create a super simple network that outputs 1 or 0 depending on it's last output. On it's first activation, it's 0, on it's second, it's 1, then it's 0 again, then 1 and so on.

var myNetwork = new Architect.LSTM(1,1,1);
var myTrainer = new Trainer(myNetwork);
var trainingSet = [
    { input: [1], output: [1] },
    { input: [1], output: [0] }
]
myTrainer.train(trainingSet, {rate: 0.01, iterations: 100000, log:100});

Why does this example not perform accurately?

@cazala
Copy link
Owner

cazala commented Dec 10, 2014

mm i tried a small variation of what you did, not exactly the same but I think it achieves the desired effect. instead of giving the network always the same input (in your case, 1) train it using both, 1 and 0 as valid inputs, and then to test it just activate the network once, and then feed the output as the next input:

var network = new Architect.LSTM(1,2,1);
network.trainer.train([
    { input: [0], output: [1] }, 
    { input: [0], output: [0] }, 
    { input: [1], output: [1] }, 
    { input: [1], output: [0] }
], { rate: .1 });

var output = [0]; 
for (var i = 0; i < 20; i++) 
    console.log(output = network.activate(output));

@eralmansouri
Copy link
Author

Hmm... That's weird. How come it doesn't work correctly without different input?

Also, it doesn't work at all if optimization is disabled. I was under the assumption optimized and non optimized behave similarly. Where did you learn to do that optimization technique by the way? It's time-consuming but performance can be worth it.

@cazala
Copy link
Owner

cazala commented Dec 11, 2014

wow that's weird that the unoptimized output doesn't work, they should behave exactly the same, it's probably because of the last change that I pushed even tho the tests passed.. I'll have to debug. The optimization process only replicates the behavior of the neuron (same algorithm) but it gets rid of unnecessary if/loops and variables (i.e. if a neuron doesn't have a self connection, the previous state gets multiplied by 0 in the original algorithm, in the optimized function doesn't even compute that term and it leaves it as 0).

And about the different input, actually I realize that what I did was different, I just trained the net to output 0 when the input was 1, and 1 when the input was 0 (since I feed the network's output back as an input)

@eralmansouri
Copy link
Author

I am having some trouble re-creating the difference between optimized and non optimized network. I think maybe I was using a modified module by mistake. I hope I didn't waste too much of your time.

Is there a way to train the example I posted earlier without changing the inputs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants