Experiment: dumping all the Learner state into one class that the callbacks can mutate #67

marcrasi · 2019-04-04T23:48:42Z

I wanted to experiment with dumping all the Learner state into one class that the callbacks can mutate. This makes the callbacks mechanism very similar to the Python one.

This PR includes:

The mechanism (with some simplifying assumptions and missing pieces).
A Recorder and ParamScheduler.
An example model that trains with Recorder and ParamScheduler.

review-notebook-app · 2019-04-04T23:48:43Z

Check out this pull request on ReviewNB: https://app.reviewnb.com/fastai/fastai_docs/pull/67

Visit www.reviewnb.com to know how we simplify your Jupyter Notebook workflows.

jph00 · 2019-04-05T00:27:17Z

So happy to see this! 😀😃😄😁🤩

sgugger · 2019-04-05T00:29:47Z

Hey Marc, that seems great and very close to what we were doing. Could you just strip your notebook before committing? That's done automatically when you run

tools/run-after-git-clone

You can also manually strip a given notebook with tools/fastai-nbstripout

dan-zheng

Huge 👍 to this training loop abstraction.

Defining a single apply method containing a switch statement is especially concise and natural:

class Recorder<Opt: Optimizer> : Callback<Opt> ... {
    override func apply(event: CallbackEvent, learner: Learner<Opt>) {
        switch event {
        case .beginFit:
            losses = []
            lrs = []
        case .afterForwardsBackwards:
            losses.append(learner.loss.scalar!)
            lrs.append(learner.optimizer.learningRate)
        default: break
        }
    }
}

jph00 · 2019-04-05T00:40:32Z

Btw our CI checks for unstripped notebooks - that's why the red X is shown in GH. You'll know if you've fixed it because it'll turn into a green tick.

Sorry for this slight friction - we've found doing this once leads to much less friction in the future.

jph00 · 2019-04-05T00:42:23Z

Marc looks like that didn't work for you. Here's the docs on the tool: https://docs.fast.ai/dev/develop.html#things-to-run-after-git-clone

Let us know if you need a hand.

marcrasi · 2019-04-05T00:44:23Z

I think that there are some other notebooks that I didn't touch in this PR that the CI is finding. Should I just also strip those in the PR?

jph00 · 2019-04-05T00:53:40Z

Thanks!

marcrasi · 2019-04-05T00:54:33Z

BTW, I have no specific plans to do anything else to this right now. So anyone feel free to copy/modify/whatever!

jph00 · 2019-04-05T00:56:08Z

Heh looks like the stripout issue was my fault! Sorry!

sgugger · 2019-04-05T00:56:51Z

It's always your fault ;)
I'll build on this tomorrow morning, but it looks great!

dan-zheng · 2019-04-05T01:30:59Z

dev_swift/callback_experiment.ipynb

+    "    // I'm getting some crashes in AD-generated code if I put a `lossFunc` in the learner.\n",
+    "    // So I'm putting a `lossWithGradient` for now, to work around this.\n",
+    "    // (model, context, inputs, labels) -> (loss, grad)\n",
+    "    typealias LossWithGradient = (Model, Context, Tensor<Float>, Tensor<Float>) -> (Tensor<Float>, Model.AllDifferentiableVariables)\n",


@marcrasi: could you provide more context on the crashes from using lossFunc?
We may want to prioritize a fix.

Yeah!

All of this is with one of the nightly binaries that I downloaded on 4/2.

Here's a notebook demonstrating the crash: https://github.com/marcrasi/fastai_docs/blob/demonstrate-ad-crash/dev_swift/callback_experiment.ipynb

On the learner.fit cell, it says:

error: Execution was interrupted, reason: signal SIGSEGV: address access protected (fault address: 0x341bc60). The process has been left at the point where it was interrupted, use "thread return -x" to return to the state before expression evaluation.

I copied the code into a .swift file (https://github.com/marcrasi/fastai_docs/blob/demonstrate-ad-crash/dev_swift/callback_experiment.swift), ran swiftc, and ran the binary. The binary also segfaults. Running lldb on the binary gives a stacktrace:

marcrasi@marcrasi:~/Downloads$ lldb ./callback_experiment (lldb) target create "./callback_experiment" Current executable set to './callback_experiment' (x86_64). (lldb) run Process 79345 launched: '/usr/local/google/home/marcrasi/Downloads/callback_experiment' (x86_64) Process 79345 stopped * thread #1, name = 'callback_experi', stop reason = signal SIGSEGV: address access protected (fault address: 0x555555fa7760) frame #0: 0x0000555555fa7760 -> 0x555555fa7760: movb %ah, (%rdi,%rsi,4) 0x555555fa7763: idivl %edi 0x555555fa7765: jg 0x555555fa7767 0x555555fa7767: addb %al, (%rdx) Target 0: (callback_experiment) stopped. (lldb) bt * thread #1, name = 'callback_experi', stop reason = signal SIGSEGV: address access protected (fault address: 0x555555fa7760) * frame #0: 0x0000555555fa7760 frame #1: 0x0000555555562ebe callback_experiment`AD__$s19callback_experiment7LearnerC13trainOneBatch2xb2yby10TensorFlow0I0VySfG_AJtFAJ5ModelQz_AJtXEfU___primal_src_0_wrt_0_1 + 526 frame #2: 0x000055555556325b callback_experiment`AD__$s19callback_experiment7LearnerC13trainOneBatch2xb2yby10TensorFlow0I0VySfG_AJtFAJ5ModelQz_AJtXEfU___vjp_src_0_wrt_0_1 + 27 frame #3: 0x0000555555563365 callback_experiment`partial apply forwarder for AD__$s19callback_experiment7LearnerC13trainOneBatch2xb2yby10TensorFlow0I0VySfG_AJtFAJ5ModelQz_AJtXEfU___vjp_src_0_wrt_0_1 + 21 frame #4: 0x000055555555d62a callback_experiment`reabstraction thunk helper <A where A: TensorFlow.Optimizer, A.Model.AllDifferentiableVariables == A.Model.CotangentVector, A.Model.CotangentVector == A.Model.AllDifferentiableVariables.TangentVector.AllDifferentiableVariables, A.Model.Input == TensorFlow.Tensor<Swift.Float>, A.Model.Output == TensorFlow.Tensor<Swift.Float>, A.Model.AllDifferentiableVariables.TangentVector.AllDifferentiableVariables == A.Model.CotangentVector.TangentVector.AllDifferentiableVariables, A.Model.CotangentVector.TangentVector.AllDifferentiableVariables == A.Model.TangentVector.TangentVector.AllDifferentiableVariables> from @callee_guaranteed (@in_guaranteed A.Model, @guaranteed TensorFlow.Tensor<Swift.Float>) -> (@owned TensorFlow.Tensor<Swift.Float>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<Swift.Float>) -> (@out A.Model.AllDifferentiableVariables, @owned TensorFlow.Tensor<Swift.Float>)) to @escaping @callee_guaranteed (@in_guaranteed A.Model, @in_guaranteed TensorFlow.Tensor<Swift.Float>) -> (@owned TensorFlow.Tensor<Swift.Float>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<Swift.Float>) -> (@out A.Model.AllDifferentiableVariables, @out TensorFlow.Tensor<Swift.Float>)) + 26 frame #5: 0x0000555555563484 callback_experiment`partial apply forwarder for reabstraction thunk helper <A where A: TensorFlow.Optimizer, A.Model.AllDifferentiableVariables == A.Model.CotangentVector, A.Model.CotangentVector == A.Model.AllDifferentiableVariables.TangentVector.AllDifferentiableVariables, A.Model.Input == TensorFlow.Tensor<Swift.Float>, A.Model.Output == TensorFlow.Tensor<Swift.Float>, A.Model.AllDifferentiableVariables.TangentVector.AllDifferentiableVariables == A.Model.CotangentVector.TangentVector.AllDifferentiableVariables, A.Model.CotangentVector.TangentVector.AllDifferentiableVariables == A.Model.TangentVector.TangentVector.AllDifferentiableVariables> from @callee_guaranteed (@in_guaranteed A.Model, @guaranteed TensorFlow.Tensor<Swift.Float>) -> (@owned TensorFlow.Tensor<Swift.Float>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<Swift.Float>) -> (@out A.Model.AllDifferentiableVariables, @owned TensorFlow.Tensor<Swift.Float>)) to @escaping @callee_guaranteed (@in_guaranteed A.Model, @in_guaranteed TensorFlow.Tensor<Swift.Float>) -> (@owned TensorFlow.Tensor<Swift.Float>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<Swift.Float>) -> (@out A.Model.AllDifferentiableVariables, @out TensorFlow.Tensor<Swift.Float>)) + 36 frame #6: 0x00007ffff7c1f88a libswiftTensorFlow.so`reabstraction thunk helper <A><A1, B1 where A: Swift.Differentiable, A1: Swift.Differentiable, B1: TensorFlow.TensorFlowFloatingPoint> from @callee_guaranteed (@in_guaranteed A, @in_guaranteed A1) -> (@owned TensorFlow.Tensor<B1>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<B1>) -> (@out A.CotangentVector, @out A1.CotangentVector)) to @escaping @callee_guaranteed (@in_guaranteed A, @in_guaranteed A1) -> (@out TensorFlow.Tensor<B1>, @owned @escaping @callee_guaranteed (@in_guaranteed TensorFlow.Tensor<B1>) -> (@out A.CotangentVector, @out A1.CotangentVector)) + 74 frame #7: 0x00007ffff7c6639d libswiftTensorFlow.so`reabstraction thunk helper <A><A1, B1 where A: Swift.Differentiable, A1: Swift.Differentiable, B1: TensorFlow.TensorFlowFloatingPoint> from @callee_guaranteed (@in_guaranteed A, @in_guaranteed A1) -> (@owned TensorFlow.Tensor<B1>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<B1>) -> (@out A.CotangentVector, @out A1.CotangentVector)) to @escaping @callee_guaranteed (@in_guaranteed A, @in_guaranteed A1) -> (@out TensorFlow.Tensor<B1>, @owned @escaping @callee_guaranteed (@in_guaranteed TensorFlow.Tensor<B1>) -> (@out A.CotangentVector, @out A1.CotangentVector))partial apply forwarder with unmangled suffix ".209" + 93 frame #8: 0x00007ffff7821f77 libswiftCore.so`(extension in Swift):Swift.Differentiable.valueWithPullback<A, B where A1: Swift.Differentiable, B1: Swift.Differentiable>(at: A1, in: (A, A1) -> B1) -> (value: B1, pullback: (B1.CotangentVector) -> (A.CotangentVector, A1.CotangentVector)) + 199 frame #9: 0x00007ffff7c1fddd libswiftTensorFlow.so`(extension in TensorFlow):Swift.Differentiable.valueWithGradient<A, B where A1: Swift.Differentiable, B1: TensorFlow.TensorFlowFloatingPoint>(at: A1, in: (A, A1) -> TensorFlow.Tensor<B1>) -> (value: TensorFlow.Tensor<B1>, gradient: (A.CotangentVector, A1.CotangentVector)) + 1181 frame #10: 0x000055555555ce95 callback_experiment`callback_experiment.Learner.trainOneBatch(xb: TensorFlow.Tensor<Swift.Float>, yb: TensorFlow.Tensor<Swift.Float>) -> () + 1685 frame #11: 0x000055555555d8c0 callback_experiment`callback_experiment.Learner.trainOneEpoch() -> () + 496 frame #12: 0x000055555555db50 callback_experiment`callback_experiment.Learner.fit(epochs: Swift.Int) -> () + 576 frame #13: 0x000055555555a733 callback_experiment`main + 3171 frame #14: 0x00007fffedd3c2b1 libc.so.6`__libc_start_main(main=(callback_experiment`main), argc=1, argv=0x00007fffffffddc8, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffddb8) at libc-start.c:291 frame #15: 0x00005555555598aa callback_experiment`_start + 42

I have not tried reducing the reproducer at all.

callback experiment

51068e2

marcrasi requested review from rxwei, dan-zheng, saeta, sgugger and jph00 April 5, 2019 00:14

dan-zheng approved these changes Apr 5, 2019

View reviewed changes

run nbstripout

01856bc

nbstripout more notebooks

c04c0f3

jph00 merged commit 7b010f1 into fastai:master Apr 5, 2019

dan-zheng reviewed Apr 5, 2019

View reviewed changes

jph00 mentioned this pull request Apr 5, 2019

Fix AD crash in Learner #68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: dumping all the Learner state into one class that the callbacks can mutate #67

Experiment: dumping all the Learner state into one class that the callbacks can mutate #67

marcrasi commented Apr 4, 2019

review-notebook-app bot commented Apr 4, 2019

jph00 commented Apr 5, 2019

sgugger commented Apr 5, 2019

dan-zheng left a comment

jph00 commented Apr 5, 2019

jph00 commented Apr 5, 2019

marcrasi commented Apr 5, 2019

jph00 commented Apr 5, 2019

marcrasi commented Apr 5, 2019

jph00 commented Apr 5, 2019

sgugger commented Apr 5, 2019

dan-zheng Apr 5, 2019

marcrasi Apr 5, 2019

Experiment: dumping all the Learner state into one class that the callbacks can mutate #67

Experiment: dumping all the Learner state into one class that the callbacks can mutate #67

Conversation

marcrasi commented Apr 4, 2019

review-notebook-app bot commented Apr 4, 2019

jph00 commented Apr 5, 2019

sgugger commented Apr 5, 2019

dan-zheng left a comment

Choose a reason for hiding this comment

jph00 commented Apr 5, 2019

jph00 commented Apr 5, 2019

marcrasi commented Apr 5, 2019

jph00 commented Apr 5, 2019

marcrasi commented Apr 5, 2019

jph00 commented Apr 5, 2019

sgugger commented Apr 5, 2019

dan-zheng Apr 5, 2019

Choose a reason for hiding this comment

marcrasi Apr 5, 2019

Choose a reason for hiding this comment