Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment: dumping all the Learner state into one class that the callbacks can mutate #67

Merged
merged 3 commits into from
Apr 5, 2019

Conversation

marcrasi
Copy link
Contributor

@marcrasi marcrasi commented Apr 4, 2019

I wanted to experiment with dumping all the Learner state into one class that the callbacks can mutate. This makes the callbacks mechanism very similar to the Python one.

This PR includes:

  • The mechanism (with some simplifying assumptions and missing pieces).
  • A Recorder and ParamScheduler.
  • An example model that trains with Recorder and ParamScheduler.

@review-notebook-app
Copy link

Check out this pull request on ReviewNB: https://app.reviewnb.com/fastai/fastai_docs/pull/67

Visit www.reviewnb.com to know how we simplify your Jupyter Notebook workflows.

@jph00
Copy link
Member

jph00 commented Apr 5, 2019

So happy to see this! 😀😃😄😁🤩

@sgugger
Copy link
Contributor

sgugger commented Apr 5, 2019

Hey Marc, that seems great and very close to what we were doing. Could you just strip your notebook before committing? That's done automatically when you run

tools/run-after-git-clone

You can also manually strip a given notebook with tools/fastai-nbstripout

Copy link
Contributor

@dan-zheng dan-zheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge 👍 to this training loop abstraction.

Defining a single apply method containing a switch statement is especially concise and natural:

class Recorder<Opt: Optimizer> : Callback<Opt> ... {
    override func apply(event: CallbackEvent, learner: Learner<Opt>) {
        switch event {
        case .beginFit:
            losses = []
            lrs = []
        case .afterForwardsBackwards:
            losses.append(learner.loss.scalar!)
            lrs.append(learner.optimizer.learningRate)
        default: break
        }
    }
}

@jph00
Copy link
Member

jph00 commented Apr 5, 2019

Btw our CI checks for unstripped notebooks - that's why the red X is shown in GH. You'll know if you've fixed it because it'll turn into a green tick.

Sorry for this slight friction - we've found doing this once leads to much less friction in the future.

@jph00
Copy link
Member

jph00 commented Apr 5, 2019

Marc looks like that didn't work for you. Here's the docs on the tool: https://docs.fast.ai/dev/develop.html#things-to-run-after-git-clone

Let us know if you need a hand.

@marcrasi
Copy link
Contributor Author

marcrasi commented Apr 5, 2019

I think that there are some other notebooks that I didn't touch in this PR that the CI is finding. Should I just also strip those in the PR?

@jph00
Copy link
Member

jph00 commented Apr 5, 2019

Thanks!

@jph00 jph00 merged commit 7b010f1 into fastai:master Apr 5, 2019
@marcrasi
Copy link
Contributor Author

marcrasi commented Apr 5, 2019

BTW, I have no specific plans to do anything else to this right now. So anyone feel free to copy/modify/whatever!

@jph00
Copy link
Member

jph00 commented Apr 5, 2019

Heh looks like the stripout issue was my fault! Sorry!

@sgugger
Copy link
Contributor

sgugger commented Apr 5, 2019

It's always your fault ;)
I'll build on this tomorrow morning, but it looks great!

" // I'm getting some crashes in AD-generated code if I put a `lossFunc` in the learner.\n",
" // So I'm putting a `lossWithGradient` for now, to work around this.\n",
" // (model, context, inputs, labels) -> (loss, grad)\n",
" typealias LossWithGradient = (Model, Context, Tensor<Float>, Tensor<Float>) -> (Tensor<Float>, Model.AllDifferentiableVariables)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcrasi: could you provide more context on the crashes from using lossFunc?
We may want to prioritize a fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah!

All of this is with one of the nightly binaries that I downloaded on 4/2.

Here's a notebook demonstrating the crash: https://github.com/marcrasi/fastai_docs/blob/demonstrate-ad-crash/dev_swift/callback_experiment.ipynb

On the learner.fit cell, it says:

error: Execution was interrupted, reason: signal SIGSEGV: address access protected (fault address: 0x341bc60).
The process has been left at the point where it was interrupted, use "thread return -x" to return to the state before expression evaluation.

I copied the code into a .swift file (https://github.com/marcrasi/fastai_docs/blob/demonstrate-ad-crash/dev_swift/callback_experiment.swift), ran swiftc, and ran the binary. The binary also segfaults. Running lldb on the binary gives a stacktrace:

marcrasi@marcrasi:~/Downloads$ lldb ./callback_experiment
(lldb) target create "./callback_experiment"
Current executable set to './callback_experiment' (x86_64).
(lldb) run
Process 79345 launched: '/usr/local/google/home/marcrasi/Downloads/callback_experiment' (x86_64)
Process 79345 stopped
* thread #1, name = 'callback_experi', stop reason = signal SIGSEGV: address access protected (fault address: 0x555555fa7760)
    frame #0: 0x0000555555fa7760
->  0x555555fa7760: movb   %ah, (%rdi,%rsi,4)
    0x555555fa7763: idivl  %edi
    0x555555fa7765: jg     0x555555fa7767
    0x555555fa7767: addb   %al, (%rdx)
Target 0: (callback_experiment) stopped.
(lldb) bt
* thread #1, name = 'callback_experi', stop reason = signal SIGSEGV: address access protected (fault address: 0x555555fa7760)
  * frame #0: 0x0000555555fa7760
    frame #1: 0x0000555555562ebe callback_experiment`AD__$s19callback_experiment7LearnerC13trainOneBatch2xb2yby10TensorFlow0I0VySfG_AJtFAJ5ModelQz_AJtXEfU___primal_src_0_wrt_0_1 + 526
    frame #2: 0x000055555556325b callback_experiment`AD__$s19callback_experiment7LearnerC13trainOneBatch2xb2yby10TensorFlow0I0VySfG_AJtFAJ5ModelQz_AJtXEfU___vjp_src_0_wrt_0_1 + 27
    frame #3: 0x0000555555563365 callback_experiment`partial apply forwarder for AD__$s19callback_experiment7LearnerC13trainOneBatch2xb2yby10TensorFlow0I0VySfG_AJtFAJ5ModelQz_AJtXEfU___vjp_src_0_wrt_0_1 + 21
    frame #4: 0x000055555555d62a callback_experiment`reabstraction thunk helper <A where A: TensorFlow.Optimizer, A.Model.AllDifferentiableVariables == A.Model.CotangentVector, A.Model.CotangentVector == A.Model.AllDifferentiableVariables.TangentVector.AllDifferentiableVariables, A.Model.Input == TensorFlow.Tensor<Swift.Float>, A.Model.Output == TensorFlow.Tensor<Swift.Float>, A.Model.AllDifferentiableVariables.TangentVector.AllDifferentiableVariables == A.Model.CotangentVector.TangentVector.AllDifferentiableVariables, A.Model.CotangentVector.TangentVector.AllDifferentiableVariables == A.Model.TangentVector.TangentVector.AllDifferentiableVariables> from @callee_guaranteed (@in_guaranteed A.Model, @guaranteed TensorFlow.Tensor<Swift.Float>) -> (@owned TensorFlow.Tensor<Swift.Float>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<Swift.Float>) -> (@out A.Model.AllDifferentiableVariables, @owned TensorFlow.Tensor<Swift.Float>)) to @escaping @callee_guaranteed (@in_guaranteed A.Model, @in_guaranteed TensorFlow.Tensor<Swift.Float>) -> (@owned TensorFlow.Tensor<Swift.Float>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<Swift.Float>) -> (@out A.Model.AllDifferentiableVariables, @out TensorFlow.Tensor<Swift.Float>)) + 26
    frame #5: 0x0000555555563484 callback_experiment`partial apply forwarder for reabstraction thunk helper <A where A: TensorFlow.Optimizer, A.Model.AllDifferentiableVariables == A.Model.CotangentVector, A.Model.CotangentVector == A.Model.AllDifferentiableVariables.TangentVector.AllDifferentiableVariables, A.Model.Input == TensorFlow.Tensor<Swift.Float>, A.Model.Output == TensorFlow.Tensor<Swift.Float>, A.Model.AllDifferentiableVariables.TangentVector.AllDifferentiableVariables == A.Model.CotangentVector.TangentVector.AllDifferentiableVariables, A.Model.CotangentVector.TangentVector.AllDifferentiableVariables == A.Model.TangentVector.TangentVector.AllDifferentiableVariables> from @callee_guaranteed (@in_guaranteed A.Model, @guaranteed TensorFlow.Tensor<Swift.Float>) -> (@owned TensorFlow.Tensor<Swift.Float>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<Swift.Float>) -> (@out A.Model.AllDifferentiableVariables, @owned TensorFlow.Tensor<Swift.Float>)) to @escaping @callee_guaranteed (@in_guaranteed A.Model, @in_guaranteed TensorFlow.Tensor<Swift.Float>) -> (@owned TensorFlow.Tensor<Swift.Float>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<Swift.Float>) -> (@out A.Model.AllDifferentiableVariables, @out TensorFlow.Tensor<Swift.Float>)) + 36
    frame #6: 0x00007ffff7c1f88a libswiftTensorFlow.so`reabstraction thunk helper <A><A1, B1 where A: Swift.Differentiable, A1: Swift.Differentiable, B1: TensorFlow.TensorFlowFloatingPoint> from @callee_guaranteed (@in_guaranteed A, @in_guaranteed A1) -> (@owned TensorFlow.Tensor<B1>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<B1>) -> (@out A.CotangentVector, @out A1.CotangentVector)) to @escaping @callee_guaranteed (@in_guaranteed A, @in_guaranteed A1) -> (@out TensorFlow.Tensor<B1>, @owned @escaping @callee_guaranteed (@in_guaranteed TensorFlow.Tensor<B1>) -> (@out A.CotangentVector, @out A1.CotangentVector)) + 74
    frame #7: 0x00007ffff7c6639d libswiftTensorFlow.so`reabstraction thunk helper <A><A1, B1 where A: Swift.Differentiable, A1: Swift.Differentiable, B1: TensorFlow.TensorFlowFloatingPoint> from @callee_guaranteed (@in_guaranteed A, @in_guaranteed A1) -> (@owned TensorFlow.Tensor<B1>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<B1>) -> (@out A.CotangentVector, @out A1.CotangentVector)) to @escaping @callee_guaranteed (@in_guaranteed A, @in_guaranteed A1) -> (@out TensorFlow.Tensor<B1>, @owned @escaping @callee_guaranteed (@in_guaranteed TensorFlow.Tensor<B1>) -> (@out A.CotangentVector, @out A1.CotangentVector))partial apply forwarder with unmangled suffix ".209" + 93
    frame #8: 0x00007ffff7821f77 libswiftCore.so`(extension in Swift):Swift.Differentiable.valueWithPullback<A, B where A1: Swift.Differentiable, B1: Swift.Differentiable>(at: A1, in: (A, A1) -> B1) -> (value: B1, pullback: (B1.CotangentVector) -> (A.CotangentVector, A1.CotangentVector)) + 199
    frame #9: 0x00007ffff7c1fddd libswiftTensorFlow.so`(extension in TensorFlow):Swift.Differentiable.valueWithGradient<A, B where A1: Swift.Differentiable, B1: TensorFlow.TensorFlowFloatingPoint>(at: A1, in: (A, A1) -> TensorFlow.Tensor<B1>) -> (value: TensorFlow.Tensor<B1>, gradient: (A.CotangentVector, A1.CotangentVector)) + 1181
    frame #10: 0x000055555555ce95 callback_experiment`callback_experiment.Learner.trainOneBatch(xb: TensorFlow.Tensor<Swift.Float>, yb: TensorFlow.Tensor<Swift.Float>) -> () + 1685
    frame #11: 0x000055555555d8c0 callback_experiment`callback_experiment.Learner.trainOneEpoch() -> () + 496
    frame #12: 0x000055555555db50 callback_experiment`callback_experiment.Learner.fit(epochs: Swift.Int) -> () + 576
    frame #13: 0x000055555555a733 callback_experiment`main + 3171
    frame #14: 0x00007fffedd3c2b1 libc.so.6`__libc_start_main(main=(callback_experiment`main), argc=1, argv=0x00007fffffffddc8, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffddb8) at libc-start.c:291
    frame #15: 0x00005555555598aa callback_experiment`_start + 42

I have not tried reducing the reproducer at all.

@jph00 jph00 mentioned this pull request Apr 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants