-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment: dumping all the Learner state into one class that the callbacks can mutate #67
Conversation
Check out this pull request on ReviewNB: https://app.reviewnb.com/fastai/fastai_docs/pull/67 Visit www.reviewnb.com to know how we simplify your Jupyter Notebook workflows. |
So happy to see this! 😀😃😄😁🤩 |
Hey Marc, that seems great and very close to what we were doing. Could you just strip your notebook before committing? That's done automatically when you run
You can also manually strip a given notebook with tools/fastai-nbstripout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge 👍 to this training loop abstraction.
Defining a single apply
method containing a switch
statement is especially concise and natural:
class Recorder<Opt: Optimizer> : Callback<Opt> ... {
override func apply(event: CallbackEvent, learner: Learner<Opt>) {
switch event {
case .beginFit:
losses = []
lrs = []
case .afterForwardsBackwards:
losses.append(learner.loss.scalar!)
lrs.append(learner.optimizer.learningRate)
default: break
}
}
}
Btw our CI checks for unstripped notebooks - that's why the red X is shown in GH. You'll know if you've fixed it because it'll turn into a green tick. Sorry for this slight friction - we've found doing this once leads to much less friction in the future. |
Marc looks like that didn't work for you. Here's the docs on the tool: https://docs.fast.ai/dev/develop.html#things-to-run-after-git-clone Let us know if you need a hand. |
I think that there are some other notebooks that I didn't touch in this PR that the CI is finding. Should I just also strip those in the PR? |
Thanks! |
BTW, I have no specific plans to do anything else to this right now. So anyone feel free to copy/modify/whatever! |
Heh looks like the stripout issue was my fault! Sorry! |
It's always your fault ;) |
" // I'm getting some crashes in AD-generated code if I put a `lossFunc` in the learner.\n", | ||
" // So I'm putting a `lossWithGradient` for now, to work around this.\n", | ||
" // (model, context, inputs, labels) -> (loss, grad)\n", | ||
" typealias LossWithGradient = (Model, Context, Tensor<Float>, Tensor<Float>) -> (Tensor<Float>, Model.AllDifferentiableVariables)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcrasi: could you provide more context on the crashes from using lossFunc
?
We may want to prioritize a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah!
All of this is with one of the nightly binaries that I downloaded on 4/2.
Here's a notebook demonstrating the crash: https://github.com/marcrasi/fastai_docs/blob/demonstrate-ad-crash/dev_swift/callback_experiment.ipynb
On the learner.fit
cell, it says:
error: Execution was interrupted, reason: signal SIGSEGV: address access protected (fault address: 0x341bc60).
The process has been left at the point where it was interrupted, use "thread return -x" to return to the state before expression evaluation.
I copied the code into a .swift
file (https://github.com/marcrasi/fastai_docs/blob/demonstrate-ad-crash/dev_swift/callback_experiment.swift), ran swiftc
, and ran the binary. The binary also segfaults. Running lldb
on the binary gives a stacktrace:
marcrasi@marcrasi:~/Downloads$ lldb ./callback_experiment
(lldb) target create "./callback_experiment"
Current executable set to './callback_experiment' (x86_64).
(lldb) run
Process 79345 launched: '/usr/local/google/home/marcrasi/Downloads/callback_experiment' (x86_64)
Process 79345 stopped
* thread #1, name = 'callback_experi', stop reason = signal SIGSEGV: address access protected (fault address: 0x555555fa7760)
frame #0: 0x0000555555fa7760
-> 0x555555fa7760: movb %ah, (%rdi,%rsi,4)
0x555555fa7763: idivl %edi
0x555555fa7765: jg 0x555555fa7767
0x555555fa7767: addb %al, (%rdx)
Target 0: (callback_experiment) stopped.
(lldb) bt
* thread #1, name = 'callback_experi', stop reason = signal SIGSEGV: address access protected (fault address: 0x555555fa7760)
* frame #0: 0x0000555555fa7760
frame #1: 0x0000555555562ebe callback_experiment`AD__$s19callback_experiment7LearnerC13trainOneBatch2xb2yby10TensorFlow0I0VySfG_AJtFAJ5ModelQz_AJtXEfU___primal_src_0_wrt_0_1 + 526
frame #2: 0x000055555556325b callback_experiment`AD__$s19callback_experiment7LearnerC13trainOneBatch2xb2yby10TensorFlow0I0VySfG_AJtFAJ5ModelQz_AJtXEfU___vjp_src_0_wrt_0_1 + 27
frame #3: 0x0000555555563365 callback_experiment`partial apply forwarder for AD__$s19callback_experiment7LearnerC13trainOneBatch2xb2yby10TensorFlow0I0VySfG_AJtFAJ5ModelQz_AJtXEfU___vjp_src_0_wrt_0_1 + 21
frame #4: 0x000055555555d62a callback_experiment`reabstraction thunk helper <A where A: TensorFlow.Optimizer, A.Model.AllDifferentiableVariables == A.Model.CotangentVector, A.Model.CotangentVector == A.Model.AllDifferentiableVariables.TangentVector.AllDifferentiableVariables, A.Model.Input == TensorFlow.Tensor<Swift.Float>, A.Model.Output == TensorFlow.Tensor<Swift.Float>, A.Model.AllDifferentiableVariables.TangentVector.AllDifferentiableVariables == A.Model.CotangentVector.TangentVector.AllDifferentiableVariables, A.Model.CotangentVector.TangentVector.AllDifferentiableVariables == A.Model.TangentVector.TangentVector.AllDifferentiableVariables> from @callee_guaranteed (@in_guaranteed A.Model, @guaranteed TensorFlow.Tensor<Swift.Float>) -> (@owned TensorFlow.Tensor<Swift.Float>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<Swift.Float>) -> (@out A.Model.AllDifferentiableVariables, @owned TensorFlow.Tensor<Swift.Float>)) to @escaping @callee_guaranteed (@in_guaranteed A.Model, @in_guaranteed TensorFlow.Tensor<Swift.Float>) -> (@owned TensorFlow.Tensor<Swift.Float>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<Swift.Float>) -> (@out A.Model.AllDifferentiableVariables, @out TensorFlow.Tensor<Swift.Float>)) + 26
frame #5: 0x0000555555563484 callback_experiment`partial apply forwarder for reabstraction thunk helper <A where A: TensorFlow.Optimizer, A.Model.AllDifferentiableVariables == A.Model.CotangentVector, A.Model.CotangentVector == A.Model.AllDifferentiableVariables.TangentVector.AllDifferentiableVariables, A.Model.Input == TensorFlow.Tensor<Swift.Float>, A.Model.Output == TensorFlow.Tensor<Swift.Float>, A.Model.AllDifferentiableVariables.TangentVector.AllDifferentiableVariables == A.Model.CotangentVector.TangentVector.AllDifferentiableVariables, A.Model.CotangentVector.TangentVector.AllDifferentiableVariables == A.Model.TangentVector.TangentVector.AllDifferentiableVariables> from @callee_guaranteed (@in_guaranteed A.Model, @guaranteed TensorFlow.Tensor<Swift.Float>) -> (@owned TensorFlow.Tensor<Swift.Float>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<Swift.Float>) -> (@out A.Model.AllDifferentiableVariables, @owned TensorFlow.Tensor<Swift.Float>)) to @escaping @callee_guaranteed (@in_guaranteed A.Model, @in_guaranteed TensorFlow.Tensor<Swift.Float>) -> (@owned TensorFlow.Tensor<Swift.Float>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<Swift.Float>) -> (@out A.Model.AllDifferentiableVariables, @out TensorFlow.Tensor<Swift.Float>)) + 36
frame #6: 0x00007ffff7c1f88a libswiftTensorFlow.so`reabstraction thunk helper <A><A1, B1 where A: Swift.Differentiable, A1: Swift.Differentiable, B1: TensorFlow.TensorFlowFloatingPoint> from @callee_guaranteed (@in_guaranteed A, @in_guaranteed A1) -> (@owned TensorFlow.Tensor<B1>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<B1>) -> (@out A.CotangentVector, @out A1.CotangentVector)) to @escaping @callee_guaranteed (@in_guaranteed A, @in_guaranteed A1) -> (@out TensorFlow.Tensor<B1>, @owned @escaping @callee_guaranteed (@in_guaranteed TensorFlow.Tensor<B1>) -> (@out A.CotangentVector, @out A1.CotangentVector)) + 74
frame #7: 0x00007ffff7c6639d libswiftTensorFlow.so`reabstraction thunk helper <A><A1, B1 where A: Swift.Differentiable, A1: Swift.Differentiable, B1: TensorFlow.TensorFlowFloatingPoint> from @callee_guaranteed (@in_guaranteed A, @in_guaranteed A1) -> (@owned TensorFlow.Tensor<B1>, @owned @escaping @callee_guaranteed (@guaranteed TensorFlow.Tensor<B1>) -> (@out A.CotangentVector, @out A1.CotangentVector)) to @escaping @callee_guaranteed (@in_guaranteed A, @in_guaranteed A1) -> (@out TensorFlow.Tensor<B1>, @owned @escaping @callee_guaranteed (@in_guaranteed TensorFlow.Tensor<B1>) -> (@out A.CotangentVector, @out A1.CotangentVector))partial apply forwarder with unmangled suffix ".209" + 93
frame #8: 0x00007ffff7821f77 libswiftCore.so`(extension in Swift):Swift.Differentiable.valueWithPullback<A, B where A1: Swift.Differentiable, B1: Swift.Differentiable>(at: A1, in: (A, A1) -> B1) -> (value: B1, pullback: (B1.CotangentVector) -> (A.CotangentVector, A1.CotangentVector)) + 199
frame #9: 0x00007ffff7c1fddd libswiftTensorFlow.so`(extension in TensorFlow):Swift.Differentiable.valueWithGradient<A, B where A1: Swift.Differentiable, B1: TensorFlow.TensorFlowFloatingPoint>(at: A1, in: (A, A1) -> TensorFlow.Tensor<B1>) -> (value: TensorFlow.Tensor<B1>, gradient: (A.CotangentVector, A1.CotangentVector)) + 1181
frame #10: 0x000055555555ce95 callback_experiment`callback_experiment.Learner.trainOneBatch(xb: TensorFlow.Tensor<Swift.Float>, yb: TensorFlow.Tensor<Swift.Float>) -> () + 1685
frame #11: 0x000055555555d8c0 callback_experiment`callback_experiment.Learner.trainOneEpoch() -> () + 496
frame #12: 0x000055555555db50 callback_experiment`callback_experiment.Learner.fit(epochs: Swift.Int) -> () + 576
frame #13: 0x000055555555a733 callback_experiment`main + 3171
frame #14: 0x00007fffedd3c2b1 libc.so.6`__libc_start_main(main=(callback_experiment`main), argc=1, argv=0x00007fffffffddc8, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffddb8) at libc-start.c:291
frame #15: 0x00005555555598aa callback_experiment`_start + 42
I have not tried reducing the reproducer at all.
I wanted to experiment with dumping all the Learner state into one class that the callbacks can mutate. This makes the callbacks mechanism very similar to the Python one.
This PR includes: