Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODO: progress towards v0.3.0 #9

Closed
mehdi-cit opened this issue May 10, 2016 · 35 comments
Closed

TODO: progress towards v0.3.0 #9

mehdi-cit opened this issue May 10, 2016 · 35 comments
Labels

Comments

@mehdi-cit
Copy link

It's all in the question's title!
I am of the opinion that this project if carried correctly would give a huge boost to arrayfire (using arrayfire from node that is).

@unbornchikken
Copy link
Member

unbornchikken commented May 11, 2016

Yeah, it's basically feature complete. But unfortunately it has a serious flaw: for the sake of performance we need deterministic scopes (arrayfire/arrayfire-dotnet#8 (comment)). But that implies that we will call destructors synchronously, and because destructors are synchronization points in ArrayFire, ArrayFire.js is synchronous despite my best effort to make it asynchronous. I'm thinking on a new approach, but that's gonna make every call and value access asynchronous for sure, which is kinda ugly, and hurts performance.

So it works, but it blocks the event loop at the current state.

EDIT:

TODO list of new version:

  • device functions
  • unified functions
  • array class
    • construct
    • indexing and assign
    • operators
    • misc properties and methods
  • create array
  • math functions
  • vector functions
  • reductions
  • complex types
  • update readme and docs
  • add a detailed contributing guide
  • wait for contributions of other functions, or implement them in spare time

@mehdi-cit
Copy link
Author

@unbornchikken Thanks for the quick reply.
Please do keep us posted if you find ways to mitigate this!

@unbornchikken unbornchikken changed the title Is this project still active? (developed/maintained) TODO: work out a way to have deterministic RAII scopes but keep the flow as async May 26, 2016
@unbornchikken
Copy link
Member

unbornchikken commented May 26, 2016

@umar456 Please take a look at my previous comment. I'm thinking on a solution. What if, let's say, array's destructor implementation wouldn't block, instead it'd add up an asynchronous operation in the queue to free up array's resources once all previous operations gets completed? cc @pavanky

@umar456
Copy link
Member

umar456 commented May 26, 2016

@pavanky can give you more detail but the destructor shouldn't be a blocking call. It should be managing the reference counts for all arrays and should be marked for deletion once all of the work is done. Those objects will be deleted at a later time(if the memory of that size is not used or the garbage collector is called). That event needs to be blocking because the GPU drivers perform a synchronization on the device but we try to avoid that whenever possible.

@unbornchikken
Copy link
Member

Ok, I'm gonna work out a simple repro case with a code flow that - in theory - shouldn't block at all, but according to v8 performance data it does.

@unbornchikken
Copy link
Member

@mehdi-cit as you can see, this project is still active, but my life events prevented me to make significant progress on it in the last few months. But, since my pet ML project is still have and will have a dependency on ArrayFire.js (and on CMake.js), you can expect me to put my focus back to those eventually.

@mehdi-cit
Copy link
Author

Not sure if this can help with the issue at hand but it could be a good alternative when it comes to "integrating" javascript and c++ code:
https://github.com/charto/nbind

@unbornchikken
Copy link
Member

unbornchikken commented Aug 16, 2016

@mehdi-cit unfortunately that's not that simple as nbind states. In ArrayFire there are a bunch of operations that act as synchronization point: constructors, complex operators, memory copy, etc. Which mean, if you wrap'em naive as-is, then you're gonna block the main loop on that point until all of previously enqueued AF operations gets completed. In Node.js you should never block the event loop.

Let's say, nbind supports asynchronous operations by the standard way. I mean by using nan's async workers: https://github.com/nodejs/nan/blob/master/doc/asyncworker.md (note: almost all of native library wrappers are doing this). But AsyncWorker uses libuv worker threads, so you gotta synchronize AF calls by some way. In current version you gotta use manual locks, but eventually thread safe ArrayFire will land and make that unnecessary. If you lock libuv workers then you'll serialize them, if there are more than one AF operations executing in parallel. Which means you'll kill libuv entirely and make Node. js totally synchronous, which is really really bad.

The only viable option is that you launch a separate libuv loop for AF and make your binding as a proxy for that. Well, this is where things will go really complicated especially if you're interacting v8 in C++, because verbosity and complexity of v8/nan.

That's why I'm creating fastcall. It will offer about the same performance that you could get with C++ based bindings (dyncall is really that fast, according to my benchmarks, it's overhead is negligible, 5%-ish), with the above mentioned separated libuv loop support.

Once fastcall stabilizes, I'll be back to this project. However I gotta invent something for proper RAII in JS, because of arrayfire/arrayfire-dotnet#8 (comment)

@robertleeplummerjr
Copy link

robertleeplummerjr commented Aug 22, 2016

I've been watching arrayfire.js for a few months now, and am in love with it. I'm working on https://github.com/harthur-org/brain.js & its recurrent neural net and want to connect to arrayfire.js at some future point in time when things line up. I've spent a great deal of time researching before coding anything, and started here (amongst many research papers, how to articles, and many other libraries) for the most part:
https://github.com/karpathy/recurrentjs

After reviewing each of the major methods that are associated with the overall mathematical procedures, I found something that concerned me:

  1. create a new matrix for each and every mathematical operation: https://github.com/karpathy/recurrentjs/blob/master/src/recurrent.js#L199
  2. create a new closure for every mathematical operation: https://github.com/karpathy/recurrentjs/blob/master/src/recurrent.js#L211

The reason that this concerned me was first, those are (from past experience) memory leaks, and that after reviewing the recurrent neural net in arrayfire.js, I see a similar approach (please note, I am very nieve to arrayfire.js still, and this isn't a slam against the library, but rather rethinking the semi normal) and it got me thinking, how could we greatly speed this up? Or rather how can we use less resources to do the same thing? One of the biggest bottlenecks of multi threading, is of course, memory to and from cpu -> gpu -> cpu -> repeat, garbage collecting, etc. but what if we could offload the entire operation of calculations to the gpu, so that there is only one input (values -> input -> gpu), and one output (values -> ouput -> cpu)?

Tinkering with the idea, I came up with a few pseudo code sessions to try and wrap my head around what I was aiming for eventually, and yes, I'll say it was a yack shave at best. This was how I saw the operation on the first go around ( (outlined here)[https://github.com/BrainJS/brain.js/issues/24] ), I thought: Really what we are trying to do is build a state tree, just like in parsing, so you'd have something like:

{
  left: leftMatrix,
  right: rightMatrix,
  into: intoMatrix,
  forwardFunction: forwardFunction,
  backpropagateFunction: backpropagateFunction
}

This would repeat over and over again, depending on the complexity of your math. forwardFunction would be the math moving forward, say add, multiply, or relu. So here if we send left and rightas arguments called withforwardFunction, whereforwardFunctionusesinto` is where the values from the addition will go. The state tree is a little odd & backward, but you get the jist (exaggerated a bit to show structure):

 left-\
       > add = into -> next -\
right-/                       \ 
                               > multiply = into -> next -\
 left-\                       /                            \
       > add = into -> next -/                              \
right-/                                                      \
                                                              > relu = into -> DONE!
 left-\                                                      /
       > add = into -> next -\                              /
right-/                       \                            /
                               > multiply = into -> next -/
 left-\                       /
       > add = into -> next -/
right-/     

In this scenario, rather than doing just in time operations on math, we'd actually setup a math equation in the gpu that could be fed a set of data and would at some future point in time return the answer to it.

In the original neural net, the setup doesn't really exist, other than some tricky prev value checks to trigger backfeeding the neural net, however the just in time calculations look something like:

var h0 = this.multiply(hiddenMatrix.weight, inputVector, this);
var h1 = this.multiply(hiddenMatrix.transition, hiddenPrev, this);
var hiddenD = this.relu(this.add(this.add(h0, h1, this), hiddenMatrix.bias, this), this);

What I'm proposing is that with this new thinking, we'd setup a math problem that could be used, similar to:

var eq = new Equation();
return eq.relu(
      eq.add(
        eq.add(
          eq.multiply(
            hiddenModel.weight,
            input
          ),
          eq.multiply(
            hiddenModel.transition,
            previousInput
          )
        ),
        hiddenModel.bias
      )
    );

Which instantiates an equation that can be used like:

eq.run(input, function(output) {
  console.log('look ma!  no CPU!', output);
});

@robertleeplummerjr
Copy link

robertleeplummerjr commented Aug 22, 2016

I accidentally hit enter before I was done. So my question is, would this answer the problem we are having of a blocking synchronous thread that ultimately is synchronous? By giving the gpu the whole problem, with minimal in and out, would that address or even help so that we could have a full fledge multi threaded approach?

@robertleeplummerjr
Copy link

@UniqueFool, curious your thoughts here.

@unbornchikken
Copy link
Member

unbornchikken commented Aug 23, 2016

@robertleeplummerjr Just wait for the new fastcall based bindings to come out before making any serious dependency on AF.js, please. In this version I'm working on a fully asynchronous, declarative approach that you proposed, with just one huge exception: I wanna have control flow too not just expressions! Like:

const result = yield raii.scope(() => {
    const arr1 = af.randu(42);
    const arr2 = af.constant(0, 42);
    for (let i = 0; i < 10; i++) {
        arr2.set(Math.random() * 42, arr1.get(Math.random() * 42));
    }
    return arr1.host();
});

You'll get yer plain old JavaScript, but that doesn't get executed right away. It gets enqueued in a separate libuv loop, and you'll get a Promise that resolves asynchronously once all of the operations gets completed on the device. And there will be an asynchronous RAII mechanism that will do exactly the same RAM and VRAM resource management automatically that C++ bindings have.

@robertleeplummerjr
Copy link

What is the eta? No rush on perfection :)

@unbornchikken
Copy link
Member

That's above just the trailer. Kinda No Man's Sky. :) ETA: when it's done. ;) I'll keep you posted in this thread about my progress.

@robertleeplummerjr
Copy link

Anything I can do to help?

@unbornchikken
Copy link
Member

Unfortunately nothing at this stage. Once I'm starting to add some actual methods to the new binding, you can help to add the others.

@robertleeplummerjr
Copy link

I love feedback, and brainstorming. I'll be here to assist in the meantime.

@robertleeplummerjr
Copy link

Any updates?

@unbornchikken
Copy link
Member

On it. I've just reached the second milestone with fastcall, one major feature remains: callback support. Few weeks ahead.

@robertleeplummerjr
Copy link

Saweet! As of this evening I got rnn, lstm, and gru networks up and running with unit tests!!! Your audience is standing by to watch the master at work.

@robertleeplummerjr
Copy link

Bragging rights: BrainJS/brain.js#29

@robertleeplummerjr
Copy link

Your code looks fantastic, by the way!

@unbornchikken
Copy link
Member

@robertleeplummerjr
Copy link

(awesome!)

@unbornchikken
Copy link
Member

Work on the new version has been started: https://github.com/arrayfire/arrayfire-js/tree/fastcall

Sorry for the delay, I had to invent the wheel to make an efficient ArrayFire binding possible on Node.

@robertleeplummerjr
Copy link

How ironic, I just landed the rnn, lstm, and gru last night!
BrainJS/brain.js#29

Ty for your hard work!

@robertleeplummerjr
Copy link

nearly ready?

@unbornchikken
Copy link
Member

unbornchikken commented Dec 19, 2016

It depends on what you mean by nearly. :) I'm working on the array class. Once it gets ready, only the function wrap grinding remains. Which is a lot of work, but repetitive and easy to do. That's where I'm hoping for a bunch of PRs, though.

@robertleeplummerjr
Copy link

robertleeplummerjr commented Dec 21, 2016

I would like to break down what will happen, once this is ready, to better understand. Here is an example I posted from above of how the neural net equation is composed:

var eq = new Equation();
return eq.relu(
      eq.add(
        eq.add(
          eq.multiply(
            hiddenModel.weight,
            input
          ),
          eq.multiply(
            hiddenModel.transition,
            previousInput
          )
        ),
        hiddenModel.bias
      )
    );

This will give us, not processed numbers, but rather all the binary (think of it like a parser tree) steps to achieve processed numbers at a later point in time. Sometime later we do:

eq.run();

and then to run the equation backward (needed for backpropagation) we run:

eq.runBackpropagate();

The standard model is to perform equations on the gpu, and send them to the cpu, and the cpu sends them back to the gpu, and then again to the cpu. What you end up with are tons of copies of arrays that ultimately (arguably) are not needed. So my question is this: Is there any way to keep the values on the gpu, and completely processed there, so there is one (or much fewer) in(s), and much less copying in and out?

To illustrate the standard model:

relu(
      add(
        add(
          multiply(
            hiddenModel.weight,
            input
          ), // send to gpu, process, receive on cpu
          multiply(
            hiddenModel.transition,
            previousInput
          ) // send to gpu, process, receive on cpu
        ), // send to gpu, process, receive on cpu
        hiddenModel.bias
      ) // send to gpu, process, receive on cpu
    ); // send to gpu, process, receive on cpu

Will we get something like this with your work?:

relu(
      add(
        add(
          multiply(
            hiddenModel.weight,
            input
          ),
          multiply(
            hiddenModel.transition,
            previousInput
          ) 
        ),
        hiddenModel.bias
      )
    ); // send to gpu, process, receive on cpu

@unbornchikken
Copy link
Member

You'll get something like that, but it's from work of @pavanky and co. not me. ArrayFire's JIT technology merges a bunch of operations into a single kernel, and launch them at once. It works on most operations, however there is much to be done for better coverage: https://github.com/arrayfire/arrayfire/milestone/16 (search for "JIT" there).

@unbornchikken unbornchikken changed the title TODO: work out a way to have deterministic RAII scopes but keep the flow as async TODO: progress towards v0.3.0 Dec 22, 2016
@unbornchikken
Copy link
Member

I've added a TODO list on top of this thread.

@robertleeplummerjr
Copy link

Well, sir...

@unbornchikken
Copy link
Member

The good news is that next week is holiday for me, so I can work on this module ... if my kids and my wife live me some breath. :)

@robertleeplummerjr
Copy link

Same exact situation here.

@unbornchikken
Copy link
Member

Unfortunately I couldn't come up with a solution in Node.js that is better (faster) than we have in the current version. ArrayFire's design could not fit very well with the Nature of Node's event queue, so it turned out, my first approach is the best that I can come up with. The good news is that already near feature complete, just check out the examples folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants