Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help me understand how to use this. #28

Closed
Zodiase opened this issue May 1, 2017 · 8 comments
Closed

Help me understand how to use this. #28

Zodiase opened this issue May 1, 2017 · 8 comments
Labels

Comments

@Zodiase
Copy link

Zodiase commented May 1, 2017

Using hamsters.js@4.0.0, I was able to run the example code. However I don't think it's making any sense to me. Also since I'm not seeing the result I'm expecting, I don't think this library is working properly.

Suppose running this code:

const foo = () => {
  var params = {
    'array':[0,1,2,3,4,5,6,7,8,9]
  };
  hamsters.run(params, function() {
      var arr = params.array;
      arr.forEach(function(item) {
        rtn.data.push((item * 120)/10);
      });
  }, function(output) {
     console.log(output);
  }, 10);
};
foo();

I was expecting the workload to be distributed among these 10 threads. But after some console logging, I see they are getting the exact same array and they are doing the exact same work, just now 10 times. How does that help? Maybe the example code just doesn't explain this but I was expecting either:

  • the thread function to be able to know its thread index so it can find the proper sub-set in the data array to work on.
  • somehow the data array is partitioned for these threads so they only see what they should do.

Could you help me understand how this library should be used? If you need an example, say we have a big array of size n (n > 10k) of positive integers and we are trying to find (x) => x * (x - 1) * (x - 2) * ... * 2 * 1 for every item in the array. Since the data items have no dependency on each other, ideally we could spawn n threads where each one would grab one number and start calculating. Could you illustrate how this could be done with this library?

Thanks a lot.

Side note: Since params and rtn are special "internal" variables, why not pass them in this way so it's easier to understand?

  hamsters.run(params, function(params, rtn) {
      var arr = params.array;
      arr.forEach(function(item) {
        rtn.data.push((item * 120)/10);
      });
  }, function(output) {
     console.log(output);
  }, 10);
@austinksmith
Copy link
Owner

I was expecting the workload to be distributed among these 10 threads. But after some console logging, I see they are getting the exact same array and they are doing the exact same work, just now 10 times. How does that help? Maybe the example code just doesn't explain this but I was expecting either:

Testing in my console shows that using v4.1.0 that the result of your supplied function

const foo = () => {
  var params = {
    'array':[0,1,2,3,4,5,6,7,8,9]
  };
  hamsters.run(params, function() {
      var arr = params.array;
      arr.forEach(function(item) {
        rtn.data.push((item * 120)/10);
      });
  }, function(output) {
     console.log(output);
  }, 10);
};
foo();

Is exactly what it should be which is an array containing 10 subarrays as your final output, since you have not asked the library to aggregate your results back together. If you want to have a single output you need to change your logic to

const foo = () => {
  var params = {
    'array':[0,1,2,3,4,5,6,7,8,9]
  };
  hamsters.run(params, function() {
      var arr = params.array;
      arr.forEach(function(item) {
        rtn.data.push((item * 120)/10);
      });
  }, function(output) {
     console.log(output);
  }, 10, true);
};
foo();

As far as

But after some console logging, I see they are getting the exact same array and they are doing the exact same work, just now 10 times.

This isn't the case, your array assuming you are defining multiple threads WILL be split across as many threads as you have specified, the same operation you defined before will be executed across all items in the array regardless of what thread they make use of. Inspecting http://hamsters.io/performance while pasting your function into the console shows me that each thread is in fact getting a subarray equal to Array.size / threads. Perhaps the problem you're having is that you aren't making use of a real worker implementation and you're seeing the behavior of the legacy mode which honestly should be following the same exact process so I'm at a loss as to what problems you're seeing.

Could you help me understand how this library should be used? If you need an example, say we have a big array of size n (n > 10k) of positive integers and we are trying to find (x) => x * (x - 1) * (x - 2) * ... * 2 * 1 for every item in the array. Since the data items have no dependency on each other, ideally we could spawn n threads where each one would grab one number and start calculating. Could you illustrate how this could be done with this library?

That's exactly what is illustrated above, you've just forgotten to tell the library you want a single output.

@Zodiase
Copy link
Author

Zodiase commented May 1, 2017

Sorry maybe I didn't make it clear in my first statement. By "some logging" I meant more console.logs than I showed in the code, I removed all of them to make the code simpler and closer to the original example code in readme. It would look more like this in my testing:

const foo = () => {
  var params = {
    'array':[0,1,2,3,4,5,6,7,8,9]
  };
  hamsters.run(params, function() {
      console.log('params', params);
      var arr = params.array;
      arr.forEach(function(item) {
        console.log('run');
        rtn.data.push((item * 120)/10);
      });
  }, function(output) {
     console.log(output);
  }, 10);
};
foo();

I saw "params" in the logs 10 times, each with identical content, and "run" 100 times, which is exactly the size of the array times the thread count I provided, while I'm expecting "run" only 10 times.

@Zodiase
Copy link
Author

Zodiase commented May 1, 2017

As of the log from the output function, for me that was:

[ [ 0, 12, 24, 36, 48, 60, 72, 84, 96, 108 ],
  [ 0, 12, 24, 36, 48, 60, 72, 84, 96, 108 ],
  [ 0, 12, 24, 36, 48, 60, 72, 84, 96, 108 ],
  [ 0, 12, 24, 36, 48, 60, 72, 84, 96, 108 ],
  [ 0, 12, 24, 36, 48, 60, 72, 84, 96, 108 ],
  [ 0, 12, 24, 36, 48, 60, 72, 84, 96, 108 ],
  [ 0, 12, 24, 36, 48, 60, 72, 84, 96, 108 ],
  [ 0, 12, 24, 36, 48, 60, 72, 84, 96, 108 ],
  [ 0, 12, 24, 36, 48, 60, 72, 84, 96, 108 ],
  [ 0, 12, 24, 36, 48, 60, 72, 84, 96, 108 ] ]

while I was expecting something like:

[ [ 0 ],
  [ 12 ],
  [ 24 ],
  [ 36 ],
  [ 48 ],
  [ 60 ],
  [ 72 ],
  [ 84 ],
  [ 96 ],
  [ 108 ] ]

@austinksmith
Copy link
Owner

You need to provide your entire example source for me to understand why you are seeing different behavior from both the provided benchmark example logic and jasmine tests.

const foo = () => {
  var params = {
    'array':[0,1,2,3,4,5,6,7,8,9]
  };
  hamsters.run(params, function() {
      var arr = params.array;
      arr.forEach(function(item) {
        rtn.data.push((item * 120)/10);
      });
  }, function(output) {
     console.log(output);
  }, 10, true);
};

Using 4.0.0 functions exactly as it should and my final output is in fact how it should be which is a single array that looks like [0, 12, 24, 36, 48, 60, 72, 84, 108]

As mentioned before, you are not telling your output to be aggregated into a single output, pay special attention to #5 on the how it works section of the read me.

This optional argument will tell the library whether or not we want to aggregate our individual thread outputs together after execution, this is only relevant if you are executing across multiple threads and defaults to false.

Additionally the documentation does not say to declare hamsters as a constant, the hamsters object should never be treated as immutable because the library modifies it self during runtime.

var hamsters = require('hamsters.js');

@Zodiase
Copy link
Author

Zodiase commented May 1, 2017

As mentioned before, you are not telling your output to be aggregated into a single output, pay special attention to #5 on the how it works section of the read me.

I'm aware of that and I didn't expect the result to be aggregated (as I plan to do the aggregation myself). What I was saying was the result contains redundant data that apparently is produced by redundant work.

I'll use the aggregation flag, hopefully it's easier for you to see what I mean.

image

I'm printing the amount of data each thread received, as well as the amount of work cycles they did. With 2 threads, the total amount of work cycle should always be the input data size, which is 10, but we are seeing 20 cycles (runs).

Also, with the aggregation flag set, the output contains redundant data.

@austinksmith
Copy link
Owner

So I've managed to reproduce this issue only under the following conditions

  1. The library is making use of legacy mode which means making use of the main thread.
  2. The function invoked is using more than 1 thread
  3. The function invoked is using a non typed array, using a typed array does not suffer the same issue.

My debugging so far leads me to believe this is an inheritance issue and a race condition based on the time slicing behavior of setTimeout causing multiple "threads" to modify the same array, I'll have a fix ready in the next release version which shouldn't be too long.

In the mean time my recommendations are to use a 3rd party worker implementation with Node.js as this is only going to affect the legacy mode of the library and only within Node or browsers that do not support web workers eg. IE9.

Thanks for spotting this, it's very hard to debug multithreaded logic so understand my requests for more info are because I wasn't seeing the problem until I recreated every condition of your setup.

@Zodiase
Copy link
Author

Zodiase commented May 1, 2017

I'm absolutely glad to help. I was looking for easy-to-use node parallelism solutions and came across this library and I think it is very promising.

By the way, could you comment on my side note there? I'd assume there's some limitation that I'm not aware of forcing you to design the API this way but wouldn't it easier for people to understand to have the thread function signature like this:

function threadFunc (params, report) {
  // Do work here with data provided in `params`.
  var data = params.array.map((v) => v + 1);
  // Send results back with the `report` function.
  report(data);
}

Or simply:

function threadFunc (params) {
  // Do work here with data provided in `params`.
  var data = params.array.map((v) => v + 1);
  // `return`ed data is automatically collected by the main process.
  return data;
}

I know in the second case it's probably harder to allow async work, in which case Promise probably could be used. My general suggestion is to make the thread function look more like a normal function, instead of using some seemingly undefined "internal variables".

@austinksmith
Copy link
Owner

I've gone ahead and pushed out v4.1.1, please update to v4.1.1 and your issues should be resolved.

https://github.com/austinksmith/Hamsters.js/releases/tag/v4.1.1

It's unfortunately not possible to do it that way unless they added that functionality as a first class part of the language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants