New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dynet-92]. Multi-device support #704

Merged
merged 46 commits into from Aug 10, 2017

Conversation

Projects
None yet
4 participants
@xunzhang
Collaborator

xunzhang commented Jul 17, 2017

  • Refactor dynet to support multi-device more friendly.
  • Honor --dynet-devices argument.
  • Implement interfaces(partly) specifying device type as the argument when defining an expression or using cg.change_expr_device before defining an expression.
  • Implement memcpy between devices in forward process.
  • Test forward in hybird CPU/GPU mode with basic expression: V * tanh(affine_transform(b, W, x)) + a.
  • Implement memcpy between devices in backward process.
  • Test backward in hybird CPU/GPU mode with basic expression: V * tanh(affine_transform(b, W, x)) + a.
  • Add support for rollback to CPU mechanism when there is no GPU implementation yet.
  • Debug hang issue using multiple GPUs
  • Add more feature tests.

Original Usage:

./a.out --dynet-devices CPU,GPU:0,GPU:1

int main(int argc, char *argv[])
{
  dynet::initialize(argc, argv);

  for (iter) {
    ComputationGraph cg(dynet::devices_map["GPU:0"]); // default device if not specify
    Expression W = parameter(cg, p_W, dynet::devices_map["CPU"]);
    Expression b = parameter(cg, p_b); // default: GPU:0
    Expression x = input(cg, {2}, x_values); // default: GPU:0
    cg.change_expr_device(dynet::devices_map["GPU:1"]); // change default device for future expressions
    Expression h = tanh(affine_transform({b, W, x})); // reside GPU:1
    
    Expression last = ...;
    cg.forward(last);
    cg.backward(last);
    // update
  }
  return 0;
}

Modified Usage:

./a.out --dynet-devices CPU,GPU:0,GPU:1

int main(int argc, char *argv[])
{
  dynet::initialize(argc, argv);

  for (iter) {
    ComputationGraph cg;
    Expression W = parameter(cg, p_W, dynet::devices_map["GPU:0"]);
    Expression b = parameter(cg, p_b); // default to p_b's device(GPU:0)
    Expression x = input(cg, {2}, x_values, dynet::devices_map["CPU"]);
    Expression x_2 = to_device(x, dynet::devices_map["GPU:0"]);
    Expression h = affine_transform({b, W, x}); // default to b's device(GPU:0)
    Expression h_2 = to_device(h, dynet::devices_map["CPU"]);
    Expression v = tanh(h_2); // default to h2's device(CPU), suppose tanh has no cuda impl in this case 
    
    Expression last = ...;
    cg.forward(last);
    cg.backward(last);
    // update
  }
  return 0;
}

To reviewer @neubig , you can have a quick test using code below:

// usage: ./a.out --dynet-devices CPU,GPU:0

#include <iostream>
#include "dynet/dynet.h"
#include "dynet/training.h"
#include "dynet/expr.h"
#include "dynet/io.h"
#include "dynet/model.h"
#include "dynet/devices.h"

using namespace std;
using namespace dynet;

int main(int argc, char** argv) {
  dynet::initialize(argc, argv);

  const unsigned ITERATIONS = 30; 

  // ParameterCollection (all the model parameters).
  ParameterCollection m;
  SimpleSGDTrainer sgd(m);

  const unsigned HIDDEN_SIZE = 8;
  Parameter p_W = m.add_parameters({HIDDEN_SIZE, 2});
  Parameter p_b = m.add_parameters({HIDDEN_SIZE});
  Parameter p_V = m.add_parameters({1, HIDDEN_SIZE});
  Parameter p_a = m.add_parameters({1});
  if (argc == 2) {
    // Load the model and parameters from file if given.
    TextFileLoader loader(argv[1]);
    loader.populate(m);
  }

  // Static declaration of the computation graph.
  ComputationGraph cg; 
  Expression W = parameter(cg, p_W);
  Expression b = parameter(cg, p_b);
  Expression V = parameter(cg, p_V);
  Expression a = parameter(cg, p_a);

  // Set x_values to change the inputs to the network.
  vector<dynet::real> x_values(2);
  Expression x = input(cg, {2}, &x_values);
  dynet::real y_value;  // Set y_value to change the target output.
  Expression y = input(cg, &y_value);

  Expression aa = W * x + b;
  Expression hhh = to_device(aa, dynet::get_global_device("CPU"));
  Expression h = tanh(hhh);
  Expression hh = to_device(h, dynet::get_global_device("GPU:0"));
  Expression y_pred = V*hh + a;
  Expression loss_expr = squared_distance(y_pred, y); 

  // Show the computation graph, just for fun.
  cg.print_graphviz();

  // Train the parameters.
  for (unsigned iter = 0; iter < ITERATIONS; ++iter) {
    double loss = 0;
    for (unsigned mi = 0; mi < 4; ++mi) {
      bool x1 = mi % 2;
      bool x2 = (mi / 2) % 2;
      x_values[0] = x1 ? 1 : -1; 
      x_values[1] = x2 ? 1 : -1; 
      y_value = (x1 != x2) ? 1 : -1; 

      loss += as_scalar(cg.forward(loss_expr));
      cg.backward(loss_expr);
      sgd.update();

    }
    loss /= 4;
    cerr << "E = " << loss << endl;
  }

  // Output the model and parameter objects to a file.
  TextFileSaver saver("/tmp/xor.model");
  saver.save(m);
}

@xunzhang xunzhang changed the title from Dynet 92 model parallelism to Dynet 92 Multi-device support Jul 17, 2017

@xunzhang xunzhang changed the title from Dynet 92 Multi-device support to [WIP] [Dynet-92]. Multi-device support Jul 17, 2017

@xunzhang xunzhang referenced this pull request Jul 20, 2017

Closed

[WIP] ThreadPool Device Support #713

2 of 3 tasks complete
@neubig

This comment has been minimized.

Show comment
Hide comment
@neubig

neubig Jul 20, 2017

Contributor

In general, this is great: I think multi-device support will be a great feature for DyNet to have. First, I have a high level comment. In my mind, there are two design decisions here:

How do we specify the "default" device of a graph node when it is not specified explicitly

  1. Current Implementation: A default is passed to ComputationGraph, and the default is used.
  2. Alternatively, we could have the node default to the device of its first argument.

The first has the advantage of perhaps being easier to understand, but may result in hidden memory moves where people aren't expecting them. It also adds some code complexity.

When some of the inputs are not on the same device what do you do?

  1. Current Implementation: The ExecutionEngine is responsible for moving memory.

  2. The ExecutionEngine throws an error, telling the user to move the memory themselves (using something like dy.change_device(x, device)).

  3. A combination of 1. and 2., where 2. is on by default, but 1. can be chosen.

  4. and 3. have the advantage of not crashing, but also have the potential to hide memory moves that the user really wouldn't want to be doing. (For example, in the example code, the weight matrix would be passed from CPU to GPU every time it was used, which would be really really bad.) 2. has the advantage of preventing this, but may result in a sightly increased coding burden.

My opinion: I tend to prefer 2./2. respectively, but could be convinced otherwise.

Contributor

neubig commented Jul 20, 2017

In general, this is great: I think multi-device support will be a great feature for DyNet to have. First, I have a high level comment. In my mind, there are two design decisions here:

How do we specify the "default" device of a graph node when it is not specified explicitly

  1. Current Implementation: A default is passed to ComputationGraph, and the default is used.
  2. Alternatively, we could have the node default to the device of its first argument.

The first has the advantage of perhaps being easier to understand, but may result in hidden memory moves where people aren't expecting them. It also adds some code complexity.

When some of the inputs are not on the same device what do you do?

  1. Current Implementation: The ExecutionEngine is responsible for moving memory.

  2. The ExecutionEngine throws an error, telling the user to move the memory themselves (using something like dy.change_device(x, device)).

  3. A combination of 1. and 2., where 2. is on by default, but 1. can be chosen.

  4. and 3. have the advantage of not crashing, but also have the potential to hide memory moves that the user really wouldn't want to be doing. (For example, in the example code, the weight matrix would be passed from CPU to GPU every time it was used, which would be really really bad.) 2. has the advantage of preventing this, but may result in a sightly increased coding burden.

My opinion: I tend to prefer 2./2. respectively, but could be convinced otherwise.

@yoavg

This comment has been minimized.

Show comment
Hide comment
@yoavg

yoavg Jul 20, 2017

Contributor

This is really great!

I like options 2 and 2 also, but would like to propose a variation of the second (2):

the name dy.change_device(x, device) is a bit confusing imo, as we are not changing the device of x as much as copying x to another device (it can also be used on the same device after). So I propose to change @neubig 's proposed interface slightly to:
Expression y = x.to_device(device)

letting both x and y be used.

Another proposal (maybe its already there, I didn't look at the code) is to also allow multiple CPU devices. There, the copying will be NOPs, but we can still run different cpu devices as different threads.

How do things look in terms of synchronization in the current implementation?

Contributor

yoavg commented Jul 20, 2017

This is really great!

I like options 2 and 2 also, but would like to propose a variation of the second (2):

the name dy.change_device(x, device) is a bit confusing imo, as we are not changing the device of x as much as copying x to another device (it can also be used on the same device after). So I propose to change @neubig 's proposed interface slightly to:
Expression y = x.to_device(device)

letting both x and y be used.

Another proposal (maybe its already there, I didn't look at the code) is to also allow multiple CPU devices. There, the copying will be NOPs, but we can still run different cpu devices as different threads.

How do things look in terms of synchronization in the current implementation?

@xunzhang

This comment has been minimized.

Show comment
Hide comment
@xunzhang

xunzhang Jul 22, 2017

Collaborator

Very helpful comments!! I also prefer 2. / 2. then. And I think to support both copy like x.to_device(device) and move like x.change_device(device) are necessary. But to_device interface is sort of hard to implement since there will be some discrete VariableIndex indexes and we need to refactor the executor code to support it. I think I will first finish change_device and then implement to_device.

Currently, we don't support specifying CPU id. I will think about that in the near future.

Collaborator

xunzhang commented Jul 22, 2017

Very helpful comments!! I also prefer 2. / 2. then. And I think to support both copy like x.to_device(device) and move like x.change_device(device) are necessary. But to_device interface is sort of hard to implement since there will be some discrete VariableIndex indexes and we need to refactor the executor code to support it. I think I will first finish change_device and then implement to_device.

Currently, we don't support specifying CPU id. I will think about that in the near future.

change default device to the first argument by default, honor change_…
…device interface instead of do memcpy in executor.
@xunzhang

This comment has been minimized.

Show comment
Hide comment
@xunzhang

xunzhang Jul 22, 2017

Collaborator

The remark of to_device and change_device in my last comment might be incorrect. Basically, I think to_device is a copy like operation which will create an additional node while change_device is sort of changing the device assignment of an expression(not sure if this semantic is useful or not).

Collaborator

xunzhang commented Jul 22, 2017

The remark of to_device and change_device in my last comment might be incorrect. Basically, I think to_device is a copy like operation which will create an additional node while change_device is sort of changing the device assignment of an expression(not sure if this semantic is useful or not).

@neubig

This comment has been minimized.

Show comment
Hide comment
@neubig

neubig Jul 22, 2017

Contributor

@xunzhang Yes, to_device is an operation that will create a new node (where the memory is stored on a different device than a single node). I don't think that we should have a function to change the device of a particular node for the reasons you mentioned: it would complicate things and require special handling in the executor. Regarding Yoav's comment about having multiple CPU devices, I'm not sure that this is necessary. In order to do things on multiple threads, we'll need to have a multi-threaded execution engine anyway, so we can probably have the multi-threaded execution engine perform multiple operations using the same CPU device. Let's save this discussion for a later commit when we tackle multi-threading the execution engine.

Contributor

neubig commented Jul 22, 2017

@xunzhang Yes, to_device is an operation that will create a new node (where the memory is stored on a different device than a single node). I don't think that we should have a function to change the device of a particular node for the reasons you mentioned: it would complicate things and require special handling in the executor. Regarding Yoav's comment about having multiple CPU devices, I'm not sure that this is necessary. In order to do things on multiple threads, we'll need to have a multi-threaded execution engine anyway, so we can probably have the multi-threaded execution engine perform multiple operations using the same CPU device. Let's save this discussion for a later commit when we tackle multi-threading the execution engine.

@xunzhang

This comment has been minimized.

Show comment
Hide comment
@xunzhang

xunzhang Jul 22, 2017

Collaborator

@neubig Right, cool. I will finish this soon.

Collaborator

xunzhang commented Jul 22, 2017

@neubig Right, cool. I will finish this soon.

@xunzhang

This comment has been minimized.

Show comment
Hide comment
@xunzhang

xunzhang Jul 24, 2017

Collaborator

This pull request is review-ready. It will not affect old code and interface and I will split remaining work in the future pull requests.

The remaining things include,

  1. Fix remaining hard-coded default_device places.
  2. Fix the multi GPUs hanging bug: this is not introduced by this pull request and should be applied with multi-device support someplace.
  3. Python interface: the code in this pull request will not break current use I think.
  4. Add tests and refactor failing GPU unit-tests, update the document.
Collaborator

xunzhang commented Jul 24, 2017

This pull request is review-ready. It will not affect old code and interface and I will split remaining work in the future pull requests.

The remaining things include,

  1. Fix remaining hard-coded default_device places.
  2. Fix the multi GPUs hanging bug: this is not introduced by this pull request and should be applied with multi-device support someplace.
  3. Python interface: the code in this pull request will not break current use I think.
  4. Add tests and refactor failing GPU unit-tests, update the document.

@xunzhang xunzhang changed the title from [WIP] [Dynet-92]. Multi-device support to [Dynet-92]. Multi-device support Jul 24, 2017

@neubig

Thanks, this is great! I have a bunch of small comments, but I think once they're resolved and I can confirm that this works in my environment, I think we can merge. Also, some of my comments might just be oversights, so if there's anything that you don't think needs to be fixed, just tell me.

Show outdated Hide outdated dynet/exec.cc
Show outdated Hide outdated dynet/exec.cc
Show outdated Hide outdated dynet/expr.cc
Show outdated Hide outdated dynet/expr.cc
Show outdated Hide outdated dynet/expr.h
Show outdated Hide outdated dynet/nodes-linalg.h
Show outdated Hide outdated dynet/nodes-matrixmultiply.h
Show outdated Hide outdated dynet/nodes-trig.h
Show outdated Hide outdated dynet/tensor.cc
Show outdated Hide outdated dynet/tensor.cc
@neubig

neubig approved these changes Aug 3, 2017

@neubig neubig merged commit 5903c85 into clab:master Aug 10, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@neubig

This comment has been minimized.

Show comment
Hide comment
@neubig

neubig Aug 10, 2017

Contributor

Thanks! I confirmed that this is working as expected, so I merged. This is great to have :)

Contributor

neubig commented Aug 10, 2017

Thanks! I confirmed that this is working as expected, so I merged. This is great to have :)

@duyvuleo

This comment has been minimized.

Show comment
Hide comment
@duyvuleo

duyvuleo Aug 17, 2017

Contributor

Is it working actually? I tried the "./examples/train_xor-multidevice" example and got the following error:

terminate called after throwing an instance of 'std::runtime_error'
what(): Invalid device name: GPU:0
Aborted (core dumped)

Contributor

duyvuleo commented Aug 17, 2017

Is it working actually? I tried the "./examples/train_xor-multidevice" example and got the following error:

terminate called after throwing an instance of 'std::runtime_error'
what(): Invalid device name: GPU:0
Aborted (core dumped)

@neubig

This comment has been minimized.

Show comment
Hide comment
@neubig

neubig Aug 17, 2017

Contributor

I think documentation is not finished yet, but you need to add --dynet-devices CPU,GPU:0 to the command line I think.

Contributor

neubig commented Aug 17, 2017

I think documentation is not finished yet, but you need to add --dynet-devices CPU,GPU:0 to the command line I think.

@duyvuleo

This comment has been minimized.

Show comment
Hide comment
@duyvuleo

duyvuleo Aug 17, 2017

Contributor

It works. Thanks!

Contributor

duyvuleo commented Aug 17, 2017

It works. Thanks!

@xunzhang xunzhang referenced this pull request Aug 25, 2017

Closed

Multi-device support #92

4 of 4 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment