Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RPC] add C++ RPC infrastructure and distributed sampler #408

Closed
wants to merge 70 commits into from
Closed

[RPC] add C++ RPC infrastructure and distributed sampler #408

wants to merge 70 commits into from

Conversation

aksnzhy
Copy link
Contributor

@aksnzhy aksnzhy commented Feb 22, 2019

Description

This PR add the C++ RPC infrastructure and distributed sampler API.

Now we need to add Windows wrapper for the c socket API for tcp_socket.cc. Anyone can help me do this job? I don't have the windows development environment.

Also, this code has been tested on multiple machines outside the unittest. If we need to add unittest test for the network code?

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR

Changes

  • python/dgl/contrib/sampling/dis_sampler.py
  • python/dgl/network.py
  • src/graph/network/communicator.h
  • src/graph/network/msg_queue.cc
  • src/graph/network/msg_queue.h
  • src/graph/network/serialize.cc
  • src/graph/network/serialize.h
  • src/graph/network/socket_communicator.cc
  • src/graph/network/socket_communicator.h
  • src/graph/network/tcp_socket.cc
  • src/graph/network/tcp_socket.h
  • src/graph/network.cc
  • src/graph/network.h

@jermainewang jermainewang changed the title add C++ RPC infrastructure and distributed sampler [RPC] add C++ RPC infrastructure and distributed sampler Feb 22, 2019
@jermainewang jermainewang self-requested a review February 22, 2019 19:51
@BarclayII
Copy link
Collaborator

BarclayII commented Feb 22, 2019

I assume that I only need to port TCPSocket to Windows? The implementation would be a little bit different between WinSock and UNIX sockets.

Re unit test: I think it's better to have some sort of tests, otherwise I wouldn't know whether Windows can work. IMO they don't have to be integrated into CI (like unit tests), but I need some scripts to set up a simulated, or connect to an existing, distributed environment.

@BarclayII BarclayII self-assigned this Feb 22, 2019
@BarclayII
Copy link
Collaborator

Also, what about PR #325 ?

@aksnzhy
Copy link
Contributor Author

aksnzhy commented Feb 23, 2019

Also, what about PR #325 ?

I closed this PR. It is out of date.

@aksnzhy
Copy link
Contributor Author

aksnzhy commented Feb 23, 2019

I assume that I only need to port TCPSocket to Windows? The implementation would be a little bit different between WinSock and UNIX sockets.

Re unit test: I think it's better to have some sort of tests, otherwise I wouldn't know whether Windows can work. IMO they don't have to be integrated into CI (like unit tests), but I need some scripts to set up a simulated, or connect to an existing, distributed environment.

Yes, only the TCPSocket class need to be changed. The other components are all platform-agnostic.

I will add some unittests for distributed sampler.

python/dgl/graph_index.py Outdated Show resolved Hide resolved
src/graph/graph_apis.cc Outdated Show resolved Hide resolved
src/graph/network/serialize.cc Outdated Show resolved Hide resolved
DGL_REGISTER_GLOBAL("graph_index._CAPI_ReceiverRecvSubgraph")
.set_body([] (DGLArgs args, DGLRetValue* rv) {
// The data buffer will be allocated just once
static char* data_buffer = new char[kMaxBufferSize];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory is allocated when the process starts regardless of how people use DGL. Maybe allocate memory more explicitly. e.g., only when people starts to use the distributed module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it by using a global pointer, which will only be allocated by CreateReceiver API

bool SocketCommunicator::InitSender(const char* ip, int port) {
// Sender only has a client socket
socket_.resize(1);
socket_[0] = new TCPSocket();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use shared_ptr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to control the pointer by handle in Finalize() method.

}
}

int MessageQueue::Add(const char* src, int size, bool is_blocking) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a large memory copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not very easy to implement zero-copy here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much does the memory copy affect the performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me evaluate it by removing the message queue for a test.

@jermainewang
Copy link
Member

jermainewang commented Feb 28, 2019

First of all, @aksnzhy @zheng-da do we agree that this will not merged into master at the moment because this feature is not part of the release plan?

@BarclayII
Copy link
Collaborator

I have updated tcp_socket.cc with Windows socket implementation.

@aksnzhy
Copy link
Contributor Author

aksnzhy commented Mar 6, 2019

It seems that the NodeFlow and Sampler API has been changed. I need to update the distributed sampler code.

@aksnzhy
Copy link
Contributor Author

aksnzhy commented Mar 7, 2019

@jermainewang Could we change the branch from dgl:gaint to dgl:master? The windows support has been completed and I have updated my code with the new NodeFlow code in master branch.

I have also pushed another PR to Da's local dgl branch.

* refactor.

* accelerate update_all in nodeflow.

* fix.

* refactor.

* fix lint.

* fix lint.

* reorganize.

* reorg.

* remove.

* add doc.

* impl block_incidence_matrix

* fix lint.

* fix.

* simple fix.

* fix test.

* fix interface.

* fix eid.

* fix comments.
@@ -105,18 +82,18 @@ DGL_REGISTER_GLOBAL("network._CAPI_ReceiverRecvSubgraph")
if (size <= 0) {
LOG(ERROR) << "Receive error: (size: " << size << ")";
}
NodeFlow nf;
NodeFlow* nf = new NodeFlow();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is the memory free'd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zheng-da
I do the same allocation like this:

nflows[i] = new NodeFlow();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the sampler free the memory?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are manually freed from Python code after instantiation of NodeFlow objects (python/dgl/nodeflow.py:59)

aksnzhy and others added 20 commits March 8, 2019 16:48
* update version

* update news
* fix cython bug

* fix

* fix
* update README performance number

* news
…mutable graph. (#437)

* fix rgcn tutorial

* small fix

* upd

* findedge/s

* upd

* upd

* upd

* upd

* add test

* remove redundancy

* upd

* upd

* upd

* upd

* add edge_subgraph

* explicit cast

* add test immutable subg

* reformat

* reformat

* fix bug

* upd

* hotfix

* subgraph docs

* fix adj mat

* fix edges() order

* add test

* revert

* upd

* fix

* upd
…449)

* fix send with builtin bug

* test cases

* remove todo comment
* refactored sampler code

* docstring

* fix tutorial
* fix performance bug in EdgeSoftmax

* minor
* fix bug in pickling

* fix lint
const IdArray& node_mapping,
const IdArray& edge_mapping,
const IdArray& layer_offsets,
const IdArray& flow_offsets) {
Copy link
Collaborator

@BarclayII BarclayII Mar 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface makes changing the nodeflow structure subtle; one needs to also change the interface here (and the deserialization, and all the calling statements).

Is it possible to change this to something like size_t SerializeNodeFlow(char *buffer, const NodeFlow &nf)?

EDIT: or maybe make these methods of NodeFlow instead (e.g. size_t NodeFlow::Serialize(char *buffer) and void NodeFlow::Deserialize(const char *buffer, size_t len))

Copy link
Contributor Author

@aksnzhy aksnzhy Mar 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. I'll do this change after new NodeFlow structure merged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "new NodeFlow structure" refers to the one with optional tensor data (which is the one I'm using) then it won't be merged until after PR #453. It's your call but I would prefer changing the interface now so later on NodeFlow could be extended further more easily.

@aksnzhy aksnzhy closed this Mar 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants