Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save embeddings issue when distributed training #53

Closed
ming-ch opened this issue Jun 1, 2020 · 2 comments
Closed

Save embeddings issue when distributed training #53

ming-ch opened this issue Jun 1, 2020 · 2 comments
Assignees

Comments

@ming-ch
Copy link

ming-ch commented Jun 1, 2020

Hi,

I run the dist_trian.py (examples/tf/graphsage/dist_train.py), it works well. However, when I try to save embedding after training, it raises RuntimeError("Graph is finalized and cannot be modified."). I meet the same issue when I try to run Bipartite GraphSAGE in the distribute mode.

  Traceback (most recent call last):
  File "dist_train.py", line 132, in <module>
    main()
  File "dist_train.py", line 128, in main
    train(config, g)
  File "dist_train.py", line 81, in train
    u_embs = trainer.get_node_embedding("u")
  File "/usr/local/lib/python2.7/dist-packages/graphlearn/python/model/tf/trainer.py", line 57, in get_node_embedding
    ids, emb, iterator = self.model.node_embedding(node_type)
@amznero
Copy link
Contributor

amznero commented Jun 3, 2020

The same problem happened to me a few days ago. I guess the problem may be caused by the MonitoredTrainingSession (MTS, for short).

graph-learn uses MonitoredTrainingSession for init ps(distributed mode in TensorFlow) environment.

But MTS will freeze the TF Graph(not GNN) after the initialization step is completed. So users are not allowed to modify the graph(can't define some new variable/op).

For this issue, after we use trainer.py to init the session, the TF Graph will be frozen, then the model begins to train for n epochs. (No modify here). But when the code goes to save embedding, it will register a new variable for getting source nodes' embeddings(try to modify the TF graph, then raise an error).

For me, I made some modifications on graph_sage.py.

def build(self):
  ...
  pos_src_emb = self.encoders['src'].encode(self.pos_src_ego_tensor)
  pos_dst_emb = self.encoders['dst'].encode(self.pos_dst_ego_tensor)
  neg_dst_emb = self.encoders['dst'].encode(self.neg_dst_ego_tensor)
  
  # some modifications
  self.pos_src_emb = pos_src_emb
  self.pos_dst_emb = pos_dst_emb
  self.neg_dst_emb = neg_dst_emb

  self.loss = self._unsupervised_loss(pos_src_emb, pos_dst_emb, neg_dst_emb)
  ...

...

def node_embedding(self, type):
  iterator = self.ego_flow.iterator
  # remove
  # ego_tensor = self.ego_flow.pos_src_ego_tensor
  # remove
  # src_emb = self.encoders['src'].encode(ego_tensor)
  # add
  src_emb = self.pos_src_emb
  src_ids = self.pos_src_ego_tensor.src.ids
  return src_ids, src_emb, iterator

...

Similar issue on StackOverflow.

@baoleai baoleai self-assigned this Jun 3, 2020
@baoleai baoleai added bug Something isn't working and removed bug Something isn't working labels Jun 3, 2020
@baoleai baoleai removed their assignment Jun 3, 2020
@baoleai
Copy link
Collaborator

baoleai commented Jun 3, 2020

The get_node_embedding in trainer.py is only for local training mode now. For distributed training, there is indeed a problem like @amznero said, you can put get_node_embedding in train funciton before session run in trainer.py. There may be a more elegant solution, I will fix it asap:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants