Save embeddings issue when distributed training #53

ming-ch · 2020-06-01T09:56:31Z

Hi,

I run the dist_trian.py (examples/tf/graphsage/dist_train.py), it works well. However, when I try to save embedding after training, it raises RuntimeError("Graph is finalized and cannot be modified."). I meet the same issue when I try to run Bipartite GraphSAGE in the distribute mode.

  Traceback (most recent call last):
  File "dist_train.py", line 132, in <module>
    main()
  File "dist_train.py", line 128, in main
    train(config, g)
  File "dist_train.py", line 81, in train
    u_embs = trainer.get_node_embedding("u")
  File "/usr/local/lib/python2.7/dist-packages/graphlearn/python/model/tf/trainer.py", line 57, in get_node_embedding
    ids, emb, iterator = self.model.node_embedding(node_type)

The text was updated successfully, but these errors were encountered:

amznero · 2020-06-03T12:27:55Z

The same problem happened to me a few days ago. I guess the problem may be caused by the MonitoredTrainingSession (MTS, for short).

graph-learn uses MonitoredTrainingSession for init ps(distributed mode in TensorFlow) environment.

But MTS will freeze the TF Graph(not GNN) after the initialization step is completed. So users are not allowed to modify the graph(can't define some new variable/op).

For this issue, after we use trainer.py to init the session, the TF Graph will be frozen, then the model begins to train for n epochs. (No modify here). But when the code goes to save embedding, it will register a new variable for getting source nodes' embeddings(try to modify the TF graph, then raise an error).

For me, I made some modifications on graph_sage.py.

def build(self):
  ...
  pos_src_emb = self.encoders['src'].encode(self.pos_src_ego_tensor)
  pos_dst_emb = self.encoders['dst'].encode(self.pos_dst_ego_tensor)
  neg_dst_emb = self.encoders['dst'].encode(self.neg_dst_ego_tensor)
  
  # some modifications
  self.pos_src_emb = pos_src_emb
  self.pos_dst_emb = pos_dst_emb
  self.neg_dst_emb = neg_dst_emb

  self.loss = self._unsupervised_loss(pos_src_emb, pos_dst_emb, neg_dst_emb)
  ...

...

def node_embedding(self, type):
  iterator = self.ego_flow.iterator
  # remove
  # ego_tensor = self.ego_flow.pos_src_ego_tensor
  # remove
  # src_emb = self.encoders['src'].encode(ego_tensor)
  # add
  src_emb = self.pos_src_emb
  src_ids = self.pos_src_ego_tensor.src.ids
  return src_ids, src_emb, iterator

...

Similar issue on StackOverflow.

baoleai · 2020-06-03T16:50:29Z

The get_node_embedding in trainer.py is only for local training mode now. For distributed training, there is indeed a problem like @amznero said, you can put get_node_embedding in train funciton before session run in trainer.py. There may be a more elegant solution, I will fix it asap:)

baoleai self-assigned this Jun 3, 2020

baoleai added bug Something isn't working and removed bug Something isn't working labels Jun 3, 2020

baoleai removed their assignment Jun 3, 2020

jackonan assigned baoleai Jan 6, 2021

Memz99 mentioned this issue Mar 19, 2021

Save embeddings out of sync when deepwalk distributed training #149

Closed

baoleai closed this as completed Sep 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save embeddings issue when distributed training #53

Save embeddings issue when distributed training #53

ming-ch commented Jun 1, 2020

amznero commented Jun 3, 2020 •

edited

baoleai commented Jun 3, 2020

Save embeddings issue when distributed training #53

Save embeddings issue when distributed training #53

Comments

ming-ch commented Jun 1, 2020

amznero commented Jun 3, 2020 • edited

baoleai commented Jun 3, 2020

amznero commented Jun 3, 2020 •

edited