Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Improve GAT models #348

Merged
merged 6 commits into from Jan 11, 2019
Merged

Conversation

jermainewang
Copy link
Member

@jermainewang jermainewang commented Jan 9, 2019

Description

A more efficient GAT implementation:

  • Use SPMV to compute softmax.
  • Batch multiple attention head together for more efficiency.

Results

Dataset Test Accuracy Time(s) Baseline#1 times(s) Baseline#2 times(s)
Cora 84.0% 0.0127 0.0982 (7.7x) 0.0424 (3.3x)
Citeseer 70.7% 0.0123 n/a n/a
Pubmed 78.1% 0.0302 n/a n/a

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change

@zheng-da @eric-haibin-lin Cannot implement it in MXNet because MXNet does not support autograd on the sparse matrix for matmul. Based on the experience from GCN, if MXNet is able to support this, MXNet should be faster.

@jermainewang jermainewang changed the title [Model] Improvate GAT models [Model] Improve GAT models Jan 9, 2019
@jermainewang
Copy link
Member Author

@sufeidechabei

def edge_attention(self, edges):
# an edge UDF to compute unnormalized attention values from src and dst
a = self.leaky_relu(edges.src['a1'] + edges.dst['a2'])
a = torch.exp(a).clamp(-10, 10) # use clamp to avoid overflow
Copy link
Collaborator

@BarclayII BarclayII Jan 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A caveat is that the softmax is technically still not the "correct and numerically-stable" softmax. For example, if the pre-softmax scores are [5, 10, 5, 5] then the results would become uniform, since here it just clips the whole thing to have a maximum absolute value of 10.

We should leave a comment there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Is this a reasonable approximation? I found in the torch implementation
https://github.com/Diego999/pyGAT/blob/3301fd68cbf349925d7458d9d47766e48021cf78/layers.py#L111
They even use torch.exp(-leaky_relu(...)). Not sure whether this is equivalent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-leaky_relu is actually better (although not perfect due to underflow) since the exponential of that would be always no greater than 1. The opposite is not since the result will easily overflow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is it still correct? It seems to me that they change the activation function to -leaky_relu.

Copy link
Collaborator

@BarclayII BarclayII Jan 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value range of the pre-softmax value doesn't matter if we disregard numerical stability and no clipping is involved. So we only need to prevent the exponential from overflowing and underflowing. But clipping it in the current way is bad because the clipped exponential will always often reside between 1 and 10 easily go above 10. The model will have trouble learning a sharp attention this way. -leaky_relu without clipping doesn't have this issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think -leaky_relu is just a little better than leaky_relu because the negative slope (the author used 0.2) is less steep than the positive slope (is 1.0). It can still go beyond 10 right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I was thinking of relu, my bad. Still, I think it will be hard to have the value explode in the negative case.

Copy link
Collaborator

@lingfanyu lingfanyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me. It seems that the accuracy of pubmed and citeseer hasn't matched paper's.

If this is because of the clamping in softmax impl, I actually wonder how much would it cost to have an extra update_all round just to calculate the max of pre-softmax attention values?

@yzh119
Copy link
Member

yzh119 commented Jan 10, 2019

@ylfdq1118 There are two available methods to avoid clamp:

  1. Calling an extra update_all to calculate the max of pre-softmax attention values.
  2. Calculate the max of pre-softmax attention values in reduce function.

If my understanding is right, I think the in the first method we compute the max value in a degree-bucketing way, then we could use SPMV to accelerate the following steps(src_mul_edge). But in the second method we execute src_mul_edge in a degree-bucketing way, which might be a heavy cost.

Is that the point why you mentioned adding an extra update_all round?

@VoVAllen
Copy link
Collaborator

According to pytorch's implementations, they minus the max before doing the softmax.
Max operation is expected to be a kernal in the future. Similar to Max Readout operation, they are both segment max(segment reduction) operation. I think it's worth time to investigate how to put this into dgl.

@yzh119
Copy link
Member

yzh119 commented Jan 10, 2019

@VoVAllen It's unnecessary to call segment_max for softmax, the backward function might be slower compared to a single op.
Custom op is in progress(#335), you may find the sparse softmax implementation there.

@lingfanyu
Copy link
Collaborator

lingfanyu commented Jan 10, 2019

You guys are right. We don't have kernel support for max reduction yet. Previously, I tried scipy to calculate max outside DGL, but that would cost extra communication between host and device and may not be doable since edge features are not scalar any more.

@jermainewang
Copy link
Member Author

One thing I noticed from the tensorflow baseline. Their loss value is much larger than ours (all the hyperparameters are the same) and the training accuracy is lower than validation accuracy. Do you guys know why this will happen?

@jermainewang
Copy link
Member Author

Figured out what happens. There is a weird dropout. Adding that and multi output attention heads gives better results. Updated.

@jermainewang jermainewang merged commit efae0f9 into dmlc:master Jan 11, 2019
@jermainewang jermainewang deleted the improve-gat branch January 11, 2019 07:02
@jermainewang jermainewang mentioned this pull request Feb 18, 2019
26 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants