[Model] Improve GAT models #348

jermainewang · 2019-01-09T23:54:51Z

Description

A more efficient GAT implementation:

Use SPMV to compute softmax.
Batch multiple attention head together for more efficiency.

Results

Dataset	Test Accuracy	Time(s)	Baseline#1 times(s)	Baseline#2 times(s)
Cora	84.0%	0.0127	0.0982 (7.7x)	0.0424 (3.3x)
Citeseer	70.7%	0.0123	n/a	n/a
Pubmed	78.1%	0.0302	n/a	n/a

All the accuracy numbers are obtained after 200 epochs.
The time measures how long it takes to train one epoch.
All time is measured on EC2 p3.2xlarge instance w/ V100 GPU.
Baseline#1: https://github.com/PetarV-/GAT.
Baseline#2: https://github.com/Diego999/pyGAT.

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [Model], [Doc], [Feature]])
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the my best knowledge, examples are either not affected by this change,
or have been fixed to be compatible with this change

@zheng-da @eric-haibin-lin Cannot implement it in MXNet because MXNet does not support autograd on the sparse matrix for matmul. Based on the experience from GCN, if MXNet is able to support this, MXNet should be faster.

jermainewang · 2019-01-09T23:56:52Z

@sufeidechabei

BarclayII · 2019-01-10T00:04:56Z

examples/pytorch/gat/train.py

+    def edge_attention(self, edges):
+        # an edge UDF to compute unnormalized attention values from src and dst
+        a = self.leaky_relu(edges.src['a1'] + edges.dst['a2'])
+        a = torch.exp(a).clamp(-10, 10)  # use clamp to avoid overflow


A caveat is that the softmax is technically still not the "correct and numerically-stable" softmax. For example, if the pre-softmax scores are [5, 10, 5, 5] then the results would become uniform, since here it just clips the whole thing to have a maximum absolute value of 10.

We should leave a comment there.

I see. Is this a reasonable approximation? I found in the torch implementation
https://github.com/Diego999/pyGAT/blob/3301fd68cbf349925d7458d9d47766e48021cf78/layers.py#L111
They even use torch.exp(-leaky_relu(...)). Not sure whether this is equivalent.

-leaky_relu is actually better (although not perfect due to underflow) since the exponential of that would be always no greater than 1. The opposite is not since the result will easily overflow.

But is it still correct? It seems to me that they change the activation function to -leaky_relu.

The value range of the pre-softmax value doesn't matter if we disregard numerical stability and no clipping is involved. So we only need to prevent the exponential from overflowing and underflowing. But clipping it in the current way is bad because the clipped exponential will ~~always often reside between 1 and 10~~ easily go above 10. The model will have trouble learning a sharp attention this way. -leaky_relu without clipping doesn't have this issue.

Hmm, I think -leaky_relu is just a little better than leaky_relu because the negative slope (the author used 0.2) is less steep than the positive slope (is 1.0). It can still go beyond 10 right?

Oops I was thinking of relu, my bad. Still, I think it will be hard to have the value explode in the negative case.

lingfanyu

The code looks good to me. It seems that the accuracy of pubmed and citeseer hasn't matched paper's.

If this is because of the clamping in softmax impl, I actually wonder how much would it cost to have an extra update_all round just to calculate the max of pre-softmax attention values?

yzh119 · 2019-01-10T07:01:22Z

@ylfdq1118 There are two available methods to avoid clamp:

Calling an extra update_all to calculate the max of pre-softmax attention values.
Calculate the max of pre-softmax attention values in reduce function.

If my understanding is right, I think the in the first method we compute the max value in a degree-bucketing way, then we could use SPMV to accelerate the following steps(src_mul_edge). But in the second method we execute src_mul_edge in a degree-bucketing way, which might be a heavy cost.

Is that the point why you mentioned adding an extra update_all round?

VoVAllen · 2019-01-10T07:39:07Z

According to pytorch's implementations, they minus the max before doing the softmax.
Max operation is expected to be a kernal in the future. Similar to Max Readout operation, they are both segment max(segment reduction) operation. I think it's worth time to investigate how to put this into dgl.

yzh119 · 2019-01-10T07:55:25Z

@VoVAllen It's unnecessary to call segment_max for softmax, the backward function might be slower compared to a single op.
Custom op is in progress(#335), you may find the sparse softmax implementation there.

lingfanyu · 2019-01-10T14:30:01Z

You guys are right. We don't have kernel support for max reduction yet. Previously, I tried scipy to calculate max outside DGL, but that would cost extra communication between host and device and may not be doable since edge features are not scalar any more.

jermainewang · 2019-01-11T05:47:39Z

One thing I noticed from the tensorflow baseline. Their loss value is much larger than ours (all the hyperparameters are the same) and the training accuracy is lower than validation accuracy. Do you guys know why this will happen?

jermainewang · 2019-01-11T06:59:08Z

Figured out what happens. There is a weird dropout. Adding that and multi output attention heads gives better results. Updated.

jermainewang added 2 commits January 9, 2019 17:20

two better GAT implementations

914a6cf

update numbers

a8026cd

jermainewang requested a review from lingfanyu January 9, 2019 23:54

jermainewang changed the title ~~[Model] Improvate GAT models~~ [Model] Improve GAT models Jan 9, 2019

Merge branch 'master' into improve-gat

aab7231

BarclayII reviewed Jan 10, 2019

View reviewed changes

use version switch for spmm

26da21c

lingfanyu approved these changes Jan 10, 2019

View reviewed changes

Merge branch 'master' into improve-gat

75be95e

add missing dropout and output heads

38fc819

jermainewang merged commit efae0f9 into dmlc:master Jan 11, 2019

jermainewang deleted the improve-gat branch January 11, 2019 07:02

jermainewang mentioned this pull request Feb 18, 2019

[Roadmap] v0.2 release checklist #302

Closed

26 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Improve GAT models #348

[Model] Improve GAT models #348

jermainewang commented Jan 9, 2019 •

edited

jermainewang commented Jan 9, 2019

BarclayII Jan 10, 2019 •

edited

jermainewang Jan 10, 2019

BarclayII Jan 10, 2019

jermainewang Jan 10, 2019

BarclayII Jan 10, 2019 •

edited

jermainewang Jan 10, 2019

BarclayII Jan 10, 2019

lingfanyu left a comment

yzh119 commented Jan 10, 2019

VoVAllen commented Jan 10, 2019

yzh119 commented Jan 10, 2019

lingfanyu commented Jan 10, 2019 •

edited

jermainewang commented Jan 11, 2019

jermainewang commented Jan 11, 2019

[Model] Improve GAT models #348

[Model] Improve GAT models #348

Conversation

jermainewang commented Jan 9, 2019 • edited

Description

Results

Checklist

jermainewang commented Jan 9, 2019

BarclayII Jan 10, 2019 • edited

Choose a reason for hiding this comment

jermainewang Jan 10, 2019

Choose a reason for hiding this comment

BarclayII Jan 10, 2019

Choose a reason for hiding this comment

jermainewang Jan 10, 2019

Choose a reason for hiding this comment

BarclayII Jan 10, 2019 • edited

Choose a reason for hiding this comment

jermainewang Jan 10, 2019

Choose a reason for hiding this comment

BarclayII Jan 10, 2019

Choose a reason for hiding this comment

lingfanyu left a comment

Choose a reason for hiding this comment

yzh119 commented Jan 10, 2019

VoVAllen commented Jan 10, 2019

yzh119 commented Jan 10, 2019

lingfanyu commented Jan 10, 2019 • edited

jermainewang commented Jan 11, 2019

jermainewang commented Jan 11, 2019

jermainewang commented Jan 9, 2019 •

edited

BarclayII Jan 10, 2019 •

edited

BarclayII Jan 10, 2019 •

edited

lingfanyu commented Jan 10, 2019 •

edited