## GAT
1.解释：计算一个节点和其他与之相邻节点的注意力得分将该得分用于权重平均邻居节点的特征从而生成新的节点特征表示
2.注意力得分：将输出后的节点进行拼接并与一个可学习的权重进行内积然后再使用LeakyReLU进行非线性结合
3.GAT对比GCN的优势，GCN 假设所有邻居节点都是等价的（或者通过手动设置的边权重来进行区分），而GAT可以通过注意力得分从而使得依据邻接节点更新节点更加合理

## 1 初始设置

In [2]:
import argparse
import torch
import numpy as np
import random

parser = argparse.ArgumentParser()
parser.add_argument('--no-cuda', action='store_true', default=False, help='Disables CUDA training.')
parser.add_argument('--fastmode', action='store_true', default=False, help='Validate during training pass.')
parser.add_argument('--sparse', action='store_true', default=False, help='GAT with sparse version or not.')
parser.add_argument('--seed', type=int, default=72, help='Random seed.')
parser.add_argument('--epochs', type=int, default=10000, help='Number of epochs to train.')
parser.add_argument('--lr', type=float, default=0.005, help='Initial learning rate.')
parser.add_argument('--weight_decay', type=float, default=5e-4, help='Weight decay (L2 loss on parameters).')
parser.add_argument('--hidden', type=int, default=8, help='Number of hidden units.')
parser.add_argument('--nb_heads', type=int, default=8, help='Number of head attentions.')
parser.add_argument('--dropout', type=float, default=0.6, help='Dropout rate (1 - keep probability).')
parser.add_argument('--alpha', type=float, default=0.2, help='Alpha for the leaky_relu.')
parser.add_argument('--patience', type=int, default=100, help='Patience')


args, _ = parser.parse_known_args()
args.cuda = not args.no_cuda and torch.cuda.is_available()

random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)

<torch._C.Generator at 0x27dff944590>

## 2 导入数据

In [3]:
## 导入one-hot编码函数
def encode_onehot(labels):
    # The classes must be sorted before encoding to enable static class encoding.
    # In other words, make sure the first class always maps to index 0.
    classes = sorted(list(set(labels)))
    classes_dict = {c: np.identity(len(classes))[i, :] for i, c in enumerate(classes)}
    labels_onehot = np.array(list(map(classes_dict.get, labels)), dtype=np.int32)
    return labels_onehot

## 2-1 处理节点数据

In [4]:
## 导入节点数据
import scipy.sparse as sp
idx_features_labels = np.genfromtxt("C:\\Users\\Wei Zhou\\Desktop\\test\\图神经网络几个算法\\cora.content", dtype=np.dtype(str))
print(idx_features_labels)

[['31336' '0' '0' ... '0' '0' 'Neural_Networks']
 ['1061127' '0' '0' ... '0' '0' 'Rule_Learning']
 ['1106406' '0' '0' ... '0' '0' 'Reinforcement_Learning']
 ...
 ['1128978' '0' '0' ... '0' '0' 'Genetic_Algorithms']
 ['117328' '0' '0' ... '0' '0' 'Case_Based']
 ['24043' '0' '0' ... '0' '0' 'Neural_Networks']]


In [5]:
## 构建节点特征的稀疏矩阵
features = sp.csr_matrix(idx_features_labels[:, 1:-1], dtype=np.float32)

In [6]:
## 提取标签并进行one_hot编码
labels = encode_onehot(idx_features_labels[:, -1])

In [7]:
## 提取索引并进行重新编码
idx = np.array(idx_features_labels[:, 0], dtype=np.int32)
idx_map = {j: i for i, j in enumerate(idx)}

## 2-2处理边数据

In [8]:
## 导入边数据
edges_unordered = np.genfromtxt("C:\\Users\\Wei Zhou\\Desktop\\test\\图神经网络几个算法\\cora.cites", dtype=np.int32)

In [9]:
## 利用之间的更新的节点标签更新边并以数组形式表示
edges = np.array(list(map(idx_map.get, edges_unordered.flatten())), dtype=np.int32).reshape(edges_unordered.shape)
print(edges)

[[ 163  402]
 [ 163  659]
 [ 163 1696]
 ...
 [1887 2258]
 [1902 1887]
 [ 837 1686]]


In [10]:
## 构建边的稀疏矩阵
adj = sp.coo_matrix((np.ones(edges.shape[0]), (edges[:, 0], edges[:, 1])), shape=(labels.shape[0], labels.shape[0]), dtype=np.float32)

In [11]:
# 将有向图转化为无向图
adj = adj + adj.T.multiply(adj.T > adj) - adj.multiply(adj.T > adj)

## 2-3 对节点特征和邻接矩阵进行标准化处理

In [12]:
##邻接矩阵的归一化方式
def normalize_adj(mx):
    """Row-normalize sparse matrix"""
    rowsum = np.array(mx.sum(1))
    r_inv_sqrt = np.power(rowsum, -0.5).flatten()
    r_inv_sqrt[np.isinf(r_inv_sqrt)] = 0.
    r_mat_inv_sqrt = sp.diags(r_inv_sqrt)
    return mx.dot(r_mat_inv_sqrt).transpose().dot(r_mat_inv_sqrt)
## 节点的归一化方式
def normalize_features(mx):
    """Row-normalize sparse matrix"""
    rowsum = np.array(mx.sum(1))
    r_inv = np.power(rowsum, -1).flatten()
    r_inv[np.isinf(r_inv)] = 0.
    r_mat_inv = sp.diags(r_inv)
    mx = r_mat_inv.dot(mx)
    return mx

In [13]:
## 执行归一化
features = normalize_features(features)
adj = normalize_adj(adj + sp.eye(adj.shape[0]))
print(features)
print(adj)

  (0, 1426)	0.05
  (0, 1352)	0.05
  (0, 1236)	0.05
  (0, 1209)	0.05
  (0, 1205)	0.05
  (0, 902)	0.05
  (0, 845)	0.05
  (0, 734)	0.05
  (0, 702)	0.05
  (0, 698)	0.05
  (0, 648)	0.05
  (0, 619)	0.05
  (0, 521)	0.05
  (0, 507)	0.05
  (0, 456)	0.05
  (0, 351)	0.05
  (0, 252)	0.05
  (0, 176)	0.05
  (0, 125)	0.05
  (0, 118)	0.05
  (1, 1425)	0.05882353
  (1, 1389)	0.05882353
  (1, 1332)	0.05882353
  (1, 1266)	0.05882353
  (1, 1263)	0.05882353
  :	:
  (2706, 475)	0.05263158
  (2706, 287)	0.05263158
  (2706, 132)	0.05263158
  (2706, 54)	0.05263158
  (2706, 48)	0.05263158
  (2706, 4)	0.05263158
  (2707, 1351)	0.05263158
  (2707, 1335)	0.05263158
  (2707, 1333)	0.05263158
  (2707, 1301)	0.05263158
  (2707, 1205)	0.05263158
  (2707, 1203)	0.05263158
  (2707, 1178)	0.05263158
  (2707, 1156)	0.05263158
  (2707, 1075)	0.05263158
  (2707, 1073)	0.05263158
  (2707, 877)	0.05263158
  (2707, 774)	0.05263158
  (2707, 737)	0.05263158
  (2707, 564)	0.05263158
  (2707, 422)	0.05263158
  (2707, 304)	0.0526315

## 2-4 转化为tensor形式好进行输入

In [14]:
adj = torch.FloatTensor(np.array(adj.todense()))
features = torch.FloatTensor(np.array(features.todense()))
labels = torch.LongTensor(np.where(labels)[1])

## 2-5 划分数据集

In [15]:
idx_train = range(140)
idx_val = range(200, 500)
idx_test = range(500, 1500)
idx_train = torch.LongTensor(idx_train)
idx_val = torch.LongTensor(idx_val)
idx_test = torch.LongTensor(idx_test)

## 3 模型搭建

In [16]:
import torch
import torch.nn as nn
import torch.nn.functional as F

## 多头注意力机制
1. 每个“头”（head）在多头注意力（Multi-Head Attention）中都有自己独立的注意力机制，包括自己的参数（如权重和偏置）。这意味着，如果你有 N 个头，你实际上有 N 套独立的注意力机制参数
2. 多头注意力机制有点类似于卷积核

In [17]:
##构建GAT模型
class GAT(nn.Module):
    def __init__(self, nfeat, nhid, nclass, dropout, alpha, nheads):
        super(GAT, self).__init__()
        self.dropout = dropout

        self.attentions = [GraphAttentionLayer(nfeat, nhid, dropout=dropout, alpha=alpha, concat=True) for _ in range(nheads)]
        for i, attention in enumerate(self.attentions):
            self.add_module('attention_{}'.format(i), attention)

        self.out_att = GraphAttentionLayer(nhid * nheads, nclass, dropout=dropout, alpha=alpha, concat=False)

    def forward(self, x, adj):
        x = F.dropout(x, self.dropout, training=self.training)
        x = torch.cat([att(x, adj) for att in self.attentions], dim=1)
        x = F.dropout(x, self.dropout, training=self.training)
        x = F.elu(self.out_att(x, adj))
        return F.log_softmax(x, dim=1)


In [18]:
## 构建GraphAttentionLayer
class GraphAttentionLayer(nn.Module):
    def __init__(self, in_features, out_features, dropout, alpha, concat=True):
        super(GraphAttentionLayer, self).__init__()
    ##防止过拟合
        self.dropout = dropout
    ##输入特征维度
        self.in_features = in_features
    ##输出特征维度
        self.out_features = out_features
    ##LeakyReLU激活函数的负斜率
        self.alpha = alpha
    ##是否在多个头之间进行链接
        self.concat = concat
    
    ##节点输入维度到输出维度的一个学习参数
        self.W = nn.Parameter(torch.empty(size=(in_features, out_features)))
    ##进行参数的初始化
        nn.init.xavier_uniform_(self.W.data, gain=1.414)
        ##gain 是一个可选的缩放因子，用于调整初始化的范围。在这个例子中，gain 被设置为根号2 （大约是1.414）。这通常用于当激活函数是 ReLU 或者 ReLU 的变体（比如 LeakyReLU）时，以帮助保持方差。
        
    ##用于计算注意力机制的权重
        self.a = nn.Parameter(torch.empty(size=(2*out_features, 1)))
    ##进行参数的初始化
        nn.init.xavier_uniform_(self.a.data, gain=1.414)
    ##进行非线性处理
        self.leakyrelu = nn.LeakyReLU(self.alpha)



    def forward(self, h, adj):
    ##对节点特征矩阵进行W的线性变换
        Wh = torch.mm(h, self.W) 
    ##计算注意力机制的输入
        e = self._prepare_attentional_mechanism_input(Wh)
    ##以-9e15作为e的mask
        zero_vec = -9e15*torch.ones_like(e)
    ##如果如果两个节点有边链接则使其注意力得分为e，如果没有则使用-9e15的mask作为其得分
        attention = torch.where(adj > 0, e, zero_vec)
    ##使用 softmax 函数对注意力得分进行归一化
        attention = F.softmax(attention, dim=1)
    ##dropout，为模型的训练增加了一些正则化
        attention = F.dropout(attention, self.dropout, training=self.training)
    ##将注意力机制的矩阵和相邻的节点特征矩阵进行矩阵相乘
        h_prime = torch.matmul(attention, Wh)

    ##将多头注意力得分进行拼接
        if self.concat:
            return F.elu(h_prime)
        else:
            return h_prime
        
        
        
##节点间的原始注意力得分
    def _prepare_attentional_mechanism_input(self, Wh):
    ##Wh和self.a的前一半进行矩阵乘法
        Wh1 = torch.matmul(Wh, self.a[:self.out_features, :])
    ##Wh和self.a的后一半进行矩阵乘法
        Wh2 = torch.matmul(Wh, self.a[self.out_features:, :])
    ##Wh1和Wh2的转置进行广播加法（broadcast add）操作，得到原始注意力得分矩阵 e。
        e = Wh1 + Wh2.T
    ##通过LeakyReLU激活函数对注意力得分进行非线性变换。
        return self.leakyrelu(e)

    def __repr__(self):
        return self.__class__.__name__ + ' (' + str(self.in_features) + ' -> ' + str(self.out_features) + ')'



## SpGAT模型和GAT的区别
1. GAT 依据节点的重要性来进行注意力的得分的打分。
2. SPGAT 引入了引入了结构化的位置信息，会通过位置信息来调节注意力分数，链接模式和位置会被考虑
3. GAT更灵活，可以处理各类型的图数据/SPGAT偏向于处理节点相对位置的问题

In [19]:
##构建SpGAT模型
class SpGAT(nn.Module):
    def __init__(self, nfeat, nhid, nclass, dropout, alpha, nheads):
        super(SpGAT, self).__init__()
        self.dropout = dropout

        self.attentions = [SpGraphAttentionLayer(nfeat, 
                                                 nhid, 
                                                 dropout=dropout, 
                                                 alpha=alpha, 
                                                 concat=True) for _ in range(nheads)]
        for i, attention in enumerate(self.attentions):
            self.add_module('attention_{}'.format(i), attention)

        self.out_att = SpGraphAttentionLayer(nhid * nheads, 
                                             nclass, 
                                             dropout=dropout, 
                                             alpha=alpha, 
                                             concat=False)

    def forward(self, x, adj):
        x = F.dropout(x, self.dropout, training=self.training)
        x = torch.cat([att(x, adj) for att in self.attentions], dim=1)
        x = F.dropout(x, self.dropout, training=self.training)
        x = F.elu(self.out_att(x, adj))
        return F.log_softmax(x, dim=1)

In [20]:
## 构建
class SpGraphAttentionLayer(nn.Module):

    def __init__(self, in_features, out_features, dropout, alpha, concat=True):
        super(SpGraphAttentionLayer, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.alpha = alpha
        self.concat = concat

        self.W = nn.Parameter(torch.zeros(size=(in_features, out_features)))
        nn.init.xavier_normal_(self.W.data, gain=1.414)
                
        self.a = nn.Parameter(torch.zeros(size=(1, 2*out_features)))
        nn.init.xavier_normal_(self.a.data, gain=1.414)

        self.dropout = nn.Dropout(dropout)
        self.leakyrelu = nn.LeakyReLU(self.alpha)
        self.special_spmm = SpecialSpmm()#进行稀疏矩阵和密集矩阵的乘法操作

    def forward(self, input, adj):
        dv = 'cuda' if input.is_cuda else 'cpu'
    ## 获取输入矩阵的行数，即图中节点的数量。
        N = input.size()[0]
    ## 找出邻接矩阵中非零元素的索引，即存在边的节点对。
        edge = adj.nonzero().t()
    ## 通过与权重矩阵 self.W 相乘来变换节点特征
        h = torch.mm(input, self.W)
        # h: N x out
        assert not torch.isnan(h).any()

    ## 拼接每一条边两端节点的特征
        edge_h = torch.cat((h[edge[0, :], :], h[edge[1, :], :]), dim=1).t()
        # edge: 2*D x E
    ## 计算注意力系数
        edge_e = torch.exp(-self.leakyrelu(self.a.mm(edge_h).squeeze()))
        assert not torch.isnan(edge_e).any()
        # edge_e: E
    ##  对注意力系数进行归一化
        e_rowsum = self.special_spmm(edge, edge_e, torch.Size([N, N]), torch.ones(size=(N,1), device=dv))
        # e_rowsum: N x 1
    
    ## 应用 Dropout
        edge_e = self.dropout(edge_e)
        # edge_e: E
        
    ## 使用注意力系数来聚合节点特征
        h_prime = self.special_spmm(edge, edge_e, torch.Size([N, N]), h)
        assert not torch.isnan(h_prime).any()
        # h_prime: N x out
    
    ## 使用归一化的注意力系数进一步更新节点特征
        h_prime = h_prime.div(e_rowsum)
        # h_prime: N x out
        assert not torch.isnan(h_prime).any()

        if self.concat:
            # if this layer is not last layer,
            return F.elu(h_prime)
        else:
            # if this layer is last layer,
            return h_prime

    def __repr__(self):
        return self.__class__.__name__ + ' (' + str(self.in_features) + ' -> ' + str(self.out_features) + ')'


In [21]:
class SpecialSpmm(nn.Module):
    def forward(self, indices, values, shape, b):
        return SpecialSpmmFunction.apply(indices, values, shape, b)

In [22]:
class SpecialSpmmFunction(torch.autograd.Function):
    """Special function for only sparse region backpropataion layer."""
    @staticmethod
    def forward(ctx, indices, values, shape, b):
    ##断言索引不需要梯度，即它们是常数。
        assert indices.requires_grad == False
    ##创建一个稀疏COO（Coordinate Format）张量
        a = torch.sparse_coo_tensor(indices, values, shape)
    ##保存输入的 a 和 b 以供 backward 方法使用。
        ctx.save_for_backward(a, b)
    ##保存输入稀疏矩阵的行数
        ctx.N = shape[0]
        return torch.matmul(a, b)

    @staticmethod
    def backward(ctx, grad_output):
        ##从上下文中恢复保存的张量
        a, b = ctx.saved_tensors
        ## 初始化梯度值
        grad_values = grad_b = None
        ##检查是否需要计算 values 的梯度
        if ctx.needs_input_grad[1]:
        ##（反向传播的梯度）和 b 的转置的矩阵乘法
            grad_a_dense = grad_output.matmul(b.t())
        ##计算边的索引
            edge_idx = a._indices()[0, :] * ctx.N + a._indices()[1, :]
        ##使用计算出的索引提取梯度
            grad_values = grad_a_dense.view(-1)[edge_idx]
        ##检查是否需要计算 b 的梯度
        if ctx.needs_input_grad[3]:
        ##计算 a 的转置和 grad_output 的矩阵乘法
            grad_b = a.t().matmul(grad_output)
        return None, grad_values, None, grad_b

## 模型选择

In [23]:
import torch.optim as optim
if args.sparse:
    model = SpGAT(nfeat=features.shape[1], 
                nhid=args.hidden, 
                nclass=int(labels.max()) + 1, 
                dropout=args.dropout, 
                nheads=args.nb_heads, 
                alpha=args.alpha)
else:
    model = GAT(nfeat=features.shape[1], 
                nhid=args.hidden, 
                nclass=int(labels.max()) + 1, 
                dropout=args.dropout, 
                nheads=args.nb_heads, 
                alpha=args.alpha)
optimizer = optim.Adam(model.parameters(), 
                       lr=args.lr, 
                       weight_decay=args.weight_decay)

## 用gpu进行训练

In [24]:
from torch.autograd import Variable
if args.cuda:
    model.cuda()
    features = features.cuda()
    adj = adj.cuda()
    labels = labels.cuda()
    idx_train = idx_train.cuda()
    idx_val = idx_val.cuda()
    idx_test = idx_test.cuda()

features, adj, labels = Variable(features), Variable(adj), Variable(labels)

## 构建训练流程

In [25]:
import time
def train(epoch):
    t = time.time()
    model.train()
    optimizer.zero_grad()
    output = model(features, adj)
    loss_train = F.nll_loss(output[idx_train], labels[idx_train])
    acc_train = accuracy(output[idx_train], labels[idx_train])
    loss_train.backward()
    optimizer.step()

    if not args.fastmode:
        # Evaluate validation set performance separately,
        # deactivates dropout during validation run.
        model.eval()
        output = model(features, adj)

    loss_val = F.nll_loss(output[idx_val], labels[idx_val])
    acc_val = accuracy(output[idx_val], labels[idx_val])
    print('Epoch: {:04d}'.format(epoch+1),
          'loss_train: {:.4f}'.format(loss_train.data.item()),
          'acc_train: {:.4f}'.format(acc_train.data.item()),
          'loss_val: {:.4f}'.format(loss_val.data.item()),
          'acc_val: {:.4f}'.format(acc_val.data.item()),
          'time: {:.4f}s'.format(time.time() - t))

    return loss_val.data.item()

In [26]:
def accuracy(output, labels):
    preds = output.max(1)[1].type_as(labels)
    correct = preds.eq(labels).double()
    correct = correct.sum()
    return correct / len(labels)

## 执行训练模型

In [27]:
import os
import glob
t_total = time.time()
loss_values = []
bad_counter = 0
best = args.epochs + 1
best_epoch = 0
for epoch in range(args.epochs):
    loss_values.append(train(epoch))

    torch.save(model.state_dict(), '{}.pkl'.format(epoch))
    if loss_values[-1] < best:
        best = loss_values[-1]
        best_epoch = epoch
        bad_counter = 0
    else:
        bad_counter += 1

    if bad_counter == args.patience:
        break

    files = glob.glob('*.pkl')
    for file in files:
        epoch_nb = int(file.split('.')[0])
        if epoch_nb < best_epoch:
            os.remove(file)

files = glob.glob('*.pkl')
for file in files:
    epoch_nb = int(file.split('.')[0])
    if epoch_nb > best_epoch:
        os.remove(file)

print("Optimization Finished!")
print("Total time elapsed: {:.4f}s".format(time.time() - t_total))

Epoch: 0001 loss_train: 1.9525 acc_train: 0.1500 loss_val: 1.9401 acc_val: 0.3033 time: 2.7198s
Epoch: 0002 loss_train: 1.9417 acc_train: 0.2071 loss_val: 1.9312 acc_val: 0.4600 time: 0.0708s
Epoch: 0003 loss_train: 1.9250 acc_train: 0.2714 loss_val: 1.9225 acc_val: 0.5533 time: 0.0701s
Epoch: 0004 loss_train: 1.9133 acc_train: 0.4071 loss_val: 1.9134 acc_val: 0.5933 time: 0.0640s
Epoch: 0005 loss_train: 1.9069 acc_train: 0.5143 loss_val: 1.9044 acc_val: 0.6033 time: 0.0702s
Epoch: 0006 loss_train: 1.8998 acc_train: 0.4429 loss_val: 1.8955 acc_val: 0.5933 time: 0.0696s
Epoch: 0007 loss_train: 1.8963 acc_train: 0.5214 loss_val: 1.8866 acc_val: 0.5933 time: 0.0698s
Epoch: 0008 loss_train: 1.8793 acc_train: 0.5214 loss_val: 1.8776 acc_val: 0.6033 time: 0.0690s
Epoch: 0009 loss_train: 1.8467 acc_train: 0.5857 loss_val: 1.8681 acc_val: 0.6067 time: 0.0707s
Epoch: 0010 loss_train: 1.8370 acc_train: 0.5643 loss_val: 1.8583 acc_val: 0.6067 time: 0.0794s
Epoch: 0011 loss_train: 1.8400 acc_train

## 构建测试流程

In [28]:
def compute_test():
    model.eval()
    output = model(features, adj)
    loss_test = F.nll_loss(output[idx_test], labels[idx_test])
    acc_test = accuracy(output[idx_test], labels[idx_test])
    print("Test set results:",
          "loss= {:.4f}".format(loss_test.data.item()),
          "accuracy= {:.4f}".format(acc_test.data.item()))

## 开始测试

In [29]:
compute_test()

Test set results: loss= 0.6648 accuracy= 0.8410
