# ref

- [그래프 머신러닝](https://product.kyobobook.co.kr/detail/S000200738068)

- [github](https://github.com/PacktPublishing/Graph-Machine-Learning)

# 신경 그래프 학습(NGL)

- NGL(Neural Graph Learning)

- 라벨 전파 및 라벨 확산 알고리즘의 비선형 버전..

- [NLS 프레임워크](https://github.com/tensorflow/neural-structured-learning)

## Load Dataset

`-` 데이터셋: Cora

- 7개의 클래스로 라벨링돼 있는 2,708개의 컴퓨터 사이언스 논문

- 각 논문은 인용을 기반으로 다른 노드와 연결된 노드

- 총 5,429개의 간선



In [1]:
from stellargraph import datasets

2023-04-06 21:44:50.486139: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
dataset = datasets.Cora()

In [3]:
%config Completer.use_jedi = False

In [4]:
dataset.download()

In [5]:
label_index = {
      'Case_Based': 0,
      'Genetic_Algorithms': 1,
      'Neural_Networks': 2,
      'Probabilistic_Methods': 3,
      'Reinforcement_Learning': 4,
      'Rule_Learning': 5,
      'Theory': 6,
  }

In [7]:
G, labels = dataset.load()

- G: 네트워크 노드, 간선, BOW표현 설명

- labea : 논문id와 클래스 중 하나 사이의 매핑

- 훈련 샘플: 이웃과 관련된 정보가 포함 -> 훈련을 정규화 하는데 사용

- 검증 샘플: 이웃과 관련된 정보 불포함 , 예측된 라벨은 노드 특증, bow표현에만 의존

In [8]:
import numpy as np
from sklearn import preprocessing, feature_extraction, model_selection

In [9]:
import tensorflow as tf
from tensorflow.train import Example, Features, Feature, Int64List, BytesList, FloatList

In [10]:
GRAPH_PREFIX="NL_nbr"

In [12]:
def _int64_feature(*value):
    """Returns int64 tf.train.Feature from a bool / enum / int / uint."""
    return Feature(int64_list=Int64List(value=list(value)))

def _bytes_feature(value):
    """Returns bytes tf.train.Feature from a string."""
    return Feature(
        bytes_list=BytesList(value=[value.encode('utf-8')])
    )

def _float_feature(*value):
    return Feature(float_list=FloatList(value=list(value)))

- _int64_feature 함수는 bool, enum, int, uint 데이터 타입을 입력 받아 int64_list 타입의 tf.train.Feature 객체를 반환

- _bytes_feature 함수는 문자열 값을 입력 받아 utf-8로 인코딩하여 bytes_list 타입의 tf.train.Feature 객체를 반환

- _float_feature 함수는 float 데이터 타입을 입력 받아 float_list 타입의 tf.train.Feature 객체를 반환

`-` 반지도 학습 데이터 셋 만드는 함수 정의

In [19]:
from functools import reduce
from typing import List, Tuple
import pandas as pd
import six

def addFeatures(x, y):
    res = Features()
    res.CopyFrom(x)
    res.MergeFrom(y)
    return res

def neighborFeatures(features: Features, weight: float, prefix: str):  # 객체, 가중치, 접두어 입력으로 받음
    data = {f"{prefix}_weight": _float_feature(weight)}
    for name, feature in six.iteritems(features.feature):
        data[f"{prefix}_{name}"] = feature 
    return Features(feature=data)

def neighborsFeatures(neighbors: List[Tuple[Features, float]]):
    return reduce(
        addFeatures, 
        [neighborFeatures(sample, weight, f"{GRAPH_PREFIX}_{ith}") for ith, (sample, weight) in enumerate(neighbors)],
        Features()
    )

def getNeighbors(idx, adjMatrix, topn=5): #인덱스와 인접행렬 이용하여 이웃 데이터셋 추출 
    weights = adjMatrix.loc[idx]
    return weights[weights>0].sort_values(ascending=False).head(topn).to_dict()
    

def semisupervisedDataset(G, labels, ratio=0.2, topn=5):  #라벨이 있는 데이터와 없는 데이터 추출
     #ratio:라벨 유무 비율 설정
     #topn: 함수에서 추출할 이웃 데이터셋 크기 설정
    n = int(np.round(len(labels)*ratio)) 
    
    labelled, unlabelled = model_selection.train_test_split(
        labels, train_size=n, test_size=None, stratify=labels
    )

### 1. 노드 특징 df로 구성하고 그래프 인접행렬로 저장

In [16]:
adjMatrix = pd.DataFrame.sparse.from_spmatrix(G.to_adjacency_matrix(), index=G.nodes(), columns=G.nodes())
    
features = pd.DataFrame(G.node_features(), index=G.nodes())

### 2. adjMatrix사용해 노드ID와 간선 가중치 반환하여 노드의 가장 가까운 TOPN이웃 검색하는 도우미 함수 구현

```python

def getNeighbors(idx, adjMatrix, topn=5): #인덱스와 인접행렬 이용하여 이웃 데이터셋 추출 
    weights = adjMatrix.loc[idx]
    neighbors = weights[weights>0]\
        .sort_values(ascending=False)\
        .head(topn)
    return [(k,v) for k, v in neighbors.iteritems()]
    
```

### 3. 정보를 단일 df로 병합

In [20]:
dataset = {
        index: Features(feature = {
            #"id": _bytes_feature(str(index)), 
            "id": _int64_feature(index),
            "words": _float_feature(*[float(x) for x in features.loc[index].values]), 
            "label": _int64_feature(label_index[label])
        })
        for index, label in pd.concat([labelled, unlabelled]).items()
    }

NameError: name 'labelled' is not defined

In [21]:
from functools import reduce
from typing import List, Tuple
import pandas as pd
import six

def addFeatures(x, y):
    res = Features()
    res.CopyFrom(x)
    res.MergeFrom(y)
    return res

def neighborFeatures(features: Features, weight: float, prefix: str):
    data = {f"{prefix}_weight": _float_feature(weight)}
    for name, feature in six.iteritems(features.feature):
        data[f"{prefix}_{name}"] = feature 
    return Features(feature=data)

def neighborsFeatures(neighbors: List[Tuple[Features, float]]):
    return reduce(
        addFeatures, 
        [neighborFeatures(sample, weight, f"{GRAPH_PREFIX}_{ith}") for ith, (sample, weight) in enumerate(neighbors)],
        Features()
    )

def getNeighbors(idx, adjMatrix, topn=5):
    weights = adjMatrix.loc[idx]
    return weights[weights>0].sort_values(ascending=False).head(topn).to_dict()
    

def semisupervisedDataset(G, labels, ratio=0.2, topn=5):
    n = int(np.round(len(labels)*ratio))
    
    labelled, unlabelled = model_selection.train_test_split(
        labels, train_size=n, test_size=None, stratify=labels
    )
    
    adjMatrix = pd.DataFrame.sparse.from_spmatrix(G.to_adjacency_matrix(), index=G.nodes(), columns=G.nodes())
    
    features = pd.DataFrame(G.node_features(), index=G.nodes())
    
    dataset = {
        index: Features(feature = {
            #"id": _bytes_feature(str(index)), 
            "id": _int64_feature(index),
            "words": _float_feature(*[float(x) for x in features.loc[index].values]), 
            "label": _int64_feature(label_index[label])
        })
        for index, label in pd.concat([labelled, unlabelled]).items()
    }
    
    trainingSet = [
        Example(features=addFeatures(
            dataset[exampleId], 
            neighborsFeatures(
                [(dataset[nodeId], weight) for nodeId, weight in getNeighbors(exampleId, adjMatrix, topn).items()]
            )
        ))
        for exampleId in labelled.index
    ]
    
    testSet = [Example(features=dataset[exampleId]) for exampleId in unlabelled.index]

    serializer = lambda _list: [e.SerializeToString() for e in _list]
    
    return serializer(trainingSet), serializer(testSet)