## The Cora Dataset

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
(Note: Load **cora.graphml** file into yEd live.) 

![cora dataset](../figures/yEd-live-cora.png)


In [1]:
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root="../data", name="Cora")

# Cora only has one graph
data = dataset[0]
print(data)

Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])


In [2]:
print(f'Dataset: {dataset}')
print(f'URL: {dataset.url}')
print('---------------')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of nodes: {data.x.shape[0]}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')
print(f'\nGraph:')
print('------')
print(f'Edges are directed: {data.is_directed()}')
print(f'Graph has isolated nodes: {data.has_isolated_nodes()}')
print(f'Graph has loops: {data.has_self_loops()}')

Dataset: Cora()
URL: https://github.com/kimiyoung/planetoid/raw/master/data
---------------
Number of graphs: 1
Number of nodes: 2708
Number of features: 1433
Number of classes: 7

Graph:
------
Edges are directed: False
Graph has isolated nodes: False
Graph has loops: False


## The Facebook Page-Page Dataset

This dataset was created using the Facebook Graph API in November 2017. In this dataset, each of the 22,470 nodes represents an official Facebook page. Pages are connected when there are mutual likes between them. Node features (128-dim vectors) are created from textual descriptions written by the owners of these pages. Our goal is to classify each node into one of four categories: *politicians*, *companies*, *television shows*, and *governmental organizations*.

In [18]:
from torch_geometric.datasets import FacebookPagePage
dataset = FacebookPagePage(root="../data/Facebook-Page-Page")

data = dataset[0]
print(data)

Data(x=[22470, 128], edge_index=[2, 342004], y=[22470])


In [7]:
print(f'Dataset: {dataset}')
print(f'URL: {dataset.url}')
print('---------------')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of nodes: {data.x.shape[0]}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')
print(f'\nGraph:')
print('------')
print(f'Edges are directed: {data.is_directed()}')
print(f'Graph has isolated nodes: {data.has_isolated_nodes()}')
print(f'Graph has loops: {data.has_self_loops()}')

Dataset: FacebookPagePage()
URL: https://graphmining.ai/datasets/ptg/facebook.npz
---------------
Number of graphs: 1
Number of nodes: 22470
Number of features: 128
Number of classes: 4

Graph:
------
Edges are directed: False
Graph has isolated nodes: False
Graph has loops: True


Unlike Cora, Facebook Page-Page doesn’t have training, evaluation, and test masks by default. We can arbitrarily create masks with the range() function

In [8]:
data.train_mask = range(18000)
data.val_mask = range(18001, 20000)
data.test_mask = range(20001, 22470)

I couldn't find a *.grahml* file of the Facebook Page-Page dataset to be able to plot on **yEd live** (because the size of the graph prohibits the use of **networkx** for visualization. Therefore I will create a .grapml file using networkx and view the graph on yEd live.

NetworkX (i.e., **write_grapml()** function) expects the node and edge attributes to be Python data types, but the PyTorch Geometric dataset contains tensors, which are not supported by the GraphML format. To resolve this issue, we need to convert the tensors to Python data types before creating the NetworkX graph object.

In [26]:
import networkx as nx
from torch_geometric.utils.convert import to_networkx

data = dataset[0]
G = to_networkx(data)

nx.write_graphml(G, '../data/facebook_page_page.graphml')

It is not straightforward to plot this graph in **yEd live**. It can be done using **Gephi**, but the nodes with few connections should be filtered out to improve the performance. The ramaining nodes can be plotted in a way that the size of the nodes depends on their number of connections and their color indicates the category they belong to. Two layouts can be applied: *Fruchterman-Reingold* and *ForceAtlas2*.