*Note: With the current version of Torch, only CPU is availabe for geometric.*

This directory contains the a selection of the Cora dataset (www.research.whizbang.com/data).

The Cora dataset consists of Machine Learning papers. These papers are classified into one of the following seven classes:
		Case_Based
		Genetic_Algorithms
		Neural_Networks
		Probabilistic_Methods
		Reinforcement_Learning
		Rule_Learning
		Theory

The papers were selected in a way such that in the final corpus every paper cites or is cited by atleast one other paper. There are 2708 papers in the whole corpus. 

After stemming and removing stopwords we were left with a vocabulary of size 1433 unique words. All words with document frequency less than 10 were removed.


THE DIRECTORY CONTAINS TWO FILES:

The .content file contains descriptions of the papers in the following format:

		<paper_id> <word_attributes>+ <class_label>

The first entry in each line contains the unique string ID of the paper followed by binary values indicating whether each word in the vocabulary is present (indicated by 1) or absent (indicated by 0) in the paper. Finally, the last entry in the line contains the class label of the paper.

The .cites file contains the citation graph of the corpus. Each line describes a link in the following format:

		<ID of cited paper> <ID of citing paper>

Each line contains two paper IDs. The first entry is the ID of the paper being cited and the second ID stands for the paper which contains the citation. The direction of the link is from right to left. If a line is represented by "paper1 paper2" then the link is "paper2->paper1". 

In [None]:
%%shell

wget "https://web.archive.org/web/20150918182409/http://www.cs.umd.edu/~sen/lbc-proj/data/cora.tgz"
tar -xzvf cora.tgz

pip install torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cpu.html
pip install torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cpu.html
pip install torch-geometric

pip install networkx

pip install icecream
pip install tqdm

--2021-11-19 15:07:29--  https://web.archive.org/web/20150918182409/http://www.cs.umd.edu/~sen/lbc-proj/data/cora.tgz
Resolving web.archive.org (web.archive.org)... 207.241.237.3
Connecting to web.archive.org (web.archive.org)|207.241.237.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘cora.tgz’

cora.tgz                [ <=>                ] 163.15K  --.-KB/s    in 0.1s    

2021-11-19 15:07:30 (1.21 MB/s) - ‘cora.tgz’ saved [167063]

cora/
cora/README
cora/cora.content
cora/cora.cites
Looking in links: https://data.pyg.org/whl/torch-1.10.0+cpu.html
Collecting torch-scatter
  Downloading https://data.pyg.org/whl/torch-1.10.0%2Bcpu/torch_scatter-2.0.9-cp37-cp37m-linux_x86_64.whl (291 kB)
[K     |████████████████████████████████| 291 kB 5.3 MB/s 
[?25hInstalling collected packages: torch-scatter
Successfully installed torch-scatter-2.0.9
Looking in links: https://data.pyg.org/whl/torch-1.10.0+cpu.html
Collecting 



In [None]:
import numpy as np
import pandas as pd
import random
from icecream import ic

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch_geometric
import networkx as nx
import scipy
from tqdm.notebook import tqdm   
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

print("Torch version:", torch.__version__)
print("CUDA Present:", torch.cuda.is_available())
print("CUDA Version:", torch.version.cuda)

Torch version: 1.10.0+cu111
CUDA Present: False
CUDA Version: 11.1


In [None]:
CONFIG = {
    'PATH': './cora',
    'LIMIT': 20,
    'HIDDEN_CHANNELS': 1024,
    'NUM_LAYERS': 2,
    'DROPOUT_RATE': 0,
    'EPOCHS': 200
}

print("Here's the configuration: ")
for k, v in CONFIG.items():
    print(f"{k} = {v}")

Here's the configuration: 
PATH = ./cora
LIMIT = 20
HIDDEN_CHANNELS = 1024
NUM_LAYERS = 2
DROPOUT_RATE = 0
EPOCHS = 200


In [None]:
class Data:
    def __init__(self, path):
        self.path = path
    
    def readFile(self, path):
        lines = []
        with open(path) as file:
            lines = file.readlines()
        return lines

    def readContent(self, data):
        nodes, labels, x = [], [], []
        for d in data:
            words = d.split("\t")
            nodes.append(words[0].strip())
            labels.append(words[-1].strip())
            x.append([ord(w) - 48 for w in words[1:-1]])
            # x.append(words[1:-1])

        # ic(x[0])
        LE = LabelEncoder()
        labels = LE.fit_transform(labels)
        ic(labels)
        x_req = torch.Tensor(x)
        # ic(x.shape)
        x = pd.DataFrame.from_records(x)
        
        return nodes, labels, LE, x_req, x

    def getLabels(self, LE, data):
        return LE.inverse_transform(data)

    def readCites(self, data):
        edges = []
        for d in data:
            words = d.split("\t")
            edges.append([
                words[0].strip(),
                words[1].strip()
            ])
        return edges

    def splitDataCount(self, data, labels):
        lcounter = dict((l, 0) for l in labels)
        indices = []
        for i in range(len(labels)):
            label = labels[i]
            if lcounter[label] < CONFIG['LIMIT']:
                indices.append(i)
                lcounter[label] += 1
        rest = [x for x in range(len(labels)) if x not in indices]
        # rest = random.sample(rest, 1000)
        indices = torch.LongTensor(indices)
        rest = torch.LongTensor(rest)
        return indices, rest

    def normalizeMatrix(self, A):
        return scipy.sparse.diags(np.array(A.sum(1)).flatten() ** -1).dot(A)

    def toTensor(self, A):
        A = A.tocoo()
        i = torch.tensor(np.vstack((A.row, A.col)), dtype=torch.long)
        v = torch.tensor(A.data, dtype=torch.float)
        return torch.sparse_coo_tensor(i, v, torch.Size(A.shape))

    def buildGraph(self):
        nodes, edges = self.getGraph()
        G = nx.Graph()
        G.add_nodes_from(nodes)
        G.add_edges_from(edges)
        A = nx.adjacency_matrix(G)
        I = scipy.sparse.identity(A.shape[0])
        A = A + I
        A = self.normalizeMatrix(A)
        A = self.toTensor(A)
        ic(A.shape)
        ic(nx.info(G))
        return A, G

    def getIndices(self):
        return self.train, self.test

    def getGraph(self):
        return self.nodes, self.edges

    def getMatrix(self):
        return self.A

    def getXY(self):
        return self.x, torch.LongTensor(self.labels)

    def printData(self):
        print(f"Number of nodes: {len(self.nodes)}")
        print(f"Number of features per node: {len(self.x[0])}")
        print(f"Categories: {set(self.labels)}")

    def handle(self):
        data = self.readFile(self.path + '/cora.content')
        e_data = self.readFile(self.path + '/cora.cites')
        self.nodes, self.labels, self.LE, self.x, self.split = self.readContent(data)
        self.train, self.test = self.splitDataCount(self.split, self.labels)
        self.edges = self.readCites(e_data)
        self.A, self.G = self.buildGraph()

In [None]:
class MyGCNLayer(nn.Module):
    def __init__(self, in_channels, out_channels):
        """
        in_channels: #features in the input
        out_channels: #features in the output

        these layers have their *own* independent weights and biases
        """
        super().__init__()
        
        self.W = nn.Parameter(torch.empty(in_channels, out_channels))
        nn.init.xavier_uniform_(self.W)
        self.b = nn.Parameter(torch.zeros(out_channels))

    def forward(self, X, A):
        """
        does the neat math on *symmetrically normalized* A
        """
        a = torch.mm(X, self.W)
        b = torch.spmm(A, a)
        return b + self.b

In [None]:
class MyGCN(nn.Module):
    def __init__(
            self, 
            in_channels, 
            hidden_channels, 
            num_layers, 
            out_channels, 
            dropout_rate
        ):
        super().__init__()
        self.in_channels = in_channels
        self.hidden_channels = hidden_channels
        self.num_layers = num_layers
        self.out_channels = out_channels
        self.dropout_rate = dropout_rate
        
        self.MyGCNLayers = []
        self.MyGCNLayers.append(
            MyGCNLayer(self.in_channels, self.hidden_channels)
        )
        self.outputLayers = MyGCNLayer(self.hidden_channels, self.out_channels)

        for _ in range(1, self.num_layers):
            self.MyGCNLayers.append(
                MyGCNLayer(self.hidden_channels, self.hidden_channels)
            )

    def forward(self, X, A):
        """
        math done on *symmetrically normalized* A
        """
        for layer in range(self.num_layers):
            # forwarded to the *appropriate* MyGCNLayer
            X = self.MyGCNLayers[layer].forward(X, A)
            X = F.relu(X)
            X = F.dropout(X, p=self.dropout_rate, training=self.training)
        X = self.outputLayers.forward(X, A)
        return F.log_softmax(X)

In [None]:
dataset = Data(CONFIG['PATH'])
dataset.handle()
X, y = dataset.getXY()
train, test = dataset.getIndices()
ic(X.shape, y.shape)
ic(train.shape, test.shape)

A = dataset.getMatrix()

ic| labels: array([2, 5, 4, ..., 1, 0, 2])
ic| A.shape: torch.Size([2708, 2708])
ic| nx.info(G): 'Graph with 2708 nodes and 5278 edges'
ic| X.shape: torch.Size([2708, 1433]), y.shape: torch.Size([2708])
ic| train.shape: torch.Size([140]), test.shape: torch.Size([1000])


In [None]:
model = MyGCN(
    in_channels=X.shape[1],
    hidden_channels=CONFIG['HIDDEN_CHANNELS'],
    num_layers=CONFIG['NUM_LAYERS'],
    out_channels=7,
    dropout_rate=CONFIG['DROPOUT_RATE']
)

loss = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

In [None]:
losses = []
for _ in tqdm(range(CONFIG['EPOCHS'])):
    optimizer.zero_grad()
    output = model.forward(X, A)
    train_x = torch.index_select(output, 0, train)
    train_y = torch.index_select(y, 0, train)
    l = loss(train_x, train_y)
    l.backward()
    losses.append(l.item())
    optimizer.step()

  0%|          | 0/200 [00:00<?, ?it/s]



In [None]:
losses

[1.9416035413742065,
 1.8918306827545166,
 1.8432536125183105,
 1.7955442667007446,
 1.7486099004745483,
 1.7024567127227783,
 1.657133936882019,
 1.6127026081085205,
 1.5692163705825806,
 1.5267143249511719,
 1.4852229356765747,
 1.4447592496871948,
 1.4053369760513306,
 1.3669685125350952,
 1.3296667337417603,
 1.2934414148330688,
 1.2582987546920776,
 1.224238395690918,
 1.1912541389465332,
 1.1593350172042847,
 1.128466248512268,
 1.0986305475234985,
 1.0698095560073853,
 1.0419834852218628,
 1.0151307582855225,
 0.989228367805481,
 0.9642512798309326,
 0.9401729106903076,
 0.9169657826423645,
 0.894602358341217,
 0.8730552792549133,
 0.8522971868515015,
 0.8323013186454773,
 0.8130406737327576,
 0.7944879531860352,
 0.7766167521476746,
 0.7594007253646851,
 0.7428139448165894,
 0.7268312573432922,
 0.7114283442497253,
 0.6965811848640442,
 0.6822671294212341,
 0.668463408946991,
 0.6551485061645508,
 0.6423014402389526,
 0.6299020648002625,
 0.6179308295249939,
 0.606369137763977,

In [None]:
output = model.forward(X, A)
test_x = torch.index_select(output, 0, test)
test_y = torch.index_select(y, 0, test)



In [None]:
predictions = torch.argmax(test_x, dim=1)

In [None]:
predictions.shape

torch.Size([1000])

In [None]:
predictions, test_y = predictions.numpy(), test_y.numpy()

In [None]:
acc = 0
for i in range(len(test_y)):
    if predictions[i] == test_y[i]:
        acc += 1
ic(acc/len(test_y)*100)

ic| acc/len(test_y)*100: 76.5


76.5