# Cora citation network

In [1]:
import networkx as nx
import numpy as np
import pickle as p
from matplotlib import pyplot as plt
%matplotlib inline

data_loc = './../data/raw/cora/'  # 'cora.cites', 'cora.content'

Cora is a directed citation network of 2708 papers with link as citation (citing paper points to cited paper, the order in the edgelist is reversed). Each paper has one label (7 types of label in total). The labels are:

- Case Based
- Genetic Algorithms
- Neural Networks
- Probabilistic Methods
- Reinforcement Learning
- Rule Learning
- Theory

Also, each papers has a binary feature vector of 1433 elements (word existance indicator) describing the content of the node. The end of each feature vector is the string label of the paper (e.g. Case_Based, or Neural_Networks).

## Load data from edge list

In [2]:
graph_file = open(data_loc+'cora.cites', 'r')

Print the first 5 lines of the graph file:

In [3]:
for _ in range(5): print(repr(graph_file.readline()))

'35\t1033\n'
'35\t103482\n'
'35\t103515\n'
'35\t1050679\n'
'35\t1103960\n'


In [4]:
graph_file.seek(0)
cora_edgelist = []
for line in graph_file.readlines():
    i, j = line.split()
    cora_edgelist.append((int(j),int(i)))  # Correct direction of links

In [5]:
print("Number of edges:", len(cora_edgelist))

Number of edges: 5429


In [6]:
cora = nx.DiGraph(cora_edgelist)

In [7]:
print("Number of nodes:", len(cora))

Number of nodes: 2708


I would like to have standard network data with nodes id ranging from 0 to 2707. 

In [8]:
# Get a conversion dictionary
lookup = {}
for new_ids, ids in enumerate(cora.nodes()):
    lookup[ids] = new_ids
# Create new graph with new node ids
new_cora = nx.DiGraph()
for i, j in cora.edges():
    new_cora.add_edge(lookup[i], lookup[j])

Dump data as pickle:

In [9]:
f = open('./../data/cora.graph', 'wb')
p.dump(new_cora, f)

## Load labels and features

In [11]:
content = open(data_loc+'cora.content', 'r')
labels = {'Case_Based': 0, 'Genetic_Algorithms': 1, 'Neural_Networks': 2, 
          'Probabilistic_Methods': 3, 'Reinforcement_Learning':4, 
          'Rule_Learning': 5, 'Theory': 6}
cora_labels = np.ndarray(shape=len(new_cora), dtype=int)
cora_features = np.ndarray(shape=(len(new_cora), 1433), dtype=int)
with open(content, 'r') as f:
    for lines in f.readlines():
        idx, *data, label = lines.strip().split()
        idx = int(idx)
        cora_labels[lookup[idx]] = labels[label]
        for i, val in enumerate(map(int, data)):
            cora_features[lookup[idx]][i] = val

TypeError: invalid file: <_io.TextIOWrapper name='./../data/raw/cora/cora.content' mode='r' encoding='UTF-8'>