k-Core Index

For more information about the project, please refer to the [project specification](https://cgi.cse.unsw.edu.au/~cs9312/24T2/project/). You can edit this file and add anything you like. We will only use the code cell of the `KCore` class for testing. You can add descriptions and some theoretical analysis (e.g. index space, query time complexity, and index time complexity) to this file without creating a separate PDF document.

**Note**: Make sure to **sequentially run all cells in each section** in order to carry over intermediate variables/packages to the next cell.

## 1. Code Template
You need to implement the `KCore` class to support k-core queries for large graphs (e.g. hundreds of millions of vertices and edges). A code template is given below.

The `KCore` class is initialized by a graph `G`. The data structure of `G` will be presented in the next section. The class calls the function `preprocess` to precompute some index structure for `G`. The `query` function has one inputs: `k`. The function outputs the `k`-cores in `G`.

In [1]:
################################################################################
# You can import any Python Standard Library modules~
from collections import deque
import networkx as nx
################################################################################

class KCore(object):
    def __init__(self, G):

        self.vertex_num = G.vertex_num
        self.adj_list = G.adj_list
        self.d = [len(G.adj_list[u]) for u in range(self.vertex_num)]

        self.D = sorted(range(len(self.d)), key=lambda k: self.d[k])

        self.p = [0]*G.vertex_num
        for i in range(G.vertex_num):
            self.p[self.D[i]] = i

        self.b = {}
        bucket_val = 1
        index = 0
        while index < G.vertex_num:
            if bucket_val == self.d[self.D[index]]:
                self.b[bucket_val] = index
                index = index + 1
                bucket_val = bucket_val + 1
            elif bucket_val < self.d[self.D[index]]:
                while bucket_val < self.d[self.D[index]]:
                    bucket_val = bucket_val + 1
            elif bucket_val > self.d[self.D[index]]:
                index = index + 1

        self.preprocess(G)

    def preprocess(self, G):
        for i in range(self.vertex_num):
            v=self.D[i]
            for u in self.adj_list[v]:
                if self.d[u]>self.d[v]:
                    du = self.d[u]
                    pu = self.p[u]
                    if du not in self.b:
                        flag = 1
                        for k in self.b:
                            if du < k:
                                self.b[du] = self.b[k]
                                flag = 0
                                break
                        if flag:
                            self.b[du] = self.vertex_num-1
                    pw = self.b[du]
                    w = self.D[pw]

                    if u!=w:
                        self.p[u] = pw
                        self.p[w] = pu
                        self.D[pu] = w
                        self.D[pw] = u
                    self.b[du] += 1
                    if self.b[du] >= self.vertex_num:
                        self.b.pop(du)
                    self.d[u] -= 1
       
        return self.d

    def query(self, k):
        cores = []
        unvisited = [0]*self.vertex_num
        loc_unvisited = [0]*self.vertex_num
        unvisited_len = self.vertex_num - 1
        for i in range(self.vertex_num):
            unvisited[i] = i
            loc_unvisited[i] = i
        
        while unvisited_len > 0:
            v = unvisited[unvisited_len]

            if self.d[v] < k:
                unvisited_len -= 1
                continue
            
            group = []
            q = deque()
            q.append(v)
            while len(q) > 0:
                u = q.popleft()
                if (loc_unvisited[u] > unvisited_len) or (self.d[u] < k):
                    continue

                group.append(u)
                unvisited,loc_unvisited = self.swap(unvisited,loc_unvisited,loc_unvisited[u],unvisited_len)

                unvisited_len -= 1
                for w in self.adj_list[u]:
                    if (self.d[w] >= k) and (loc_unvisited[w] <= unvisited_len):
 
                        q.append(w)
                    elif (self.d[w] < k) and (loc_unvisited[w] <= unvisited_len):

                        unvisited,loc_unvisited = self.swap(unvisited,loc_unvisited,loc_unvisited[w],unvisited_len)
                        unvisited_len -= 1
            cores.append(group)

                    
        return cores

    def swap(self,arr,loc_arr, i, j):
        temp = arr[i]
        arr[i] = arr[j]
        arr[j] = temp
        loc_arr[arr[i]] = i
        loc_arr[arr[j]] = j
        return arr,loc_arr

## 2. Graph Data Structure
The following is the data stucture of the input graph `G`.

In [2]:
################################################################################
# Do not edit this code cell.
################################################################################

class UndirectedUnweightedGraph(object):
    def __init__(self, edge_list):
        self.adj_list = []
        self.vertex_num = 0
        self.edge_num = 0
        info = True
        for [vertex_u, vertex_v] in edge_list:
            if info:
                info = False
                self.vertex_num = vertex_u
                self.edge_num = vertex_v
                self.adj_list = [list() for _ in range(self.vertex_num)]
            else:
                self.adj_list[vertex_u].append(vertex_v)
                self.adj_list[vertex_v].append(vertex_u)

## 3. How to test your code

### 3.1 Download the sample dataset.

Running the following command will create the **COMP9312-24T2-Project** folder, which contains the files for the three datasets.

> **Cora** (2k vertices) is a real citation graph, **map_BJ_part** (4k vertices) is a real road network for a small area of Beijing, and **map_NY_part** (7k vertices) is a real road network for a small area of New York. For the two road networks, we erase the weight of each edge in the original dataset to generate unweighted graphs for simplicity.

There are three files for each dataset, where ***.graph** includes the graph information and all graph edges (vertex IDs are consecutive and start from 0), ***.query** includes a set of queries for testing. ***.answer** includes the correct answer to each query for your reference.

If the dataset already exists, an error like "*destination path 'COMP9312-24T2-Project' already exists*" will appear.

**NOTE**: We will test the code using different datasets.

In [None]:
!git clone https://github.com/kevinChnn/COMP9312-24T2-Project

Cloning into 'COMP9312-24T2-Project'...
remote: Enumerating objects: 15, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 15 (delta 3), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (15/15), 91.59 KiB | 4.16 MiB/s, done.
Resolving deltas: 100% (3/3), done.


### 3.2 The main function

Our test procedure first loads the graph dataset and the query dataset. Then, it calls the `KCore` class to preprocess the graph. After that, it will run each query and test its efficiency and correctness.

In [3]:
import time
import numpy as np
import pickle

if __name__ == "__main__":

    print('\n######## Loading the dataset...')
    #edge_list = np.loadtxt('./COMP9312-24T2-Project/test-kcore2.graph', dtype=int) 
    edge_list = np.loadtxt('./COMP9312-24T2-Project/cora.graph', dtype=int)
    G = UndirectedUnweightedGraph(edge_list)
    

    print('\n######## Preprocessing the graph...')
    start_preprocessing = time.time()
    KC = KCore(G)
    end_preprocessing = time.time()
    print("Preprocessing time: {:.8f}".format(end_preprocessing-start_preprocessing))

    print('\n######## Query Testing...')
    k_lo = 2 # inclusive
    k_hi = 5 # exclusive
    for k in range(k_lo,k_hi):
        print(f"Querying {k}-cores...")
        with open(f'./COMP9312-24T2-Project/cora-{k}.core', 'rb') as f:
            correct_cores = pickle.load(f)
        start_query = time.time()
        cores = KC.query(k)
        end_query = time.time()
        
        if len(cores) != len(correct_cores) or sum(len(core) for core in cores) != sum(len(core) for core in correct_cores):
            print("Query time: {:.8f} | {}-cores : False".format((end_query-start_query), k))
        else:
            test_cores = {frozenset(core) for core in cores}
            print("Query time: {:.8f} | {}-cores : {}".format((end_query-start_query), k, test_cores==correct_cores))


######## Loading the dataset...

######## Preprocessing the graph...
Preprocessing time: 0.00337720

######## Query Testing...
Querying 2-cores...
Query time: 0.00338411 | 2-cores : True
Querying 3-cores...
Query time: 0.00156188 | 3-cores : True
Querying 4-cores...
Query time: 0.00058317 | 4-cores : True
