This notebook is solely used to train features for training and testing set. Our aim is to separate different steps in the workflow such that the code of each module, namely Feature engineering, Predictions, and Evaluation, do not mix up one to another. Having avoided to write a huge notebook that executes everything in one place, the separation brings a way to have a clean, proper code, thus making it easier for further development and maintenance.

Feature engineering is one of the most important step among the entire process, since the data that we are given do not provided explicit features. Therefore, we have to come up with a set of features that may or may not contribute to the quality of prediction.

The computation of several has proven to be time-consuming. Therefore, it is not practical to recompute everything from stratch every time we work on the project. That's why we decided to save computed features into files, so that we do not have to repeat the process of feature extraction.

This notebook is proceeded as follows:
- reading data sets,
- building the citation graph,
- computing features,
- saving the features to files to be fed to the classifiers (in another notebook),
- additional modifications to the features (adding/replacing features that are already computed)

# Libraries and Utility functions

In this section, we precise the necessary libraries and define several utility functions. The **execution time** is also noted down for all important steps. It is useful to bear in mind the amount of time needed to compute some specific features, that way, we must think of a way to avoid expensive computation.

In [91]:
import time

start = time.time()

# --- utility librairies --- #
import numpy as np
import scipy as sc
import csv

# --- working with graph by using NetworKit --- #
import networkit as nk

# --- working with text --- #
import nltk
# nltk.download('stopwords') # if stopwords haven't been downloaded, please do
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import KeyedVectors
from nltk import sent_tokenize, word_tokenize

# --- plotting real cute stuffs --- #
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

end = time.time()
print('Importing libraries and setting up parameters takes %.4f s' % (end-start))

Importing libraries and setting up parameters takes 0.0009 s


In [2]:
def build_graph(nodes, edges):
    '''
    Build the graph from the set of nodes and edges.
    NetworKit does not require labels for nodes, it only needs the index 0,1,2... of the nodes.
    
    Parameters
    ----------
    nodes: set of nodes
    edges: set of edges
    
    Returns
    -------
    the graph
    '''
    g = nk.Graph(len(nodes)) # adding nodes

    for edge in edges:
        if not g.hasEdge(edge[0], edge[1]): # avoid multiple edges
            g.addEdge(edge[0], edge[1])
            
    return g

In [5]:
def preprocess(text, dg_removal=True, sw_removal=True, stemming=True):
    '''
    Preprocess text: digit removal, stopword removal, stemming
    
    Parameters
    ----------
    text: text on which preprocessing is applied
    dg_removal: whether to apply digit removal or not
    sw_removal: whether to apply stopword removal or not
    stemming: whether to apply stemming or not
    
    Returns
    -------
    the text after preprocessing
    '''
    result = text
    
    sw = set(nltk.corpus.stopwords.words('english')) # set of stopwords
    stemmer = nltk.stem.PorterStemmer() # stemmer
    
    if dg_removal:
        result = re.sub('[0-9]', '', result)
    
    if sw_removal:
        result = ' '.join([token for token in result.split() if token not in sw])
        
    if stemming:
        result = ' '.join([stemmer.stem(token) for token in result.split()])
    
    return result

In [157]:
def print_feat_info(feat_name, set_name, arr):
    '''
    Print mean and standard deviation of a feature on training or testing set
    
    Parameters
    ----------
    feat_name: feature name
    set_namme: 'training' | 'testing'
    arr: the feature array
    '''
    print("%s: " % set_name, arr[0:5])
    print('--> Mean = %.3f, Std = %.3f, Non-null ratio: %.2f'
          %(np.mean(arr), np.std(arr), float(np.count_nonzero(arr))/float(len(arr))))
    print()

Once the libraries and utility functions are properly set up, let's get work done!

# 1. Reading data

In [6]:
path_data = '../data/' # path to the data
path_submission = '../submission/' # path to submission files

In [7]:
start = time.time()

# ====== read in node informations ====== #
with open(path_data + 'node_information.csv', 'r') as f:
    reader = csv.reader(f)
    node_info = list(reader)

end = time.time()
print('Reading node information takes %.4f s' % (end-start))

Reading node information takes 0.3492 s


In [8]:
start = time.time()

# ====== read training data as str ====== #
training = np.genfromtxt(path_data + 'training_set.txt', dtype=str)

end = time.time()
print('Reading training set takes %.4f s' % (end-start))

Reading training set takes 2.9434 s


In [9]:
start = time.time()

# ====== read testing data as str ====== #
testing = np.genfromtxt(path_data + 'testing_set.txt', dtype=str)

end = time.time()
print('Reading testing set takes %.4f s' % (end-start))

Reading testing set takes 0.1551 s


# 2. Building the citation graph

With the data loaded in, we should have enough information to construct the citation graph. It is observed that the number of edges is approximately half of the size of the training set i.e. the negative and positive class labels are **balanced**. Hence, it would later be easier to train classifiers.

In [10]:
start = time.time()

# ====== build the graph ====== #

nodes = [element[0] for element in node_info] # create index list to be passed as nodes
edges = [(nodes.index(element[0]), nodes.index(element[1])) for element in training if element[2] == '1']
g = build_graph(nodes, edges)

end = time.time()
print('Building the citation graph takes %.4f s' % (end-start))

Building the citation graph takes 172.6058 s


In [11]:
# check for general information of the graph
print('Number of vertices: %d' % g.numberOfNodes())
print('Number of edges (after multiple edges removal): %d' % g.numberOfEdges())

Number of vertices: 27770
Number of edges (after multiple edges removal): 334690


# 3. Computing features

The list of features is described as follows, and the computation rule is the same for both training and testing set.

| Feature                | Explanation                                                        | Value       | Type        |
|:----------------------:|:------------------------------------------------------------------:|:-----------:|:-----------:|
| Common neighbors       | Number of common neighbors                                         | numercial   | topological |
| Jaccard coefficient    | Link-based Jaccard coefficient                                     | numerical   | topological |
| Adamic-Adar coefficient| Adamic-Adar coefficient between two nodes                          | numercial   | topological |
| In the same k-core     | Whether both nodes/one of them/none of them are in the same k-core | categorical | topological |
| Katz index             | (centrality package) By-pair maximum of Katz centrality value      | numerical   | topological |
| Katz index             | (linkprediction package) The traditional approach to compute Katz  | numerical   | topological |
| Degree                 | By-pair maximum of degree centrality values                        | numerical   | topological |
| Betweenness centrality | By-pair maximum of betweenness centrality values                   | numerical   | topological |
| PageRank score         | By-pair maximum of PageRank score                                  | numerical   | topological |
| Preferential Attachment| Preferential attachment metric of a pair of nodes                  | numerical   | topological |
| Resource Allocation    | Resource Allocation matrix of a pair of nodes                      | numerical   | topological |
| Cosine similarity      | Cosine similarity between word vectors of titles + abstracts       | numerical   | semantic    |
| Title overlap          | Number of overlapping words in title                               | numerical   | meta-data   |
| Common authors         | The number of common authors between two articles                  | numerical   | meta-data   |
| Temporal difference    | Difference in publication year (absolute value)                    | numerical   | meta-data   |
| Same journal           | Whether two articles are published in the same journal             | binary      | meta-data   |

In [12]:
# compute the dictionary of (ID STRING - index INT) to accelerate access to a node's ID in the built graph
ID = dict(zip(nodes, [nodes.index(n) for n in nodes]))

It is important to create such a mapping between an article's ID (e.g. '1001', '1002', '1003') and a node's ID in the graph (0, 1, 2), because it speeds up remarkably the computation time.

Until now, we are ready to compute the set of features of interest.

## 3.1 Topological features

### 3.1.1 - Number of common neighbors

In [13]:
def common_neighbors(ds, g):
    '''
    Feature: The number of common neighbors
    
    Parameters
    ----------
    ds: dataset to compute feature from
    g: the graph
    '''
    size = len(ds)
    common_neigh = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        
        # collect the number of overlapping words in title
        common_neigh[i] = len(
            set(g.neighbors(ID[src]))
            .intersection(set(g.neighbors(ID[dest])))
        )
        
    return common_neigh

In [14]:
start = time.time()

# compute the average degree
train_common_neigh = common_neighbors(training, g)

end = time.time()
print('Computing the number of common neighbors for training set takes %.4f s' %(end-start))

Computing the number of common neighbors for training set takes 13.1309 s


In [15]:
start = time.time()

# compute the average degree
test_common_neigh = common_neighbors(testing, g)

end = time.time()
print('Computing the number of common neighbors for testing set takes %.4f s' %(end-start))

Computing the number of common neighbors for testing set takes 0.7275 s


In [158]:
print_feat_info('Common neighbors', 'Training', train_common_neigh)
print_feat_info('Common neighbors', 'Testing', test_common_neigh)

Training:  [ 1. 20.  0.  0.  0.]
--> Mean = 6.232, Std = 11.137, Non-null ratio: 0.53

Testing:  [ 0. 24. 59. 21.  0.]
--> Mean = 6.159, Std = 10.945, Non-null ratio: 0.53



### 3.1.2 - Jaccard coefficient

In [146]:
def jaccard_coeff(ds, g):
    '''
    Feature: Link-based Jaccard coefficient
    
    Parameters
    ----------
    ds: dataset to compute feature from
    g: the graph
    '''
    size = len(ds)
    coeff = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        
        # collect the number of overlapping words in title
        inters = len(
            set(g.neighbors(ID[src]))
            .intersection(set(g.neighbors(ID[dest])))
        ) # intersection of neighbors
        
        union = len(
            set(g.neighbors(ID[src]))
            .union(set(g.neighbors(ID[dest])))
        ) # union of neighbors
        
        coeff[i] = (float(inters)/float(union) if union != 0 else 0)
        
    return coeff

In [147]:
start = time.time()

# compute the average degree
train_jaccard_coeff = jaccard_coeff(training, g)

end = time.time()
print('Computing link-based Jaccard coefficient for training set takes %.4f s' %(end-start))

Computing link-based Jaccard coefficient for training set takes 22.8225 s


In [148]:
start = time.time()

# compute the average degree
test_jaccard_coeff = jaccard_coeff(testing, g)

end = time.time()
print('Computing link-based Jaccard coefficient for testing set takes %.4f s' %(end-start))

Computing link-based Jaccard coefficient for testing set takes 1.3702 s


In [165]:
print_feat_info('Jaccard coefficient', 'Training', train_jaccard_coeff)
print_feat_info('Jaccard coefficient', 'Testing', test_jaccard_coeff)

Training:  [0.05882353 0.09708738 0.         0.         0.        ]
--> Mean = 0.058, Std = 0.090, Non-null ratio: 0.53

Testing:  [0.         0.07430341 0.06533776 0.22105263 0.        ]
--> Mean = 0.061, Std = 0.097, Non-null ratio: 0.53



### 3.1.3 - Adamic-Adar coefficient

In [150]:
def adamic_adar_coeff(ds, g):
    '''
    Feature: Adamic-Adar coefficient
    
    Parameters
    ----------
    ds: the dataset
    g: graph
    '''
    
    size = len(ds)
    aa_coeff = np.zeros(size)
    aa_index = nk.linkprediction.AdamicAdarIndex(g)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        
        aa_coeff[i] = aa_index.run(ID[src], ID[dest])
        
    return aa_coeff

In [151]:
start = time.time()

# compute the adamic-adar coefficient
train_aa_coeff = adamic_adar_coeff(training, g)

end = time.time()
print('Computing Adamic-Adar coefficient feature for training set takes %.4f s' %(end-start))

Computing Adamic-Adar coefficient feature for training set takes 5.6423 s


In [152]:
start = time.time()

# compute the adamic-adar coefficient
test_aa_coeff = adamic_adar_coeff(testing, g)

end = time.time()
print('Computing Adamic-Adar coefficient feature for testing set takes %.4f s' %(end-start))

Computing Adamic-Adar coefficient feature for testing set takes 0.3349 s


In [166]:
print_feat_info('Adamic-Adar', 'Training', train_aa_coeff)
print_feat_info('Adamic-Adar', 'Testing', test_aa_coeff)

Training:  [0.51389834 4.32036615 0.         0.         0.        ]
--> Mean = 1.513, Std = 2.740, Non-null ratio: 0.53

Testing:  [ 0.          5.37797275 15.05361173  4.89942438  0.        ]
--> Mean = 1.498, Std = 2.692, Non-null ratio: 0.53



### 3.1.4 - In k-core

In [167]:
start = time.time()

core_decomp = nk.community.CoreDecomposition(g, storeNodeOrder=True)
core_decomp.run()
cover_g = core_decomp.getCover()
order = 15 # important parameters

end = time.time()
print('Core decomposition of the graph takes %.4f s' % (end-start))

Core decomposition of the graph takes 0.1260 s


In [168]:
print('There are %d nodes that belong in the %d-core decomposition of this graph' 
      % (len(cover_g.getMembers(order)), order))

There are 9647 nodes that belong in the 15-core decomposition of this graph


In [169]:
def in_kcore(ds, kcore):
    '''
    Compute feature: whether a pair of nodes is found in the same k-core graph
    
    Parameters
    ----------
    ds: dataset
    kcore: the k-core graph after decomposition as a set of nodes index (ranged from 0 to 27,770)
    
    Returns
    -------
    A numpy array of ordinal values: 
        - 0 if both nodes are not in the kcores, 
        - 0.5 if one of them is in the kcores, 
        - 1 of they are both in the k-core
    '''
    size = len(ds)
    same_kcore = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        
        # compute whether two nodes are in the given kcore | one of them is in the kcore | none of them
        index_src = ID[src] # index of src
        index_dest = ID[dest] # index of dest
        
        if index_src in kcore and index_dest in kcore:
            result = 1.0
        elif index_src not in kcore and index_dest not in kcore:
            result = 0.0
        else:
            result = 0.5
            
        same_kcore[i] = result
        
    return same_kcore

In [170]:
start = time.time()

# compute the position of two nodes wrt k-core
train_in_kcore = in_kcore(training, cover_g.getMembers(order))

end = time.time()
print('Computing the in k-core feature for training set takes %.4f s' %(end-start))

Computing the in k-core feature for training set takes 1.1210 s


In [171]:
start = time.time()

# compute the position of two nodes wrt k-core
test_in_kcore = in_kcore(testing, cover_g.getMembers(order))

end = time.time()
print('Computing the in k-core feature for testing set takes %.4f s' %(end-start))

Computing the in k-core feature for testing set takes 0.1098 s


In [172]:
print_feat_info('In k-core', 'Training', train_in_kcore)
print_feat_info('In k-core', 'Testing', test_in_kcore)

Training:  [0.  1.  0.  0.5 0.5]
--> Mean = 0.564, Std = 0.420, Non-null ratio: 0.70

Testing:  [1.  1.  1.  1.  0.5]
--> Mean = 0.555, Std = 0.418, Non-null ratio: 0.70



### 3.1.5 - Katz index

#### (A) *centrality* package

#### (B) *linkprediction* package

### 3.1.6 - By-pair maximum of degrees

In [174]:
def max_degrees(ds, g):
    '''
    Feature: Maximum degrees among 2 nodes
    
    Parameters
    ----------
    ds: dataset to compute feature from
    g: the graph
    '''
    size = len(ds)
    max_degree = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        
        # collect the number of overlapping words in title
        src_deg = g.degree(ID[src])
        dest_deg = g.degree(ID[dest])
        max_degree[i] = max(src_deg, dest_deg)
        
    return max_degree

In [175]:
start = time.time()

# compute the average degree
train_degrees = max_degrees(training, g)

end = time.time()
print('Computing the by-pair maximum degree for training set takes %.4f s' %(end-start))

Computing the by-pair maximum degree for training set takes 1.4165 s


In [176]:
start = time.time()

# compute the average degree
test_degrees = max_degrees(testing, g)

end = time.time()
print('Computing the by-pair maximum degree for testing set takes %.4f s' %(end-start))

Computing the by-pair maximum degree for testing set takes 0.0842 s


In [177]:
print_feat_info('By-pair max degree', 'Training', train_degrees)
print_feat_info('By-pair max degree', 'Testing', test_degrees)

Training:  [ 12. 147.   5.  20.  24.]
--> Mean = 106.541, Std = 239.810, Non-null ratio: 1.00

Testing:  [ 59. 302. 739.  65. 150.]
--> Mean = 107.564, Std = 243.680, Non-null ratio: 1.00



### 3.1.7 - By-pair maximum of betweenness

### 3.1.8 - By-pair maximum of PageRank

In [178]:
# ====== compute PageRank index ====== #
start = time.time()

page_rank_g = nk.centrality.PageRank(g)
page_rank_g.run()

end = time.time()
print('Computing the PageRank index of the graph takes %.4f s' % (end-start))

Computing the PageRank index of the graph takes 0.2161 s


In [179]:
def max_pagerank(ds, pr):
    '''
    Compute feature: average of pagerank
    
    Parameters
    ----------
    ds: dataset to compute feature from
    pr: PageRank centrality object
    '''
    size = len(ds)
    max_pr = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        
        # collect the average of betweenness centrality of 2 nodes
        # log to "dampen" too small values
        _max = max(pr[ID[src]], pr[ID[dest]])
        max_pr[i] = np.log(_max) if _max != 0.0 else 0.0
        
    return max_pr

In [180]:
start = time.time()

# compute the average pagerank on training set
train_pagerank = max_pagerank(training, page_rank_g.scores())

end = time.time()
print('Computing the by-pair maximum page rank for training set takes %.4f s' %(end-start))

Computing the by-pair maximum page rank for training set takes 1.9228 s


In [181]:
start = time.time()

# compute the average pagerank
test_pagerank = max_pagerank(testing, page_rank_g.scores())

end = time.time()
print('Computing the by-pair maximum page rank for testing set takes %.4f s' %(end-start))

Computing the by-pair maximum page rank for testing set takes 0.1348 s


In [182]:
print_feat_info('By-pair max PageRank', 'Training', train_pagerank)
print_feat_info('By-pair max PageRank', 'Testing', test_pagerank)

Training:  [-10.4952174   -9.04507102 -11.03082025 -10.67156463 -10.4743869 ]
--> Mean = -9.655, Std = 0.869, Non-null ratio: 1.00

Testing:  [-9.71123221 -8.05014677 -7.28060674 -9.78108665 -8.84688464]
--> Mean = -9.667, Std = 0.884, Non-null ratio: 1.00



### 3.1.9 - Preferential Attachment index

In [184]:
def pref_attach(ds, pa):
    '''
    Feature: Preferential Attachment between 2 nodes
    
    Parameters
    ----------
    ds: dataset
    pa: prefenrential attachment object
    '''
    size = len(ds)
    pa_result = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        _pa = pa.run(ID[src], ID[dest])
        pa_result[i] = np.log(_pa) if _pa != 0 else 0.0
        
    return pa_result

In [185]:
pa_object = nk.linkprediction.PreferentialAttachmentIndex(g)

In [186]:
start = time.time()

# compute the Preferential Attachment index
train_pref_attach = pref_attach(training, pa_object)

end = time.time()
print('Computing Preferential Attachment feature for training set takes %.4f s' %(end-start))

Computing Preferential Attachment feature for training set takes 1.9145 s


In [187]:
start = time.time()

# compute the Preferential Attachment index
test_pref_attach = pref_attach(testing, pa_object)

end = time.time()
print('Computing Preferential Attachment feature for testing set takes %.4f s' %(end-start))

Computing Preferential Attachment feature for testing set takes 0.1245 s


In [188]:
print_feat_info('Preferential Attachment', 'Training', train_pref_attach)
print_feat_info('Preferential Attachment', 'Testing', test_pref_attach)

Training:  [4.27666612 9.35988044 1.60943791 5.6347896  5.12396398]
--> Mean = 6.442, Std = 2.200, Non-null ratio: 1.00

Testing:  [ 6.9679092   9.51708951 12.01246969  8.1062129   6.95654544]
--> Mean = 6.379, Std = 2.223, Non-null ratio: 0.99



### 3.1.10 - Resource Allocation measure

In [190]:
def res_allocation(ds, ra):
    '''
    Feature: ResourceAllocation index
    
    Parameters
    ----------
    ds: dataset
    ra: Resource Allocation object
    '''
    size = len(ds)
    ra_result = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        _ra = ra.run(ID[src], ID[dest])
        ra_result[i] = _ra
        
    return ra_result

In [191]:
ra_object = nk.linkprediction.ResourceAllocationIndex(g)

In [192]:
start = time.time()

# compute Resource Allocation 
train_res_alloc = res_allocation(training, ra_object)

end = time.time()
print('Computing Resource Allocation feature for training set takes %.4f s' %(end-start))

Computing Resource Allocation feature for training set takes 5.1549 s


In [193]:
start = time.time()

# compute Resource Allocation 
test_res_alloc = res_allocation(testing, ra_object)

end = time.time()
print('Computing Resource Allocation feature for testing set takes %.4f s' %(end-start))

Computing Resource Allocation feature for testing set takes 0.3013 s


In [194]:
print_feat_info('Resouce Allocation', 'Training', train_res_alloc)
print_feat_info('Resouce Allocation', 'Testing', test_res_alloc)

Training:  [0.14285714 0.22640079 0.         0.         0.        ]
--> Mean = 0.125, Std = 0.247, Non-null ratio: 0.53

Testing:  [0.         0.31153472 1.34259427 0.29841899 0.        ]
--> Mean = 0.125, Std = 0.243, Non-null ratio: 0.53



## 3.2 Semantic features: Cosine similarity

### 3.2.1 - TD-IDF

### 3.2.2 - word2vec

### 3.2.3 doc2vec

## 3.3 Meta-data features

### 3.3.1 - Number of overlapping words in titles

### 3.3.2 - Number of common authors

In [195]:
def common_authors(ds):
    '''
    Compute feature: number of common authors
    
    Parameters
    ----------
    ds: dataset to compute feature from
    
    Returns
    -------
    A numpy array
    '''
    size = len(ds)
    common_auth = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        src_info, dest_info = node_info[ID[src]], node_info[ID[dest]] # get the associated node information by index
        
        # compute the difference in publication year in absolute value (because we don't know which one cites the other)
        common_auth[i] = len(
            set(src_info[3].split(','))
            .intersection(set(dest_info[3].split(',')))
        )
        
    return common_auth

In [196]:
start = time.time()

# compute the temporal difference
train_common_auth = common_authors(training)

end = time.time()
print('Computing the number of common authors for training set takes %.4f s' %(end-start))

Computing the number of common authors for training set takes 2.7997 s


In [197]:
start = time.time()

# compute the temporal difference
test_common_auth = common_authors(testing)

end = time.time()
print('Computing the number of common authors for testing set takes %.4f s' %(end-start))

Computing the number of common authors for testing set takes 0.1966 s


In [199]:
print_feat_info('Common authors', 'Training', train_common_auth)
print_feat_info('Common authors', 'Testing', test_common_auth)

Training:  [0. 0. 0. 0. 0.]
--> Mean = 0.083, Std = 0.357, Non-null ratio: 0.06

Testing:  [0. 0. 0. 0. 0.]
--> Mean = 0.082, Std = 0.351, Non-null ratio: 0.06



### 3.3.3 - Temporal difference in publication year

In [200]:
def temporal_difference(ds):
    '''
    Compute feature: Difference in publication year
    
    Parameters
    ----------
    ds: the dataset to compute
    
    Returns
    -------
    A numpy array where each entry corresponds to the temporal difference of a pair of nodes
    '''
    size = len(ds)
    temp_diff = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        src_info, dest_info = node_info[ID[src]], node_info[ID[dest]] # get the associated node information by index
        
        # compute the difference in publication year in absolute value (because we don't know which one cites the other)
        temp_diff[i] = abs(
            int(src_info[1]) - int(dest_info[1])
        )
        
    return temp_diff

In [201]:
start = time.time()

# compute the temporal difference
train_temp_diff = temporal_difference(training)

end = time.time()
print('Computing temporal difference for training set takes %.4f s' %(end-start))

Computing temporal difference for training set takes 2.8968 s


In [202]:
start = time.time()

# compute the temporal difference
test_temp_diff = temporal_difference(testing)

end = time.time()
print('Computing temporal difference for testing set takes %.4f s' %(end-start))

Computing temporal difference for testing set takes 0.2023 s


In [203]:
print_feat_info('Temporal difference', 'Training', train_temp_diff)
print_feat_info('Temporal difference', 'Testing', test_temp_diff)

Training:  [0. 1. 2. 4. 5.]
--> Mean = 2.795, Std = 2.435, Non-null ratio: 0.85

Testing:  [0. 1. 2. 0. 5.]
--> Mean = 2.814, Std = 2.443, Non-null ratio: 0.85



### 3.3.4 - Published in the same journal

In [204]:
def same_journal(ds):
    '''
    Compute feature: whether two articles are published in the same journal
    
    Parameters
    ----------
    ds: dataset to compute feature from
    
    Returns
    -------
    A numpy array of binary values (0|1)
    '''
    size = len(ds)
    same_journal = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        src_info, dest_info = node_info[ID[src]], node_info[ID[dest]] # get the associated node information by index
        
        # 1 if two articles are published in the same journal, 0 otherwise
        same_journal[i] = int(
            len(src_info[4])>0 and  # journal info of source not null
            len(dest_info[4])>0 and # journal info of dest not null
            src_info[4] == dest_info[4] # the same journal title
        )
        
    return same_journal

In [205]:
start = time.time()

# compute the temporal difference
train_same_journal = same_journal(training)

end = time.time()
print('Computing whether two articles are published in the same journal for training set takes %.4f s' %(end-start))

Computing whether two articles are published in the same journal for training set takes 2.4349 s


In [206]:
start = time.time()

# compute the temporal difference
test_same_journal = same_journal(testing)

end = time.time()
print('Computing whether two articles are published in the same journal for testing set takes %.4f s' %(end-start))

Computing whether two articles are published in the same journal for testing set takes 0.1387 s


In [207]:
print_feat_info('In same journal', 'Training', train_same_journal)
print_feat_info('In same journal', 'Testing', test_same_journal)

Training:  [1. 0. 0. 0. 0.]
--> Mean = 0.110, Std = 0.313, Non-null ratio: 0.11

Testing:  [0. 0. 1. 1. 0.]
--> Mean = 0.109, Std = 0.312, Non-null ratio: 0.11



# 4. Saving features

# 5. Additional operations