This notebook is solely used to train features for training and testing set. Our aim is to separate different steps in the workflow such that the code of each module, namely Feature engineering, Predictions, and Evaluation, do not mix up one to another. Having avoided to write a huge notebook that executes everything in one place, the separation brings a way to have a clean, proper code, thus making it easier for further development and maintenance.

Feature engineering is one of the most important step among the entire process, since the data that we are given do not provided explicit features. Therefore, we have to come up with a set of features that may or may not contribute to the quality of prediction.

The computation of several has proven to be time-consuming. Therefore, it is not practical to recompute everything from stratch every time we work on the project. That's why we decided to save computed features into files, so that we do not have to repeat the process of feature extraction.

This notebook is proceeded as follows:
- reading data sets,
- building the citation graph,
- computing features,
- saving the features to files to be fed to the classifiers (in another notebook)

# Libraries and Utility functions

In this section, we precise the necessary libraries and define several utility functions. The **execution time** is also noted down for all important steps. It is useful to bear in mind the amount of time needed to compute some specific features, that way, we must think of a way to avoid expensive computation.

In [53]:
import time

start = time.time()

# --- utility librairies --- #
import numpy as np
import scipy as sc
import csv

# --- working with graph by using NetworKit --- #
import networkit as nk

# --- working with text --- #
import nltk
# nltk.download('stopwords') # if stopwords haven't been downloaded, please do
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import KeyedVectors
from gensim.models.doc2vec import TaggedDocument, Doc2Vec
from nltk import sent_tokenize, word_tokenize

# --- plotting real cute stuffs --- #
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

end = time.time()
print('Importing libraries and setting up parameters takes %.4f s' % (end-start))

Importing libraries and setting up parameters takes 0.0003 s


In [2]:
def build_graph(nodes, edges):
    '''
    Build the graph from the set of nodes and edges.
    NetworKit does not require labels for nodes, it only needs the index 0,1,2... of the nodes.
    
    Parameters
    ----------
    nodes: set of nodes
    edges: set of edges
    
    Returns
    -------
    the graph
    '''
    g = nk.Graph(len(nodes)) # adding nodes

    for edge in edges:
        if not g.hasEdge(edge[0], edge[1]): # avoid multiple edges
            g.addEdge(edge[0], edge[1])
            
    return g

In [3]:
def preprocess(text, dg_removal=True, sw_removal=True, stemming=True):
    '''
    Preprocess text: digit removal, stopword removal, stemming
    
    Parameters
    ----------
    text: text on which preprocessing is applied
    dg_removal: whether to apply digit removal or not
    sw_removal: whether to apply stopword removal or not
    stemming: whether to apply stemming or not
    
    Returns
    -------
    the text after preprocessing
    '''
    result = text
    
    sw = set(nltk.corpus.stopwords.words('english')) # set of stopwords
    stemmer = nltk.stem.PorterStemmer() # stemmer
    
    if dg_removal:
        result = re.sub('[0-9]', '', result)
    
    if sw_removal:
        result = ' '.join([token for token in result.split() if token not in sw])
        
    if stemming:
        result = ' '.join([stemmer.stem(token) for token in result.split()])
    
    return result

In [4]:
def print_feat_info(feat_name, set_name, arr):
    '''
    Print mean and standard deviation of a feature on training or testing set
    
    Parameters
    ----------
    feat_name: feature name
    set_namme: 'training' | 'testing'
    arr: the feature array
    '''
    print("%s: " % set_name, arr[0:5])
    print('--> Mean = %.3f, Std = %.3f, Non-null ratio: %.2f'
          %(np.mean(arr), np.std(arr), float(np.count_nonzero(arr))/float(len(arr))))
    print()

Once the libraries and utility functions are properly set up, let's get work done!

# 1. Reading data

In [5]:
path_data = '../data/' # path to the data
path_submission = '../submission/' # path to submission files

In [6]:
start = time.time()

# ====== read in node informations ====== #
with open(path_data + 'node_information.csv', 'r') as f:
    reader = csv.reader(f)
    node_info = list(reader)

end = time.time()
print('Reading node information takes %.4f s' % (end-start))

Reading node information takes 0.3328 s


In [7]:
start = time.time()

# ====== read training data as str ====== #
training = np.genfromtxt(path_data + 'training_set.txt', dtype=str)

end = time.time()
print('Reading training set takes %.4f s' % (end-start))

Reading training set takes 2.7601 s


In [8]:
start = time.time()

# ====== read testing data as str ====== #
testing = np.genfromtxt(path_data + 'testing_set.txt', dtype=str)

end = time.time()
print('Reading testing set takes %.4f s' % (end-start))

Reading testing set takes 0.1431 s


# 2. Building the citation graph

With the data loaded in, we should have enough information to construct the citation graph. It is observed that the number of edges is approximately half of the size of the training set i.e. the negative and positive class labels are **balanced**. Hence, it would later be easier to train classifiers.

In [9]:
start = time.time()

# ====== build the graph ====== #

nodes = [element[0] for element in node_info] # create index list to be passed as nodes
edges = [(nodes.index(element[0]), nodes.index(element[1])) for element in training if element[2] == '1']
g = build_graph(nodes, edges)

end = time.time()
print('Building the citation graph takes %.4f s' % (end-start))

Building the citation graph takes 172.3828 s


In [10]:
# check for general information of the graph
print('Number of vertices: %d' % g.numberOfNodes())
print('Number of edges (after multiple edges removal): %d' % g.numberOfEdges())

Number of vertices: 27770
Number of edges (after multiple edges removal): 334690


# 3. Computing features

The list of features is described as follows, and the computation rule is the same for both training and testing set.

| Feature                | Explanation                                                        | Value       | Type        |
|:----------------------:|:------------------------------------------------------------------:|:-----------:|:-----------:|
| Common neighbors       | Number of common neighbors                                         | numercial   | topological |
| Jaccard coefficient    | Link-based Jaccard coefficient                                     | numerical   | topological |
| Adamic-Adar coefficient| Adamic-Adar coefficient between two nodes                          | numercial   | topological |
| In the same k-core     | Whether both nodes/one of them/none of them are in the same k-core | categorical | topological |
| Katz index             | (centrality package) By-pair maximum of Katz centrality value      | numerical   | topological |
| Katz index             | (linkprediction package) The traditional approach to compute Katz  | numerical   | topological |
| Degree                 | By-pair maximum of degree centrality values                        | numerical   | topological |
| Betweenness centrality | By-pair maximum of betweenness centrality values                   | numerical   | topological |
| PageRank score         | By-pair maximum of PageRank score                                  | numerical   | topological |
| Preferential Attachment| Preferential attachment metric of a pair of nodes                  | numerical   | topological |
| Resource Allocation    | Resource Allocation matrix of a pair of nodes                      | numerical   | topological |
| Cosine similarity      | Cosine similarity between word vectors of titles + abstracts       | numerical   | semantic    |
| Title overlap          | Number of overlapping words in title                               | numerical   | meta-data   |
| Common authors         | The number of common authors between two articles                  | numerical   | meta-data   |
| Temporal difference    | Difference in publication year (absolute value)                    | numerical   | meta-data   |
| Same journal           | Whether two articles are published in the same journal             | binary      | meta-data   |

In [11]:
# compute the dictionary of (ID STRING - index INT) to accelerate access to a node's ID in the built graph
ID = dict(zip(nodes, [nodes.index(n) for n in nodes]))

It is important to create such a mapping between an article's ID (e.g. '1001', '1002', '1003') and a node's ID in the graph (0, 1, 2), because it speeds up remarkably the computation time.

Until now, we are ready to compute the set of features of interest.

## 3.1 Topological features

### 3.1.1 - Number of common neighbors

In [66]:
def common_neighbors(ds, g):
    '''
    Feature: The number of common neighbors
    
    Parameters
    ----------
    ds: dataset to compute feature from
    g: the graph
    '''
    size = len(ds)
    common_neigh = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        
        # collect the number of overlapping words in title
        common_neigh[i] = len(
            set(g.neighbors(ID[src]))
            .intersection(set(g.neighbors(ID[dest])))
        )
        
    return common_neigh

In [67]:
start = time.time()

# compute the average degree
train_common_neigh = common_neighbors(training, g)

end = time.time()
print('Computing the number of common neighbors for training set takes %.4f s' %(end-start))

Computing the number of common neighbors for training set takes 9.9154 s


In [68]:
start = time.time()

# compute the average degree
test_common_neigh = common_neighbors(testing, g)

end = time.time()
print('Computing the number of common neighbors for testing set takes %.4f s' %(end-start))

Computing the number of common neighbors for testing set takes 0.6105 s


In [69]:
print_feat_info('Common neighbors', 'Training', train_common_neigh)
print_feat_info('Common neighbors', 'Testing', test_common_neigh)

Training:  [ 1. 20.  0.  0.  0.]
--> Mean = 6.232, Std = 11.137, Non-null ratio: 0.53

Testing:  [ 0. 24. 59. 21.  0.]
--> Mean = 6.159, Std = 10.945, Non-null ratio: 0.53



### 3.1.2 - Jaccard coefficient

In [70]:
def jaccard_coeff(ds, g):
    '''
    Feature: Link-based Jaccard coefficient
    
    Parameters
    ----------
    ds: dataset to compute feature from
    g: the graph
    '''
    size = len(ds)
    coeff = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        
        # collect the number of overlapping words in title
        inters = len(
            set(g.neighbors(ID[src]))
            .intersection(set(g.neighbors(ID[dest])))
        ) # intersection of neighbors
        
        union = len(
            set(g.neighbors(ID[src]))
            .union(set(g.neighbors(ID[dest])))
        ) # union of neighbors
        
        coeff[i] = (float(inters)/float(union) if union != 0 else 0)
        
    return coeff

In [71]:
start = time.time()

# compute the average degree
train_jaccard_coeff = jaccard_coeff(training, g)

end = time.time()
print('Computing link-based Jaccard coefficient for training set takes %.4f s' %(end-start))

Computing link-based Jaccard coefficient for training set takes 21.0012 s


In [72]:
start = time.time()

# compute the average degree
test_jaccard_coeff = jaccard_coeff(testing, g)

end = time.time()
print('Computing link-based Jaccard coefficient for testing set takes %.4f s' %(end-start))

Computing link-based Jaccard coefficient for testing set takes 1.0386 s


In [73]:
print_feat_info('Jaccard coefficient', 'Training', train_jaccard_coeff)
print_feat_info('Jaccard coefficient', 'Testing', test_jaccard_coeff)

Training:  [0.05882353 0.09708738 0.         0.         0.        ]
--> Mean = 0.058, Std = 0.090, Non-null ratio: 0.53

Testing:  [0.         0.07430341 0.06533776 0.22105263 0.        ]
--> Mean = 0.061, Std = 0.097, Non-null ratio: 0.53



### 3.1.3 - Adamic-Adar coefficient

In [74]:
def adamic_adar_coeff(ds, g):
    '''
    Feature: Adamic-Adar coefficient
    
    Parameters
    ----------
    ds: the dataset
    g: graph
    '''
    
    size = len(ds)
    aa_coeff = np.zeros(size)
    aa_index = nk.linkprediction.AdamicAdarIndex(g)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        
        aa_coeff[i] = aa_index.run(ID[src], ID[dest])
        
    return aa_coeff

In [75]:
start = time.time()

# compute the adamic-adar coefficient
train_aa_coeff = adamic_adar_coeff(training, g)

end = time.time()
print('Computing Adamic-Adar coefficient feature for training set takes %.4f s' %(end-start))

Computing Adamic-Adar coefficient feature for training set takes 6.0754 s


In [76]:
start = time.time()

# compute the adamic-adar coefficient
test_aa_coeff = adamic_adar_coeff(testing, g)

end = time.time()
print('Computing Adamic-Adar coefficient feature for testing set takes %.4f s' %(end-start))

Computing Adamic-Adar coefficient feature for testing set takes 0.3486 s


In [77]:
print_feat_info('Adamic-Adar', 'Training', train_aa_coeff)
print_feat_info('Adamic-Adar', 'Testing', test_aa_coeff)

Training:  [0.51389834 4.32036615 0.         0.         0.        ]
--> Mean = 1.513, Std = 2.740, Non-null ratio: 0.53

Testing:  [ 0.          5.37797275 15.05361173  4.89942438  0.        ]
--> Mean = 1.498, Std = 2.692, Non-null ratio: 0.53



### 3.1.4 - In k-core

In [78]:
start = time.time()

core_decomp = nk.community.CoreDecomposition(g, storeNodeOrder=True)
core_decomp.run()
cover_g = core_decomp.getCover()
order = 15 # important parameters

end = time.time()
print('Core decomposition of the graph takes %.4f s' % (end-start))

Core decomposition of the graph takes 0.1697 s


In [79]:
print('There are %d nodes that belong in the %d-core decomposition of this graph' 
      % (len(cover_g.getMembers(order)), order))

There are 9647 nodes that belong in the 15-core decomposition of this graph


In [80]:
def in_kcore(ds, kcore):
    '''
    Compute feature: whether a pair of nodes is found in the same k-core graph
    
    Parameters
    ----------
    ds: dataset
    kcore: the k-core graph after decomposition as a set of nodes index (ranged from 0 to 27,770)
    
    Returns
    -------
    A numpy array of ordinal values: 
        - 0 if both nodes are not in the kcores, 
        - 0.5 if one of them is in the kcores, 
        - 1 of they are both in the k-core
    '''
    size = len(ds)
    same_kcore = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        
        # compute whether two nodes are in the given kcore | one of them is in the kcore | none of them
        index_src = ID[src] # index of src
        index_dest = ID[dest] # index of dest
        
        if index_src in kcore and index_dest in kcore:
            result = 1.0
        elif index_src not in kcore and index_dest not in kcore:
            result = 0.0
        else:
            result = 0.5
            
        same_kcore[i] = result
        
    return same_kcore

In [81]:
start = time.time()

# compute the position of two nodes wrt k-core
train_in_kcore = in_kcore(training, cover_g.getMembers(order))

end = time.time()
print('Computing the in k-core feature for training set takes %.4f s' %(end-start))

Computing the in k-core feature for training set takes 1.7808 s


In [82]:
start = time.time()

# compute the position of two nodes wrt k-core
test_in_kcore = in_kcore(testing, cover_g.getMembers(order))

end = time.time()
print('Computing the in k-core feature for testing set takes %.4f s' %(end-start))

Computing the in k-core feature for testing set takes 0.1182 s


In [83]:
print_feat_info('In k-core', 'Training', train_in_kcore)
print_feat_info('In k-core', 'Testing', test_in_kcore)

Training:  [0.  1.  0.  0.5 0.5]
--> Mean = 0.564, Std = 0.420, Non-null ratio: 0.70

Testing:  [1.  1.  1.  1.  0.5]
--> Mean = 0.555, Std = 0.418, Non-null ratio: 0.70



### 3.1.5 - Katz index

Katz measure is one of the most curious feature to consider from the dataset, since it causes overfitting to every classifier, even to Naive Bayes that is often proved to be "immune" to overfitting due to its simple assumption of independence.

`Networkit` oddly provides computation of Katz measure in two different packages, namely `centrality` and `linkprediction`. Due to the lack of detailed documentation on their website, we assume each has different behaviors, since katz index computed with `centrality` package is relatively fast (within a few seconds) meanwhile the same measure obtained by `linkprediction` package takes about 9 hours to complete. It is the latter one that causes overfitting, whereas the former one does not cause much trouble (but does not have significant contribution to the accuracy score either).

#### (A) *centrality* package

In [12]:
start = time.time()

# ====== create a Katz Centrality object and run it ====== #
katz = nk.centrality.KatzCentrality(g)
katz.run()

end = time.time()
print('Computing katz centrality (with centrality package) takes %.4f s' % (end-start))

Computing katz centrality (with centrality package) takes 0.1100 s


In [13]:
def max_katz(ds, katz_scores):
    '''
    By-pair maximumKatz index between a pair of nodes (using centrality package and get the maximum)
    '''
    size = len(ds)
    katz_result = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1]
        _katz = max(katz_scores[ID[src]], katz_scores[ID[dest]])
        katz_result[i] = np.log(_katz) if _katz != 0.0 else 0.0
        
    return katz_result

In [14]:
start = time.time()

# compute the katz index for training set
train_katz = max_katz(training, katz.scores())

end = time.time()
print('Computing Katz max feature for training set takes %.4f s' %(end-start))

Computing Katz max feature for training set takes 1.7918 s


In [15]:
start = time.time()

# compute the katz index for testing set
test_katz = max_katz(testing, katz.scores())

end = time.time()
print('Computing Katz average feature for testing set takes %.4f s' %(end-start))

Computing Katz average feature for testing set takes 0.1408 s


In [17]:
print_feat_info('Katz with centrality package', 'Training', train_katz)
print_feat_info('Katz with centrality package', 'Testing', test_katz)

Training:  [-5.5804487  -4.2164855  -5.64652461 -5.25171866 -5.36014155]
--> Mean = -4.878, Std = 0.685, Non-null ratio: 1.00

Testing:  [-4.69918791 -3.80632847 -2.88068326 -4.78584888 -4.22347698]
--> Mean = -4.881, Std = 0.693, Non-null ratio: 1.00



#### (B) *linkprediction* package

Because the computation of Katz centrality measure with `linkprediction` package is highly time-consuming, we have decided to avoid any recomputation. Instead, we ran it only once and saved the result into files (please refer to `katz_training.txt` and `katz_testing.txt`). Then, we only need to read back the data from file to reconstruct the feature.

In [22]:
start = time.time()

# ====== read katz for training set from file ====== #
_train_katz_linkpred = np.genfromtxt(path_data + 'katz_training.txt', dtype=float)
train_katz_linkpred = [np.log(value) if value != 0 else 0 for value in _train_katz_linkpred]

end = time.time()
print('Reconstructing katz index for traing set takes %.4fs' % (end-start))

Reconstructing katz index for traing set takes 2.6767s


In [23]:
start = time.time()

# ====== read katz for training set from file ====== #
_test_katz_linkpred = np.genfromtxt(path_data + 'katz_testing.txt', dtype=float)
test_katz_linkpred = [np.log(value) if value != 0 else 0 for value in _test_katz_linkpred]

end = time.time()
print('Reconstructing katz index for traing set takes %.4fs' % (end-start))

Reconstructing katz index for traing set takes 0.1655s


In [24]:
print_feat_info('Katz link prediction', 'Training', train_katz_linkpred)
print_feat_info('Katz link prediction', 'Testing', test_katz_linkpred)

Training:  [-5.293029862567551, -5.199655114063047, 0, -13.487961947295274, -15.799096614000588]
--> Mean = -10.194, Std = 7.904, Non-null ratio: 0.92

Testing:  [-15.072323905681971, -7.365925687758105, -6.472026643304163, -7.537850990784671, -13.279108565893594]
--> Mean = -12.131, Std = 7.010, Non-null ratio: 0.91



### 3.1.6 - By-pair maximum of degrees

In [84]:
def max_degrees(ds, g):
    '''
    Feature: Maximum degrees among 2 nodes
    
    Parameters
    ----------
    ds: dataset to compute feature from
    g: the graph
    '''
    size = len(ds)
    max_degree = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        
        # collect the number of overlapping words in title
        src_deg = g.degree(ID[src])
        dest_deg = g.degree(ID[dest])
        max_degree[i] = max(src_deg, dest_deg)
        
    return max_degree

In [85]:
start = time.time()

# compute the average degree
train_degrees = max_degrees(training, g)

end = time.time()
print('Computing the by-pair maximum degree for training set takes %.4f s' %(end-start))

Computing the by-pair maximum degree for training set takes 1.7575 s


In [86]:
start = time.time()

# compute the average degree
test_degrees = max_degrees(testing, g)

end = time.time()
print('Computing the by-pair maximum degree for testing set takes %.4f s' %(end-start))

Computing the by-pair maximum degree for testing set takes 0.1244 s


In [87]:
print_feat_info('By-pair max degree', 'Training', train_degrees)
print_feat_info('By-pair max degree', 'Testing', test_degrees)

Training:  [ 12. 147.   5.  20.  24.]
--> Mean = 106.541, Std = 239.810, Non-null ratio: 1.00

Testing:  [ 59. 302. 739.  65. 150.]
--> Mean = 107.564, Std = 243.680, Non-null ratio: 1.00



### 3.1.7 - By-pair maximum of betweenness

`Networkit` offers various choices for the computation of betweenness centrality, such as traditional, estimate and approximate approach. We have decided to come up with the traditional approach, that is to consider all-pair shortest paths. It is acceptable in our case because the citation is not too large (27700 nodes are a number smaller than what a social network has in the present day). The computation should be completed within 10 minutes or less.

In [26]:
# ====== compute betweenness centrality ====== #
start = time.time()

# use the traditional approach of betweeness computation
btwn = nk.centrality.Betweenness(g)
btwn.run()

end = time.time()
print('Compute betweenness centrality of every node in the graph takes %.4f s' % (end-start))

Compute betweenness centrality of every node in the graph takes 460.1832 s


In [27]:
def max_betweeness(ds, btwn):
    '''
    Compute feature: by-pair maximum of betweenness centrality
    
    Parameters
    ----------
    ds: dataset to compute feature from
    g: the graph
    
    Returns
    -------
    A numpy array of numerical values
    '''
    size = len(ds)
    max_btw = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        
        # collect the average of betweenness centrality of 2 nodes
        _max = max(btwn[ID[src]], btwn[ID[dest]])
        max_btw[i] = np.log(_max) if _max != 0 else 0.0
        
    return max_btw

In [28]:
start = time.time()

# compute the average degree
train_btwn = max_betweeness(training, btwn.scores())

end = time.time()
print('Computing the average betweenness for training set takes %.4f s' %(end-start))

Computing the average betweenness for training set takes 1.7032 s


In [29]:
start = time.time()

# compute the average degree
test_btwn = max_betweeness(testing, btwn.scores())

end = time.time()
print('Computing the average betweenness for testing set takes %.4f s' %(end-start))

Computing the average betweenness for testing set takes 0.1040 s


In [30]:
print_feat_info('Betweenness', 'Training', train_btwn)
print_feat_info('Betweenness', 'Testing', test_btwn)

Training:  [10.5093708  11.39325368  9.53921731  8.04019582  9.38724288]
--> Mean = 11.460, Std = 2.127, Non-null ratio: 1.00

Testing:  [12.58866497 14.77807021 15.9955066  11.20495121 13.16245853]
--> Mean = 11.440, Std = 2.155, Non-null ratio: 1.00



### 3.1.8 - By-pair maximum of PageRank

In [88]:
# ====== compute PageRank index ====== #
start = time.time()

page_rank_g = nk.centrality.PageRank(g)
page_rank_g.run()

end = time.time()
print('Computing the PageRank index of the graph takes %.4f s' % (end-start))

Computing the PageRank index of the graph takes 0.2050 s


In [89]:
def max_pagerank(ds, pr):
    '''
    Compute feature: average of pagerank
    
    Parameters
    ----------
    ds: dataset to compute feature from
    pr: PageRank centrality object
    '''
    size = len(ds)
    max_pr = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        
        # collect the average of betweenness centrality of 2 nodes
        # log to "dampen" too small values
        _max = max(pr[ID[src]], pr[ID[dest]])
        max_pr[i] = np.log(_max) if _max != 0.0 else 0.0
        
    return max_pr

In [90]:
start = time.time()

# compute the average pagerank on training set
train_pagerank = max_pagerank(training, page_rank_g.scores())

end = time.time()
print('Computing the by-pair maximum page rank for training set takes %.4f s' %(end-start))

Computing the by-pair maximum page rank for training set takes 2.5733 s


In [91]:
start = time.time()

# compute the average pagerank
test_pagerank = max_pagerank(testing, page_rank_g.scores())

end = time.time()
print('Computing the by-pair maximum page rank for testing set takes %.4f s' %(end-start))

Computing the by-pair maximum page rank for testing set takes 0.1430 s


In [92]:
print_feat_info('By-pair max PageRank', 'Training', train_pagerank)
print_feat_info('By-pair max PageRank', 'Testing', test_pagerank)

Training:  [-10.4952174   -9.04507102 -11.03082025 -10.67156463 -10.4743869 ]
--> Mean = -9.655, Std = 0.869, Non-null ratio: 1.00

Testing:  [-9.71123221 -8.05014677 -7.28060674 -9.78108665 -8.84688464]
--> Mean = -9.667, Std = 0.884, Non-null ratio: 1.00



### 3.1.9 - Preferential Attachment index

In [93]:
def pref_attach(ds, pa):
    '''
    Feature: Preferential Attachment between 2 nodes
    
    Parameters
    ----------
    ds: dataset
    pa: prefenrential attachment object
    '''
    size = len(ds)
    pa_result = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        _pa = pa.run(ID[src], ID[dest])
        pa_result[i] = np.log(_pa) if _pa != 0 else 0.0
        
    return pa_result

In [94]:
pa_object = nk.linkprediction.PreferentialAttachmentIndex(g)

In [95]:
start = time.time()

# compute the Preferential Attachment index
train_pref_attach = pref_attach(training, pa_object)

end = time.time()
print('Computing Preferential Attachment feature for training set takes %.4f s' %(end-start))

Computing Preferential Attachment feature for training set takes 2.2458 s


In [96]:
start = time.time()

# compute the Preferential Attachment index
test_pref_attach = pref_attach(testing, pa_object)

end = time.time()
print('Computing Preferential Attachment feature for testing set takes %.4f s' %(end-start))

Computing Preferential Attachment feature for testing set takes 0.1471 s


In [97]:
print_feat_info('Preferential Attachment', 'Training', train_pref_attach)
print_feat_info('Preferential Attachment', 'Testing', test_pref_attach)

Training:  [4.27666612 9.35988044 1.60943791 5.6347896  5.12396398]
--> Mean = 6.442, Std = 2.200, Non-null ratio: 1.00

Testing:  [ 6.9679092   9.51708951 12.01246969  8.1062129   6.95654544]
--> Mean = 6.379, Std = 2.223, Non-null ratio: 0.99



### 3.1.10 - Resource Allocation measure

In [98]:
def res_allocation(ds, ra):
    '''
    Feature: ResourceAllocation index
    
    Parameters
    ----------
    ds: dataset
    ra: Resource Allocation object
    '''
    size = len(ds)
    ra_result = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        _ra = ra.run(ID[src], ID[dest])
        ra_result[i] = _ra
        
    return ra_result

In [99]:
ra_object = nk.linkprediction.ResourceAllocationIndex(g)

In [100]:
start = time.time()

# compute Resource Allocation 
train_res_alloc = res_allocation(training, ra_object)

end = time.time()
print('Computing Resource Allocation feature for training set takes %.4f s' %(end-start))

Computing Resource Allocation feature for training set takes 6.0396 s


In [101]:
start = time.time()

# compute Resource Allocation 
test_res_alloc = res_allocation(testing, ra_object)

end = time.time()
print('Computing Resource Allocation feature for testing set takes %.4f s' %(end-start))

Computing Resource Allocation feature for testing set takes 0.3480 s


In [102]:
print_feat_info('Resouce Allocation', 'Training', train_res_alloc)
print_feat_info('Resouce Allocation', 'Testing', test_res_alloc)

Training:  [0.14285714 0.22640079 0.         0.         0.        ]
--> Mean = 0.125, Std = 0.247, Non-null ratio: 0.53

Testing:  [0.         0.31153472 1.34259427 0.29841899 0.        ]
--> Mean = 0.125, Std = 0.243, Non-null ratio: 0.53



## 3.2 Semantic features: Cosine similarity

Before computing semantic features, it is mandatory to build a corpus of all texts extracted from the given data. We define as corpus as the collection of all articles' title and abstract.

In [35]:
start = time.time()

# ====== corpus is the set of titles + abstracts, apply preprocessing to each article ======#

# nltk.download('stopwords') # uncomment if haven't downloaded stopwords
corpus = [preprocess(element[2] + ' ' + element[5], dg_removal=True, sw_removal=True, stemming=True) 
          for element in node_info]

end = time.time()
print('Building the corpus with preprocessing on words takes %.4f s' % (end-start))

Building the corpus with preprocessing on words takes 50.9963 s


In [36]:
def cosine_sim_text(ds, vectors, is_w2v):
    '''
    Compute feature: cosine similarity in title and abstract, cosine similarity on either TF-IDF or word2vec
    
    Parameters
    ----------
    ds: dataset to compute feature from
    vectors: vectors of word embeddings or TF-IDF
    is_w2v: True if vectors contain word embeddings, False in case of TF-IDF
    
    Returns
    -------
    A numpy array of cosine values
    '''
    size = len(ds)
    cosines = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        
        # collect the cosine similarity
        src_vect, dest_vect = vectors[ID[src]], vectors[ID[dest]] # get the corresponding vector in TD-IDF matrix

        # compute cosine similarity
        cos = 1 - sc.spatial.distance.cosine(src_vect, dest_vect) if is_w2v else cosine_similarity(src_vect, dest_vect)
        
        cosines[i] = cos
        
    return cosines

### 3.2.1 - TD-IDF

In [37]:
start = time.time()

# ====== fit TF-IDF ====== #
vectorizer = TfidfVectorizer(stop_words='english') # create a TF-IDF vectorizer
tfidf = vectorizer.fit_transform(corpus) # TD-IDF matrix of the entire corpus (set of abstracts)
print('TF-IDF matrix of shape:', tfidf.shape)

end = time.time()
print('Building TF-IDF matrix takes %.4f s' % (end-start))

TF-IDF matrix of shape: (27770, 17080)
Building TF-IDF matrix takes 1.8592 s


In [38]:
start = time.time()

# compute the cosine similarity
train_cos_tfidf = cosine_sim_text(training, tfidf, False)

end = time.time()
print('Computing cosine similarity for training set takes %.4f s' %(end-start))

Computing cosine similarity for training set takes 345.2234 s


In [39]:
start = time.time()

# compute the cosine similarity
test_cos_tfidf = cosine_sim_text(testing, tfidf, False)

end = time.time()
print('Computing cosine similarity for testing set takes %.4f s' %(end-start))

Computing cosine similarity for testing set takes 17.6878 s


In [40]:
print_feat_info('Cosine similarity with TF-IDF', 'Training', train_cos_tfidf)
print_feat_info('Cosine similarity with TF-IDF', 'Testing', test_cos_tfidf)

Training:  [0.19996622 0.06436945 0.02053711 0.05937844 0.09852643]
--> Mean = 0.114, Std = 0.116, Non-null ratio: 0.97

Testing:  [0.11804009 0.30786265 0.20753805 0.16112407 0.31824453]
--> Mean = 0.114, Std = 0.116, Non-null ratio: 0.97



### 3.2.2 - word2vec

In [41]:
start = time.time()

# ====== reading google vector ====== #
google_vecs = KeyedVectors.load_word2vec_format(path_data + 'GoogleNews-vectors-negative300.bin.gz', binary=True)

end = time.time()
print('Loading word2vec of google takes %.4f s' % (end-start))

INFO:gensim.models.utils_any2vec:loading projection weights from ../data/GoogleNews-vectors-negative300.bin.gz
INFO:gensim.models.utils_any2vec:loaded (3000000, 300) matrix from ../data/GoogleNews-vectors-negative300.bin.gz


Loading word2vec of google takes 118.7759 s


In [42]:
start = time.time()

# building documents for word2vec training
train_tag = []
total = len(corpus)
processed = 0
i = 0
#nltk.download('punkt') # uncomment if package 'punkt' not already downloaded
for x in corpus:
    words = []
    sentences = sent_tokenize(x)
    for s in sentences:
        words.extend(word_tokenize(s)) 
    doc = words
    i = i+1
    train_tag.append(doc)

end = time.time()
print('Building documents for word2vec training takes %.4f s' % (end-start))

Building documents for word2vec training takes 10.3649 s


In [43]:
def word2vec(x_data,vector):
    print ("Loading GoogleNews-vectors-negative300.bin")
    google_vecs = vector
    print ("GoogleNews-vectors-negative300.bin loaded")
    
    print ("Averaging Word Embeddings...")
    x_data_embeddings = []
    total = len(x_data)
    for tagged_plot in x_data:
        count = 0  
        doc_vector = np.zeros(300)
        for sentence in tagged_plot:
            try:
                doc_vector += google_vecs[sentence]
            except KeyError:
                doc_vector += [0.0]*300
                continue

        x_data_embeddings.append(doc_vector)
            
    return np.array(x_data_embeddings)

In [44]:
start = time.time()

# get word embessings
x_embeddings = word2vec(train_tag, google_vecs)
print('Word embeddings of shape:', x_embeddings.shape)

end = time.time()
print('Embedding words takes %.4f s' % (end-start))

Loading GoogleNews-vectors-negative300.bin
GoogleNews-vectors-negative300.bin loaded
Averaging Word Embeddings...
Word embeddings of shape: (27770, 300)
Embedding words takes 12.1084 s


In [50]:
start = time.time()

# ====== compute cosine similarity with word embeddings in training set ====== #
train_cos_w2v = np.nan_to_num(cosine_sim_text(training, x_embeddings, True))

end = time.time()
print('Computing cosine similarity using word embedding for training features takes %.4f s' % (end-start))

  dist = 1.0 - uv / np.sqrt(uu * vv)


Computing cosine similarity using word embedding for training features takes 19.9758 s


In [51]:
start = time.time()

# ======compute cosine similarity with word embeddings in testing set ====== #
test_cos_w2v = np.nan_to_num(cosine_sim_text(testing, x_embeddings, True))

end = time.time()
print('Computing cosine similarity using word embedding for testing features takes %.4f s' % (end-start))

  dist = 1.0 - uv / np.sqrt(uu * vv)


Computing cosine similarity using word embedding for testing features takes 1.0727 s


In [52]:
print_feat_info('Cosine similarity with word2vec', 'Training', train_cos_w2v)
print_feat_info('Cosine similarity with word2vec', 'Testing', test_cos_w2v)

Training:  [0.76839412 0.64235938 0.76174991 0.73752149 0.7232033 ]
--> Mean = 0.691, Std = 0.097, Non-null ratio: 1.00

Testing:  [0.72750916 0.69251008 0.61316547 0.60472135 0.37674628]
--> Mean = 0.691, Std = 0.097, Non-null ratio: 1.00



### 3.2.3 doc2vec

In [57]:
train_tag = []
print ("Building TaggedDocuments for train")
labels = [counter[0] for counter in node_info]
total = len(corpus)
processed = 0
i = 0
for x in corpus:
    words = []
    sentences = sent_tokenize(x)
    for s in sentences:
        words.extend(word_tokenize(s)) 
    doc = TaggedDocument(words, labels[i])
    i = i+1
    train_tag.append(doc)
print ("Done")

Building TaggedDocuments for train
Done


In [58]:
model = Doc2Vec(min_count=1, window=10, size=300, sample=1e-4, negative=5, workers=2)
print ("Building Vocabulary")
model.build_vocab(train_tag)
for epoch in range(1):
    print ("Training epoch %s" % epoch)
    model.train(train_tag, total_examples=model.corpus_count , epochs=model.iter)
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay
    model.train(train_tag, total_examples=model.corpus_count, epochs=model.iter)
    
x_train = []
for doc_id in range(len(train_tag)):
    inferred_vector = model.infer_vector(train_tag[doc_id].words)
    x_train.append(inferred_vector)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags


Building Vocabulary


INFO:gensim.models.doc2vec:PROGRESS: at example #10000, processed 665191 words (2394513/s), 14582 word types, 10 tags
INFO:gensim.models.doc2vec:PROGRESS: at example #20000, processed 1289024 words (2356692/s), 21649 word types, 10 tags
INFO:gensim.models.doc2vec:collected 25455 word types and 10 unique tags from a corpus of 27770 examples and 1783670 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=1 retains 25455 unique words (100% of original 25455, drops 0)
INFO:gensim.models.word2vec:min_count=1 leaves 1783670 word corpus (100% of original 1783670, drops 0)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 25455 items
INFO:gensim.models.word2vec:sample=0.0001 downsamples 713 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 889654 word corpus (49.9% of prior 1783670)
INFO:gensim.models.base_any2vec:estimated required memory for 25455 words and 300 dimensions: 73833500 bytes
INFO:gensim.mode

Training epoch 0


INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 23.31% examples, 244825 words/s, in_qsize 4, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 52.21% examples, 276827 words/s, in_qsize 3, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 1 - PROGRESS: at 79.91% examples, 280979 words/s, in_qsize 3, out_qsize 0
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.base_any2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.base_any2vec:EPOCH - 1 : training on 1783670 raw words (1067839 effective words) took 3.7s, 291197 effective words/s
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 26.44% examples, 276349 words/s, in_qsize 3, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 53.39% examples, 282332 words/s, in_qsize 3, out_qsize 0
INFO:gensim.models.base_any2vec:EPOCH 2 - PROGRESS: at 82.78% examples, 289056 words/s, in_qsize 3, out_qsize 0
INFO:gensim.

In [61]:
start = time.time()

# ====== compute cosine similarity with word embeddings in training set ====== #
train_cos_d2v = np.nan_to_num(cosine_sim_text(training, x_train, True))

end = time.time()
print('Computing cosine similarity using word embedding for training features takes %.4f s' % (end-start))

Computing cosine similarity using word embedding for training features takes 27.5655 s


In [62]:
start = time.time()

# ====== compute cosine similarity with word embeddings in training set ====== #
test_cos_d2v = np.nan_to_num(cosine_sim_text(testing, x_train, True))

end = time.time()
print('Computing cosine similarity using word embedding for training features takes %.4f s' % (end-start))

Computing cosine similarity using word embedding for training features takes 1.5624 s


In [64]:
print_feat_info('Cosine similarity with doc2vec', 'Training', train_cos_d2v)
print_feat_info('Cosine similarity with doc2vec', 'Testing', test_cos_d2v)

Training:  [-0.18935569  0.84578013  0.42042828  0.58582169  0.64102918]
--> Mean = 0.482, Std = 0.193, Non-null ratio: 1.00

Testing:  [ 0.5447917   0.39170963 -0.05685651  0.49049193  0.64354986]
--> Mean = 0.483, Std = 0.193, Non-null ratio: 1.00



## 3.3 Meta-data features

### 3.3.1 - Number of overlapping words in titles

In [31]:
def overlap_title(ds):
    '''
    Compute feature: number of overlapping words in the title
    
    Parameters
    ----------
    ds: dataset to compute feature from
    
    Returns
    -------
    A numpy array of numerical values
    '''
    size = len(ds)
    overlap_title = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        src_info, dest_info = node_info[ID[src]], node_info[ID[dest]] # get the associated node information by index
        
        # collect the number of overlapping words in title
        src_title, dest_title = preprocess(src_info[2]).split(), preprocess(dest_info[2]).split()
        overlap_title[i] = len(
            set(src_title)
            .intersection(set(dest_title))
        )
        
    return overlap_title

In [32]:
start = time.time()

# compute the number of overlapping words in title
train_overlap_title = overlap_title(training)

end = time.time()
print('Computing number of overlapping words in title for training set takes %.4f s' %(end-start))

Computing number of overlapping words in title for training set takes 659.3079 s


In [33]:
start = time.time()

# compute the number of overlapping words in title
test_overlap_title = overlap_title(testing)

end = time.time()
print('Computing number of overlapping words in title for testing set takes %.4f s' %(end-start))

Computing number of overlapping words in title for testing set takes 31.8326 s


In [34]:
print_feat_info('Number of overlapping words in title', 'Training', train_overlap_title)
print_feat_info('Number of overlapping words in title', 'Testing', test_overlap_title)

Training:  [2. 1. 0. 0. 0.]
--> Mean = 0.495, Std = 0.862, Non-null ratio: 0.32

Testing:  [0. 2. 1. 1. 0.]
--> Mean = 0.491, Std = 0.865, Non-null ratio: 0.32



### 3.3.2 - Number of common authors

In [103]:
def common_authors(ds):
    '''
    Compute feature: number of common authors
    
    Parameters
    ----------
    ds: dataset to compute feature from
    
    Returns
    -------
    A numpy array
    '''
    size = len(ds)
    common_auth = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        src_info, dest_info = node_info[ID[src]], node_info[ID[dest]] # get the associated node information by index
        
        # compute the difference in publication year in absolute value (because we don't know which one cites the other)
        common_auth[i] = len(
            set(src_info[3].split(','))
            .intersection(set(dest_info[3].split(',')))
        )
        
    return common_auth

In [104]:
start = time.time()

# compute the temporal difference
train_common_auth = common_authors(training)

end = time.time()
print('Computing the number of common authors for training set takes %.4f s' %(end-start))

Computing the number of common authors for training set takes 2.7053 s


In [105]:
start = time.time()

# compute the temporal difference
test_common_auth = common_authors(testing)

end = time.time()
print('Computing the number of common authors for testing set takes %.4f s' %(end-start))

Computing the number of common authors for testing set takes 0.1831 s


In [106]:
print_feat_info('Common authors', 'Training', train_common_auth)
print_feat_info('Common authors', 'Testing', test_common_auth)

Training:  [0. 0. 0. 0. 0.]
--> Mean = 0.083, Std = 0.357, Non-null ratio: 0.06

Testing:  [0. 0. 0. 0. 0.]
--> Mean = 0.082, Std = 0.351, Non-null ratio: 0.06



### 3.3.3 - Temporal difference in publication year

In [107]:
def temporal_difference(ds):
    '''
    Compute feature: Difference in publication year
    
    Parameters
    ----------
    ds: the dataset to compute
    
    Returns
    -------
    A numpy array where each entry corresponds to the temporal difference of a pair of nodes
    '''
    size = len(ds)
    temp_diff = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        src_info, dest_info = node_info[ID[src]], node_info[ID[dest]] # get the associated node information by index
        
        # compute the difference in publication year in absolute value (because we don't know which one cites the other)
        temp_diff[i] = abs(
            int(src_info[1]) - int(dest_info[1])
        )
        
    return temp_diff

In [108]:
start = time.time()

# compute the temporal difference
train_temp_diff = temporal_difference(training)

end = time.time()
print('Computing temporal difference for training set takes %.4f s' %(end-start))

Computing temporal difference for training set takes 2.0621 s


In [109]:
start = time.time()

# compute the temporal difference
test_temp_diff = temporal_difference(testing)

end = time.time()
print('Computing temporal difference for testing set takes %.4f s' %(end-start))

Computing temporal difference for testing set takes 0.1297 s


In [110]:
print_feat_info('Temporal difference', 'Training', train_temp_diff)
print_feat_info('Temporal difference', 'Testing', test_temp_diff)

Training:  [0. 1. 2. 4. 5.]
--> Mean = 2.795, Std = 2.435, Non-null ratio: 0.85

Testing:  [0. 1. 2. 0. 5.]
--> Mean = 2.814, Std = 2.443, Non-null ratio: 0.85



### 3.3.4 - Published in the same journal

In [111]:
def same_journal(ds):
    '''
    Compute feature: whether two articles are published in the same journal
    
    Parameters
    ----------
    ds: dataset to compute feature from
    
    Returns
    -------
    A numpy array of binary values (0|1)
    '''
    size = len(ds)
    same_journal = np.zeros(size)
    
    for i in range(size):
        src, dest = ds[i][0], ds[i][1] # get the source and dest ID
        src_info, dest_info = node_info[ID[src]], node_info[ID[dest]] # get the associated node information by index
        
        # 1 if two articles are published in the same journal, 0 otherwise
        same_journal[i] = int(
            len(src_info[4])>0 and  # journal info of source not null
            len(dest_info[4])>0 and # journal info of dest not null
            src_info[4] == dest_info[4] # the same journal title
        )
        
    return same_journal

In [112]:
start = time.time()

# compute the temporal difference
train_same_journal = same_journal(training)

end = time.time()
print('Computing whether two articles are published in the same journal for training set takes %.4f s' %(end-start))

Computing whether two articles are published in the same journal for training set takes 1.7626 s


In [113]:
start = time.time()

# compute the temporal difference
test_same_journal = same_journal(testing)

end = time.time()
print('Computing whether two articles are published in the same journal for testing set takes %.4f s' %(end-start))

Computing whether two articles are published in the same journal for testing set takes 0.1241 s


In [114]:
print_feat_info('In same journal', 'Training', train_same_journal)
print_feat_info('In same journal', 'Testing', test_same_journal)

Training:  [1. 0. 0. 0. 0.]
--> Mean = 0.110, Std = 0.313, Non-null ratio: 0.11

Testing:  [0. 0. 1. 1. 0.]
--> Mean = 0.109, Std = 0.312, Non-null ratio: 0.11



# 4. Saving features

In [118]:
# list of selected features
features = [
    'common_neighbors', # 0
    'jaccard', # 1
    'adamic_adar', # 2
    'in_kcore', # 3
    'katz_centrality', # 4
    'katz_linkpred', # 5
    'max_degrees', # 6
    'max_betweenness', # 7
    'max_pagerank', # 8
    'pref_attach', # 9
    'res_alloc', # 10
    'cos_tfidf', # 11
    'cos_w2v', # 12
    'cos_d2v', # 13
    'overlap_title', # 14
    'common_authors', # 15
    'temporal_diff', # 16
    'same_journal' # 17
]

## 4.1 Saving training features

In [119]:
# ====== create array of training feature ====== #
training_features = np.array([
    train_common_neigh,
    train_jaccard_coeff,
    train_aa_coeff,
    train_in_kcore,
    train_katz,
    train_katz_linkpred,
    train_degrees,
    train_btwn,
    train_pagerank,
    train_pref_attach,
    train_res_alloc,
    train_cos_tfidf,
    train_cos_w2v,
    train_cos_d2v,
    train_overlap_title,
    train_common_auth,
    train_temp_diff,
    train_same_journal
]).T

In [120]:
# ====== Saving features (training_features) ====== #
with open(path_data + 'training_features.csv', 'w', newline='') as f:
    csv_out = csv.writer(f)
    csv_out.writerow(features)
    for row in training_features:
        csv_out.writerow(row)

## 4.2 Saving testing features

In [121]:
# ====== create array of testing features ====== #
testing_features = np.array([
    test_common_neigh,
    test_jaccard_coeff,
    test_aa_coeff,
    test_in_kcore,
    test_katz,
    test_katz_linkpred,
    test_degrees,
    test_btwn,
    test_pagerank,
    test_pref_attach,
    test_res_alloc,
    test_cos_tfidf,
    test_cos_w2v,
    test_cos_d2v,
    test_overlap_title,
    test_common_auth,
    test_temp_diff,
    test_same_journal
]).T

In [122]:
# ====== Saving features (testing_features) ====== #
with open(path_data + 'testing_features.csv', 'w', newline='') as f:
    csv_out = csv.writer(f)
    csv_out.writerow(features)
    for row in testing_features:
        csv_out.writerow(row)