# Feature Engineering

In this notebook we're going to generate features for our link prediction classifier.

In [1]:
from neo4j import GraphDatabase

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# tag::imports[]
from sklearn.ensemble import RandomForestClassifier
# end::imports[]

In [2]:
bolt_uri = "bolt://localhost:7687"
# bolt_uri = "bolt://link-prediction-neo4j"
driver = GraphDatabase.driver(bolt_uri, auth=("neo4j", "letmein"))
#driver = GraphDatabase.driver(bolt_uri, auth=("neo4j", "admin"), max_connection_lifetime=50000)

print(driver.address)

localhost:7687


We can create our classifier with the following code:

In [8]:
# Load the CSV files saved in the train/test notebook

df_train_under = pd.read_csv("data/df_train_under.csv")
df_test_under = pd.read_csv("data/df_test_under.csv")

In [9]:
df_train_under.sample(5)

Unnamed: 0,node1,node2,label
129619,123477,134755,1
123580,11602,117372,1
132545,140047,145863,1
45586,2726,50968,0
33395,12019,145308,0


In [10]:
df_test_under.sample(5)

Unnamed: 0,node1,node2,label
68306,195698,214156,0
60476,10640,134462,0
139485,248684,248691,1
127115,179863,227351,1
33202,151363,51148,0


# Generating graphy features

We’ll start by creating a simple model that tries to predict whether two authors will have a future collaboration based on features extracted from common authors, preferential attachment, and the total union of neighbors.

The following function computes each of these measures for pairs of nodes:

#### Link Prediction algorithms

Common neighbors captures the idea that two strangers who have a friend in common are more likely to be introduced than those who don’t have any friends in common.

Preferential Attachment is a measure used to compute the closeness of nodes, based on their shared neighbors.

Total Neighbors computes the closeness of nodes, based on the number of unique neighbors that they have. It is based on the idea that the more connected a node is, the more likely it is to receive new links.



In [11]:
# tag::graphy-features[]
def apply_graphy_features(data, rel_type):
    query = """
    UNWIND $pairs AS pair
    MATCH (p1) WHERE id(p1) = pair.node1
    MATCH (p2) WHERE id(p2) = pair.node2
    RETURN pair.node1 AS node1,
           pair.node2 AS node2,
           gds.alpha.linkprediction.commonNeighbors(p1, p2, {
             relationshipQuery: $relType}) AS cn,
           gds.alpha.linkprediction.preferentialAttachment(p1, p2, {
             relationshipQuery: $relType}) AS pa,
           gds.alpha.linkprediction.totalNeighbors(p1, p2, {
             relationshipQuery: $relType}) AS tn
    """
    pairs = [{"node1": node1, "node2": node2}  for node1,node2 in data[["node1", "node2"]].values.tolist()]
    
    with driver.session(database="neo4j") as session:
        result = session.run(query, {"pairs": pairs, "relType": rel_type})
        features = pd.DataFrame([dict(record) for record in result])    
    return pd.merge(data, features, on = ["node1", "node2"])
# end::graphy-features[]

Let's apply the function to our training DataFrame:

In [12]:
# tag::apply-graphy-features[]
df_train_under = apply_graphy_features(df_train_under, "CO_AUTHOR_EARLY")
df_test_under = apply_graphy_features(df_test_under, "CO_AUTHOR")
# end::apply-graphy-features[]

In [20]:
# vik - see what are the columns 
df_train_under

Unnamed: 0,node1,node2,label,cn,pa,tn
0,166652,188273,0,1.000,7.000,7.000
1,65078,238563,0,0.000,33.000,14.000
2,102296,11128,0,0.000,16.000,10.000
3,72660,139273,0,0.000,52.000,17.000
4,113037,112244,0,0.000,12.000,7.000
...,...,...,...,...,...,...
162187,263882,263889,1,6.000,49.000,8.000
162188,263883,263889,1,6.000,49.000,8.000
162189,263886,263889,1,6.000,49.000,8.000
162190,20601,263890,1,0.000,10.000,11.000


Now we're going to add some new features that are generated from graph algorithms.

## Triangles and The Clustering Coefficient

We'll start by running the [triangle count](clusteringCoefficientProperty) algorithm over our test and train sub graphs. This algorithm will return the number of triangles that each node forms, as well as each node's clustering coefficient. The clustering coefficient of a node indicates the likelihood that its neighbours are also connected.

### Community detection algorithms
The Triangle Count algorithm counts the number of triangles for each node in the graph. A triangle is a set of three nodes where each node has a relationship to the other two. In graph theory terminology, this is sometimes referred to as a 3-clique. The Triangle Count algorithm in the GDS library only finds triangles in undirected graphs.

The Local Clustering Coefficient algorithm computes the local clustering coefficient for each node in the graph. The local clustering coefficient Cn of a node n describes the likelihood that the neighbours of n are also connected. To compute Cn we use the number of triangles a node is a part of Tn, and the degree of the node dn.

In [22]:
# tag::train-triangles[]
query = """
CALL gds.triangleCount.write({
  nodeProjection: 'Author',
  relationshipProjection: {
    CO_AUTHOR_EARLY: {
      type: 'CO_AUTHOR_EARLY',
      orientation: 'UNDIRECTED'
    }
  },
  writeProperty: 'trianglesTrain'
});
"""

with driver.session(database="neo4j") as session:
    result = session.run(query)
# end::train-triangles[]    
    df = pd.DataFrame([dict(record) for record in result])
df

Unnamed: 0,writeMillis,nodePropertiesWritten,globalTriangleCount,nodeCount,postProcessingMillis,createMillis,computeMillis,configuration
0,20,80299,97205,80299,0,27,13,"{'writeConcurrency': 4, 'writeProperty': 'tria..."


In [23]:
# tag::test-triangles[]
query = """
CALL gds.triangleCount.write({
  nodeProjection: 'Author',
  relationshipProjection: {
    CO_AUTHOR_LATE: {
      type: 'CO_AUTHOR_LATE',
      orientation: 'UNDIRECTED'
    }
  },
  writeProperty: 'trianglesTest'
});
"""

with driver.session(database="neo4j") as session:
    result = session.run(query)
# end::test-triangles[]    
    df = pd.DataFrame([dict(record) for record in result])
df    

Unnamed: 0,writeMillis,nodePropertiesWritten,globalTriangleCount,nodeCount,postProcessingMillis,createMillis,computeMillis,configuration
0,28,80299,95413,80299,0,21,14,"{'writeConcurrency': 4, 'writeProperty': 'tria..."


In [24]:
# tag::train-coefficient[]
query = """
CALL gds.localClusteringCoefficient.write({
  nodeProjection: 'Author',
  relationshipProjection: {
    CO_AUTHOR_EARLY: {
      type: 'CO_AUTHOR_EARLY',
      orientation: 'UNDIRECTED'
    }
  },
  writeProperty: 'coefficientTrain'
});
"""

with driver.session(database="neo4j") as session:
    result = session.run(query)
# end::train-coefficient[]
    df = pd.DataFrame([dict(record) for record in result])
df

Unnamed: 0,writeMillis,nodePropertiesWritten,averageClusteringCoefficient,nodeCount,postProcessingMillis,createMillis,computeMillis,configuration
0,24,80299,0.375,80299,0,19,19,"{'writeConcurrency': 4, 'triangleCountProperty..."


In [25]:
# tag::test-coefficient[]
query = """
CALL gds.localClusteringCoefficient.write({
  nodeProjection: 'Author',
  relationshipProjection: {
    CO_AUTHOR_LATE: {
      type: 'CO_AUTHOR_LATE',
      orientation: 'UNDIRECTED'
    }
  },
  writeProperty: 'coefficientTest'
});
"""

with driver.session(database="neo4j") as session:
    result = session.run(query)
# end::test-coefficient[]
    df = pd.DataFrame([dict(record) for record in result])
df

Unnamed: 0,writeMillis,nodePropertiesWritten,averageClusteringCoefficient,nodeCount,postProcessingMillis,createMillis,computeMillis,configuration
0,16,80299,0.337,80299,0,20,14,"{'writeConcurrency': 4, 'triangleCountProperty..."


The following function will add these features to our train and test DataFrames:

In [31]:
# tag::triangles-coefficient-features[]
def apply_triangles_features(data, triangles_prop, coefficient_prop):
    query = """
    UNWIND $pairs AS pair
    MATCH (p1) WHERE id(p1) = pair.node1
    MATCH (p2) WHERE id(p2) = pair.node2
    RETURN pair.node1 AS node1,
    pair.node2 AS node2,
    apoc.coll.min([p1[$trianglesProp], p2[$trianglesProp]]) AS minTriangles,
    apoc.coll.max([p1[$trianglesProp], p2[$trianglesProp]]) AS maxTriangles,
    apoc.coll.min([p1[$coefficientProp], p2[$coefficientProp]]) AS minCoefficient,
    apoc.coll.max([p1[$coefficientProp], p2[$coefficientProp]]) AS maxCoefficient
    """    
    pairs = [{"node1": node1, "node2": node2}  for node1,node2 in data[["node1", "node2"]].values.tolist()]    
    params = {
    "pairs": pairs,
    "trianglesProp": triangles_prop,
    "coefficientProp": coefficient_prop
    }

    with driver.session(database="neo4j") as session:
        result = session.run(query, params)
        features = pd.DataFrame([dict(record) for record in result])       
    
    return pd.merge(data, features, on = ["node1", "node2"])
# end::triangles-coefficient-features[]

In [32]:
# tag::apply-triangles-coefficient-features[]
df_train_under = apply_triangles_features(df_train_under, "trianglesTrain", "coefficientTrain")
df_test_under = apply_triangles_features(df_test_under, "trianglesTest", "coefficientTest")
# end::apply-triangles-coefficient-features[]

In [33]:
df_train_under.sample(5)

Unnamed: 0,node1,node2,label,cn,pa,tn,minTriangles,maxTriangles,minCoefficient,maxCoefficient
19716,129569,19172,0,1.0,120.0,25.0,4,102,0.267,0.537
73679,145736,170899,0,1.0,96.0,21.0,10,18,0.15,0.667
49869,89621,111820,0,0.0,8.0,6.0,1,6,1.0,1.0
4500,1442,232697,0,0.0,75.0,28.0,3,34,0.113,1.0
119318,107272,107275,1,3.0,16.0,5.0,6,6,1.0,1.0


In [34]:
df_test_under.sample(5)

Unnamed: 0,node1,node2,label,cn,pa,tn,minTriangles,maxTriangles,minCoefficient,maxCoefficient
142300,36667,253259,1,2.0,54.0,19.0,3,21,0.269,1.0
39492,188450,156729,0,0.0,18.0,9.0,3,11,0.733,1.0
98129,89614,145193,1,3.0,40.0,10.0,6,7,0.333,1.0
103481,58019,161116,1,6.0,224.0,33.0,12,29,0.126,0.571
66660,192250,205206,0,0.0,40.0,13.0,10,28,1.0,1.0


## Community Detection

Community detection algorithms evaluate how a group is clustered or partitioned. Nodes are considered more similar to nodes that fall in their community than to nodes in other communities.

We'll run two community detection algorithms over the train and test sub graphs - Label Propagation and Louvain. First up, Label Propagation: 

### Community Detection

The Label Propagation algorithm (LPA) is a fast algorithm for finding communities in a graph. It detects these communities using network structure alone as its guide, and doesn’t require a pre-defined objective function or prior information about the communities.
LPA works by propagating labels throughout the network and forming communities based on this process of label propagation.

The Louvain method is an algorithm to detect communities in large networks. It maximizes a modularity score for each community, where the modularity quantifies the quality of an assignment of nodes to communities. This means evaluating how much more densely connected the nodes within a community are, compared to how connected they would be in a random network.
The Louvain algorithm is a hierarchical clustering algorithm, that recursively merges communities into a single node and executes the modularity clustering on the condensed graphs.

In [35]:
# tag::train-lpa[]
query = """
CALL gds.labelPropagation.write({
  nodeProjection: "Author",
  relationshipProjection: {
    CO_AUTHOR_EARLY: {
      type: 'CO_AUTHOR_EARLY',
      orientation: 'UNDIRECTED'
    }
  },
  writeProperty: "partitionTrain"
});
"""

with driver.session(database="neo4j") as session:
    result = session.run(query)
# end::train-lpa[]
    df = pd.DataFrame([dict(record) for record in result])
df    

Unnamed: 0,writeMillis,nodePropertiesWritten,ranIterations,didConverge,communityCount,communityDistribution,postProcessingMillis,createMillis,computeMillis,configuration
0,305,80299,9,True,43748,"{'p99': 15, 'min': 1, 'max': 247, 'mean': 1.83...",41,45,188,"{'maxIterations': 10, 'writeConcurrency': 4, '..."


In [36]:
# tag::test-lpa[]
query = """
CALL gds.labelPropagation.write({
  nodeProjection: "Author",
  relationshipProjection: {
    CO_AUTHOR_LATE: {
      type: 'CO_AUTHOR_LATE',
      orientation: 'UNDIRECTED'
    }
  },
  writeProperty: "partitionTest"
});
"""

with driver.session(database="neo4j") as session:
    result = session.run(query)
# end::test-lpa[]    
    df = pd.DataFrame([dict(record) for record in result])
df    

Unnamed: 0,writeMillis,nodePropertiesWritten,ranIterations,didConverge,communityCount,communityDistribution,postProcessingMillis,createMillis,computeMillis,configuration
0,212,80299,7,True,48500,"{'p99': 12, 'min': 1, 'max': 562, 'mean': 1.65...",23,32,63,"{'maxIterations': 10, 'writeConcurrency': 4, '..."


And now Louvain. The Louvain algorithm returns intermediate communities, which are useful for finding fine grained communities that exist in a graph. We'll add a property to each node containing the community revealed on the first iteration of the algorithm:

In [37]:
# tag::train-louvain[]
query = """
CALL gds.louvain.stream({
  nodeProjection: 'Author',
  relationshipProjection: {
    CO_AUTHOR_EARLY: {
      type: 'CO_AUTHOR_EARLY',
      orientation: 'UNDIRECTED'
    }
  },
  includeIntermediateCommunities: true
})
YIELD nodeId, communityId, intermediateCommunityIds
WITH gds.util.asNode(nodeId) AS node, intermediateCommunityIds[0] AS smallestCommunity
SET node.louvainTrain = smallestCommunity;
"""

with driver.session(database="neo4j") as session:
    display(session.run(query).consume().counters)
# end::train-louvain[]    

{'properties_set': 80299}

In [38]:
# tag::test-louvain[]
query = """
CALL gds.louvain.stream({
  nodeProjection: 'Author',
  relationshipProjection: {
    CO_AUTHOR_LATE: {
      type: 'CO_AUTHOR_LATE',
      orientation: 'UNDIRECTED'
    }
  },
  includeIntermediateCommunities: true
})
YIELD nodeId, communityId, intermediateCommunityIds
WITH gds.util.asNode(nodeId) AS node, intermediateCommunityIds[0] AS smallestCommunity
SET node.louvainTest = smallestCommunity;
"""

with driver.session(database="neo4j") as session:
    display(session.run(query).consume().counters)
# end::test-louvain[]    

{'properties_set': 80299}

In [39]:
# tag::community-features[]
def apply_community_features(data, partition_prop, louvain_prop):
    query = """
    UNWIND $pairs AS pair
    MATCH (p1) WHERE id(p1) = pair.node1
    MATCH (p2) WHERE id(p2) = pair.node2
    RETURN pair.node1 AS node1,
    pair.node2 AS node2,
    gds.alpha.linkprediction.sameCommunity(p1, p2, $partitionProp) AS sp,    
    gds.alpha.linkprediction.sameCommunity(p1, p2, $louvainProp) AS sl
    """
    pairs = [{"node1": node1, "node2": node2}  for node1,node2 in data[["node1", "node2"]].values.tolist()]
    params = {
    "pairs": pairs,
    "partitionProp": partition_prop,
    "louvainProp": louvain_prop
    }
    
    with driver.session(database="neo4j") as session:
        result = session.run(query, params)
        features = pd.DataFrame([dict(record) for record in result])
    
    return pd.merge(data, features, on = ["node1", "node2"])
# end::community-features[]

In [40]:
# tag::apply-community-features[]
df_train_under = apply_community_features(df_train_under, "partitionTrain", "louvainTrain")
df_test_under = apply_community_features(df_test_under, "partitionTest", "louvainTest")
# end::apply-community-features[]

In [43]:
# tag::train-after-features[]
df_train_under.drop(columns=["node1", "node2"]).sample(5, random_state=42)
# end::train-after-features[]

Unnamed: 0,label,cn,pa,tn,minTriangles,maxTriangles,minCoefficient,maxCoefficient,sp,sl
116159,1,4.0,25.0,6.0,10,10,1.0,1.0,1.0,1.0
107109,1,2.0,12.0,5.0,3,3,0.5,1.0,1.0,1.0
82258,1,2.0,9.0,4.0,3,3,1.0,1.0,1.0,1.0
79773,0,0.0,308.0,36.0,24,26,0.113,0.264,0.0,0.0
42367,0,1.0,15.0,7.0,3,10,1.0,1.0,0.0,0.0


In [44]:
# tag::test-after-features[]
df_test_under.drop(columns=["node1", "node2"]).sample(5, random_state=42)
# end::test-after-features[]

Unnamed: 0,label,cn,pa,tn,minTriangles,maxTriangles,minCoefficient,maxCoefficient,sp,sl
35910,0,0.0,17.0,18.0,0,42,0.0,0.309,0.0,0.0
98339,1,3.0,910.0,58.0,30,69,0.116,0.13,0.0,0.0
147201,1,1.0,6.0,4.0,0,0,0.0,0.0,1.0,1.0
126570,1,6.0,380.0,33.0,39,40,0.287,0.333,1.0,0.0
124563,1,2.0,15.0,6.0,2,3,0.3,0.667,1.0,1.0


In [48]:
# Re-order so that label is last
df_train_under = df_train_under.reindex(columns=sorted(df_train_under.columns))
df_train_under = df_train_under.reindex(columns=(list([a for a in df_train_under.columns if a != 'label']) + ['label'] ))

df_test_under = df_test_under.reindex(columns=sorted(df_test_under.columns))
df_test_under = df_test_under.reindex(columns=(list([a for a in df_test_under.columns if a != 'label']) + ['label'] ))


# Save our DataFrames to CSV files for use in the next notebook

df_train_under.to_csv("data/df_train_under_all.csv", index=False)
df_test_under.to_csv("data/df_test_under_all.csv", index=False)

# df_train_under = pd.read_csv("data/df_train_under_all.csv")
# df_test_under = pd.read_csv("data/df_test_under_all.csv")

# Save the samples as CSV files as well
df_train_under.drop(columns=["node1", "node2"]).sample(5, random_state=42).to_csv("data/df_train_under_sample.csv", index=False, float_format='%g')
df_test_under.drop(columns=["node1", "node2"]).sample(5, random_state=42).to_csv("data/df_test_under_sample.csv", index=False, float_format='%g')