# ABOUT
- this code trains node2vec embeddings for every "card_id to ID_COL" pair
- background:
    - it may be useful to generate embeddings for id columns
    - dataset is too large to apply the original node2vec algorithm
    - instead, the graph structure is converted to csr matrices and used to train node2vec
- steps:
    - convert to csr matrix
    - use GGvec to generate node2vec embeddings

- GGVec notes:
    - If you are using GGVec, keep order at 1. Using higher order embeddings will take quadratically more time. Additionally, keep negative_ratio low (0.05-0.1), learning_rate high (0.1), and use aggressive early stopping values. GGVec generally only needs a few (less than 100) epochs to get most of the embedding quality you need.

## load data
- a csr matrix representing a graph

In [32]:
import csrgraph as cg
import nodevectors

In [33]:
# generate graph from edge list
path = r"C:\Users\tanch\Documents\NTU\NTU Year 4\Semester 1\CZ4041 - Machine Learning\Team Project\data\edgelist for node2vec\card_id_merchant_id.csv"
G = cg.read_edgelist(path, directed=False, sep=',')

In [34]:
G.nnodes

660173

## train node2vec
- train node2vec embeddings and save

In [39]:
ggvec_model = nodevectors.GGVec(learning_rate = 0.1,
                                n_components = 32,
                                negative_ratio = 0.075, 
                                max_epoch = 350, 
                                tol = "auto", 
                                verbose = True) 
ggvec_model.fit(G)

Loss: 0.0218	:  47%|██████████████████████████████▊                                  | 166/350 [03:58<04:24,  1.44s/it]

Converged! Loss: 0.0218





In [40]:
path = r"C:\Users\tanch\Documents\NTU\NTU Year 4\Semester 1\CZ4041 - Machine Learning\Team Project\model\node2vec_card_id_merchant_group_id"
ggvec_model.save(path)

### here we show that indeed similar nodes have similar embeddings i.e high dot product

In [54]:
# given  4 nodes, a and b are connected, c and d are connected
from numpy import dot
a = "C_ID_0001506ef0"
b = "M_ID_19799774fc"
c = "C_ID_0002709b5a"
d = "M_ID_be730907ce"
b in G[a], d in G[a], b in G[c], d in G[c]

(True, False, False, True)

In [62]:
dot(ggvec_model.model[a],ggvec_model.model[b]), dot(ggvec_model.model[a],ggvec_model.model[d])

(0.7478426979817574, 0.16052633975867792)

In [63]:
dot(ggvec_model.model[c], ggvec_model.model[b]), dot(ggvec_model.model[c], ggvec_model.model[d])

(-0.02798832961395234, 0.8640927129767475)