# Creating undirected and directed word2vec semantic networks

Load free association network:

In [1]:
%run 'get-graph-norms.ipynb'

Load methods to construct the directed and undirected graph:

In [2]:
%run 'methods-dir-graphs.ipynb'

Imports for this notebook need to be loaded after imported notebooks above:

In [3]:
import pickle
import powerlaw
import networkx as nx
import os
import matplotlib.pyplot as plt

from gensim.models.keyedvectors import KeyedVectors

In [4]:
path_w2v = os.path.join(os.pardir, 'data', 'word2vec')
path_w2v_raw = os.path.join(os.pardir, 'data', 'word2vec', 'GoogleNews-vectors-negative300.bin')

### Load word2vec data

This might take some time:

In [5]:
w2v_model = KeyedVectors.load_word2vec_format(path_w2v_raw, binary=True)
w2v_model.init_sims()

Keep only those words in w2v vocab that also exist in the association network:

In [6]:
vocab = list(set(fan_vocab) & set(w2v_model.vocab))
print('Words not in w2v vocab:', len(fan_vocab)-len(vocab))

Words not in w2v vocab: 41


Compute the similarity matrix by computing a dot product between all pairs of vectors in the vocabulary. This is done only for the half of the vocabulary since the dot product is commutative:

In [7]:
mat_inner = get_similarity_matrix(vocab, 'dot', w2v_model)

In [8]:
mat_cosine = get_similarity_matrix(vocab, 'cos', w2v_model)

## Directed graphs

### 1. K-nn 

In [9]:
dir_edges_knn_dot = get_knn_edges(mat_inner, vocab, fan_dict_k)
dir_edges_knn_cos = get_knn_edges(mat_cosine, vocab, fan_dict_k)

with open(os.path.join(path_w2v, 'w2v_fan_directed-knn-dot_edgeset.pkl'), 'wb+') as f:
    pickle.dump(dir_edges_knn_dot, f)
    
with open(os.path.join(path_w2v, 'w2v_fan_directed-knn-cos_edgeset.pkl'), 'wb+') as f:
    pickle.dump(dir_edges_knn_cos, f)

### 2. cs-method 


The third parameter in `get_cs_edges` is R_max. A few values have been tested to find the one that produces the desired average degree:

In [10]:
dir_edges_cs_dot = get_cs_edges(mat_inner, vocab, 8)

Average degree: 12.52481414506731


In [11]:
dir_edges_cs_cos = get_cs_edges(mat_cosine, vocab, 76)

Average degree: 12.595137633112317


In [12]:
with open(os.path.join(path_w2v, 'w2v_fan_directed-cs-dot_edgeset.pkl'), 'wb+') as f:
    pickle.dump(dir_edges_cs_dot, f)
    
with open(os.path.join(path_w2v, 'w2v_fan_directed-cs-cos_edgeset.pkl'), 'wb+') as f:
    pickle.dump(dir_edges_cs_cos, f)

## Undirected graphs

The range for the thresholds has been determined by observing the number of edges in resulting graph (post-thresholding) that roughly corresponds to the number of edges in the association network. 
Graphs that are "close" to the picked one are also saved.

In [13]:
undirected_settings = [
    (np.arange(0.37, 0.39, .01), mat_cosine, 'cos'),
    (np.arange(4.19, 4.22, .01), mat_inner, 'dot')]

for th, mat, method in undirected_settings:
    print('similarity:', method)
    for t in th:
        edges = []
        x, y = np.where(mat>=t)
        for w1, w2 in zip(x, y):
            edge = (vocab[w1], vocab[w2], mat[w1,w2])
            edges.append(edge)

        g_un = nx.Graph()
        g_un.add_weighted_edges_from(edges)

        str_print = "tau: {:.2f}, m: {}, k: {:.2f}".format(
            t, len(edges), 2*nx.number_of_edges(g_un)/nx.number_of_nodes(g_un))
        print(str_print)

        tau = '%.2f'%t
        save_path = 'w2v_fan_%.2f_%s_edgeset.pkl'%(t, method)

        with open(os.path.join(path_w2v, save_path), 'wb+') as f:
            pickle.dump(edges, f) 

similarity: cos
tau: 0.37, m: 60147, k: 24.42
tau: 0.38, m: 52317, k: 21.35
tau: 0.39, m: 45527, k: 18.67
similarity: dot
tau: 4.19, m: 44981, k: 22.32
tau: 4.20, m: 44442, k: 22.10
tau: 4.21, m: 43886, k: 21.85
