<a href="https://colab.research.google.com/github/ale0xb/keywords-vis/blob/master/keywords_vis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cross-domain Visual Exploration of Academic Corpora via the Latent Meaning of User-authored Keywords (Full paper --> https://osf.io/h29qv/)
## (Alejandro Benito, Roberto Therón) http://vis.usal.es
## Abstract
Nowadays, scholars dedicate a substantial amount of their work to querying and browsing increasingly large collections 
of research papers on the Internet. In parallel, the recent surge of novel interdisciplinary approaches in science requires 
scholars to acquire competencies in new fields for which they may lack the necessary vocabulary to formulate adequate queries. 
This problem, together with the issue of information overload, poses new challenges in the fields of natural language processing
(NLP) and visualization design that call for a rapid response from the scientific community. In this respect, we report on a novel
visualization scheme that enables the exploration of research paper collections via the analysis of semantic proximity relationships
found in author-assigned keywords. Our proposal replaces traditional string queries by a bag-of-words (BoW) extracted from a 
user-generated auxiliary corpus that captures the intentionality of the research. Continuing on the line established by previous 
works, we combine novel advances in the fields of NLP with visual network analysis techniques to offer scholars a perspective of 
the target corpus that better fits their research needs. To highlight the advantages of our proposal, we conduct two experiments 
employing a collection of visualization research papers and an auxiliary cross-domain BoW. Here, we showcase how our visualization 
can be used to maximize the effectiveness of a browsing session by enhancing the language acquisition task, which allows an effective 
extraction of knowledge that is in line with the users’ previous expectations.

## Google Colab only
Uncomment the following two cells if you're running this notebook in Google Colab.

In [1]:
#!pip install numpy==1.16.3 scipy==1.2.1 nltk==3.2.5 scikit-learn==0.20.3 iteration_utilities==0.7.0 networkx==2.3

In [2]:
#!mkdir datasets
#!mkdir model
#!curl -O https://raw.githubusercontent.com/ale0xb/keywords-vis/master/datasets/dh_papers-complete.json 
#!curl -O https://raw.githubusercontent.com/ale0xb/keywords-vis/master/datasets/vispubdata_papers-2018-complete.json
#!curl -O https://raw.githubusercontent.com/ale0xb/keywords-vis/master/model/all-paths.pkl
  
#!mv dh_papers-complete.json ./datasets
#!mv vispubdata_papers-2018-complete.json ./datasets
#!mv all-paths.pkl ./model

## Imports

In [3]:
import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()


import scipy
import scipy.sparse as sps
from scipy.spatial.distance import pdist, squareform
from scipy.spatial import ConvexHull
from scipy import stats

from collections import Counter


from sklearn.preprocessing import normalize



import shutil
import pickle
import itertools
import copy
from iteration_utilities import flatten
import json

import networkx as nx
from networkx.readwrite import json_graph

[nltk_data] Downloading package stopwords to /Users/alexb/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Stem, unigrams, skipgrams

# Main method
Our document exploration method comprises two main phases: The first involves all the necessary steps to generate a keyword-to-keyword similarity matrix from a latent semantic analysis of the corpus. The second phase focuses on the querying, filtering and visualization of this similarity matrix. As we introduced in previous sections, our method aims to remove the need to provide a textual query to extract knowledge from a given target corpus $C_{t}$ by relying instead on an auxiliary user-generated query corpus $C_{q}$. This distinction allows us to form two bags of words (BoW), formed by the keyword associations found in the query and target corpora, which are used as the two main inputs of our scheme. 

The query corpus can be freely composed from the user's reference manager or from any other source she or he considers relevant to study. Under this assumption, we expect the user to be familiar with the language of the query dataset whereas the target corpus is to be explored. At the end of the process, our method allows the user to query the target corpus by using keywords exclusive to the query corpus, effectively skipping the need for a language acquisition stage which may be highly time-consuming.

# Datasets

Before we continue to explain our proposed visualization scheme, in this section we comment on two document collections that were employed during our experiments. In the first sections, we discussed some of the problems related to the selection of an appropriate query string during the extraction phase of mapping studies and literature reviews, which we aimed to leverage in this work. To this end, we replace this query string with a bag-of-keywords (BOW) obtained from an auxiliary corpus. This first BOW represents the intentionality of the research; this is, it provides a high-level semantic expression that is representative of the kind of knowledge the researcher is interested in extracting from the target corpus. We construct this hypothetical situation in the context of two inherently interdisciplinary bodies of knowledge, DH and visualization, which we introduce below:

## Query Corpus: Digital Humanities Visualization Papers

DH is an interdisciplinary area of scholarly on which computational methods are applied in the resolution of research questions related to traditional humanities disciplines such as history, philosophy, linguistics, literature, art, archaeology, music, cultural studies and social sciences. This process usually involves the "application of developed computational methods"<cite data-cite="5825190/EPG46MXF"></cite> in a variety of fields of computer science, such as topic modeling, digital mapping, text mining, information retrieval, digital publishing or visualization, in "novel and unexpected ways" <cite data-cite="5825190/EPG46MXF"></cite>. Particularly, in recent years visualization has become a hot topic in DH as evidenced by the increasing number of visualization-related submissions to the annual DH conference. 

This surge has also had an impact on the visualization community, who have turned their attention to the DH as a vibrant new area of application for novel visualization techniques. An excellent example of this recent interest is the Workshop on Visualization for the DH [VIS4DH](http://vis4dh.dbvis.de/), which has taken place as a parallel session to the IEEE Vis Conference since its first edition in 2016. One of the recurrent discussions of this workshop has orbited around the idea of how to produce significant visualization advances in the context of the DH. Whereas visualization techniques have been showcased in a large number of computing problems related to the humanities, some authors have warned of an increasing tendency in the DH visualization community to apply standard visualization techniques (such as force-directed graph layouts or word clouds) to the resolution of intrinsically distinct research questions. This tendency, as these authors note, might be impeding the production of valuable visualization research in the humanities <cite data-cite="5825190/SFVYGKHS"></cite> <cite data-cite="5825190/GGZM7TH4"></cite> and therefore they stress the need to incorporate appropriate methodologies and evaluation techniques into the design process of the humanities.

### Description
According to the context presented in the previous paragraphs, the first dataset was constructed from metadata describing papers published in the DH conferences between years 2015-2018 <cite data-cite="5825190/2GKEGQL9"></cite>. Given the broad range of themes present in this conference, we limited our search to papers that fell in the domain of visualization, i.e., papers that contained the word "visualization" either on its title, subject or any of its keywords. We also completed this data with author keywords associations extracted from papers presented at the three editions of the Workshops on Visualization for the DH between years 2016-2018. This composition ensures that we have a varied and rich BoW to query a larger, general-purpose target corpus. 

**The humanities-visualization dataset accounts for 257 documents, containing 728 unique keywords that appear a total of 1131 times, which gives an average of 4.40 keywords per paper.**


In [4]:
dh_papers = json.loads(open('./datasets/dh_papers-complete.json').read())

## Target Corpus: Data Visualization Research Papers
The second document collection is related to the general topic of visualization. Visualization is a major research theme in computer science that relates to the generation of graphics, diagrams, images and animations that help to enhance the comprehensibility of the underlying data and computational algorithms at play in a broad range of computer-related domains. For these reasons, visualization research papers provide a rich and variate set of keyword associations to explore and to connect to other different knowledge domains (e.g. the humanities).
The dataset comprises of meta-data from more than 3000 research papers presented at the IEEE Visualization set of conferences: InfoVis, SciVis, VAST and Vis from 1990-2018 and it was recently compiled by a group of experts in visualization <cite data-cite="5825190/NKZVXPYC"></cite>. The dataset is publicly accessible https://vispubdata.org and actively maintained and updated by its authors. Data visualization research papers represent a rich corpus with multiple connections to other fields of modern science such as astronomy, sports, humanities, biology and machine learning, among others. 

**To date, the dataset contains 3102 research papers, of which 2123 contain author keywords. The number of unique keywords in this dataset is 5108, appearing a total of 9877 times, which results in an average of 4.64 keywords per paper.**

In [5]:
vis_papers = json.loads(open('./datasets/vispubdata_papers-2018-complete.json').read())

## Preprocessing
Prior application of the semantic model to our data, we perform tokenization and stemming on the author-assigned keywords. In the tokenization process we split each multi-term keyword into its constituent parts, which are then stemmed and ultimately added to the BoW. Note that tokens appearing twice or more in the same document were counted as one. 

We noticed that in our case the inclusion of these two word pre-processing techniques was highly beneficial for the following reasons: The first and most obvious is that it provides an automated manner to match a high number of different linguistic keyword variations of the same concept (e.g. singular and plural), a circumstance that, unlike it occurs in keyword taxonomies, can be observed in uncontrolled keywords due to their closer proximity to natural language. Second, it allows the detection and subsequent removal of embedded stop words: this is, words that do not carry any real meaning in the context of the collection and that might not appear on their own in the corpus. 

Take for example the multi-term keywords _"visual document analysis"_ and _"visual citation analysis"_: Although at a high level these two concepts are clearly related (because they represent two specializations of visual analysis), making a more clear distinction between them might not be immediately obvious if they are found in a corpus related to visual analytics. In this case, the particles "visual" and "analysis" can be interpreted as noise because they do not add value to our understanding of the contents of the corpus. On the other hand, all three particles could carry important significance in other contexts.

In [6]:
def remove_duplicates(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]
  
def stem_keywords(keywords):
    terms_dict = {}
    keywords_stemmed = []
    stems_dictionary = {}
    stems_dictionary_inv = {}

    for doc in keywords:
        doc_stemmed = []
        for k in doc:
            for tok in k.split():
                if tok in stop_words:
                    continue
        
                stem_tok = stemmer.stem(tok)

                if stem_tok in stop_stems:
                    continue
                    
                if stem_tok not in stems_dictionary:
                    stems_dictionary[stem_tok] = [tok]
                elif tok not in stems_dictionary[stem_tok]:
                    stems_dictionary[stem_tok].append(tok)
          
                if k not in stems_dictionary_inv:
                    stems_dictionary_inv[tok] = [stem_tok]
                elif stem_tok not in stems_dictionary_inv[tok]:
                    stems_dictionary_inv[tok].append(stem_tok)

                doc_stemmed.append(stem_tok)
        keywords_stemmed.append(remove_duplicates(doc_stemmed))

    return(keywords_stemmed, stems_dictionary, stems_dictionary_inv)

In [7]:
def trace_stem(keyword_stem):
    if keyword_stem in trace_stem.intersection_keywords_set:
        return 'link'
    elif keyword_stem in trace_stem.source_keywords_set:
        return 'query'
    elif keyword_stem in trace_stem.dest_keywords_set:
        return 'target'
    else:
        raise RuntimeError('Error: keyword %s could not be traced' % keyword_stem)

In [8]:
def trace_color(word, font_size, position, orientation, random_state=None, **kwargs):
    trace = trace_stem(stemmer.stem(word))
    if trace == 'query':
        trace_color = '#d95f02'
    elif trace == 'link':
        trace_color = '#7570b3'
    else:
        trace_color = '#1b9e77'
    return trace_color

In [9]:
stop_stems = ['visual', 'digit', 'human', 'humanit', 'humanist']

In [10]:
dh_papers_offset = len(dh_papers) - 1
dh_keywords = [t['keywords'] for t in dh_papers]
vis_keywords = [t['keywords'] for t in vis_papers]

all_papers = dh_papers + vis_papers

keywords = dh_keywords + vis_keywords
keywords_flat = list(flatten(keywords))


dh_docs_stemmed, dh_stems_dictionary, dh_stems_dictionary_inv = stem_keywords(dh_keywords)
vis_docs_stemmed, vis_stems_dictionary, vis_stems_dictionary_inv = stem_keywords(vis_keywords)

keywords_stemmed = dh_docs_stemmed + vis_docs_stemmed

stems_dictionary = {}
for key in (dh_stems_dictionary.keys() | vis_stems_dictionary.keys()):
    if key in dh_stems_dictionary: stems_dictionary.setdefault(key, []).append(dh_stems_dictionary[key])
    if key in vis_stems_dictionary: stems_dictionary.setdefault(key, []).append(vis_stems_dictionary[key])


for k,v in stems_dictionary.copy().items():
    stems_dictionary[k] = remove_duplicates(list(flatten(v)))
  
  
stems_dictionary_inv = {**dh_stems_dictionary_inv, **vis_stems_dictionary_inv}



source_keywords_set = remove_duplicates(list(flatten(dh_docs_stemmed)))
dest_keywords_set = remove_duplicates(list(flatten(vis_docs_stemmed)))


intersection_keywords_set = [k for k in source_keywords_set if k in dest_keywords_set]

source_keywords_set = [k for k in source_keywords_set if k not in intersection_keywords_set]
dest_keywords_set = [k for k in dest_keywords_set if k not in intersection_keywords_set]

all_keywords_unique = source_keywords_set + intersection_keywords_set + dest_keywords_set

trace_stem.source_keywords_set = source_keywords_set
trace_stem.intersection_keywords_set = intersection_keywords_set
trace_stem.dest_keywords_set = dest_keywords_set


keyword2indx = {k:v for v,k in enumerate(all_keywords_unique)}
indx2keyword = {indx:keyword for keyword,indx in keyword2indx.items()}


print("There are %s source documents and %s destination documents" % (len(dh_docs_stemmed), len(vis_docs_stemmed)))
print('Vocabulary size is %s: %s source stems, %s intersection stems and %s destination stems' % 
  (len(all_keywords_unique), len(source_keywords_set), len(intersection_keywords_set), len(dest_keywords_set)))


There are 257 source documents and 2123 destination documents
Vocabulary size is 2720: 257 source stems, 320 intersection stems and 2143 destination stems


# Building counts

Once keywords have been tokenized and stemmed, the next step of our method relies on counting the number of times each unique token appears on the two query and target BOWs. Similarly, we calculate skipgram counts in order to measure the number of times two tokens can be seen together. The skipgrams count is employed to construct a $N \times N$ sparse matrix on which each cell represents the absolute count of observed associations between any two given tokens. At this stage, a binary term-document sparse matrix $T$ is also created. This binary matrix is employed in the last step of the method to project the results onto document space and produce a set of paper recommendations. 

With vocabulary $V_{t}$ we build a square term-context frequency matrix $F\in\mathcal{\mathbb{R}}^{|V_{g}| \times |V_{g}|} $ and a binary term-document matrix $B\in\mathcal{\mathbb{B}}^{|V_{g}| \times |C_{g}|}$. The word-context frequency matrix captures how many times two terms appear together in the corpus. Following <cite data-cite="5825190/WEDH9C6G"></cite>, this translates into $\#(w,c) \cdot |C_{g}|$. 

Finally, we retain the provenance of each token by indexing the square matrix $\mathbf{M}$ in the following manner:

$
  M_{i} =
    \begin{cases}
      0 \leq i < |V_{q}|  & \iff M_{i} \in V_{q} \\
      |V_{q}| \leq i < |V_{q}| + |V_{l}|  & \iff M_{i} \in V_{l} \\
      |V_{q}| + |V_{l}| \leq i < |V_{g}|  & \iff M_{i} \in V_{t}
    \end{cases}
$

In [11]:
def build_unigrams(keywords, verbose=False):
    unigram_counts = Counter()
  
    for ii, doc in enumerate(keywords):
        if ii % 2000 == 0 and verbose:
            print(f'finished {ii/len(keywords):.2%} of papers')
        for keyword in doc:
            unigram_counts[keyword] += 1
            
    print('done')
    print('vocabulary size: {}'.format(len(unigram_counts)))
    print('most common: {}'.format(unigram_counts.most_common(10)))
  
    return unigram_counts

In [12]:
def build_skipgrams(keywords_stemmed, verbose=False):
    n_docs = len(keywords_stemmed)
    skipgram_counts = Counter()
    for idoc, doc in enumerate(keywords_stemmed):
        for keyword_a, keyword_b in itertools.combinations(doc, 2):
            skipgram_a = (keyword_a, keyword_b)
            skipgram_b = (keyword_b, keyword_a)
            skipgram_counts[skipgram_a] += 1
            skipgram_counts[skipgram_b] += 1
        if idoc % 1000 == 0 and verbose:
            print(f'finished {idoc/n_docs:.2%} of documents')

    print('done')
    print('number of skipgrams: {}'.format(len(skipgram_counts)))
    print('most common: {}'.format(skipgram_counts.most_common(20)))

    return skipgram_counts

In [13]:
#Term-Document Matrix
def build_td_mat(keywords, keyword2indx):
    doc_indxs = []
    keyword_indxs = []
    tdat_values = []
    for ii, doc in enumerate(keywords):
        for keyword in doc:
            doc_indxs.append(ii)
            keyword_indxs.append(keyword2indx[keyword])
            tdat_values.append(True)

    return sps.csr_matrix((tdat_values, (doc_indxs, keyword_indxs)))

In [14]:
def build_wordcounts(skipgram_counts, keyword2indx, verbose=False):
    row_indxs = []
    col_indxs = []
    dat_values = []
    ii = 0
    for (keyword1, keyword2), sg_count in skipgram_counts.items():
        ii += 1
        if ii % 1000 == 0 and verbose:
            print(f'finished {ii/len(skipgram_counts):.2%} of skipgrams')
        key1_indx = keyword2indx[keyword1]
        key2_indx = keyword2indx[keyword2]
        
        row_indxs.append(key1_indx)
        col_indxs.append(key2_indx)
        dat_values.append(sg_count)
    
    wwcnt_mat = sps.csr_matrix((dat_values, (row_indxs, col_indxs)))
    print('wwcnt sparse matrix was built (%d, %d)' % wwcnt_mat.shape)
    return wwcnt_mat

In [15]:
unigram_counts = build_unigrams(keywords_stemmed)
tf_bin_mat = build_td_mat(keywords_stemmed, keyword2indx)
tf_bin_list = tf_bin_mat.toarray().tolist()

skipgram_counts = build_skipgrams(keywords_stemmed)
wwcnt_mat = build_wordcounts(skipgram_counts, keyword2indx)

done
vocabulary size: 2720
most common: [('data', 492), ('analysi', 316), ('interact', 299), ('volum', 285), ('render', 269), ('inform', 259), ('analyt', 237), ('model', 190), ('design', 170), ('graph', 145)]
done
number of skipgrams: 84158
most common: [(('volum', 'render'), 207), (('render', 'volum'), 207), (('data', 'analysi'), 104), (('analysi', 'data'), 104), (('interact', 'data'), 60), (('data', 'interact'), 60), (('field', 'vector'), 55), (('vector', 'field'), 55), (('data', 'analyt'), 52), (('analyt', 'data'), 52), (('analyt', 'analysi'), 52), (('analysi', 'analyt'), 52), (('interact', 'analyt'), 51), (('analyt', 'interact'), 51), (('analysi', 'interact'), 48), (('interact', 'analysi'), 48), (('design', 'studi'), 48), (('studi', 'design'), 48), (('interact', 'inform'), 47), (('inform', 'interact'), 47)]
wwcnt sparse matrix was built (2720, 2720)


# Latent Semantic Analysis
We selected Latent Semantic Analysis (LSA), a distributional semantic model (DSM) to define a semantic space of keywords.

LSA is a theory of language and distributional semantic model that extracts and represents the contextual-usage meaning of 
words by applying statistical calculations to a corpus of text <cite data-cite="5825190/B4LPB8XA"></cite>. LSA (or LSI, for Latent Semantic Indexing, as it is known 
in the information retrieval community) assumes that the occurring patterns of words in a variety of contexts are able 
to determine the degree of similarity among such words <cite data-cite="5825190/YJCC8EDY"></cite>. 

LSA is a fully unsupervised method that, unlike the case of predictive semantic models, does not employ any knowledge base 
or human-generated dictionary. Rather, it relies solely on the analysis of raw text. Because LSA originated in the psychology community, 
since its beginnings it was thoroughly evaluated to measure its accuracy in replicating human judgments of meaning similarity <cite data-cite="5825190/U2RP32HP"></cite>. The 
similarity estimates derived by LSA are not based on simple contiguity frequencies or co-occurrence, but rather they depend on a 
deeper statistical analysis that extracts the underlying semantics from a corpus. 

This kind of analysis has the positive effect of producing results that are conceptually similar in meaning to a given query term even if these results do not share specific words with the search criteria. Beyond that, some authors have stressed the role of LSA as a fundamental computational theory of the acquisition and representation of knowledge that is closely related to the inductive property of learning, for which people seem to acquire much more knowledge than appears to be available in experience <cite data-cite="5825190/SD9XSNXL"></cite>.

## Singular Value Decomposition
To produce a semantic analysis of the words in a corpus, LSA makes use of a well-known linear algebra matrix decomposition method called Singular Value Decomposition (SVD) that we briefly summarize for the reader hereafter: 

SVD is used to decompose a given matrix $M$ into the product of three matrices $U \Sigma V^{T}$, where $U$ and $V$ are orthonormal 
($U^{T} U = V^{T} V = I$) and $\Sigma$ is a diagonal matrix of sorted singular values of the same rank r as the input matrix. 

Let $\Sigma_{k}$, where $k < r$, the diagonal formed by the k first singular values of $\Sigma$ and let $U_{k}$ and $V_{k}$ be the matrices that result from keeping only the first \textit{k} columns in $U$ and $V$. The matrix $\hat{M} = U_{k} \Sigma_{k} V^T_{k}$ is the rank k matrix that minimizes the Frobenius norm between the input matrix $M$ and any other rank-k matrix, i.e. $\hat{M} \in {arg\,min} \Vert M - \hat{M}\Vert_{F}$. This is, the resulting matrix is the best k-dimensional approximation to the original in the least-squares sense (minimizing covariance).


In [16]:
def svd_reduce(pmi_use, embedding_size):
    print('Will reduce matrix of shape (%d, %d) with embedding_size=%d' %(pmi_use.shape[0], pmi_use.shape[1], embedding_size))
    return sps.linalg.svds(pmi_use, embedding_size)

In [17]:
def generate_wordvecs(uu):
    word_vecs = uu
    word_vecs_norm = normalize(uu, norm='l2')
    return (word_vecs, word_vecs_norm)

# Pointwise Mutual Infomration

In previous sections, we explained that LSA extracts latent semantics from factorizing a co-occurrence statistics matrix $\mathbf{M}$. This matrix can be built via different methods, such as term-frequency (TF) or term frequency - inverse document frequency (TF-IDF). In our case, we detected that narrow-domain corpora produce a great overlapping of insignificant words (noise) that we wanted to eliminate. To this end, we relied on a well-known metric of information science, Pointwise Mutual Information (PMI) <cite data-cite="5825190/GEINZ3ZS"></cite> because 1. provides an efficient manner to remove repetitive terms from the analysis and 2. when used in conjunction with LSA/SVD is capable of generating linguistic models that excel at distributional similarity tasks <cite data-cite="5825190/AGWC3HVT"></cite>. 

The usage of the smoothed PPMI matrix in LSA favors the detection of infrequent and informative relationships occurring in the high-dimensional semantic space over uninformative terms. This feature helps to provide a view of the target corpus that is based on the specifics of the user-generated query corpus and to identify keyword pairs that share a common latent meaning.

PMI encodes the probability for a pair of tokens to be seen together in a document with respect to the probability of seeing those two same tokens in the whole corpus. This probability is defined as the log ratio between \textit{w} and \textit{c}'s joint probability and the product of their marginal probabilities. These probabilities can be extracted empirically from the corpus by counting the number of times \textit{w} and \textit{c} appear in the same document divided by the times they can be seen in other documents. In this paper, we do not consider the order in which terms appear within a document and therefore, the word-context matrix is built solely on co-occurrence. 

Similarly, the term-document matrix is a sparse binary matrix whose entries are defined as $B(t,d) = \{1 \text{ if t occurs in d or 0 otherwise}\}$.


(1) $PMI(w,c) = \log\frac{\hat{P}(w,c)}{\hat{P}(w)\hat{P}(c)} = \log\frac{\#(w,c) \cdot |C_{T}|}{\#(w) \cdot \#(c)}$

Following recommendations in the recent NLP literature \cite{levy_improving_2015}, we employ an smoothed version of the PMI matrix. During our experiments, we found that setting the smoothing factor $\alpha$ to 0.95 yielded the best results in the similarity task, which is in line with observations from other studies: 

(2) $SPMI(w,c) = \log\frac{\hat{P}(w,c)}{\hat{P}(w)\hat{P}_{\alpha}(c)}$

where the smoothed unigram distribution of the context is:

(3) $\hat{P}_{\alpha}(c) = \frac{\#(c)^{\alpha}}{\sum_{c}{\#(c)^{\alpha}}}$

The pairwise results are stored in a Smoothed PMI matrix $M^{SPMI}$ that matches the original dimensions of $F$, $|V_{T}| \times |V_{T}|$. A common problem with $M^{SPMI}$ is that it contains entries of the form $\text{PMI}(w,c) = \log 0 = -\infty$ for word-context pairs that were never observed. This issue is solved in the NLP literature by using \textit{positive} PMI (PPMI), in which the negative entries are replaced by 0:


(4) $M = SPPMI(w,c)=
    \begin{cases}
      SPMI(w,c) & \text{if}\ SPMI(w,c)>0 \\
      0 & \text{otherwise}
    \end{cases}
\label{eq:sppmi}$

In [18]:
def build_pmis(wwcnt_mat, skipgram_counts, keyword2indx, verbose=False):
    num_skipgrams = wwcnt_mat.sum()
    assert(sum(skipgram_counts.values())==num_skipgrams)

    # for creating sparse matrix
    row_indxs = []
    col_indxs = []
    sppmi_values = []

    # smoothing
    alpha = 0.95
    nca_denom = np.sum(np.array(wwcnt_mat.sum(axis=0)).flatten()**alpha)
    sum_over_words = np.array(wwcnt_mat.sum(axis=0)).flatten()
    sum_over_words_alpha = sum_over_words**alpha
    sum_over_contexts = np.array(wwcnt_mat.sum(axis=1)).flatten()

    ii = 0
    for (keyword1, keyword2), sg_count in skipgram_counts.items():
        ii += 1
        if ii % 5000 == 0 and verbose:
          print(f'finished {ii/len(skipgram_counts):.2%} of skipgrams')

        keyword1_indx = keyword2indx[keyword1]
        keyword2_indx = keyword2indx[keyword2]

        nwc = sg_count
        Pwc = nwc / num_skipgrams
        nw = sum_over_contexts[keyword1_indx]
        Pw = nw / num_skipgrams

        nc = sum_over_words[keyword2_indx]
        Pc = nc / num_skipgrams

        nca = sum_over_words_alpha[keyword2_indx]
        Pca = nca / nca_denom

        sppmi = max(np.log2(Pwc/(Pw*Pca)), 0)

        row_indxs.append(keyword1_indx)
        col_indxs.append(keyword2_indx)
        sppmi_values.append(sppmi)
    
    
    sppmi_mat = sps.csr_matrix((sppmi_values, (row_indxs, col_indxs)))
  
    print('sparse ppmi matrix was built')

    return sppmi_mat

In [19]:
#Build ppmi 
sppmi_mat = build_pmis(wwcnt_mat, skipgram_counts, keyword2indx)

uu, ss, vv = svd_reduce(sppmi_mat, 50)
word_vecs, word_vecs_norm = generate_wordvecs(uu)

#use normalized/not normalized
use_vecs = word_vecs_norm

sparse ppmi matrix was built
Will reduce matrix of shape (2720, 2720) with embedding_size=50


## Distance Matrix

One of the most popular (dis)similarity measures employed in NLP is the cosine of the angle formed by two word vectors <cite data-cite="5825190/SAAU4NU2"></cite>. This measure discards the length of the vectors and quantifies the difference in their direction in multidimensional space. We selected this similarity measure because, as reported by other studies, it is adequate to represent cognitive similarity beyond simple linguistic similarity <cite data-cite="undefined"></cite>. The formula of the cosine is well known and can be applied easily to the LSA vectors to build a distance matrix $D$:

(5) $
    D(x,y) = cos(x,y) = \frac{\sum^{n}_{i=1} x_{i} \cdot y_{i}}{\sqrt{\sum^{n}_{i=1} x^{2}_{i} \cdot \sum^{n}_{i=1} y^{2}_{i}}}
$

Analogously, the similarity between two vectors can be expressed as:

(6) $
    S(x,y) = 1 - D(x,y)
$

As a final step, we employed the similarity matrix $S$ to detect and merge synonyms (i.e. token pairs with $S(x,y) \approx 0$), which resulted in a reduction in vocabularies sizes ($|V_{q}| = 176, |V_{t}| = 1745, |V_{l}| = 320$).


In [20]:
#Generate distance matrix
Y = pdist(use_vecs, 'cosine')


print('Finding synonyms...')
distance_matrix = squareform(Y)
dist_matrix_sq = np.triu(distance_matrix)
syns = {}
remove = []
rows = dist_matrix_sq.shape[0]
cols = dist_matrix_sq.shape[0]
for x in range(0, rows):
    for y in range(0, cols):
        if x >= y:
            continue
        elif dist_matrix_sq[x][y] <= 0.01:
            if x in syns:
                syns[x].append(y)
            else:
                syns[x] = [y]
            if y in syns:
                syns[y].append(x)
            else:
                syns[y] = [x]
            remove.append(y)

keep = []
for i in range(0, rows):
    if i not in remove:
        keep.append(i)

use_vecs_c = use_vecs[keep]

sppmi_mat_c = sppmi_mat[keep,:]
sppmi_mat_c = sppmi_mat_c[:,keep]

print('Found %s synonyms. Operating with %s vectors now' % (len(set(remove)), len(use_vecs_c)))

Y_c = pdist(use_vecs_c, 'cosine')


sep_indx_src = len(source_keywords_set) -1
sep_indx_inter = len(source_keywords_set) + len(intersection_keywords_set) - 1

new_seps = []

prev = 'query'
for i in range(len(use_vecs_c)):
    current = trace_stem(indx2keyword[keep[i]])
    if current != prev:
        new_seps.append(i)
    prev = current
    

sep_indx_src_c = new_seps[0] - 1
sep_indx_inter_c = new_seps[1] - 1
indx2keyword_c = {k:indx2keyword[keep[k]] for k in range(len(use_vecs_c))}
keyword2indx_c = {v:k for k,v in indx2keyword_c.items()}


Finding synonyms...
Found 479 synonyms. Operating with 2241 vectors now


In [21]:
syn_query = []
syn_link = []
syn_target = []

for k,v in indx2keyword_c.items():
    trace = trace_stem(v)
    if trace == 'query':
        syn_query.append(v)
    elif trace == 'link':
        syn_link.append(v)
    else:
        syn_target.append(v)

# print(len(syn_query), len(syn_link), len(syn_target))

## Analyzing Inter-group Similarities

The second stage of our method focuses on exploring the similarity matrix $S$ that was obtained in the last step. To overcome the conceptual distance between the query and target corpora, we look for structural patterns in the similarity relationships between keywords on the query vocabulary and those found exclusively in the target vocabulary. For this task, we rely on the construction of a complete graph $G$ using the distance matrix $D$, which enables us to analyze the similarity between nodes (tokens) using different scaling techniques to reduce the complexity of the resulting graph. 


In order to map all tokens in $V_{t}$ to their counterpart in $V_q$, we identify the shortest path that connects a token in $V_{t}$ to any other token $V_{q}$. Formally, we can define the set of shortest paths $P'_{j}$ from the token $j$ in $V_{t}$ to all tokens in $V_{q}$ as the sequence of node pairs $(t_{j}^{t},t_{k1}^{r}), (t_{k1}^{r},t_{k2}^{r}), \dots, (t_{kl}^{r},t_{i}^{q})$ with $r \in \{q,l,t\}$. Given that all pairs are edges representing distances, the sum of all distance pairs in a path in $P'$ gives the total distance between the token $t_{j}^{t}$ and every other token in $V_{q}$. Therefore, there exists a shortest path $P$ in $P'$ connecting the node $t_{j}^{t}$ to another node $t_{i}^{q}$ that, by (6), yields a maximum similarity over all other alternative paths to tokens in $V_{q}$. Note that when $|P| = 1$, the similarity score $sim$ is equal to the value of the similarity matrix $S$ at $S(t_{j}, t_{i})$.

(7) $
    sim(t_{j}^{t}, t_{i}^{q}) = 1 - \min_{P \in P'_{j}}\{\sum_{k=1}^{l}{dist(t_{k},t_{k+1})} \mid (t_{k},t_{k+1}) \in P\}\\
$


By (7), the path $P_{t^{t}_{j}}$ that maximizes the similarity score $sim$ is a \textit{significant} path of the target token $t^{t}_{j}$ in $G$ because it connects it to its most similar counterpart in $V_{q}$. These paths can be easily computed by a multi-source version of the Dijkstra algorithm.

After all shortest paths have been calculated, we can group similar nodes by the number of shared links by their respective paths from $V_{t}$ to $V_{q}$. In this way, sets of target nodes that present structural similarities in their relationship with the query dataset can be grouped together. This builds upon the idea that nodes related to the same topics have the likelihood to share more links in the shortest paths that relate them to tokens in $V_{q}$, while shortest paths of dissimilar nodes have few or no links in common <cite data-cite="5825190/2M3GJCKB"></cite>.


Particularly, the subgraph resulting from merging two or more shortest paths with common elements $P_{1}, P_{2} \dots P_{n}$ is a spanning tree of its nodes in $G$. This procedure is illustrated in the figure below. On the left, two shortest paths for tokens $t_{1}^{t}$ and $t_{2}^{t}$ are shown. As $|V_{t}| \gg |V_{q}|$, some paths will share at least a common destination token in $V_{q}$, $t_{1}^{q}$ in the example. Input paths are ultimately merged into the same tree $T_{t_{1}^{q}}$.

<img src='img/path-generation.png'>*Figure*: Shortest paths $P(t^{1}_{t})$ (left, top) and $P(t^{2}_{t})$ (left, bottom) connecting tokens $t^{1}_{t}$ and $t^{2}_{t}$ to their closest neighbor in $V_{q}$. The proposed method detects coincident tokens in the resulting paths and constructs the spanning tree that contains them. This results in a partition of the dataset in which tokens in $V_{t}$ are grouped together if they relate to $V_{q}$ in a similar manner.</img>




After merging paths with common elements we obtain a set of trees $T = \{T_{1}, T_{2}\dots T_{i}\}$ for each token $t^{q}_{i} \in V_{q}$ present in any path in $P$. Note that at this point not all tokens in $V_{q}$ can be found in $T$, whereas all tokens in $V_{t}$ are. The solution to this issue is trivial and can be solved by adding a token $t^{j}_{q}$ not present in $T$ to the MST of its nearest-neighbor given that there are not any other shorter paths connecting $t^{j}_{q}$ to any other token in $V_{t}$. At the end of this process, any tree, or a combination of trees in $T$, along with related documents, can be represented in a visualization according to the procedure outlined in the next sections.

### Generating the model
The following cells generate a visualization model based on paths extracted from proximity data. 
If you want to recreate the process (*it could take several hours*),
remove the pre-generated pickle file './model/all-paths.pkl' and run the code.

In [22]:
def generate_paths(dist_matrix, sep_indx_src, sep_indx_inter, indx2keyword):
    dist_matrix_sq = np.triu(squareform(dist_matrix))
    dists_graph = nx.from_numpy_matrix(dist_matrix_sq)
  
    paths_by_dest = {}
    paths_location = './model/all-paths.pkl'
  
    try:
        infile = open(paths_location, 'rb')
        paths_by_dest = pickle.load(infile)
    except FileNotFoundError:      
        for j in range(sep_indx_inter + 1, len(indx2keyword)):
            print('Calculating best path for %s/%s' % (j, len(indx2keyword) - 1))
            dist, path = nx.multi_source_dijkstra(dists_graph, set(range(0,sep_indx_src + 1)), target=j)
            paths_by_dest[j] = {'path' : path, 'distance': dist}
            print(indx2keyword[j], paths_by_dest[j]['distance'], [(indx2keyword[n], trace_stem(indx2keyword[n])) for n in paths_by_dest[j]['path']])
    
    output = open('all-paths.pkl', 'wb')
    pickle.dump(paths_by_dest, output)
    
    shutil.move('all-paths.pkl', paths_location)

  
    return dists_graph, paths_by_dest

### Main

In [23]:
def sortFn(a):
    cmp_str = ''
    for i in range(len(a)):
        cmp_str += indx2keyword_c[a[i]]

    return cmp_str

def merge_subs(lst_of_lsts):
    copy_list = copy.deepcopy(lst_of_lsts)
    res = []
    for row in copy_list:
        for i, resrow in enumerate(res):
            if row[0]==resrow[0]:
                res[i] += row[1:]
                break
        else:
            res.append(row)
    return res

dists_graph, paths_by_dest = generate_paths(Y_c, sep_indx_src_c, sep_indx_inter_c, indx2keyword_c)

In [24]:
all_paths = [v['path'] for k,v in paths_by_dest.items()]

merged_paths_sorted = sorted(merge_subs(all_paths), key=sortFn)

for p in merged_paths_sorted:
    print([(indx2keyword_c[j], trace_stem(indx2keyword_c[j])) for j in (p)])

[('aborigin', 'query'), ('unifi', 'target'), ('unifi', 'target'), ('input', 'target'), ('teleoper', 'target'), ('atom', 'target'), ('echographi', 'target'), ('levels-of-detail', 'target'), ('teleoper', 'target'), ('haptic', 'target'), ('sphere', 'target'), ('cross', 'target'), ('sphere', 'target'), ('terascal', 'target'), ('world-wide-web', 'target'), ('pdm', 'target'), ('wayfind', 'target'), ('shock', 'target'), ('forcefeedback', 'target'), ('projector', 'target'), ('pc', 'target'), ('invers', 'target'), ('head', 'target'), ('surgic', 'target'), ('hysteroscopi', 'target'), ('shear-warp', 'target'), ('speech', 'target'), ('haptic', 'target'), ('6-dof', 'target'), ('invers', 'target'), ('kinemat', 'target'), ('section', 'target'), ('anatom', 'target'), ('section', 'target'), ('carv', 'target'), ('seam', 'target')]
[('academ', 'query'), ('manipul', 'target'), ('contour', 'target'), ('magnif', 'target'), ('data-driven', 'target'), ('data-driven', 'target'), ('data-min', 'target'), ('oracl

In [25]:
merged_paths_dict = {}
for p in merged_paths_sorted:
    merged_paths_dict[indx2keyword_c[p[0]]] = p
    print([('/'.join(stems_dictionary[indx2keyword_c[j]]), trace_stem(indx2keyword_c[j])) for j in sorted(p)])

[('aboriginal', 'query'), ('unified', 'target'), ('unified', 'target'), ('input', 'target'), ('atomic', 'target'), ('echography', 'target'), ('levels-of-detail', 'target'), ('teleoperation', 'target'), ('teleoperation', 'target'), ('haptic/haptics', 'target'), ('haptic/haptics', 'target'), ('crossing', 'target'), ('sphere', 'target'), ('sphere', 'target'), ('terascale', 'target'), ('world-wide-web', 'target'), ('pdm', 'target'), ('wayfinding', 'target'), ('shock', 'target'), ('forcefeedback', 'target'), ('projector/projectors', 'target'), ('pc', 'target'), ('inverse', 'target'), ('inverse', 'target'), ('head', 'target'), ('surgical', 'target'), ('hysteroscopy', 'target'), ('shear-warp', 'target'), ('speech', 'target'), ('6-dof', 'target'), ('kinematics', 'target'), ('anatomic', 'target'), ('sections', 'target'), ('sections', 'target'), ('carving', 'target'), ('seam', 'target')]
[('academic', 'query'), ('writing', 'link'), ('writing', 'link'), ('writing', 'link'), ('writing', 'link'), (

# Visualizing documents via keyword proximity

In the last stage, the user is expected to provide a set of keywords to explore the collection. Following the reasoning explained in previous sections, the user employs keywords specific to the query vocabulary to obtain affine keywords and documents from the target corpus. These elements are presented to the user in a visualization that shows exploration paths related to the input query expression. The user is then able to progressively form a mental image of the target corpus by following these paths and, optionally, perform further research on the list of document suggestions that are displayed in the same visualization space. In this section, we comment on the necessary steps that were taken to produce this expected output. 

The visualization employs a single tree as input, which can be one of the trees in $T$ if only a single keyword is provided, or a tree resulting from combining two or more trees in $T$. The tree is drawn in the plane using the Kamada-Kawai layout algorithm <cite data-cite="5825190/3CIKKD6P"></cite>, where tokens are depicted as vertices (text) and cosine distances as edges (solid lines) in the network. Query, link and target keywords are shown in orange, blue and green colors, respectively. Tokens are translated into their original forms to ensure the readability of the results. In a subsequent step, the visualization is completed by representing documents into the semantic subspace defined by $T$. Firstly, the $TD$ matrix is filtered to obtain documents that contain any of the terms in $T$. Note that each of the resulting documents may contain one or more terms (components) of the semantic subspace $T$. Then, documents are projected according to their components' positions in the plane, as assigned by the Kamada-Kawai layout.

<img src='img/doc-projections.png'>*Figure*: Documents are projected into the 2D representation of the semantic subspace defined by $T$. $d_{1}$ is projected to its only component in the subspace, $t_{1}$. Similarly, $d_{2}$ contains terms $t_{3}$ and $t_{4}$ of and therefore it is projected at the mid-distance of the link betwen the two terms. Finally, documents such as $d_{3}$ that contain three or more terms are projected at the centroid of the convex hull formed by the positions of such terms in the plane.</img>

Documents are represented as dots in the visualization and follow the same color scheme as keywords: Documents on the query corpus are shown in orange, whereas those appearing in the target dataset are shown in green. Whenever two or more documents share the same position in the plane, they are aggregated in a visual encoding (the size of the circle). We represent links between a document and their related components in the plane with a dashed line, which facilitates the task of identifying relationships between terms and documents.


In [26]:
def get_recommendations_for_path(component):
    nodes = list(component.nodes())
    token_indices = [keep[n] for n in nodes]
    docs = []
    for doc_index, doc in enumerate(tf_bin_list):
        doc_dict = {}
        doc_tokens = []
        for token_index, token in enumerate(doc):
            if token == True and token_index in token_indices:
                doc_tokens.append(token_index)
        if len(doc_tokens) > 0:
            the_doc = all_papers[doc_index]
            doc_dict['title'] = the_doc['title']
            doc_dict['keywords'] = the_doc['keywords']
            doc_dict['title'] = the_doc['title']

            if doc_index <= dh_papers_offset:
                doc_dict['trace'] = 'query'
            else:
                doc_dict['trace'] = 'target'

            doc_dict['tokens'] = [indx2keyword[i] for i in doc_tokens]
            doc_dict['token_indices'] = doc_tokens
            doc_dict['token_indices_c'] = [nodes[token_indices.index(i)] for i in doc_tokens]
            docs.append(doc_dict)
        
    return docs

In [27]:
def scale_number(unscaled, to_min, to_max, from_min, from_max):
    if from_min == from_max: return to_max
    return (to_max-to_min)*(unscaled-from_min)/(from_max-from_min)+to_min

def build_keywords_tree(G, indx2keyword, stems_dictionary):
    counts = []
    counts.extend(list(map(lambda x: x[1]["count"], list(G.nodes(data=True)))))
  
    nodes = G.nodes()
  
    min_dist = 0
    max_dist = 0
  
    for row, data in nx.shortest_path_length(G, weight='weight'):
        for col, dist in data.items():
            if min_dist > dist:
                min_dist = dist
            if max_dist < dist:
                max_dist = dist
  
    df = pd.DataFrame(index=nodes, columns=nodes)
    for row, data in nx.shortest_path_length(G, weight='weight'):
        for col, dist in data.items():
            scaled_dist = scale_number(dist, 1, 4, min_dist, max_dist)
            df.loc[row,col] = scaled_dist
  
    df = df.fillna(df.max().max())

    pos = nx.kamada_kawai_layout(G, scale=1.0, dim=2, weight='weight', dist=df.to_dict())
  
  
    x = {k:float('%.4f' % v[0])  for k,v in pos.items()}
    y = {k:float('%.4f' % v[1]) for k,v in pos.items()}
  
    nx.set_node_attributes(G, x, 'x')
    nx.set_node_attributes(G, y, 'y')
  
    graph_json = json_graph.node_link_data(G)

    docs = get_recommendations_for_path(G)
    node_list = list(nodes)

    docs_json = {}
    
    for doc in docs:
        doc_keyword_ids = "+".join([str(k) for k in sorted(doc['token_indices_c'])])
        if doc_keyword_ids in docs_json:
            #Doc already added
            docs_json[doc_keyword_ids]['count'] += 1
            docs_json[doc_keyword_ids]['docs'].append(doc)
        else:
            if len(doc['token_indices_c']) == 1:
                cx = pos[doc['token_indices_c'][0]][0]
                cy = pos[doc['token_indices_c'][0]][1]
            elif len(doc['token_indices_c']) == 2:
                ax = pos[doc['token_indices_c'][0]][0]
                ay = pos[doc['token_indices_c'][0]][1]
                bx = pos[doc['token_indices_c'][1]][0]
                by = pos[doc['token_indices_c'][1]][1]
                cx = (ax + bx) / 2
                cy = (ay + by) / 2
            else:
                the_points = [ (pos[i][0], pos[i][1]) for i in doc['token_indices_c']]
                hull = ConvexHull(the_points)
                cx = np.mean(hull.points[hull.vertices,0])
                cy = np.mean(hull.points[hull.vertices,1])

            docs_json[doc_keyword_ids] = {
              'cx': cx,
              'cy': cy,
              'count': 1,
              'docs': [doc]
            }
    
  
    json_obj = {'graph' : graph_json, 'docs' : docs_json}

    return json.dumps(json_obj)


## Combining Significant Paths

Apart from the visualization of a single tree, our visualization scheme also supports the combination of two or more query terms to represent related keywords and documents. Given that by definition all trees in $T$ are disjoint subgraphs of $G$, we can find an MST in $G$ that contains all vertices in $T_{1}, T_{2}, \dots T_{n}$ and that presents the minimum edit distance of all possible MSTs to the sum of all subgraphs. This reasoning is depicted in the Figure below. 

<img src='img/path-query.png'>*Figure*: Paths for the user-provided terms $T_{t_{1}^{q}}$ and  $T_{t_{2}^{q}}$ are combined into a new path that results from calculating the MST of nodes in the two paths. This procedure ensures that the two paths are presented in the most coherent possible way in the visualization.</img>

The tree resulting from the combination of the two paths has similar properties to any other tree in $T$ and thus can be displayed in the same manner as we describe in the next section.

In [28]:
# path_keys = ['co-retweet', 'cooccurr']

# path_keys = ['bayerisch', 'jazz']

# path_keys = ['aborigin']

# path_keys = ['palaeographi', 'mediev']

# path_keys = ['russian', 'transmedia']

# path_keys = ['women']

# path_keys = ['co-retweet', 'coword']

# path_keys = ['coword']

# path_keys = ['shakespear']
path_keys = ['willa', 'racial']


query_indxs = []
sum_path = []
for key in path_keys:
    sum_path += merged_paths_dict[key]
    query_indxs.append(merged_paths_dict[key][0])


path_graph = dists_graph.subgraph(sum_path).copy()


for u,v,d in path_graph.edges(data=True):
    d['sim'] = 1 - d['weight']
  
for i in path_graph.nodes():
    trace = trace_stem(indx2keyword_c[i])
    if trace == 'query':
        trace_color = '#d95f02'
    elif trace == 'link':
        trace_color = '#7570b3'
    else:
        trace_color = '#1b9e77'
    path_graph.nodes[i]['keyword'] = stems_dictionary[indx2keyword_c[i]]
    count = 0

    for k in stems_dictionary[indx2keyword_c[i]]:
        count += unigram_counts[k]
        path_graph.node[i]['count'] = count
        path_graph.nodes[i]['trace_color'] = trace_color
        path_graph.nodes[i]['trace'] = trace
        
T=nx.minimum_spanning_tree(path_graph)

json_results = build_keywords_tree(T, indx2keyword_c, stems_dictionary) 
# print(json_results)

In [29]:
print(json_results)

{"graph": {"directed": false, "multigraph": false, "graph": {}, "nodes": [{"keyword": ["gabor"], "count": 1, "trace_color": "#1b9e77", "trace": "target", "x": -0.1819, "y": 0.7791, "id": 2052}, {"keyword": ["occlusion-free"], "count": 0, "trace_color": "#1b9e77", "trace": "target", "x": 0.4362, "y": 0.2237, "id": 1414}, {"keyword": ["heuristic-based"], "count": 0, "trace_color": "#1b9e77", "trace": "target", "x": 0.3956, "y": 0.611, "id": 1930}, {"keyword": ["filtering", "filters", "filter", "filtered"], "count": 28, "trace_color": "#1b9e77", "trace": "target", "x": 0.1005, "y": 1.0, "id": 523}, {"keyword": ["spiral"], "count": 1, "trace_color": "#1b9e77", "trace": "target", "x": 0.2504, "y": -0.2788, "id": 1806}, {"keyword": ["metro"], "count": 1, "trace_color": "#1b9e77", "trace": "target", "x": 0.5575, "y": -0.2393, "id": 1810}, {"keyword": ["oac"], "count": 1, "trace_color": "#1b9e77", "trace": "target", "x": -0.3815, "y": -0.0278, "id": 1813}, {"keyword": ["racial"], "count": 1, "

In [30]:
import IPython
import IPython.display as display


display.display(display.Javascript('window.path_data = %s' % json_results))

display.display(IPython.core.display.HTML('''
         <div id='vis'></div>
         <style>.container { width:100% !important; }</style>
         <script src="/static/components/requirejs/require.js"></script>
         <script>
          requirejs.config({
            paths: {
              "d3": "https://d3js.org/d3.v5.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        <script>
          requirejs(['d3'], function(d3) {
            const width = 1850,
                  height = 1850;

            const svg = d3.select('#vis').append('svg')
                            .attr('id', 'vis-svg')
                            .attr('width', width)
                            .attr('height', height);
            graph_json = window.path_data
            
            console.log(JSON.stringify(graph_json))
            
            const graph = graph_json['graph'];
            const docs = graph_json['docs'];

            const xScale = d3.scaleLinear().domain(d3.extent(graph.nodes, d=>d.x)).range([140, width-140])
            const yScale = d3.scaleLinear().domain(d3.extent(graph.nodes, d=>d.y)).range([140, height-140])

            const linkline = d3.line()
                                  .x(d => xScale(d.x))
                                  .y(d => yScale(d.y));

            const docs_three = Object.keys(docs).filter(d=>d.split('+').length >= 2).map(d => docs[d])

            const doc_lines_data = []

            docs_three.forEach(doc => {
              doc['docs'].forEach(d => {
                d['token_indices_c'].forEach(t=> {
                  const target_node = graph.nodes.filter(r => r.id == t)[0]
                  const line_points = [
                    {
                      'x' : doc.cx,
                      'y' : doc.cy 
                    },
                    {
                      'x' : target_node.x,
                      'y' : target_node.y
                    }
                  ]
                  doc_lines_data.push(line_points)
                })
              })
            })


            const doc_links = svg.append("g")
                            .attr("id", "doc_links")
                          .selectAll("line")
                          .data(doc_lines_data)
                          .enter().append("path")
                            .attr("stroke-dasharray", '5,5')
                            .style("stroke", "black")
                            .attr("stroke-width", 0.5)
                            .attr("opacity", 0.4)
                            .attr("d", d => linkline(d));


            const links = svg.append("g")
                            .attr("id", "links")
                          .selectAll("line")
                          .data(graph.links)
                          .enter().append("path")
                            .style("stroke", "black")
                            .attr("stroke-width", 0.5)
                            .attr("opacity", 0.4)
                            .attr("d", d => linkline(graph.nodes.filter(b=>b.id == d.source || b.id == d.target)));


            function getBB(selection) {
                selection.each(function(d){d.bbox = this.getBBox();})
            }

            const nodes_g = svg.append("g")
                            .attr("id", "nodes")
                          .selectAll('g')
                          .data(graph.nodes)
                          .enter()
                          .append('g');
            nodes_g
                  .append('text')
                    .attr("class", "node")
                    .text(d=> d.keyword[0])
                    .attr('x', d => xScale(d.x))
                    .attr('y', d => yScale(d.y))
                    .style('fill', d=>d.trace_color)
                    .style('font-size', "14px")
                    .style('font-weight', "bold")
                    .style('text-anchor', "middle")
                    .call(getBB);

            nodes_g.insert("rect", "text")
                              .attr('x', d => xScale(d.x) - d.bbox.width/2)
                              .attr('y', d => yScale(d.y) - 3*(d.bbox.height/4))
                              .attr("width", function(d){return d.bbox.width})
                              .attr("height", function(d){return d.bbox.height})
                              .style("fill", "white")
                              // .style("stroke", "black");

            const all_docs = Object.values(docs)                    

            const docs_g = svg.append("g")
                      .attr("id", "docs")
                    .selectAll('g')
                    .data(all_docs)
                    .enter()
                    .append('g')

            const radiusScale = d3.scaleLinear().domain(d3.extent(all_docs, d=>d.count)).range([3,5])
            docs_g.append('circle')
                .attr('r', d=>radiusScale(d.count))
                .attr('cx', d=>xScale(d.cx))
                .attr('cy', d=>yScale(d.cy))
                .style('fill', d => {
                  console.log(d);
                  if (d['docs'][0].trace == 'query')
                    return '#d95f02'
                  else return '#1b9e77'
                })
                .style('opacity', 0.5);

            docs_g.append('text')
                  .attr('class', 'doc-text')
                  .text(d => d.docs[0].title)
                  .attr('x', d => xScale(d.cx))
                  .attr('y', d => 10 + yScale(d.cy))
                  .style('font-size', "9px")
                  .style('text-anchor', "middle")
                  .style('font-weight', "light")
                  .style('font-family', "Open Sans")
          })
        </script>
      
      '''))

<IPython.core.display.Javascript object>

# Scratchpad
If you want to add your own experiments or have a closer look at the data structures this is the right place to do so.

# References

<div class="cite2c-biblio"></div>