# Landmarks Tutorial

In this notebook, we will use the landmarks submodule of Gismo to give an interactive description of ACM topics and researchers of the https://www.lincs.fr laboratory.

This notebook can be used as a blueprint to analyze other group of people under the scope of a topic classification.

Before starting this topic, it is recommended to have looked at the [ACM](https://gismo.readthedocs.io/en/latest/tutorials/tutorial_acm.html) and [DBLP](https://gismo.readthedocs.io/en/latest/tutorials/tutorial_dblp.html) tutorials.

## Lincs Researchers

In this section, we bind the LINCS researchers with their DBLP id.

### List of DBLP researchers

First, we open the DBLP database (see the [DBLP Tutorial](https://gismo.readthedocs.io/en/latest/tutorials/tutorial_dblp.html) to get your copy of the database).

In [1]:
from pathlib import Path

p = Path("../../../../Datasets/xplorer")

from gismo.filesource import FileSource
source = FileSource(filename="dblp", path=p)

``source`` is a list-like object whose entries are dicts that describe articles.

In [2]:
source[1234567]

{'type': 'article',
 'title': 'CAIM Discretization Algorithm.',
 'year': '2004',
 'venue': 'IEEE Trans. Knowl. Data Eng.',
 'authors': ['Lukasz A. Kurgan', 'Krzysztof J. Cios']}

Let's extract the set of authors. Each author is lowered and spaces are replaced with underscore for better later processing.

In [3]:
dblp_authors = {a.lower().replace(" ", "_") for paper in source for a in paper['authors']}

In [4]:
"fabien_mathieu" in dblp_authors

True

In [5]:
"fabin_mathieu" in dblp_authors

False

### List of Lincs Members

First we get a copy of the LINCS webpage that tells its researchers and feed it to BeautifulSoup

In [6]:
import requests
from bs4 import BeautifulSoup as bs
soup = bs(requests.get('https://www.lincs.fr/people/').text)

We make a function to convert table rows of the HTML page into researcher dict. Each dict will have a *name* (display name) and a *dblp* (DBLP id) entry.

In [7]:
# difflib will allow to cope with possible typos
from difflib import get_close_matches
def row2dict(row, dblp_authors, manual=None):
    """
    Soup 2 dict conversion
    
    Parameters
    ----------
    line: soup
        The row to convert
    dblp_authors: set
        The list of DBLP authors
    manual: dict
        Manual associations between name and id
        
    Returns
    -------
    dict
        A dict shaped like {'name': "John Doe", 'dblp': "john_doe"}
    """
    if manual is None:
        manual = {}
    # name extraction
    row = row('td')[1]
    name = row.text
    # manual association
    if name in manual:
        return {'name': name, 'dblp': manual[name]}
    # Attempt direct transformation
    dblp = name.lower().replace(" ", "_")
    # If result exists in dblp, return that
    if dblp in dblp_authors:
        return {'name': name, 'dblp': dblp}
    # Attempt to use the lincs automatic URL to infer dblp name
    a = row.a
    if a:
        href = a['href'].split('?more=')[1]
        if href in dblp_authors:
            return {'name': name, 'dblp': href}
    # last chance: subselect potential candidates that share the same last name and use difflib to guess the good answer.
    print(f"No direct dblp entry found for {name}, start fuzzy search")
    last_name = name.split()[-1]
    candidates = [a for a in dblp_authors if last_name.lower() in a]
    candidates = get_close_matches(name.lower().replace(" ", "_"), candidates, cutoff=0.3)
    if len(candidates)>0:
        print(f"Found candidate(s): {candidates}")
        dblp = candidates[0]
        return {'name': name, 'dblp': dblp}
    # If all failed, return empty id
    return {'name': name, 'dblp': ""}

The manual override below was actually populated by executing the cell afterwards and iterating until all things were OK.

In [8]:
manual = {"Giovanni Pau": "giovanni_pau_0001",
         "Luis Uzeda Garcia": "luis_guilherme_uzeda_garcia"}

The actual construction of the list of LINCS researchers.

In [9]:
lincs = [row2dict(line, dblp_authors, manual) for table in soup('table')[:2] for line in table('tr')]

No direct dblp entry found for Salah Eddine Elayoubi, start fuzzy search
Found candidate(s): ['salah-eddine_elayoubi']


In [10]:
lincs

[{'name': 'Ana Bušić', 'dblp': 'ana_busic'},
 {'name': 'Anais Vergne', 'dblp': 'anais_vergne'},
 {'name': 'Bartek Blaszczyszyn', 'dblp': 'bartek_blaszczyszyn'},
 {'name': 'Chung Shue (Calvin) Chen', 'dblp': 'chung_shue_chen'},
 {'name': 'Daniel Kofman', 'dblp': 'daniel_kofman'},
 {'name': 'Eitan Altman', 'dblp': 'eitan_altman'},
 {'name': 'Fabien Mathieu', 'dblp': 'fabien_mathieu'},
 {'name': 'François Baccelli', 'dblp': 'françois_baccelli'},
 {'name': 'Laurent Decreusefond', 'dblp': 'laurent_decreusefond'},
 {'name': 'Laurent Viennot', 'dblp': 'laurent_viennot'},
 {'name': 'Luca Muscariello', 'dblp': 'luca_muscariello'},
 {'name': 'Ludovic Noirie', 'dblp': 'ludovic_noirie'},
 {'name': 'Makhlouf Hadji', 'dblp': 'makhlouf_hadji'},
 {'name': 'Marc Lelarge', 'dblp': 'marc_lelarge'},
 {'name': 'Marceau Coupechoux', 'dblp': 'marceau_coupechoux'},
 {'name': 'Petr Kuznetsov', 'dblp': 'petr_kuznetsov'},
 {'name': 'Philippe Jacquet', 'dblp': 'philippe_jacquet'},
 {'name': 'Philippe Martins', 'd

## DBLP Gismo

In this Section, we use Landmarks to construct a small XGismo focused around the LINCS researchers. In details:
- We construct a large Gismo between articles and researchers, exactly like in the DBLP tutorial;
- We use landmarks to extract a (much smaller) list of articles based on collaboration proximity.
- We build a XGismo between researchers and keywords from this smaller source.

### Construction of a full Gismo on authors

This part is similar to the one from the DBLP tutorial.

In [11]:
from gismo.corpus import Corpus
from gismo.embedding import Embedding
from gismo.gismo import Gismo
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_author = CountVectorizer(dtype=float, preprocessor=lambda x:x, tokenizer=lambda x: x.split(' '))

In [12]:
def to_authors_text(dic):
    return " ".join([a.lower().replace(' ', '_') for a in dic['authors']])
corpus = Corpus(source, to_text=to_authors_text)

In [13]:
try:
    embedding = Embedding(filename="dblp_author_embedding", path=p)
except:
    embedding = Embedding(vectorizer=vectorizer_author)
    embedding.fit_transform(corpus)
    embedding.save(filename="dblp_author_embedding", path=p)

In [14]:
gismo = Gismo(corpus, embedding)

Given the size of the dataset, processing a query can take about one second.

In [15]:
gismo.rank("fabien_mathieu")

True

In [16]:
from gismo.post_processing import post_features_cluster_print
gismo.post_features_cluster = post_features_cluster_print
gismo.get_features_by_cluster()

 F: 0.02. R: 0.26. S: 0.76.
- F: 0.03. R: 0.25. S: 0.75.
-- F: 0.04. R: 0.24. S: 0.74.
--- F: 0.06. R: 0.22. S: 0.70.
---- F: 0.07. R: 0.20. S: 0.64.
----- F: 0.24. R: 0.18. S: 0.58.
------ F: 0.29. R: 0.18. S: 0.60.
------- F: 0.44. R: 0.15. S: 0.80.
-------- fabien_mathieu (R: 0.11; S: 1.00)
-------- F: 0.60. R: 0.04. S: 0.45.
--------- laurent_viennot (R: 0.02; S: 0.48)
--------- F: 0.70. R: 0.02. S: 0.38.
---------- diego_perino (R: 0.02; S: 0.43)
---------- yacine_boufkhad (R: 0.01; S: 0.30)
------- F: 0.77. R: 0.03. S: 0.37.
-------- julien_reynier (R: 0.01; S: 0.38)
-------- fabien_de_montgolfier (R: 0.01; S: 0.39)
-------- anh-tuan_gai (R: 0.01; S: 0.30)
------ gheorghe_postelnicu (R: 0.00; S: 0.19)
----- F: 0.33. R: 0.01. S: 0.30.
------ the_dang_huynh (R: 0.01; S: 0.29)
------ dohy_hong (R: 0.00; S: 0.18)
----- nidhi_hegde (R: 0.00; S: 0.21)
---- F: 0.93. R: 0.02. S: 0.33.
----- ludovic_noirie (R: 0.01; S: 0.35)
----- françois_durand (R: 0.01; S: 0.32)
--- F: 0.48. R: 0.02. S

### Using landmarks to shrink a source

To reduce the size of the dataset, we make landmarks out of the researchers, and we credit each entry with a budget of 2,000 articles.

In [17]:
from gismo.landmarks import Landmarks
lincs_landmarks_full = Landmarks(source=lincs, to_text=lambda x: x['dblp'], 
                                 x_density=2000)

We launch the computation of the source. This takes a couple of minutes, as a ranking diffusion needs to be performed for all researchers.

In [18]:
import logging
logging.basicConfig()
log = logging.getLogger("Gismo")
log.setLevel(level=logging.DEBUG)

In [19]:
reduced_source = lincs_landmarks_full.get_reduced_source(gismo)

INFO:Gismo:Start computation of 45 landmarks.
DEBUG:Gismo:Processing ana_busic.
DEBUG:Gismo:Landmarks of ana_busic computed.
DEBUG:Gismo:Processing anais_vergne.
DEBUG:Gismo:Landmarks of anais_vergne computed.
DEBUG:Gismo:Processing bartek_blaszczyszyn.
DEBUG:Gismo:Landmarks of bartek_blaszczyszyn computed.
DEBUG:Gismo:Processing chung_shue_chen.
DEBUG:Gismo:Landmarks of chung_shue_chen computed.
DEBUG:Gismo:Processing daniel_kofman.
DEBUG:Gismo:Landmarks of daniel_kofman computed.
DEBUG:Gismo:Processing eitan_altman.
DEBUG:Gismo:Landmarks of eitan_altman computed.
DEBUG:Gismo:Processing fabien_mathieu.
DEBUG:Gismo:Landmarks of fabien_mathieu computed.
DEBUG:Gismo:Processing françois_baccelli.
DEBUG:Gismo:Landmarks of françois_baccelli computed.
DEBUG:Gismo:Processing laurent_decreusefond.
DEBUG:Gismo:Landmarks of laurent_decreusefond computed.
DEBUG:Gismo:Processing laurent_viennot.
DEBUG:Gismo:Landmarks of laurent_viennot computed.
DEBUG:Gismo:Processing luca_muscariello.
DEBUG:Gismo

In [20]:
print(f"Source length went down from {len(source)} to {len(reduced_source)}.")

Source length went down from 4857787 to 57993.


Instead of 5,000,000 all purposes articles, we now have 58,000 articles lying in the neighborhood of the considered researchers. We now can close the file descriptor as we won't need further access to the original source.

In [21]:
source.close()

### Building a small XGismo

#### Author Embedding

Author embedding takes a couple of seconds instead of a couple of minutes.

In [22]:
reduced_corpus = Corpus(reduced_source, to_text=to_authors_text)
reduced_author_embedding = Embedding(vectorizer=vectorizer_author)
reduced_author_embedding.fit_transform(reduced_corpus)

#### Sanity Check

We can rebuild a small author Gismo. This part is merely a sanity check to verify that the reduction didn't change too much things in the vicinity of the LINCS..

In [23]:
reduced_gismo = Gismo(reduced_corpus, reduced_author_embedding)

Ranking is nearly instant.

In [24]:
reduced_gismo.rank("fabien_mathieu")

True

The results are almost identical to what was returned by the full Gismo.

In [25]:
from gismo.post_processing import post_features_cluster_print
reduced_gismo.post_features_cluster = post_features_cluster_print
reduced_gismo.get_features_by_cluster()

 F: 0.02. R: 0.26. S: 0.74.
- F: 0.03. R: 0.25. S: 0.72.
-- F: 0.03. R: 0.24. S: 0.71.
--- F: 0.05. R: 0.22. S: 0.67.
---- F: 0.07. R: 0.20. S: 0.64.
----- F: 0.25. R: 0.18. S: 0.57.
------ F: 0.29. R: 0.18. S: 0.60.
------- F: 0.44. R: 0.15. S: 0.81.
-------- fabien_mathieu (R: 0.11; S: 1.00)
-------- F: 0.59. R: 0.04. S: 0.45.
--------- laurent_viennot (R: 0.02; S: 0.47)
--------- F: 0.69. R: 0.02. S: 0.38.
---------- diego_perino (R: 0.02; S: 0.42)
---------- yacine_boufkhad (R: 0.01; S: 0.30)
------- F: 0.77. R: 0.03. S: 0.37.
-------- julien_reynier (R: 0.01; S: 0.37)
-------- fabien_de_montgolfier (R: 0.01; S: 0.38)
-------- anh-tuan_gai (R: 0.01; S: 0.29)
------ gheorghe_postelnicu (R: 0.00; S: 0.19)
----- F: 0.33. R: 0.01. S: 0.30.
------ the_dang_huynh (R: 0.01; S: 0.29)
------ dohy_hong (R: 0.00; S: 0.18)
----- nidhi_hegde (R: 0.00; S: 0.21)
---- F: 0.57. R: 0.02. S: 0.30.
----- F: 0.94. R: 0.02. S: 0.33.
------ françois_durand (R: 0.01; S: 0.32)
------ ludovic_noirie (R: 0.0

#### Word Embedding

Now we build the word embedding, with the spacy add-on. Takes a couple of minutes instead of a couple of hours.

In [26]:
import spacy
# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en', disable=['parser', 'ner'])
# Who cares about DET and such?
keep = {'ADJ', 'NOUN', 'NUM', 'PROPN', 'SYM', 'VERB'}

preprocessor=lambda txt: " ".join([token.lemma_.lower() for token in nlp(txt) 
                                   if token.pos_ in keep and not token.is_stop])
vectorizer_text = CountVectorizer(dtype=float, min_df=5, max_df=.02, ngram_range=[1, 3], preprocessor=preprocessor)

In [27]:
reduced_corpus.to_text = lambda e: e['title']
reduced_word_embedding = Embedding(vectorizer=vectorizer_text)
reduced_word_embedding.fit_transform(reduced_corpus)

#### Gathering pieces together

We can combine the reduced embeddings to build a XGismo between authors and words.

In [28]:
from gismo.gismo import XGismo
xgismo = XGismo(x_embedding=reduced_author_embedding, y_embedding=reduced_word_embedding)

We can save this for later use.

In [29]:
xgismo.save(filename="reduced_lincs_xgismo", path=p, erase=True)

The file should be about 53 Mb, whereas a full-size DBLP XGismo is about 4 Gb. What about speed and quality of results?

In [30]:
xgismo.rank("anne_bouillard", y=False)

True

In [31]:
xgismo.get_documents_by_rank()

['anne_bouillard',
 'bruno_gaujal',
 'ana_busic',
 'eric_thierry',
 'ramavarapu_s._sreenivas',
 'albert_benveniste',
 'giovanni_stea',
 'sidney_rosario',
 'jean_mairesse',
 'nico_m._van_dijk',
 'eitan_altman',
 'stefan_haar',
 'frank_p._kelly',
 'zhen_liu_0001',
 'claude_jard',
 'aurore_junier',
 'giovanni_chiola',
 'elie_de_panafieu',
 'christelle_rovetta',
 'bruno_salvy',
 'yves_dallery',
 'rémi_varloot',
 'wojciech_szpankowski',
 'ulle_endriss',
 'laurent_jouhet',
 'radu_mateescu_0001',
 'marco_ajmone_marsan',
 'françois_baccelli',
 'jean-marc_vincent']

Let's try some more elaborate display.

In [32]:
from gismo.post_processing import post_documents_cluster_print, post_features_cluster_print
xgismo.parameters.distortion = 1.0
xgismo.post_documents_cluster = post_documents_cluster_print
xgismo.post_features_cluster = post_features_cluster_print
xgismo.get_documents_by_cluster()

 F: 0.01. R: 0.08. S: 0.83.
- F: 0.08. R: 0.08. S: 0.82.
-- F: 0.51. R: 0.03. S: 0.69.
--- F: 0.58. R: 0.03. S: 0.67.
---- F: 0.84. R: 0.03. S: 0.67.
----- F: 0.84. R: 0.03. S: 0.68.
------ anne_bouillard (R: 0.02; S: 0.79)
------ bruno_gaujal (R: 0.01; S: 0.66)
------ eric_thierry (R: 0.00; S: 0.60)
------ laurent_jouhet (R: 0.00; S: 0.59)
----- elie_de_panafieu (R: 0.00; S: 0.49)
---- radu_mateescu_0001 (R: 0.00; S: 0.39)
--- aurore_junier (R: 0.00; S: 0.50)
-- F: 0.45. R: 0.02. S: 0.38.
--- F: 0.74. R: 0.01. S: 0.35.
---- ana_busic (R: 0.00; S: 0.37)
---- nico_m._van_dijk (R: 0.00; S: 0.25)
---- frank_p._kelly (R: 0.00; S: 0.31)
---- christelle_rovetta (R: 0.00; S: 0.33)
---- yves_dallery (R: 0.00; S: 0.29)
--- eitan_altman (R: 0.00; S: 0.43)
--- F: 0.67. R: 0.00. S: 0.42.
---- zhen_liu_0001 (R: 0.00; S: 0.35)
---- françois_baccelli (R: 0.00; S: 0.42)
--- jean-marc_vincent (R: 0.00; S: 0.30)
-- F: 0.12. R: 0.01. S: 0.35.
--- F: 0.22. R: 0.01. S: 0.32.
---- F: 0.50. R: 0.01. S: 0.31.

In [33]:
xgismo.get_features_by_cluster(target_k=1.5, resolution=.5, distortion=.5)

 F: 0.32. R: 0.20. S: 0.85.
- F: 0.35. R: 0.17. S: 0.84.
-- F: 0.50. R: 0.15. S: 0.84.
--- F: 0.72. R: 0.10. S: 0.82.
---- network calculus (R: 0.03; S: 0.80)
---- calculus (R: 0.03; S: 0.79)
---- free choice (R: 0.01; S: 0.69)
---- case delay (R: 0.01; S: 0.75)
---- case (R: 0.01; S: 0.79)
--- F: 0.63. R: 0.02. S: 0.69.
---- closed queueing (R: 0.01; S: 0.69)
---- queueing (R: 0.01; S: 0.50)
--- ospf (R: 0.01; S: 0.52)
--- sampling (R: 0.01; S: 0.58)
-- delay (R: 0.01; S: 0.47)
-- end (R: 0.01; S: 0.56)
- F: 0.94. R: 0.02. S: 0.41.
-- service orchestration (R: 0.01; S: 0.50)
-- orchestrations (R: 0.01; S: 0.38)


## Rebuild landmarks

### Lincs landmarks

We can rebuild Lincs landmarks on the XGismo.

In [34]:
lincs_landmarks = Landmarks(source=lincs, to_text=lambda x: x['dblp'],
                           rank = lambda g, q: g.rank(q, y=False))
lincs_landmarks.fit(xgismo)

INFO:Gismo:Start computation of 45 landmarks.
DEBUG:Gismo:Processing ana_busic.
DEBUG:Gismo:Landmarks of ana_busic computed.
DEBUG:Gismo:Processing anais_vergne.
DEBUG:Gismo:Landmarks of anais_vergne computed.
DEBUG:Gismo:Processing bartek_blaszczyszyn.
DEBUG:Gismo:Landmarks of bartek_blaszczyszyn computed.
DEBUG:Gismo:Processing chung_shue_chen.
DEBUG:Gismo:Landmarks of chung_shue_chen computed.
DEBUG:Gismo:Processing daniel_kofman.
DEBUG:Gismo:Landmarks of daniel_kofman computed.
DEBUG:Gismo:Processing eitan_altman.
DEBUG:Gismo:Landmarks of eitan_altman computed.
DEBUG:Gismo:Processing fabien_mathieu.
DEBUG:Gismo:Landmarks of fabien_mathieu computed.
DEBUG:Gismo:Processing françois_baccelli.
DEBUG:Gismo:Landmarks of françois_baccelli computed.
DEBUG:Gismo:Processing laurent_decreusefond.
DEBUG:Gismo:Landmarks of laurent_decreusefond computed.
DEBUG:Gismo:Processing laurent_viennot.
DEBUG:Gismo:Landmarks of laurent_viennot computed.
DEBUG:Gismo:Processing luca_muscariello.
DEBUG:Gismo

We can extract the Lincs researchers that the most similar to a given researcher (not necessarily from Lincs).

In [35]:
xgismo.rank("anne_bouillard", y=False)
lincs_landmarks.post_item = lambda l, i: l[i]['name']
lincs_landmarks.get_landmarks_by_rank(xgismo)

['Ana Bušić',
 'Elie de Panafieu',
 'Andrea Marcano',
 'Fabien Mathieu',
 'Marc-Olivier Buob',
 'Dalia Popescu',
 'Bartek Blaszczyszyn',
 'François Baccelli',
 'Eitan Altman',
 'Dimitrios Milioris',
 'Renata Teixeira',
 'Philippe Jacquet',
 'Laurent Massoulié',
 'Marc Lelarge',
 'Thomas Bonald',
 'Tijani Chahed']

We can also use a keyword query, and organize the results in clusters.

In [36]:
xgismo.rank("anne_bouillard", y=False)
from gismo.post_processing import post_landmarks_cluster_print
lincs_landmarks.post_cluster = post_landmarks_cluster_print
lincs_landmarks.get_landmarks_by_cluster(xgismo, balance=.5, target_k=1.2)

 F: 0.03. 
- F: 0.09. 
-- F: 0.11. 
--- F: 0.12. 
---- Ana Bušić 
---- F: 0.39. 
----- Marc-Olivier Buob 
----- Bartek Blaszczyszyn 
---- F: 0.40. 
----- Dalia Popescu 
----- F: 0.47. 
------ François Baccelli 
------ Eitan Altman 
--- F: 0.80. 
---- Elie de Panafieu 
---- Fabien Mathieu 
-- Dimitrios Milioris 
- Andrea Marcano 


### ACM landmarks

We can build other landmarks using the ACM categories. This will enable to describe things in term of categories.

In [37]:
from gismo.datasets.acm import get_acm, flatten_acm
acm = flatten_acm(get_acm(), min_size=10)

In [38]:
acm_landmarks = Landmarks(acm, to_text=lambda e: e['query'])

In [39]:
log.setLevel(logging.INFO)
acm_landmarks.fit(xgismo)

INFO:Gismo:Start computation of 113 landmarks.
INFO:Gismo:All landmarks are built.


In [40]:
xgismo.rank("fabien_mathieu", y=False)
acm_landmarks.post_item = lambda l, i: l[i]['name']
acm_landmarks.get_landmarks_by_rank(xgismo, balance=.5, target_k=1.2)

['Web searching and information discovery',
 'Theory of computation',
 'Software system structures']

In [41]:
xgismo.rank("combinatorics")
acm_landmarks.post_cluster = post_landmarks_cluster_print
acm_landmarks.get_landmarks_by_cluster(xgismo, balance=.5, target_k=1.5)

 F: 0.27. 
- F: 0.57. 
-- F: 0.91. 
--- Discrete mathematics 
--- Mathematics of computing 
--- Graph theory 
-- F: 0.94. 
--- Data structures 
--- Models of computation 
--- Hardware test 
--- Semantics and reasoning 
--- Design and analysis of algorithms 
--- Theory of computation 
-- F: 0.89. 
--- Symbolic and algebraic algorithms 
--- Symbolic and algebraic manipulation 
--- Numerical analysis 
--- Mathematical analysis 
- Formal languages and automata theory 


Note that we fully ignore the original ACM category hierarchy. Instead, Gismo builds its own hierarchy tailored to the query.

### Combining landmarks

Through the ``post_processing`` methods, we can intricate multiple landmarks. For example, the following method associates Lincs researchers and keywords to a tree of ACM categories.

In [42]:
from gismo.common import auto_k
import numpy as np
def post_cluster_acm(l, cluster, depth=0, kw_size=.3, mts_size=.5):
    tk_kw = 1/kw_size
    tk_mts = 1/mts_size
    n = l.x_direction.shape[0]

    kws_view = cluster.vector[0, n:]
    k = auto_k(data=kws_view.data, max_k=100, target=tk_kw)
    keywords = [xgismo.embedding.features[i] 
                for i in kws_view.indices[np.argsort(-kws_view.data)[:k]]]

    if len(cluster.children) > 0:
        print(f"|{'-'*depth}")
        for c in cluster.children:
            post_cluster_acm(l, c, depth=depth+1)
    else:
        domain = l[cluster.indice]['name']
        researchers = ", ".join(lincs_landmarks.get_landmarks_by_rank(cluster, 
                                                          target_k=tk_mts,
                                                        distortion=0.5))    
        print(f"|{'-'*depth} {domain} ({researchers}) ({', '.join(keywords)})")


In [43]:
xgismo.rank("combinatorics")
acm_landmarks.post_cluster = post_cluster_acm
acm_landmarks.get_landmarks_by_cluster(xgismo, target_k=1.5)

|
|-
|--
|--- Discrete mathematics (Elie de Panafieu, Philippe Jacquet) (graph, combinatoric, analytic, random, analytic combinatoric, complexity)
|--- Mathematics of computing (Elie de Panafieu, Philippe Jacquet) (graph, random, complexity, analytic, combinatoric)
|--- Graph theory (Elie de Panafieu, Philippe Jacquet) (graph, random, complexity)
|--
|--- Data structures (Philippe Jacquet, Elie de Panafieu, Marc-Olivier Buob) (compression, graph, complexity, structure, random, analytic)
|--- Models of computation (Elie de Panafieu, Philippe Jacquet) (complexity, graph, random, parallel, function)
|--- Hardware test (Philippe Jacquet, Elie de Panafieu, Marc-Olivier Buob) (compression, graph, complexity, random)
|--- Semantics and reasoning (Philippe Jacquet, Elie de Panafieu, Marc-Olivier Buob) (graph, complexity, random)
|--- Design and analysis of algorithms (Philippe Jacquet, Elie de Panafieu, Marc-Olivier Buob) (graph, complexity, random, algorithms, problem)
|--- Theory of computat

Conversely, one can associate ACM categories and keywords to a tree of Lincs researchers.

In [44]:
def post_cluster_lincs(l, cluster, depth=0, kw_size=.3, acm_size=.5):
    tk_kw = 1/kw_size
    tk_acm = 1/acm_size
    n = l.x_direction.shape[0]

    kws_view = cluster.vector[0, n:]
    k = auto_k(data=kws_view.data, max_k=100, target=tk_kw)
    keywords = [xgismo.embedding.features[i] 
                for i in kws_view.indices[np.argsort(-kws_view.data)[:k]]]

    if len(cluster.children) > 0:
        print(f"|{'-'*depth}")
        for c in cluster.children:
            post_cluster_lincs(l, c, depth=depth+1)
    else:
        researcher = l[cluster.indice]['name']
        domains = ", ".join(acm_landmarks.get_landmarks_by_rank(cluster, 
                                                          target_k=tk_acm,
                                                        distortion=0.5))
        print(f"|{'-'*depth} {researcher} ({domains}) ({', '.join(keywords)})")

In [45]:
xgismo.rank("anne_bouillard", y=False)
lincs_landmarks.post_cluster = post_cluster_lincs
lincs_landmarks.get_landmarks_by_cluster(xgismo, target_k=1.4)

|
|-
|--
|--- Ana Bušić (Mathematical analysis, Mathematics of computing, Probabilistic reasoning algorithms, Probability and statistics, Symbolic and algebraic algorithms) (sampling, perfect sampling, perfect, queueing, closed queueing, exact, closed)
|---
|---- Elie de Panafieu (Discrete mathematics, Symbolic and algebraic algorithms, Symbolic and algebraic manipulation, Models of computation, Mathematics of computing, Mathematical analysis, Graph theory) (network calculus, calculus, analytic, combinatoric, analytic combinatoric, kernels)
|---- Fabien Mathieu (Symbolic and algebraic algorithms, Models of computation, Symbolic and algebraic manipulation) (network calculus, calculus, space, network design, transport network, kernels, fronthaul)
|-- Marc-Olivier Buob (Design and analysis of algorithms) (routing, delays, pattern matching, lightweight, log, space time, space, route, matching, pattern)
|- Andrea Marcano (Power and energy) (impact)


That's all for this tutorial!