# DBLP exploration

This tutorial shows how explore DBLP with Gismo.

If you have never used Gismo before, you may want to start with the *Toy example tutorial* or the *ACM* tutorial.

Note: the DBLP databased is not small. Recommended requirements to excute this Notebook:
- Fast Internet connection (you will need to download a few hundred Mb)
- 4 Gb of free space
- 4 Gb of RAM (8Gb or more recommended)
- Descent CPU (can take more than one hour on slow CPUs)

Here, *documents* are articles in DBLP. The *features* of an article category will vary.

## Initialisation

First, we load the required packages.

In [1]:
import numpy as np
import spacy
from gismo import Corpus, Embedding, CountVectorizer, cosine_similarity, Gismo
from pathlib import Path
from functools import partial

from gismo.datasets.dblp import Dblp
from gismo.filesource import FileSource
from gismo.post_processing import post_features_cluster_print, post_documents_cluster_print

Then, we prepare the DBLP source.

First we choose the location of the DBLP files. If you want to run this Notebook at your place, please change the path below and check that it exists.

In [4]:
path = Path("../../../../../Datasets/DBLP")
path.exists()

True

Construction of the dblp files. Only needs to be performed the first time or when you want to refresh the database. Takes about 10 minutes on a Surface Pro 4 with fiber Internet connection.

In [8]:
dblp = Dblp(path=path)
dblp.build()

Retrieve https://dblp.uni-trier.de/xml/dblp.xml.gz from the Internet.
DBLP database downloaded to ..\..\..\..\..\Datasets\DBLP\dblp.xml.gz.
Converting DBLP database from ..\..\..\..\..\Datasets\DBLP\dblp.xml.gz (may take a while).
Building Index.
Conversion done.


Then, we can load the database as a filesource.

In [9]:
source = FileSource(filename="dblp", path=path)
source[0]

{'type': 'inproceedings',
 'authors': ['Arnon Rosenthal'],
 'title': 'The Future of Classic Data Administration: Objects + Databases + CASE',
 'venue': 'SWEE',
 'year': '1998'}

Each article is a dict with fields ``type``, ``venue``, ``title``, ``year``, and ``authors``. We build a corpus that will tell Gismo that the content of an article is its ``title`` value.

In [10]:
corpus = Corpus(source, to_text=lambda x: x['title'])

We build an embedding on top of that corpus.
- We set ``min_df=30`` to exclude rare features;
- We set ``max_df=.02`` to exclude anything present in more than 2% of the corpus;
- We use `spacy` to lemmatize & remove some stopwords; remove `preprocessor=...` from the input if you want to skip this (takes time);
- A few manually selected stopwords to fine-tune things.
- We set ``ngram_range=(1, 2)`` to include bi-grams in the embedding.

This will take a few minutes (without spacy) up to a few hours (with spacy enabled). You can save the embedding for later if you want.

In [None]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
keep = {'ADJ', 'NOUN', 'NUM', 'PROPN', 'SYM', 'VERB'}
vectorizer = CountVectorizer(min_df=30, max_df=.02, ngram_range=(1, 2), dtype=float,
                             preprocessor=lambda txt: " ".join([w.lemma_.lower() for w in nlp(txt) 
                                                                if w.pos_ in keep and not w.is_stop]),
                             stop_words=['a', 'about', 'an', 'and', 'for', 'from', 'in', 'of', 'on', 'the', 'with'])

try:
    embedding = Embedding.load(filename="dblp_embedding", path=path)
except:
    embedding = Embedding(vectorizer=vectorizer)
    embedding.fit_transform(corpus)
    embedding.dump(filename="dblp_embedding", path=path)



In [None]:
embedding.x

We see from ``embedding.x`` that the embedding links about 6,200,000 documents to 193,000 features. In average, each document is linked to about 10 features.

Now, we initiate the gismo object, and customize post_processers to ease the display.

In [None]:
gismo = Gismo(corpus, embedding)

In [None]:
def post_article(g, i):
    dic = g.corpus[i]
    authors = ", ".join(dic['authors'])
    return f"{dic['title']} By {authors} ({dic['venue']}, {dic['year']})"
    
gismo.post_documents_item = post_article

def post_title(g, i):
    return g.corpus[i]['title']
    authors = ", ".join(dic['authors'])
    return f"{dic['title']} By {authors} ({dic['venue']}, {dic['year']})"

def post_meta(g, i):
    dic = g.corpus[i]
    authors = ", ".join(dic['authors'])
    return f"{authors} ({dic['venue']}, {dic['year']})"


gismo.post_documents_cluster = partial(post_documents_cluster_print, post_item=post_title)
gismo.post_features_cluster = post_features_cluster_print

As the dataset is big, we lower the precision of the computation to speed up things a little bit.

In [None]:
gismo.parameters.n_iter = 2

## Machine Learning (and Covid-19) query

We perform the query *Machine learning*. The returned ``True`` tells that some of the query features were found in the corpus' features.

In [None]:
gismo.rank("Machine Learning")

What are the best articles on *Machine Learning*?

In [None]:
gismo.get_documents_by_rank()

OK, this seems to go everywhere. Maybe we can narrow with a more specific request.

In [13]:
gismo.rank("Machine Learning and covid-19")

True

In [14]:
gismo.get_documents_by_rank()

['Ergonomics of Virtual Learning During COVID-19. By Lu Yuan, Alison Garaudy (AHFE (11), 2021)',
 'University Virtual Learning in Covid Times. By Verónica Marín-Díaz, Eloísa Reche, Javier Martín (Technol. Knowl. Learn., 2022)',
 'DCML: Deep contrastive mutual learning for COVID-19 recognition. By Hongbin Zhang, Weinan Liang, Chuanxiu Li, Qipeng Xiong, Haowei Shi, Lang Hu, Guangli Li (Biomed. Signal Process. Control., 2022)',
 'Interpretable Sequence Learning for Covid-19 Forecasting. By Sercan Ömer Arik, Chun-Liang Li, Jinsung Yoon, Rajarishi Sinha, Arkady Epshteyn, Long T. Le, Vikas Menon, Shashank Singh 0005, Leyou Zhang, Martin Nikoltchev, Yash Sonthalia, Hootan Nakhost, Elli Kanal, Tomas Pfister (NeurIPS, 2020)',
 'Interpretable Sequence Learning for COVID-19 Forecasting. By Sercan Ömer Arik, Chun-Liang Li, Jinsung Yoon, Rajarishi Sinha, Arkady Epshteyn, Long T. Le, Vikas Menon, Shashank Singh 0005, Leyou Zhang, Nate Yoder, Martin Nikoltchev, Yash Sonthalia, Hootan Nakhost, Elli Ka

Sounds nice. How are the top-10 articles related? Note: as the graph structure is really sparse on the document side (10 features), it is best to de-activate the query-distortion, which is intended for longer documents.

In [15]:
gismo.parameters.distortion = 0.0
gismo.get_documents_by_cluster(k=10)

 F: 0.45. R: 0.01. S: 0.74.
- F: 0.45. R: 0.01. S: 0.73.
-- F: 0.45. R: 0.00. S: 0.66.
--- F: 0.66. R: 0.00. S: 0.53.
---- Ergonomics of Virtual Learning During COVID-19. (R: 0.00; S: 0.63)
---- University Virtual Learning in Covid Times. (R: 0.00; S: 0.35)
--- DCML: Deep contrastive mutual learning for COVID-19 recognition. (R: 0.00; S: 0.60)
--- Dual Teaching: Simultaneous Remote and In-Person Learning During COVID. (R: 0.00; S: 0.35)
--- An Analysis of the Effectiveness of Emergency Distance Learning under COVID-19. (R: 0.00; S: 0.56)
-- F: 0.70. R: 0.00. S: 0.60.
--- F: 1.00. R: 0.00. S: 0.54.
---- Interpretable Sequence Learning for Covid-19 Forecasting. (R: 0.00; S: 0.54)
---- Interpretable Sequence Learning for COVID-19 Forecasting. (R: 0.00; S: 0.54)
--- Automated Machine Learning for COVID-19 Forecasting. (R: 0.00; S: 0.59)
-- The Deaf Experience in Remote Learning during COVID-19. (R: 0.00; S: 0.52)
- The Study on the Efficiency of Smart Learning in the COVID-19. (R: 0.00; S:

Now, let's look at the main keywords.

In [16]:
gismo.get_features_by_rank(20)

['covid',
 '19',
 'covid 19',
 'learning covid',
 'machine',
 'machine learning',
 'pandemic',
 '19 pandemic',
 'online learning',
 'online',
 'chest',
 'deep',
 '19 detection',
 'deep learning',
 'student',
 'distance learning',
 'ray',
 'chest ray',
 'classification',
 'ct']

Let's organize them.

In [17]:
# On the feature side, the graph is more dense so we can use query distortion
gismo.get_features_by_cluster(distortion=1)

 F: 0.31. R: 0.06. S: 0.97.
- F: 0.46. R: 0.06. S: 0.97.
-- F: 0.55. R: 0.05. S: 0.97.
--- F: 0.96. R: 0.05. S: 0.97.
---- covid (R: 0.01; S: 1.00)
---- 19 (R: 0.01; S: 0.99)
---- covid 19 (R: 0.01; S: 0.99)
---- learning covid (R: 0.01; S: 0.97)
--- F: 0.97. R: 0.01. S: 0.55.
---- pandemic (R: 0.00; S: 0.56)
---- 19 pandemic (R: 0.00; S: 0.54)
-- F: 0.95. R: 0.00. S: 0.44.
--- online learning (R: 0.00; S: 0.44)
--- online (R: 0.00; S: 0.45)
- F: 1.00. R: 0.01. S: 0.31.
-- machine (R: 0.00; S: 0.31)
-- machine learning (R: 0.00; S: 0.31)


Rough, very broad analysis:
- One big keyword cluster about Coronavirus / Covid-19, pandemic, online learning;
- Machine Learning as a separate small cluster.

In [18]:
np.dot(gismo.embedding.query_projection("Machine learning")[0], gismo.embedding.y)

<1x6232511 sparse matrix of type '<class 'numpy.float64'>'
	with 88256 stored elements in Compressed Sparse Row format>

88,000 articles with an explicit link to machine learning.

In [19]:
np.dot(gismo.embedding.query_projection("Covid-19")[0], gismo.embedding.y)

<1x6232511 sparse matrix of type '<class 'numpy.float64'>'
	with 11831 stored elements in Compressed Sparse Row format>

12,000 articles with an explicit link to covid-19.

## Authors query

Instead of looking at words, we can explore authors and their collaborations.

 We just have to rewire the corpus to output string of authors.

In [20]:
def to_authors_text(dic):
    return " ".join([a.replace(' ', '_') for a in dic['authors']])
corpus = Corpus(source, to_text=to_authors_text)

We can build a new embedding on top of this modified corpus. We tell the vectorizer to be stupid: don't preprocess, words are separated spaces.

This will take a few minutes (you can save the embedding for later if you want).

In [21]:
vectorizer = CountVectorizer(dtype=float,
                            preprocessor=lambda x:x, tokenizer=lambda x: x.split(' '))
try:
    a_embedding = Embedding.load(filename="dblp_aut_embedding", path=path)
except:
    a_embedding = Embedding(vectorizer=vectorizer)
    a_embedding.fit_transform(corpus)
    a_embedding.dump(filename="dblp_aut_embedding", path=path)

In [22]:
a_embedding.x

<6232511x3200857 sparse matrix of type '<class 'numpy.float64'>'
	with 20237296 stored elements in Compressed Sparse Row format>

We now have about 3,200,000 authors to explore. Let's reload gismo and try to play.

In [23]:
gismo = Gismo(corpus, a_embedding)
gismo.post_documents_item = post_article
gismo.post_features_item = lambda g, i: g.embedding.features[i].replace("_", " ")

In [24]:
gismo.post_documents_cluster = partial(post_documents_cluster_print, post_item=post_meta)
gismo.post_features_cluster = post_features_cluster_print

### Laurent Massoulié query

In [25]:
gismo.rank("Laurent_Massoulié")

True

What are the most central articles of Laurent Massoulié in terms of collaboration?

In [26]:
gismo.get_documents_by_rank(k=10)

['Asynchrony and Acceleration in Gossip Algorithms. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2020)',
 'Decentralized Optimization with Heterogeneous Delays: a Continuous-Time Approach. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2021)',
 'Scalable Local Area Service Discovery. By Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007)',
 'A spectral method for community detection in moderately-sparse degree-corrected stochastic block models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (ITCS, 2017)',
 'An Impossibility Result for Reconstruction in a Degree-Corrected Planted-Partition Model. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 

We see lots of duplicates. This is not surprising as many articles can published first as a research report, then as a conference paper, last as a journal article. Luckily, Gismo can cover for you.

In [27]:
gismo.get_documents_by_coverage(k=10)

['Asynchrony and Acceleration in Gossip Algorithms. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2020)',
 'Decentralized Optimization with Heterogeneous Delays: a Continuous-Time Approach. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2021)',
 'Scalable Local Area Service Discovery. By Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007)',
 'A spectral method for community detection in moderately-sparse degree-corrected stochastic block models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (ITCS, 2017)',
 'An Impossibility Result for Reconstruction in a Degree-Corrected Planted-Partition Model. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 

Hum, not working well. The reason here is query distortion. Query distortion is a gismo feature that modulates the clustering with the query. Sadly, when features are authors, the underlying graph has a very specific structure (highly sparse and redundant) that makes query distortion *too* effective. The solution is to desactivate it.

In [28]:
gismo.parameters.distortion = 0
gismo.get_documents_by_coverage(k=10)

['Asynchrony and Acceleration in Gossip Algorithms. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2020)',
 'Scalable Local Area Service Discovery. By Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007)',
 'A spectral method for community detection in moderately-sparse degree-corrected stochastic block models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (CoRR, 2020)',
 'Optimal Content Placement for Peer-to-Peer Video-on-Demand Systems. By Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013)',
 'Robustness of Spectral Methods for Community Detection. By Ludovic Stephan, Laurent Massoulié (COLT, 2019)',
 'Concentration of Non-Isotropic Random Tensors with Applications to Learning and Empirical Risk Minimization. By Mathieu Even, Laurent Massoulié (CoRR, 2021)',
 'Correlation Detection in Trees for Planted Graph Alignment. By Luca Ganassali, Lauren

Much better. No duplicate and more diversity in the results. Let's observe the communities.

In [29]:
gismo.get_documents_by_cluster(k=20, resolution=.9)

 F: 0.37. R: 0.07. S: 0.86.
- F: 0.37. R: 0.07. S: 0.86.
-- F: 0.81. R: 0.01. S: 0.63.
--- F: 1.00. R: 0.01. S: 0.56.
---- Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2020) (R: 0.00; S: 0.56)
---- Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2021) (R: 0.00; S: 0.56)
--- F: 1.00. R: 0.01. S: 0.63.
---- Mathieu Even, Laurent Massoulié (CoRR, 2021) (R: 0.00; S: 0.63)
---- Mathieu Even, Laurent Massoulié (COLT, 2021) (R: 0.00; S: 0.63)
-- F: 0.37. R: 0.06. S: 0.81.
--- F: 0.50. R: 0.04. S: 0.71.
---- F: 1.00. R: 0.01. S: 0.59.
----- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015) (R: 0.00; S: 0.59)
----- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016) (R: 0.00; S: 0.59)
----- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (ITCS, 2017) (R: 0.00; S: 0.59)
----- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015) (R: 0.00; S: 0.59)
---- F: 0.84. R: 0.03. S: 0.64.
----- F: 1.00. R: 0.01. S: 0.64.
------ Luca Ganassali, Lauren

OK! We see that the articles are organized by writing commmunities. Also note how Gismo managed to organize a hierachical grouping of the communities.

Now, let's look in terms of authors. This is actually the interesting part when studying collaborations.

In [30]:
gismo.get_features_by_rank()

['Laurent Massoulié',
 'Marc Lelarge',
 'Stratis Ioannidis',
 'Hadrien Hendrikx',
 'Nidhi Hegde 0001',
 'Peter B. Key',
 'Francis R. Bach',
 'Anne-Marie Kermarrec',
 'Ayalvadi J. Ganesh',
 'Luca Ganassali',
 'Mathieu Even',
 'Lennart Gulikers',
 'Milan Vojnovic',
 'Dan-Cristian Tomozei',
 'Amin Karbasi',
 'Augustin Chaintreau',
 'Mathieu Leconte',
 'Bo Tan 0002',
 'James Roberts',
 'Rémi Varloot',
 'Kuang Xu']

We see many authors that were not present in the articles listed above. This is an important observation: central articles (with respect to a query) are not necessarily written by central authors!

Let's organize them into communities.

In [31]:
gismo.get_features_by_cluster(resolution=.6)

 F: 0.00. R: 0.21. S: 0.54.
- F: 0.00. R: 0.21. S: 0.54.
-- F: 0.01. R: 0.21. S: 0.54.
--- F: 0.01. R: 0.20. S: 0.54.
---- F: 0.01. R: 0.20. S: 0.55.
----- F: 0.02. R: 0.18. S: 0.54.
------ F: 0.03. R: 0.17. S: 0.52.
------- F: 0.08. R: 0.13. S: 0.42.
-------- F: 0.15. R: 0.12. S: 0.41.
--------- Laurent Massoulié (R: 0.10; S: 1.00)
--------- Marc Lelarge (R: 0.01; S: 0.17)
--------- Luca Ganassali (R: 0.01; S: 0.19)
-------- F: 0.12. R: 0.01. S: 0.22.
--------- Lennart Gulikers (R: 0.00; S: 0.22)
--------- Milan Vojnovic (R: 0.00; S: 0.06)
------- F: 0.10. R: 0.02. S: 0.27.
-------- F: 0.35. R: 0.01. S: 0.27.
--------- Hadrien Hendrikx (R: 0.01; S: 0.26)
--------- Mathieu Even (R: 0.00; S: 0.20)
-------- Francis R. Bach (R: 0.01; S: 0.07)
------- F: 0.08. R: 0.02. S: 0.18.
-------- F: 0.12. R: 0.01. S: 0.17.
--------- Peter B. Key (R: 0.01; S: 0.12)
--------- Ayalvadi J. Ganesh (R: 0.01; S: 0.14)
-------- Anne-Marie Kermarrec (R: 0.01; S: 0.07)
------ Nidhi Hegde 0001 (R: 0.01; S: 0.1

### Jim Roberts  query

In [32]:
gismo.rank("James_W._Roberts")

True

Let's have a covering set of articles.

In [33]:
gismo.get_documents_by_coverage(k=10)

['Integrated Admission Control for Streaming and Elastic Traffic. By Nabil Benameur, Slim Ben Fredj, Frank Delcoigne, Sara Oueslati-Boulahia, James W. Roberts (QofIS, 2001)',
 'An In-Camera Data Stream Processing System for Defect Detection in Web Inspection Tasks. By S. Hossain Hajimowlana, Roberto Muscedere, Graham A. Jullien, James W. Roberts (Real Time Imaging, 1999)',
 "Modifications of Thomae's Function and Differentiability. By Kevin Beanland, James W. Roberts, Craig Stevenson (Am. Math. Mon., 2009)",
 'A Traffic Control Framework for High Speed Data Transmission. By James W. Roberts, Brahim Bensaou, Y. Canetti (Modelling and Evaluation of ATM Networks, 1993)',
 'Statistical bandwidth sharing: a study of congestion at flow level. By Slim Ben Fredj, Thomas Bonald, Alexandre Proutière, G. Régnié, James W. Roberts (SIGCOMM, 2001)',
 'Swing: Traffic capacity of a simple WDM ring network. By Thomas Bonald, Sara Oueslati, James W. Roberts, Charlotte Roger (ITC, 2009)',
 'Impact of "Tr

Who are the associated authors?

In [34]:
gismo.get_features_by_rank(k=10)

['James W. Roberts',
 'Thomas Bonald',
 'Maher Hamdi',
 'Sara Oueslati-Boulahia',
 'Ali Ibrahim',
 'Alexandre Proutière',
 'Sara Oueslati',
 'Jorma T. Virtamo',
 'Slim Ben Fredj',
 'Jussi Kangasharju']

Let's organize them.

In [35]:
gismo.get_features_by_cluster(k=10, resolution=.4)

 F: 0.01. R: 0.24. S: 0.51.
- F: 0.01. R: 0.23. S: 0.51.
-- F: 0.04. R: 0.21. S: 0.50.
--- F: 0.15. R: 0.20. S: 0.50.
---- F: 0.23. R: 0.18. S: 0.90.
----- James W. Roberts (R: 0.14; S: 1.00)
----- Thomas Bonald (R: 0.05; S: 0.25)
---- F: 0.57. R: 0.01. S: 0.32.
----- Sara Oueslati-Boulahia (R: 0.01; S: 0.27)
----- Slim Ben Fredj (R: 0.01; S: 0.30)
---- Sara Oueslati (R: 0.01; S: 0.15)
--- Alexandre Proutière (R: 0.01; S: 0.04)
-- Maher Hamdi (R: 0.01; S: 0.12)
-- Ali Ibrahim (R: 0.01; S: 0.06)
- F: 0.01. R: 0.01. S: 0.05.
-- Jorma T. Virtamo (R: 0.01; S: 0.04)
-- Jussi Kangasharju (R: 0.00; S: 0.03)


### Combined queries

We can input multiple authors.

In [36]:
gismo.rank("Laurent_Massoulié and James_W._Roberts")

True

Let's have a covering set of articles.

In [37]:
gismo.get_documents_by_coverage(k=10)

['Integrated Admission Control for Streaming and Elastic Traffic. By Nabil Benameur, Slim Ben Fredj, Frank Delcoigne, Sara Oueslati-Boulahia, James W. Roberts (QofIS, 2001)',
 'Defect detection in web inspection using fuzzy fusion of texture features. By S. Hossain Hajimowlana, Roberto Muscedere, Graham A. Jullien, James W. Roberts (ISCAS, 2000)',
 "Modifications of Thomae's Function and Differentiability. By Kevin Beanland, James W. Roberts, Craig Stevenson (Am. Math. Mon., 2009)",
 'A Traffic Control Framework for High Speed Data Transmission. By James W. Roberts, Brahim Bensaou, Y. Canetti (Modelling and Evaluation of ATM Networks, 1993)',
 'Statistical bandwidth sharing: a study of congestion at flow level. By Slim Ben Fredj, Thomas Bonald, Alexandre Proutière, G. Régnié, James W. Roberts (SIGCOMM, 2001)',
 'Swing: Traffic capacity of a simple WDM ring network. By Thomas Bonald, Sara Oueslati, James W. Roberts, Charlotte Roger (ITC, 2009)',
 'Impact of "Trunk Reservation" on Elasti

Note that we get here only articles by Roberts, yet the articles returned have sightly changed.

Now, let's look at the main authors.

In [38]:
gismo.get_features_by_rank()

['James W. Roberts',
 'Laurent Massoulié',
 'Thomas Bonald',
 'Marc Lelarge',
 'Maher Hamdi',
 'Nidhi Hegde 0001',
 'Stratis Ioannidis',
 'Alexandre Proutière',
 'Sara Oueslati-Boulahia',
 'Hadrien Hendrikx',
 'Ali Ibrahim',
 'Peter B. Key',
 'Francis R. Bach',
 'Jorma T. Virtamo',
 'Sara Oueslati']

We see a mix of both co-authors. How are they organized?

In [39]:
gismo.get_features_by_cluster(resolution=.4)

 F: 0.01. R: 0.20. S: 0.55.
- F: 0.02. R: 0.19. S: 0.55.
-- F: 0.05. R: 0.11. S: 0.50.
--- F: 0.23. R: 0.10. S: 0.50.
---- James W. Roberts (R: 0.07; S: 0.92)
---- Thomas Bonald (R: 0.03; S: 0.24)
---- Sara Oueslati-Boulahia (R: 0.00; S: 0.25)
--- Sara Oueslati (R: 0.00; S: 0.14)
-- F: 0.14. R: 0.05. S: 0.21.
--- F: 0.24. R: 0.05. S: 0.21.
---- Laurent Massoulié (R: 0.05; S: 0.39)
---- Hadrien Hendrikx (R: 0.00; S: 0.10)
--- Francis R. Bach (R: 0.00; S: 0.03)
-- Marc Lelarge (R: 0.01; S: 0.07)
-- Maher Hamdi (R: 0.01; S: 0.11)
-- F: 0.10. R: 0.01. S: 0.09.
--- Nidhi Hegde 0001 (R: 0.01; S: 0.08)
--- Alexandre Proutière (R: 0.00; S: 0.04)
-- Stratis Ioannidis (R: 0.00; S: 0.04)
-- Ali Ibrahim (R: 0.00; S: 0.06)
-- Peter B. Key (R: 0.00; S: 0.05)
- Jorma T. Virtamo (R: 0.00; S: 0.04)


## Cross-gismo

Gismo can combine two embeddings two create one hybrid gismo. This is called a cross-gismo (XGismo). This features can be used to analyze authors with respect to the words they use (and vice-versa).

In [40]:
from gismo.gismo import XGismo
gismo = XGismo(x_embedding=a_embedding, y_embedding=embedding)
gismo.diteration.n_iter = 2 # to speed up a little bit computation time

Note that XGismo does not use the underlying corpus, so we can now close the source (the source keeps the file ``dblp.data`` open).

In [41]:
source.close()

In [42]:
gismo.post_documents_item = lambda g, i: g.corpus[i].replace("_", " ")
gismo.post_features_cluster = post_features_cluster_print
gismo.post_documents_cluster = post_documents_cluster_print

Let's try a request.

In [43]:
gismo.rank("self-stabilization")

True

What are the associated keywords?

In [44]:
gismo.get_features_by_rank(k=10)

['stabilization',
 'self',
 'self stabilization',
 'stabilize',
 'self stabilize',
 'distribute',
 'distributed',
 'robust',
 'sensor',
 'stabilizing']

How are keywords structured?

In [45]:
gismo.get_features_by_cluster(k=20, resolution=.8)

 F: 0.35. R: 0.02. S: 0.80.
- F: 0.61. R: 0.02. S: 0.80.
-- F: 0.76. R: 0.02. S: 0.81.
--- F: 0.82. R: 0.01. S: 0.81.
---- F: 0.92. R: 0.01. S: 0.81.
----- stabilization (R: 0.00; S: 0.81)
----- self stabilization (R: 0.00; S: 0.81)
----- stabilizing (R: 0.00; S: 0.75)
---- F: 0.94. R: 0.00. S: 0.68.
----- sensor (R: 0.00; S: 0.68)
----- wireless (R: 0.00; S: 0.66)
---- fault (R: 0.00; S: 0.76)
--- F: 0.85. R: 0.01. S: 0.67.
---- F: 0.97. R: 0.01. S: 0.67.
----- self (R: 0.00; S: 0.76)
----- stabilize (R: 0.00; S: 0.69)
----- self stabilize (R: 0.00; S: 0.66)
---- distributed (R: 0.00; S: 0.70)
-- F: 0.79. R: 0.00. S: 0.46.
--- distribute (R: 0.00; S: 0.67)
--- robot (R: 0.00; S: 0.41)
--- byzantine (R: 0.00; S: 0.44)
--- optimal (R: 0.00; S: 0.64)
--- mobile (R: 0.00; S: 0.53)
- F: 0.54. R: 0.00. S: 0.36.
-- F: 0.65. R: 0.00. S: 0.44.
--- F: 0.68. R: 0.00. S: 0.38.
---- robust (R: 0.00; S: 0.44)
---- stability (R: 0.00; S: 0.32)
--- adaptive (R: 0.00; S: 0.51)
-- F: 0.54. R: 0.00. S: 

Who are the associated researchers?

In [46]:
gismo.get_documents_by_rank(k=10)

['Ted Herman',
 'Shlomi Dolev',
 'Sébastien Tixeuil',
 'Sukumar Ghosh',
 'George Varghese',
 'Shay Kutten',
 'Toshimitsu Masuzawa',
 'Stefan Schmid 0001',
 'Swan Dubois',
 'Laurence Pilard']

How are they structured?

In [47]:
gismo.get_documents_by_cluster(k=10, resolution=.9)

 F: 0.78. R: 0.05. S: 0.83.
- F: 0.93. R: 0.05. S: 0.83.
-- F: 0.98. R: 0.01. S: 0.82.
--- Ted Herman (R: 0.01; S: 0.82)
--- George Varghese (R: 0.00; S: 0.81)
-- F: 0.96. R: 0.02. S: 0.80.
--- F: 0.97. R: 0.01. S: 0.80.
---- Shlomi Dolev (R: 0.01; S: 0.78)
---- Toshimitsu Masuzawa (R: 0.00; S: 0.74)
---- Laurence Pilard (R: 0.00; S: 0.80)
--- Shay Kutten (R: 0.00; S: 0.82)
-- F: 0.94. R: 0.02. S: 0.82.
--- F: 0.98. R: 0.01. S: 0.81.
---- Sébastien Tixeuil (R: 0.01; S: 0.82)
---- Sukumar Ghosh (R: 0.01; S: 0.80)
--- Swan Dubois (R: 0.00; S: 0.81)
- Stefan Schmid 0001 (R: 0.00; S: 0.67)


We can also query researchers. Just use underscores in the query and add `y=False` to indicate that the input is *documents*.

In [48]:
gismo.rank("Sébastien_Tixeuil and Fabien_Mathieu", y=False)

True

What are the associated keywords?

In [49]:
gismo.get_features_by_rank(k=10)

['p2p',
 'grid',
 'byzantine',
 'stabilization',
 'reloaded',
 'refresh',
 'self',
 'self stabilization',
 'p2p network',
 'live streaming']

Using covering can yield other keywords of interest.

In [50]:
gismo.get_features_by_coverage(k=10)

['p2p',
 'grid',
 'byzantine',
 'preference',
 'pagerank',
 'fun',
 'reloaded',
 'p2p network',
 'old',
 'acyclic']

How are keywords structured?

In [51]:
gismo.get_features_by_cluster(k=20, resolution=.7)

 F: 0.39. R: 0.21. S: 0.67.
- F: 0.72. R: 0.09. S: 0.31.
-- F: 0.72. R: 0.06. S: 0.31.
--- p2p (R: 0.02; S: 0.34)
--- refresh (R: 0.01; S: 0.27)
--- live streaming (R: 0.01; S: 0.29)
--- live (R: 0.01; S: 0.33)
--- streaming (R: 0.01; S: 0.32)
-- reloaded (R: 0.01; S: 0.29)
-- p2p network (R: 0.01; S: 0.29)
-- old (R: 0.01; S: 0.27)
-- acyclic (R: 0.01; S: 0.32)
- grid (R: 0.02; S: 0.47)
- F: 0.66. R: 0.09. S: 0.62.
-- F: 0.79. R: 0.08. S: 0.62.
--- byzantine (R: 0.01; S: 0.56)
--- stabilization (R: 0.01; S: 0.58)
--- self (R: 0.01; S: 0.62)
--- self stabilization (R: 0.01; S: 0.59)
--- stabilize (R: 0.01; S: 0.60)
--- self stabilize (R: 0.01; S: 0.60)
--- asynchronous (R: 0.01; S: 0.56)
-- fun (R: 0.01; S: 0.48)
- preference (R: 0.01; S: 0.24)
- pagerank (R: 0.01; S: 0.24)


Who are the associated researchers?

In [52]:
gismo.get_documents_by_rank(k=10)

['Sébastien Tixeuil',
 'Fabien Mathieu',
 'Shlomi Dolev',
 'Toshimitsu Masuzawa',
 'Michel Raynal',
 'Nitin H. Vaidya',
 'Stéphane Devismes',
 'Fukuhito Ooshita',
 'Edmond Bianco',
 'Ted Herman']

How are they structured?

In [53]:
gismo.get_documents_by_cluster(k=10, resolution=.8)

 F: 0.00. R: 0.01. S: 0.66.
- F: 0.12. R: 0.00. S: 0.66.
-- F: 0.36. R: 0.00. S: 0.52.
--- F: 0.80. R: 0.00. S: 0.50.
---- F: 0.80. R: 0.00. S: 0.52.
----- Sébastien Tixeuil (R: 0.00; S: 0.53)
----- Shlomi Dolev (R: 0.00; S: 0.42)
----- Toshimitsu Masuzawa (R: 0.00; S: 0.47)
----- Stéphane Devismes (R: 0.00; S: 0.41)
----- Fukuhito Ooshita (R: 0.00; S: 0.50)
---- Ted Herman (R: 0.00; S: 0.39)
--- F: 0.82. R: 0.00. S: 0.31.
---- Michel Raynal (R: 0.00; S: 0.37)
---- Nitin H. Vaidya (R: 0.00; S: 0.26)
-- Fabien Mathieu (R: 0.00; S: 0.73)
- Edmond Bianco (R: 0.00; S: 0.04)
