# DBLP exploration

This tutorial shows how explore DBLP with Gismo.

If you have never used Gismo before, you may want to start with the *Toy example tutorial* or the *ACM* tutorial.

Note: the DBLP databased is not small. Recommended requirements to excute this Notebook:
- Fast Internet connection (you will need to download a few hundred Mb)
- 4 Gb of free space
- 4 Gb of RAM (8Gb or more recommended)
- Descent CPU (can take more than one hour on slow CPUs)

Here, *documents* are articles in DBLP. The *features* of an article category will vary.

## Initialisation

First, we load the required packages.

In [1]:
import numpy as np
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pathlib import Path
from functools import partial

from gismo.datasets.dblp import Dblp
from gismo.filesource import FileSource
from gismo.corpus import Corpus
from gismo.embedding import Embedding
from gismo.gismo import Gismo
from gismo.post_processing import post_features_cluster_print, post_documents_cluster_print

Then, we prepare the DBLP source.

First we choose the location of the DBLP files. If you want to run this Notebook at your place, please change the path below and check that it exists.

In [2]:
path = Path("../../../../../Datasets/DBLP")
path.exists()

True

Construction of the dblp files. Only needs to be performed the first time or when you want to refresh the database. Takes about 10 minutes on a Surface Pro 4 with fiber Internet connection.

In [3]:
dblp = Dblp(path=path)
dblp.build()

File ..\..\..\..\..\Datasets\DBLP\dblp.xml.gz already exists. Use refresh option to overwrite.
File ..\..\..\..\..\Datasets\DBLP\dblp.data already exists. Use refresh option to overwrite.


Then, we can load the database as a filesource.

In [4]:
source = FileSource(filename="dblp", path=path)
source[0]

{'type': 'article',
 'title': 'Spectre Attacks: Exploiting Speculative Execution.',
 'venue': 'meltdownattack.com',
 'year': '2018',
 'authors': ['Paul Kocher',
  'Daniel Genkin',
  'Daniel Gruss',
  'Werner Haas',
  'Mike Hamburg',
  'Moritz Lipp',
  'Stefan Mangard',
  'Thomas Prescher 0002',
  'Michael Schwarz 0001',
  'Yuval Yarom']}

Each article is a dict with fields ``type``, ``venue``, ``title``, ``year``, and ``authors``. We build a corpus that will tell Gismo that the content of an article is its ``title`` value.

In [5]:
corpus = Corpus(source, to_text=lambda x: x['title'])

We build an embedding on top of that corpus.
- We set ``min_df=30`` to exclude rare features;
- We set ``max_df=.02`` to exclude anything present in more than 2% of the corpus;
- We use `spacy` to lemmatize & remove some stopwords; remove `preprocessor=...` from the input if you want to skip this (takes time);
- A few manually selected stopwords to fine-tune things.
- We set ``ngram_range=[1, 2]`` to include bi-grams in the embedding.

This will take a few minutes (without spacy) up to a few hours (with spacy enabled). You can save the embedding for later if you want.

In [6]:
nlp = spacy.load('en', disable=['parser', 'ner'])
keep = {'ADJ', 'NOUN', 'NUM', 'PROPN', 'SYM', 'VERB'}
vectorizer = CountVectorizer(min_df=30, max_df=.02, ngram_range=[1, 2], dtype=float,
                             preprocessor=lambda txt: " ".join([w.lemma_.lower() for w in nlp(txt) 
                                                                if w.pos_ in keep and not w.is_stop]),
                             stop_words=['a', 'about', 'an', 'and', 'for', 'from', 'in', 'of', 'on', 'the', 'with'])
embedding = Embedding(vectorizer=vectorizer)

try:
    embedding.load(filename="dblp_embedding", path=path)
except:
    embedding.fit_transform(corpus)
    embedding.save(filename="dblp_embedding", path=path)

In [7]:
embedding.x

<4889758x149361 sparse matrix of type '<class 'numpy.float64'>'
	with 45847196 stored elements in Compressed Sparse Row format>

We see from ``embedding.x`` that the embedding links about 5,000,000 documents to 150,000 features. In average, each document is linked to about 10 features.

Now, we initiate the gismo object, and customize post_processers to ease the display.

In [8]:
gismo = Gismo(corpus, embedding)

In [9]:
def post_article(g, i):
    dic = g.corpus[i]
    authors = ", ".join(dic['authors'])
    return f"{dic['title']} By {authors} ({dic['venue']}, {dic['year']})"
    
gismo.post_documents_item = post_article

def post_title(g, i):
    return g.corpus[i]['title']
    authors = ", ".join(dic['authors'])
    return f"{dic['title']} By {authors} ({dic['venue']}, {dic['year']})"

def post_meta(g, i):
    dic = g.corpus[i]
    authors = ", ".join(dic['authors'])
    return f"{authors} ({dic['venue']}, {dic['year']})"


gismo.post_documents_cluster = partial(post_documents_cluster_print, post_item=post_title)
gismo.post_features_cluster = post_features_cluster_print

As the dataset is big, we lower the precision of the computation to speed up things a little bit.

In [10]:
gismo.parameters.n_iter = 2

## Machine Learning (and Covid-19) query

We perform the query *Machine learning*. The returned ``True`` tells that some of the query features were found in the corpus' features.

In [11]:
gismo.rank("Machine Learning")

True

What are the best articles on *Machine Learning*?

In [12]:
gismo.get_documents_by_rank()

['Can Machine Learning Model with Static Features be Fooled: an Adversarial Machine Learning Approach. By Rahim Taheri, Reza Javidan, Mohammad Shojafar, Vinod P, Mauro Conti (CoRR, 2019)',
 'How Developers Iterate on Machine Learning Workflows - A Survey of the Applied Machine Learning Literature. By Doris Xin, Litian Ma, Shuchen Song, Aditya G. Parameswaran (CoRR, 2018)',
 'Informed Machine Learning - Towards a Taxonomy of Explicit Integration of Knowledge into Machine Learning. By Laura von Rüden, Sebastian Mayer, Jochen Garcke, Christian Bauckhage, Jannis Schücker (CoRR, 2019)',
 'Three Differential Emotion Classification by Machine Learning Algorithms using Physiological Signals - Discriminantion of Emotions by Machine Learning Algorithms. By Eun-Hye Jang, Byoung-Jun Park, Sang-Hyeob Kim, Jin-Hun Sohn (ICAART (1), 2012)',
 'When Lempel-Ziv-Welch Meets Machine Learning: A Case Study of Accelerating Machine Learning using Coding. By Fengan Li, Lingjiao Chen, Arun Kumar 0001, Jeffrey 

OK, this seems to go everywhere. Maybe we can narrow with a more specific request.

In [13]:
gismo.rank("Machine Learning and covid-19")

True

In [14]:
gismo.get_documents_by_rank()

['GSA-DenseNet121-COVID-19: a Hybrid Deep Learning Architecture for the Diagnosis of COVID-19 Disease based on Gravitational Search Optimization Algorithm. By Dalia Ezzat, Aboul Ella Hassanien, Hassan Aboul Ella (CoRR, 2020)',
 'NAIST COVID: Multilingual COVID-19 Twitter and Weibo Dataset. By Zhiwei Gao, Shuntaro Yada, Shoko Wakamiya, Eiji Aramaki (CoRR, 2020)',
 'COVID-MobileXpert: On-Device COVID-19 Screening using Snapshots of Chest X-Ray. By Xin Li, Chengyin Li, Dongxiao Zhu (CoRR, 2020)',
 'Deep-COVID: Predicting COVID-19 From Chest X-Ray Images Using Deep Transfer Learning. By Shervin Minaee, Rahele Kafieh, Milan Sonka, Shakib Yazdani, Ghazaleh Jamalipour Soufi (CoRR, 2020)',
 'COVID-CAPS: A Capsule Network-based Framework for Identification of COVID-19 cases from X-ray Images. By Parnian Afshar, Shahin Heidarian, Farnoosh Naderkhani, Anastasia Oikonomou, Konstantinos N. Plataniotis, Arash Mohammadi 0001 (CoRR, 2020)',
 'COVID-Net: A Tailored Deep Convolutional Neural Network Des

Sounds nice. How are the top-10 articles related? Note: as the graph structure is really sparse on the document side (10 features), it is best to de-activate the query-distortion, which is intended for longer documents.

In [15]:
gismo.parameters.distortion = 0.0
gismo.get_documents_by_cluster(k=10)

 F: 0.56. R: 0.02. S: 0.93.
- F: 0.57. R: 0.02. S: 0.92.
-- GSA-DenseNet121-COVID-19: a Hybrid Deep Learning Architecture for the Diagnosis of COVID-19 Disease based on Gravitational Search Optimization Algorithm. (R: 0.00; S: 0.64)
-- F: 0.66. R: 0.01. S: 0.91.
--- F: 0.66. R: 0.01. S: 0.90.
---- NAIST COVID: Multilingual COVID-19 Twitter and Weibo Dataset. (R: 0.00; S: 0.69)
---- CORD-19: The Covid-19 Open Research Dataset. (R: 0.00; S: 0.76)
---- A SIDARTHE Model of COVID-19 Epidemic in Italy. (R: 0.00; S: 0.79)
---- A First Instagram Dataset on COVID-19. (R: 0.00; S: 0.80)
--- F: 0.68. R: 0.00. S: 0.73.
---- COVID-MobileXpert: On-Device COVID-19 Screening using Snapshots of Chest X-Ray. (R: 0.00; S: 0.70)
---- Deep-COVID: Predicting COVID-19 From Chest X-Ray Images Using Deep Transfer Learning. (R: 0.00; S: 0.63)
-- COVID-CAPS: A Capsule Network-based Framework for Identification of COVID-19 cases from X-ray Images. (R: 0.00; S: 0.60)
-- COVID-Net: A Tailored Deep Convolutional Neu

OK! Let's decode this: we have:
- Lone wolves on gravitional search (first article) and serious gaming (last article)
- A large cluster on applications of ML to X-Ray / CT scan processing
- A cluster on datasets and social networks
- Note that it's not perfect: the social network has been nested inside X-Ray although a manual classification would probably have put it nearby...

Now, let's look at the main keywords.

In [16]:
gismo.get_features_by_rank(20)

['19',
 'covid',
 'covid 19',
 'machine',
 'machine learning',
 'pandemic',
 'chest',
 'chest ray',
 'coronavirus',
 'ray',
 'dataset',
 'deep',
 'ct',
 'epidemic',
 'ray image',
 'social',
 'learn',
 'twitter',
 'infection',
 'spread']

Let's organize them.

In [17]:
# On the feature side, the graph is more dense so we can use query distortion
gismo.get_features_by_cluster(distortion=1)

 F: 0.15. R: 0.07. S: 0.93.
- F: 0.26. R: 0.06. S: 0.93.
-- F: 0.29. R: 0.06. S: 0.97.
--- F: 0.97. R: 0.05. S: 0.98.
---- 19 (R: 0.02; S: 0.99)
---- covid (R: 0.02; S: 0.97)
---- covid 19 (R: 0.02; S: 0.98)
--- pandemic (R: 0.00; S: 0.32)
--- F: 0.86. R: 0.00. S: 0.30.
---- chest (R: 0.00; S: 0.32)
---- chest ray (R: 0.00; S: 0.28)
---- ray (R: 0.00; S: 0.33)
--- dataset (R: 0.00; S: 0.28)
-- coronavirus (R: 0.00; S: 0.27)
- F: 0.79. R: 0.01. S: 0.14.
-- machine (R: 0.00; S: 0.16)
-- machine learning (R: 0.00; S: 0.12)


Rough, very broad analysis:
- One big keyword cluster about Coronavirus / Covid-19;
- Chest X-ray as a sub-cluster;
- Machine Learning as a separate small cluster.

In [18]:
np.dot(gismo.embedding.query_projection("Machine learning")[0], gismo.embedding.y)

<1x4889758 sparse matrix of type '<class 'numpy.float64'>'
	with 56026 stored elements in Compressed Sparse Row format>

50,000 articles with an explicit link to machine learning.

In [19]:
np.dot(gismo.embedding.query_projection("Covid-19")[0], gismo.embedding.y)

<1x4889758 sparse matrix of type '<class 'numpy.float64'>'
	with 762 stored elements in Compressed Sparse Row format>

800 articles with an explicit link to covid-19.

## Authors query

Instead of looking at words, we can explore authors and their collaborations.

 We just have to rewire the corpus to output string of authors.

In [20]:
def to_authors_text(dic):
    return " ".join([a.replace(' ', '_') for a in dic['authors']])
corpus = Corpus(source, to_text=to_authors_text)

We can build a new embedding on top of this modified corpus. We tell the vectorizer to be stupid: don't preprocess, words are separated spaces.

This will take a few minutes (you can save the embedding for later if you want).

In [21]:
vectorizer = CountVectorizer(dtype=float,
                            preprocessor=lambda x:x, tokenizer=lambda x: x.split(' '))
a_embedding = Embedding(vectorizer=vectorizer)
try:
    a_embedding.load(filename="dblp_aut_embedding", path=path)
except:
    a_embedding.fit_transform(corpus)
    a_embedding.save(filename="dblp_aut_embedding", path=path)

In [22]:
a_embedding.x

<4889758x2532137 sparse matrix of type '<class 'numpy.float64'>'
	with 15083911 stored elements in Compressed Sparse Row format>

We now have about 2,500,000 authors to explore. Let's reload gismo and try to play.

In [23]:
gismo = Gismo(corpus, a_embedding)
gismo.post_documents_item = post_article
gismo.post_features_item = lambda g, i: g.embedding.features[i].replace("_", " ")

In [24]:
gismo.post_documents_cluster = partial(post_documents_cluster_print, post_item=post_meta)
gismo.post_features_cluster = post_features_cluster_print

### Laurent Massoulié query

In [25]:
gismo.rank("Laurent_Massoulié")

True

What are the most central articles of Laurent Massoulié in terms of collaboration?

In [26]:
gismo.get_documents_by_rank(k=10)

['Robustness of Spectral Methods for Community Detection. By Ludovic Stephan, Laurent Massoulié (COLT, 2019)',
 'Robustness of spectral methods for community detection. By Ludovic Stephan, Laurent Massoulié (CoRR, 2018)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (CoRR, 2020)',
 'Spectral alignment of correlated Gaussian random matrices. By Luca Ganassali, Marc Lelarge, Laurent Massoulié (CoRR, 2019)',
 'Planting trees in graphs, and finding them back. By Laurent Massoulié, Ludovic Stephan, Don Towsley (COLT, 2019)',
 'Brief announcement: adaptive content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (PODC, 2010)',
 'Optimal content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (INFOCOM, 2011)',
 'Optimal Content Placement for Peer-to-Peer Video-on-Demand Systems. By Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013)',
 'Scalable Local Area Service Discov

We see lots of duplicates. This is not surprising as many articles can published first as a research report, then as a conference paper, last as a journal article. Luckily, Gismo can cover for you.

In [27]:
gismo.get_documents_by_coverage(k=10)

['Robustness of Spectral Methods for Community Detection. By Ludovic Stephan, Laurent Massoulié (COLT, 2019)',
 'Robustness of spectral methods for community detection. By Ludovic Stephan, Laurent Massoulié (CoRR, 2018)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (CoRR, 2020)',
 'Spectral alignment of correlated Gaussian random matrices. By Luca Ganassali, Marc Lelarge, Laurent Massoulié (CoRR, 2019)',
 'Planting trees in graphs, and finding them back. By Laurent Massoulié, Ludovic Stephan, Don Towsley (COLT, 2019)',
 'Brief announcement: adaptive content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (PODC, 2010)',
 'Optimal content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (INFOCOM, 2011)',
 'Optimal Content Placement for Peer-to-Peer Video-on-Demand Systems. By Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013)',
 'Scalable Local Area Service Discov

Hum, not working well. The reason here is query distortion. Query distortion is a gismo feature that modulates the clustering with the query. Sadly, when features are authors, the underlying graph has a very specific structure (highly sparse and redundant) that makes query distortion *too* effective. The solution is to desactivate it.

In [28]:
gismo.parameters.distortion = 0
gismo.get_documents_by_coverage(k=10)

['Robustness of Spectral Methods for Community Detection. By Ludovic Stephan, Laurent Massoulié (COLT, 2019)',
 'Scalable Local Area Service Discovery. By Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007)',
 'Asynchronous Accelerated Proximal Stochastic Gradient for Strongly Convex Distributed Finite Sums. By Hadrien Hendrikx, Francis Bach, Laurent Massoulié (CoRR, 2019)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (CoRR, 2020)',
 'Brief announcement: adaptive content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (PODC, 2010)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016)',
 'Adaptive matching for expert systems with uncertain task types. By Virag Shah, Lennart Gulikers, Laurent Massoulié, Milan Vojnovic (Allerton, 2017)',
 'Spectral alignment of correlated Gaussian random matrices. By Luca Ganassal

Much better. No duplicate and more diversity in the results. Let's observe the communities.

In [29]:
gismo.get_documents_by_cluster(k=20, resolution=.9)

 F: 0.36. R: 0.09. S: 0.87.
- F: 0.36. R: 0.08. S: 0.86.
-- F: 0.36. R: 0.06. S: 0.83.
--- F: 0.36. R: 0.03. S: 0.73.
---- F: 0.88. R: 0.01. S: 0.60.
----- F: 1.00. R: 0.01. S: 0.60.
------ Ludovic Stephan, Laurent Massoulié (COLT, 2019) (R: 0.00; S: 0.60)
------ Ludovic Stephan, Laurent Massoulié (CoRR, 2018) (R: 0.00; S: 0.60)
----- Laurent Massoulié, Ludovic Stephan, Don Towsley (COLT, 2019) (R: 0.00; S: 0.54)
---- F: 1.00. R: 0.01. S: 0.61.
----- Bo Tan 0002, Laurent Massoulié (PODC, 2010) (R: 0.00; S: 0.61)
----- Bo Tan 0002, Laurent Massoulié (INFOCOM, 2011) (R: 0.00; S: 0.61)
----- Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013) (R: 0.00; S: 0.61)
--- F: 0.41. R: 0.03. S: 0.69.
---- F: 0.86. R: 0.01. S: 0.59.
----- Luca Ganassali, Laurent Massoulié (CoRR, 2020) (R: 0.00; S: 0.58)
----- Luca Ganassali, Marc Lelarge, Laurent Massoulié (CoRR, 2019) (R: 0.00; S: 0.56)
---- F: 0.61. R: 0.03. S: 0.60.
----- F: 1.00. R: 0.02. S: 0.59.
------ Lennart Gulikers, Marc Lelarge,

OK! We see that the articles are organized by writing commmunities. Also note how Gismo managed to organize a hierachical grouping of the communities.

Now, let's look in terms of authors. This is actually the interesting part when studying collaborations.

In [30]:
gismo.get_features_by_rank()

['Laurent Massoulié',
 'Marc Lelarge',
 'Stratis Ioannidis',
 'Nidhi Hegde',
 'Peter B. Key',
 'Anne-Marie Kermarrec',
 'Ayalvadi J. Ganesh',
 'Lennart Gulikers',
 'Hadrien Hendrikx',
 'Francis Bach',
 'Dan-Cristian Tomozei',
 'Amin Karbasi',
 'Milan Vojnovic',
 'Augustin Chaintreau',
 'Mathieu Leconte',
 'Bo Tan 0002',
 'James Roberts',
 'Ludovic Stephan',
 'Kuang Xu',
 'Fabio Picconi']

We see many authors that were not present in the articles listed above. This is an important observation: central articles (with respect to a query) are not necessarily written by central authors!

Let's organize them into communities.

In [31]:
gismo.get_features_by_cluster(resolution=.6)

 F: 0.00. R: 0.21. S: 0.52.
- F: 0.00. R: 0.21. S: 0.52.
-- F: 0.01. R: 0.21. S: 0.52.
--- F: 0.02. R: 0.20. S: 0.51.
---- F: 0.03. R: 0.17. S: 0.50.
----- F: 0.04. R: 0.17. S: 0.49.
------ F: 0.08. R: 0.16. S: 0.49.
------- F: 0.13. R: 0.15. S: 0.50.
-------- F: 0.15. R: 0.14. S: 0.49.
--------- Laurent Massoulié (R: 0.10; S: 1.00)
--------- Nidhi Hegde (R: 0.01; S: 0.18)
--------- Peter B. Key (R: 0.01; S: 0.16)
--------- Ayalvadi J. Ganesh (R: 0.01; S: 0.16)
--------- Lennart Gulikers (R: 0.00; S: 0.22)
--------- Hadrien Hendrikx (R: 0.00; S: 0.18)
--------- Francis Bach (R: 0.00; S: 0.06)
-------- Marc Lelarge (R: 0.01; S: 0.15)
------- Milan Vojnovic (R: 0.00; S: 0.06)
------ Anne-Marie Kermarrec (R: 0.01; S: 0.08)
----- Mathieu Leconte (R: 0.00; S: 0.10)
---- F: 0.03. R: 0.02. S: 0.14.
----- F: 0.09. R: 0.02. S: 0.14.
------ Stratis Ioannidis (R: 0.01; S: 0.15)
------ Augustin Chaintreau (R: 0.00; S: 0.06)
----- Amin Karbasi (R: 0.00; S: 0.05)
---- Dan-Cristian Tomozei (R: 0.00; 

### Jim Roberts  query

In [32]:
gismo.rank("James_W._Roberts")

True

Let's have a covering set of articles.

In [33]:
gismo.get_documents_by_coverage(k=10)

['Integrated Admission Control for Streaming and Elastic Traffic. By Nabil Benameur, Slim Ben Fredj, Frank Delcoigne, Sara Oueslati-Boulahia, James W. Roberts (QofIS, 2001)',
 'An In-Camera Data Stream Processing System for Defect Detection in Web Inspection Tasks. By S. Hossain Hajimowlana, Roberto Muscedere, Graham A. Jullien, James W. Roberts (Real-Time Imaging, 1999)',
 "Modifications of Thomae's Function and Differentiability. By Kevin Beanland, James W. Roberts, Craig Stevenson (The American Mathematical Monthly, 2009)",
 'A Traffic Control Framework for High Speed Data Transmission. By James W. Roberts, Brahim Bensaou, Y. Canetti (Modelling and Evaluation of ATM Networks, 1993)',
 'Statistical bandwidth sharing: a study of congestion at flow level. By Slim Ben Fredj, Thomas Bonald, Alexandre Proutière, G. Régnié, James W. Roberts (SIGCOMM, 2001)',
 'Swing: Traffic capacity of a simple WDM ring network. By Thomas Bonald, Sara Oueslati, James W. Roberts, Charlotte Roger (Internati

Who are the associated authors?

In [34]:
gismo.get_features_by_rank(k=10)

['James W. Roberts',
 'Thomas Bonald',
 'Maher Hamdi',
 'Sara Oueslati-Boulahia',
 'Ali Ibrahim',
 'Alexandre Proutière',
 'Sara Oueslati',
 'Jorma T. Virtamo',
 'Slim Ben Fredj',
 'Jussi Kangasharju']

Let's organize them.

In [35]:
gismo.get_features_by_cluster(k=10, resolution=.4)

 F: 0.01. R: 0.23. S: 0.55.
- F: 0.01. R: 0.22. S: 0.54.
-- F: 0.04. R: 0.20. S: 0.53.
--- F: 0.16. R: 0.19. S: 0.53.
---- F: 0.25. R: 0.17. S: 0.91.
----- James W. Roberts (R: 0.12; S: 1.00)
----- Thomas Bonald (R: 0.05; S: 0.28)
---- F: 0.57. R: 0.02. S: 0.34.
----- Sara Oueslati-Boulahia (R: 0.01; S: 0.28)
----- Slim Ben Fredj (R: 0.01; S: 0.32)
---- Sara Oueslati (R: 0.01; S: 0.16)
--- Alexandre Proutière (R: 0.01; S: 0.05)
-- Maher Hamdi (R: 0.01; S: 0.13)
-- Ali Ibrahim (R: 0.01; S: 0.07)
- F: 0.01. R: 0.01. S: 0.05.
-- Jorma T. Virtamo (R: 0.01; S: 0.04)
-- Jussi Kangasharju (R: 0.01; S: 0.03)


### Combined queries

We can input multiple authors.

In [36]:
gismo.rank("Laurent_Massoulié and James_W._Roberts")

True

Let's have a covering set of articles.

In [37]:
gismo.get_documents_by_coverage(k=10)

['Integrated Admission Control for Streaming and Elastic Traffic. By Nabil Benameur, Slim Ben Fredj, Frank Delcoigne, Sara Oueslati-Boulahia, James W. Roberts (QofIS, 2001)',
 'An In-Camera Data Stream Processing System for Defect Detection in Web Inspection Tasks. By S. Hossain Hajimowlana, Roberto Muscedere, Graham A. Jullien, James W. Roberts (Real-Time Imaging, 1999)',
 "Modifications of Thomae's Function and Differentiability. By Kevin Beanland, James W. Roberts, Craig Stevenson (The American Mathematical Monthly, 2009)",
 'A Traffic Control Framework for High Speed Data Transmission. By James W. Roberts, Brahim Bensaou, Y. Canetti (Modelling and Evaluation of ATM Networks, 1993)',
 'Statistical bandwidth sharing: a study of congestion at flow level. By Slim Ben Fredj, Thomas Bonald, Alexandre Proutière, G. Régnié, James W. Roberts (SIGCOMM, 2001)',
 'Swing: Traffic capacity of a simple WDM ring network. By Thomas Bonald, Sara Oueslati, James W. Roberts, Charlotte Roger (Internati

Note that we get here only articles by Roberts, yet the articles returned have sightly changed.

Now, let's look at the main authors.

In [38]:
gismo.get_features_by_rank()

['James W. Roberts',
 'Laurent Massoulié',
 'Thomas Bonald',
 'Nidhi Hegde',
 'Maher Hamdi',
 'Marc Lelarge',
 'Stratis Ioannidis',
 'Alexandre Proutière',
 'Sara Oueslati-Boulahia',
 'Ali Ibrahim',
 'Peter B. Key',
 'Jorma T. Virtamo',
 'Sara Oueslati',
 'Anne-Marie Kermarrec',
 'Ayalvadi J. Ganesh',
 'Slim Ben Fredj']

We see a mix of both co-authors. How are they organized?

In [39]:
gismo.get_features_by_cluster(resolution=.4)

 F: 0.01. R: 0.20. S: 0.54.
- F: 0.01. R: 0.20. S: 0.54.
-- F: 0.05. R: 0.19. S: 0.53.
--- F: 0.25. R: 0.09. S: 0.84.
---- James W. Roberts (R: 0.06; S: 0.91)
---- Thomas Bonald (R: 0.03; S: 0.26)
--- F: 0.12. R: 0.08. S: 0.23.
---- Laurent Massoulié (R: 0.05; S: 0.40)
---- Nidhi Hegde (R: 0.01; S: 0.10)
---- Marc Lelarge (R: 0.01; S: 0.06)
---- Stratis Ioannidis (R: 0.01; S: 0.06)
---- Alexandre Proutière (R: 0.01; S: 0.05)
---- Peter B. Key (R: 0.00; S: 0.07)
---- Ayalvadi J. Ganesh (R: 0.00; S: 0.06)
--- F: 0.57. R: 0.01. S: 0.31.
---- Sara Oueslati-Boulahia (R: 0.00; S: 0.26)
---- Slim Ben Fredj (R: 0.00; S: 0.29)
--- Sara Oueslati (R: 0.00; S: 0.15)
--- Anne-Marie Kermarrec (R: 0.00; S: 0.03)
-- Maher Hamdi (R: 0.01; S: 0.12)
-- Ali Ibrahim (R: 0.00; S: 0.07)
- Jorma T. Virtamo (R: 0.00; S: 0.04)


## Cross-gismo

Gismo can combine two embeddings two create one hybrid gismo. This is called a cross-gismo (XGismo). This features can be used to analyze authors with respect to the words they use (and vice-versa).

In [40]:
from gismo.gismo import XGismo
gismo = XGismo(x_embedding=a_embedding, y_embedding=embedding)
gismo.diteration.n_iter = 2 # to speed up a little bit computation time

Note that XGismo does not use the underlying corpus, so we can now close the source (the source keeps the file ``dblp.data`` open).

In [41]:
source.close()

In [42]:
gismo.post_documents_item = lambda g, i: g.corpus[i].replace("_", " ")
gismo.post_features_cluster = post_features_cluster_print
gismo.post_documents_cluster = post_documents_cluster_print

Let's try a request.

In [43]:
gismo.rank("self-stabilization")

True

What are the associated keywords?

In [44]:
gismo.get_features_by_rank(k=10)

['stabilization',
 'self',
 'self stabilization',
 'stabilize',
 'self stabilize',
 'distribute',
 'distributed',
 'robust',
 'sensor',
 'adaptive']

How are keywords structured?

In [45]:
gismo.get_features_by_cluster(k=20, resolution=.8)

 F: 0.03. R: 0.02. S: 0.80.
- F: 0.50. R: 0.02. S: 0.80.
-- F: 0.65. R: 0.02. S: 0.80.
--- F: 0.70. R: 0.02. S: 0.81.
---- F: 0.77. R: 0.02. S: 0.81.
----- F: 0.91. R: 0.01. S: 0.81.
------ stabilization (R: 0.00; S: 0.81)
------ self (R: 0.00; S: 0.75)
------ self stabilization (R: 0.00; S: 0.82)
------ stabilize (R: 0.00; S: 0.72)
------ self stabilize (R: 0.00; S: 0.67)
----- F: 0.86. R: 0.00. S: 0.75.
------ distribute (R: 0.00; S: 0.68)
------ distributed (R: 0.00; S: 0.76)
----- F: 0.92. R: 0.00. S: 0.68.
------ sensor (R: 0.00; S: 0.67)
------ wireless (R: 0.00; S: 0.65)
----- fault (R: 0.00; S: 0.75)
---- adaptive (R: 0.00; S: 0.57)
--- F: 0.82. R: 0.00. S: 0.47.
---- F: 0.82. R: 0.00. S: 0.47.
----- networks (R: 0.00; S: 0.75)
----- byzantine (R: 0.00; S: 0.46)
----- optimal (R: 0.00; S: 0.65)
----- mobile (R: 0.00; S: 0.56)
---- robot (R: 0.00; S: 0.44)
-- robust (R: 0.00; S: 0.46)
- F: 0.60. R: 0.00. S: 0.13.
-- F: 0.68. R: 0.00. S: 0.10.
--- nonlinear (R: 0.00; S: 0.08)
---

Who are the associated researchers?

In [46]:
gismo.get_documents_by_rank(k=10)

['Ted Herman',
 'Shlomi Dolev',
 'Sébastien Tixeuil',
 'Sukumar Ghosh',
 'George Varghese',
 'Shay Kutten',
 'Toshimitsu Masuzawa',
 'Stefan Schmid 0001',
 'Swan Dubois',
 'Bertrand Ducourthial']

How are they structured?

In [47]:
gismo.get_documents_by_cluster(k=10, resolution=.9)

 F: 0.79. R: 0.05. S: 0.84.
- F: 0.88. R: 0.05. S: 0.84.
-- F: 0.93. R: 0.04. S: 0.83.
--- F: 0.95. R: 0.03. S: 0.83.
---- Ted Herman (R: 0.01; S: 0.81)
---- Sébastien Tixeuil (R: 0.01; S: 0.83)
---- Sukumar Ghosh (R: 0.01; S: 0.80)
---- Swan Dubois (R: 0.00; S: 0.79)
--- F: 0.96. R: 0.02. S: 0.78.
---- Shlomi Dolev (R: 0.01; S: 0.78)
---- Shay Kutten (R: 0.00; S: 0.83)
---- Toshimitsu Masuzawa (R: 0.00; S: 0.72)
-- F: 0.98. R: 0.01. S: 0.80.
--- George Varghese (R: 0.00; S: 0.82)
--- Bertrand Ducourthial (R: 0.00; S: 0.79)
- Stefan Schmid 0001 (R: 0.00; S: 0.72)


We can also query researchers. Just use underscores in the query and add `y=False` to indicate that the input is *documents*.

In [48]:
gismo.rank("Sébastien_Tixeuil and Fabien_Mathieu", y=False)

True

What are the associated keywords?

In [49]:
gismo.get_features_by_rank(k=10)

['p2p',
 'byzantine',
 'stabilization',
 'stabilize',
 'p2p network',
 'refresh',
 'reload',
 'self',
 'self stabilize',
 'self stabilization']

Using covering can yield other keywords of interest.

In [50]:
gismo.get_features_by_coverage(k=10)

['p2p',
 'byzantine',
 'grid',
 'preference',
 'pagerank',
 'p2p network',
 'refresh',
 'streaming',
 'acyclic',
 'stabilization']

How are keywords structured?

In [51]:
gismo.get_features_by_cluster(k=20, resolution=.7)

 F: 0.06. R: 0.23. S: 0.64.
- F: 0.62. R: 0.13. S: 0.27.
-- F: 0.68. R: 0.12. S: 0.27.
--- F: 0.69. R: 0.11. S: 0.27.
---- p2p (R: 0.02; S: 0.32)
---- p2p network (R: 0.01; S: 0.28)
---- F: 0.84. R: 0.06. S: 0.25.
----- refresh (R: 0.01; S: 0.24)
----- reload (R: 0.01; S: 0.24)
----- old (R: 0.01; S: 0.23)
----- live streaming (R: 0.01; S: 0.25)
----- preference base (R: 0.01; S: 0.23)
----- live (R: 0.01; S: 0.32)
---- streaming (R: 0.01; S: 0.31)
---- acyclic (R: 0.01; S: 0.27)
--- pagerank (R: 0.01; S: 0.20)
-- preference (R: 0.01; S: 0.19)
- F: 0.39. R: 0.10. S: 0.61.
-- F: 0.82. R: 0.09. S: 0.61.
--- F: 0.84. R: 0.03. S: 0.54.
---- byzantine (R: 0.02; S: 0.54)
---- asynchronous (R: 0.01; S: 0.55)
--- F: 0.91. R: 0.06. S: 0.60.
---- stabilization (R: 0.02; S: 0.58)
---- stabilize (R: 0.01; S: 0.61)
---- self (R: 0.01; S: 0.62)
---- self stabilize (R: 0.01; S: 0.60)
---- self stabilization (R: 0.01; S: 0.58)
-- grid (R: 0.01; S: 0.41)


Who are the associated researchers?

In [52]:
gismo.get_documents_by_rank(k=10)

['Sébastien Tixeuil',
 'Fabien Mathieu',
 'Shlomi Dolev',
 'Toshimitsu Masuzawa',
 'Nitin H. Vaidya',
 'Edmond Bianco',
 'Stéphane Devismes',
 'Franck Petit',
 'Michel Raynal',
 'Ted Herman']

How are they structured?

In [53]:
gismo.get_documents_by_cluster(k=10, resolution=.8)

 F: 0.00. R: 0.01. S: 0.63.
- F: 0.08. R: 0.00. S: 0.63.
-- F: 0.71. R: 0.00. S: 0.52.
--- Sébastien Tixeuil (R: 0.00; S: 0.56)
--- F: 0.93. R: 0.00. S: 0.46.
---- Shlomi Dolev (R: 0.00; S: 0.46)
---- Toshimitsu Masuzawa (R: 0.00; S: 0.48)
---- Stéphane Devismes (R: 0.00; S: 0.44)
---- Franck Petit (R: 0.00; S: 0.44)
---- Ted Herman (R: 0.00; S: 0.42)
--- F: 0.78. R: 0.00. S: 0.34.
---- Nitin H. Vaidya (R: 0.00; S: 0.30)
---- Michel Raynal (R: 0.00; S: 0.37)
-- Fabien Mathieu (R: 0.00; S: 0.69)
- Edmond Bianco (R: 0.00; S: 0.04)
