# DBLP exploration

This tutorial shows how explore DBLP with Gismo.

Note: the DBLP databased is not small. Recommended requirements to excute this Notebook:
- Fast Internet connection (you will need to download a few hundred Mb)
- 4 Gb of free space
- 4 Gb of RAM (8Gb or more recommended)
- Descent CPU (can take more than one hour on slow CPUs)

Here, *documents* are articles in DBLP. The *features* of an article category will vary.

## Initialisation

First, we load the required package.

In [1]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pathlib import Path

from gismo.datasets.dblp import Dblp
from gismo.filesource import FileSource
from gismo.corpus import Corpus
from gismo.embedding import Embedding
from gismo.gismo import Gismo
from gismo.post_processing import print_feature_cluster

Then, we prepare the DBLP source.

First we choose the location of the DBLP files. If you want to run this Notebook at your place, please change the path below and check that it exists.

In [2]:
path = Path("../../../../../Datasets/DBLP")
path.exists()

True

Construction of the dblp files. Only needs to be performed the first time or when you want to refresh the database. Takes about 10 minutes on a Surface Pro 4 with fiber Internet connection.

In [3]:
dblp = Dblp(path=path)
dblp.build()

File ..\..\..\..\..\Datasets\DBLP\dblp.xml.gz already exists.
File ..\..\..\..\..\Datasets\DBLP\dblp.data already exists.


Then, we can load the database as a filesource.

In [4]:
source = FileSource(filename="dblp", path=path)
source[0]

{'type': 'article',
 'title': 'Spectre Attacks: Exploiting Speculative Execution.',
 'venue': 'meltdownattack.com',
 'year': '2018',
 'authors': ['Paul Kocher',
  'Daniel Genkin',
  'Daniel Gruss',
  'Werner Haas',
  'Mike Hamburg',
  'Moritz Lipp',
  'Stefan Mangard',
  'Thomas Prescher 0002',
  'Michael Schwarz 0001',
  'Yuval Yarom']}

Each article is a dict with fields ``type``, ``venue``, ``title``, ``year``, and ``authors``. We build a corpus that will tell Gismo that the content of an article is its ``title`` value.

In [5]:
corpus = Corpus(source, to_text=lambda x: x['title'])

We build an embedding on top of that corpus.
- We set ``min_df=30`` to exclude rare features;
- We set ``max_df=.02`` to exclude anything present in more than 2% of the corpus;
- A few manually selected stopwords to fine-tune things.
- We set ``ngram_range=[1, 2]`` to include bi-grams in the embedding.

This will take a few minutes (you can save the embedding for later if you want).

In [6]:
vectorizer = CountVectorizer(min_df=30, max_df=.02, ngram_range=[1, 2], dtype=float, 
                             stop_words=['a', 'about', 'an', 'and', 'for', 'from', 'in', 'of', 'on', 'the', 'with'])
embedding = Embedding(vectorizer=vectorizer)

try:
    embedding.load(filename="dblp_embedding", path=path)
except:
    embedding.fit_transform(corpus)
    embedding.save(filename="dblp_embedding", path=path)

In [7]:
embedding.x

<4889758x160216 sparse matrix of type '<class 'numpy.float64'>'
	with 49388202 stored elements in Compressed Sparse Row format>

We see from ``embedding.x`` that the embedding links about 5,000,000 documents to 160,000 features. In average, each document is linked to about 10 features.

Now, we initiate the gismo object, and customize post_processers to ease the display.

In [8]:
gismo = Gismo(corpus, embedding)

In [9]:
def post_article(g, i):
    dic = g.corpus[i]
    authors = ", ".join(dic['authors'])
    return f"{dic['title']} By {authors} ({dic['venue']}, {dic['year']})"
    
gismo.post_document = post_article
def print_document_cluster(gismo, cluster, depth=""):
    sim = cosine_similarity(cluster.vector, gismo.diteration.y_relevance.reshape(1, -1))[0][0]
    if len(cluster.children) == 0:
        txt = gismo.corpus[cluster.indice]["title"]
        print(f"{depth} {txt} "
              f"(R: {gismo.diteration.x_relevance[cluster.indice]:.2f}; "
              f"S: {sim:.2f})")
    else:
        print(f"{depth} F: {cluster.focus:.2f}. "
              f"R: {sum(gismo.diteration.x_relevance[cluster.members]):.2f}. "
              f"S: {sim:.2f}.")
    for c in cluster.children:
        print_document_cluster(gismo, c, depth=depth + '-')
gismo.post_document_cluster = print_document_cluster
gismo.post_feature_cluster = print_feature_cluster

As the dataset is big, we lower the precision of the computation to speed up things a little bit.

In [10]:
gismo.diteration.n_iter = 2

## Machine Learning (and Covid-19) query

We perform the query *Machine learning*. The returned ``True`` tells that some of the query features were found in the corpus' features.

In [11]:
gismo.rank("Machine Learning")

True

What are the best articles on *Machine Learning*?

In [12]:
gismo.get_ranked_documents()

['Machine Learning Unplugged - Development and Evaluation of a Workshop About Machine Learning. By Elisaweta Ossovski, Michael Brinkmeier (ISSEP, 2019)',
 'Can Machine Learning Model with Static Features be Fooled: an Adversarial Machine Learning Approach. By Rahim Taheri, Reza Javidan, Mohammad Shojafar, Vinod P, Mauro Conti (CoRR, 2019)',
 'MARVIN: An Open Machine Learning Corpus and Environment for Automated Machine Learning Primitive Annotation and Execution. By Chris A. Mattmann, Sujen Shah, Brian Wilson (CoRR, 2018)',
 'Trustless Machine Learning Contracts; Evaluating and Exchanging Machine Learning Models on the Ethereum Blockchain. By A. Besir Kurtulmus, Kenny Daniel (CoRR, 2018)',
 'How Developers Iterate on Machine Learning Workflows - A Survey of the Applied Machine Learning Literature. By Doris Xin, Litian Ma, Shuchen Song, Aditya G. Parameswaran (CoRR, 2018)',
 'Machine learning in computer forensics (and the lessons learned from machine learning in computer security). By 

OK, this seems to go everywhere. Maybe we can narrow with a more specific request.

In [13]:
gismo.rank("Machine Learning and covid-19")

True

In [14]:
gismo.get_ranked_documents()

['GSA-DenseNet121-COVID-19: a Hybrid Deep Learning Architecture for the Diagnosis of COVID-19 Disease based on Gravitational Search Optimization Algorithm. By Dalia Ezzat, Aboul Ella Hassanien, Hassan Aboul Ella (CoRR, 2020)',
 'COVID-CT-Dataset: A CT Scan Dataset about COVID-19. By Jinyu Zhao, Yichen Zhang, Xuehai He, Pengtao Xie (CoRR, 2020)',
 'NAIST COVID: Multilingual COVID-19 Twitter and Weibo Dataset. By Zhiwei Gao, Shuntaro Yada, Shoko Wakamiya, Eiji Aramaki (CoRR, 2020)',
 'COVID-MobileXpert: On-Device COVID-19 Screening using Snapshots of Chest X-Ray. By Xin Li, Chengyin Li, Dongxiao Zhu (CoRR, 2020)',
 'Deep-COVID: Predicting COVID-19 From Chest X-Ray Images Using Deep Transfer Learning. By Shervin Minaee, Rahele Kafieh, Milan Sonka, Shakib Yazdani, Ghazaleh Jamalipour Soufi (CoRR, 2020)',
 'COVID-CAPS: A Capsule Network-based Framework for Identification of COVID-19 cases from X-ray Images. By Parnian Afshar, Shahin Heidarian, Farnoosh Naderkhani, Anastasia Oikonomou, Konst

Sounds nice. How are the top-10 articles related?

In [15]:
gismo.get_clustered_ranked_documents(k=10, resolution=.9)

 F: 0.97. R: 0.02. S: 0.95.
- GSA-DenseNet121-COVID-19: a Hybrid Deep Learning Architecture for the Diagnosis of COVID-19 Disease based on Gravitational Search Optimization Algorithm. (R: 0.00; S: 0.96)
- F: 1.00. R: 0.01. S: 0.92.
-- COVID-CT-Dataset: A CT Scan Dataset about COVID-19. (R: 0.00; S: 0.92)
-- NAIST COVID: Multilingual COVID-19 Twitter and Weibo Dataset. (R: 0.00; S: 0.92)
-- COVID-MobileXpert: On-Device COVID-19 Screening using Snapshots of Chest X-Ray. (R: 0.00; S: 0.92)
-- Deep-COVID: Predicting COVID-19 From Chest X-Ray Images Using Deep Transfer Learning. (R: 0.00; S: 0.92)
-- COVID-CAPS: A Capsule Network-based Framework for Identification of COVID-19 cases from X-ray Images. (R: 0.00; S: 0.92)
-- COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images. (R: 0.00; S: 0.92)
- F: 1.00. R: 0.00. S: 0.94.
-- CORD-19: The Covid-19 Open Research Dataset. (R: 0.00; S: 0.94)
-- ArCOV-19: The First Arabic CO

OK! Let's decode this: we have:
- One single article on gravitional search
- A large cluster on applications of ML to X-Ray / CT scan processing
- A social cluster
- Note that it's not perfect: a Twitter dataset falls in the X-Ray cluster...

Now, let's look at the main keywords.

In [16]:
gismo.get_ranked_features(20)

['19',
 'covid',
 'covid 19',
 'machine',
 'machine learning',
 'pandemic',
 'chest',
 'coronavirus',
 'dataset',
 'ray',
 'chest ray',
 'deep',
 'ray images',
 'ct',
 'images',
 'twitter',
 'deep learning',
 'social',
 'using machine',
 'epidemic']

Let's organize them.

In [17]:
gismo.get_clustered_ranked_features(20)

 F: 0.00. R: 0.07. S: 0.92.
- F: 0.13. R: 0.07. S: 0.92.
-- F: 0.18. R: 0.07. S: 0.92.
--- F: 0.25. R: 0.06. S: 0.92.
---- F: 0.31. R: 0.06. S: 0.96.
----- F: 0.97. R: 0.05. S: 0.98.
------ 19 (R: 0.02; S: 0.99)
------ covid (R: 0.02; S: 0.97)
------ covid 19 (R: 0.02; S: 0.98)
----- pandemic (R: 0.00; S: 0.31)
----- F: 0.80. R: 0.01. S: 0.31.
------ chest (R: 0.00; S: 0.33)
------ ray (R: 0.00; S: 0.34)
------ chest ray (R: 0.00; S: 0.27)
------ ray images (R: 0.00; S: 0.27)
------ images (R: 0.00; S: 0.34)
----- F: 0.41. R: 0.00. S: 0.31.
------ dataset (R: 0.00; S: 0.29)
------ twitter (R: 0.00; S: 0.24)
----- F: 0.71. R: 0.00. S: 0.32.
------ deep (R: 0.00; S: 0.36)
------ deep learning (R: 0.00; S: 0.27)
---- coronavirus (R: 0.00; S: 0.29)
---- ct (R: 0.00; S: 0.28)
---- social (R: 0.00; S: 0.28)
--- F: 0.99. R: 0.01. S: 0.16.
---- machine (R: 0.00; S: 0.16)
---- machine learning (R: 0.00; S: 0.16)
-- epidemic (R: 0.00; S: 0.18)
- using machine (R: 0.00; S: 0.01)


Rough, very broad analysis:
- Covid-19 as a pandemic diseases is a core cluster. As they are more article in DBLP (a computer science database) about ML than CV-19, Gismo focused more on that
- X-ray processing seems to be a field of interest
- Twitter datasets

In [18]:
np.dot(gismo.embedding.query_projection("Machine learning")[0], gismo.embedding.y)

<1x4889758 sparse matrix of type '<class 'numpy.float64'>'
	with 49442 stored elements in Compressed Sparse Row format>

50,000 articles with an explicit link to machine learning.

In [19]:
np.dot(gismo.embedding.query_projection("Covid-19")[0], gismo.embedding.y)

<1x4889758 sparse matrix of type '<class 'numpy.float64'>'
	with 793 stored elements in Compressed Sparse Row format>

800 articles with an explicit link to covid-19.

## Authors query

Instead of looking at words, we can explore authors and their collaborations.

 We just have to rewire the corpus to output string of authors.

In [20]:
def to_authors_text(dic):
    return " ".join([a.replace(' ', '_') for a in dic['authors']])
corpus = Corpus(source, to_text=to_authors_text)

We can build a new embedding on top of this modified corpus. We tell the vectorizer to be stupid: don't preprocess, words are separated spaces.

This will take a few minutes (you can save the embedding for later if you want).

In [21]:
vectorizer = CountVectorizer(dtype=float,
                            preprocessor=lambda x:x, tokenizer=lambda x: x.split(' '))
a_embedding = Embedding(vectorizer=vectorizer)
try:
    a_embedding.load(filename="dblp_aut_embedding", path=path)
except:
    a_embedding.fit_transform(corpus)
    a_embedding.save(filename="dblp_aut_embedding", path=path)

In [22]:
a_embedding.x

<4889758x2532137 sparse matrix of type '<class 'numpy.float64'>'
	with 15083911 stored elements in Compressed Sparse Row format>

We now have about 2,500,000 authors to explore. Let's reload gismo and try to play.

In [23]:
gismo = Gismo(corpus, a_embedding)
gismo.post_document = post_article
gismo.post_feature = lambda g, i: g.embedding.features[i].replace("_", " ")

def print_document_cluster(gismo, cluster, depth=""):
    sim = cosine_similarity(cluster.vector, gismo.diteration.y_relevance.reshape(1, -1))[0][0]
    if len(cluster.children) == 0:
        dic = gismo.corpus[cluster.indice]
        authors = ", ".join(dic['authors'])
        txt = f"{authors} ({dic['venue']}, {dic['year']})"
        #post_article(gismo, cluster.indice)
        print(f"{depth} {txt} ")
         #     f"(R: {gismo.diteration.x_relevance[cluster.indice]:.2f}; "
          #    f"S: {sim:.2f})")
    else:
        print(f"{depth} F: {cluster.focus:.2f}. "
              f"R: {sum(gismo.diteration.x_relevance[cluster.members]):.2f}. "
              f"S: {sim:.2f}.")
    for c in cluster.children:
        print_document_cluster(gismo, c, depth=depth + '-')
gismo.post_document_cluster = print_document_cluster


gismo.post_document_cluster = print_document_cluster
gismo.post_feature_cluster = print_feature_cluster

### Laurent Massoulié query

In [24]:
gismo.rank("Laurent_Massoulié")

True

What are the most central articles of Laurent Massoulié in terms of collaboration?

In [25]:
gismo.get_ranked_documents(k=10)

['Robustness of Spectral Methods for Community Detection. By Ludovic Stephan, Laurent Massoulié (COLT, 2019)',
 'Robustness of spectral methods for community detection. By Ludovic Stephan, Laurent Massoulié (CoRR, 2018)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (CoRR, 2020)',
 'Spectral alignment of correlated Gaussian random matrices. By Luca Ganassali, Marc Lelarge, Laurent Massoulié (CoRR, 2019)',
 'Planting trees in graphs, and finding them back. By Laurent Massoulié, Ludovic Stephan, Don Towsley (COLT, 2019)',
 'Brief announcement: adaptive content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (PODC, 2010)',
 'Optimal content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (INFOCOM, 2011)',
 'Optimal Content Placement for Peer-to-Peer Video-on-Demand Systems. By Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013)',
 'Scalable Local Area Service Discov

We see lots of duplicates. This is not surprising as many articles can published first as a research report, then as a conference paper, last as a journal article. Luckily, Gismo can cover for you.

In [26]:
gismo.get_covering_documents(k=10)

['Robustness of Spectral Methods for Community Detection. By Ludovic Stephan, Laurent Massoulié (COLT, 2019)',
 'Robustness of spectral methods for community detection. By Ludovic Stephan, Laurent Massoulié (CoRR, 2018)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (CoRR, 2020)',
 'Spectral alignment of correlated Gaussian random matrices. By Luca Ganassali, Marc Lelarge, Laurent Massoulié (CoRR, 2019)',
 'Planting trees in graphs, and finding them back. By Laurent Massoulié, Ludovic Stephan, Don Towsley (COLT, 2019)',
 'Brief announcement: adaptive content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (PODC, 2010)',
 'Optimal content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (INFOCOM, 2011)',
 'Optimal Content Placement for Peer-to-Peer Video-on-Demand Systems. By Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013)',
 'Scalable Local Area Service Discov

Hum, not working well. The reason here is query distortion. Query distortion is a gismo feature that modulates the clustering with the query. Sadly, when features are authors, the underlying graph has a very specific structure (highly sparse and redundant) that makes query distortion *too* effective. The solution is to desactivate it.

In [27]:
gismo.query_distortion = False
gismo.get_covering_documents(k=10)

['Robustness of Spectral Methods for Community Detection. By Ludovic Stephan, Laurent Massoulié (COLT, 2019)',
 'Scalable Local Area Service Discovery. By Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007)',
 'Asynchronous Accelerated Proximal Stochastic Gradient for Strongly Convex Distributed Finite Sums. By Hadrien Hendrikx, Francis Bach, Laurent Massoulié (CoRR, 2019)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (CoRR, 2020)',
 'Brief announcement: adaptive content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (PODC, 2010)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016)',
 'Adaptive matching for expert systems with uncertain task types. By Virag Shah, Lennart Gulikers, Laurent Massoulié, Milan Vojnovic (Allerton, 2017)',
 'Spectral alignment of correlated Gaussian random matrices. By Luca Ganassal

Much better. No duplicate and more diversity in the results. Let's observe the communities.

In [28]:
gismo.get_clustered_ranked_documents(k=20, resolution=.9)

 F: 0.36. R: 0.09. S: 0.87.
- F: 0.36. R: 0.08. S: 0.86.
-- F: 0.36. R: 0.06. S: 0.83.
--- F: 0.36. R: 0.03. S: 0.73.
---- F: 0.88. R: 0.01. S: 0.60.
----- F: 1.00. R: 0.01. S: 0.60.
------ Ludovic Stephan, Laurent Massoulié (COLT, 2019) 
------ Ludovic Stephan, Laurent Massoulié (CoRR, 2018) 
----- Laurent Massoulié, Ludovic Stephan, Don Towsley (COLT, 2019) 
---- F: 1.00. R: 0.01. S: 0.61.
----- Bo Tan 0002, Laurent Massoulié (PODC, 2010) 
----- Bo Tan 0002, Laurent Massoulié (INFOCOM, 2011) 
----- Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013) 
--- F: 0.41. R: 0.03. S: 0.69.
---- F: 0.86. R: 0.01. S: 0.59.
----- Luca Ganassali, Laurent Massoulié (CoRR, 2020) 
----- Luca Ganassali, Marc Lelarge, Laurent Massoulié (CoRR, 2019) 
---- F: 0.61. R: 0.03. S: 0.60.
----- F: 1.00. R: 0.02. S: 0.59.
------ Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016) 
------ Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015) 
------ Lennart Gulikers, Marc Lelarge, L

OK! We see writing commnities.

Now, let's look in terms of authors. This is actually the interesting part when studying collaborations.

In [29]:
gismo.get_ranked_features()

['Laurent Massoulié',
 'Marc Lelarge',
 'Stratis Ioannidis',
 'Nidhi Hegde',
 'Peter B. Key',
 'Anne-Marie Kermarrec',
 'Ayalvadi J. Ganesh',
 'Lennart Gulikers',
 'Hadrien Hendrikx',
 'Francis Bach',
 'Dan-Cristian Tomozei',
 'Amin Karbasi',
 'Milan Vojnovic',
 'Augustin Chaintreau',
 'Mathieu Leconte',
 'Bo Tan 0002',
 'James Roberts',
 'Ludovic Stephan',
 'Kuang Xu',
 'Fabio Picconi']

We see many authors that were not present in the articles listed above. This is an important observation: central articles (with respect to a query) are not necessarily written by central authors!

Let's organize them into communities.

In [30]:
gismo.get_clustered_ranked_features(resolution=.6)

 F: 0.00. R: 0.21. S: 0.52.
- F: 0.00. R: 0.21. S: 0.52.
-- F: 0.01. R: 0.21. S: 0.52.
--- F: 0.02. R: 0.20. S: 0.51.
---- F: 0.03. R: 0.17. S: 0.50.
----- F: 0.04. R: 0.17. S: 0.49.
------ F: 0.08. R: 0.16. S: 0.49.
------- F: 0.13. R: 0.15. S: 0.50.
-------- F: 0.15. R: 0.14. S: 0.49.
--------- Laurent_Massoulié (R: 0.10; S: 1.00)
--------- Nidhi_Hegde (R: 0.01; S: 0.18)
--------- Peter_B._Key (R: 0.01; S: 0.16)
--------- Ayalvadi_J._Ganesh (R: 0.01; S: 0.16)
--------- Lennart_Gulikers (R: 0.00; S: 0.22)
--------- Hadrien_Hendrikx (R: 0.00; S: 0.18)
--------- Francis_Bach (R: 0.00; S: 0.06)
-------- Marc_Lelarge (R: 0.01; S: 0.15)
------- Milan_Vojnovic (R: 0.00; S: 0.06)
------ Anne-Marie_Kermarrec (R: 0.01; S: 0.08)
----- Mathieu_Leconte (R: 0.00; S: 0.10)
---- F: 0.03. R: 0.02. S: 0.14.
----- F: 0.09. R: 0.02. S: 0.14.
------ Stratis_Ioannidis (R: 0.01; S: 0.15)
------ Augustin_Chaintreau (R: 0.00; S: 0.06)
----- Amin_Karbasi (R: 0.00; S: 0.05)
---- Dan-Cristian_Tomozei (R: 0.00; 

### Jim Roberts  query

In [31]:
gismo.rank("James_W._Roberts")

True

Let's have a covering set of articles.

In [32]:
gismo.get_covering_documents(k=10)

['Integrated Admission Control for Streaming and Elastic Traffic. By Nabil Benameur, Slim Ben Fredj, Frank Delcoigne, Sara Oueslati-Boulahia, James W. Roberts (QofIS, 2001)',
 'An In-Camera Data Stream Processing System for Defect Detection in Web Inspection Tasks. By S. Hossain Hajimowlana, Roberto Muscedere, Graham A. Jullien, James W. Roberts (Real-Time Imaging, 1999)',
 "Modifications of Thomae's Function and Differentiability. By Kevin Beanland, James W. Roberts, Craig Stevenson (The American Mathematical Monthly, 2009)",
 'A Traffic Control Framework for High Speed Data Transmission. By James W. Roberts, Brahim Bensaou, Y. Canetti (Modelling and Evaluation of ATM Networks, 1993)',
 'Statistical bandwidth sharing: a study of congestion at flow level. By Slim Ben Fredj, Thomas Bonald, Alexandre Proutière, G. Régnié, James W. Roberts (SIGCOMM, 2001)',
 'Swing: Traffic capacity of a simple WDM ring network. By Thomas Bonald, Sara Oueslati, James W. Roberts, Charlotte Roger (Internati

Who are the associated authors?

In [33]:
gismo.get_ranked_features(k=10)

['James W. Roberts',
 'Thomas Bonald',
 'Maher Hamdi',
 'Sara Oueslati-Boulahia',
 'Ali Ibrahim',
 'Alexandre Proutière',
 'Sara Oueslati',
 'Jorma T. Virtamo',
 'Slim Ben Fredj',
 'Jussi Kangasharju']

Let's organize them.

In [34]:
gismo.get_clustered_ranked_features(k=10, resolution=.4)

 F: 0.01. R: 0.23. S: 0.55.
- F: 0.01. R: 0.22. S: 0.54.
-- F: 0.04. R: 0.20. S: 0.53.
--- F: 0.16. R: 0.19. S: 0.53.
---- F: 0.25. R: 0.17. S: 0.91.
----- James_W._Roberts (R: 0.12; S: 1.00)
----- Thomas_Bonald (R: 0.05; S: 0.28)
---- F: 0.57. R: 0.02. S: 0.34.
----- Sara_Oueslati-Boulahia (R: 0.01; S: 0.28)
----- Slim_Ben_Fredj (R: 0.01; S: 0.32)
---- Sara_Oueslati (R: 0.01; S: 0.16)
--- Alexandre_Proutière (R: 0.01; S: 0.05)
-- Maher_Hamdi (R: 0.01; S: 0.13)
-- Ali_Ibrahim (R: 0.01; S: 0.07)
- F: 0.01. R: 0.01. S: 0.05.
-- Jorma_T._Virtamo (R: 0.01; S: 0.04)
-- Jussi_Kangasharju (R: 0.01; S: 0.03)


### Combined queries

We can input multiple authors.

In [35]:
gismo.rank("Laurent_Massoulié and James_W._Roberts")

True

In [36]:
gismo.embedding.query_projection("Laurent_Massoulié and James_W._Roberts")

(<1x2532137 sparse matrix of type '<class 'numpy.float64'>'
 	with 2 stored elements in Compressed Sparse Row format>,
 True)

Let's have a covering set of articles.

In [37]:
gismo.get_covering_documents(k=10)

['Integrated Admission Control for Streaming and Elastic Traffic. By Nabil Benameur, Slim Ben Fredj, Frank Delcoigne, Sara Oueslati-Boulahia, James W. Roberts (QofIS, 2001)',
 'An In-Camera Data Stream Processing System for Defect Detection in Web Inspection Tasks. By S. Hossain Hajimowlana, Roberto Muscedere, Graham A. Jullien, James W. Roberts (Real-Time Imaging, 1999)',
 "Modifications of Thomae's Function and Differentiability. By Kevin Beanland, James W. Roberts, Craig Stevenson (The American Mathematical Monthly, 2009)",
 'A Traffic Control Framework for High Speed Data Transmission. By James W. Roberts, Brahim Bensaou, Y. Canetti (Modelling and Evaluation of ATM Networks, 1993)',
 'Statistical bandwidth sharing: a study of congestion at flow level. By Slim Ben Fredj, Thomas Bonald, Alexandre Proutière, G. Régnié, James W. Roberts (SIGCOMM, 2001)',
 'Swing: Traffic capacity of a simple WDM ring network. By Thomas Bonald, Sara Oueslati, James W. Roberts, Charlotte Roger (Internati

Note that we get here only articles by Roberts, yet the articles returned have sightly changed.

Now, let's look at the main authors.

In [38]:
gismo.get_ranked_features()

['James W. Roberts',
 'Laurent Massoulié',
 'Thomas Bonald',
 'Nidhi Hegde',
 'Maher Hamdi',
 'Marc Lelarge',
 'Stratis Ioannidis',
 'Alexandre Proutière',
 'Sara Oueslati-Boulahia',
 'Ali Ibrahim',
 'Peter B. Key',
 'Jorma T. Virtamo',
 'Sara Oueslati',
 'Anne-Marie Kermarrec',
 'Ayalvadi J. Ganesh',
 'Slim Ben Fredj']

We see a mix of both co-authors. How are they organized?

In [39]:
gismo.get_clustered_ranked_features(resolution=.4)

 F: 0.01. R: 0.20. S: 0.54.
- F: 0.01. R: 0.20. S: 0.54.
-- F: 0.05. R: 0.19. S: 0.53.
--- F: 0.25. R: 0.09. S: 0.84.
---- James_W._Roberts (R: 0.06; S: 0.91)
---- Thomas_Bonald (R: 0.03; S: 0.26)
--- F: 0.12. R: 0.08. S: 0.23.
---- Laurent_Massoulié (R: 0.05; S: 0.40)
---- Nidhi_Hegde (R: 0.01; S: 0.10)
---- Marc_Lelarge (R: 0.01; S: 0.06)
---- Stratis_Ioannidis (R: 0.01; S: 0.06)
---- Alexandre_Proutière (R: 0.01; S: 0.05)
---- Peter_B._Key (R: 0.00; S: 0.07)
---- Ayalvadi_J._Ganesh (R: 0.00; S: 0.06)
--- F: 0.57. R: 0.01. S: 0.31.
---- Sara_Oueslati-Boulahia (R: 0.00; S: 0.26)
---- Slim_Ben_Fredj (R: 0.00; S: 0.29)
--- Sara_Oueslati (R: 0.00; S: 0.15)
--- Anne-Marie_Kermarrec (R: 0.00; S: 0.03)
-- Maher_Hamdi (R: 0.01; S: 0.12)
-- Ali_Ibrahim (R: 0.00; S: 0.07)
- Jorma_T._Virtamo (R: 0.00; S: 0.04)


Rough analysis:
- One cluster centered around Jim
- One cluster centered around Laurent
- Others

Note that Massoulié's cluster is bigger that Roberts' one. This is a possible explanation of the predominance of Roberts' articles when asking for best documents (Massoulié's articles are more diluted).

## Cross-gismo

Gismo can combine two embeddings two create one hybrid gismo. This is called a cross-gismo. This features can be used to analyze authors with respect to the words they use (and vice-versa).

In [40]:
from gismo.gismo import XGismo
gismo = XGismo(x_embedding=a_embedding, y_embedding=embedding)
gismo.diteration.n_iter = 2 # to speed up a little computation

Note that XGismo does not use the underlying corpus, so we can now close the source (the source keeps the ``dblp.data`` file open).

In [41]:
source.close()

In [42]:
gismo.post_document = lambda g, i: g.corpus[i].replace("_", " ")
gismo.post_feature_cluster = print_feature_cluster
def print_document_cluster(gismo, cluster, depth=""):
    sim = cosine_similarity(cluster.vector, gismo.diteration.y_relevance.reshape(1, -1))[0][0]
    if len(cluster.children) == 0:
        txt = gismo.corpus[cluster.indice].replace("_", " ")
        print(f"{depth} {txt} "
              f"(R: {gismo.diteration.x_relevance[cluster.indice]:.2f}; "
              f"S: {sim:.2f})")
    else:
        print(f"{depth} F: {cluster.focus:.2f}. "
              f"R: {sum(gismo.diteration.x_relevance[cluster.members]):.2f}. "
              f"S: {sim:.2f}.")
    for c in cluster.children:
        print_document_cluster(gismo, c, depth=depth + '-')
gismo.post_document_cluster = print_document_cluster

Let's try a request.

In [43]:
gismo.rank("self-stabilization")

True

What are the associated keywords?

In [44]:
gismo.get_ranked_features(k=10)

['stabilization',
 'self',
 'self stabilization',
 'stabilizing',
 'self stabilizing',
 'distributed',
 'robust',
 'sensor',
 'adaptive',
 'fault']

Using covering can yield other keywords of interest.

In [45]:
gismo.get_covering_features(k=10)

['stabilization',
 'nonlinear',
 'linear',
 'byzantine',
 'sensor',
 'fault',
 'self',
 'self stabilization',
 'stabilizing',
 'self stabilizing']

How are keywords structured?

In [46]:
gismo.get_clustered_ranked_features(k=20)

 F: 0.04. R: 0.02. S: 0.79.
- F: 0.62. R: 0.02. S: 0.79.
-- F: 0.62. R: 0.02. S: 0.80.
--- F: 0.62. R: 0.02. S: 0.80.
---- stabilization (R: 0.00; S: 0.81)
---- self (R: 0.00; S: 0.76)
---- self stabilization (R: 0.00; S: 0.82)
---- stabilizing (R: 0.00; S: 0.72)
---- self stabilizing (R: 0.00; S: 0.66)
---- distributed (R: 0.00; S: 0.74)
---- robust (R: 0.00; S: 0.46)
---- adaptive (R: 0.00; S: 0.57)
---- dynamic (R: 0.00; S: 0.67)
---- optimal (R: 0.00; S: 0.65)
---- stability (R: 0.00; S: 0.36)
--- F: 0.93. R: 0.00. S: 0.66.
---- sensor (R: 0.00; S: 0.66)
---- wireless (R: 0.00; S: 0.65)
---- sensor networks (R: 0.00; S: 0.65)
--- fault (R: 0.00; S: 0.75)
-- F: 0.91. R: 0.00. S: 0.47.
--- byzantine (R: 0.00; S: 0.46)
--- mobile (R: 0.00; S: 0.56)
--- asynchronous (R: 0.00; S: 0.54)
- F: 0.51. R: 0.00. S: 0.16.
-- nonlinear (R: 0.00; S: 0.08)
-- linear (R: 0.00; S: 0.27)


Who are the associated researchers?

In [47]:
gismo.get_ranked_documents(k=10)

['Ted Herman',
 'Shlomi Dolev',
 'Sébastien Tixeuil',
 'Sukumar Ghosh',
 'George Varghese',
 'Shay Kutten',
 'Toshimitsu Masuzawa',
 'Stefan Schmid 0001',
 'Swan Dubois',
 'Bertrand Ducourthial']

How are they structured?

In [48]:
gismo.get_clustered_ranked_documents(k=10, resolution=.9)

 F: 0.75. R: 0.05. S: 0.84.
- F: 0.87. R: 0.05. S: 0.84.
-- F: 0.92. R: 0.04. S: 0.83.
--- F: 0.95. R: 0.03. S: 0.83.
---- Ted Herman (R: 0.01; S: 0.81)
---- Sébastien Tixeuil (R: 0.01; S: 0.83)
---- Sukumar Ghosh (R: 0.01; S: 0.81)
---- Swan Dubois (R: 0.00; S: 0.80)
--- F: 0.97. R: 0.01. S: 0.77.
---- Shlomi Dolev (R: 0.01; S: 0.79)
---- Toshimitsu Masuzawa (R: 0.00; S: 0.73)
--- Shay Kutten (R: 0.00; S: 0.82)
-- F: 0.96. R: 0.01. S: 0.80.
--- George Varghese (R: 0.00; S: 0.84)
--- Bertrand Ducourthial (R: 0.00; S: 0.79)
- Stefan Schmid 0001 (R: 0.00; S: 0.71)


We can also query researchers. Just use underscores in the query and add `y=False` to indicate that the input is *documents*.

In [49]:
gismo.rank("Sébastien_Tixeuil and Fabien_Mathieu", y=False)

True

What are the associated keywords?

In [50]:
gismo.get_ranked_features(k=10)

['byzantine',
 'stabilization',
 'p2p',
 'stabilizing',
 'live streaming',
 'p2p networks',
 'reloaded',
 'self',
 'self stabilization',
 'live']

Using covering can yield other keywords of interest.

In [51]:
gismo.get_covering_features(k=10)

['byzantine',
 'preference',
 'p2p',
 'pagerank',
 'grid',
 'live streaming',
 'p2p networks',
 'acyclic',
 'streaming',
 'stabilization']

How are keywords structured?

In [52]:
gismo.get_clustered_ranked_features(k=20, resolution=.7)

 F: 0.23. R: 0.22. S: 0.66.
- F: 0.38. R: 0.21. S: 0.66.
-- F: 0.81. R: 0.09. S: 0.61.
--- byzantine (R: 0.02; S: 0.53)
--- stabilization (R: 0.02; S: 0.58)
--- stabilizing (R: 0.01; S: 0.60)
--- self (R: 0.01; S: 0.62)
--- self stabilization (R: 0.01; S: 0.59)
--- self stabilizing (R: 0.01; S: 0.59)
--- robots (R: 0.01; S: 0.51)
--- asynchronous (R: 0.01; S: 0.55)
-- F: 0.74. R: 0.10. S: 0.30.
--- F: 0.79. R: 0.02. S: 0.36.
---- p2p (R: 0.02; S: 0.34)
---- streaming (R: 0.01; S: 0.33)
--- F: 0.88. R: 0.05. S: 0.29.
---- live streaming (R: 0.01; S: 0.29)
---- reloaded (R: 0.01; S: 0.29)
---- live (R: 0.01; S: 0.31)
---- preference based (R: 0.01; S: 0.27)
---- refresh (R: 0.01; S: 0.27)
--- p2p networks (R: 0.01; S: 0.30)
--- acyclic (R: 0.01; S: 0.31)
-- pagerank (R: 0.01; S: 0.24)
-- grid (R: 0.01; S: 0.43)
- preference (R: 0.01; S: 0.21)


Who are the associated researchers?

In [53]:
gismo.get_ranked_documents(k=10)

['Sébastien Tixeuil',
 'Fabien Mathieu',
 'Shlomi Dolev',
 'Toshimitsu Masuzawa',
 'Franck Petit',
 'Stéphane Devismes',
 'Edmond Bianco',
 'Ted Herman',
 'Nitin H. Vaidya',
 'Michel Raynal']

How are they structured?

In [54]:
gismo.get_clustered_ranked_documents(k=10, resolution=.8)

 F: 0.00. R: 0.01. S: 0.62.
- F: 0.10. R: 0.01. S: 0.61.
-- F: 0.21. R: 0.00. S: 0.51.
--- F: 0.75. R: 0.00. S: 0.49.
---- Sébastien Tixeuil (R: 0.00; S: 0.56)
---- F: 0.92. R: 0.00. S: 0.45.
----- Shlomi Dolev (R: 0.00; S: 0.45)
----- Toshimitsu Masuzawa (R: 0.00; S: 0.48)
----- Franck Petit (R: 0.00; S: 0.43)
----- Stéphane Devismes (R: 0.00; S: 0.44)
----- Ted Herman (R: 0.00; S: 0.42)
--- F: 0.73. R: 0.00. S: 0.32.
---- Nitin H. Vaidya (R: 0.00; S: 0.29)
---- Michel Raynal (R: 0.00; S: 0.34)
-- Fabien Mathieu (R: 0.00; S: 0.71)
- Edmond Bianco (R: 0.00; S: 0.05)
