# DBLP exploration

This tutorial shows how explore DBLP with Gismo.

Note: the DBLP databased is not small. Recommended requirements to excute this Notebook:
- Fast Internet connection (you will need to download a few hundred Mb)
- 4 Gb of free space
- 4 Gb of RAM (8Gb or more recommended)
- Descent CPU (can take more than one hour on slow CPUs)

Here, *documents* are articles in DBLP. The *features* of an article category will vary.

## Initialisation

First, we load the required package.

In [1]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pathlib import Path

from gismo.datasets.dblp import Dblp
from gismo.filesource import FileSource
from gismo.corpus import Corpus
from gismo.embedding import Embedding
from gismo.gismo import Gismo
from gismo.post_processing import print_feature_cluster

Then, we prepare the DBLP source.

First we choose the location of the DBLP files. If you want to run this Notebook at your place, please change the path below and check that it exists.

In [2]:
path = Path("../../../../../Datasets/DBLP")
path.exists()

True

Construction of the dblp files. Only needs to be performed the first time or when you want to refresh the database. Takes about 10 minutes on a Surface Pro 4 with fiber Internet connection.

In [3]:
dblp = Dblp(path=path)
dblp.build()

File ..\..\..\..\..\Datasets\DBLP\dblp.xml.gz already exists.
File ..\..\..\..\..\Datasets\DBLP\dblp.data already exists.


Then, we can load the database as a filesource.

In [4]:
source = FileSource(filename="dblp", path=path)
source[0]

{'type': 'article',
 'title': 'Spectre Attacks: Exploiting Speculative Execution.',
 'venue': 'meltdownattack.com',
 'year': '2018',
 'authors': ['Paul Kocher',
  'Daniel Genkin',
  'Daniel Gruss',
  'Werner Haas',
  'Mike Hamburg',
  'Moritz Lipp',
  'Stefan Mangard',
  'Thomas Prescher 0002',
  'Michael Schwarz 0001',
  'Yuval Yarom']}

Each article is a dict with fields ``type``, ``venue``, ``title``, ``year``, and ``authors``. We build a corpus that will tell Gismo that the content of an article is its ``title`` value.

In [5]:
corpus = Corpus(source, to_text=lambda x: x['title'])

We build an embedding on top of that corpus.
- We set ``min_df=30`` to exclude rare features;
- We set ``max_df=.02`` to exclude anything present in more than 2% of the corpus;
- A few manually selected stopwords to fine-tune things.
- We set ``ngram_range=[1, 2]`` to include bi-grams in the embedding.

This will take a few minutes (you can save the embedding for later if you want).

In [None]:
vectorizer = CountVectorizer(min_df=30, max_df=.02, ngram_range=[1, 2], dtype=float, 
                             stop_words=['a', 'about', 'an', 'and', 'for', 'from', 'in', 'of', 'on', 'the', 'with'])
embedding = Embedding(vectorizer=vectorizer)

try:
    embedding.load(filename="dblp_embedding", path=path)
except:
    embedding.fit_transform(corpus)
    embedding.save(filename="dblp_embedding", path=path)

In [None]:
embedding.x

We see from ``embedding.x`` that the embedding links about 5,000,000 documents to 160,000 features. In average, each document is linked to about 10 features.

Now, we initiate the gismo object, and customize post_processers to ease the display.

In [None]:
gismo = Gismo(corpus, embedding)

In [10]:
def post_article(g, i):
    dic = g.corpus[i]
    authors = ", ".join(dic['authors'])
    return f"{dic['title']} By {authors} ({dic['venue']}, {dic['year']})"
    
gismo.post_document = post_article
def print_document_cluster(gismo, cluster, depth=""):
    sim = cosine_similarity(cluster.vector, gismo.diteration.y_relevance.reshape(1, -1))[0][0]
    if len(cluster.children) == 0:
        txt = gismo.corpus[cluster.indice]["title"]
        print(f"{depth} {txt} "
              f"(R: {gismo.diteration.x_relevance[cluster.indice]:.2f}; "
              f"S: {sim:.2f})")
    else:
        print(f"{depth} F: {cluster.focus:.2f}. "
              f"R: {sum(gismo.diteration.x_relevance[cluster.members]):.2f}. "
              f"S: {sim:.2f}.")
    for c in cluster.children:
        print_document_cluster(gismo, c, depth=depth + '-')
gismo.post_document_cluster = print_document_cluster
gismo.post_feature_cluster = print_feature_cluster

As the dataset is big, we lower the precision of the computation to speed up things a little bit.

In [11]:
gismo.diteration.n_iter = 2

## Machine Learning (and Covid-19) query

We perform the query *Machine learning*. The returned ``True`` tells that some of the query features were found in the corpus' features.

In [None]:
gismo.rank("Machine Learning")

What are the best articles on *Machine Learning*?

In [None]:
gismo.get_ranked_documents()

OK, this seems to go everywhere. Maybe we can narrow with a more specific request.

In [None]:
gismo.rank("Machine Learning and covid-19")

In [None]:
gismo.get_ranked_documents()

Sounds nice. How are these articles related?

In [None]:
gismo.get_clustered_ranked_documents(k=10, resolution=.9)

OK! Let's decode this: we have:
- One single article on gravitional search
- A large cluster on applications of ML to X-Ray / CT scan processing
- A social cluster
- Note that it's not perfect: a Twitter dataset falls in the X-Ray cluster...

Now, let's look at the main keywords.

In [None]:
gismo.get_ranked_features(20)

Let's organize them.

In [None]:
gismo.get_clustered_ranked_features(20)

Rough, very broad analysis:
- Covid-19 as a pandemic diseases is a core cluster. As they are more article in DBLP (a computer science database) about ML than CV-19, Gismo focused more on that
- X-ray processing seems to be a field of interest
- Twitter datasets

In [None]:
np.dot(gismo.embedding.query_projection("Machine learning")[0], gismo.embedding.y)

50,000 articles with an explicit link to machine learning.

In [None]:
np.dot(gismo.embedding.query_projection("Covid-19")[0], gismo.embedding.y)

800 articles with an explicit link to covid-19.

## Authors query

Instead of looking at words, we can explore authors and their collaborations.

 We just have to rewire the corpus to output string of authors.

In [6]:
def to_authors_text(dic):
    return " ".join([a.replace(' ', '_') for a in dic['authors']])
corpus = Corpus(source, to_text=to_authors_text)

We can build a new embedding on top of this modified corpus. We tell the vectorizer to be stupid: don't preprocess, words are separated spaces.

This will take a few minutes (you can save the embedding for later if you want).

In [7]:
vectorizer = CountVectorizer(dtype=float,
                            preprocessor=lambda x:x, tokenizer=lambda x: x.split(' '))
embedding = Embedding(vectorizer=vectorizer)
try:
    embedding.load(filename="dblp_aut_embedding", path=path)
except:
    embedding.fit_transform(corpus)
    embedding.save(filename="dblp_aut_embedding", path=path)

In [8]:
embedding.x

<4889758x2532137 sparse matrix of type '<class 'numpy.float64'>'
	with 15083911 stored elements in Compressed Sparse Row format>

We now have about 2,500,000 authors to explore. Let's reload gismo and try Fabien Mathieu.

In [12]:
gismo = Gismo(corpus, embedding)
gismo.post_document = post_article
gismo.post_feature = lambda g, i: g.embedding.features[i].replace("_", " ")

def print_document_cluster(gismo, cluster, depth=""):
    sim = cosine_similarity(cluster.vector, gismo.diteration.y_relevance.reshape(1, -1))[0][0]
    if len(cluster.children) == 0:
        dic = gismo.corpus[cluster.indice]
        authors = ", ".join(dic['authors'])
        txt = f"{authors} ({dic['venue']}, {dic['year']})"
        #post_article(gismo, cluster.indice)
        print(f"{depth} {txt} ")
         #     f"(R: {gismo.diteration.x_relevance[cluster.indice]:.2f}; "
          #    f"S: {sim:.2f})")
    else:
        print(f"{depth} F: {cluster.focus:.2f}. "
              f"R: {sum(gismo.diteration.x_relevance[cluster.members]):.2f}. "
              f"S: {sim:.2f}.")
    for c in cluster.children:
        print_document_cluster(gismo, c, depth=depth + '-')
gismo.post_document_cluster = print_document_cluster


gismo.post_document_cluster = print_document_cluster
gismo.post_feature_cluster = print_feature_cluster

### Laurent Massoulié query

In [13]:
gismo.diteration.offset = 1
gismo.rank("Laurent_Massoulié")

True

What are the most central articles of Laurent Massoulié in terms of collaboration?

In [14]:
gismo.get_ranked_documents(k=10)

['Robustness of Spectral Methods for Community Detection. By Ludovic Stephan, Laurent Massoulié (COLT, 2019)',
 'Robustness of spectral methods for community detection. By Ludovic Stephan, Laurent Massoulié (CoRR, 2018)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (CoRR, 2020)',
 'Spectral alignment of correlated Gaussian random matrices. By Luca Ganassali, Marc Lelarge, Laurent Massoulié (CoRR, 2019)',
 'Planting trees in graphs, and finding them back. By Laurent Massoulié, Ludovic Stephan, Don Towsley (COLT, 2019)',
 'Brief announcement: adaptive content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (PODC, 2010)',
 'Optimal content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Massoulié (INFOCOM, 2011)',
 'Optimal Content Placement for Peer-to-Peer Video-on-Demand Systems. By Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013)',
 'Scalable Local Area Service Discov

We see lots of duplicates. This is not surprising as many articles can published first as a research report, then as a conference paper, last as a journal article. Luckily, Gismo can cover for you.

In [15]:
gismo.get_covering_documents(k=10, stretch=3)

['Robustness of Spectral Methods for Community Detection. By Ludovic Stephan, Laurent Massoulié (COLT, 2019)',
 'Scalable Local Area Service Discovery. By Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (CoRR, 2020)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016)',
 'Asynchronous Accelerated Proximal Stochastic Gradient for Strongly Convex Distributed Finite Sums. By Hadrien Hendrikx, Francis Bach, Laurent Massoulié (CoRR, 2019)',
 'Non-Metric Coordinates for Predicting Network Proximity. By Peter B. Key, Laurent Massoulié, Dan-Cristian Tomozei (INFOCOM, 2008)',
 'Planting trees in graphs, and finding them back. By Laurent Massoulié, Ludovic Stephan, Don Towsley (COLT, 2019)',
 'Brief announcement: adaptive content placement for peer-to-peer video-on-demand systems. By Bo Tan 0002, Laurent Mas

Sounds nice. Let's observe the communities?

In [17]:
gismo.get_clustered_ranked_documents(k=20, resolution=.9)

 F: 0.36. R: 0.09. S: 0.87.
- F: 0.36. R: 0.08. S: 0.86.
-- F: 0.36. R: 0.06. S: 0.83.
--- F: 0.36. R: 0.03. S: 0.73.
---- F: 0.88. R: 0.01. S: 0.60.
----- F: 1.00. R: 0.01. S: 0.60.
------ Ludovic Stephan, Laurent Massoulié (COLT, 2019) 
------ Ludovic Stephan, Laurent Massoulié (CoRR, 2018) 
----- Laurent Massoulié, Ludovic Stephan, Don Towsley (COLT, 2019) 
---- F: 1.00. R: 0.01. S: 0.61.
----- Bo Tan 0002, Laurent Massoulié (PODC, 2010) 
----- Bo Tan 0002, Laurent Massoulié (INFOCOM, 2011) 
----- Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013) 
--- F: 0.41. R: 0.03. S: 0.69.
---- F: 0.86. R: 0.01. S: 0.59.
----- Luca Ganassali, Laurent Massoulié (CoRR, 2020) 
----- Luca Ganassali, Marc Lelarge, Laurent Massoulié (CoRR, 2019) 
---- F: 0.61. R: 0.03. S: 0.60.
----- F: 1.00. R: 0.02. S: 0.59.
------ Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016) 
------ Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015) 
------ Lennart Gulikers, Marc Lelarge, L

OK! We see writing commnities.

Now, let's look in terms of authors. This is actually the interesting part when studying collaborations. So far, the top articles do not seem to be very informative in the author space.

In [18]:
gismo.get_ranked_features()

['Laurent Massoulié',
 'Marc Lelarge',
 'Stratis Ioannidis',
 'Nidhi Hegde',
 'Peter B. Key',
 'Anne-Marie Kermarrec',
 'Ayalvadi J. Ganesh',
 'Lennart Gulikers',
 'Hadrien Hendrikx',
 'Francis Bach',
 'Dan-Cristian Tomozei',
 'Amin Karbasi',
 'Milan Vojnovic',
 'Augustin Chaintreau',
 'Mathieu Leconte',
 'Bo Tan 0002',
 'James Roberts',
 'Ludovic Stephan',
 'Kuang Xu',
 'Fabio Picconi']

We see many authors that were not present in the articles listed above. This is an important observation: central articles (with respect to a query) are not necessarily written by central authors!

Let's organize them into communities.

In [20]:
gismo.get_clustered_ranked_features()

 F: 0.01. R: 0.21. S: 0.59.
- F: 0.02. R: 0.21. S: 0.58.
-- F: 0.03. R: 0.20. S: 0.63.
--- F: 0.13. R: 0.19. S: 0.59.
---- F: 0.27. R: 0.13. S: 0.80.
----- F: 0.32. R: 0.11. S: 0.84.
------ Laurent_Massoulié (R: 0.10; S: 1.00)
------ Nidhi_Hegde (R: 0.01; S: 0.32)
----- F: 0.37. R: 0.02. S: 0.33.
------ F: 0.59. R: 0.02. S: 0.33.
------- Stratis_Ioannidis (R: 0.01; S: 0.34)
------- Amin_Karbasi (R: 0.00; S: 0.20)
------ Augustin_Chaintreau (R: 0.00; S: 0.20)
---- F: 0.53. R: 0.02. S: 0.26.
----- Marc_Lelarge (R: 0.01; S: 0.35)
----- Mathieu_Leconte (R: 0.00; S: 0.19)
---- Peter_B._Key (R: 0.01; S: 0.28)
---- F: 0.62. R: 0.01. S: 0.28.
----- Anne-Marie_Kermarrec (R: 0.01; S: 0.30)
----- Ayalvadi_J._Ganesh (R: 0.01; S: 0.27)
---- F: 0.36. R: 0.01. S: 0.23.
----- Lennart_Gulikers (R: 0.00; S: 0.22)
----- Milan_Vojnovic (R: 0.00; S: 0.19)
---- Dan-Cristian_Tomozei (R: 0.00; S: 0.17)
--- F: 0.94. R: 0.01. S: 0.21.
---- Hadrien_Hendrikx (R: 0.00; S: 0.21)
---- Francis_Bach (R: 0.00; S: 0.23)

### Jim Roberts  query

In [21]:
gismo.rank("James_W._Roberts")

True

Who are the associated authors?

In [22]:
gismo.get_ranked_features()

['James W. Roberts',
 'Thomas Bonald',
 'Maher Hamdi',
 'Sara Oueslati-Boulahia',
 'Ali Ibrahim',
 'Alexandre Proutière',
 'Sara Oueslati',
 'Jorma T. Virtamo',
 'Slim Ben Fredj',
 'Jussi Kangasharju',
 'Keith W. Ross',
 'Brahim Bensaou',
 'Alain Simonian',
 'Nabil Benameur',
 'Abdesselem Kortebi',
 'Zbigniew Dziong',
 'Luca Muscariello',
 'Craig Stevenson',
 'Y. Canetti',
 'Philippe Olivier',
 'Kevin Beanland']

Let's organize them.

In [23]:
gismo.get_clustered_ranked_features()

 F: 0.01. R: 0.28. S: 0.51.
- F: 0.02. R: 0.28. S: 0.51.
-- F: 0.02. R: 0.26. S: 0.48.
--- F: 0.02. R: 0.25. S: 0.47.
---- F: 0.03. R: 0.24. S: 0.46.
----- F: 0.27. R: 0.21. S: 0.42.
------ F: 0.30. R: 0.19. S: 0.97.
------- F: 0.66. R: 0.17. S: 0.99.
-------- James_W._Roberts (R: 0.12; S: 1.00)
-------- Thomas_Bonald (R: 0.05; S: 0.66)
------- Ali_Ibrahim (R: 0.01; S: 0.28)
------- Alexandre_Proutière (R: 0.01; S: 0.29)
------ F: 0.83. R: 0.02. S: 0.32.
------- Sara_Oueslati-Boulahia (R: 0.01; S: 0.32)
------- Slim_Ben_Fredj (R: 0.01; S: 0.32)
------- Nabil_Benameur (R: 0.00; S: 0.27)
----- Maher_Hamdi (R: 0.01; S: 0.29)
----- F: 0.80. R: 0.02. S: 0.28.
------ Sara_Oueslati (R: 0.01; S: 0.28)
------ F: 1.00. R: 0.01. S: 0.23.
------- Abdesselem_Kortebi (R: 0.00; S: 0.23)
------- Luca_Muscariello (R: 0.00; S: 0.23)
---- F: 0.47. R: 0.01. S: 0.26.
----- Jorma_T._Virtamo (R: 0.01; S: 0.23)
----- Alain_Simonian (R: 0.00; S: 0.23)
--- F: 1.00. R: 0.01. S: 0.22.
---- Jussi_Kangasharju (R: 0

### Combined queries

We can input multiple authors.

In [24]:
gismo.rank("Laurent_Massoulié and James_W._Roberts")

True

Now, let's look at the main authors.

In [25]:
gismo.get_ranked_features()

['James W. Roberts',
 'Laurent Massoulié',
 'Thomas Bonald',
 'Nidhi Hegde',
 'Maher Hamdi',
 'Marc Lelarge',
 'Stratis Ioannidis',
 'Alexandre Proutière',
 'Sara Oueslati-Boulahia',
 'Ali Ibrahim',
 'Peter B. Key',
 'Jorma T. Virtamo',
 'Sara Oueslati',
 'Anne-Marie Kermarrec',
 'Ayalvadi J. Ganesh',
 'Slim Ben Fredj']

We see a mix of both co-authors. How are they organized?

In [26]:
gismo.get_clustered_ranked_features()

 F: 0.00. R: 0.20. S: 0.53.
- F: 0.03. R: 0.13. S: 0.52.
-- F: 0.04. R: 0.13. S: 0.51.
--- F: 0.10. R: 0.12. S: 0.47.
---- F: 0.31. R: 0.11. S: 0.46.
----- F: 0.66. R: 0.09. S: 0.91.
------ James_W._Roberts (R: 0.06; S: 0.91)
------ Thomas_Bonald (R: 0.03; S: 0.61)
----- F: 0.70. R: 0.01. S: 0.32.
------ Sara_Oueslati-Boulahia (R: 0.00; S: 0.29)
------ Slim_Ben_Fredj (R: 0.00; S: 0.29)
----- Ali_Ibrahim (R: 0.00; S: 0.26)
---- F: 0.42. R: 0.01. S: 0.25.
----- Nidhi_Hegde (R: 0.01; S: 0.20)
----- Alexandre_Proutière (R: 0.01; S: 0.27)
--- Maher_Hamdi (R: 0.01; S: 0.26)
--- Sara_Oueslati (R: 0.00; S: 0.26)
-- Jorma_T._Virtamo (R: 0.00; S: 0.21)
- F: 0.19. R: 0.07. S: 0.32.
-- F: 0.34. R: 0.06. S: 0.37.
--- Laurent_Massoulié (R: 0.05; S: 0.40)
--- Marc_Lelarge (R: 0.01; S: 0.14)
--- Stratis_Ioannidis (R: 0.01; S: 0.14)
-- Peter_B._Key (R: 0.00; S: 0.11)
-- F: 0.62. R: 0.01. S: 0.11.
--- Anne-Marie_Kermarrec (R: 0.00; S: 0.12)
--- Ayalvadi_J._Ganesh (R: 0.00; S: 0.11)


Rough analysis:
- One cluster centered around Jim
- One cluster centered around Laurent
- One subcluster with high focus that relates Nidhi (Laurent's top 10) and Alexandre (Jim's top 10). As expected, they have a strong collaboration history!

## Cross-gismo

To be done.

Don't forget to close the source when you're done (the source keeps the ``dblp.data`` file open).

In [27]:
source.close()