# DBLP exploration

This tutorial shows how explore DBLP with Gismo.

If you have never used Gismo before, you may want to start with the *Toy example tutorial* or the *ACM* tutorial.

Note: the DBLP databased is not small. Recommended requirements to excute this Notebook:
- Fast Internet connection (you will need to download a few hundred Mb)
- 4 Gb of free space
- 4 Gb of RAM (8Gb or more recommended)
- Descent CPU (can take more than one hour on slow CPUs)

Here, *documents* are articles in DBLP. The *features* of an article category will vary.

## Initialisation

First, we load the required packages.

In [None]:
import numpy as np
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pathlib import Path

from gismo.datasets.dblp import Dblp
from gismo.filesource import FileSource
from gismo.corpus import Corpus
from gismo.embedding import Embedding
from gismo.gismo import Gismo
from gismo.post_processing import post_features_cluster_print

Then, we prepare the DBLP source.

First we choose the location of the DBLP files. If you want to run this Notebook at your place, please change the path below and check that it exists.

In [2]:
path = Path("../../../../../Datasets/DBLP")
path.exists()

True

Construction of the dblp files. Only needs to be performed the first time or when you want to refresh the database. Takes about 10 minutes on a Surface Pro 4 with fiber Internet connection.

In [3]:
dblp = Dblp(path=path)
dblp.build()

File ..\..\..\..\..\Datasets\DBLP\dblp.xml.gz already exists. Use refresh option to overwrite.
File ..\..\..\..\..\Datasets\DBLP\dblp.data already exists. Use refresh option to overwrite.


Then, we can load the database as a filesource.

In [4]:
source = FileSource(filename="dblp", path=path)
source[0]

{'type': 'article',
 'title': 'Spectre Attacks: Exploiting Speculative Execution.',
 'venue': 'meltdownattack.com',
 'year': '2018',
 'authors': ['Paul Kocher',
  'Daniel Genkin',
  'Daniel Gruss',
  'Werner Haas',
  'Mike Hamburg',
  'Moritz Lipp',
  'Stefan Mangard',
  'Thomas Prescher 0002',
  'Michael Schwarz 0001',
  'Yuval Yarom']}

Each article is a dict with fields ``type``, ``venue``, ``title``, ``year``, and ``authors``. We build a corpus that will tell Gismo that the content of an article is its ``title`` value.

In [5]:
corpus = Corpus(source, to_text=lambda x: x['title'])

We build an embedding on top of that corpus.
- We set ``min_df=30`` to exclude rare features;
- We set ``max_df=.02`` to exclude anything present in more than 2% of the corpus;
- We use `spacy` to lemmatize & remove some stopwords; remove `preprocessor=...` from the input if you want to skip this (takes time);
- A few manually selected stopwords to fine-tune things.
- We set ``ngram_range=[1, 2]`` to include bi-grams in the embedding.

This will take a few minutes (without spacy) up to a few hours (with spacy enabled). You can save the embedding for later if you want.

In [6]:
nlp = spacy.load('en', disable=['parser', 'ner'])
keep = {'ADJ', 'NOUN', 'NUM', 'PROPN', 'SYM', 'VERB'}
vectorizer = CountVectorizer(min_df=30, max_df=.02, ngram_range=[1, 2], dtype=float,
                             preprocessor=lambda txt: " ".join([w.lemma_.lower() for w in nlp(txt) 
                                                                if w.pos_ in keep and not w.is_stop]),
                             stop_words=['a', 'about', 'an', 'and', 'for', 'from', 'in', 'of', 'on', 'the', 'with'])
embedding = Embedding(vectorizer=vectorizer)

try:
    embedding.load(filename="dblp_embedding", path=path)
except:
    embedding.fit_transform(corpus)
    embedding.save(filename="dblp_embedding", path=path)

In [7]:
embedding.x

<4889758x149361 sparse matrix of type '<class 'numpy.float64'>'
	with 45847196 stored elements in Compressed Sparse Row format>

We see from ``embedding.x`` that the embedding links about 5,000,000 documents to 150,000 features. In average, each document is linked to about 10 features.

Now, we initiate the gismo object, and customize post_processers to ease the display.

In [8]:
gismo = Gismo(corpus, embedding)

In [29]:
def post_article(g, i):
    dic = g.corpus[i]
    authors = ", ".join(dic['authors'])
    return f"{dic['title']} By {authors} ({dic['venue']}, {dic['year']})"
    
gismo.post_documents_item = post_article


def get_sim(csr, arr):
    return csr.dot(arr)[0]/np.linalg.norm(csr.data)/np.linalg.norm(arr)    




def post_documents_cluster_print(gismo, cluster, depth=""):
#     sim = cosine_similarity(cluster.vector, gismo.diteration.y_relevance.reshape(1, -1))[0][0]
    sim = get_sim(cluster.vector, gismo.diteration.y_relevance)
    if len(cluster.children) == 0:
        txt = gismo.corpus[cluster.indice]["title"]
        print(f"{depth} {txt} "
              f"(R: {gismo.diteration.x_relevance[cluster.indice]:.2f}; "
              f"S: {sim:.2f})")
    else:
        print(f"{depth} F: {cluster.focus:.2f}. "
              f"R: {sum(gismo.diteration.x_relevance[cluster.members]):.2f}. "
              f"S: {sim:.2f}.")
    for c in cluster.children:
        post_documents_cluster_print(gismo, c, depth=depth + '-')
gismo.post_documents_cluster = post_documents_cluster_print
gismo.post_features_cluster = post_features_cluster_print

As the dataset is big, we lower the precision of the computation to speed up things a little bit.

In [30]:
gismo.parameters.n_iter = 2

## Machine Learning (and Covid-19) query

We perform the query *Machine learning*. The returned ``True`` tells that some of the query features were found in the corpus' features.

In [31]:
gismo.rank("Machine Learning")

True

What are the best articles on *Machine Learning*?

In [32]:
gismo.get_documents_by_rank()

['Can Machine Learning Model with Static Features be Fooled: an Adversarial Machine Learning Approach. By Rahim Taheri, Reza Javidan, Mohammad Shojafar, Vinod P, Mauro Conti (CoRR, 2019)',
 'How Developers Iterate on Machine Learning Workflows - A Survey of the Applied Machine Learning Literature. By Doris Xin, Litian Ma, Shuchen Song, Aditya G. Parameswaran (CoRR, 2018)',
 'Informed Machine Learning - Towards a Taxonomy of Explicit Integration of Knowledge into Machine Learning. By Laura von Rüden, Sebastian Mayer, Jochen Garcke, Christian Bauckhage, Jannis Schücker (CoRR, 2019)',
 'Three Differential Emotion Classification by Machine Learning Algorithms using Physiological Signals - Discriminantion of Emotions by Machine Learning Algorithms. By Eun-Hye Jang, Byoung-Jun Park, Sang-Hyeob Kim, Jin-Hun Sohn (ICAART (1), 2012)',
 'When Lempel-Ziv-Welch Meets Machine Learning: A Case Study of Accelerating Machine Learning using Coding. By Fengan Li, Lingjiao Chen, Arun Kumar 0001, Jeffrey 

OK, this seems to go everywhere. Maybe we can narrow with a more specific request.

In [33]:
gismo.rank("Machine Learning and covid-19")

True

In [34]:
gismo.get_documents_by_rank()

['GSA-DenseNet121-COVID-19: a Hybrid Deep Learning Architecture for the Diagnosis of COVID-19 Disease based on Gravitational Search Optimization Algorithm. By Dalia Ezzat, Aboul Ella Hassanien, Hassan Aboul Ella (CoRR, 2020)',
 'NAIST COVID: Multilingual COVID-19 Twitter and Weibo Dataset. By Zhiwei Gao, Shuntaro Yada, Shoko Wakamiya, Eiji Aramaki (CoRR, 2020)',
 'COVID-MobileXpert: On-Device COVID-19 Screening using Snapshots of Chest X-Ray. By Xin Li, Chengyin Li, Dongxiao Zhu (CoRR, 2020)',
 'Deep-COVID: Predicting COVID-19 From Chest X-Ray Images Using Deep Transfer Learning. By Shervin Minaee, Rahele Kafieh, Milan Sonka, Shakib Yazdani, Ghazaleh Jamalipour Soufi (CoRR, 2020)',
 'COVID-CAPS: A Capsule Network-based Framework for Identification of COVID-19 cases from X-ray Images. By Parnian Afshar, Shahin Heidarian, Farnoosh Naderkhani, Anastasia Oikonomou, Konstantinos N. Plataniotis, Arash Mohammadi 0001 (CoRR, 2020)',
 'COVID-Net: A Tailored Deep Convolutional Neural Network Des

Sounds nice. How are the top-10 articles related? Note: as the graph structure is really sparse on the document side (10 features), it is best to de-activate the query-distortion, which is intended for longer documents.

In [35]:
gismo.parameters.distortion = 0.0
gismo.get_documents_by_cluster(k=10)

 F: 0.56. R: 0.02. S: 0.93.
- F: 0.57. R: 0.02. S: 0.92.
-- GSA-DenseNet121-COVID-19: a Hybrid Deep Learning Architecture for the Diagnosis of COVID-19 Disease based on Gravitational Search Optimization Algorithm. (R: 0.00; S: 0.64)
-- F: 0.66. R: 0.01. S: 0.91.
--- F: 0.66. R: 0.01. S: 0.90.
---- NAIST COVID: Multilingual COVID-19 Twitter and Weibo Dataset. (R: 0.00; S: 0.69)
---- CORD-19: The Covid-19 Open Research Dataset. (R: 0.00; S: 0.76)
---- A SIDARTHE Model of COVID-19 Epidemic in Italy. (R: 0.00; S: 0.79)
---- A First Instagram Dataset on COVID-19. (R: 0.00; S: 0.80)
--- F: 0.68. R: 0.00. S: 0.73.
---- COVID-MobileXpert: On-Device COVID-19 Screening using Snapshots of Chest X-Ray. (R: 0.00; S: 0.70)
---- Deep-COVID: Predicting COVID-19 From Chest X-Ray Images Using Deep Transfer Learning. (R: 0.00; S: 0.63)
-- COVID-CAPS: A Capsule Network-based Framework for Identification of COVID-19 cases from X-ray Images. (R: 0.00; S: 0.60)
-- COVID-Net: A Tailored Deep Convolutional Neu

OK! Let's decode this: we have:
- Lone wolves on gravitional search (first article) and serious gaming (last article)
- A large cluster on applications of ML to X-Ray / CT scan processing
- A cluster on datasets and social networks
- Note that it's not perfect: the social network has been nested inside X-Ray although a manual classification would probably have put it nearby...

Now, let's look at the main keywords.

In [36]:
gismo.get_features_by_rank(20)

['19',
 'covid',
 'covid 19',
 'machine',
 'machine learning',
 'pandemic',
 'chest',
 'chest ray',
 'coronavirus',
 'ray',
 'dataset',
 'deep',
 'ct',
 'epidemic',
 'ray image',
 'social',
 'learn',
 'twitter',
 'infection',
 'spread']

Let's organize them.

In [38]:
# On the feature side, the graph is more dense so we can use query distortion
gismo.get_features_by_cluster(distortion=1)

 F: 0.15. R: 0.07. S: 0.93.
- F: 0.26. R: 0.06. S: 0.93.
-- F: 0.29. R: 0.06. S: 0.97.
--- F: 0.97. R: 0.05. S: 0.98.
---- 19 (R: 0.02; S: 0.99)
---- covid (R: 0.02; S: 0.97)
---- covid 19 (R: 0.02; S: 0.98)
--- pandemic (R: 0.00; S: 0.32)
--- F: 0.86. R: 0.00. S: 0.30.
---- chest (R: 0.00; S: 0.32)
---- chest ray (R: 0.00; S: 0.28)
---- ray (R: 0.00; S: 0.33)
--- dataset (R: 0.00; S: 0.28)
-- coronavirus (R: 0.00; S: 0.27)
- F: 0.79. R: 0.01. S: 0.14.
-- machine (R: 0.00; S: 0.16)
-- machine learning (R: 0.00; S: 0.12)


Rough, very broad analysis:
- One big keyword cluster about Coronavirus / Covid-19;
- Chest X-ray as a sub-cluster;
- Machine Learning as a separate small cluster.

In [None]:
np.dot(gismo.embedding.query_projection("Machine learning")[0], gismo.embedding.y)

50,000 articles with an explicit link to machine learning.

In [None]:
np.dot(gismo.embedding.query_projection("Covid-19")[0], gismo.embedding.y)

800 articles with an explicit link to covid-19.

## Authors query

Instead of looking at words, we can explore authors and their collaborations.

 We just have to rewire the corpus to output string of authors.

In [None]:
def to_authors_text(dic):
    return " ".join([a.replace(' ', '_') for a in dic['authors']])
corpus = Corpus(source, to_text=to_authors_text)

We can build a new embedding on top of this modified corpus. We tell the vectorizer to be stupid: don't preprocess, words are separated spaces.

This will take a few minutes (you can save the embedding for later if you want).

In [None]:
vectorizer = CountVectorizer(dtype=float,
                            preprocessor=lambda x:x, tokenizer=lambda x: x.split(' '))
a_embedding = Embedding(vectorizer=vectorizer)
try:
    a_embedding.load(filename="dblp_aut_embedding", path=path)
except:
    a_embedding.fit_transform(corpus)
    a_embedding.save(filename="dblp_aut_embedding", path=path)

In [None]:
a_embedding.x

We now have about 2,500,000 authors to explore. Let's reload gismo and try to play.

In [None]:
gismo = Gismo(corpus, a_embedding)
gismo.post_documents_item = post_article
gismo.post_features_item = lambda g, i: g.embedding.features[i].replace("_", " ")

def print_document_cluster(gismo, cluster, depth=""):
    sim = cosine_similarity(cluster.vector, gismo.diteration.y_relevance.reshape(1, -1))[0][0]
    if len(cluster.children) == 0:
        dic = gismo.corpus[cluster.indice]
        authors = ", ".join(dic['authors'])
        txt = f"{authors} ({dic['venue']}, {dic['year']})"
        #post_article(gismo, cluster.indice)
        print(f"{depth} {txt} ")
         #     f"(R: {gismo.diteration.x_relevance[cluster.indice]:.2f}; "
          #    f"S: {sim:.2f})")
    else:
        print(f"{depth} F: {cluster.focus:.2f}. "
              f"R: {sum(gismo.diteration.x_relevance[cluster.members]):.2f}. "
              f"S: {sim:.2f}.")
    for c in cluster.children:
        print_document_cluster(gismo, c, depth=depth + '-')
gismo.post_document_cluster = print_document_cluster


gismo.post_documents_cluster = print_document_cluster
gismo.post_features_cluster = post_features_cluster_print

### Laurent Massoulié query

In [None]:
gismo.rank("Laurent_Massoulié")

What are the most central articles of Laurent Massoulié in terms of collaboration?

In [None]:
gismo.get_documents_by_rank(k=10)

We see lots of duplicates. This is not surprising as many articles can published first as a research report, then as a conference paper, last as a journal article. Luckily, Gismo can cover for you.

In [None]:
gismo.get_documents_by_coverage(k=10)

Hum, not working well. The reason here is query distortion. Query distortion is a gismo feature that modulates the clustering with the query. Sadly, when features are authors, the underlying graph has a very specific structure (highly sparse and redundant) that makes query distortion *too* effective. The solution is to desactivate it.

In [None]:
gismo.parameters.distortion = 0
gismo.get_documents_by_coverage(k=10)

Much better. No duplicate and more diversity in the results. Let's observe the communities.

In [None]:
gismo.get_documents_by_cluster(k=20, resolution=.9)

OK! We see that the articles are organized by writing commmunities. Also note how Gismo managed to organize a hierachical grouping of the communities.

Now, let's look in terms of authors. This is actually the interesting part when studying collaborations.

In [None]:
gismo.get_features_by_rank()

We see many authors that were not present in the articles listed above. This is an important observation: central articles (with respect to a query) are not necessarily written by central authors!

Let's organize them into communities.

In [None]:
gismo.get_features_by_cluster(resolution=.6)

### Jim Roberts  query

In [None]:
gismo.rank("James_W._Roberts")

Let's have a covering set of articles.

In [None]:
gismo.get_documents_by_coverage(k=10)

Who are the associated authors?

In [None]:
gismo.get_features_by_rank(k=10)

Let's organize them.

In [None]:
gismo.get_features_by_cluster(k=10, resolution=.4)

### Combined queries

We can input multiple authors.

In [None]:
gismo.rank("Laurent_Massoulié and James_W._Roberts")

Let's have a covering set of articles.

In [None]:
gismo.get_documents_by_coverage(k=10)

Note that we get here only articles by Roberts, yet the articles returned have sightly changed.

Now, let's look at the main authors.

In [None]:
gismo.get_features_by_rank()

We see a mix of both co-authors. How are they organized?

In [None]:
gismo.get_features_by_cluster(resolution=.4)

## Cross-gismo

Gismo can combine two embeddings two create one hybrid gismo. This is called a cross-gismo (XGismo). This features can be used to analyze authors with respect to the words they use (and vice-versa).

In [None]:
from gismo.gismo import XGismo
gismo = XGismo(x_embedding=a_embedding, y_embedding=embedding)
gismo.diteration.n_iter = 2 # to speed up a little bit computation time

Note that XGismo does not use the underlying corpus, so we can now close the source (the source keeps the ``dblp.data`` file open).

In [None]:
source.close()

In [None]:
gismo.post_document = lambda g, i: g.corpus[i].replace("_", " ")
gismo.post_feature_cluster = print_feature_cluster
def print_document_cluster(gismo, cluster, depth=""):
    sim = cosine_similarity(cluster.vector, gismo.diteration.y_relevance.reshape(1, -1))[0][0]
    if len(cluster.children) == 0:
        txt = gismo.corpus[cluster.indice].replace("_", " ")
        print(f"{depth} {txt} "
              f"(R: {gismo.diteration.x_relevance[cluster.indice]:.2f}; "
              f"S: {sim:.2f})")
    else:
        print(f"{depth} F: {cluster.focus:.2f}. "
              f"R: {sum(gismo.diteration.x_relevance[cluster.members]):.2f}. "
              f"S: {sim:.2f}.")
    for c in cluster.children:
        print_document_cluster(gismo, c, depth=depth + '-')
gismo.post_document_cluster = print_document_cluster

Let's try a request.

In [None]:
gismo.rank("self-stabilization")

What are the associated keywords?

In [None]:
gismo.get_ranked_features(k=10)

How are keywords structured?

In [None]:
gismo.get_clustered_ranked_features(k=20, resolution=.8)

Who are the associated researchers?

In [None]:
gismo.get_ranked_documents(k=10)

How are they structured?

In [None]:
gismo.get_clustered_ranked_documents(k=10, resolution=.9)

We can also query researchers. Just use underscores in the query and add `y=False` to indicate that the input is *documents*.

In [None]:
gismo.rank("Sébastien_Tixeuil and Fabien_Mathieu", y=False)

What are the associated keywords?

In [None]:
gismo.get_ranked_features(k=10)

Using covering can yield other keywords of interest.

In [None]:
gismo.get_covering_features(k=10)

How are keywords structured?

In [None]:
gismo.get_clustered_ranked_features(k=20, resolution=.7)

Who are the associated researchers?

In [None]:
gismo.get_ranked_documents(k=10)

How are they structured?

In [None]:
gismo.get_clustered_ranked_documents(k=10, resolution=.8)