# DBLP exploration

This tutorial shows how explore DBLP with Gismo.

If you have never used Gismo before, you may want to start with the *Toy example tutorial* or the *ACM* tutorial.

Note: the DBLP databased is not small. Recommended requirements to excute this Notebook:
- Fast Internet connection (you will need to download a few hundred Mb)
- 4 Gb of free space
- 4 Gb of RAM (8Gb or more recommended)
- Descent CPU (can take more than one hour on slow CPUs)

Here, *documents* are articles in DBLP. The *features* of an article category will vary.

## Initialisation

First, we load the required packages.

In [1]:
import numpy as np
import spacy
from gismo import Corpus, Embedding, CountVectorizer, cosine_similarity, Gismo
from pathlib import Path
from functools import partial

from gismo.datasets.dblp import Dblp
from gismo.filesource import FileSource
from gismo.post_processing import post_features_cluster_print, post_documents_cluster_print

Then, we prepare the DBLP source.

First we choose the location of the DBLP files. If you want to run this Notebook at your place, please change the path below and check that it exists.

In [2]:
path = Path("../../../../../Datasets/DBLP")
path.exists()

True

Construction of the dblp files. Only needs to be performed the first time or when you want to refresh the database. Takes about 10 minutes on a Surface Pro 4 with fiber Internet connection.

In [3]:
dblp = Dblp(path=path)
dblp.build()

File ..\..\..\..\..\Datasets\DBLP\dblp.xml.gz already exists. Use refresh option to overwrite.
File ..\..\..\..\..\Datasets\DBLP\dblp.data already exists. Use refresh option to overwrite.


Then, we can load the database as a filesource.

In [4]:
source = FileSource(filename="dblp", path=path)
source[0]

{'type': 'inproceedings',
 'authors': ['Arnon Rosenthal'],
 'title': 'The Future of Classic Data Administration: Objects + Databases + CASE',
 'venue': 'SWEE',
 'year': '1998'}

Each article is a dict with fields ``type``, ``venue``, ``title``, ``year``, and ``authors``. We build a corpus that will tell Gismo that the content of an article is its ``title`` value.

In [5]:
corpus = Corpus(source, to_text=lambda x: x['title'])

We build an embedding on top of that corpus.
- We set ``min_df=30`` to exclude rare features;
- We set ``max_df=.02`` to exclude anything present in more than 2% of the corpus;
- We use `spacy` to lemmatize & remove some stopwords; remove `preprocessor=...` from the input if you want to skip this (takes time);
- A few manually selected stopwords to fine-tune things.
- We set ``ngram_range=(1, 2)`` to include bi-grams in the embedding.

This will take a few minutes (without spacy) up to a few hours (with spacy enabled). You can save the embedding for later if you want.

In [None]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
keep = {'ADJ', 'NOUN', 'NUM', 'PROPN', 'SYM', 'VERB'}
vectorizer = CountVectorizer(min_df=30, max_df=.02, ngram_range=(1, 2), dtype=float,
                             preprocessor=lambda txt: " ".join([w.lemma_.lower() for w in nlp(txt) 
                                                                if w.pos_ in keep and not w.is_stop]),
                             stop_words=['a', 'about', 'an', 'and', 'for', 'from', 'in', 'of', 'on', 'the', 'with'])

try:
    embedding = Embedding.load(filename="dblp_embedding", path=path)
except:
    embedding = Embedding(vectorizer=vectorizer)
    embedding.fit_transform(corpus)
    embedding.dump(filename="dblp_embedding", path=path)

In [None]:
embedding.x

We see from ``embedding.x`` that the embedding links about 6,200,000 documents to 193,000 features. In average, each document is linked to about 10 features.

Now, we initiate the gismo object, and customize post_processers to ease the display.

In [None]:
gismo = Gismo(corpus, embedding)

In [None]:
def post_article(g, i):
    dic = g.corpus[i]
    authors = ", ".join(dic['authors'])
    return f"{dic['title']} By {authors} ({dic['venue']}, {dic['year']})"
    
gismo.post_documents_item = post_article

def post_title(g, i):
    return g.corpus[i]['title']
    authors = ", ".join(dic['authors'])
    return f"{dic['title']} By {authors} ({dic['venue']}, {dic['year']})"

def post_meta(g, i):
    dic = g.corpus[i]
    authors = ", ".join(dic['authors'])
    return f"{authors} ({dic['venue']}, {dic['year']})"


gismo.post_documents_cluster = partial(post_documents_cluster_print, post_item=post_title)
gismo.post_features_cluster = post_features_cluster_print

As the dataset is big, we lower the precision of the computation to speed up things a little bit.

In [None]:
gismo.parameters.n_iter = 2

## Machine Learning (and Covid-19) query

We perform the query *Machine learning*. The returned ``True`` tells that some of the query features were found in the corpus' features.

In [None]:
gismo.rank("Machine Learning")

What are the best articles on *Machine Learning*?

In [None]:
gismo.get_documents_by_rank()

OK, this seems to go everywhere. Maybe we can narrow with a more specific request.

In [None]:
gismo.rank("Machine Learning and covid-19")

In [None]:
gismo.get_documents_by_rank()

Sounds nice. How are the top-10 articles related? Note: as the graph structure is really sparse on the document side (10 features), it is best to de-activate the query-distortion, which is intended for longer documents.

In [None]:
gismo.parameters.distortion = 0.0
gismo.get_documents_by_cluster(k=10)

Now, let's look at the main keywords.

In [None]:
gismo.get_features_by_rank(20)

Let's organize them.

In [None]:
# On the feature side, the graph is more dense so we can use query distortion
gismo.get_features_by_cluster(distortion=1)

Rough, very broad analysis:
- One big keyword cluster about Coronavirus / Covid-19, pandemic, online learning;
- Machine Learning as a separate small cluster.

In [None]:
np.dot(gismo.embedding.query_projection("Machine learning")[0], gismo.embedding.y)

88,000 articles with an explicit link to machine learning.

In [None]:
np.dot(gismo.embedding.query_projection("Covid-19")[0], gismo.embedding.y)

12,000 articles with an explicit link to covid-19.

## Authors query

Instead of looking at words, we can explore authors and their collaborations.

 We just have to rewire the corpus to output string of authors.

In [None]:
def to_authors_text(dic):
    return " ".join([a.replace(' ', '_') for a in dic['authors']])
corpus = Corpus(source, to_text=to_authors_text)

We can build a new embedding on top of this modified corpus. We tell the vectorizer to be stupid: don't preprocess, words are separated spaces.

This will take a few minutes (you can save the embedding for later if you want).

In [None]:
vectorizer = CountVectorizer(dtype=float,
                            preprocessor=lambda x:x, tokenizer=lambda x: x.split(' '))
try:
    a_embedding = Embedding.load(filename="dblp_aut_embedding", path=path)
except:
    a_embedding = Embedding(vectorizer=vectorizer)
    a_embedding.fit_transform(corpus)
    a_embedding.dump(filename="dblp_aut_embedding", path=path)

In [None]:
a_embedding.x

We now have about 3,200,000 authors to explore. Let's reload gismo and try to play.

In [None]:
gismo = Gismo(corpus, a_embedding)
gismo.post_documents_item = post_article
gismo.post_features_item = lambda g, i: g.embedding.features[i].replace("_", " ")

In [None]:
gismo.post_documents_cluster = partial(post_documents_cluster_print, post_item=post_meta)
gismo.post_features_cluster = post_features_cluster_print

### Laurent Massoulié query

In [None]:
gismo.rank("Laurent_Massoulié")

What are the most central articles of Laurent Massoulié in terms of collaboration?

In [None]:
gismo.get_documents_by_rank(k=10)

We see lots of duplicates. This is not surprising as many articles can published first as a research report, then as a conference paper, last as a journal article. Luckily, Gismo can cover for you.

In [None]:
gismo.get_documents_by_coverage(k=10)

Hum, not working well. The reason here is query distortion. Query distortion is a gismo feature that modulates the clustering with the query. Sadly, when features are authors, the underlying graph has a very specific structure (highly sparse and redundant) that makes query distortion *too* effective. The solution is to desactivate it.

In [None]:
gismo.parameters.distortion = 0
gismo.get_documents_by_coverage(k=10)

Much better. No duplicate and more diversity in the results. Let's observe the communities.

In [None]:
gismo.get_documents_by_cluster(k=20, resolution=.9)

OK! We see that the articles are organized by writing commmunities. Also note how Gismo managed to organize a hierachical grouping of the communities.

Now, let's look in terms of authors. This is actually the interesting part when studying collaborations.

In [None]:
gismo.get_features_by_rank()

We see many authors that were not present in the articles listed above. This is an important observation: central articles (with respect to a query) are not necessarily written by central authors!

Let's organize them into communities.

In [None]:
gismo.get_features_by_cluster(resolution=.6)

### Jim Roberts  query

In [None]:
gismo.rank("James_W._Roberts")

Let's have a covering set of articles.

In [None]:
gismo.get_documents_by_coverage(k=10)

Who are the associated authors?

In [None]:
gismo.get_features_by_rank(k=10)

Let's organize them.

In [None]:
gismo.get_features_by_cluster(k=10, resolution=.4)

### Combined queries

We can input multiple authors.

In [None]:
gismo.rank("Laurent_Massoulié and James_W._Roberts")

Let's have a covering set of articles.

In [None]:
gismo.get_documents_by_coverage(k=10)

Note that we get here only articles by Roberts, yet the articles returned have sightly changed.

Now, let's look at the main authors.

In [None]:
gismo.get_features_by_rank()

We see a mix of both co-authors. How are they organized?

In [None]:
gismo.get_features_by_cluster(resolution=.4)

## Cross-gismo

Gismo can combine two embeddings two create one hybrid gismo. This is called a cross-gismo (XGismo). This features can be used to analyze authors with respect to the words they use (and vice-versa).

In [None]:
from gismo.gismo import XGismo
gismo = XGismo(x_embedding=a_embedding, y_embedding=embedding)
gismo.diteration.n_iter = 2 # to speed up a little bit computation time

Note that XGismo does not use the underlying corpus, so we can now close the source (the source keeps the file ``dblp.data`` open).

In [None]:
source.close()

In [None]:
gismo.post_documents_item = lambda g, i: g.corpus[i].replace("_", " ")
gismo.post_features_cluster = post_features_cluster_print
gismo.post_documents_cluster = post_documents_cluster_print

Let's try a request.

In [None]:
gismo.rank("self-stabilization")

What are the associated keywords?

In [None]:
gismo.get_features_by_rank(k=10)

How are keywords structured?

In [None]:
gismo.get_features_by_cluster(k=20, resolution=.8)

Who are the associated researchers?

In [None]:
gismo.get_documents_by_rank(k=10)

How are they structured?

In [None]:
gismo.get_documents_by_cluster(k=10, resolution=.9)

We can also query researchers. Just use underscores in the query and add `y=False` to indicate that the input is *documents*.

In [None]:
gismo.rank("Sébastien_Tixeuil and Fabien_Mathieu", y=False)

What are the associated keywords?

In [None]:
gismo.get_features_by_rank(k=10)

Using covering can yield other keywords of interest.

In [None]:
gismo.get_features_by_coverage(k=10)

How are keywords structured?

In [None]:
gismo.get_features_by_cluster(k=20, resolution=.7)

Who are the associated researchers?

In [None]:
gismo.get_documents_by_rank(k=10)

How are they structured?

In [None]:
gismo.get_documents_by_cluster(k=10, resolution=.8)