# Data preparation

This notebook is in charge of creating the objects that are used in the paper *Once upon a time in Algotel*. After setting a few things (mostly where you want the data to be stored and checking your packages), you only require to execute it once.

## Packages

You need ``Gismo>=0.4.1`` for the Notebook to work.

In [1]:
import gismo
gismo.__version__

'0.4.1'

If you don't have gismo, you can install it from pip (``pip install gismo``), or install it from sources at https://github.com/balouf/gismo

If you have an older version of Gismo, upgrade is strongly recommended (``pip install gismo -U``).

## Data folder

You need to tell where the data will be located. You can just use ``Path(".")`` to select your working directory or a different location if you prefer. It is recommended to have at least 2Gb available (you can clean some files afterwards).

In [2]:
from pathlib import Path
data_folder = Path("../../../datasets")
data_folder.exists()

True

## DBLP retrieval

This part first retrieves the DBLP database.

In [3]:
from gismo.datasets.dblp import Dblp

dblp = Dblp(path=data_folder)
dblp.build()

File ..\..\..\datasets\dblp.xml.gz already exists. Use refresh option to overwrite.
File ..\..\..\datasets\dblp.data already exists. Use refresh option to overwrite.


 After a few minutes, we have something usable by Gismo.

In [4]:
from gismo.filesource import FileSource

source = FileSource(filename="dblp", path=data_folder)
source[500000]

{'type': 'article',
 'authors': ['Jun Hou',
  'Qianmu Li',
  'Rong Tan',
  'Shunmei Meng',
  'Hanrui Zhang',
  'Sainan Zhang'],
 'title': 'An Intrusion Tracking Watermarking Scheme.',
 'year': '2019',
 'venue': 'IEEE Access'}

We have many articles there:

In [5]:
len(source)

5301990

How many authors?

In [6]:
dblp_authors = {auth for art in source for auth in art['authors']}

In [7]:
len(dblp_authors)

2740056

## Loading program committees

This data has been semi-automatically processed independently and is shipped with the project. You just need to load it.

In [8]:
import json, gzip
with gzip.open('algotels_1999_2021.json.gz', 'rt', encoding='utf8') as f:
    algotels = json.load(f)

`algotels` is a dict with two keys.
- `by_year` -> dict that associates year (string) to program committee (list of strings)
- `pcs` -> list of all PC chairs (list of strings)

Note that we use underscores instead of spaces for the names. This is just a trick to facilitate pre-processing later on.

In [9]:
algotels['by_year']['2007']

['Guillaume_Chelius',
 'David_Coudert',
 'Marin_Bertier',
 'Lélia_Blin',
 'Tijani_Chahed',
 'Claude_Chaudet',
 'Marcelo_Dias_de_Amorim',
 'Bertrand_Ducourthial',
 'Jean-Michel_Fourneau',
 'Jérôme_Galtier',
 'Cyril_Gavoille',
 'Yacine_Ghamri-Doudane',
 'Isabelle_Guérin_Lassous',
 'Clémence_Magnien',
 'Thomas_Noël',
 'Philippe_Owezarski',
 'Christophe_Paul',
 'Christophe_Prieur',
 'Hervé_Rivano',
 'Franck_Rousseau',
 'Bruno_Sericola',
 'David_Simplot-Ryl',
 'Radu_State',
 'Sébastien_Tixeuil',
 'Fabrice_Valois',
 'Sandrine_Vial',
 'Frédéric_Weis']

In [10]:
algotels['pcs']

['Benoît_Darties',
 'Alessia_Milani',
 'Thomas_Begin',
 'Erwan_Le_Merrer',
 'Christelle_Caillouet',
 'Cristel_Pelsser',
 'Aline_Carneiro_Viana',
 'Stéphane_Devismes',
 'David_Ilcinkas',
 'Katia_Jaffrès-Runser',
 'Lélia_Blin',
 'Frédéric_Giroire',
 'Jérémie_Chalopin',
 'Fabrice_Theoleyre',
 'Nicolas_Nisse',
 'Franck_Rousseau',
 'Nicolas_Hanusse',
 'Fabien_Mathieu',
 'Bertrand_Ducourthial',
 'Pascal_Felber',
 'Maria_Potop-Butucaru',
 'Hervé_Rivano',
 'Augustin_Chaintreau',
 'Clémence_Magnien',
 'David_Simplot-Ryl',
 'Sébastien_Tixeuil',
 'Guillaume_Chelius',
 'David_Coudert',
 'Marcelo_Dias_de_Amorim',
 'Jean-Claude_König',
 'Matthieu_Latapy',
 'Philippe_Owezarski',
 'Isabelle_Guérin_Lassous',
 'Frédéric_Havet',
 'Khaldoun_Al_Agha',
 'Cyril_Gavoille',
 'Thomas_Noël',
 'Laurent_Viennot',
 'Véronique_Vèque',
 'Eric_Fleury',
 'Pierre_Fraigniaud',
 'Karine_Altisen',
 'Quentin_Bramas']

How many researchers in the PC?

In [11]:
pcs_people = {auth for year in algotels['by_year'].values() for auth in year}

In [12]:
len(pcs_people)

266

## Corpus reduction

DBLP provides references for more than 5,000,000 articles. This is huge. We want to reduce the number of articles. The goal here is twofold:
- Smaller datasets are faster, so the reduction will make it easier to perform various experiments;
- Reduction is also an occasion to focus on the articles that are the most relevant for the study: we will not reduce randomly, but by selecting articles that are *close* to Algotel. The focus allows to give more attention to relevant fields/vocabulary.

### Algotel corpus

We re-arrange slighly the content of ``algotels`` to make it easier to process afterwards. In details, we flatten the content into a list of dict, each dict having a display name (`name`) and a usable content (`dblp`).

In [13]:
algotels_lmks = [{'name': k, 'dblp': " ".join(v)} for k, v in algotels['by_year'].items()]
algotels_lmks.append({'name': 'pcs', 'dblp': " ".join(algotels['pcs'])})
algotels_lmks

[{'name': '1999',
  'dblp': 'Jean-Claude_Bermond Fabrice_Clérot Afonso_Ferreira Jean-Michel_Fourneau Pierre_Fraigniaud Cyril_Gavoille Gérard_Hébuterne Jean-Luc_Lutton Philippe_Mahey Fabrice_R._Noreils Stéphane_Ubéda Véronique_Vèque'},
 {'name': '2000',
  'dblp': 'Alexandre_Caminada Fabrice_Clérot Eric_Fleury Jean-Michel_Fourneau Pierre_Fraigniaud Etienne_Gaudin Gérard_Hébuterne Daniel_Kofman Jean-Claude_König Martine_Labbé Philippe_Nain Thomas_Noël Stephane_Perennes Patrick_Snape Kim_Loan_Thai François_Tillerot Laurent_Viennot'},
 {'name': '2001',
  'dblp': 'André-Luc_Beylot Stéphane_Boucheron Fabrice_Chauvet Eric_Fleury Jérôme_Galtier Etienne_Gaudin Cyril_Gavoille Michel_Gendreau S._Grisouard Gérard_Hébuterne Eric_Horlait Philippe_Jacquet Jean-Claude_König Christian_Laforest Xavier_Lagrange Geraldo_Robson_Mateus Michel_Morvan Jean-Jacques_Pansiot Nihal_Pekergin Brigitte_Plateau Alain_Quilliot Michel_Riguidel Patrick_Tortelier Véronique_Vèque Laurent_Toutain Stéphane_Ubéda'},
 {'name':

### Gismo on DBLP authors

This part builds a Gismo on authors (a weighted bipartite graph between articles and authors).

This is just initialization stuff.

In [14]:
from gismo.corpus import Corpus
from gismo.embedding import Embedding
from gismo.gismo import Gismo
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_author = CountVectorizer(dtype=float, preprocessor=lambda x:x, tokenizer=lambda x: x.split(' '))

This part tells how to convert a DBLP article dict from the source into a string representation of authors.

In [15]:
def to_authors_text(dic):
    return " ".join([a.replace(' ', '_') for a in dic['authors']])
corpus = Corpus(source, to_text=to_authors_text)

This builds the embedding (the actual weighted bipartite graph between articles and authors).

In [16]:
embedding = Embedding(vectorizer=vectorizer_author)
embedding.fit_transform(corpus)

The Gismo is the object that glues all pieces together.

In [17]:
gismo = Gismo(corpus, embedding)

Gismo can do many things that we don't need right know. For example, it can easily tell your closest co-authors (including yourself).

In [18]:
gismo.rank("Fabien_Mathieu")
gismo.get_features_by_rank()

['Fabien_Mathieu',
 'Laurent_Viennot',
 'Diego_Perino',
 'Julien_Reynier',
 'Céline_Comte',
 'Ludovic_Noirie',
 'François_Durand',
 'Fabien_de_Montgolfier',
 'The_Dang_Huynh',
 'Thomas_Bonald',
 'Yacine_Boufkhad',
 'Ilkka_Norros',
 'Mohamed_Bouklit',
 'Anh-Tuan_Gai',
 'François_Baccelli',
 'Nidhi_Hegde',
 'Gheorghe_Postelnicu',
 'Dohy_Hong',
 'Anne_Bouillard']

In [19]:
gismo.rank('Philippe_Jacquet')
gismo.get_features_by_rank()

['Philippe_Jacquet',
 'Wojciech_Szpankowski',
 'Bernard_Mans',
 'Georgios_Rodolakis',
 'Salman_Malik',
 'Paul_Mühlethaler',
 'Dimitris_Milioris',
 'Cédric_Adjih',
 'Dalia_Georgiana_Popescu',
 'Emmanuel_Baccelli',
 'Laurent_Viennot',
 'Mireille_Régnier',
 'Thomas_Heide_Clausen']

### Reducing the number of articles

Through the Landmarks submodule, Gismo can associate some arbitrary items (like program committee) to articles and/or authors.

The following lines tell to associate each entry of `algotels_lmk` to up-to 20,000 articles, which will be selected by Gismo, and to build a selection of articles by merging all results.

In [20]:
from gismo.landmarks import Landmarks
landmarks_full = Landmarks(source=algotels_lmks, to_text=lambda x: x['dblp'],
                                 x_density=20000)

In [21]:
reduced_source = landmarks_full.get_reduced_source(gismo)

In [22]:
print(f"Source length went down from {len(source)} to {len(reduced_source)}.")

Source length went down from 5301990 to 119370.


We can close the original DBLP source (the big DBLP source keeps a file open while in use).

In [23]:
source.close()

How many authors?

In [24]:
reduced_dblp_authors = {auth for art in reduced_source for auth in art['authors']}

In [25]:
len(reduced_dblp_authors)

90676

## XGismo

A XGismo object merges two bipartite graphs into a new one. Here we will merge a bipartite graph between articles and authors with a bipartite graph between articles and vocabulary, producing a bipartite graph between authors and vocabulary.

First we rebuild an author graph on the new (reduced) corpus of articles.

In [26]:
reduced_corpus = Corpus(reduced_source, to_text=to_authors_text)
reduced_author_embedding = Embedding(vectorizer=vectorizer_author)
reduced_author_embedding.fit_transform(reduced_corpus)

Then we build the vocabulary graph.
- We use spacy to enhance word selection
- We make a few additional tweaks, like filtering by frequency and detection of consecutive words (n-grams)

In [27]:
import spacy
# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
# Who cares about DET and such?
keep = {'ADJ', 'NOUN', 'NUM', 'PROPN', 'SYM', 'VERB'}

preprocessor=lambda txt: " ".join([token.lemma_.lower() for token in nlp(txt)
                                   if token.pos_ in keep and not token.is_stop])
vectorizer_text = CountVectorizer(dtype=float, min_df=5, max_df=.02, ngram_range=[1, 3], preprocessor=preprocessor)

Creation of the vocabulary graph (will take a few minutes).

In [28]:
reduced_corpus.to_text = lambda e: e['title']
reduced_word_embedding = Embedding(vectorizer=vectorizer_text)
reduced_word_embedding.fit_transform(reduced_corpus)

Now the xgismo can be made from the two graphs.

In [29]:
from gismo.gismo import XGismo
xgismo = XGismo(x_embedding=reduced_author_embedding, y_embedding=reduced_word_embedding)

The dimensions of the xgismo (number of authors times vocabulary size):

In [30]:
xgismo.embedding.x

<90676x33900 sparse matrix of type '<class 'numpy.float64'>'
	with 2485287 stored elements in Compressed Sparse Row format>

XGismo can link words or researchers to words or researchers. For example:

In [31]:
xgismo.rank("self-stabilization")

True

In [32]:
xgismo.get_documents_by_rank(k=10)

['Ted_Herman',
 'Shlomi_Dolev',
 'Sébastien_Tixeuil',
 'Toshimitsu_Masuzawa',
 'Shay_Kutten',
 'Stéphane_Devismes',
 'Swan_Dubois',
 'Stefan_Schmid_0001',
 'Bertrand_Ducourthial',
 'Karine_Altisen']

In [33]:
xgismo.get_features_by_rank(k=10)

['self',
 'stabilization',
 'self stabilization',
 'stabilize',
 'self stabilize',
 'byzantine',
 'distributed',
 'tree',
 'asynchronous',
 'fault']

In [34]:
xgismo.rank("Pierre_Fraigniaud", y=False)

True

In [35]:
xgismo.get_documents_by_rank(k=10)

['Pierre_Fraigniaud',
 'Amos_Korman',
 'Andrzej_Pelc',
 'David_Peleg',
 'Cyril_Gavoille',
 'Michel_Raynal',
 'Fedor_V._Fomin',
 'Dimitrios_M._Thilikos',
 'David_Ilcinkas',
 'Nicolas_Nisse']

In [36]:
xgismo.get_features_by_rank(k=10)

['local',
 'decision',
 'local decision',
 'advice',
 'compute',
 'broadcasting',
 'verify',
 'exploration',
 'distributed',
 'tree']

## Saving

We can now save the xgismo, which is the only thing we need in addition to the program committees.

In [37]:
xgismo.save(filename="algotels_xgismo", path=data_folder, compress=True, erase=True)

## Cleaning (optional)

If you don't want to re-use the DBLP database in the future and need to save some space, you can safely remove the DBLP files you have created.

The list of the files is:

In [38]:
for file in data_folder.glob('dblp*'):
    if file.is_file():
        print(f"{file} ({file.stat().st_size} bytes)")

..\..\..\datasets\dblp.data (940287966 bytes)
..\..\..\datasets\dblp.dtd (12973 bytes)
..\..\..\datasets\dblp.index (21208238 bytes)
..\..\..\datasets\dblp.xml.gz (616436534 bytes)


If you are OK to delete these files, execute the following cell.

In [32]:
for file in data_folder.glob('dblp*'):
    if file.is_file():
        file.unlink()

This notebook is finally over. You can now switch to the other one and start playing!