# Data preparation

This notebook is in charge of creating the objects that are used in the paper *Fun with FUN*. After setting a few things (mostly where you want the data to be stored and checking your packages), you only require to execute it once.

## Packages

You need ``Gismo>=0.4.1`` for the Notebook to work.

In [1]:
import gismo
gismo.__version__

'0.4.2'

If you don't have gismo, you can install it from pip (``pip install gismo``), or install it from sources at https://github.com/balouf/gismo

If you have an older version of Gismo, upgrade is strongly recommended (``pip install gismo -U``).

## Data folder

You need to tell where the data will be located. You can just use ``Path(".")`` to select your working directory or a different location if you prefer. It is recommended to have at least 2Gb available (you can clean some files afterwards).

In [3]:
from pathlib import Path
data_folder = Path("../../../../../Datasets")
data_folder.exists()

True

## DBLP retrieval

This part first retrieves the DBLP database (if you already have a local copy, it is not refreshed by default).

In [5]:
from gismo.datasets.dblp import Dblp

dblp = Dblp(path=data_folder)
dblp.build()

File ..\..\..\..\..\Datasets\dblp.xml.gz already exists. Use refresh option to overwrite.
File ..\..\..\..\..\Datasets\dblp.data already exists. Use refresh option to overwrite.


 After a few minutes, we have something usable by Gismo.

In [6]:
from gismo.filesource import FileSource

source = FileSource(filename="dblp", path=data_folder)
source[500000]

{'type': 'article',
 'authors': ['Krishnanand N. Kaipa', 'Carlos W. Morato', 'Satyandra K. Gupta'],
 'title': 'Design of Hybrid Cells to Facilitate Safe and Efficient Human-Robot Collaboration During Assembly Operations.',
 'year': '2018',
 'venue': 'J. Comput. Inf. Sci. Eng.'}

We have many articles there:

In [7]:
len(source)

5772344

How many unique authors?

In [8]:
dblp_authors = {auth for art in source for auth in art['authors']}

In [9]:
len(dblp_authors)

2976175

## Loading program committees and authors

This data has been semi-automatically processed independently and is shipped with the project. You just need to load it.

In [10]:
import json
with open('fun_pcs.json', 'rt', encoding='utf8') as f:
    fun_pcs = json.load(f)
fun_pcs.keys()

dict_keys(['by_year', 'pcs'])

`fun_pcs` is a dict with two keys.
- `by_year` -> dict that associates year (string) to program committee (list of strings)
- `pcs` -> list of all PC chairs (list of strings)

In [11]:
import json
with open('fun_authors.json', 'rt', encoding='utf8') as f:
    fun_authors = json.load(f)
fun_authors.keys()

dict_keys(['1998', '2012', '2021', '2018', '2016', '2010', '2007', '2014', '2002', '2004'])

`fun_authors` is a dict with one key per edition.


Note that we use underscores instead of spaces for the names. This is just a trick to facilitate pre-processing later on.

In [12]:
fun_pcs['by_year']['2022']

['Oswin_Aichholzer',
 'Alkida_Balliu',
 'Pierluigi_Crescenzi',
 'Miriam_Di_Ianni',
 'David_Eppstein',
 'Panagiota_Fatourou',
 'Fedor_V._Fomin',
 'Pierre_Fraigniaud',
 'Magnús_M._Halldórsson',
 'Taisuke_Izumi',
 'Kei_Kimura',
 'Masashi_Kiyomi',
 'Lisa_Kohl',
 'Irina_Kostitsyna',
 'Jayson_Lynch',
 'Tillmann_Miltzow',
 'Neeldhara_Misra',
 'Valia_Mitsou',
 'Takaaki_Mizuki',
 'Lata_Narayanan',
 'Harumichi_Nishimura',
 'Yoshio_Okamoto',
 'Boaz_Patt-Shamir',
 'Andrea_Pietracaprina',
 'Sergio_Rajsbaum',
 'Adele_A._Rescigno',
 'Ryuhei_Uehara',
 'Yushi_Uno',
 'Virginia_Vassilevska_Williams',
 'Aaron_Williams',
 'Prudence_W._H._Wong',
 'Tom_C._van_der_Zanden']

In [13]:
fun_authors['1998']

['Danny_Krizanc',
 'Jurek_Czyzowicz',
 'Alon_Itai',
 'Massimo_Santini_0001',
 'Andrei_Z._Broder',
 'Yan_Gérard',
 'Prosenjit_Bose',
 'Sebastiano_Vigna',
 'Wojciech_Szpankowski',
 'Tami_Tamir',
 'Ronald_I._Becker',
 'Erkki_Sutinen',
 'Yen-I_Chiang',
 'M._Cecilia_Verri',
 'Barry_Hayes',
 'Paolo_Boldi',
 'Michael_Rodeh',
 'Joseph_C._Culberson',
 'David_Eppstein',
 'Renzo_Sprugnoli',
 'Maurice_Nivat',
 'Refael_Hassin',
 'Anil_Maheshwari',
 'Steven_S._Seiden',
 'Alberto_Pedrotti',
 'Evangelos_Kranakis',
 'Rudolf_Fleischer',
 'Shlomi_Rubinstein',
 'Bruno_Simeone',
 'Hadas_Shachnai',
 'Aviezri_S._Fraenkel',
 'Marshall_W._Bern',
 'Donatella_Merlini',
 'Avrim_Blum',
 'Prabhakar_Raghavan',
 'Alain_Daurat']

How many researchers in the PCs?

In [14]:
pcs_people = {auth for year in fun_pcs['by_year'].values() for auth in year}
len(pcs_people)

164

How many authors?

In [15]:
aus_people = {auth for year in fun_authors.values() for auth in year}
len(aus_people)

465

## Corpus reduction

DBLP provides references for more than 5,000,000 articles. This is huge. We want to reduce the number of articles. The goal here is twofold:
- Smaller datasets are faster, so the reduction will make it easier to perform various experiments;
- Reduction is also an occasion to focus on the articles that are the most relevant for the study: we will not reduce randomly, but by selecting articles that are *close* to the FUN community. The focus allows to give more attention to relevant fields/vocabulary.

### FUN landmarks

We re-arrange slighly the content of pcs and authors to make it easier to process afterwards. In details, we flatten the content into a list of dict, each dict having a display name (`name`) and a usable content (`dblp`).

In [16]:
funs_lmks = [{'name': f"pc_{k}", 'dblp': " ".join(v)} for k, v in fun_pcs['by_year'].items()]
funs_lmks.append({'name': 'pcs', 'dblp': " ".join(fun_pcs['pcs'])})
funs_lmks += [{'name': f"au_{k}", 'dblp': " ".join(v)} for k, v in fun_authors.items()]
funs_lmks

[{'name': 'pc_1998',
  'dblp': 'Giorgio_Ausiello Shimon_Even Zvi_Galil Elena_Lodi Fabrizio_Luccio Jürg_Nievergelt Linda_Pagli David_Peleg Kurt_Mehlhorn Franco_P._Preparata Arnold_L._Rosenberg Nicola_Santoro'},
 {'name': 'pc_2001',
  'dblp': 'Jean-Claude_Bermond Pierluigi_Crescenzi Frank_Dehne Paolo_Ferragina Michele_Flammini Paola_Flocchini Roberto_Grossi Danny_Krizanc Elena_Lodi Linda_Pagli David_Peleg Andrea_Pietracaprina Geppino_Pucci Luca_Trevisan Sergio_Rajsbaum Antonio_Restivo Nicola_Santoro Ugo_Vaccaro Sebastiano_Vigna Tandy_J._Warnow'},
 {'name': 'pc_2004',
  'dblp': 'Lars_Arge Michael_A._Bender Gerth_Stølting_Brodal Pierluigi_Crescenzi Martin_Farach-Colton Paolo_Ferragina Rudolf_Fleischer Paola_Flocchini Pierre_Fraigniaud Roberto_Grossi Stefano_Leonardi Giovanni_Manzini Gonzalo_Navarro Andrea_Pietracaprina Giuseppe_Prencipe Rajeev_Raman Kunihiko_Sadakane Peter_Sanders_0001 Steven_Skiena Christos_D._Zaroliagis'},
 {'name': 'pc_2007',
  'dblp': 'Nancy_M._Amato Nina_Amenta Marcel

### Gismo on DBLP authors

This part builds a Gismo on authors (a bipartite graph between articles and authors with stochastic TF-IDTF weights).

This is just initialization stuff.

In [17]:
from gismo.corpus import Corpus
from gismo.embedding import Embedding
from gismo.gismo import Gismo
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_author = CountVectorizer(dtype=float, preprocessor=lambda x:x, tokenizer=lambda x: x.split(' '))

This part tells how to convert a DBLP article dict from the source into a string representation of authors.

In [18]:
def to_authors_text(dic):
    return " ".join([a.replace(' ', '_') for a in dic['authors']])
corpus = Corpus(source, to_text=to_authors_text)

This builds the embedding (the actual weighted bipartite graph between articles and authors).

In [19]:
embedding = Embedding(vectorizer=vectorizer_author)
embedding.fit_transform(corpus)

The Gismo is the object that glues all pieces together.

In [20]:
gismo = Gismo(corpus, embedding)

Gismo can do many things that we don't need right know. For example, it can easily tell your closest co-authors (including yourself).

In [21]:
gismo.rank("Fabien_Mathieu")
gismo.get_features_by_rank()

['Fabien_Mathieu',
 'Laurent_Viennot',
 'Diego_Perino',
 'Céline_Comte',
 'Julien_Reynier',
 'Ludovic_Noirie',
 'François_Durand',
 'Fabien_de_Montgolfier',
 'The_Dang_Huynh',
 'Yacine_Boufkhad',
 'Thomas_Bonald',
 'Ilkka_Norros',
 'Mohamed_Bouklit',
 'Anh-Tuan_Gai',
 'François_Baccelli',
 'Nidhi_Hegde_0001',
 'Gheorghe_Postelnicu',
 'Anne_Bouillard',
 'Dohy_Hong']

In [22]:
gismo.rank('Sébastien_Tixeuil')
gismo.get_features_by_rank()

['Sébastien_Tixeuil',
 'Mikhail_Nesterenko',
 'Maria_Potop-Butucaru',
 'Toshimitsu_Masuzawa',
 'Quentin_Bramas',
 'Swan_Dubois',
 'Alexandre_Maurer',
 'Lélia_Blin',
 'Stéphane_Devismes',
 'Maria_Gradinariu_Potop-Butucaru',
 'Fukuhito_Ooshita',
 'Zohir_Bouzid',
 'Silvia_Bonomi',
 'Sylvie_Delaët',
 'Anissa_Lamani',
 'Ajoy_Kumar_Datta',
 'Xavier_Urbain',
 'Adam_Heriban',
 'Giovanni_Farina',
 'Xavier_Défago',
 'Maria_Gradinariu',
 'Franck_Petit',
 'Lionel_Rieg']

### Reducing the number of articles

Through the Landmarks submodule, Gismo can associate some arbitrary items (like program committee or authors of an edition) to articles and/or authors.

The following lines tell to associate to each entry of `funs_lmk` to up-to 20,000 articles, which will be selected by Gismo, and to build a selection of articles by merging all results.

In [23]:
from gismo.landmarks import Landmarks
landmarks_full = Landmarks(source=funs_lmks, to_text=lambda x: x['dblp'],
                                 x_density=20000)

In [24]:
reduced_source = landmarks_full.get_reduced_source(gismo)

In [25]:
print(f"Source length went down from {len(source)} to {len(reduced_source)}.")

Source length went down from 5772344 to 156744.


We can close the original DBLP source (the big DBLP source keeps a file open while in use).

In [26]:
source.close()

How many authors?

In [28]:
reduced_dblp_authors = {auth for art in reduced_source for auth in art['authors']}
len(reduced_dblp_authors)

97518

## XGismo

A XGismo (pronounced *Cross*-Gismo) object merges two bipartite graphs into a new one. Here we will merge a bipartite graph between articles and authors with a bipartite graph between articles and vocabulary, producing a bipartite graph between authors and vocabulary.

First we rebuild an author graph on the new (reduced) corpus of articles.

In [29]:
reduced_corpus = Corpus(reduced_source, to_text=to_authors_text)
reduced_author_embedding = Embedding(vectorizer=vectorizer_author)
reduced_author_embedding.fit_transform(reduced_corpus)

Then we build the vocabulary graph.
- We use spacy to enhance word selection
- We make a few additional tweaks, like filtering by frequency and detection of consecutive words (n-grams)

In [30]:
import spacy
# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
# Who cares about DET and such?
keep = {'ADJ', 'NOUN', 'NUM', 'PROPN', 'SYM', 'VERB'}

preprocessor=lambda txt: " ".join([token.lemma_.lower() for token in nlp(txt)
                                   if token.pos_ in keep and not token.is_stop])
vectorizer_text = CountVectorizer(dtype=float, min_df=5, max_df=.02, ngram_range=[1, 3], preprocessor=preprocessor)

Creation of the vocabulary graph (will take a few minutes).

In [31]:
reduced_corpus.to_text = lambda e: e['title']
reduced_word_embedding = Embedding(vectorizer=vectorizer_text)
reduced_word_embedding.fit_transform(reduced_corpus)

Now the xgismo can be made from the two graphs.

In [32]:
from gismo.gismo import XGismo
xgismo = XGismo(x_embedding=reduced_author_embedding, y_embedding=reduced_word_embedding)

The dimensions of the xgismo (number of authors times vocabulary size):

In [33]:
xgismo.embedding.x

<97518x39085 sparse matrix of type '<class 'numpy.float64'>'
	with 2472965 stored elements in Compressed Sparse Row format>

XGismo can link words or researchers to words or researchers. For example:

In [39]:
xgismo.rank("self-stabilization")

True

In [40]:
xgismo.get_documents_by_rank(k=10)

['Sébastien_Tixeuil',
 'Shlomi_Dolev',
 'Toshimitsu_Masuzawa',
 'Shay_Kutten',
 'Stéphane_Devismes',
 'Swan_Dubois',
 'Karine_Altisen',
 'Bertrand_Ducourthial',
 'Masafumi_Yamashita',
 'Franck_Petit']

In [41]:
xgismo.get_features_by_rank(k=10)

['self',
 'stabilization',
 'stabilize',
 'self stabilization',
 'self stabilize',
 'byzantine',
 'distribute',
 'stabilizing',
 'networks',
 'distributed']

In [42]:
xgismo.rank("Pierre_Fraigniaud", y=False)

True

In [43]:
xgismo.get_documents_by_rank(k=10)

['Pierre_Fraigniaud',
 'Amos_Korman',
 'Andrzej_Pelc',
 'David_Peleg',
 'Cyril_Gavoille',
 'Laurent_Feuilloley',
 'Michel_Raynal',
 'Sergio_Rajsbaum',
 'Juho_Hirvonen',
 'David_Ilcinkas']

In [44]:
xgismo.get_features_by_rank(k=10)

['local',
 'distribute',
 'distributed',
 'decision',
 'broadcasting',
 'exploration',
 'local decision',
 'tree exploration',
 'routing',
 'verify']

## Saving

We can now save the xgismo, which is the only thing we need in addition to the program committees.

In [45]:
xgismo.save(filename="xgismo_fun", path=data_folder, compress=True, erase=True)

## Cleaning (optional)

If you don't want to re-use the DBLP database in the future and need to save some space, you can safely remove the DBLP files you have created.

The list of the files is:

In [46]:
for file in data_folder.glob('dblp*'):
    if file.is_file():
        print(f"{file} ({file.stat().st_size} bytes)")

..\..\..\..\..\Datasets\dblp.data (1028624622 bytes)
..\..\..\..\..\Datasets\dblp.dtd (12973 bytes)
..\..\..\..\..\Datasets\dblp.index (23089626 bytes)
..\..\..\..\..\Datasets\dblp.xml.gz (683059541 bytes)


If you are OK to delete these files, execute the following cell.

In [32]:
for file in data_folder.glob('dblp*'):
    if file.is_file():
        file.unlink()

This notebook is finally over. You can now switch to the other one and start playing!