Author: Pascal, pascal@bayesimpact.org

Date: 2016-06-22

Skip the run test because the ROME version has to be updated to make it work in the exported repository. TODO: Update ROME and remove the `skiptest` flag.


# ROME update from v328 to v329

In June 2016 I realized that they had released a new version of the ROME. I want to investigate what changed and whether we need to do anything about it.

You might not be able to reproduce this notebook, mostly because it requires to have the two versions of the ROME in your `data/rome/csv` folder which happens only just before we switch to v329. You'll have to trust me on the results ;-)

In [1]:
import collections
import glob
from os import path
import pandas

rome_path = '../../data/rome/csv'

OLD_VERSION = '328'
NEW_VERSION = '329'

old_version_files = frozenset(glob.glob(rome_path + '/*%s*' % OLD_VERSION))
new_version_files = frozenset(glob.glob(rome_path + '/*%s*' % NEW_VERSION))

First let's check if there are new or deleted files (only matching by file names).

In [2]:
new_files = new_version_files - frozenset(f.replace(OLD_VERSION, NEW_VERSION) for f in old_version_files)
deleted_files = old_version_files - frozenset(f.replace(NEW_VERSION, OLD_VERSION) for f in new_version_files)

print('%d new files' % len(new_files))
print('%d deleted files' % len(deleted_files))

0 new files
0 deleted files


So we have the same set of files: good start.

Now let's set up a dataset that, for each table, links the old file and the new file.

In [3]:
new_to_old = dict((f, f.replace(NEW_VERSION, OLD_VERSION)) for f in new_version_files)

# Load all datasets.
Dataset = collections.namedtuple('Dataset', ['basename', 'old', 'new'])
data = [Dataset(
        basename=path.basename(f),
        old=pandas.read_csv(f.replace(NEW_VERSION, OLD_VERSION)),
        new=pandas.read_csv(f))
    for f in sorted(new_version_files)]

def find_dataset_by_name(data, partial_name):
    for dataset in data:
        if partial_name in dataset.basename:
            return dataset
    raise ValueError('No dataset named %s, the list is\n%s' % (partial_name, [dataset.basename for d in data]))

Let's make sure the structure hasn't changed:

In [4]:
for dataset in data:
    if set(dataset.old.columns) != set(dataset.new.columns):
        print('Columns of %s have changed.' % dataset.basename)

All files have the same columns as before: still good.

In [5]:
untouched = 0
for dataset in data:
    diff = len(dataset.new.index) - len(dataset.old.index)
    if diff > 0:
        print('%d values added in %s' % (diff, dataset.basename))
    elif diff < 0:
        print('%d values removed in %s' % (diff, dataset.basename))
    else:
        untouched += 1
print('%d/%d files with the same number of rows' % (untouched, len(data)))

17 values added in unix_coherence_item_v329_utf8.csv
17 values added in unix_cr_gd_dp_appellations_v329_utf8.csv
4 values added in unix_item_arborescence_v329_utf8.csv
9 values added in unix_item_v329_utf8.csv
34 values added in unix_liens_rome_referentiels_v329_utf8.csv
7 values added in unix_referentiel_activite_riasec_v329_utf8.csv
7 values added in unix_referentiel_activite_v329_utf8.csv
17 values added in unix_referentiel_appellation_v329_utf8.csv
2 values added in unix_referentiel_competence_v329_utf8.csv
1 values added in unix_texte_v329_utf8.csv
11/21 files with the same number of rows


So we have minor additions in half of the files. At one point we cared about `referentiel_activite` and `referentiel_activite_riasec` but have no concrete application for now.

The only interesting ones are `referentiel_appellation` and `referentiel_competence`, so let's see more precisely.

In [6]:
jobs = find_dataset_by_name(data, 'referentiel_appellation')
new_ogrs = set(jobs.new.code_ogr) - set(jobs.old.code_ogr)
new_jobs = jobs.new[jobs.new.code_ogr.isin(new_ogrs)]

job_groups = find_dataset_by_name(data, 'referentiel_code_rome_v')
pandas.merge(new_jobs, job_groups.new[['code_rome', 'libelle_rome']], on='code_rome', how='left')

Unnamed: 0,code_ogr,libelle_appellation_long,libelle_appellation_court,code_rome,code_type_section_appellation,libelle_type_section_appellation,statut,libelle_rome
0,38992,Expert / Experte à distance sinistres et domma...,Expert(e) à distance sinistres et dommages en ...,C1107,1,PRINCIPALE,1,Indemnisations en assurances
1,38993,Assistant / Assistante de programmes immobiliers,Assistant / Assistante de programmes immobiliers,C1503,1,PRINCIPALE,1,Management de projet immobilier
2,38994,Assistant / Assistante en promotion immobilière,Assistant / Assistante en promotion immobilière,C1503,1,PRINCIPALE,1,Management de projet immobilier
3,38995,E-merchandiser,E-merchandiser,D1506,1,PRINCIPALE,1,Marchandisage
4,38996,Assistant / Assistante de rédaction,Assistant / Assistante de rédaction,E1106,1,PRINCIPALE,1,Journalisme et information média
5,38997,Chef de publicité online,Chef de publicité online,E1401,1,PRINCIPALE,1,Développement et promotion publicitaire
6,38998,Assistant chargé / Assistante chargée d''affa...,Assistant chargé / Assistante chargée d''affa...,F1106,1,PRINCIPALE,1,Ingénierie et études du BTP
7,38999,Assistant chargé / Assistante chargée d''affai...,Assistant(e) chargé(e) d''affaires BTPgénie cl...,F1106,1,PRINCIPALE,1,Ingénierie et études du BTP
8,39000,Conducteur / Conductrice de travaux en rénovat...,Conducteur / Conductrice de travaux rénovation...,F1201,1,PRINCIPALE,1,Conduite de travaux du BTP
9,39001,Installateur / Installatrice de chauffage bois,Installateur / Installatrice de chauffage bois,F1603,1,PRINCIPALE,1,Installation d''équipements sanitaires et ther...


The new entries look legitimate (including many jobs related to new technologies). However there's a typo in `Infirmier(ière) coordinateur(trcie) en établis...` where the shortened feminine version should be `(trice)` instead of `(trcie)`.

In [7]:
competences = find_dataset_by_name(data, 'referentiel_competence')
new_ogrs = set(competences.new.code_ogr) - set(competences.old.code_ogr)
competences.new[competences.new.code_ogr.isin(new_ogrs)]

Unnamed: 0,code_ogr,libelle_competence,code_type_competence,libelle_type_competence,statut
1848,39009,Techniques de l''expertise à distance en assur...,1,SAVOIRS THEORIQUES ET PROCEDURAUX,1
2563,39010,Utilisation de logiciels immobiliers,2,SAVOIRS DE L''ACTION,1


The new entries also look legitimate (including one related to new technologies).

# Conclusion
The new version of ROME, v329, doesn't introduce any major changes: mainly additions of few new rows in existing files. However it introduces a typo, so if we can patch it properly we should switch all our notebooks and scripts to use it, and the transition should be transparent with a very small advantage on the new one.

This confirms the [Changelog](http://www.pole-emploi.org/front/common/tools/load_file.html?galleryId=53360) written by Pôle Emploi.