# Homogenous Job Groups

Author: `guillaume@bayes.org`

Date: `2017-01-04`

Can people easily switch jobs between the jobs of a job group? To find out, we use several datasets to find clues that the job group might be non homogenous. We are going to use three datasets:

## Mobility

We look at the job mobility links and investigate if they are different than the group mobility links, both for inlinks (mobility links towards a given job in a group) or outlinks (mobility links from a given job in the job group). Notebook on job mobility [here](./ROME_mobility_similarity.ipynb).

## FAP

If one job group is split between several FAP groups, it is an indication it might not be similar.

## Salary

Are job groups that have very different salaries not homogenous?

## Imports and renaming

First, a few imports and renaming:

In [1]:
%matplotlib inline

import codecs

from os import path

import pandas as pd
import seaborn as _
from bob_emploi.lib import cleaned_data
from bob_emploi.lib import read_data

rome_version = 'v329'
data_folder = '../../../data'
rome_folder = path.join(data_folder, 'rome/csv')
mobility_csv = path.join(rome_folder, 'unix_rubrique_mobilite_%s_utf8.csv' % rome_version)
salaries = pd.read_csv('../../../data/fhs_salaries.csv', dtype={'departement_id': str})
fap_names = read_data.parse_intitule_fap('../../../data/intitule_fap2009.txt')
with codecs.open('../../../data/crosswalks/passage_fap2009_romev3.txt', 'r', 'latin-1') as fap_file:
    fap_romeq_mapping = read_data.parse_fap_rome_crosswalk(fap_file.readlines())

jobs_names = cleaned_data.rome_jobs(data_folder=data_folder).name
job_groups = cleaned_data.rome_job_groups(data_folder=data_folder).name
mobility = pd.read_csv(mobility_csv)

mobility.head()

Unnamed: 0,code_rome,code_rome_cible,code_appellation_source,code_appellation_cible,code_type_mobilite,libelle_type_mobilite
0,A1101,A1416,,,1,Proche
1,A1202,A1203,,,1,Proche
2,A1203,A1202,,,1,Proche
3,A1203,A1405,,,1,Proche
4,A1203,A1414,,,1,Proche


Let's add the rome names:

In [2]:
# Rename columns.
mobility.rename(columns={
        'code_rome': 'group_source',
        'code_appellation_source': 'job_source',
        'code_rome_cible': 'group_target',
        'code_appellation_cible': 'job_target',
    }, inplace=True)

# Add names.
mobility['group_source_name'] = mobility['group_source'].map(job_groups)
mobility['group_target_name'] = mobility['group_target'].map(job_groups)
mobility['job_source_name'] = mobility['job_source'].map(jobs_names)
mobility['job_target_name'] = mobility['job_target'].map(jobs_names)
mobility = mobility[[
        'group_source', 'group_source_name', 'job_source', 'job_source_name',
        'group_target', 'group_target_name', 'job_target', 'job_target_name',
        'code_type_mobilite', 'libelle_type_mobilite'
    ]]
mobility.head()

Unnamed: 0,group_source,group_source_name,job_source,job_source_name,group_target,group_target_name,job_target,job_target_name,code_type_mobilite,libelle_type_mobilite
0,A1101,Conduite d'engins agricoles et forestiers,,,A1416,"Polyculture, élevage",,,1,Proche
1,A1202,Entretien des espaces naturels,,,A1203,Entretien des espaces verts,,,1,Proche
2,A1203,Entretien des espaces verts,,,A1202,Entretien des espaces naturels,,,1,Proche
3,A1203,Entretien des espaces verts,,,A1405,Arboriculture et viticulture,,,1,Proche
4,A1203,Entretien des espaces verts,,,A1414,Horticulture et maraîchage,,,1,Proche


## Homogenous from mobility inlinks

Let's compute all the jobs that have mobility links that points towards them. We assume that the corresponding job group is not homogenous, otherwise the mobility would have been towards the whole group.

In [3]:
# Only keeps links pointing to a job inside.
job_targets = mobility[mobility.job_target.notnull()][['job_source', 'group_source','job_target', 'group_target']]

job_targets.head()

Unnamed: 0,job_source,group_source,job_target,group_target
5,,A1204,11059.0,G1202
6,,A1204,15194.0,K1707
30,,A1412,14761.0,A1413
31,,A1412,17423.0,A1413
32,,A1412,17425.0,A1413


Let's see which job groups are homogenous:

In [4]:
homogenous_according_to_inlink = job_targets.groupby('group_target').first().isnull()['group_source']
homogenous_according_to_inlink.name = 'homogenous'

# Setting a value for all job groups.
homogenous_according_to_inlink = homogenous_according_to_inlink.reindex(job_groups.index).fillna(True)

inlink = homogenous_according_to_inlink.to_frame().join(job_groups)
inlink[inlink.homogenous].head()

Unnamed: 0_level_0,homogenous,name
code_rome,Unnamed: 1_level_1,Unnamed: 2_level_1
A1204,True,Protection du patrimoine naturel
D1301,True,Management de magasin de détail
A1205,True,Sylviculture
D1501,True,Animation de vente
D1504,True,Direction de magasin de grande distribution


And not homogenous:

In [5]:
inlink[~inlink.homogenous].head()

Unnamed: 0_level_0,homogenous,name
code_rome,Unnamed: 1_level_1,Unnamed: 2_level_1
D1214,False,Vente en habillement et accessoires de la pers...
D1401,False,Assistanat commercial
D1402,False,Relation commerciale grands comptes et entrepr...
D1403,False,Relation commerciale auprès de particuliers
D1404,False,Relation commerciale en vente de véhicules


In [6]:
print('We have ' + repr(homogenous_according_to_inlink.sum()) + ' homogenous groups according to inlinks.')

We have 319 homogenous groups according to inlinks.


## Homogenous from mobility outlinks

Now let's check the exact opposite: job groups that links starting from a job inside of them.

In [7]:
# Only keeps links pointing from a job inside.
jobs_sources = mobility[mobility.job_source.notnull()][['job_source', 'group_source','job_target', 'group_target']]

jobs_sources.head()

Unnamed: 0,job_source,group_source,job_target,group_target
38,17010.0,A1413,,D1501
80,15405.0,B1303,15415.0,B1603
184,10868.0,D1102,,G1604
187,10868.0,D1102,,H3303
214,10259.0,D1201,,D1214


Let's see which job groups are homogenous:

In [8]:
# Extract groups.
homogenous_according_to_outlink = jobs_sources.groupby('group_source').first().isnull()['group_target']
homogenous_according_to_outlink.name = 'homogenous'

# Setting a value for all groups.
homogenous_according_to_outlink = homogenous_according_to_outlink.reindex(job_groups.index).fillna(True)

outlink = homogenous_according_to_outlink.to_frame().join(job_groups)
outlink[outlink.homogenous].head()

Unnamed: 0_level_0,homogenous,name
code_rome,Unnamed: 1_level_1,Unnamed: 2_level_1
A1204,True,Protection du patrimoine naturel
D1301,True,Management de magasin de détail
D1401,True,Assistanat commercial
D1403,True,Relation commerciale auprès de particuliers
D1404,True,Relation commerciale en vente de véhicules


And not homogenous:

In [9]:
outlink[~outlink.homogenous].head()

Unnamed: 0_level_0,homogenous,name
code_rome,Unnamed: 1_level_1,Unnamed: 2_level_1
D1214,False,Vente en habillement et accessoires de la pers...
D1402,False,Relation commerciale grands comptes et entrepr...
D1408,False,Téléconseil et télévente
E1102,False,"Ecriture d'ouvrages, de livres"
E1104,False,Conception de contenus multimédias


In [10]:
print('We have ' + repr(homogenous_according_to_outlink.sum()) + ' homogenous groups according to outlinks.')

We have 451 homogenous groups according to outlinks.


## Homogenous according to FAP

FAP is an other way to group jobs. There is mapping from FAP to rome, see [here](../../datasets/rome/ROME-FAP_Mapping.ipynb). If there is a ROME group that maps to several FAP, we assume that the ROME is not homogenous.

First we extract the mapping rome to fap:

In [11]:
# parse_fap_rome_crosswalk gives actually qualified ROME codes.
fap_romeq_mapping = fap_romeq_mapping.rename(columns={'rome': 'romeQ'})

fap_romeq_mapping['rome'] = fap_romeq_mapping['romeQ'].apply(lambda s: s[:5])
fap_mapping = fap_romeq_mapping.groupby(['rome','fap'], as_index=False).first()
del(fap_mapping['romeQ'])

flatten_mapping = fap_mapping.groupby('rome', as_index=False).agg({'fap': lambda x: sorted(x.tolist())})
flatten_mapping['homogenous'] = flatten_mapping.fap.apply(lambda x: len(x) == 1)

flatten_mapping.head()

Unnamed: 0,rome,fap,homogenous
0,A1101,[A0Z43],True
1,A1201,[A0Z42],True
2,A1202,[A1Z41],True
3,A1203,[A1Z41],True
4,A1204,"[A0Z42, G1Z70, H0Z91]",False


Now we check if a ROME group maps to several FAP groups, then it's not homogenous:

In [12]:
homogenous_according_to_fap = flatten_mapping[~flatten_mapping.homogenous]
homogenous_according_to_fap.set_index('rome', inplace=True)
homogenous_according_to_fap = homogenous_according_to_fap['homogenous']
homogenous_according_to_fap.head()

# Setting a value for all job groups.
homogenous_according_to_fap = homogenous_according_to_fap.reindex(job_groups.index).fillna(True)

fap = homogenous_according_to_fap.to_frame().join(job_groups)
fap[~fap.homogenous].head()

Unnamed: 0_level_0,homogenous,name
code_rome,Unnamed: 1_level_1,Unnamed: 2_level_1
A1204,False,Protection du patrimoine naturel
D1401,False,Assistanat commercial
D1407,False,Relation technico-commerciale
D1501,False,Animation de vente
D1506,False,Marchandisage


Let's see the not homogenous ones:

In [13]:
fap[fap.homogenous].head()

Unnamed: 0_level_0,homogenous,name
code_rome,Unnamed: 1_level_1,Unnamed: 2_level_1
D1214,True,Vente en habillement et accessoires de la pers...
D1301,True,Management de magasin de détail
D1402,True,Relation commerciale grands comptes et entrepr...
D1403,True,Relation commerciale auprès de particuliers
D1404,True,Relation commerciale en vente de véhicules


In [14]:
print('We have ' + repr(homogenous_according_to_fap.sum()) + ' homogenous groups according to fap.')

We have 330 homogenous groups according to fap.


## Homogenous according to FHS

The FHS contains all salaries asked by jobseekers. We are going to see if we can determine similarity based on similarity of salaries inside a job group.

Let's only keep salaries that are reasonable, and count the total number of jobseekers by department/rome:

In [15]:
reasonable_salaries = salaries[(salaries.salary_high > 10000) & (salaries.salary_high < 5000000)]
jobseekers = reasonable_salaries.groupby(['departement_id', 'code_rome']).sum()['count'].reset_index()

# Going to country level.
counts = jobseekers.groupby('code_rome').sum()
counts['name'] = job_groups

jobseekers.head()

Unnamed: 0,departement_id,code_rome,count
0,2,A1101,18
1,2,A1201,9
2,2,A1202,45
3,2,A1203,165
4,2,A1204,1


Let's define two homogeneity measures based on salaries:  

* Homogeneity

Defined as median / (max - min), this shows the variation between min and max.

* Strangeness

Defined as (max - median) / (1000 + median - min), this shows how much the max is away from the median compared to the min. We put 1000 to avoid infinite strangeness when min = median. That puts a upper bound of (max - median)/1000 to the strangeness.

In [16]:
def homogeneity(df):
    total_count = df['count'].sum()
    if total_count < 15:
        return None
    cumulative_count = df.sort_values('salary_high')['count'].cumsum()
    return pd.DataFrame([{
        'min': df[cumulative_count <= 0.2 * total_count].salary_low.max(),
        'median': df[cumulative_count <= 0.5 * total_count].salary_low.max(),
        'max': df[cumulative_count <= 0.8 * total_count].salary_high.max(),

        # First metric we define: strangeness as:
        # (max - median) / (1000 + median - min)
        'strangeness': (df[cumulative_count <= 0.8 * total_count].salary_high.max()-df[cumulative_count <= 0.5 * total_count].salary_low.max()) / (500 + df[cumulative_count <= 0.5 * total_count].salary_low.max()-df[cumulative_count <= 0.2 * total_count].salary_low.max()),

        # Second metric we define: homogeneity as:
        # median / (max - min)
        'homogeneity': df[cumulative_count <= 0.5 * total_count].salary_low.max() / (df[cumulative_count <= 0.8 * total_count].salary_high.max()-df[cumulative_count <= 0.2 * total_count].salary_low.max()) 
    }])
    
salarie_sum = reasonable_salaries.groupby(['code_rome', 'salary_low', 'salary_high']).sum().reset_index()
salaries_stats = salarie_sum.groupby('code_rome').apply(homogeneity).reset_index().set_index('code_rome')
salaries_stats['job_group_name'] = job_groups
salaries_stats = salaries_stats[['job_group_name', 'min', 'median', 'max', 'homogeneity', 'strangeness']]
salaries_stats.sort_values('homogeneity').head(10)

Unnamed: 0_level_0,job_group_name,min,median,max,homogeneity,strangeness
code_rome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
J1408,Ostéopathie et chiropraxie,13200.0,18000,75000,0.291262,10.754717
L1102,Mannequinat et pose artistique,17800.0,21600,62000,0.488688,9.395349
J1201,Biologie médicale,17500.0,25000,62000,0.561798,4.625
C1105,Études actuarielles en assurances,21600.0,45000,95000,0.613079,2.09205
M1207,Trésorerie et financement,28500.0,40000,90000,0.650407,4.166667
J1102,Médecine généraliste et spécialisée,21600.0,42000,80000,0.719178,1.818182
J1103,Médecine dentaire,18000.0,32000,62000,0.727273,2.068966
F1203,Direction et ingénierie d'exploitation de gise...,24000.0,45000,85000,0.737705,1.860465
C1103,Courtage en assurances,20000.0,32000,62000,0.761905,2.4
C1303,Gestion de portefeuilles sur les marchés finan...,36000.0,46000,95000,0.779661,4.666667


Unfortunately, even for low homogeneity, it seems that we have groups where the job is the same but salaries vary a lot. Let's see with the other metric "strangeness":

In [17]:
salaries_stats.sort_values('homogeneity').tail(10)

Unnamed: 0_level_0,job_group_name,min,median,max,homogeneity,strangeness
code_rome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
G1605,Plonge en restauration,17000.0,17300,17500,34.6,0.25
D1505,Personnel de caisse,17000.0,17300,17500,34.6,0.25
A1409,Élevage de lapins et volailles,17100.0,17300,17600,34.6,0.428571
K1303,Assistance auprès d'enfants,17000.0,17300,17500,34.6,0.25
K1304,Services domestiques,17000.0,17300,17400,43.25,0.125
J1301,Personnel polyvalent des services hospitaliers,17100.0,17300,17500,43.25,0.285714
K2204,Nettoyage de locaux,17000.0,17300,17400,43.25,0.125
D1507,Mise en rayon libre-service,17200.0,17300,17500,57.666667,0.333333
A1402,Aide agricole de production légumière ou végétale,17200.0,17300,17400,86.5,0.166667
H2413,"Préparation de fils, montage de métiers textiles",,18000,23200,,


Unfortunately, even for high strangeness, it seems that we have groups where the job is the same but salaries vary a lot. 

Each time the jobs with high strangeness and low homogeneity are the ones that just have very variable salaries, like CEO, Ostéopathie, Mannequinat et pose artistique, Trésorerie et financement, Études actuarielles en assurances.

Those are the same jobs, but have very dissimilar salaries.

So we can't use this metric for discovery non-homogenous group, and we'll drop it from the analysis.


## Stats summary

Let's see what the percentage of homogenous groups according to each metric:

In [18]:
def pretty_percentage(number):
    return '%.2f%%' % (number * 100)

nb_groups = job_groups.count()

print ('Percentage of homogenous according to inlinks = ' + pretty_percentage(homogenous_according_to_inlink.sum() / nb_groups))
print ('Percentage of homogenous according to outlinks = ' + pretty_percentage(homogenous_according_to_outlink.sum() / nb_groups))
print ('Percentage of homogenous according to the FAP = ' + pretty_percentage(homogenous_according_to_fap.sum() / nb_groups))

Percentage of homogenous according to inlinks = 60.08%
Percentage of homogenous according to outlinks = 84.93%
Percentage of homogenous according to the FAP = 62.15%


We see that most groups are homogenous, let's now check the intersection.

## Intersection metrics

In order to find an "homogeneity" metric that has a decent precision and recall, we propose to look at the intersection between the different metrics we defined above.

In [19]:
# Combining all the homogenous metrics we obtained.
homogenous_groups_inlink_outlink = homogenous_according_to_outlink & homogenous_according_to_inlink
homogenous_groups_inlink_fap = homogenous_according_to_inlink & homogenous_according_to_fap
homogenous_groups_outlink_fap = homogenous_according_to_outlink & homogenous_according_to_fap
homogenous_groups_inlink_outlink_fap = homogenous_according_to_outlink & homogenous_according_to_fap & homogenous_groups_inlink_fap
homogenous_groups_inlink_outlink_fap.name = 'homogenous'
antigenous_groups_inlink_outlink_fap = ~homogenous_according_to_outlink & ~homogenous_according_to_fap & ~homogenous_groups_inlink_fap
antigenous_groups_inlink_outlink_fap.name = 'antigenous'

print('Number of homogenous groups using inlinks and outlinks = ' + pretty_percentage(homogenous_groups_inlink_outlink.sum() / nb_groups))
print('Number of homogenous groups using inlinks and fap = ' + pretty_percentage(homogenous_groups_inlink_fap.sum() / nb_groups))
print('Number of homogenous groups using outlinks and fap = ' + pretty_percentage(homogenous_groups_outlink_fap.sum() / nb_groups))
print('Number of homogenous groups using in/outlinks + fap = ' + pretty_percentage(homogenous_groups_inlink_outlink_fap.sum() / nb_groups))

Number of homogenous groups using inlinks and outlinks = 55.74%
Number of homogenous groups using inlinks and fap = 36.53%
Number of homogenous groups using outlinks and fap = 53.48%
Number of homogenous groups using in/outlinks + fap = 34.09%


Examples of homogenous groups:

In [20]:
homogenous = homogenous_groups_inlink_outlink_fap.to_frame().join(job_groups)
homogenous[homogenous.homogenous].head(10)

Unnamed: 0_level_0,homogenous,name
code_rome,Unnamed: 1_level_1,Unnamed: 2_level_1
D1301,True,Management de magasin de détail
A1205,True,Sylviculture
D1504,True,Direction de magasin de grande distribution
D1508,True,Encadrement du personnel de caisses
D1509,True,Management de département en grande distribution
E1105,True,Coordination d'édition
E1201,True,Photographie
E1204,True,Projection cinéma
E1303,True,Encadrement des industries graphiques
E1308,True,Intervention technique en industrie graphique


Example of the most non-homogenous groups:

In [21]:
antigenous = antigenous_groups_inlink_outlink_fap.to_frame().join(job_groups)
antigenous[antigenous.antigenous].head(10)

Unnamed: 0_level_0,antigenous,name
code_rome,Unnamed: 1_level_1,Unnamed: 2_level_1
F1102,True,Conception - aménagement d'espaces intérieurs
F1604,True,Montage d'agencements
F1606,True,Peinture en bâtiment
F1608,True,Pose de revêtements rigides
F1610,True,Pose et restauration de couvertures
F1702,True,Construction de routes et voies
F1705,True,Pose de canalisations
F1706,True,Préfabrication en béton industriel
H2102,True,Conduite d'équipement de production alimentaire
H2201,True,Assemblage d'ouvrages en bois


Using the intersection of inlinks, outlinks and FAP seems to be a good compromise between precision and recall.


## Homogenous groups coverage

We are going to look at the size of homogenous groups to see if they cover a lot of jobseekers or not.

In [22]:
job_seekers_count = salaries.groupby(['code_rome'])['count'].sum()
job_seekers_count.name = 'job_seekers'
job_seekers_count = job_seekers_count.to_frame().join(homogenous)

nb_seekers = job_seekers_count['job_seekers'].sum()
nb_homogenous_seekers = job_seekers_count[job_seekers_count.homogenous]['job_seekers'].sum()

print("Percentage of users in homogenous groups: " + pretty_percentage(nb_homogenous_seekers/nb_seekers))

Percentage of users in homogenous groups: 38.10%


In [23]:
nb_homogenous_groups = len(job_seekers_count[job_seekers_count.homogenous]['job_seekers'])

print("Average percentage of seekers per homogenous group: " + pretty_percentage(nb_homogenous_seekers / (nb_homogenous_groups * nb_seekers)))

Average percentage of seekers per homogenous group: 0.21%


In [24]:
nb_non_homogenous_seekers = nb_seekers - nb_homogenous_seekers
nb_non_homogenous_groups = nb_groups - nb_homogenous_groups

print("Average percentage of seekers per homogenous group: " + pretty_percentage(nb_non_homogenous_seekers / (nb_non_homogenous_groups * nb_seekers)))

Average percentage of seekers per homogenous group: 0.18%


In [25]:
seekers_per_group = nb_seekers / float(nb_groups)
seekers_per_homogenous_group = nb_homogenous_seekers / float(nb_homogenous_groups)
size_ratio = seekers_per_homogenous_group / seekers_per_group

print("Relative size of homogenous groups compared to non-homogenous groups: " + pretty_percentage(size_ratio))

Relative size of homogenous groups compared to non-homogenous groups: 112.39%


There are more people in homogenous groups than non homogenous groups.

# Conclusions

* Salary does not give a good information on the homogeneity of a job group. It could be used if we want super homogenous groups with high precision low recall.

* There are relatively small overlaps between homogeneity metrics (around 25%)

* Combining in/outlinks + FAP we find 34% of job groups that are homogenous, for instance "Management de magasin de détail", "sylviculture". Some of the most antigenous groups according to those metrics are "Conception - aménagement d'espaces intérieurs" and "Montage d'agencements"

* Surprisingly, homogenous groups cover more people than the average group


## Further work:

- If we want more precision on homogeneity at the expense of recall, we can take into account salaries that are very homogenous.
- Make the intersection of different heuristics a venn diagrams
- upload the inlink/outlinks/FAP/salaries to the AirTable
- do some manual evaluation of groups to see how homogenous they are