# Tuning du modèle

L'objet de ce notebook est d'illustrer les différentes étapes de tuning du modèle.


## Préambule

### Imports

In [1]:
# setting up sys.path for relative imports
from pathlib import Path
import sys
project_root = str(Path(sys.path[0]).parents[1].absolute())
if project_root not in sys.path:
    sys.path.append(project_root)

In [37]:
# imports and customization of diplay
# import os
import re
from functools import partial
from itertools import product
# import numpy as np
import pandas as pd
pd.options.display.min_rows = 6
pd.options.display.width=108
# from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
# from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
# from matplotlib import pyplot as plt

from src.pimest import ContentGetter
from src.pimest import PathGetter
from src.pimest import PDFContentParser
from src.pimest import BlockSplitter
from src.pimest import SimilaritySelector
# from src.pimest import custom_accuracy
from src.pimest import text_sim_score
# from src.pimest import text_similarity
# from src.pimest import build_text_processor

### Acquisition des données

On récupère les données manuellement étiquetées et on les intègre dans un dataframe

In [3]:
ground_truth_df = pd.read_csv(Path('..') / '..' / 'ground_truth' / 'manually_labelled_ground_truth.csv',
                              sep=';',
                              encoding='latin-1',
                              index_col='uid')
ground_truth_uids = list(ground_truth_df.index)

acqui_pipe = Pipeline([('PathGetter', PathGetter(ground_truth_uids=ground_truth_uids,
                                                  train_set_path=Path('..') / '..' / 'ground_truth',
                                                  ground_truth_path=Path('..') / '..' / 'ground_truth',
                                                  )),
                        ('ContentGetter', ContentGetter(missing_file='to_nan')),
                        ('ContentParser', PDFContentParser(none_content='to_empty')),
                       ],
                       verbose=True)

texts_df = acqui_pipe.fit_transform(ground_truth_df)
texts_df['ingredients'] = texts_df['ingredients'].fillna('')
texts_df

[Pipeline] ........ (step 1 of 3) Processing PathGetter, total=   0.1s
[Pipeline] ..... (step 2 of 3) Processing ContentGetter, total=   0.6s
Launching 8 processes.
[Pipeline] ..... (step 3 of 3) Processing ContentParser, total=  36.7s


Unnamed: 0_level_0,designation,ingredients,path,content,text
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",../../ground_truth/a0492df6-9c76-4303-8813-65e...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,Concentré Liquide Asian CHEF® \n\nBouteille de...
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de blé T65, eau, levure, vinaigre de ci...",../../ground_truth/d183e914-db2f-4e2f-863a-a3b...,b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n4 0 obj\r<</L...,
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,- 100% Semoule de BLE dur de qualité supérieur...,../../ground_truth/ab48a1ed-7a3d-4686-bb6d-ab4...,b'%PDF-1.4\n%\xc7\xec\x8f\xa2\n5 0 obj\n<</Len...,Direction Qualité \n\n \n\n \n\nPATES ALIMENTA...
...,...,...,...,...,...
e67341d8-350f-46f4-9154-4dbbb8035621,PRÉPARATION POUR CRÈME BRÛLÉE BIO 6L,"Sucre roux de canne*° (64%), amidon de maïs*, ...",../../ground_truth/e67341d8-350f-46f4-9154-4db...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE \n\nCREME BRÛLÉE 6L \n\nREF : ...
a8f6f672-20ac-4ff8-a8f2-3bc4306c8df3,Céréales instantanées en poudre saveur caramel...,"Farine 87,1 % (Blé (GLUTEN), Blé hydrolysé (GL...",../../ground_truth/a8f6f672-20ac-4ff8-a8f2-3bc...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,81 rue de Sans Souci – CS13754 – 69576 Limones...
0faad739-ea8c-4f03-b62e-51ee592a0546,"FARINE DE BLÉ TYPE 45, 10KG",Farine de blé T45,../../ground_truth/0faad739-ea8c-4f03-b62e-51e...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,\n1050/10502066400 \n\n10502055300/1050202520...


### Train / Test split

On va appliquer une grid search pour déterminer les meilleurs paramètres de notre modèle. 
Pour ne pas surestimer la performance du modèle, il est nécessaire de bien séparer le jeu de test du jeu d'entraînement, y compris pour la grid search !

In [4]:
train, test = train_test_split(texts_df, test_size=100, random_state=42)

Dans toute la suite, on utilisera le jeu d'entraînement pour effectuer le tuning des hyperparamètres.

## Ajustement de la fonction de découpage des textes

L'objectif de cette partie est d'optimiser la fonction de découpage des textes en blocs. On va tester quelques fonctions candidates, via une GridSearch.

### Définition des fonctions candidates

On définit les fonctions de split : 

In [5]:
# definitions of splitter funcs
splitter_funcs = []
def split_func(text):
    return(text.split('\n\n'))
splitter_funcs.append(split_func)
def split_func(text):
    return(text.split('\n'))
splitter_funcs.append(split_func)
def split_func(text):
    regex = r'\s*\n\s*\n\s*'
    return(re.split(regex, text))
splitter_funcs.append(split_func)

### Mise en place du pipeline

On construit ensuite un pipeline de traitement du texte.
Le SimilaritySelector prenant en entrée une pandas.Series, on définit entre le BlockSplitter (dont la méthode transform() retourne un pandas.DataFrame) et le SimilaritySelector une fonction utilitaire qui séléctionne la colonne 'blocks'.

In [6]:
def select_col(df, col_name='blocks'):
        return(df[col_name].fillna(''))
col_selector = FunctionTransformer(select_col)    

In [7]:
process_pipe = Pipeline([('Splitter', BlockSplitter()),
                         ('ColumnSelector', col_selector),
                         ('SimilaritySelector', SimilaritySelector())
                        ],
                       verbose=False)

On peut tester le fonctionnement de ce Pipeline.
Attention, les résultats ne sont pas représentatifs, on entraîne et on prédit sur le même jeu de données !

In [8]:
process_pipe.fit(train, train['ingredients'])
process_pipe.predict(train)

Launching 8 processes.
Launching 8 processes.


uid
02d5ceb9-21c2-4965-8f65-309bca7638b2    Café chicorée solubles et fibres de chicorée.\...
bbe72396-6ed4-4df1-935b-0c0a7dbd77dc                                                     
507b428e-e99d-464b-b9d3-50629efe4355    COMPOSITION\nMélange de Blés de pays recommand...
                                                              ...                        
4b28bb17-1f1d-4cbb-ac3b-80227ef248ab    Gluten\nCrustacés\nOeufs\nPoisson\nSoja\nLait\...
d2137dae-ff21-46ec-83be-7400773c6c3b    Amidon modifié de pomme de terre - Fécule de p...
571d98ae-9647-4bd4-ad1a-a497f93987cb    Composition typique (Données inappropriées pou...
Length: 400, dtype: object

### Helper fonction

On doit faire varier dans la grid search des paramètres qui sont packés sous forme de dictionnaires avant d'être passés au SimilaritySelector.
On construit une fonction qui permet de construire le produit cartésien qui va bien pour ces paramètres.

In [9]:
def prod_params(dict_to_prod):
    """ 
    In : dict of dicts.
    First level key : parameter name
    Second level key : name of scenario with this parameter value
    Values : parameter value
    
    Returns a tuple: 
        - list of labels to name scenario
        - list of dictionaries to pass to count_vect_kwargs
    """
    label_lists = [list(dict_.keys()) for dict_ in dict_to_prod.values()]
    labels = list(map(lambda x: ', '.join(x), list(product(*label_lists))))
    values_iter = list(product(*[list(dict_.values()) for dict_ in dict_to_prod.values()]))
    parms_names = list(dict_to_prod.keys())
    dict_out = [{key: val for (key, val) in zip(parms_names, values_)} for values_ in values_iter]
    return(labels, dict_out)

In [10]:
prod_params({'stop_words': {'no stopwords removal': None, 'with stopwords removal' : {'de', 'le'}},
             'ngram_ranges': {'no_ngram': (1, 1), 'bigrams': (1, 2)}})

(['no stopwords removal, no_ngram',
  'no stopwords removal, bigrams',
  'with stopwords removal, no_ngram',
  'with stopwords removal, bigrams'],
 [{'stop_words': None, 'ngram_ranges': (1, 1)},
  {'stop_words': None, 'ngram_ranges': (1, 2)},
  {'stop_words': {'de', 'le'}, 'ngram_ranges': (1, 1)},
  {'stop_words': {'de', 'le'}, 'ngram_ranges': (1, 2)}])

### Application de la GridSearch : tuning du text_preprocessing

On applique ensuite une grid search en faisant varier les fonctions de text preprocessing : 
- fonction de split du texte des documents en blocs
- retrait ou non de stopwords
- prise en compte de ngrams
- juste pour une première comparaison, choix du candidat par projection l1/l2 ou par similarité cosinus

On scorera via la similarité de Levenshtein.

In [11]:
lev_scorer = partial(text_sim_score, similarity='levenshtein')

In [12]:
stop_words = {'pas', 'le', 'en', 'pour', 'ou', 'ce', 'de', 'dans', 'du', 'and', 'un', 'sur', 'et',
              'of', 'est', 'par', 'la', 'les', 'dont', 'au', 'des', 'que'}

In [13]:
ngram_ranges = {'no_ngram': (1, 1), 'bigrams': (1, 2), 'trigrams': (1, 3)}

In [14]:
kwargs_to_prod = prod_params({'stop_words': {'no stopwords removal': None, 'with stopwords removal' : stop_words},
                              'ngram_range': ngram_ranges,
                              'strip_accents': {'keep accents': None, 'remove accents': 'unicode'}
                             }
                            )

In [15]:
param_grid = [{'Splitter__splitter_func': splitter_funcs,
               'SimilaritySelector__similarity': ['projection', 'cosine'],
               'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
              }
             ]
search = GridSearchCV(process_pipe,
                      param_grid,
                      cv=8, 
                      scoring= lev_scorer,
                      n_jobs=-1,
                      verbose=1,
                     ).fit(train, train['ingredients'])

Fitting 8 folds for each of 72 candidates, totalling 576 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   10.9s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   57.2s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  2.3min


Launching 8 processes.


[Parallel(n_jobs=-1)]: Done 576 out of 576 | elapsed:  3.2min finished


In [16]:
labels = list(product(kwargs_to_prod[0], ['Projection l2/l1', 'Cosinus'], ['Split 1', 'Split 2', 'Split 3']))
labels = list(map(lambda x: ', '.join(x), labels))

In [17]:
search.best_params_

{'SimilaritySelector__count_vect_kwargs': {'stop_words': {'and',
   'au',
   'ce',
   'dans',
   'de',
   'des',
   'dont',
   'du',
   'en',
   'est',
   'et',
   'la',
   'le',
   'les',
   'of',
   'ou',
   'par',
   'pas',
   'pour',
   'que',
   'sur',
   'un'},
  'ngram_range': (1, 3),
  'strip_accents': 'unicode'},
 'SimilaritySelector__similarity': 'projection',
 'Splitter__splitter_func': <function __main__.split_func(text)>}

In [18]:
for i in range(len(search.cv_results_['rank_test_score'])):
    str_result = f"{search.cv_results_['mean_test_score'][i]:.2%} +/- {search.cv_results_['std_test_score'][i]:.2%}"
    print(labels[i], str_result)

no stopwords removal, no_ngram, keep accents, Projection l2/l1, Split 1 50.15% +/- 5.61%
no stopwords removal, no_ngram, keep accents, Projection l2/l1, Split 2 38.90% +/- 4.52%
no stopwords removal, no_ngram, keep accents, Projection l2/l1, Split 3 52.65% +/- 6.09%
no stopwords removal, no_ngram, keep accents, Cosinus, Split 1 40.30% +/- 4.91%
no stopwords removal, no_ngram, keep accents, Cosinus, Split 2 25.87% +/- 2.25%
no stopwords removal, no_ngram, keep accents, Cosinus, Split 3 41.98% +/- 5.27%
no stopwords removal, no_ngram, remove accents, Projection l2/l1, Split 1 49.47% +/- 5.77%
no stopwords removal, no_ngram, remove accents, Projection l2/l1, Split 2 38.56% +/- 4.19%
no stopwords removal, no_ngram, remove accents, Projection l2/l1, Split 3 52.02% +/- 6.05%
no stopwords removal, no_ngram, remove accents, Cosinus, Split 1 40.93% +/- 5.02%
no stopwords removal, no_ngram, remove accents, Cosinus, Split 2 26.06% +/- 2.00%
no stopwords removal, no_ngram, remove accents, Cosinus,

On tire de ce premier test:
- que le modèle est bien plus performant avec le retrait des stopwords
- que le split le plus efficace est la fonction qui applique la regex (deux retours chariots parmi des whitespaces) - split 3
- que la prise en compte de bigrammes améliore, avec les trigrammes en plus on ne gagne rien
- que la similarité cosinus semble sensiblement moins performante que le choix par projection (l2/l1)

Remarque : la standard dev est quand même assez élevée (de l'ordre de 5-6%). Les scénarios avec peu d'écart entre leurs moyennes (2-3%) ne sont pas départageables via cette grid search.

### Application de la Grid Search : tuning du calcul de similarité

On va maintenant déterminer, sur la base des paramètres déjà retenus, le mode de calcul de similarité le plus performant.
Seul le calcul par projection est paramétrique (norme dans l'espace de départ vs. norme sur l'espace projeté), on fera uniquement varier ces paramètres (en plus de la comparaison avec la similarité cosinus).

On comparera également la performance du modèle selon qu'on vectorise les textes via les comptes de mots, ou bien seulement via un identifiant binaire (présence ou absence du mot).

In [19]:
process_pipe.set_params(**{'Splitter__splitter_func': splitter_funcs[2],
                           })

kwargs_to_prod = prod_params({'stop_words': {'with stopwords removal' : stop_words},
                              'ngram_range': {'bigrams': (1, 2)},
                              'binary': {'counts': False, 'binary flag': True},
                              'strip_accents': {'remove accents': 'unicode'}
                             })

param_grid = [{
               'SimilaritySelector__source_norm': ['l2'],
               'SimilaritySelector__projected_norm': ['l1'],
               'SimilaritySelector__similarity': ['projection'],
               'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
              },
              {
               'SimilaritySelector__similarity': ['projection'],
               'SimilaritySelector__source_norm': ['l3'],
               'SimilaritySelector__projected_norm': ['l2'],
               'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
              },
              {
               'SimilaritySelector__similarity': ['projection'],
               'SimilaritySelector__source_norm': ['l3'],
               'SimilaritySelector__projected_norm': ['l1'],
               'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
              },
              {
               'SimilaritySelector__similarity': ['projection'],
               'SimilaritySelector__source_norm': ['l4'],
               'SimilaritySelector__projected_norm': ['l3'],
               'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
              },
              {
               'SimilaritySelector__similarity': ['projection'],
               'SimilaritySelector__source_norm': ['l4'],
               'SimilaritySelector__projected_norm': ['l2'],
               'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
              },
              {
               'SimilaritySelector__similarity': ['projection'],
               'SimilaritySelector__source_norm': ['l5'],
               'SimilaritySelector__projected_norm': ['l4'],
               'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
              },
              {
               'SimilaritySelector__similarity': ['cosine'],
               'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
              }
             ]
search = GridSearchCV(process_pipe,
                      param_grid,
                      cv=8, 
                      scoring= lev_scorer,
                      n_jobs=-1,
                      verbose=1,
                     ).fit(train, train['ingredients'])

Fitting 8 folds for each of 14 candidates, totalling 112 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   12.0s


Launching 8 processes.


[Parallel(n_jobs=-1)]: Done 112 out of 112 | elapsed:   36.6s finished


In [20]:
labels = ['l2, l1',
          'l3, l2', 
          'l3, l1',
          'l4, l3',
          'l4, l2',
          'l5, l4',
          'cosine',
         ]

labels = list(product(labels, kwargs_to_prod[0]))
labels = list(map(lambda x: ', '.join(x), labels))

for i in range(len(search.cv_results_['rank_test_score'])):
    str_result = f"{search.cv_results_['mean_test_score'][i]:.2%} +/- {search.cv_results_['std_test_score'][i]:.2%}"
    print(labels[i], str_result)

l2, l1, with stopwords removal, bigrams, counts, remove accents 60.86% +/- 5.65%
l2, l1, with stopwords removal, bigrams, binary flag, remove accents 61.00% +/- 5.61%
l3, l2, with stopwords removal, bigrams, counts, remove accents 61.55% +/- 5.25%
l3, l2, with stopwords removal, bigrams, binary flag, remove accents 62.05% +/- 4.73%
l3, l1, with stopwords removal, bigrams, counts, remove accents 59.33% +/- 5.34%
l3, l1, with stopwords removal, bigrams, binary flag, remove accents 59.06% +/- 5.44%
l4, l3, with stopwords removal, bigrams, counts, remove accents 59.61% +/- 4.00%
l4, l3, with stopwords removal, bigrams, binary flag, remove accents 62.61% +/- 4.29%
l4, l2, with stopwords removal, bigrams, counts, remove accents 61.11% +/- 5.30%
l4, l2, with stopwords removal, bigrams, binary flag, remove accents 61.00% +/- 5.61%
l5, l4, with stopwords removal, bigrams, counts, remove accents 56.23% +/- 3.40%
l5, l4, with stopwords removal, bigrams, binary flag, remove accents 61.82% +/- 3.67

On tire de ce second test les conclusions suivantes :
- comme lors du premier test, l'identification du meilleur candidat par similarité cosinus est moins performante que par projection
- plusieurs configurations de paramètres permettent d'obtenir des performance similaires via la projection : 
    - l2/l1
    - l2/l1b
    - l3/l2
    - l3/l2b
    - l3/l1b
    - l4/l2
    - l4/l2b

### Application de la Grid Search : impact des mots non vus en entrainement

On va également voir si l'utilisation d'un vectorizer de type HashingVectorizer, qui permet de prendre en compte des mots non vus lors de l'entraînement a un impact sur la performance (ou son écart type, qui est très élevé...).

In [21]:
process_pipe.set_params(**{'Splitter__splitter_func': splitter_funcs[2],

                           })

kwargs_to_prod = prod_params({'stop_words': {'with stopwords removal' : stop_words},
                              'ngram_range': {'bigrams': (1, 2)},
                              'binary': {'counts': False, 'binary flag': True},
                             })

param_grid = [{'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
               'SimilaritySelector__count_vect_type': ['TfidfVectorizer', 'HashingVectorizer'],
               'SimilaritySelector__similarity': ['projection'],
               'SimilaritySelector__source_norm': ['l4'],
               'SimilaritySelector__projected_norm': ['l2'],               
              },
              {'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
               'SimilaritySelector__count_vect_type': ['TfidfVectorizer', 'HashingVectorizer'],
               'SimilaritySelector__similarity': ['projection'],               
               'SimilaritySelector__source_norm': ['l3'],
               'SimilaritySelector__projected_norm': ['l2'],
              },
              {'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
               'SimilaritySelector__count_vect_type': ['TfidfVectorizer', 'HashingVectorizer'],
               'SimilaritySelector__similarity': ['projection'],               
               'SimilaritySelector__source_norm': ['l2'],
               'SimilaritySelector__projected_norm': ['l1'],
              },
              {'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
               'SimilaritySelector__count_vect_type': ['TfidfVectorizer', 'HashingVectorizer'],
               'SimilaritySelector__similarity': ['cosine'],               
              },
             ]
search = GridSearchCV(process_pipe,
                      param_grid,
                      cv=8, 
                      scoring= lev_scorer,
                      n_jobs=-1,
                      verbose=1,
                     ).fit(train, train['ingredients'])

Fitting 8 folds for each of 16 candidates, totalling 128 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   11.5s


Launching 8 processes.


[Parallel(n_jobs=-1)]: Done 128 out of 128 | elapsed:   43.1s finished


In [22]:
labels = [
          'l4/l2',
          'l3/l2',
          'l2/l1',
          'cosine'
         ]

labels = list(product(labels, kwargs_to_prod[0], ['TfidfVectorizer', 'HashingVectorizer']))
labels = list(map(lambda x: ', '.join(x), labels))

for i in range(len(search.cv_results_['rank_test_score'])):
    str_result = f"{search.cv_results_['mean_test_score'][i]:.2%} +/- {search.cv_results_['std_test_score'][i]:.2%}"
    print(labels[i], str_result)

l4/l2, with stopwords removal, bigrams, counts, TfidfVectorizer 61.11% +/- 5.30%
l4/l2, with stopwords removal, bigrams, counts, HashingVectorizer 61.07% +/- 5.23%
l4/l2, with stopwords removal, bigrams, binary flag, TfidfVectorizer 61.00% +/- 5.61%
l4/l2, with stopwords removal, bigrams, binary flag, HashingVectorizer 60.51% +/- 5.54%
l3/l2, with stopwords removal, bigrams, counts, TfidfVectorizer 61.55% +/- 5.25%
l3/l2, with stopwords removal, bigrams, counts, HashingVectorizer 59.97% +/- 5.25%
l3/l2, with stopwords removal, bigrams, binary flag, TfidfVectorizer 62.05% +/- 4.73%
l3/l2, with stopwords removal, bigrams, binary flag, HashingVectorizer 59.58% +/- 5.46%
l2/l1, with stopwords removal, bigrams, counts, TfidfVectorizer 60.86% +/- 5.65%
l2/l1, with stopwords removal, bigrams, counts, HashingVectorizer 60.36% +/- 5.89%
l2/l1, with stopwords removal, bigrams, binary flag, TfidfVectorizer 61.00% +/- 5.61%
l2/l1, with stopwords removal, bigrams, binary flag, HashingVectorizer 60.

L'utilisation d'un HashingVectorizer à la place d'un TfidfVectorizer, pour prendre en compte les mots non vus lors de l'entrainement, n'a pas d'impact positif sur la performance du modèle.
Au contraire, elle semble globalement diminuer de quelques points.

### Application d'une grid search : pondération des mots 

On va en plus appliquer une pondération absolue et relative des mots, dans la recherche de similarité par cosinus.

Les différentes possibilités pour le vecteur cible sont : 
- moyenne des vecteurs de textes des listes d'ingrédients, avec uniquement un flag binaire (présence / absence du mot) : la cible est la document frequency moyenne des mots des listes d'ingrédients
- moyenne des vecteurs de textes des listes d'ingrédients, avec en prenant en compte les comptes des mots dans chacun des textes : la cible est la term frequency moyenne des mots au sein des listes d'ingrédients
- moyenne des scores "absolus" de chacun des mots au sein des listes d'ingrédients. Il s'agit d'une "smooth document frequency" (elle croit logarithmiquement)
- moyenne des scores "relatifs" de chacun des mots entre liste d'ingrédients et contenu des fiches techniques. Ici on compare la doc frequency entre les deux corpus, pour donner plus de poids aux mots qui sont plus présents dans les listes d'ingrédients que dans le reste du corps du texte.

On comparera à la projection l4/l2b, qui porte jusque là les meilleurs résultats.

In [47]:
process_pipe.set_params(**{'Splitter__splitter_func': splitter_funcs[2],
                           'SimilaritySelector__count_vect_type': 'TfidfVectorizer',
                           })

kwargs_to_prod = prod_params({'stop_words': {'with stopwords removal' : stop_words},
                              'ngram_range': {'no_ngrams': (1, 1), 'bigrams': (1, 2)},
                              'binary': {'counts': False, 'binary flag': True},
                              'use_idf': {'without idf': False, 'with idf': True},
                              'strip_accents': {'remove accents': 'unicode'},
                             })

param_grid = [{
               'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
               'SimilaritySelector__scoring': ['default', 'absolute_score', 'relative_score'],    
               'SimilaritySelector__similarity': ['cosine'],
              },
              {'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
               'SimilaritySelector__scoring': ['default'],
               'SimilaritySelector__similarity': ['projection'],
               'SimilaritySelector__source_norm': ['l4'],
               'SimilaritySelector__projected_norm': ['l2'],                   
              },
             ]
search = GridSearchCV(process_pipe,
                      param_grid,
                      cv=8, 
                      scoring= lev_scorer,
                      n_jobs=-1,
                      verbose=1,
                     ).fit(train, train['ingredients'])

Fitting 8 folds for each of 32 candidates, totalling 256 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   13.6s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  1.1min


Launching 8 processes.


[Parallel(n_jobs=-1)]: Done 256 out of 256 | elapsed:  1.5min finished


In [48]:
labels = [
          'cosine',
         ]


labels = list(product(labels, kwargs_to_prod[0], ['default', 'absolute score', 'relative score']))
labels.extend(list(product(['projection l4/l2'],  kwargs_to_prod[0])))
labels = list(map(lambda x: ', '.join(x), labels))

for i in range(len(search.cv_results_['rank_test_score'])):
    str_result = f"{search.cv_results_['mean_test_score'][i]:.2%} +/- {search.cv_results_['std_test_score'][i]:.2%}"
    print(labels[i], str_result)

cosine, with stopwords removal, no_ngrams, counts, without idf, remove accents, default 53.69% +/- 7.13%
cosine, with stopwords removal, no_ngrams, counts, without idf, remove accents, absolute score 54.47% +/- 6.87%
cosine, with stopwords removal, no_ngrams, counts, without idf, remove accents, relative score 29.42% +/- 5.32%
cosine, with stopwords removal, no_ngrams, counts, with idf, remove accents, default 55.47% +/- 6.70%
cosine, with stopwords removal, no_ngrams, counts, with idf, remove accents, absolute score 52.47% +/- 6.91%
cosine, with stopwords removal, no_ngrams, counts, with idf, remove accents, relative score 32.73% +/- 5.21%
cosine, with stopwords removal, no_ngrams, binary flag, without idf, remove accents, default 53.29% +/- 7.86%
cosine, with stopwords removal, no_ngrams, binary flag, without idf, remove accents, absolute score 54.15% +/- 8.17%
cosine, with stopwords removal, no_ngrams, binary flag, without idf, remove accents, relative score 28.02% +/- 5.00%
cosine,

In [49]:
search.cv_results_

{'mean_fit_time': array([1.04464698, 1.11869857, 1.38373205, 0.7387183 , 0.99120578,
        1.4639748 , 0.70843965, 0.8971028 , 1.28756195, 0.74357998,
        0.91837344, 1.26502818, 1.15127203, 1.58092356, 2.05790946,
        1.1457637 , 1.53968149, 2.02482742, 1.18108353, 1.59118012,
        2.13614097, 1.2229169 , 1.59080085, 2.14523304, 0.78127733,
        0.64556399, 0.62803563, 0.61303991, 0.97116473, 1.0362848 ,
        1.06738931, 0.99505311]),
 'std_fit_time': array([0.09646691, 0.12062726, 0.13310404, 0.12761933, 0.10728558,
        0.11898674, 0.10610455, 0.0447356 , 0.11340495, 0.11330933,
        0.05391361, 0.12981145, 0.11105268, 0.08852161, 0.13011447,
        0.15965745, 0.08680976, 0.10783568, 0.09796659, 0.14748077,
        0.14729109, 0.0783453 , 0.11034447, 0.13701008, 0.201603  ,
        0.03673408, 0.01226879, 0.02465827, 0.08458008, 0.07782379,
        0.09005581, 0.08323097]),
 'mean_score_time': array([0.42370096, 0.3385323 , 0.43205762, 0.35798511, 0.382356

On en déduit :
- que la similarité par projection reste le mode de détermination du candidat le plus efficace
- que dans ce mode, l'utilisation de l'idf dégrade la performance
- néanmoins, dans le cadre de la similarité cosinus, l'utilisation de l'idf a un impact positif pour la fonction de scoring par défaut, ou relative.

### Application de la grid search : embeddings des mots

On mesure l'impact sur la performance de l'utilisation d'embeddings de mots.

In [25]:
process_pipe.set_params(**{'Splitter__splitter_func': splitter_funcs[2],
                           'SimilaritySelector__count_vect_type': 'TfidfVectorizer',
                           })

kwargs_to_prod = prod_params({'stop_words': {'with stopwords removal' : stop_words},
                              'ngram_range': {'no_ngram': (1, 1)},
                              'binary': {'counts': False, 'binary flag': True},           
                              'strip_accents': {'remove accents': 'unicode'},
                             })

param_grid = [{
               'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
               'SimilaritySelector__scoring': ['default', 'absolute_score', 'relative_score'],    
               'SimilaritySelector__similarity': ['cosine'],
               'SimilaritySelector__embedding_method': [None, 'Word2Vec', 'tSVD'],
              },
             ]
search = GridSearchCV(process_pipe,
                      param_grid,
                      cv=8, 
                      scoring= lev_scorer,
                      n_jobs=-1,
                      verbose=1,
                      error_score='raise',
                     ).fit(train, train['ingredients'])

Fitting 8 folds for each of 18 candidates, totalling 144 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   21.6s


Launching 8 processes.


[Parallel(n_jobs=-1)]: Done 144 out of 144 | elapsed:  1.5min finished


In [26]:
labels = ['default', 'absolute_score', 'relative_score']


labels = list(product(kwargs_to_prod[0],
                      ['No embed', 'Word2Vec', 'tSVD'],
                      ['default', 'absolute score', 'relative score'],
                     ))
# labels.extend(list(product(['projection l4/l2'],  kwargs_to_prod[0])))
labels = list(map(lambda x: ', '.join(x), labels))

for i in range(len(search.cv_results_['rank_test_score'])):
    str_result = f"{search.cv_results_['mean_test_score'][i]:.2%} +/- {search.cv_results_['std_test_score'][i]:.2%}"
    print(labels[i], str_result)

with stopwords removal, no_ngram, counts, remove accents, No embed, default 53.69% +/- 7.13%
with stopwords removal, no_ngram, counts, remove accents, No embed, absolute score 54.47% +/- 6.87%
with stopwords removal, no_ngram, counts, remove accents, No embed, relative score 29.42% +/- 5.32%
with stopwords removal, no_ngram, counts, remove accents, Word2Vec, default 53.57% +/- 4.97%
with stopwords removal, no_ngram, counts, remove accents, Word2Vec, absolute score 52.97% +/- 5.81%
with stopwords removal, no_ngram, counts, remove accents, Word2Vec, relative score 10.18% +/- 2.67%
with stopwords removal, no_ngram, counts, remove accents, tSVD, default 50.95% +/- 7.02%
with stopwords removal, no_ngram, counts, remove accents, tSVD, absolute score 50.11% +/- 7.72%
with stopwords removal, no_ngram, counts, remove accents, tSVD, relative score 8.43% +/- 1.01%
with stopwords removal, no_ngram, binary flag, remove accents, No embed, default 53.29% +/- 7.86%
with stopwords removal, no_ngram, bi

In [28]:
search.best_params_

{'SimilaritySelector__count_vect_kwargs': {'stop_words': {'and',
   'au',
   'ce',
   'dans',
   'de',
   'des',
   'dont',
   'du',
   'en',
   'est',
   'et',
   'la',
   'le',
   'les',
   'of',
   'ou',
   'par',
   'pas',
   'pour',
   'que',
   'sur',
   'un'},
  'ngram_range': (1, 1),
  'binary': True,
  'strip_accents': 'unicode'},
 'SimilaritySelector__embedding_method': 'tSVD',
 'SimilaritySelector__scoring': 'absolute_score',
 'SimilaritySelector__similarity': 'cosine'}

### Random search : validation finale de l'ensemble des critères

On applique enfin une random search, afin de voir si les conclusions qui avaient été tirée lors des explorations systématiques de certains domaines sont viables.

In [44]:
kwargs_to_prod = prod_params({'stop_words': {'with stopwords removal' : stop_words, 'keep stopwords': None},
                              'use_idf': {'with idf': True, 'no idf': False},
                              'binary': {'counts': False, 'binary flag': True},                                       
                              'ngram_range': {'no_ngram': (1, 1), 'bigrams': (1, 2), 'trigrams': (1, 3)},  
                              'strip_accents': {'remove accents': 'unicode', 'keep accents': None},
                             })
param_grid = [{
               'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
               'SimilaritySelector__scoring': ['default', 'absolute_score', 'relative_score'],    
               'SimilaritySelector__similarity': ['cosine'],
               'SimilaritySelector__embedding_method': [None, 'Word2Vec', 'tSVD'],
    
              },
              {'SimilaritySelector__count_vect_kwargs': kwargs_to_prod[1],
               'SimilaritySelector__scoring': ['default'],
               'SimilaritySelector__similarity': ['projection'],
               'SimilaritySelector__source_norm': ['l2', 'l3', 'l4', 'l5'],
               'SimilaritySelector__projected_norm': ['l1', 'l2', 'l3', 'l4'],                   
              },
             ]
len(kwargs_to_prod[1])

search = RandomizedSearchCV(process_pipe,
                            param_grid,
                            n_iter=50,
                            cv=8, 
                            scoring= lev_scorer,
                            n_jobs=-1,
                            verbose=1,
                           ).fit(train, train['ingredients'])

Fitting 8 folds for each of 50 candidates, totalling 400 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   21.1s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  2.0min


Launching 8 processes.


[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed:  4.0min finished


In [45]:
search.best_params_

{'SimilaritySelector__source_norm': 'l5',
 'SimilaritySelector__similarity': 'projection',
 'SimilaritySelector__scoring': 'default',
 'SimilaritySelector__projected_norm': 'l3',
 'SimilaritySelector__count_vect_kwargs': {'stop_words': None,
  'use_idf': False,
  'binary': True,
  'ngram_range': (1, 2),
  'strip_accents': None}}

In [46]:
search.best_score_

0.6091800832952281