# PIM Estimator

This notebook aims to test the PIM-Estimator module.

In [1]:
import os

import numpy as np
import pandas as pd

from src.pimest import IngredientExtractor
from src.pimest import PIMIngredientExtractor
from src.pimest import PathGetter
from src.pimest import ContentGetter
from src.pimapi import Requester
from src.pimpdf import PDFDecoder

# 1. Extracting the data

First, let's refresh the data from the production environment.

In [2]:
#requester = Requester('prd', proxies=None)
requester = Requester('prd')
print('----------------------------------------')
requester.refresh_directory()
print('----------------------------------------')
requester.modification_report()
print('----------------------------------------')
requester.fetch_list_from_PIM(requester.modified_items(), batch_size=20)
print('----------------------------------------')
requester.dump_data_from_result()
print('----------------------------------------')
requester.dump_files_from_result()
print('----------------------------------------')
requester.modification_report()
print('----------------------------------------')

----------------------------------------
Done
----------------------------------------
Number of items: 13142
Number of items with outdated data: 252
Number of items with outdated files: 252
----------------------------------------
Done
----------------------------------------
Done
----------------------------------------
Launching 13 threads.
Thread complete!
Thread complete!
Thread complete!
Thread complete!
Thread complete!
Thread complete!
Thread complete!
Thread complete!
Thread complete!
Thread complete!
Thread complete!
Thread complete!
Thread complete!
Done
----------------------------------------
Number of items: 13142
Number of items with outdated data: 0
Number of items with outdated files: 0
----------------------------------------


Then, fetch the ingredient lists into a pandas DataFrame:

In [3]:
requester.fetch_all_from_PIM(page_size=1000, max_page=-1, nx_properties='*')
mapping = {'uid': 'uid', 'Libellé': 'title', 'Ingrédients': 'properties.pprodc:ingredientsList'}
df = requester.result_to_dataframe(record_path='entries', mapping=mapping, index='uid')
df

Done


Unnamed: 0_level_0,Libellé,Ingrédients
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
afee12c7-177e-4a68-9539-8cbb68442503,DESTR D'ODEURS AIR&TEXTILES 750CCX6 DESODOR U2,
7d390121-17e8-43bf-a357-9d06b79d2d47,THÉ VERT AGRUME BTE 25S FRAICH LIPTON,
f234cd84-c8f6-433f-85ec-6e0b6980adc6,TORTILLA BLE 30CM,
e82a8173-b379-41ac-b319-aa058a04fcfb,VIN ROUGE MÉDITERRANÉE 25CL X12,
4b12c47c-84f5-4132-b362-22b864379a67,VIN MÉDITERRANÉE ROSÉ 25CL X12,
...,...,...
5cde49c6-9e7e-4bd2-b22a-3239f643379d,ROULEAU CÉLISOFT 1.20X50 M CITRON,
0273eadc-851a-4b68-8020-8041700a4f3d,2D VENT FRAIS 5LX4 DESODOR U2,
ef42a938-2203-446e-8d28-9fd27c6d3146,3D VENT FRAIS 5LX4 DESODOR U2,
68f5d81b-7f91-40a0-8504-0ec320a86de4,NETTOYANT INOX 500ML LOT 2X6 KING,


We only keep the products for which there is an ingredient list in the system.

In [4]:
df = df.loc[pd.notna(df['Ingrédients'])]
df

Unnamed: 0_level_0,Libellé,Ingrédients
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
a84ebaef-3c39-4661-a8bf-bcd0b297076d,ETUI RIZ DE CAMARGUE IGP LONG BLANC,RIZ LONG BLANC
898c0810-7f5a-4a1d-b169-b191dcd9d0bd,CRAC FORM 210G X 8,"Farine de maïs, farine de riz, sucre, sel."
075672b6-c1d5-43d9-9e80-21ec44b2d872,CRAC'FORM MAÏS RIZ SARRASIN 210G X 8,"farine de riz 40%, farine de maïs 38%, amidon ..."
0560a9c2-d635-46ef-a23f-df7ad8632257,SAC 500 GR MORILLES DE CULTURE,MORILLES
01817326-7f55-4a1c-928e-47d1bed8a38e,SPY PYRENEA 150 PK6 EP,EAU DE SOURCE
...,...,...
16a1db98-79ba-4ea7-9943-6d093a4c8ee9,TOMATES SÉCHÉES À L’HUILE POCHE 650G LA PULPE,"Tomates séchées réhydratées, huile de tourneso..."
c778d6f9-aa06-47d4-a4ec-199eec9e1373,MORCEAUX DE POIVRONS ROUGES ET JAUNES GRILLÉS ...,"Poivrons rouges et jaunes, huile de tournesol,..."
3d4a97ed-d6be-49c4-8eae-8ce72c50e68f,QUARTIERS D’ARTICHAUTS MARINÉS À L’HUILE POCHE...,"Artichauts, huile de tournesol, sel, sucre, pl..."
4128f89a-8df7-4da7-a2c0-3ee1302a46f4,BARQUETTE PATE A TARTINER POULAIN,"Ingrédients : Sucre, huile de tournesol, NOISE..."


# 2. Training the estimator

For this simple test, the estimator will be trained on the whole dataset (which is not good practice - this is just to demonstrate the usage of this class).

## 2.1 Importing the module

The cell below is just here to enable to reload source code of pimest module without having to restart the kernel.

In [5]:
#import importlib
#import src.pimest
#importlib.reload(pimest)

## 2.2 Training the estimator

Although not a good practice, we train the estimator on the whole dataset.

In [6]:
estim = IngredientExtractor()
estim.fit(df.loc[:, 'Ingrédients'])

<src.pimest.IngredientExtractor at 0x1c31b2e0348>

In [7]:
estim.vectorized_texts_

<9648x4190 sparse matrix of type '<class 'numpy.int64'>'
	with 169139 stored elements in Compressed Sparse Row format>

In [8]:
estim.vocabulary_

{'riz': 3385,
 'long': 2509,
 'blanc': 735,
 'farine': 1870,
 'de': 1371,
 'maïs': 2628,
 'sucre': 3665,
 'sel': 3539,
 '40': 168,
 '38': 163,
 'amidon': 467,
 'sarrasin': 3501,
 'fibre': 1897,
 'pomme': 3146,
 'millet': 2664,
 'teff': 3753,
 'fabriqué': 1861,
 'dans': 1367,
 'un': 3919,
 'atelier': 606,
 'qui': 3307,
 'utilise': 3928,
 'des': 1404,
 'protéines': 3228,
 'lait': 2426,
 'et': 1816,
 'du': 1496,
 'soja': 3590,
 'morilles': 2729,
 'eau': 1724,
 'source': 3616,
 'kombu': 2402,
 'déshydraté': 1532,
 '100': 38,
 'graines': 2140,
 'moutarde': 2747,
 'vinaigre': 3989,
 'alcool': 427,
 'acidifiant': 345,
 'acide': 334,
 'citrique': 1086,
 'conservateur': 1193,
 'disulfite': 1467,
 'potassium': 3169,
 'jus': 2362,
 'orange': 2914,
 'citron': 1089,
 'base': 671,
 'concentré': 1168,
 'correcteur': 1238,
 'acidité': 352,
 'lactate': 2414,
 'sodium': 3585,
 'édulcorants': 4105,
 'aspartame': 593,
 'acésulfame': 367,
 'conservateurs': 1194,
 'sorbate': 3600,
 'benzoate': 687,
 'arômes

In [9]:
estim.mean_corpus_

matrix([[0.00196932, 0.00010365, 0.00010365, ..., 0.00943201, 0.00849917,
         0.00082919]])

# 3. Testing the estimator

## 3.1 Parsing a doc into blocks

First, we parse a single doc into blocks of texts:

In [10]:
uid = '7ad672f8-40d4-4527-ab49-af3284d23fab'
path = os.path.join('.', 'dumps', 'prd', uid, 'FTF.pdf')
blocks = PDFDecoder.path_to_blocks(path)
blocks

['FICHE TECHNIQUE ',
 ' ',
 'BBEETTTTEERRAAVVEESS  RROOUUGGEESS    ',
 'CCOOUUPPÉÉEESS  EENN  DDÉÉSS  ',
 'LLEEGGEERREEMMEENNTT  VVIINNAAIIGGRREEEESS  ',
 ' \n \n \n \n \n \n \n ',
 'Page : 1/2 ',
 ' \nService Qualité Groupe ',
 'd’aucy/CGC ',
 '56 500 LOCMINE ',
 'Référence : SQ/LV/624 - Version : F ',
 'Date : 3/08/18 ',
 ' ',
 'CARACTÉRISTIQUES GÉNÉRALES :  ',
 'Dénomination Réglementaire ',
 'Code produit \nFormat \nContenance \nPoids Net Total \nPoids Net Égoutté ',
 'Liste des Ingrédients ',
 'Campagne de Production \nOrigine légume ',
 ' \n ',
 'Betteraves rouges coupées en dés ',
 'légèrement vinaigrées ',
 '4505 \n5/1 ',
 '4250 ml \n4000 g \n2655 g ',
 'Betteraves rouges, eau, vinaigre d’alcool (0,9%), sucre, sel, ',
 'acidifiant : acide citrique. (E330) ',
 'Septembre à Décembre ',
 'France ',
 ' ',
 'NOMBRE DE PARTS ',
 'Adultes \nEnfants ',
 '26 parts \n44 parts ',
 ' \nCARACTÉRISTIQUES DU PRODUIT : ',
 'Réf. CTCPA _ décision n°48 – Conserves de betteraves rouges \n \nBette

## 3.2 Predicting the ingredient block

Then we predict the block which is supposed to most likely be the one holding the ingredient list:

In [11]:
block_num = estim.predict(blocks)
print(blocks[block_num])

Pour 100 g*AR**Pour 100 g*AR**Energie (kJ)128Glucides (g)5,22%Energie (kcal)31Dont sucres (g)4,85%Matières grasses (g)0,10%Fibres alimentaires (g)2,4Dont acides gras saturés (g)0,00%Protéines (g)1,02%*pour 100 g de produit égouttéSel (g)0,6010%**Apport de référence pour un adulte-type (8400kJ/2000 kcal).Les apports de références varient en fonction de l'âge, du sexe et de l'activité physique.2%


We can see that for the product with uid `78f66d90-aeab-4f15-8130-0c418955b79a`, the estimator has successfully identified the ingredient block!

# 4. Wrapped Estimator

A helper wrapped class enables to directly compare the current content of the PIM system with what has been extracted from the associated pdf file.

This helper directly inherits from `IngredientExtractor` class:

In [12]:
#from importlib import reload
#import src.pimest
#importlib.reload(pimest)

In [13]:
#wrapped_estim = PIMIngredientExtractor(env='prd', proxies=None)
wrapped_estim = PIMIngredientExtractor(env='prd')
wrapped_estim.fit(df.loc[:, 'Ingrédients'])

<src.pimest.PIMIngredientExtractor at 0x1c31b45c708>

In [122]:
wrapped_estim.compare_uid_data('78f66d90-aeab-4f15-8130-0c418955b79a')

Fetching data from PIM for uid 78f66d90-aeab-4f15-8130-0c418955b79a...
Done
----------------------------------------------------------
Ingredient list from PIM is :

Semoule supérieure de BLE dur, OEUFS frais de poules élevées en plein air (30%).

----------------------------------------------------------
Supplier technical datasheet from PIM for uid 78f66d90-aeab-4f15-8130-0c418955b79a is:
https://produits.groupe-pomona.fr/nuxeo/nxfile/default/78f66d90-aeab-4f15-8130-0c418955b79a/pprodad:technicalSheet/FT-186759_Tagliat%20nid%207oeuf%20Als%202mm%20sac3KG_Gd%20Mere.pdf?changeToken=22-0
----------------------------------------------------------
Downloading content of technical datasheet file...
Done!
----------------------------------------------------------
Parsing content of technical datasheet file...
Done!
----------------------------------------------------------
Ingredient list extracted from technical datasheet:

Semoule supérieure de blé dur, œufs frais de 
poules élevées en ple

In [15]:
wrapped_estim.print_blocks()

0  |  FERRERO France  

1  |  CS90058
76136 MONT-SAINT-
AIGNAN 

2  |  sept-17 

3  |  NUTELLA 

4  |  INFORMATIONS NUTRITIONNELLES 

5  |  Pâtes à tartiner aux noisettes 

6  |  Valeurs nutritionnelles  

7  |  moyennes   

8  |  Pour 100g 

9  |  Pour 15g 

10  |  %* par 15g 

11  |  INGREDIENTS 

12  |  ENERGIE 

13  |  2252 Kj/539 kcal 

14  |  336 Kj/80 kcal 

15  |  Sucre, huile de palme, noisettes 13%, lait écrémé 
en poudre 8,7%, cacao maigre 7,4%, émulsifiants :  

16  |  lécithines [soja], vanilline 

17  |  ALLERGENES 

18  |  MATIERES GRASSES 

19  |  dont acides gras saturés 

20  |  GLUCIDES 

21  |  dont sucres 

22  |  30,9 g 

23  |  10,6 g 

24  |  57,5 g 

25  |  56,3 g 

26  |  4,6 g 

27  |  1,6 g 

28  |  8,6 g 

29  |  8,4 g 

30  |  PROTEINES 

31  |  6,3 g 

32  |  0,9 g 

33  |  CONSERVATION 

34  |  *Apport de référence pour un adulte-type (8 400 kJ/2 000 kcal) 

35  |  SEL 

36  |  0,11 g 

37  |  0,016 g 

38  |   - noisettes 

39  |   - lait 

40  |   - so

In [16]:
from sklearn.model_selection import train_test_split

train_uids, test_uids = train_test_split(df, test_size=500, random_state=42)
#test_uids.reset_index().loc[:, 'uid'].to_csv(os.path.join('.', 'test_uids.csv'), header=True, encoding='utf-8-sig', index=False)

# 5. Transformers

A handful of transformers have been developped to treat the data.

## 5.1 PathGetter

This transformer takes a DataFrame with uids as index, and adds a columns that is the path on the pdf file on disk - depending on whether the uid is from ground truth or from "normal" train set.

The root paths can be passed at initialization, or they will be defaulted to what is specified in the configuration file.

The ground truth uids must be declared at initialization.

In [17]:
from pathlib import Path

In [18]:
ground_truth_df = pd.read_csv(os.path.join('.', 'ground_truth', 'manually_labelled_ground_truth.csv'),
                              sep=';',
                              encoding='latin-1',
                              index_col='uid')
ground_truth_uids = list(ground_truth_df.index)
ground_truth_uids

['a0492df6-9c76-4303-8813-65ec5ccbfa70',
 'd183e914-db2f-4e2f-863a-a3b2d054c0b8',
 'ab48a1ed-7a3d-4686-bb6d-ab4f367cada8',
 '528d4be3-425c-4f8b-8a87-12f1bc645ddd',
 '51b38427-b2ea-4c56-93e8-4242361ef31b',
 'e4c8c61f-a401-4384-8128-181447e5bdd2',
 '50766d10-6135-4958-b743-1cddfcb7c230',
 'bb9f3995-57b3-429d-b075-2d81a90e406f',
 '52bf5f5a-79f2-4652-8eac-2b80c697269b',
 '04235024-80f3-46c2-bad0-aae0d5fab024',
 '93a5d344-4ab0-48de-bbb0-fec8117d07a2',
 '440ca0b4-1c79-433f-ac8d-6e2ae77d3288',
 '7b9cf3b3-bf4b-408c-b8c8-807a09745979',
 'b638ab96-427c-4d5f-9a19-74684eec3b8e',
 '8bb101c0-8c6e-4665-8310-1a2d13257b2e',
 '54f40033-f9cf-411c-81a5-11974f6715aa',
 '2405356e-ecc9-404f-8279-16fddf168b73',
 'c57be8e4-30e0-417b-90fa-b60d9f9ece48',
 '6622442d-d88f-4b90-9414-b91c9e8e5296',
 '507b428e-e99d-464b-b9d3-50629efe4355',
 '545e8bc5-15fd-458a-8c9f-b18faecc2016',
 'b45b3942-accd-4115-8f1f-736dbd42a35a',
 '6c07fd47-116d-4531-a164-d4f4040f212e',
 'ca711137-4644-45b7-9928-456a09d9746a',
 'e70d347d-5e91-

In [19]:
transformer = PathGetter(ground_truth_uids=ground_truth_uids,
                         train_set_path=Path.cwd() / 'dumps' / 'prd',
                         ground_truth_path=Path.cwd() / 'ground_truth',
                        )

In [20]:
idx = pd.Index(ground_truth_uids[:5] + list(df.index[:5]), name='uid')
test_df = df.loc[idx,:]
test_df = transformer.fit_transform(test_df)
test_df

Unnamed: 0_level_0,Libellé,Ingrédients,path
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...
a84ebaef-3c39-4661-a8bf-bcd0b297076d,ETUI RIZ DE CAMARGUE IGP LONG BLANC,RIZ LONG BLANC,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...
898c0810-7f5a-4a1d-b169-b191dcd9d0bd,CRAC FORM 210G X 8,"Farine de maïs, farine de riz, sucre, sel.",C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...
075672b6-c1d5-43d9-9e80-21ec44b2d872,CRAC'FORM MAÏS RIZ SARRASIN 210G X 8,"farine de riz 40%, farine de maïs 38%, amidon ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...
0560a9c2-d635-46ef-a23f-df7ad8632257,SAC 500 GR MORILLES DE CULTURE,MORILLES,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...
01817326-7f55-4a1c-928e-47d1bed8a38e,SPY PYRENEA 150 PK6 EP,EAU DE SOURCE,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...


## 5.2 ContentGetter

This transformer reads the files from disk and loads their content as binaries into the dataframe.

In [21]:
transformer_2 = ContentGetter(missing_file='to_nan')
test_df = transformer_2.fit_transform(test_df)
test_df

Unnamed: 0_level_0,Libellé,Ingrédients,path,content
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n4 0 obj\r<</L...
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.4\n%\xc7\xec\x8f\xa2\n5 0 obj\n<</Len...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.3\n%\xc7\xec\x8f\xa2\n7 0 obj\n<</Len...
a84ebaef-3c39-4661-a8bf-bcd0b297076d,ETUI RIZ DE CAMARGUE IGP LONG BLANC,RIZ LONG BLANC,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...
898c0810-7f5a-4a1d-b169-b191dcd9d0bd,CRAC FORM 210G X 8,"Farine de maïs, farine de riz, sucre, sel.",C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...
075672b6-c1d5-43d9-9e80-21ec44b2d872,CRAC'FORM MAÏS RIZ SARRASIN 210G X 8,"farine de riz 40%, farine de maïs 38%, amidon ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...
0560a9c2-d635-46ef-a23f-df7ad8632257,SAC 500 GR MORILLES DE CULTURE,MORILLES,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...
01817326-7f55-4a1c-928e-47d1bed8a38e,SPY PYRENEA 150 PK6 EP,EAU DE SOURCE,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.3\n3 0 obj\n<</Type /Page\n/Parent 1 ...


## 5.3 PDFContentParser

This estimator parses the content of the PDF files into text using the functionalities defined in the pimpdf module.

In [22]:
from src.pimest import PDFContentParser
transformer_3 = PDFContentParser()
test_df = transformer_3.fit_transform(test_df)
test_df

Launching 4 processes.


Unnamed: 0_level_0,Libellé,Ingrédients,path,content,text
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,Concentré Liquide Asian CHEF® \n\nBouteille de...
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n4 0 obj\r<</L...,
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.4\n%\xc7\xec\x8f\xa2\n5 0 obj\n<</Len...,Direction Qualité \n\n \n\n \n\nPATES ALIMENTA...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE\n\nFEVE DE TONKA\n\nDipteryx O...
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.3\n%\xc7\xec\x8f\xa2\n7 0 obj\n<</Len...,CAVIAR \n\nD’AUBERGINES\n500G\n\nPRÉSENTATION\...
a84ebaef-3c39-4661-a8bf-bcd0b297076d,ETUI RIZ DE CAMARGUE IGP LONG BLANC,RIZ LONG BLANC,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,"SARL BENOIT\n\n367, chemin de Mérieux - Quarti..."
898c0810-7f5a-4a1d-b169-b191dcd9d0bd,CRAC FORM 210G X 8,"Farine de maïs, farine de riz, sucre, sel.",C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,CRAC’FORM \n\nDonnées commerciales \n\n \n \nB...
075672b6-c1d5-43d9-9e80-21ec44b2d872,CRAC'FORM MAÏS RIZ SARRASIN 210G X 8,"farine de riz 40%, farine de maïs 38%, amidon ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,CRAC’FORM MAÏS RIZ SARRASIN \n\nDonnées commer...
0560a9c2-d635-46ef-a23f-df7ad8632257,SAC 500 GR MORILLES DE CULTURE,MORILLES,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE & LOGISTIQUE ASSURANCE QUALITE...
01817326-7f55-4a1c-928e-47d1bed8a38e,SPY PYRENEA 150 PK6 EP,EAU DE SOURCE,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.3\n3 0 obj\n<</Type /Page\n/Parent 1 ...,Designation du produit\nMarque\nNom de la sour...


## 5.4 BlockSplitter

This transformer splits the content of the text column into blocks of texts using the prvoided splitter function.

In [23]:
from src.pimest import BlockSplitter
splitter_func = lambda x: x.split('\n\n')
transformer_4 = BlockSplitter(splitter_func=splitter_func)
test_df = transformer_4.fit_transform(test_df)
test_df

Launching 4 processes.


Unnamed: 0_level_0,Libellé,Ingrédients,path,content,text,blocks
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,Concentré Liquide Asian CHEF® \n\nBouteille de...,"[Concentré Liquide Asian CHEF® , Bouteille de ..."
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n4 0 obj\r<</L...,,[]
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.4\n%\xc7\xec\x8f\xa2\n5 0 obj\n<</Len...,Direction Qualité \n\n \n\n \n\nPATES ALIMENTA...,"[Direction Qualité , , , PATES ALIMENTAIRES ..."
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE\n\nFEVE DE TONKA\n\nDipteryx O...,"[FICHE TECHNIQUE, FEVE DE TONKA, Dipteryx Odor..."
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.3\n%\xc7\xec\x8f\xa2\n7 0 obj\n<</Len...,CAVIAR \n\nD’AUBERGINES\n500G\n\nPRÉSENTATION\...,"[CAVIAR , D’AUBERGINES\n500G, PRÉSENTATION, LE..."
a84ebaef-3c39-4661-a8bf-bcd0b297076d,ETUI RIZ DE CAMARGUE IGP LONG BLANC,RIZ LONG BLANC,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,"SARL BENOIT\n\n367, chemin de Mérieux - Quarti...","[SARL BENOIT, 367, chemin de Mérieux - Quartie..."
898c0810-7f5a-4a1d-b169-b191dcd9d0bd,CRAC FORM 210G X 8,"Farine de maïs, farine de riz, sucre, sel.",C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,CRAC’FORM \n\nDonnées commerciales \n\n \n \nB...,"[CRAC’FORM , Données commerciales , \n \nBéné..."
075672b6-c1d5-43d9-9e80-21ec44b2d872,CRAC'FORM MAÏS RIZ SARRASIN 210G X 8,"farine de riz 40%, farine de maïs 38%, amidon ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,CRAC’FORM MAÏS RIZ SARRASIN \n\nDonnées commer...,"[CRAC’FORM MAÏS RIZ SARRASIN , Données commerc..."
0560a9c2-d635-46ef-a23f-df7ad8632257,SAC 500 GR MORILLES DE CULTURE,MORILLES,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE & LOGISTIQUE ASSURANCE QUALITE...,[FICHE TECHNIQUE & LOGISTIQUE ASSURANCE QUALIT...
01817326-7f55-4a1c-928e-47d1bed8a38e,SPY PYRENEA 150 PK6 EP,EAU DE SOURCE,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.3\n3 0 obj\n<</Type /Page\n/Parent 1 ...,Designation du produit\nMarque\nNom de la sour...,[Designation du produit\nMarque\nNom de la sou...


In [24]:
for text in test_df['blocks'].iloc[0]:
    print('-------------------------------------------\n', text)

-------------------------------------------
 Concentré Liquide Asian CHEF® 
-------------------------------------------
 Bouteille de 980 ml
-------------------------------------------
 DESCRIPTION DU PRODUIT
-------------------------------------------
 BÉNÉFICES CLÉS DU PRODUIT
-------------------------------------------
 CODE EAN
-------------------------------------------
 7613035849105
-------------------------------------------
 Concentré liquide aux saveurs asiatiques.
-------------------------------------------
 Utilisable à chaud et à froid.
Sans exhausteurs de goût ajoutés 
(glutamates, inosinates, guanylates).
Adapté à un régime végétarien.
Sans gluten.
-------------------------------------------
 INGRÉDIENTS
-------------------------------------------
 Eau, maltodextrine, sel, arômes, sucre, arôme naturel de citronnelle, amidon modifié, ail en poudre, épices (combava, curcuma), extraits 
d'épices (gingembre, poivre), stabilisant (gomme xanthane).   
-------------------------

# 6. Pipelining transformers

One can pipeline these transformers using the scikit-learn standard Pipeline.

## 6.1 Data acquisition

First step is to build a data acquisition pipeline, which will provide a DataFrame with the pdf documents full texts.

In [134]:
from importlib import reload
import src.pimest
reload(src.pimest)
from src.pimest import ContentGetter
from src.pimest import PathGetter
from src.pimest import PDFContentParser
from sklearn.pipeline import Pipeline

In [27]:
acqui_pipe = Pipeline([('PathGetter', PathGetter(ground_truth_uids=ground_truth_uids,
                                                  train_set_path=Path.cwd() / 'dumps' / 'prd',
                                                  ground_truth_path=Path.cwd() / 'ground_truth',
                                                  )),
                        ('ContentGetter', ContentGetter(missing_file='to_nan')),
                        ('ContentParser', PDFContentParser()),
                       ],
                       verbose=True)

In [48]:
idx = pd.Index(ground_truth_uids)
texts_df = df.loc[idx,]
texts_df = acqui_pipe.fit_transform(texts_df)
texts_df

[Pipeline] ........ (step 1 of 4) Processing PathGetter, total=   0.2s
[Pipeline] ..... (step 2 of 4) Processing ContentGetter, total=  13.6s
Launching 4 processes.
[Pipeline] ..... (step 3 of 4) Processing ContentParser, total= 3.4min
Launching 4 processes.
[Pipeline] ..... (step 4 of 4) Processing BlockSplitter, total=   0.8s


Unnamed: 0,Libellé,Ingrédients,path,content,text,blocks
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,Concentré Liquide Asian CHEF® \n\nBouteille de...,"[Concentré Liquide Asian CHEF® , Bouteille de ..."
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n4 0 obj\r<</L...,,[]
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.4\n%\xc7\xec\x8f\xa2\n5 0 obj\n<</Len...,Direction Qualité \n\n \n\n \n\nPATES ALIMENTA...,"[Direction Qualité , , , PATES ALIMENTAIRES ..."
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE\n\nFEVE DE TONKA\n\nDipteryx O...,"[FICHE TECHNIQUE, FEVE DE TONKA, Dipteryx Odor..."
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.3\n%\xc7\xec\x8f\xa2\n7 0 obj\n<</Len...,CAVIAR \n\nD’AUBERGINES\n500G\n\nPRÉSENTATION\...,"[CAVIAR , D’AUBERGINES\n500G, PRÉSENTATION, LE..."
...,...,...,...,...,...,...
7c709a5e-b913-4a02-9396-a6469b09482a,Mini-bonbons aux fruits en sachet 1 kg SKENDY,"Sirop de glucose, sucre, arômes naturels, acid...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.6\r\n%\xbd\xbe\xbc\r\n1 0 obj\r\n<<\r...,FTin. 396\nPage: 1/2\n\nIndice de révision:\nD...,"[FTin. 396\nPage: 1/2, Indice de révision:\nDa..."
c5dee4ab-9f57-4533-9f89-e216ee110f68,"FARINE DE BLÉ TYPE 45, 25KG",farine de BLE T45,C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,\n1050/10502066400 \n\n10502055300/1050202520...,"[ \n1050/10502066400 , 10502055300/10502025200..."
e67341d8-350f-46f4-9154-4dbbb8035621,PRÉPARATION POUR CRÈME BRÛLÉE BIO 6L,"Sucre roux de canne*°(64%), amidon de maïs*, p...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE \n\nCREME BRÛLÉE 6L \n\nREF : ...,"[FICHE TECHNIQUE , CREME BRÛLÉE 6L , REF : NAP..."
a8f6f672-20ac-4ff8-a8f2-3bc4306c8df3,Céréales instantanées en poudre saveur caramel...,"Farine 87,1 % (BLE (GLUTEN), BLE hydrolysé (GL...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,81 rue de Sans Souci – CS13754 – 69576 Limones...,[81 rue de Sans Souci – CS13754 – 69576 Limone...


## 6.2 Block splitting and prediction

Based on the pdf documents full text, we can now compute a block list for each of them, as well as predict which one seems to be the best candidate.

In [135]:
from src.pimest import BlockSplitter
from src.pimest import SimilaritySelector

In [136]:
def splitter(text):
    return(text.split('\n\n'))

In [137]:
process_pipe = Pipeline([('BlockSplitter', BlockSplitter(splitter_func=splitter)),
                         ('SimilaritySelector', SimilaritySelector())
                       ],
                       verbose=True)

In [138]:
predicted_df = process_pipe.fit_transform(texts_df)
predicted_df

Launching 4 processes.
[Pipeline] ..... (step 1 of 2) Processing BlockSplitter, total=   0.6s
[0.5        1.         1.15470054 1.         0.         0.
 0.89442719 1.11803399 1.         4.70678724 0.70710678 0.70710678
 1.80838886 1.63299316 0.94280904 2.5819889  0.         2.3590713
 1.         0.70710678 1.         0.70710678 1.         1.15470054
 1.         1.         0.         0.         2.70205914 0.83205029
 1.13389342 0.         0.5        1.         0.         0.
 0.         0.         1.41421356 1.5        0.         0.70710678
 1.41421356 1.         1.         1.15470054 0.         0.
 0.         0.         0.         0.         1.         0.
 0.         0.         1.41421356 0.         0.         0.
 0.         0.         0.         0.         0.         1.56524758
 0.57735027 1.06066017 0.         0.         0.         0.
 0.         0.         0.         0.         0.         1.34164079
 2.13808994 1.61644772 0.         0.         0.57735027 0.
 0.         0.         0.

[0.5        0.         1.         0.5        0.         0.
 1.15470054 1.22474487 0.         1.         0.         0.48507125
 0.         0.         2.58557251 0.         3.60555128 0.
 0.         0.57735027 0.         2.10721783 0.         1.
 0.         0.28867513 1.51185789 0.5        0.         0.
 1.41421356 1.5        0.         0.70710678 1.41421356 1.
 1.         0.         0.         1.         1.         1.
 0.         0.         0.         0.         0.         0.
 1.44337567 0.         1.         0.         3.25660937 0.90453403
 1.06066017 1.51185789 0.5        0.         1.         0.5
 1.22474487 0.         1.         0.         0.         0.
 0.         0.         0.         0.5        0.         0.
 0.57735027 0.         0.         0.         0.         0.
 0.         1.         0.         0.         0.         0.
 1.         0.         0.         0.         0.         0.
 0.         0.        ]
[0.         0.         1.         0.         1.         1.15470054
 0.    

 1.98088693 0.         0.        ]
[1.15470054 0.37796447 0.         1.15470054 0.         0.
 0.         0.         0.         0.57735027 1.88982237 0.40824829
 1.41421356 0.         0.         0.         1.50755672 0.
 0.70710678 0.         1.15470054 0.         0.         1.10940039
 0.83205029 0.         0.         1.73205081 0.         0.
 1.         1.         0.         2.57760893 1.33333333 1.88561808
 0.         0.         0.89442719 1.15470054 0.37796447 0.
 1.15470054 1.15470054 0.30151134 0.70710678 0.         1.73205081
 0.         0.         0.         2.13808994 2.19776902 0.
 1.13389342 0.         0.         2.80878723 1.33333333 1.88561808
 0.         0.         0.89442719 0.57735027 0.40824829 0.
 1.15470054 0.         0.         0.         0.         0.
 0.70710678 0.         0.         0.         0.37796447 0.
 0.         0.         0.         0.         0.35355339 0.
 0.         1.10940039 0.83205029 0.         0.         0.
 0.         0.         0.         0.    

[0.         1.73205081 0.         0.         0.         2.52982213
 1.         4.16666667 0.         1.33630621 0.         0.
 0.         1.46059349 0.         1.26491106 0.         1.87662973
 0.         1.33333333 0.         1.8        0.         2.21880078
 0.         0.         1.75       1.         0.         1.73205081
 0.         0.         0.         0.         0.         0.
 2.47487373 1.73205081 2.23606798 2.11057941 1.88982237 1.88982237
 2.21359436 1.78885438 2.         2.         2.23606798 2.
 2.04124145 2.12132034 0.70710678 0.70710678 0.70710678 1.
 1.         1.         1.         1.         1.         0.
 1.         1.         1.         1.         1.         1.
 1.         1.         0.         1.         0.         1.10940039
 0.         0.         0.         0.57735027 2.06559112 0.
 0.         0.         0.66666667 1.75       1.         0.
 1.73205081 0.         0.         0.57735027 0.         0.
 0.         0.         0.         0.5        0.         1.
 0.     

[0.         0.         0.         1.         0.5        1.
 0.40824829 0.         1.33333333 0.         0.         1.15470054
 0.         0.         1.44337567 0.70710678 0.40824829 0.
 0.         0.89442719 0.         4.0118871  2.08893187 0.
 0.         0.         1.15470054 0.         1.41421356 1.15470054
 0.57735027 1.         1.         1.34164079 1.34164079 1.34164079
 0.70710678 0.70710678 1.41421356 0.70710678 1.         0.
 0.         1.88982237 0.89442719 1.41421356 1.15470054 0.
 0.         1.14707867 0.5547002  1.66811531 0.5        0.
 0.         0.         0.57735027 0.70710678 1.06904497 0.
 0.         0.         0.         0.         0.70710678 0.89442719
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 1.         0.         0.         0.         0.         0.
 1.         1.         0.         0.         0.         0.
 0.         0.         0

 0.         2.60985072 1.63299316 0.         1.63299316 0.        ]
[1.38675049 0.         0.         1.13389342 0.31622777 0.40824829
 0.57735027 1.5        0.         0.         0.         1.15470054
 0.         1.69705627 1.44337567 0.         1.15470054 2.97533722
 0.48507125 2.         1.70560573 1.41421356 0.         1.
 1.54919334 0.         0.         0.         1.5        0.
 2.1602469  1.41421356 0.         0.         0.53452248 0.90453403
 0.         0.         0.         1.         2.87121968 2.1821789
 1.15470054 0.         0.5        0.5        0.         0.
 1.38675049 0.         0.         1.13389342 0.31622777 0.40824829
 0.57735027 1.5        0.         0.         0.         2.68105095
 0.         2.         1.         1.97278785 1.         1.22474487
 2.         0.         0.         0.70710678 0.57735027 0.
 0.89442719 0.         1.         0.89442719 0.         0.
 0.         0.57735027 0.         1.         0.         0.
 0.37796447 0.4472136  1.         0.4472136

 0.        ]
[0.5        0.75592895 0.         0.         0.5        0.
 0.         0.         0.4472136  0.         0.         3.08697453
 1.66410059 1.15470054 4.33333333 0.80178373 1.         0.
 0.         0.         0.         0.         0.89442719 0.
 1.41421356 0.         0.         0.         0.         0.
 1.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         1.
 0.         1.58113883 0.81649658 0.         0.         0.
 0.         0.         0.         0.         1.5        0.90453403
 0.59628479 0.         0.57735027 1.73205081 0.         0.89442719
 0.         0.70710678 0.57735027 1.         0.         0.
 0.         0.         0.         1.         1.         0.
 0.91499142 0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         1.73205081
 0.         0.         1.15470054 0.         0.         0.30151134
 0.

 0.        ]
[0.         0.         1.34164079 0.         0.         1.34164079
 0.         0.         0.         0.         0.         1.87082869
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         1.         0.
 2.88675135 1.         1.         1.         1.41421356 0.70710678
 0.         1.         0.         1.         0.         1.51185789
 0.         0.         0.         0.         0.70710678 0.70710678
 1.         0.         0.         0.         0.57735027 0.
 0.         1.73205081 0.70710678 1.41421356 1.73205081 1.63299316
 1.41421356 1.         1.15470054 1.78885438 0.89442719 1.34164079
 0.         0.         0.89442719 0.         0.         1.
 0.         0.66666667 1.80906807 0.         0.         1.34164079
 0.         1.73205081 0.         0.         0.         0.
 0.         0.         0.         1.33630621 0.         0.
 0.70710678 1.         1.73205081 0.         0.70710678 0.57735027
 0.70710678 0.         1.     

 0.         1.11803399 0.37796447 0.        ]
[1.34164079 0.         0.         0.         0.         0.4472136
 0.62554324 0.57735027 0.         0.         0.75592895 1.
 0.70710678 0.         2.04124145 0.         0.         0.
 0.         0.         1.         1.73205081 0.         0.
 0.         1.         0.         0.         4.24264069 1.89624489
 1.41421356 0.70710678 0.         1.         1.15470054 1.15470054
 0.         1.         0.63245553 1.83532587 1.83532587 0.75377836
 0.        ]
[0.         0.         0.         1.         0.         0.
 0.         0.         1.63299316 0.         0.4472136  0.57735027
 0.57735027 0.57735027 0.         0.         1.15470054 0.
 0.         1.51185789 1.         0.         0.         1.73205081
 0.70710678 1.15470054 4.51037923 0.5        1.06066017 0.75
 1.15470054 1.         0.         0.70710678 0.         0.
 0.         0.         1.34164079 0.72760688 0.25819889 1.13389342
 0.        ]
[1.22474487 2.02072594 0.57735027 0.70710678 

[0.         1.63299316 0.         1.34164079 2.04124145 0.
 0.         0.70710678 1.15470054 0.70710678 0.9486833  0.
 0.70710678 0.         0.70710678 0.         1.22474487 0.70710678
 2.         0.70710678 1.         0.         0.         0.
 1.         0.70710678 0.70710678 0.         0.         0.
 0.         0.         0.57735027 0.         0.70710678 0.
 0.         1.         1.5        0.         0.         0.
 0.         0.70710678 0.70710678 0.         0.40824829 0.
 0.70710678 0.         0.70710678 0.         0.70710678 0.
 0.         0.75592895 1.         4.125      0.         1.15470054
 0.         0.         0.         3.50230339 3.10376116 0.33333333
 0.40824829 1.15470054 0.89442719 0.        ]
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.70710678 0.         1.06066017 0.         2.
 1.         3.05085108 1.         0.         

[0.         0.70710678 0.         1.22474487 0.75592895 0.
 0.5        0.         2.61091965 1.20604538 0.         1.34164079
 1.         0.         2.12132034 1.15470054 2.23606798 1.41421356
 0.         0.70710678 0.         1.5        0.         2.
 0.70710678 1.24807544 1.5        1.41421356 0.         1.10940039
 1.         0.         0.         1.         0.         1.
 0.90453403 1.42693538 0.         0.         0.         0.
 0.70710678 0.         1.22474487 0.75592895 0.         0.5
 0.         0.57735027 1.49240501 0.         0.70710678 0.
 0.         0.         0.         0.         0.         0.
 0.         1.41421356 0.         0.         0.         0.
 0.         1.41421356 0.         0.         0.         0.
 0.         0.         0.         0.57735027 0.         0.85280287
 0.         2.25170501 1.66811531 1.37799724 1.51185789 0.37796447
 1.5        0.90453403 0.         0.59628479 0.         0.        ]
[1.22474487 0.         1.15470054 2.5924757  2.84018779 0.5773502

[0.         0.31622777 0.         0.         1.15470054 0.
 0.70710678 0.         0.         2.12132034 0.         0.
 0.70710678 1.26491106 0.66666667 1.         1.         0.
 0.70710678 1.         0.5        1.81901719 0.         0.70710678
 1.15470054 1.15470054 1.73205081 1.         0.         2.
 0.         2.64575131 0.         3.35658557 1.15470054 1.22474487
 1.         0.         1.33630621 0.60999428 0.         0.57735027
 2.41209076 0.         0.         1.80739223 2.13718683 0.57735027
 1.87082869 0.40824829 1.63299316 1.63299316 0.         0.
 0.         0.         1.15470054 0.5        1.34164079 0.70710678
 1.34164079 0.70710678 0.         1.         1.58113883 0.
 4.27239199 1.         0.         0.5        0.         0.
 0.57735027 0.         1.33630621 0.60999428 0.         1.15470054
 0.5        1.34164079 0.70710678 1.34164079 0.70710678 0.
 1.         1.58113883 0.         0.5        0.         0.
 0.57735027 0.         0.         0.         2.38047614 1.
 1.     

 0.         0.         0.        ]
[0.         0.         0.         0.57735027 0.         1.15470054
 1.15470054 0.         0.70710678 0.57735027 0.70710678 0.
 1.         1.15470054 1.         0.70710678 0.89442719 1.73205081
 0.70710678 1.22474487 0.         0.         1.         1.
 0.70710678 1.         1.         0.         0.81649658 0.
 1.78885438 1.         1.54919334 0.         0.         0.
 0.         0.         1.41421356 0.70710678 0.         0.
 0.57735027 1.         1.         0.         1.22474487 1.
 1.         0.         1.15470054 0.         0.         0.
 0.         0.83205029 0.         0.         0.         0.
 0.         0.         0.         1.05337032 0.         0.
 0.89442719 0.         0.         0.         0.         1.67125804
 1.11417203 1.6        2.48069469 2.26778684 2.9938208  0.57735027
 1.5        0.         1.15470054 0.         1.25108648 0.
 3.33141266 0.         0.         1.15470054 1.15470054 1.
 1.15470054 1.         1.         1.         1. 

[0.         0.         0.57735027 0.         1.03279556 0.
 0.         0.         0.         1.5109662  1.15470054 0.
 0.70710678 0.70710678 0.         0.         0.70710678 0.
 1.         0.         0.         0.         0.         0.
 1.73205081 0.         0.         0.70710678 0.         0.
 0.5        0.         0.         0.75592895 0.         0.57735027
 0.         0.         0.70710678 0.         1.68540194 0.
 0.98058068 0.75592895 0.89442719 0.         0.70710678 0.89442719
 0.89442719 0.         0.         0.         0.         0.
 1.96116135 0.         0.         0.57735027 0.         0.
 1.03279556 0.         0.5        1.24034735 0.5        0.33333333
 0.         0.         0.5        0.         0.5        0.
 0.         0.5        0.         0.         0.         0.
 0.5        0.89442719 0.         0.         0.         0.
 2.25867586 1.96116135 0.         0.         0.         0.
 0.57154761 0.         0.         0.57735027 0.         1.03279556
 0.         0.5        0

Unnamed: 0,Libellé,Ingrédients,path,content,text,blocks,predicted
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,Concentré Liquide Asian CHEF® \n\nBouteille de...,"[Concentré Liquide Asian CHEF® , Bouteille de ...","Eau, maltodextrine, sel, arômes, sucre, arôme ..."
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n4 0 obj\r<</L...,,[],
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.4\n%\xc7\xec\x8f\xa2\n5 0 obj\n<</Len...,Direction Qualité \n\n \n\n \n\nPATES ALIMENTA...,"[Direction Qualité , , , PATES ALIMENTAIRES ...",Conforme à : \n– Décret « Pâtes Alimentaires ...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE\n\nFEVE DE TONKA\n\nDipteryx O...,"[FICHE TECHNIQUE, FEVE DE TONKA, Dipteryx Odor...",Information sur le risque de contaminations cr...
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.3\n%\xc7\xec\x8f\xa2\n7 0 obj\n<</Len...,CAVIAR \n\nD’AUBERGINES\n500G\n\nPRÉSENTATION\...,"[CAVIAR , D’AUBERGINES\n500G, PRÉSENTATION, LE...","A u b e r g i n e 6 0 , 5 % \n(aubergine, h..."
...,...,...,...,...,...,...,...
7c709a5e-b913-4a02-9396-a6469b09482a,Mini-bonbons aux fruits en sachet 1 kg SKENDY,"Sirop de glucose, sucre, arômes naturels, acid...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.6\r\n%\xbd\xbe\xbc\r\n1 0 obj\r\n<<\r...,FTin. 396\nPage: 1/2\n\nIndice de révision:\nD...,"[FTin. 396\nPage: 1/2, Indice de révision:\nDa...",MENTIONS D'ÉTIQUETAGE- LABELS MENTIONS\nINGRED...
c5dee4ab-9f57-4533-9f89-e216ee110f68,"FARINE DE BLÉ TYPE 45, 25KG",farine de BLE T45,C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,\n1050/10502066400 \n\n10502055300/1050202520...,"[ \n1050/10502066400 , 10502055300/10502025200...",Les valeures moyennes indiquées sont soumises ...
e67341d8-350f-46f4-9154-4dbbb8035621,PRÉPARATION POUR CRÈME BRÛLÉE BIO 6L,"Sucre roux de canne*°(64%), amidon de maïs*, p...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE \n\nCREME BRÛLÉE 6L \n\nREF : ...,"[FICHE TECHNIQUE , CREME BRÛLÉE 6L , REF : NAP...",\nGluten \nŒufs \nPoissons \nCrustacés \nArac...
a8f6f672-20ac-4ff8-a8f2-3bc4306c8df3,Céréales instantanées en poudre saveur caramel...,"Farine 87,1 % (BLE (GLUTEN), BLE hydrolysé (GL...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,81 rue de Sans Souci – CS13754 – 69576 Limones...,[81 rue de Sans Souci – CS13754 – 69576 Limone...,"Farine 87,1 % (Blé (GLUTEN), Blé hydrolysé (GL..."


In [142]:
print('\n-----------------------------------------------\n'.join(predicted_df['predicted']))

Eau, maltodextrine, sel, arômes, sucre, arôme naturel de citronnelle, amidon modifié, ail en poudre, épices (combava, curcuma), extraits 
d'épices (gingembre, poivre), stabilisant (gomme xanthane).   
-----------------------------------------------

-----------------------------------------------
Conforme à : 
–  Décret « Pâtes Alimentaires » 55-1175 du 31/08/55 
–  Arrêté « Pâtes-Semoules de blé dur » du 27/05/57 
–  Arrêté « Quantité Nominale  Pâtes» du 08/10/08 
–  Décret « Métrologie » 78-166 du 31/01/78 
–  Règlement « OGM » 1829/2003 du 22/09/2003 
–  Réglementation européenne Pesticides et Contaminants 
–  Règlement CE 178/2002 du 28/01/2002 sur la sécurité alimentaire des denrées 
–  Règlement CE « Microbiologie des Denrées Alimentaires» 2073/2005 du 15/11/05 
–  Règlement UE n°1169/2011 du 25/10/11 concernant l’information des consommateurs sur les 
denrées alimentaires 
-----------------------------------------------
Information sur le risque de contaminations croisées : le 

# 7 Scoring

Now is time to evaluate the performance of this simple model.

## 7.1 Target ingredient lists acquisition

First, we load the manually labelled ground truth into a target DataFrame.

In [143]:
y =pd.read_csv(os.path.join('.', 'ground_truth', 'manually_labelled_ground_truth.csv'),
               sep=';',
               encoding='latin-1',
               index_col='uid')
y

Unnamed: 0_level_0,designation,ingredients
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ..."
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de blé T65, eau, levure, vinaigre de ci..."
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,- 100% Semoule de BLE dur de qualité supérieur...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,fève de tonka (graines ridées de 25 à 50mm de ...
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol..."
...,...,...
7c709a5e-b913-4a02-9396-a6469b09482a,Mini-bonbons aux fruits en sachet 1 kg SKENDY,SIROP DE GLUCOSE / SUCRE / AROMES / ACIDIFIANT...
c5dee4ab-9f57-4533-9f89-e216ee110f68,"FARINE DE BLÉ TYPE 45, 25KG",Farine de blé T45
e67341d8-350f-46f4-9154-4dbbb8035621,PRÉPARATION POUR CRÈME BRÛLÉE BIO 6L,"Sucre roux de canne*° (64%), amidon de maïs*, ..."
a8f6f672-20ac-4ff8-a8f2-3bc4306c8df3,Céréales instantanées en poudre saveur caramel...,"Farine 87,1 % (Blé (GLUTEN), Blé hydrolysé (GL..."


In [149]:
comparison = pd.concat([y['ingredients'], predicted_df['predicted']], axis=1)
comparison

Unnamed: 0_level_0,ingredients,predicted
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,"Eau, maltodextrine, sel, arômes, sucre, arôme ...","Eau, maltodextrine, sel, arômes, sucre, arôme ..."
d183e914-db2f-4e2f-863a-a3b2d054c0b8,"Farine de blé T65, eau, levure, vinaigre de ci...",
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,- 100% Semoule de BLE dur de qualité supérieur...,Conforme à : \n– Décret « Pâtes Alimentaires ...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,fève de tonka (graines ridées de 25 à 50mm de ...,Information sur le risque de contaminations cr...
51b38427-b2ea-4c56-93e8-4242361ef31b,"Aubergine 60,5% (aubergine, huile de tournesol...","A u b e r g i n e 6 0 , 5 % \n(aubergine, h..."
...,...,...
7c709a5e-b913-4a02-9396-a6469b09482a,SIROP DE GLUCOSE / SUCRE / AROMES / ACIDIFIANT...,MENTIONS D'ÉTIQUETAGE- LABELS MENTIONS\nINGRED...
c5dee4ab-9f57-4533-9f89-e216ee110f68,Farine de blé T45,Les valeures moyennes indiquées sont soumises ...
e67341d8-350f-46f4-9154-4dbbb8035621,"Sucre roux de canne*° (64%), amidon de maïs*, ...",\nGluten \nŒufs \nPoissons \nCrustacés \nArac...
a8f6f672-20ac-4ff8-a8f2-3bc4306c8df3,"Farine 87,1 % (Blé (GLUTEN), Blé hydrolysé (GL...","Farine 87,1 % (Blé (GLUTEN), Blé hydrolysé (GL..."


In [150]:
comparison.loc[comparison['ingredients'] == comparison['predicted']]

Unnamed: 0_level_0,ingredients,predicted
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
52bf5f5a-79f2-4652-8eac-2b80c697269b,jus d'orange issu de l'agriculture biologique.,jus d'orange issu de l'agriculture biologique.
4f83306f-66de-4545-9b12-7790b57b61ae,"Sirop de glucose, sucre, eau, stabilisants (E4...","Sirop de glucose, sucre, eau, stabilisants (E4..."
fdb4c751-1f88-430d-9a84-80cdb1047082,"Pommes 70%, abricots 25%, sucre, antioxydant: ...","Pommes 70%, abricots 25%, sucre, antioxydant: ..."
eb42ef77-7050-4f9a-8f10-7c9df5c15b21,Riz long étuvé de qualité supérieure issu de l...,Riz long étuvé de qualité supérieure issu de l...
15c7bf93-fa9d-4a5a-88ff-a3ef21e90303,"Huile d'olive 99,6%, Arôme truffe blanche (0,4%)","Huile d'olive 99,6%, Arôme truffe blanche (0,4%)"
9ea66ea7-7791-4c45-b682-0b92dce9de8e,"Cornichons, eau, vinaigre d'alcool, sucre, sel...","Cornichons, eau, vinaigre d'alcool, sucre, sel..."
0abe8883-1b53-409d-a6cc-f14303763ef8,"Eau Minérale Naturelle, gaz de la source","Eau Minérale Naturelle, gaz de la source"
1427f551-0d0b-41c9-850d-b398077a15d0,"Sirop de glucose-fructose, fraises, sucre, gél...","Sirop de glucose-fructose, fraises, sucre, gél..."
bb201173-2549-4165-9202-d1412882a8f6,"Haricots rouges secs trempés, eau, sel, concen...","Haricots rouges secs trempés, eau, sel, concen..."
02d5ceb9-21c2-4965-8f65-309bca7638b2,"Café soluble (48,5%), fibres de chicorée (olig...","Café soluble (48,5%), fibres de chicorée (olig..."
