# PIM Estimator

This notebook aims to test the PIM-Estimator module.

In [1]:
import os

import numpy as np
import pandas as pd

from src.pimest import IngredientExtractor
from src.pimest import PIMIngredientExtractor
from src.pimest import PathGetter
from src.pimest import ContentGetter
from src.pimapi import Requester
from src.pimpdf import PDFDecoder

# 1. Extracting the data

First, let's refresh the data from the production environment.

In [4]:
#requester = Requester('prd', proxies=None)
requester = Requester('prd')
print('----------------------------------------')
requester.refresh_directory()
print('----------------------------------------')
requester.modification_report()
print('----------------------------------------')
# requester.fetch_list_from_PIM(requester.modified_items(), batch_size=20)
print('----------------------------------------')
# requester.dump_data_from_result()
print('----------------------------------------')
# requester.dump_files_from_result()
print('----------------------------------------')
# requester.modification_report()
print('----------------------------------------')

----------------------------------------
Done
----------------------------------------
Number of items: 13152
Number of items with outdated data: 13151
Number of items with outdated files: 13151
----------------------------------------
----------------------------------------
----------------------------------------
----------------------------------------
----------------------------------------


Then, fetch the ingredient lists into a pandas DataFrame:

In [5]:
requester.fetch_all_from_PIM(page_size=1000, max_page=-1, nx_properties='*')
mapping = {'uid': 'uid', 'Libellé': 'title', 'Ingrédients': 'properties.pprodc:ingredientsList'}
df = requester.result_to_dataframe(record_path='entries', mapping=mapping, index='uid')
df

Done


  self.df = pd.io.json.json_normalize(data, record_path=record_path,


Unnamed: 0_level_0,Libellé,Ingrédients
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
afee12c7-177e-4a68-9539-8cbb68442503,DESTR D'ODEURS AIR&TEXTILES 750CCX6 DESODOR U2,
7d390121-17e8-43bf-a357-9d06b79d2d47,THÉ VERT AGRUME BTE 25S FRAICH LIPTON,
f234cd84-c8f6-433f-85ec-6e0b6980adc6,T WHEAT 30 A 18X6 52C MISSION 1620,"WHEAT flour (55%), water, vegetable fat (palm)..."
e82a8173-b379-41ac-b319-aa058a04fcfb,VIN ROUGE MÉDITERRANÉE 25CL X12,
4b12c47c-84f5-4132-b362-22b864379a67,VIN MÉDITERRANÉE ROSÉ 25CL X12,
...,...,...
5cde49c6-9e7e-4bd2-b22a-3239f643379d,ROULEAU CÉLISOFT 1.20X50 M CITRON,
0273eadc-851a-4b68-8020-8041700a4f3d,2D VENT FRAIS 5LX4 DESODOR U2,
ef42a938-2203-446e-8d28-9fd27c6d3146,3D VENT FRAIS 5LX4 DESODOR U2,
68f5d81b-7f91-40a0-8504-0ec320a86de4,NETTOYANT INOX 500ML LOT 2X6 KING,


We only keep the products for which there is an ingredient list in the system.

In [6]:
df = df.loc[pd.notna(df['Ingrédients'])]
df

Unnamed: 0_level_0,Libellé,Ingrédients
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
f234cd84-c8f6-433f-85ec-6e0b6980adc6,T WHEAT 30 A 18X6 52C MISSION 1620,"WHEAT flour (55%), water, vegetable fat (palm)..."
a84ebaef-3c39-4661-a8bf-bcd0b297076d,ETUI RIZ DE CAMARGUE IGP LONG BLANC,RIZ LONG BLANC
898c0810-7f5a-4a1d-b169-b191dcd9d0bd,CRAC FORM 210G X 8,"Farine de maïs, farine de riz, sucre, sel."
075672b6-c1d5-43d9-9e80-21ec44b2d872,CRAC'FORM MAÏS RIZ SARRASIN 210G X 8,"farine de riz 40%, farine de maïs 38%, amidon ..."
0560a9c2-d635-46ef-a23f-df7ad8632257,SAC 500 GR MORILLES DE CULTURE,MORILLES
...,...,...
16a1db98-79ba-4ea7-9943-6d093a4c8ee9,TOMATES SÉCHÉES À L’HUILE POCHE 650G LA PULPE,"Tomates séchées réhydratées, huile de tourneso..."
c778d6f9-aa06-47d4-a4ec-199eec9e1373,MORCEAUX DE POIVRONS ROUGES ET JAUNES GRILLÉS ...,"Poivrons rouges et jaunes, huile de tournesol,..."
3d4a97ed-d6be-49c4-8eae-8ce72c50e68f,QUARTIERS D’ARTICHAUTS MARINÉS À L’HUILE POCHE...,"Artichauts, huile de tournesol, sel, sucre, pl..."
4128f89a-8df7-4da7-a2c0-3ee1302a46f4,BARQUETTE PATE A TARTINER POULAIN,"Ingrédients : Sucre, huile de tournesol, NOISE..."


# 2. Training the estimator

For this simple test, the estimator will be trained on the whole dataset (which is not good practice - this is just to demonstrate the usage of this class).

## 2.1 Importing the module

The cell below is just here to enable to reload source code of pimest module without having to restart the kernel.

In [7]:
#import importlib
#import src.pimest
#importlib.reload(pimest)

## 2.2 Training the estimator

Although not a good practice, we train the estimator on the whole dataset.

In [8]:
estim = IngredientExtractor()
estim.fit(df.loc[:, 'Ingrédients'])

<src.pimest.IngredientExtractor at 0x7f4b80c41a30>

In [9]:
estim.vectorized_texts_

<9672x4139 sparse matrix of type '<class 'numpy.int64'>'
	with 169700 stored elements in Compressed Sparse Row format>

In [10]:
estim.vocabulary_

{'wheat': 3999,
 'flour': 1917,
 '55': 235,
 'water': 3997,
 'vegetable': 3917,
 'fat': 1866,
 'palm': 2926,
 'stabilisers': 3605,
 '422': 181,
 '412': 173,
 '466': 198,
 'salt': 3451,
 'raising': 3293,
 'agents': 394,
 '450': 191,
 '500': 219,
 'emulsifier': 1748,
 '471': 201,
 'gluten': 2081,
 'dextrose': 1419,
 'acidity': 349,
 'regulator': 3321,
 '296': 134,
 'rice': 3342,
 'semolina': 3516,
 'preservatives': 3163,
 '281': 131,
 '202': 110,
 'treatment': 3827,
 'agent': 392,
 '920': 297,
 'riz': 3354,
 'long': 2480,
 'blanc': 730,
 'farine': 1863,
 'de': 1369,
 'maïs': 2597,
 'sucre': 3632,
 'sel': 3509,
 '40': 166,
 '38': 161,
 'amidon': 465,
 'sarrasin': 3470,
 'fibre': 1888,
 'pomme': 3110,
 'millet': 2634,
 'teff': 3718,
 'fabriqué': 1856,
 'dans': 1364,
 'un': 3884,
 'atelier': 603,
 'qui': 3274,
 'utilise': 3892,
 'des': 1402,
 'protéines': 3194,
 'lait': 2396,
 'et': 1812,
 'du': 1492,
 'soja': 3559,
 'morilles': 2698,
 'eau': 1723,
 'source': 3584,
 'kombu': 2378,
 'déshydr

In [11]:
estim.mean_corpus_

matrix([[0.00196443, 0.00010339, 0.00010339, ..., 0.00971878, 0.00889165,
         0.00082713]])

# 3. Testing the estimator

## 3.1 Parsing a doc into blocks

First, we parse a single doc into blocks of texts:

In [17]:
uid = '7ad672f8-40d4-4527-ab49-af3284d23fab'
path = os.path.join('.', 'dumps', 'prd', uid, 'FTF.pdf')
blocks = PDFDecoder.path_to_blocks(path)
blocks

['FICHE TECHNIQUE ',
 ' ',
 'BBEETTTTEERRAAVVEESS  RROOUUGGEESS    ',
 'CCOOUUPPÉÉEESS  EENN  DDÉÉSS  ',
 'LLEEGGEERREEMMEENNTT  VVIINNAAIIGGRREEEESS  ',
 ' \n \n \n \n \n \n \n ',
 'Page : 1/2 ',
 ' \nService Qualité Groupe ',
 'd’aucy/CGC ',
 '56 500 LOCMINE ',
 'Référence : SQ/LV/624 - Version : F ',
 'Date : 3/08/18 ',
 ' ',
 'CARACTÉRISTIQUES GÉNÉRALES :  ',
 'Dénomination Réglementaire ',
 'Code produit \nFormat \nContenance \nPoids Net Total \nPoids Net Égoutté ',
 'Liste des Ingrédients ',
 'Campagne de Production \nOrigine légume ',
 ' \n ',
 'Betteraves rouges coupées en dés ',
 'légèrement vinaigrées ',
 '4505 \n5/1 ',
 '4250 ml \n4000 g \n2655 g ',
 'Betteraves rouges, eau, vinaigre d’alcool (0,9%), sucre, sel, ',
 'acidifiant : acide citrique. (E330) ',
 'Septembre à Décembre ',
 'France ',
 ' ',
 'NOMBRE DE PARTS ',
 'Adultes \nEnfants ',
 '26 parts \n44 parts ',
 ' \nCARACTÉRISTIQUES DU PRODUIT : ',
 'Réf. CTCPA _ décision n°48 – Conserves de betteraves rouges \n \nBette

## 3.2 Predicting the ingredient block

Then we predict the block which is supposed to most likely be the one holding the ingredient list:

In [18]:
block_num = estim.predict(blocks)
print(blocks[block_num])

Pour 100 g*AR**Pour 100 g*AR**Energie (kJ)128Glucides (g)5,22%Energie (kcal)31Dont sucres (g)4,85%Matières grasses (g)0,10%Fibres alimentaires (g)2,4Dont acides gras saturés (g)0,00%Protéines (g)1,02%*pour 100 g de produit égouttéSel (g)0,6010%**Apport de référence pour un adulte-type (8400kJ/2000 kcal).Les apports de références varient en fonction de l'âge, du sexe et de l'activité physique.2%


We can see that for the product with uid `78f66d90-aeab-4f15-8130-0c418955b79a`, the estimator has successfully identified the ingredient block!

# 4. Wrapped Estimator

A helper wrapped class enables to directly compare the current content of the PIM system with what has been extracted from the associated pdf file.

This helper directly inherits from `IngredientExtractor` class:

In [None]:
#from importlib import reload
#import src.pimest
#importlib.reload(pimest)

In [19]:
#wrapped_estim = PIMIngredientExtractor(env='prd', proxies=None)
wrapped_estim = PIMIngredientExtractor(env='prd')
wrapped_estim.fit(df.loc[:, 'Ingrédients'])

<src.pimest.PIMIngredientExtractor at 0x7f4b80e11a60>

In [20]:
wrapped_estim.compare_uid_data('78f66d90-aeab-4f15-8130-0c418955b79a')

Fetching data from PIM for uid 78f66d90-aeab-4f15-8130-0c418955b79a...
Done
----------------------------------------------------------
Ingredient list from PIM is :

Semoule supérieure de BLE dur, OEUFS frais de poules élevées en plein air (30%).

----------------------------------------------------------
Supplier technical datasheet from PIM for uid 78f66d90-aeab-4f15-8130-0c418955b79a is:
https://produits.groupe-pomona.fr/nuxeo/nxfile/default/78f66d90-aeab-4f15-8130-0c418955b79a/pprodad:technicalSheet/FT-186759_Tagliat%20nid%207oeuf%20Als%202mm%20sac3KG_Gd%20Mere.pdf?changeToken=22-0
----------------------------------------------------------
Downloading content of technical datasheet file...
Done!
----------------------------------------------------------
Parsing content of technical datasheet file...
Done!
----------------------------------------------------------
Ingredient list extracted from technical datasheet:

Semoule supérieure de blé dur, œufs frais de 
poules élevées en ple

In [21]:
wrapped_estim.print_blocks()

0  |    

1  |    

2  |  Version du 12/03/2018  

3  |  FICHE TECHNIQUE  

4  |    

5  |  Pâtes d’Alsace (IGP) L’ALSACIENNE 7 œufs frais au kg  

6  |    
Dénomination Légale : 
  

7  |    

8  |    

9  |    

10  |   
  

11  |    

12  |    

13  |    

14  |    

15  |    

16  |    Pâtes alimentaires 7 œufs frais au kilo de semoule de blé dur  

17  |    

18  |    

19  |    

20  |    

21  |    

22  |    IGP Pâtes d’Alsace  (Indication Géographique Protégée)  

23  |    

24  |   
  

25  |    

26  |    
  

27  |    

28  |    

29  |    

30  |    

31  |    

32  |  Caractéristiques Techniques : 
  

33  |  Ingrédients  

34  |  Semoule supérieure de blé dur, œufs frais de 
poules élevées en plein air (30%).  

35  |  OGM  

36  |  Absence  

37  |  Allergènes  

38  |  -  Blé 
-  Oeufs  

39  |  Ionisation  Non  

40  |  Valeurs nutritionnelles 
pour 100g de pâtes crues  

41  |    

42  |  Mat.grasses :  

43  |  Glucides :  

44  |  Fibres :  

45  |   3g  

46  | 

In [22]:
from sklearn.model_selection import train_test_split

train_uids, test_uids = train_test_split(df, test_size=500, random_state=42)
#test_uids.reset_index().loc[:, 'uid'].to_csv(os.path.join('.', 'test_uids.csv'), header=True, encoding='utf-8-sig', index=False)

# 5. Transformers

A handful of transformers have been developped to treat the data.

## 5.1 PathGetter

This transformer takes a DataFrame with uids as index, and adds a columns that is the path on the pdf file on disk - depending on whether the uid is from ground truth or from "normal" train set.

The root paths can be passed at initialization, or they will be defaulted to what is specified in the configuration file.

The ground truth uids must be declared at initialization.

In [23]:
from pathlib import Path

In [24]:
ground_truth_df = pd.read_csv(os.path.join('.', 'ground_truth', 'manually_labelled_ground_truth.csv'),
                              sep=';',
                              encoding='latin-1',
                              index_col='uid')
ground_truth_uids = list(ground_truth_df.index)
ground_truth_uids

['a0492df6-9c76-4303-8813-65ec5ccbfa70',
 'd183e914-db2f-4e2f-863a-a3b2d054c0b8',
 'ab48a1ed-7a3d-4686-bb6d-ab4f367cada8',
 '528d4be3-425c-4f8b-8a87-12f1bc645ddd',
 '51b38427-b2ea-4c56-93e8-4242361ef31b',
 'e4c8c61f-a401-4384-8128-181447e5bdd2',
 '50766d10-6135-4958-b743-1cddfcb7c230',
 'bb9f3995-57b3-429d-b075-2d81a90e406f',
 '52bf5f5a-79f2-4652-8eac-2b80c697269b',
 '04235024-80f3-46c2-bad0-aae0d5fab024',
 '93a5d344-4ab0-48de-bbb0-fec8117d07a2',
 '440ca0b4-1c79-433f-ac8d-6e2ae77d3288',
 '7b9cf3b3-bf4b-408c-b8c8-807a09745979',
 'b638ab96-427c-4d5f-9a19-74684eec3b8e',
 '8bb101c0-8c6e-4665-8310-1a2d13257b2e',
 '54f40033-f9cf-411c-81a5-11974f6715aa',
 '2405356e-ecc9-404f-8279-16fddf168b73',
 'c57be8e4-30e0-417b-90fa-b60d9f9ece48',
 '6622442d-d88f-4b90-9414-b91c9e8e5296',
 '507b428e-e99d-464b-b9d3-50629efe4355',
 '545e8bc5-15fd-458a-8c9f-b18faecc2016',
 'b45b3942-accd-4115-8f1f-736dbd42a35a',
 '6c07fd47-116d-4531-a164-d4f4040f212e',
 'ca711137-4644-45b7-9928-456a09d9746a',
 'e70d347d-5e91-

In [25]:
transformer = PathGetter(ground_truth_uids=ground_truth_uids,
                         train_set_path=Path.cwd() / 'dumps' / 'prd',
                         ground_truth_path=Path.cwd() / 'ground_truth',
                        )

In [26]:
idx = pd.Index(ground_truth_uids[:5] + list(df.index[:5]), name='uid')
test_df = df.loc[idx,:]
test_df = transformer.fit_transform(test_df)
test_df

Unnamed: 0_level_0,Libellé,Ingrédients,path
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",/home/pmasse/PIM-Recognizer/ground_truth/a0492...
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",/home/pmasse/PIM-Recognizer/ground_truth/d183e...
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,/home/pmasse/PIM-Recognizer/ground_truth/ab48a...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",/home/pmasse/PIM-Recognizer/ground_truth/528d4...
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",/home/pmasse/PIM-Recognizer/ground_truth/51b38...
f234cd84-c8f6-433f-85ec-6e0b6980adc6,T WHEAT 30 A 18X6 52C MISSION 1620,"WHEAT flour (55%), water, vegetable fat (palm)...",/home/pmasse/PIM-Recognizer/dumps/prd/f234cd84...
a84ebaef-3c39-4661-a8bf-bcd0b297076d,ETUI RIZ DE CAMARGUE IGP LONG BLANC,RIZ LONG BLANC,/home/pmasse/PIM-Recognizer/dumps/prd/a84ebaef...
898c0810-7f5a-4a1d-b169-b191dcd9d0bd,CRAC FORM 210G X 8,"Farine de maïs, farine de riz, sucre, sel.",/home/pmasse/PIM-Recognizer/dumps/prd/898c0810...
075672b6-c1d5-43d9-9e80-21ec44b2d872,CRAC'FORM MAÏS RIZ SARRASIN 210G X 8,"farine de riz 40%, farine de maïs 38%, amidon ...",/home/pmasse/PIM-Recognizer/dumps/prd/075672b6...
0560a9c2-d635-46ef-a23f-df7ad8632257,SAC 500 GR MORILLES DE CULTURE,MORILLES,/home/pmasse/PIM-Recognizer/dumps/prd/0560a9c2...


## 5.2 ContentGetter

This transformer reads the files from disk and loads their content as binaries into the dataframe.

In [27]:
transformer_2 = ContentGetter(missing_file='to_nan')
test_df = transformer_2.fit_transform(test_df)
test_df

Unnamed: 0_level_0,Libellé,Ingrédients,path,content
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",/home/pmasse/PIM-Recognizer/ground_truth/a0492...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",/home/pmasse/PIM-Recognizer/ground_truth/d183e...,b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n4 0 obj\r<</L...
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,/home/pmasse/PIM-Recognizer/ground_truth/ab48a...,b'%PDF-1.4\n%\xc7\xec\x8f\xa2\n5 0 obj\n<</Len...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",/home/pmasse/PIM-Recognizer/ground_truth/528d4...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",/home/pmasse/PIM-Recognizer/ground_truth/51b38...,b'%PDF-1.3\n%\xc7\xec\x8f\xa2\n7 0 obj\n<</Len...
f234cd84-c8f6-433f-85ec-6e0b6980adc6,T WHEAT 30 A 18X6 52C MISSION 1620,"WHEAT flour (55%), water, vegetable fat (palm)...",/home/pmasse/PIM-Recognizer/dumps/prd/f234cd84...,
a84ebaef-3c39-4661-a8bf-bcd0b297076d,ETUI RIZ DE CAMARGUE IGP LONG BLANC,RIZ LONG BLANC,/home/pmasse/PIM-Recognizer/dumps/prd/a84ebaef...,
898c0810-7f5a-4a1d-b169-b191dcd9d0bd,CRAC FORM 210G X 8,"Farine de maïs, farine de riz, sucre, sel.",/home/pmasse/PIM-Recognizer/dumps/prd/898c0810...,
075672b6-c1d5-43d9-9e80-21ec44b2d872,CRAC'FORM MAÏS RIZ SARRASIN 210G X 8,"farine de riz 40%, farine de maïs 38%, amidon ...",/home/pmasse/PIM-Recognizer/dumps/prd/075672b6...,
0560a9c2-d635-46ef-a23f-df7ad8632257,SAC 500 GR MORILLES DE CULTURE,MORILLES,/home/pmasse/PIM-Recognizer/dumps/prd/0560a9c2...,


## 5.3 PDFContentParser

This estimator parses the content of the PDF files into text using the functionalities defined in the pimpdf module.

In [28]:
from src.pimest import PDFContentParser
transformer_3 = PDFContentParser(none_content='to_empty')
test_df = transformer_3.fit_transform(test_df)
test_df

Launching 8 processes.


Unnamed: 0_level_0,Libellé,Ingrédients,path,content,text
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",/home/pmasse/PIM-Recognizer/ground_truth/a0492...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,Concentré Liquide Asian CHEF® \n\nBouteille de...
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",/home/pmasse/PIM-Recognizer/ground_truth/d183e...,b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n4 0 obj\r<</L...,
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,/home/pmasse/PIM-Recognizer/ground_truth/ab48a...,b'%PDF-1.4\n%\xc7\xec\x8f\xa2\n5 0 obj\n<</Len...,Direction Qualité \n\n \n\n \n\nPATES ALIMENTA...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",/home/pmasse/PIM-Recognizer/ground_truth/528d4...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE\n\nFEVE DE TONKA\n\nDipteryx O...
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",/home/pmasse/PIM-Recognizer/ground_truth/51b38...,b'%PDF-1.3\n%\xc7\xec\x8f\xa2\n7 0 obj\n<</Len...,CAVIAR \n\nD’AUBERGINES\n500G\n\nPRÉSENTATION\...
f234cd84-c8f6-433f-85ec-6e0b6980adc6,T WHEAT 30 A 18X6 52C MISSION 1620,"WHEAT flour (55%), water, vegetable fat (palm)...",/home/pmasse/PIM-Recognizer/dumps/prd/f234cd84...,,
a84ebaef-3c39-4661-a8bf-bcd0b297076d,ETUI RIZ DE CAMARGUE IGP LONG BLANC,RIZ LONG BLANC,/home/pmasse/PIM-Recognizer/dumps/prd/a84ebaef...,,
898c0810-7f5a-4a1d-b169-b191dcd9d0bd,CRAC FORM 210G X 8,"Farine de maïs, farine de riz, sucre, sel.",/home/pmasse/PIM-Recognizer/dumps/prd/898c0810...,,
075672b6-c1d5-43d9-9e80-21ec44b2d872,CRAC'FORM MAÏS RIZ SARRASIN 210G X 8,"farine de riz 40%, farine de maïs 38%, amidon ...",/home/pmasse/PIM-Recognizer/dumps/prd/075672b6...,,
0560a9c2-d635-46ef-a23f-df7ad8632257,SAC 500 GR MORILLES DE CULTURE,MORILLES,/home/pmasse/PIM-Recognizer/dumps/prd/0560a9c2...,,


## 5.4 BlockSplitter

This transformer splits the content of the text column into blocks of texts using the prvoided splitter function.

In [29]:
from src.pimest import BlockSplitter
splitter_func = lambda x: x.split('\n\n')
transformer_4 = BlockSplitter(splitter_func=splitter_func)
test_df = transformer_4.fit_transform(test_df)
test_df

Launching 8 processes.


Unnamed: 0_level_0,Libellé,Ingrédients,path,content,text,blocks
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",/home/pmasse/PIM-Recognizer/ground_truth/a0492...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,Concentré Liquide Asian CHEF® \n\nBouteille de...,"[Concentré Liquide Asian CHEF® , Bouteille de ..."
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",/home/pmasse/PIM-Recognizer/ground_truth/d183e...,b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n4 0 obj\r<</L...,,[]
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,/home/pmasse/PIM-Recognizer/ground_truth/ab48a...,b'%PDF-1.4\n%\xc7\xec\x8f\xa2\n5 0 obj\n<</Len...,Direction Qualité \n\n \n\n \n\nPATES ALIMENTA...,"[Direction Qualité , , , PATES ALIMENTAIRES ..."
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",/home/pmasse/PIM-Recognizer/ground_truth/528d4...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE\n\nFEVE DE TONKA\n\nDipteryx O...,"[FICHE TECHNIQUE, FEVE DE TONKA, Dipteryx Odor..."
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",/home/pmasse/PIM-Recognizer/ground_truth/51b38...,b'%PDF-1.3\n%\xc7\xec\x8f\xa2\n7 0 obj\n<</Len...,CAVIAR \n\nD’AUBERGINES\n500G\n\nPRÉSENTATION\...,"[CAVIAR , D’AUBERGINES\n500G, PRÉSENTATION, LE..."
f234cd84-c8f6-433f-85ec-6e0b6980adc6,T WHEAT 30 A 18X6 52C MISSION 1620,"WHEAT flour (55%), water, vegetable fat (palm)...",/home/pmasse/PIM-Recognizer/dumps/prd/f234cd84...,,,[]
a84ebaef-3c39-4661-a8bf-bcd0b297076d,ETUI RIZ DE CAMARGUE IGP LONG BLANC,RIZ LONG BLANC,/home/pmasse/PIM-Recognizer/dumps/prd/a84ebaef...,,,[]
898c0810-7f5a-4a1d-b169-b191dcd9d0bd,CRAC FORM 210G X 8,"Farine de maïs, farine de riz, sucre, sel.",/home/pmasse/PIM-Recognizer/dumps/prd/898c0810...,,,[]
075672b6-c1d5-43d9-9e80-21ec44b2d872,CRAC'FORM MAÏS RIZ SARRASIN 210G X 8,"farine de riz 40%, farine de maïs 38%, amidon ...",/home/pmasse/PIM-Recognizer/dumps/prd/075672b6...,,,[]
0560a9c2-d635-46ef-a23f-df7ad8632257,SAC 500 GR MORILLES DE CULTURE,MORILLES,/home/pmasse/PIM-Recognizer/dumps/prd/0560a9c2...,,,[]


In [30]:
for text in test_df['blocks'].iloc[0]:
    print('-------------------------------------------\n', text)

-------------------------------------------
 Concentré Liquide Asian CHEF® 
-------------------------------------------
 Bouteille de 980 ml
-------------------------------------------
 DESCRIPTION DU PRODUIT
-------------------------------------------
 BÉNÉFICES CLÉS DU PRODUIT
-------------------------------------------
 CODE EAN
-------------------------------------------
 7613035849105
-------------------------------------------
 Concentré liquide aux saveurs asiatiques.
-------------------------------------------
 Utilisable à chaud et à froid.
Sans exhausteurs de goût ajoutés 
(glutamates, inosinates, guanylates).
Adapté à un régime végétarien.
Sans gluten.
-------------------------------------------
 INGRÉDIENTS
-------------------------------------------
 Eau, maltodextrine, sel, arômes, sucre, arôme naturel de citronnelle, amidon modifié, ail en poudre, épices (combava, curcuma), extraits 
d'épices (gingembre, poivre), stabilisant (gomme xanthane).   
-------------------------

# 6. Pipelining transformers

One can pipeline these transformers using the scikit-learn standard Pipeline.

## 6.1 Data acquisition

First step is to build a data acquisition pipeline, which will provide a DataFrame with the pdf documents full texts.

In [31]:
from importlib import reload
import src.pimest
reload(src.pimest)
from src.pimest import ContentGetter
from src.pimest import PathGetter
from src.pimest import PDFContentParser
from sklearn.pipeline import Pipeline

In [32]:
acqui_pipe = Pipeline([('PathGetter', PathGetter(ground_truth_uids=ground_truth_uids,
                                                  train_set_path=Path.cwd() / 'dumps' / 'prd',
                                                  ground_truth_path=Path.cwd() / 'ground_truth',
                                                  )),
                        ('ContentGetter', ContentGetter(missing_file='to_nan')),
                        ('ContentParser', PDFContentParser(none_content='to_empty')),
                       ],
                       verbose=True)

We can then filter the data to get it to work only on ground truth files.

In [33]:
idx = pd.Index(ground_truth_uids)
idx_ds = pd.Series([''] * len(idx), index=idx)
texts_df, idx_ds = df.align(idx_ds, join='right', axis=0)
texts_df

Unnamed: 0,Libellé,Ingrédients
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ..."
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz..."
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre..."
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol..."
...,...,...
7c709a5e-b913-4a02-9396-a6469b09482a,Mini-bonbons aux fruits en sachet 1 kg SKENDY,"Sirop de glucose, sucre, arômes naturels, acid..."
c5dee4ab-9f57-4533-9f89-e216ee110f68,"FARINE DE BLÉ TYPE 45, 25KG",farine de BLE T45
e67341d8-350f-46f4-9154-4dbbb8035621,PRÉPARATION POUR CRÈME BRÛLÉE BIO 6L,"Sucre roux de canne*°(64%), amidon de maïs*, p..."
a8f6f672-20ac-4ff8-a8f2-3bc4306c8df3,Céréales instantanées en poudre saveur caramel...,"Farine 87,1 % (BLE (GLUTEN), BLE hydrolysé (GL..."


And we are now set to run our pipeline:

In [34]:
texts_df = acqui_pipe.fit_transform(texts_df)
texts_df

[Pipeline] ........ (step 1 of 3) Processing PathGetter, total=   0.1s
[Pipeline] ..... (step 2 of 3) Processing ContentGetter, total=   0.6s
Launching 8 processes.
[Pipeline] ..... (step 3 of 3) Processing ContentParser, total=  36.8s


Unnamed: 0,Libellé,Ingrédients,path,content,text
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",/home/pmasse/PIM-Recognizer/ground_truth/a0492...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,Concentré Liquide Asian CHEF® \n\nBouteille de...
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",/home/pmasse/PIM-Recognizer/ground_truth/d183e...,b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n4 0 obj\r<</L...,
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,/home/pmasse/PIM-Recognizer/ground_truth/ab48a...,b'%PDF-1.4\n%\xc7\xec\x8f\xa2\n5 0 obj\n<</Len...,Direction Qualité \n\n \n\n \n\nPATES ALIMENTA...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",/home/pmasse/PIM-Recognizer/ground_truth/528d4...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE\n\nFEVE DE TONKA\n\nDipteryx O...
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",/home/pmasse/PIM-Recognizer/ground_truth/51b38...,b'%PDF-1.3\n%\xc7\xec\x8f\xa2\n7 0 obj\n<</Len...,CAVIAR \n\nD’AUBERGINES\n500G\n\nPRÉSENTATION\...
...,...,...,...,...,...
7c709a5e-b913-4a02-9396-a6469b09482a,Mini-bonbons aux fruits en sachet 1 kg SKENDY,"Sirop de glucose, sucre, arômes naturels, acid...",/home/pmasse/PIM-Recognizer/ground_truth/7c709...,b'%PDF-1.6\r\n%\xbd\xbe\xbc\r\n1 0 obj\r\n<<\r...,FTin. 396\nPage: 1/2\n\nIndice de révision:\nD...
c5dee4ab-9f57-4533-9f89-e216ee110f68,"FARINE DE BLÉ TYPE 45, 25KG",farine de BLE T45,/home/pmasse/PIM-Recognizer/ground_truth/c5dee...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,\n1050/10502066400 \n\n10502055300/1050202520...
e67341d8-350f-46f4-9154-4dbbb8035621,PRÉPARATION POUR CRÈME BRÛLÉE BIO 6L,"Sucre roux de canne*°(64%), amidon de maïs*, p...",/home/pmasse/PIM-Recognizer/ground_truth/e6734...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE \n\nCREME BRÛLÉE 6L \n\nREF : ...
a8f6f672-20ac-4ff8-a8f2-3bc4306c8df3,Céréales instantanées en poudre saveur caramel...,"Farine 87,1 % (BLE (GLUTEN), BLE hydrolysé (GL...",/home/pmasse/PIM-Recognizer/ground_truth/a8f6f...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,81 rue de Sans Souci – CS13754 – 69576 Limones...


## 6.2 Block splitting and prediction

Based on the pdf documents full text, we can now compute a block list for each of them, as well as predict which one seems to be the best candidate.

In [35]:
from src.pimest import BlockSplitter
from src.pimest import SimilaritySelector

In [36]:
def splitter(text):
    return(text.split('\n\n'))

In [37]:
process_pipe = Pipeline([('BlockSplitter', BlockSplitter(splitter_func=splitter)),
                         ('SimilaritySelector', SimilaritySelector())
                       ],
                       verbose=True)

In [38]:
predicted_df = process_pipe.fit_transform(texts_df)
predicted_df

Launching 8 processes.
[Pipeline] ..... (step 1 of 2) Processing BlockSplitter, total=   0.3s
[Pipeline]  (step 2 of 2) Processing SimilaritySelector, total=   1.2s


Unnamed: 0,Libellé,Ingrédients,path,content,text,blocks,predicted
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",/home/pmasse/PIM-Recognizer/ground_truth/a0492...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,Concentré Liquide Asian CHEF® \n\nBouteille de...,"[Concentré Liquide Asian CHEF® , Bouteille de ...","Eau, maltodextrine, sel, arômes, sucre, arôme ..."
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",/home/pmasse/PIM-Recognizer/ground_truth/d183e...,b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n4 0 obj\r<</L...,,[],
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,/home/pmasse/PIM-Recognizer/ground_truth/ab48a...,b'%PDF-1.4\n%\xc7\xec\x8f\xa2\n5 0 obj\n<</Len...,Direction Qualité \n\n \n\n \n\nPATES ALIMENTA...,"[Direction Qualité , , , PATES ALIMENTAIRES ...",Conforme à : \n– Décret « Pâtes Alimentaires ...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",/home/pmasse/PIM-Recognizer/ground_truth/528d4...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE\n\nFEVE DE TONKA\n\nDipteryx O...,"[FICHE TECHNIQUE, FEVE DE TONKA, Dipteryx Odor...",Information sur le risque de contaminations cr...
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",/home/pmasse/PIM-Recognizer/ground_truth/51b38...,b'%PDF-1.3\n%\xc7\xec\x8f\xa2\n7 0 obj\n<</Len...,CAVIAR \n\nD’AUBERGINES\n500G\n\nPRÉSENTATION\...,"[CAVIAR , D’AUBERGINES\n500G, PRÉSENTATION, LE...","A u b e r g i n e 6 0 , 5 % \n(aubergine, h..."
...,...,...,...,...,...,...,...
7c709a5e-b913-4a02-9396-a6469b09482a,Mini-bonbons aux fruits en sachet 1 kg SKENDY,"Sirop de glucose, sucre, arômes naturels, acid...",/home/pmasse/PIM-Recognizer/ground_truth/7c709...,b'%PDF-1.6\r\n%\xbd\xbe\xbc\r\n1 0 obj\r\n<<\r...,FTin. 396\nPage: 1/2\n\nIndice de révision:\nD...,"[FTin. 396\nPage: 1/2, Indice de révision:\nDa...",MENTIONS D'ÉTIQUETAGE- LABELS MENTIONS\nINGRED...
c5dee4ab-9f57-4533-9f89-e216ee110f68,"FARINE DE BLÉ TYPE 45, 25KG",farine de BLE T45,/home/pmasse/PIM-Recognizer/ground_truth/c5dee...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,\n1050/10502066400 \n\n10502055300/1050202520...,"[ \n1050/10502066400 , 10502055300/10502025200...",Les valeures moyennes indiquées sont soumises ...
e67341d8-350f-46f4-9154-4dbbb8035621,PRÉPARATION POUR CRÈME BRÛLÉE BIO 6L,"Sucre roux de canne*°(64%), amidon de maïs*, p...",/home/pmasse/PIM-Recognizer/ground_truth/e6734...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE \n\nCREME BRÛLÉE 6L \n\nREF : ...,"[FICHE TECHNIQUE , CREME BRÛLÉE 6L , REF : NAP...",\nGluten \nŒufs \nPoissons \nCrustacés \nArac...
a8f6f672-20ac-4ff8-a8f2-3bc4306c8df3,Céréales instantanées en poudre saveur caramel...,"Farine 87,1 % (BLE (GLUTEN), BLE hydrolysé (GL...",/home/pmasse/PIM-Recognizer/ground_truth/a8f6f...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,81 rue de Sans Souci – CS13754 – 69576 Limones...,[81 rue de Sans Souci – CS13754 – 69576 Limone...,"Farine 87,1 % (Blé (GLUTEN), Blé hydrolysé (GL..."


In [39]:
print('\n-----------------------------------------------\n'.join(predicted_df['predicted']))

Eau, maltodextrine, sel, arômes, sucre, arôme naturel de citronnelle, amidon modifié, ail en poudre, épices (combava, curcuma), extraits 
d'épices (gingembre, poivre), stabilisant (gomme xanthane).   
-----------------------------------------------

-----------------------------------------------
Conforme à : 
–  Décret « Pâtes Alimentaires » 55-1175 du 31/08/55 
–  Arrêté « Pâtes-Semoules de blé dur » du 27/05/57 
–  Arrêté « Quantité Nominale  Pâtes» du 08/10/08 
–  Décret « Métrologie » 78-166 du 31/01/78 
–  Règlement « OGM » 1829/2003 du 22/09/2003 
–  Réglementation européenne Pesticides et Contaminants 
–  Règlement CE 178/2002 du 28/01/2002 sur la sécurité alimentaire des denrées 
–  Règlement CE « Microbiologie des Denrées Alimentaires» 2073/2005 du 15/11/05 
–  Règlement UE n°1169/2011 du 25/10/11 concernant l’information des consommateurs sur les 
denrées alimentaires 
-----------------------------------------------
Information sur le risque de contaminations croisées : le 

# 7 Scoring

Now is time to evaluate the performance of this simple model.

## 7.1 Target ingredient lists acquisition

First, we load the manually labelled ground truth into a target DataFrame.

In [40]:
y =pd.read_csv(os.path.join('.', 'ground_truth', 'manually_labelled_ground_truth.csv'),
               sep=';',
               encoding='latin-1',
               index_col='uid')
y

Unnamed: 0_level_0,designation,ingredients
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ..."
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de blé T65, eau, levure, vinaigre de ci..."
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,- 100% Semoule de BLE dur de qualité supérieur...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,fève de tonka (graines ridées de 25 à 50mm de ...
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol..."
...,...,...
7c709a5e-b913-4a02-9396-a6469b09482a,Mini-bonbons aux fruits en sachet 1 kg SKENDY,SIROP DE GLUCOSE / SUCRE / AROMES / ACIDIFIANT...
c5dee4ab-9f57-4533-9f89-e216ee110f68,"FARINE DE BLÉ TYPE 45, 25KG",Farine de blé T45
e67341d8-350f-46f4-9154-4dbbb8035621,PRÉPARATION POUR CRÈME BRÛLÉE BIO 6L,"Sucre roux de canne*° (64%), amidon de maïs*, ..."
a8f6f672-20ac-4ff8-a8f2-3bc4306c8df3,Céréales instantanées en poudre saveur caramel...,"Farine 87,1 % (Blé (GLUTEN), Blé hydrolysé (GL..."


In [41]:
comparison = pd.concat([y['ingredients'], predicted_df['predicted']], axis=1)
comparison

Unnamed: 0_level_0,ingredients,predicted
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,"Eau, maltodextrine, sel, arômes, sucre, arôme ...","Eau, maltodextrine, sel, arômes, sucre, arôme ..."
d183e914-db2f-4e2f-863a-a3b2d054c0b8,"Farine de blé T65, eau, levure, vinaigre de ci...",
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,- 100% Semoule de BLE dur de qualité supérieur...,Conforme à : \n– Décret « Pâtes Alimentaires ...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,fève de tonka (graines ridées de 25 à 50mm de ...,Information sur le risque de contaminations cr...
51b38427-b2ea-4c56-93e8-4242361ef31b,"Aubergine 60,5% (aubergine, huile de tournesol...","A u b e r g i n e 6 0 , 5 % \n(aubergine, h..."
...,...,...
7c709a5e-b913-4a02-9396-a6469b09482a,SIROP DE GLUCOSE / SUCRE / AROMES / ACIDIFIANT...,MENTIONS D'ÉTIQUETAGE- LABELS MENTIONS\nINGRED...
c5dee4ab-9f57-4533-9f89-e216ee110f68,Farine de blé T45,Les valeures moyennes indiquées sont soumises ...
e67341d8-350f-46f4-9154-4dbbb8035621,"Sucre roux de canne*° (64%), amidon de maïs*, ...",\nGluten \nŒufs \nPoissons \nCrustacés \nArac...
a8f6f672-20ac-4ff8-a8f2-3bc4306c8df3,"Farine 87,1 % (Blé (GLUTEN), Blé hydrolysé (GL...","Farine 87,1 % (Blé (GLUTEN), Blé hydrolysé (GL..."


## 7.2 Accuracy based on strict comparison

The most straightforward way to assess whether the model performs well or not is simply to strictly compare if the predicted string is the exact same as the ground truth.

In [49]:
accuracy = sum(comparison['ingredients'] == comparison['predicted']) / len(comparison)
accuracy

0.03

This score (3%) is surprisingly low as when reading the results, more ingredient lists seem to be correctly extracted. Maybe