# PIM Estimator

This notebook aims to test the PIM-Estimator module.

In [1]:
import os

import numpy as np
import pandas as pd

from src.pimest import IngredientExtractor
from src.pimest import PIMIngredientExtractor
from src.pimest import PathGetter
from src.pimest import ContentGetter
from src.pimapi import Requester
from src.pimpdf import PDFDecoder

# 1. Extracting the data

First, let's refresh the data from the production environment.

In [2]:
#requester = Requester('prd', proxies=None)
requester = Requester('prd')
print('----------------------------------------')
requester.refresh_directory()
print('----------------------------------------')
requester.modification_report()
print('----------------------------------------')
requester.fetch_list_from_PIM(requester.modified_items(), batch_size=20)
print('----------------------------------------')
requester.dump_data_from_result()
print('----------------------------------------')
requester.dump_files_from_result()
print('----------------------------------------')
requester.modification_report()
print('----------------------------------------')

----------------------------------------
Done
----------------------------------------
Number of items: 13132
Number of items with outdated data: 145
Number of items with outdated files: 145
----------------------------------------
Done
----------------------------------------
Done
----------------------------------------
Launching 8 threads.
Thread complete!
Thread complete!
Thread complete!
Thread complete!
Thread complete!
Thread complete!
Thread complete!
Thread complete!
Done
----------------------------------------
Number of items: 13132
Number of items with outdated data: 0
Number of items with outdated files: 0
----------------------------------------


Then, fetch the ingredient lists into a pandas DataFrame:

In [3]:
requester.fetch_all_from_PIM(page_size=1000, max_page=-1, nx_properties='*')
mapping = {'uid': 'uid', 'Libellé': 'title', 'Ingrédients': 'properties.pprodc:ingredientsList'}
df = requester.result_to_dataframe(record_path='entries', mapping=mapping, index='uid')
df

Done


Unnamed: 0_level_0,Libellé,Ingrédients
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
afee12c7-177e-4a68-9539-8cbb68442503,DESTR D'ODEURS AIR&TEXTILES 750CCX6 DESODOR U2,
7d390121-17e8-43bf-a357-9d06b79d2d47,THÉ VERT AGRUME BTE 25S FRAICH LIPTON,
f234cd84-c8f6-433f-85ec-6e0b6980adc6,TORTILLA BLE 30CM,
e82a8173-b379-41ac-b319-aa058a04fcfb,VIN ROUGE MÉDITERRANÉE 25CL X12,
4b12c47c-84f5-4132-b362-22b864379a67,VIN MÉDITERRANÉE ROSÉ 25CL X12,
...,...,...
5cde49c6-9e7e-4bd2-b22a-3239f643379d,ROULEAU CÉLISOFT 1.20X50 M CITRON,
0273eadc-851a-4b68-8020-8041700a4f3d,2D VENT FRAIS 5LX4 DESODOR U2,
ef42a938-2203-446e-8d28-9fd27c6d3146,3D VENT FRAIS 5LX4 DESODOR U2,
68f5d81b-7f91-40a0-8504-0ec320a86de4,NETTOYANT INOX PAR LOT DE 2,


We only keep the products for which there is an ingredient list in the system.

In [4]:
df = df.loc[pd.notna(df['Ingrédients'])]
df

Unnamed: 0_level_0,Libellé,Ingrédients
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
a84ebaef-3c39-4661-a8bf-bcd0b297076d,ETUI RIZ DE CAMARGUE IGP LONG BLANC,RIZ LONG BLANC
898c0810-7f5a-4a1d-b169-b191dcd9d0bd,CRAC FORM 210G X 8,"Farine de maïs, farine de riz, sucre, sel."
075672b6-c1d5-43d9-9e80-21ec44b2d872,CRAC'FORM MAÏS RIZ SARRASIN 210G X 8,"farine de riz 40%, farine de maïs 38%, amidon ..."
0560a9c2-d635-46ef-a23f-df7ad8632257,SAC 500 GR MORILLES DE CULTURE,MORILLES
01817326-7f55-4a1c-928e-47d1bed8a38e,SPY PYRENEA 150 PK6 EP,EAU DE SOURCE
...,...,...
16a1db98-79ba-4ea7-9943-6d093a4c8ee9,TOMATES SÉCHÉES À L’HUILE POCHE 650G LA PULPE,"Tomates séchées réhydratées, huile de tourneso..."
c778d6f9-aa06-47d4-a4ec-199eec9e1373,MORCEAUX DE POIVRONS ROUGES ET JAUNES GRILLÉS ...,"Poivrons rouges et jaunes, huile de tournesol,..."
3d4a97ed-d6be-49c4-8eae-8ce72c50e68f,QUARTIERS D’ARTICHAUTS MARINÉS À L’HUILE POCHE...,"Artichauts, huile de tournesol, sel, sucre, pl..."
4128f89a-8df7-4da7-a2c0-3ee1302a46f4,BARQUETTE PATE A TARTINER POULAIN,"Ingrédients : Sucre, huile de tournesol, NOISE..."


# 2. Training the estimator

For this simple test, the estimator will be trained on the whole dataset (which is not good practice - this is just to demonstrate the usage of this class).

## 2.1 Importing the module

The cell below is just here to enable to reload source code of pimest module without having to restart the kernel.

In [5]:
#import importlib
#import src.pimest
#importlib.reload(pimest)

## 2.2 Training the estimator

Although not a good practice, we train the estimator on the whole dataset.

In [6]:
estim = IngredientExtractor()
estim.fit(df.loc[:, 'Ingrédients'])

<src.pimest.IngredientExtractor at 0x297dba70d48>

In [7]:
estim.vectorized_texts_

<9633x4161 sparse matrix of type '<class 'numpy.int64'>'
	with 168749 stored elements in Compressed Sparse Row format>

In [8]:
estim.vocabulary_

{'riz': 3362,
 'long': 2489,
 'blanc': 735,
 'farine': 1866,
 'de': 1368,
 'maïs': 2607,
 'sucre': 3639,
 'sel': 3516,
 '40': 168,
 '38': 163,
 'amidon': 467,
 'sarrasin': 3478,
 'fibre': 1890,
 'pomme': 3123,
 'millet': 2643,
 'teff': 3727,
 'fabriqué': 1858,
 'dans': 1364,
 'un': 3893,
 'atelier': 606,
 'qui': 3284,
 'utilise': 3902,
 'des': 1401,
 'protéines': 3205,
 'lait': 2406,
 'et': 1813,
 'du': 1493,
 'soja': 3566,
 'morilles': 2708,
 'eau': 1721,
 'source': 3591,
 'kombu': 2386,
 'déshydraté': 1529,
 '100': 38,
 'graines': 2129,
 'moutarde': 2725,
 'vinaigre': 3962,
 'alcool': 427,
 'acidifiant': 345,
 'acide': 334,
 'citrique': 1085,
 'conservateur': 1191,
 'disulfite': 1464,
 'potassium': 3146,
 'jus': 2349,
 'orange': 2892,
 'citron': 1087,
 'base': 671,
 'concentré': 1166,
 'correcteur': 1236,
 'acidité': 352,
 'lactate': 2394,
 'sodium': 3561,
 'édulcorants': 4076,
 'aspartame': 593,
 'acésulfame': 367,
 'conservateurs': 1192,
 'sorbate': 3575,
 'benzoate': 687,
 'arômes

In [9]:
estim.mean_corpus_

matrix([[0.00197239, 0.00010381, 0.00010381, ..., 0.00892764, 0.00851241,
         0.00083048]])

# 3. Testing the estimator

## 3.1 Parsing a doc into blocks

First, we parse a single doc into blocks of texts:

In [10]:
uid = '7ad672f8-40d4-4527-ab49-af3284d23fab'
path = os.path.join('.', 'dumps', 'prd', uid, 'FTF.pdf')
blocks = PDFDecoder.path_to_blocks(path)
blocks

['FICHE TECHNIQUE ',
 ' ',
 'BBEETTTTEERRAAVVEESS  RROOUUGGEESS    ',
 'CCOOUUPPÉÉEESS  EENN  DDÉÉSS  ',
 'LLEEGGEERREEMMEENNTT  VVIINNAAIIGGRREEEESS  ',
 ' \n \n \n \n \n \n \n ',
 'Page : 1/2 ',
 ' \nService Qualité Groupe ',
 'd’aucy/CGC ',
 '56 500 LOCMINE ',
 'Référence : SQ/LV/624 - Version : F ',
 'Date : 3/08/18 ',
 ' ',
 'CARACTÉRISTIQUES GÉNÉRALES :  ',
 'Dénomination Réglementaire ',
 'Code produit \nFormat \nContenance \nPoids Net Total \nPoids Net Égoutté ',
 'Liste des Ingrédients ',
 'Campagne de Production \nOrigine légume ',
 ' \n ',
 'Betteraves rouges coupées en dés ',
 'légèrement vinaigrées ',
 '4505 \n5/1 ',
 '4250 ml \n4000 g \n2655 g ',
 'Betteraves rouges, eau, vinaigre d’alcool (0,9%), sucre, sel, ',
 'acidifiant : acide citrique. (E330) ',
 'Septembre à Décembre ',
 'France ',
 ' ',
 'NOMBRE DE PARTS ',
 'Adultes \nEnfants ',
 '26 parts \n44 parts ',
 ' \nCARACTÉRISTIQUES DU PRODUIT : ',
 'Réf. CTCPA _ décision n°48 – Conserves de betteraves rouges \n \nBette

## 3.2 Predicting the ingredient block

Then we predict the block which is supposed to most likely be the one holding the ingredient list:

In [11]:
block_num = estim.predict(blocks)
print(blocks[block_num])

Pour 100 g*AR**Pour 100 g*AR**Energie (kJ)128Glucides (g)5,22%Energie (kcal)31Dont sucres (g)4,85%Matières grasses (g)0,10%Fibres alimentaires (g)2,4Dont acides gras saturés (g)0,00%Protéines (g)1,02%*pour 100 g de produit égouttéSel (g)0,6010%**Apport de référence pour un adulte-type (8400kJ/2000 kcal).Les apports de références varient en fonction de l'âge, du sexe et de l'activité physique.2%


We can see that for the product with uid `78f66d90-aeab-4f15-8130-0c418955b79a`, the estimator has successfully identified the ingredient block!

# 4. Wrapped Estimator

A helper wrapped class enables to directly compare the current content of the PIM system with what has been extracted from the associated pdf file.

This helper directly inherits from `IngredientExtractor` class:

In [12]:
#from importlib import reload
#import src.pimest
#importlib.reload(pimest)

In [13]:
#wrapped_estim = PIMIngredientExtractor(env='prd', proxies=None)
wrapped_estim = PIMIngredientExtractor(env='prd')
wrapped_estim.fit(df.loc[:, 'Ingrédients'])

<src.pimest.PIMIngredientExtractor at 0x297dd527488>

In [14]:
wrapped_estim.compare_uid_data('6fa56f54-57cf-4328-b034-f11380d301d0')

Fetching data from PIM for uid 6fa56f54-57cf-4328-b034-f11380d301d0...
Done
----------------------------------------------------------
Ingredient list from PIM is :

Sucre, huile de palme, NOISETTES 13%, LAIT écrémé en poudre 8,7%, cacao maigre 7,4%, émulsifiants : lécithines (SOJA), vanilline

----------------------------------------------------------
Supplier technical datasheet from PIM for uid 6fa56f54-57cf-4328-b034-f11380d301d0 is:
https://produits.groupe-pomona.fr/nuxeo/nxfile/default/6fa56f54-57cf-4328-b034-f11380d301d0/pprodad:technicalSheet/FC2017%20Nutella%20.pdf?changeToken=162-0
----------------------------------------------------------
Downloading content of technical datasheet file...
Done!
----------------------------------------------------------
Parsing content of technical datasheet file...
Done!
----------------------------------------------------------
Ingredient list extracted from technical datasheet:

Sucre, huile de palme, noisettes 13%, lait écrémé 
en poudre 

In [15]:
wrapped_estim.print_blocks()

0  |  FERRERO France  

1  |  CS90058
76136 MONT-SAINT-
AIGNAN 

2  |  sept-17 

3  |  NUTELLA 

4  |  INFORMATIONS NUTRITIONNELLES 

5  |  Pâtes à tartiner aux noisettes 

6  |  Valeurs nutritionnelles  

7  |  moyennes   

8  |  Pour 100g 

9  |  Pour 15g 

10  |  %* par 15g 

11  |  INGREDIENTS 

12  |  ENERGIE 

13  |  2252 Kj/539 kcal 

14  |  336 Kj/80 kcal 

15  |  Sucre, huile de palme, noisettes 13%, lait écrémé 
en poudre 8,7%, cacao maigre 7,4%, émulsifiants :  

16  |  lécithines [soja], vanilline 

17  |  ALLERGENES 

18  |  MATIERES GRASSES 

19  |  dont acides gras saturés 

20  |  GLUCIDES 

21  |  dont sucres 

22  |  30,9 g 

23  |  10,6 g 

24  |  57,5 g 

25  |  56,3 g 

26  |  4,6 g 

27  |  1,6 g 

28  |  8,6 g 

29  |  8,4 g 

30  |  PROTEINES 

31  |  6,3 g 

32  |  0,9 g 

33  |  CONSERVATION 

34  |  *Apport de référence pour un adulte-type (8 400 kJ/2 000 kcal) 

35  |  SEL 

36  |  0,11 g 

37  |  0,016 g 

38  |   - noisettes 

39  |   - lait 

40  |   - so

In [16]:
from sklearn.model_selection import train_test_split

train_uids, test_uids = train_test_split(df, test_size=500, random_state=42)
#test_uids.reset_index().loc[:, 'uid'].to_csv(os.path.join('.', 'test_uids.csv'), header=True, encoding='utf-8-sig', index=False)

# 5. Transformers

A handful of transformers have been developped to treat the data.

## 5.1 PathGetter

This transformer takes a DataFrame with uids as index, and adds a columns that is the path on the pdf file on disk - depending on whether the uid is from ground truth or from "normal" train set.

The root paths can be passed at initialization, or they will be defaulted to what is specified in the configuration file.

The ground truth uids must be declared at initialization.

In [17]:
from pathlib import Path

In [18]:
ground_truth_df = pd.read_csv(os.path.join('.', 'ground_truth', 'manually_labelled_ground_truth.csv'),
                              sep=';',
                              encoding='latin-1',
                              index_col='uid')
ground_truth_uids = list(ground_truth_df.index)
ground_truth_uids

['a0492df6-9c76-4303-8813-65ec5ccbfa70',
 'd183e914-db2f-4e2f-863a-a3b2d054c0b8',
 'ab48a1ed-7a3d-4686-bb6d-ab4f367cada8',
 '528d4be3-425c-4f8b-8a87-12f1bc645ddd',
 '51b38427-b2ea-4c56-93e8-4242361ef31b',
 'e4c8c61f-a401-4384-8128-181447e5bdd2',
 '50766d10-6135-4958-b743-1cddfcb7c230',
 'bb9f3995-57b3-429d-b075-2d81a90e406f',
 '52bf5f5a-79f2-4652-8eac-2b80c697269b',
 '04235024-80f3-46c2-bad0-aae0d5fab024',
 '93a5d344-4ab0-48de-bbb0-fec8117d07a2',
 '440ca0b4-1c79-433f-ac8d-6e2ae77d3288',
 '7b9cf3b3-bf4b-408c-b8c8-807a09745979',
 'b638ab96-427c-4d5f-9a19-74684eec3b8e',
 '8bb101c0-8c6e-4665-8310-1a2d13257b2e',
 '54f40033-f9cf-411c-81a5-11974f6715aa',
 '2405356e-ecc9-404f-8279-16fddf168b73',
 'c57be8e4-30e0-417b-90fa-b60d9f9ece48',
 '6622442d-d88f-4b90-9414-b91c9e8e5296',
 '507b428e-e99d-464b-b9d3-50629efe4355',
 '545e8bc5-15fd-458a-8c9f-b18faecc2016',
 'b45b3942-accd-4115-8f1f-736dbd42a35a',
 '6c07fd47-116d-4531-a164-d4f4040f212e',
 'ca711137-4644-45b7-9928-456a09d9746a',
 'e70d347d-5e91-

In [19]:
transformer = PathGetter(ground_truth_uids=ground_truth_uids,
                         train_set_path=Path.cwd() / 'dumps' / 'prd',
                         ground_truth_path=Path.cwd() / 'ground_truth',
                        )

In [25]:
idx = pd.Index(ground_truth_uids[:5] + list(df.index[:5]), name='uid')
test_df = df.loc[idx,:]
test_df = transformer.fit_transform(test_df)
test_df

Unnamed: 0_level_0,Libellé,Ingrédients,path
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...
a84ebaef-3c39-4661-a8bf-bcd0b297076d,ETUI RIZ DE CAMARGUE IGP LONG BLANC,RIZ LONG BLANC,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...
898c0810-7f5a-4a1d-b169-b191dcd9d0bd,CRAC FORM 210G X 8,"Farine de maïs, farine de riz, sucre, sel.",C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...
075672b6-c1d5-43d9-9e80-21ec44b2d872,CRAC'FORM MAÏS RIZ SARRASIN 210G X 8,"farine de riz 40%, farine de maïs 38%, amidon ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...
0560a9c2-d635-46ef-a23f-df7ad8632257,SAC 500 GR MORILLES DE CULTURE,MORILLES,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...
01817326-7f55-4a1c-928e-47d1bed8a38e,SPY PYRENEA 150 PK6 EP,EAU DE SOURCE,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...


## 5.2 ContentGetter

This transformer reads the files from disk and loads their content as binaries into the dataframe.

In [26]:
transformer_2 = ContentGetter(missing_file='to_nan')
test_df = transformer_2.fit_transform(test_df)
test_df

Unnamed: 0_level_0,Libellé,Ingrédients,path,content
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n4 0 obj\r<</L...
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.4\n%\xc7\xec\x8f\xa2\n5 0 obj\n<</Len...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.3\n%\xc7\xec\x8f\xa2\n7 0 obj\n<</Len...
a84ebaef-3c39-4661-a8bf-bcd0b297076d,ETUI RIZ DE CAMARGUE IGP LONG BLANC,RIZ LONG BLANC,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...
898c0810-7f5a-4a1d-b169-b191dcd9d0bd,CRAC FORM 210G X 8,"Farine de maïs, farine de riz, sucre, sel.",C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...
075672b6-c1d5-43d9-9e80-21ec44b2d872,CRAC'FORM MAÏS RIZ SARRASIN 210G X 8,"farine de riz 40%, farine de maïs 38%, amidon ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...
0560a9c2-d635-46ef-a23f-df7ad8632257,SAC 500 GR MORILLES DE CULTURE,MORILLES,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...
01817326-7f55-4a1c-928e-47d1bed8a38e,SPY PYRENEA 150 PK6 EP,EAU DE SOURCE,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.3\n3 0 obj\n<</Type /Page\n/Parent 1 ...


## 5.3 PDFContentParser

This estimator parses the content of the PDF files using the functionalities defined in the pimpdf module.

In [27]:
from src.pimest import PDFContentParser
transformer_3 = PDFContentParser()
test_df = transformer_3.fit_transform(test_df)
test_df

Launching 4 processes.


Unnamed: 0_level_0,Libellé,Ingrédients,path,content,text
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,Concentré Liquide Asian CHEF® \n\nBouteille de...
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n4 0 obj\r<</L...,
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.4\n%\xc7\xec\x8f\xa2\n5 0 obj\n<</Len...,Direction Qualité \n\n \n\n \n\nPATES ALIMENTA...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE\n\nFEVE DE TONKA\n\nDipteryx O...
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.3\n%\xc7\xec\x8f\xa2\n7 0 obj\n<</Len...,CAVIAR \n\nD’AUBERGINES\n500G\n\nPRÉSENTATION\...
a84ebaef-3c39-4661-a8bf-bcd0b297076d,ETUI RIZ DE CAMARGUE IGP LONG BLANC,RIZ LONG BLANC,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,"SARL BENOIT\n\n367, chemin de Mérieux - Quarti..."
898c0810-7f5a-4a1d-b169-b191dcd9d0bd,CRAC FORM 210G X 8,"Farine de maïs, farine de riz, sucre, sel.",C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,CRAC’FORM \n\nDonnées commerciales \n\n \n \nB...
075672b6-c1d5-43d9-9e80-21ec44b2d872,CRAC'FORM MAÏS RIZ SARRASIN 210G X 8,"farine de riz 40%, farine de maïs 38%, amidon ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,CRAC’FORM MAÏS RIZ SARRASIN \n\nDonnées commer...
0560a9c2-d635-46ef-a23f-df7ad8632257,SAC 500 GR MORILLES DE CULTURE,MORILLES,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE & LOGISTIQUE ASSURANCE QUALITE...
01817326-7f55-4a1c-928e-47d1bed8a38e,SPY PYRENEA 150 PK6 EP,EAU DE SOURCE,C:\Users\pmasse\Pyprojects\PIM-Recognizer\dump...,b'%PDF-1.3\n3 0 obj\n<</Type /Page\n/Parent 1 ...,Designation du produit\nMarque\nNom de la sour...


# 6. Pipelining transformers

One can pipeline these transformers using the scikit-learn standard Pipeline.

In [28]:
from importlib import reload
import src.pimest
reload(src.pimest)
from src.pimest import ContentGetter
from src.pimest import PathGetter
from src.pimest import PDFContentParser

In [29]:
from sklearn.pipeline import Pipeline

In [30]:
my_pipeline = Pipeline([('PathGetter', PathGetter(ground_truth_uids=ground_truth_uids,
                                                  train_set_path=Path.cwd() / 'dumps' / 'prd',
                                                  ground_truth_path=Path.cwd() / 'ground_truth',
                                                  )),
                        ('ContentGetter', ContentGetter(missing_file='to_nan')),
                        ('ContentParser', PDFContentParser()),
                        ('BlockSplitter', )
                       ],
                       verbose=True)

In [32]:
idx = pd.Index(ground_truth_uids)
test_df = df.loc[idx,:]
test_df = my_pipeline.fit_transform(test_df)
test_df

[Pipeline] .............. (step 1 of 3) Processing path, total=   0.2s
[Pipeline] ........... (step 2 of 3) Processing content, total=   2.9s
Launching 4 processes.
[Pipeline] ..... (step 3 of 3) Processing contentParser, total= 3.2min


Unnamed: 0,Libellé,Ingrédients,path,content,text
a0492df6-9c76-4303-8813-65ec5ccbfa70,Concentré liquide Asian en bouteille 980 ml CHEF,"Eau, maltodextrine, sel, arômes, sucre, arôme ...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,Concentré Liquide Asian CHEF® \n\nBouteille de...
d183e914-db2f-4e2f-863a-a3b2d054c0b8,Pain burger curry 80 g CREATIV BURGER,"Farine de BLE T65, eau, levure, huile de colz...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n4 0 obj\r<</L...,
ab48a1ed-7a3d-4686-bb6d-ab4f367cada8,Macaroni en sachet 500 g PANZANI,100% Semoule de BLE dur de qualité supérieure,C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.4\n%\xc7\xec\x8f\xa2\n5 0 obj\n<</Len...,Direction Qualité \n\n \n\n \n\nPATES ALIMENTA...
528d4be3-425c-4f8b-8a87-12f1bc645ddd,Fève de Tonka en sachet 100 g COMPTOIR COLONIAL,"Fève de tonka, taux de coumarine compris entre...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE\n\nFEVE DE TONKA\n\nDipteryx O...
51b38427-b2ea-4c56-93e8-4242361ef31b,Caviar d'aubergine en pot 500 g PUGET RESTAURA...,"Aubergine 60,5% (aubergine, huile de tournesol...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.3\n%\xc7\xec\x8f\xa2\n7 0 obj\n<</Len...,CAVIAR \n\nD’AUBERGINES\n500G\n\nPRÉSENTATION\...
...,...,...,...,...,...
7c709a5e-b913-4a02-9396-a6469b09482a,Mini-bonbons aux fruits en sachet 1 kg SKENDY,"Sirop de glucose, sucre, arômes naturels, acid...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.6\r\n%\xbd\xbe\xbc\r\n1 0 obj\r\n<<\r...,FTin. 396\nPage: 1/2\n\nIndice de révision:\nD...
c5dee4ab-9f57-4533-9f89-e216ee110f68,"FARINE DE BLÉ TYPE 45, 25KG",farine de BLE T45,C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,\n1050/10502066400 \n\n10502055300/1050202520...
e67341d8-350f-46f4-9154-4dbbb8035621,PRÉPARATION POUR CRÈME BRÛLÉE BIO 6L,"Sucre roux de canne*°(64%), amidon de maïs*, p...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,FICHE TECHNIQUE \n\nCREME BRÛLÉE 6L \n\nREF : ...
a8f6f672-20ac-4ff8-a8f2-3bc4306c8df3,Céréales instantanées en poudre saveur caramel...,"Farine 87,1 % (BLE (GLUTEN), BLE hydrolysé (GL...",C:\Users\pmasse\Pyprojects\PIM-Recognizer\grou...,b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n...,81 rue de Sans Souci – CS13754 – 69576 Limones...
