# PIM Estimator

This notebook aims to test the PIM-Estimator module.

In [1]:
import os

import numpy as np
import pandas as pd

from src.pimest import IngredientExtractor
from src.pimest import PIMIngredientExtractor
from src.pimapi import Requester
from src.pimpdf import PDFDecoder

# 1. Extracting the data

First, let's refresh the data from the production environment.

In [None]:
#requester = Requester('prd', proxies=None)
requester = Requester('prd')
print('----------------------------------------')
requester.refresh_directory()
print('----------------------------------------')
requester.modification_report()
print('----------------------------------------')
requester.fetch_list_from_PIM(requester.modified_items(), batch_size=20)
print('----------------------------------------')
requester.dump_data_from_result()
print('----------------------------------------')
requester.dump_files_from_result()
print('----------------------------------------')
requester.modification_report()
print('----------------------------------------')

Then, fetch the ingredient lists into a pandas DataFrame:

In [3]:
requester.fetch_all_from_PIM(page_size=1000, max_page=-1, nx_properties='*')
mapping = {'uid': 'uid', 'Libellé': 'title', 'Ingrédients': 'properties.pprodc:ingredientsList'}
df = requester.result_to_dataframe(record_path='entries', mapping=mapping, index='uid')
df

Done


Unnamed: 0_level_0,Libellé,Ingrédients
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
afee12c7-177e-4a68-9539-8cbb68442503,DESTRUCTEUR D'ODEUR 500 ML,
7d390121-17e8-43bf-a357-9d06b79d2d47,THÉ VERT AGRUME BTE 25S FRAICH LIPTON,
f234cd84-c8f6-433f-85ec-6e0b6980adc6,TORTILLA BLE 30CM,
e82a8173-b379-41ac-b319-aa058a04fcfb,VIN ROUGE MÉDITERRANÉE 25CL X12,
4b12c47c-84f5-4132-b362-22b864379a67,VIN MÉDITERRANÉE ROSÉ 25CL X12,
...,...,...
c9c05f33-afc1-4b78-8391-2bcba74887a7,Garniture pâtissière à froid en sac 2.5 kg COM...,"Sucre, amidon modifié, poudre de LACTOSERUM, m..."
b7c1f419-6b98-4787-88ea-f2dfdbb345ea,"Sun muffin en sac 2,5 kg COMPLET","Sucre, farine de FROMENT, amidon de FROMENT, a..."
d9887a2e-b463-4329-9eb1-60839a21ba42,"Cookie en sac 2,5 kg COMPLET","Farine de FROMENT, sucre, poudres à lever : (E..."
dfbdd53f-2caa-4a54-b419-4cd57c276c8c,"Crème pâtissière à froid pur beurre en sac 2,5...","Sucre, amidon modifié, poudre de LACTOSERUM, p..."


We only keep the products for which there is an ingredient list in the system.

In [4]:
df = df.loc[pd.notna(df['Ingrédients'])]
df

Unnamed: 0_level_0,Libellé,Ingrédients
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
34d33a48-5735-49ec-a08f-6642933dec00,Préparation de concentré de citron et d'aneth ...,Préparation aux jus de citron & aneth (condime...
d45d3058-f7d5-4cb8-be37-99bcc2ed06e9,Ecume de saveurs en bouteille 150 ml MISS ALGAE,Préparation de jus de yuzu ( condiment de bals...
47b861bc-42bc-4f7e-9b76-a6d4504b3907,Pain spécial burger à la semoule de blé 72 g H...,"Farine de BLÉ 59%, eau, levure, dextrose, semo..."
d81eae2e-0232-46f8-b058-bdc1e6b7a64e,Ketchup en flacon souple 220 ml HEINZ,"Tomates (148 g pour 100 g de ketchup), vinaigr..."
abd31acd-e2d9-437a-9588-5bf0d0f3f059,Confit de canard du Gers IGP en boîte 2/1 CANA...,"Cuisses de canard confites, graisse de canard,..."
...,...,...
c9c05f33-afc1-4b78-8391-2bcba74887a7,Garniture pâtissière à froid en sac 2.5 kg COM...,"Sucre, amidon modifié, poudre de LACTOSERUM, m..."
b7c1f419-6b98-4787-88ea-f2dfdbb345ea,"Sun muffin en sac 2,5 kg COMPLET","Sucre, farine de FROMENT, amidon de FROMENT, a..."
d9887a2e-b463-4329-9eb1-60839a21ba42,"Cookie en sac 2,5 kg COMPLET","Farine de FROMENT, sucre, poudres à lever : (E..."
dfbdd53f-2caa-4a54-b419-4cd57c276c8c,"Crème pâtissière à froid pur beurre en sac 2,5...","Sucre, amidon modifié, poudre de LACTOSERUM, p..."


# 2. Training the estimator

For this simple test, the estimator will be trained on the whole dataset (which is not good practice - this is just to demonstrate the usage of this class).

## 2.1 Importing the module

The cell below is just here to enable to reload source code of pimest module without having to restart the kernel.

In [None]:
#import importlib
#import src.pimest
#importlib.reload(pimest)

## 2.2 Training the estimator

Although not a good practice, we train the estimator on the whole dataset.

In [5]:
estim = IngredientExtractor()
estim.fit(df.loc[:, 'Ingrédients'])

<src.pimest.IngredientExtractor at 0x2e8a8e24fc8>

In [6]:
estim.vectorized_texts_

<9578x4078 sparse matrix of type '<class 'numpy.int64'>'
	with 167447 stored elements in Compressed Sparse Row format>

In [7]:
estim.mean_corpus_

matrix([[0.00198371, 0.00010441, 0.00010441, ..., 0.00908332, 0.00803926,
         0.00083525]])

# 3. Testing the estimator

## 3.1 Parsing a doc into blocks

First, we parse a single doc into blocks of texts:

In [None]:
uid = '7ad672f8-40d4-4527-ab49-af3284d23fab'
path = os.path.join('.', 'dumps', 'prd', uid, 'FTF.pdf')
blocks = PDFDecoder.path_to_blocks(path)
blocks

## 3.2 Predicting the ingredient block

Then we predict the block which is supposed to most likely be the one holding the ingredient list:

In [None]:
block_num = estim.predict(blocks)
print(blocks[block_num])

We can see that for the product with uid `78f66d90-aeab-4f15-8130-0c418955b79a`, the estimator has successfully identified the ingredient block!

# 4. Wrapped Estimator

A helper wrapped class enables to directly compare the current content of the PIM system with what has been extracted from the associated pdf file.

This helper directly inherits from `IngredientExtractor` class:

In [None]:
#from importlib import reload
#import src.pimest
#importlib.reload(pimest)

In [8]:
#wrapped_estim = PIMIngredientExtractor(env='prd', proxies=None)
wrapped_estim = PIMIngredientExtractor(env='prd')
wrapped_estim.fit(df.loc[:, 'Ingrédients'])

<src.pimest.PIMIngredientExtractor at 0x2e8a777ac48>

In [10]:
wrapped_estim.compare_uid_data('d0aa2c1c-4317-4e5f-8a18-82e56976da22')

Fetching data from PIM for uid d0aa2c1c-4317-4e5f-8a18-82e56976da22...
Done
----------------------------------------------------------
Ingredient list from PIM is :

Sucre, dextrose, maltodextrine, stabilisants (E460, E450, E516, E401, E404), épaississant (E407), arôme naturel (LAIT).Peut contenir : FRUITS A COQUE, OEUF, SOJA et GLUTEN.

----------------------------------------------------------
Supplier technical datasheet from PIM for uid d0aa2c1c-4317-4e5f-8a18-82e56976da22 is:
https://produits.groupe-pomona.fr/nuxeo/nxfile/default/d0aa2c1c-4317-4e5f-8a18-82e56976da22/pprodad:technicalSheet/FT%20NESTL%C3%89%20Docello%20Panna%20Cotta%20-%20Etui%20de%20600%20g.pdf?changeToken=76-0
----------------------------------------------------------
Downloading content of technical datasheet file...
Done!
----------------------------------------------------------
Parsing content of technical datasheet file...
Done!
----------------------------------------------------------
Ingredient list extrac

In [None]:
wrapped_estim.print_blocks(wrapped_estim.resp)

In [None]:
from sklearn.model_selection import train_test_split

train_uids, test_uids = train_test_split(df, test_size=500, random_state=42)
test_uids.reset_index().loc[:, 'uid'].to_csv(os.path.join('.', 'test_uids.csv'), header=True, encoding='utf-8-sig', index=False)