# PIM Estimator

This notebook aims to test the PIM-Estimator module.

In [1]:
import os

import numpy as np
import pandas as pd

from src.pimest import IngredientExtractor
from src.pimest import PIMIngredientExtractor
from src.pimapi import Requester
from src.pimpdf import PDFDecoder

# 1. Extracting the data

First, let's refresh the data from the production environment.

In [2]:
requester = Requester('prd')

In [None]:
#requester = Requester('prd', proxies=None)
requester = Requester('prd')
print('----------------------------------------')
requester.refresh_directory()
print('----------------------------------------')
requester.modification_report()
print('----------------------------------------')
requester.fetch_list_from_PIM(requester.modified_items(), batch_size=20)
print('----------------------------------------')
requester.dump_data_from_result()
print('----------------------------------------')
requester.dump_files_from_result()
print('----------------------------------------')
requester.modification_report()
print('----------------------------------------')

----------------------------------------
Done
----------------------------------------
Number of items: 13000
Number of items with outdated data: 219
Number of items with outdated files: 219
----------------------------------------
An error occured in this thread!
HTTPSConnectionPool(host='produits.groupe-pomona.fr', port=443): Max retries exceeded with url: /nuxeo/api/v1/id/00000000-0000-0000-0000-000000000000/@search?query=SELECT+%2A+FROM+Document+WHERE+ecm%3AprimaryType%3D%27pomProduct%27+AND+ecm%3AisVersion%3D0+AND+ecm%3Auuid+in+%28%270989658d-b709-413f-be9a-32b528da596d%27%2C+%279f9fc2fa-c3e2-4cb1-9f09-6d031d338282%27%2C+%27453d74bd-dcaa-4a99-acc9-853a5d3ff0ec%27%2C+%2743201561-4011-4a84-828a-dda1c99bbb50%27%2C+%27ca69f52c-3b35-4eab-bf20-f28c2d70c732%27%2C+%27918d4fbe-6ba2-47c3-9255-c2309907eeb7%27%2C+%276c441199-d8f5-4b72-baac-7365dbb1ca1e%27%2C+%27509b00e1-5df6-459a-b469-b3eceed6f4ca%27%2C+%27b47de1a4-349d-4260-8301-1cba1a34b7e8%27%2C+%275c86dd6e-356c-4236-ab3b-081056bc8af6%27%2

Then, fetch the ingredient lists into a pandas DataFrame:

In [3]:
requester.fetch_all_from_PIM(page_size=1000, max_page=-1, nx_properties='*')
mapping = {'uid': 'uid', 'Libellé': 'title', 'Ingrédients': 'properties.pprodc:ingredientsList'}
df = requester.result_to_dataframe(record_path='entries', mapping=mapping, index='uid')
df

Done


Unnamed: 0_level_0,Libellé,Ingrédients
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
9c7c134e-ce31-4f99-9fdc-fc65e30f6b4d,Sucettes caramel et fruits en présentoir de 25...,"Sucettes caramel: LAIT frais, sucre, sirop de ..."
f3fb0863-eb63-4a31-82af-d8663bc14460,"Fourchette 16,5 cm champagne PLASTICO",
9b234a3a-e1a0-4a63-b91c-7528e92f94ab,"Couteau 16,5 cm oxibio PLASTI",
371646a0-04e8-4fb6-bf0c-dac7896f173b,"Cuillère oxibio 16,5 cm PLASTI",
122576b5-d45d-41e1-b8bf-a070b13a062f,Boisson citron-citron vert en bouteille verre ...,"Eau, jus à base de concentrés de : citron (8,5..."
...,...,...
04a8b2c0-c794-45fe-af0d-54264b4927f4,CELIOUATE 38X38 BLANC OLYMPIA BETON X50,
7f51a13e-05a5-4497-9349-4219ae64f7fd,CELIOUATE 38X38 BETON OLYMPIA ANTHRACITE,
4128f89a-8df7-4da7-a2c0-3ee1302a46f4,BARQUETTE PATE A TARTINER POULAIN,"Ingrédients : Sucre, huile de tournesol, noise..."
e770da68-157f-4fd2-a951-ffc63fa5f69e,MELANGE JAKARTA EN SAC 1KG,


We only keep the products for which there is an ingredient list in the system.

In [4]:
df = df.loc[pd.notna(df['Ingrédients'])]
df

Unnamed: 0_level_0,Libellé,Ingrédients
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
9c7c134e-ce31-4f99-9fdc-fc65e30f6b4d,Sucettes caramel et fruits en présentoir de 25...,"Sucettes caramel: LAIT frais, sucre, sirop de ..."
122576b5-d45d-41e1-b8bf-a070b13a062f,Boisson citron-citron vert en bouteille verre ...,"Eau, jus à base de concentrés de : citron (8,5..."
0ea9000b-eb8e-4d11-94ed-0ddd685742ac,Pic-Nic break en étui 50 g PIC-NIC,"Pate a tartiner aux noisettes: sucre, huiles v..."
22ac3d02-45a5-451a-993f-35890db3a61a,Rocher lait en étui 35 g SUCHARD,"Sucre, praliné 22% (NOISETTES, sucre), pâte de..."
ea8695cf-9d9f-41cb-9c75-dc129f1deb3e,Rocher noir en étui 35 g SUCHARD,"Ingrédients : Ingrédients: Sucre, pâte de caca..."
...,...,...
1f073f3b-c855-4e20-8339-f46a5bd26724,TOMATES CERISES CONFITES MARINÉES À L’HUILE PO...,"Tomates confites, huile de tournesol, sel, suc..."
16a1db98-79ba-4ea7-9943-6d093a4c8ee9,TOMATES SÉCHÉES À L’HUILE POCHE 650G LA PULPE,"Tomates séchées réhydratées, huile de tourneso..."
c778d6f9-aa06-47d4-a4ec-199eec9e1373,MORCEAUX DE POIVRONS ROUGES ET JAUNES GRILLÉS ...,"Poivrons rouges et jaunes, huile de tournesol,..."
3d4a97ed-d6be-49c4-8eae-8ce72c50e68f,QUARTIERS D’ARTICHAUTS MARINÉS À L’HUILE POCHE...,"Artichauts, huile de tournesol, sel, sucre, pl..."


# 2. Training the estimator

For this simple test, the estimator will be trained on the whole dataset (which is not good practice - this is just to demonstrate the usage of this class).

## 2.1 Importing the module

The cell below is just here to enable to reload source code of pimest module without having to restart the kernel.

In [None]:
#import importlib
#import src.pimest
#importlib.reload(pimest)

## 2.2 Training the estimator

Although not a good practice, we train the estimator on the whole dataset.

In [5]:
estim = IngredientExtractor()
estim.fit(df.loc[:, 'Ingrédients'])

<src.pimest.IngredientExtractor at 0x1dce3d42d88>

In [6]:
estim.vectorized_texts_

<9570x4076 sparse matrix of type '<class 'numpy.int64'>'
	with 167232 stored elements in Compressed Sparse Row format>

In [7]:
estim.mean_corpus_

matrix([[0.00198537, 0.00010449, 0.00010449, ..., 0.0091954 , 0.00804598,
         0.00062696]])

# 3. Testing the estimator

## 3.1 Parsing a doc into blocks

First, we parse a single doc into blocks of texts:

In [None]:
uid = '7ad672f8-40d4-4527-ab49-af3284d23fab'
path = os.path.join('.', 'dumps', 'prd', uid, 'FTF.pdf')
blocks = PDFDecoder.path_to_blocks(path)
blocks

## 3.2 Predicting the ingredient block

Then we predict the block which is supposed to most likely be the one holding the ingredient list:

In [None]:
block_num = estim.predict(blocks)
print(blocks[block_num])

We can see that for the product with uid `78f66d90-aeab-4f15-8130-0c418955b79a`, the estimator has successfully identified the ingredient block!

# 4. Wrapped Estimator

A helper wrapped class enables to directly compare the current content of the PIM system with what has been extracted from the associated pdf file.

This helper directly inherits from `IngredientExtractor` class:

In [None]:
#from importlib import reload
#import src.pimest
#importlib.reload(pimest)

In [9]:
#wrapped_estim = PIMIngredientExtractor(env='prd', proxies=None)
wrapped_estim = PIMIngredientExtractor(env='prd')
wrapped_estim.fit(df.loc[:, 'Ingrédients'])

<src.pimest.PIMIngredientExtractor at 0x1dce3f90988>

In [16]:
wrapped_estim.compare_uid_data('71083f9d-14b5-4111-a4a4-af4f654282e4')

Fetching data from PIM for uid 71083f9d-14b5-4111-a4a4-af4f654282e4...
Done
----------------------------------------------------------
Ingredient list from PIM is :

None

----------------------------------------------------------
Supplier technical datasheet from PIM for uid 71083f9d-14b5-4111-a4a4-af4f654282e4 is:
https://produits.groupe-pomona.fr/nuxeo/nxfile/default/71083f9d-14b5-4111-a4a4-af4f654282e4/pprodad:technicalSheet/Fiche%20Technique%20produit%20210AC23.pdf?changeToken=74-0
----------------------------------------------------------
Downloading content of technical datasheet file...
Done!
----------------------------------------------------------
Parsing content of technical datasheet file...
Done!
----------------------------------------------------------
Ingredient list extracted from technical datasheet:

Ce produit est à conserver dans un endroit sec et frais, loin de toute source de chaleur. Ce produit n'est pas un jouet.
L'utilisateur devra vérifier la compatibilité d

In [13]:
wrapped_estim.print_blocks(wrapped_estim.resp)

0  |  BELL 

1  |  Rue Nicolas Appert -BP 30173 
59653 Villeneuve d'Ascq - France
Siège social : La Woëstyne - 59173  

2  |  Renescure - France 

3  |  Dénomination légale / Legal  

4  |  name 

5  |  Définition produit / Product  

6  |  definition 

7  |  Date de Durabilité Minimale /  

8  |  Minimum Durability Date
Conditions de conservation  

9  |  après ouverture / After opening  

10  |  storage instructions
Format / Format 

11  |  Boîte / Can 

12  |  4/4 

13  |  Liste des ingrédients /  

14  |  Ingredient list 

15  |  FICHE TECHNIQUE / SPECIFICATION SHEET 

16  |  Photo non contractuelle / Non-contractual picture 

17  |  Client / Client 

18  |  Marque / Brand 

19  |  EPINARDS HACHES
CHOPPED SPINACH 

20  |  AVRIL 

21  |  EPINARDS HACHES
CHOPPED SPINACH 

22  |  REP951 

23  |  Mise à jour le  /  

24  |  Update 

25  |  06/06/2018 

26  |  Responsable  

27  |  Qualité / Quality  

28  |  Manager  

29  |  D.Fernandez 

30  |  Epinards en conserves, préparés à parti

In [27]:
from sklearn.model_selection import train_test_split

train_uids, test_uids = train_test_split(df, test_size=500, random_state=42)
test_uids.reset_index().loc[:, 'uid'].to_csv(os.path.join('.', 'test_uids.csv'), header=True, encoding='utf-8-sig', index=False)