# PIM Estimator

This notebook aims to test the PIM-Estimator module.

In [1]:
import os

import numpy as np
import pandas as pd

from src.pimest import IngredientExtractor
from src.pimest import PIMIngredientExtractor
from src.pimest import PathGetter
from src.pimest import ContentGetter
from src.pimapi import Requester
from src.pimpdf import PDFDecoder



# 1. Extracting the data

First, let's refresh the data from the production environment.

In [None]:
#requester = Requester('prd', proxies=None)
requester = Requester('prd')
print('----------------------------------------')
requester.refresh_directory()
print('----------------------------------------')
requester.modification_report()
print('----------------------------------------')
requester.fetch_list_from_PIM(requester.modified_items(), batch_size=20)
print('----------------------------------------')
requester.dump_data_from_result()
print('----------------------------------------')
requester.dump_files_from_result()
print('----------------------------------------')
requester.modification_report()
print('----------------------------------------')

----------------------------------------
Done
----------------------------------------
Number of items: 13134
Number of items with outdated data: 2579
Number of items with outdated files: 9222
----------------------------------------
An error occured in this thread!
HTTPSConnectionPool(host='produits.groupe-pomona.fr', port=443): Max retries exceeded with url: /nuxeo/api/v1/id/00000000-0000-0000-0000-000000000000/@search?query=SELECT+%2A+FROM+Document+WHERE+ecm%3AprimaryType%3D%27pomProduct%27+AND+ecm%3AisVersion%3D0+AND+ecm%3Auuid+in+%28%2710f5f525-bba4-43d5-8ce7-90aa5b2a50e4%27%2C+%27e46095be-6ee1-4158-b800-ce9b60665f6e%27%2C+%2705d39275-7eac-4bc2-9a3a-f3e05ab7d2b2%27%2C+%27308fc588-b2b2-4b7d-a1ce-0ad7491599d3%27%2C+%27d3c9715f-7744-4977-9101-f4ea7303cdd5%27%2C+%27e6938350-eea1-40a3-bc84-c11da6141cd1%27%2C+%27a420f474-bc07-430b-bcf8-7312323523f6%27%2C+%276b873d7d-0082-4d92-be34-b35d012fac16%27%2C+%276ea6bf9a-3633-4870-a64c-32937c77ca7f%27%2C+%277fff32c1-ded6-4e74-8172-68cc8c4d6c9e%27

Done
----------------------------------------
Done
----------------------------------------
Launching 456 threads.
Thread complete!
An error occured in this thread!
HTTPSConnectionPool(host='produits.groupe-pomona.fr', port=443): Max retries exceeded with url: /nuxeo/nxfile/default/05e9edf1-012b-4f84-88c5-464511e9c992/pprodad:technicalSheet (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response')))
An error occured in this thread!
HTTPSConnectionPool(host='produits.groupe-pomona.fr', port=443): Max retries exceeded with url: /nuxeo/nxfile/default/1371878b-bdc1-42fd-8770-605f1334499d/pprodad:technicalSheet (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response')))
An error occured in this thread!
HTTPSConnectionPool(host='produits.groupe-pomona.fr', port=443): Max retries exceeded with url: /nuxeo/nxfile/default/00c7bcb1-970e-4da9-9de8-2ec31a021a5a/pprodad:technicalS

An error occured in this thread!
HTTPSConnectionPool(host='produits.groupe-pomona.fr', port=443): Max retries exceeded with url: /nuxeo/nxfile/default/0765eb77-6350-4718-b82a-71b13766db65/pprodad:technicalSheet (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response')))
An error occured in this thread!
HTTPSConnectionPool(host='produits.groupe-pomona.fr', port=443): Max retries exceeded with url: /nuxeo/nxfile/default/0eee2d3d-1600-475b-9def-c2a9f20e1b76/pprodad:technicalSheet (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response')))
An error occured in this thread!
HTTPSConnectionPool(host='produits.groupe-pomona.fr', port=443): Max retries exceeded with url: /nuxeo/nxfile/default/377c4cd6-6287-44b9-9426-3b5437512574/pprodad:technicalSheet (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response')))
An erro

An error occured in this thread!
HTTPSConnectionPool(host='produits.groupe-pomona.fr', port=443): Max retries exceeded with url: /nuxeo/nxfile/default/11bfb6c0-c86c-4a2c-90ce-3e5c5a63e783/pprodad:technicalSheet (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response')))
An error occured in this thread!
HTTPSConnectionPool(host='produits.groupe-pomona.fr', port=443): Max retries exceeded with url: /nuxeo/nxfile/default/1a421514-f7a7-42c9-8418-72ada62c6268/pprodad:technicalSheet (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response')))
An error occured in this thread!
HTTPSConnectionPool(host='produits.groupe-pomona.fr', port=443): Max retries exceeded with url: /nuxeo/nxfile/default/00c39185-edad-41a3-afde-eb7bcab2f0af/pprodad:technicalSheet (Caused by ProxyError('Cannot connect to proxy.', RemoteDisconnected('Remote end closed connection without response')))
An erro

Then, fetch the ingredient lists into a pandas DataFrame:

In [None]:
requester.fetch_all_from_PIM(page_size=1000, max_page=-1, nx_properties='*')
mapping = {'uid': 'uid', 'Libellé': 'title', 'Ingrédients': 'properties.pprodc:ingredientsList'}
df = requester.result_to_dataframe(record_path='entries', mapping=mapping, index='uid')
df

We only keep the products for which there is an ingredient list in the system.

In [None]:
df = df.loc[pd.notna(df['Ingrédients'])]
df

# 2. Training the estimator

For this simple test, the estimator will be trained on the whole dataset (which is not good practice - this is just to demonstrate the usage of this class).

## 2.1 Importing the module

The cell below is just here to enable to reload source code of pimest module without having to restart the kernel.

In [None]:
#import importlib
#import src.pimest
#importlib.reload(pimest)

## 2.2 Training the estimator

Although not a good practice, we train the estimator on the whole dataset.

In [None]:
estim = IngredientExtractor()
estim.fit(df.loc[:, 'Ingrédients'])

In [None]:
estim.vectorized_texts_

In [None]:
estim.vocabulary_

In [None]:
estim.mean_corpus_

# 3. Testing the estimator

## 3.1 Parsing a doc into blocks

First, we parse a single doc into blocks of texts:

In [None]:
uid = '7ad672f8-40d4-4527-ab49-af3284d23fab'
path = os.path.join('.', 'dumps', 'prd', uid, 'FTF.pdf')
blocks = PDFDecoder.path_to_blocks(path)
blocks

## 3.2 Predicting the ingredient block

Then we predict the block which is supposed to most likely be the one holding the ingredient list:

In [None]:
block_num = estim.predict(blocks)
print(blocks[block_num])

We can see that for the product with uid `78f66d90-aeab-4f15-8130-0c418955b79a`, the estimator has successfully identified the ingredient block!

# 4. Wrapped Estimator

A helper wrapped class enables to directly compare the current content of the PIM system with what has been extracted from the associated pdf file.

This helper directly inherits from `IngredientExtractor` class:

In [None]:
#from importlib import reload
#import src.pimest
#importlib.reload(pimest)

In [None]:
#wrapped_estim = PIMIngredientExtractor(env='prd', proxies=None)
wrapped_estim = PIMIngredientExtractor(env='prd')
wrapped_estim.fit(df.loc[:, 'Ingrédients'])

In [None]:
wrapped_estim.compare_uid_data('78f66d90-aeab-4f15-8130-0c418955b79a')

In [None]:
wrapped_estim.print_blocks()

In [None]:
from sklearn.model_selection import train_test_split

train_uids, test_uids = train_test_split(df, test_size=500, random_state=42)
#test_uids.reset_index().loc[:, 'uid'].to_csv(os.path.join('.', 'test_uids.csv'), header=True, encoding='utf-8-sig', index=False)

# 5. Transformers

A handful of transformers have been developped to treat the data.

## 5.1 PathGetter

This transformer takes a DataFrame with uids as index, and adds a columns that is the path on the pdf file on disk - depending on whether the uid is from ground truth or from "normal" train set.

The root paths can be passed at initialization, or they will be defaulted to what is specified in the configuration file.

The ground truth uids must be declared at initialization.

In [None]:
from pathlib import Path

In [None]:
ground_truth_df = pd.read_csv(os.path.join('.', 'ground_truth', 'manually_labelled_ground_truth.csv'),
                              sep=';',
                              encoding='latin-1',
                              index_col='uid')
ground_truth_uids = list(ground_truth_df.index)
ground_truth_uids

In [None]:
transformer = PathGetter(ground_truth_uids=ground_truth_uids,
                         train_set_path=Path.cwd() / 'dumps' / 'prd',
                         ground_truth_path=Path.cwd() / 'ground_truth',
                        )

In [None]:
idx = pd.Index(ground_truth_uids[:5] + list(df.index[:5]), name='uid')
test_df = df.loc[idx,:]
test_df = transformer.fit_transform(test_df)
test_df

## 5.2 ContentGetter

This transformer reads the files from disk and loads their content as binaries into the dataframe.

In [None]:
transformer_2 = ContentGetter(missing_file='to_nan')
test_df = transformer_2.fit_transform(test_df)
test_df

## 5.3 PDFContentParser

This estimator parses the content of the PDF files into text using the functionalities defined in the pimpdf module.

In [None]:
from src.pimest import PDFContentParser
transformer_3 = PDFContentParser()
test_df = transformer_3.fit_transform(test_df)
test_df

## 5.4 BlockSplitter

This transformer splits the content of the text column into blocks of texts using the prvoided splitter function.

In [None]:
from src.pimest import BlockSplitter
splitter_func = lambda x: x.split('\n\n')
transformer_4 = BlockSplitter(splitter_func=splitter_func)
test_df = transformer_4.fit_transform(test_df)
test_df

In [None]:
for text in test_df['blocks'].iloc[0]:
    print('-------------------------------------------\n', text)

# 6. Pipelining transformers

One can pipeline these transformers using the scikit-learn standard Pipeline.

## 6.1 Data acquisition

First step is to build a data acquisition pipeline, which will provide a DataFrame with the pdf documents full texts.

In [None]:
from importlib import reload
import src.pimest
reload(src.pimest)
from src.pimest import ContentGetter
from src.pimest import PathGetter
from src.pimest import PDFContentParser
from sklearn.pipeline import Pipeline

In [None]:
acqui_pipe = Pipeline([('PathGetter', PathGetter(ground_truth_uids=ground_truth_uids,
                                                  train_set_path=Path.cwd() / 'dumps' / 'prd',
                                                  ground_truth_path=Path.cwd() / 'ground_truth',
                                                  )),
                        ('ContentGetter', ContentGetter(missing_file='to_nan')),
                        ('ContentParser', PDFContentParser()),
                       ],
                       verbose=True)

In [None]:
idx = pd.Index(ground_truth_uids)
texts_df = df.loc[idx,]
texts_df = acqui_pipe.fit_transform(texts_df)
texts_df

## 6.2 Block splitting and prediction

Based on the pdf documents full text, we can now compute a block list for each of them, as well as predict which one seems to be the best candidate.

In [None]:
from src.pimest import BlockSplitter
from src.pimest import SimilaritySelector

In [None]:
def splitter(text):
    return(text.split('\n\n'))

In [None]:
process_pipe = Pipeline([('BlockSplitter', BlockSplitter(splitter_func=splitter)),
                         ('SimilaritySelector', SimilaritySelector())
                       ],
                       verbose=True)

In [None]:
predicted_df = process_pipe.fit_transform(texts_df)
predicted_df

In [None]:
print('\n-----------------------------------------------\n'.join(predicted_df['predicted']))

# 7 Scoring

Now is time to evaluate the performance of this simple model.

## 7.1 Target ingredient lists acquisition

First, we load the manually labelled ground truth into a target DataFrame.

In [None]:
y =pd.read_csv(os.path.join('.', 'ground_truth', 'manually_labelled_ground_truth.csv'),
               sep=';',
               encoding='latin-1',
               index_col='uid')
y

In [None]:
comparison = pd.concat([y['ingredients'], predicted_df['predicted']], axis=1)
comparison

In [None]:
comparison.loc[comparison['ingredients'] == comparison['predicted']]