# Introduction

## Goal

This notebook aims to present how data stored in non formatted documents could be leveraged to improve data quality inside the PIM.

This notebook uses a handful of modules developped inside this project.

## What pipeline?

The different steps for this project are as follows:

- fetch all product IDs from PIM
- split the products between a train set and a test set
- train the algorithm on the train set
- make it make prediction on the test set
- compare it with the ingredient list on this product

# Train / Test split

We will use production data for training and testing of this model. The ID of the products are the PIM uid, and therefore are listed in the directory of the PIM-API module.

First, let's get those uids.

In [1]:
from src import pimapi
requester = pimapi.Requester('prd')
tool.refresh_directory()
requester._directory

Unnamed: 0_level_0,type,title,lastModified,lastRefreshed,lastFetchedData,lastFetchedFiles
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
58f67e91-6d16-4f12-97c7-e67d24be6805,pomProduct,Curcuma moulu en sac 1 kg LA CASE AUX EPICES,2019-12-06 11:24:21.231000+00:00,2020-02-03 17:20:32.292940+00:00,2020-02-03 13:22:18.211934+00:00,2020-02-03 13:22:48.460434+00:00
f46327cf-8f3c-4b67-85fa-7e6e6daaf618,pomProduct,Pique à viande bleu en sachet de 100 HIPPOPOTAMUS,2020-01-06 14:52:29.288000+00:00,2020-02-03 17:20:32.292940+00:00,2020-02-03 13:22:18.211934+00:00,2020-02-03 13:22:48.460434+00:00
58a1ae66-ca4d-4d82-9d30-4073b1edaeb8,pomProduct,Pique à viande rouge en sachet de 100 HIPPOPOT...,2020-01-06 14:52:48.715000+00:00,2020-02-03 17:20:32.292940+00:00,2020-02-03 13:22:18.211934+00:00,2020-02-03 13:22:48.460434+00:00
9bc59474-7839-458b-b56a-ba334fe4894b,pomProduct,Pique à viande noir en sachet de 100 HIPPOPOTAMUS,2020-01-06 14:53:00.627000+00:00,2020-02-03 17:20:32.292940+00:00,2020-02-03 13:22:18.211934+00:00,2020-02-03 13:22:48.460434+00:00
c97834e7-124e-4491-9f2a-3e4009fdda4e,pomProduct,Pique à viande marron en sachet de 100 HIPPOPO...,2020-01-06 14:53:12.375000+00:00,2020-02-03 17:20:32.292940+00:00,2020-02-03 13:22:18.211934+00:00,2020-02-03 13:22:48.460434+00:00
...,...,...,...,...,...,...
5fc3e589-5165-4358-877f-d554076fea4f,pomProduct,THE NR CHAI BIO (25 SAC PYR)X6 P.LEAF,2020-02-03 14:49:49.491000+00:00,2020-02-03 17:20:32.292940+00:00,NaT,NaT
140dd2b5-abc7-4a22-b298-0363f92271bb,pomProduct,GALETTE EN BARQUETTE DE 325G,2020-02-03 15:33:56.751000+00:00,2020-02-03 17:20:32.292940+00:00,NaT,NaT
2a4d75f0-3425-44be-8749-57c506e92de5,pomProduct,PALETS EN BARQUETTE DE 325G,2020-02-03 15:34:53.449000+00:00,2020-02-03 17:20:32.292940+00:00,NaT,NaT
ebb5548e-d5ac-4750-b9f4-180337c16c8a,pomProduct,PETIT BEURRE DE BARQUETTE DE 350G,2020-02-03 15:36:08.705000+00:00,2020-02-03 17:20:32.292940+00:00,NaT,NaT


One can see the modification status of the product via the `modification_report` method:

In [9]:
requester.modification_report()

Number of items: 12858
Number of items with outdated data: 3693
Number of items with outdated files: 4153


The PIM uids of the products are the keys of the `directory` of our requester. We extract the ingredients associated with these uids.

In [10]:
requester.fetch_all_from_PIM(page_size=1000, max_page=-1, nx_properties='*')
requester.result[0].json()['entries'][0]

Done


TypeError: unhashable type: 'slice'

In [13]:
mapping = {'uid': 'uid', 'Libellé': 'title', 'Ingrédients': 'properties.pprodc:ingredientsList'}
df = requester.result_to_dataframe(record_path='entries', mapping=mapping, index='uid')
df

Unnamed: 0_level_0,Libellé,Ingrédients
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
9c7c134e-ce31-4f99-9fdc-fc65e30f6b4d,Sucettes caramel et fruits en présentoir de 25...,"Sucettes caramel: LAIT frais, sucre, sirop de ..."
f3fb0863-eb63-4a31-82af-d8663bc14460,"Fourchette 16,5 cm champagne PLASTICO",
9b234a3a-e1a0-4a63-b91c-7528e92f94ab,"Couteau 16,5 cm oxibio PLASTI",
371646a0-04e8-4fb6-bf0c-dac7896f173b,"Cuillère oxibio 16,5 cm PLASTI",
122576b5-d45d-41e1-b8bf-a070b13a062f,Boisson citron-citron vert en bouteille verre ...,"Eau, jus à base de concentrés de : citron (8,5..."
...,...,...
ee2a60ff-581f-4c7b-968b-5e7983b3c5ff,HARICOTS VERTS EXTRA FINS 4/4 EPISAVEURS,"haricots verts , eau, sel"
272c9c44-39a3-4391-bc3b-d40aad9e1c25,FLAGEOLETS EXTRA FINS 5/1,"flageolets verts, eau, sel"
e493062f-5aa6-48be-9fc8-d7b37235c466,BOITE BURGER KRAFT 145X130X80,
8ca01ff2-7e2b-4ac3-ba71-f99c5f97fd9e,SERVIETTE OUATE 20X20 CM BLANCHE PERSONNALISEE,
