#### Libraries and functions uploads

In [1]:
from pdf2image import convert_from_path
import pytesseract
import pandas as pd
import json
import cv2
import numpy as np
from pytesseract import Output
import requests
import json
from dotenv import load_dotenv
import os

In [2]:
load_dotenv()
pytesseract.pytesseract.tesseract_cmd =  os.getenv("TESSERACT_CMD")

# PDF Parsing

Text is extracted from PDF files in this section. The PDFs are first converted to images and preprocessed to improve quality (grayscale and noise reduction). 
Since the font, size, and quality of each PDF are unknown in advance, a loop over multiple DPI settings is applied to select the version that produces the most characters. Although this method is not the fastest, it adapts to varying PDF qualities.  
Further improvements could involve pixel-level analysis or using online tools with APIs.  
The output of this step is text.

In [3]:
from src.pdf_parser import pdf_to_text

In [4]:
text_1 = pdf_to_text('data/pdfs/Analysis-Example-3_B.pdf')
print(text_1)



pdf parsed
Ananas
Ananas origine Costa Rica 34-03

Residuo_misurato Quantita

misura vedi nota in Italia
Ethephon 0,27 mg/kg 0,04 2 SI -
Fludioxonil 0,038 mg/kg 0011 7 SI :

Note:

——
* == @&




# LLM

In this section, a LLM is used to extract a json from the raw text parsed at the previous step. The LLM is fed with a on-shot prompt, to explain how to structure the json output of this step.

In [5]:
from langchain.chat_models import init_chat_model
import openai
from openai import OpenAI
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
from src.get_llm_response import get_llm_response

In [6]:
response_1 = get_llm_response(text_1)
print(response_1)

```json
{
    "Product name on the analysis report": "Ananas origine Costa Rica 34-03",
    "Product": "Pineapple",
    "Pesticide_molecules": [
        {
            "Molecule in the report": "Ethephon",
            "Molecule": "Ethephon",
            "Measured": "0.27",
            "Unit": "mg/kg"
        },
        {
            "Molecule in the report": "Fludioxonil",
            "Molecule": "Fludioxonil",
            "Measured": "0.038",
            "Unit": "mg/kg"
        }
    ]
}
```


The output is in "text" format" and the json has to be extracted. In the langgraph, a tool does it automatically.

In [7]:
response_1 = response_1.strip()
if response_1.startswith('```'):
    response_1 = response_1.split('\n', 1)[1]
if response_1.endswith('```'):
    response_1 = response_1.rsplit('\n', 1)[0]
json = json.loads(response_1)

In [8]:
json

{'Product name on the analysis report': 'Ananas origine Costa Rica 34-03',
 'Product': 'Pineapple',
 'Pesticide_molecules': [{'Molecule in the report': 'Ethephon',
   'Molecule': 'Ethephon',
   'Measured': '0.27',
   'Unit': 'mg/kg'},
  {'Molecule in the report': 'Fludioxonil',
   'Molecule': 'Fludioxonil',
   'Measured': '0.038',
   'Unit': 'mg/kg'}]}

## API EU DB

Once the json is created, the EU databases of products and substances are used to get the information about the product_id and substance_id. 

In [9]:
import httpx

In [10]:
from src.european_db_API_calls import fetch_all_data
from src.european_db_API_calls import get_substance_mrl_EU
from src.european_db_API_calls import get_db_pest, get_db_prod

In [11]:
df_prod = get_db_prod()
df_prod.head()

Unnamed: 0,language,product_id,product_parent_id,product_code,product_type_id,product_name,product_scientific_names
0,EN,379,,1100000,1,"PRODUCTS OF ANIMAL ORIGIN - FISH, FISHPRODUCTS...",
1,EN,109,108.0,220010,4,Garlic,"Allium sativum,"
2,EN,349,313.0,1016000,3,(f) poultry,
3,EN,338,335.0,1014030,4,Liver,
4,EN,204,201.0,300030,4,Peas,"Pisum sativum,"


In [12]:
df_pest = get_db_pest()
df_pest.head()

Unnamed: 0,substance_id,substance_name,as_cas_number,substance_status,approval_date,expiry_date,risk_assessment,substance_category,as_is_group,as_micro_org,...,candidate_for_substitution_type,active_substance_part_of_group,active_substance_part_of_group_id,as_member_id,as_member_name,pesticide_residue_linked,pest_res_linked_annex,pest_res_linked_legislation,pest_res_linked_legislation_url,pest_res_mrl_webpage
0,357,(E)-11-Tetradecen-1-yl acetate (SCLP Acetates),33189-72-9,Approved,01/09/2022,30/08/2037,EFSA,AT - Attractant,No,No,...,,Straight Chain Lepidopteran Pheromones (SCLPs)...,1530.0,374.0,(Z)-8-Dodecen-1-yl acetate (SCLP Actetates),Straight Chain Lepidopteran Pheromones (SCLP),Annex IV,Reg. (EU) 2023/1719,http://data.europa.eu/eli/reg/2023/1719/oj,https://ec.europa.eu/food/plant/pesticides/eu-...
1,689,Iminoctadine,13516-27-3,Not approved,,,,FU - Fungicide,No,No,...,,,,,,Default MRL of 0.01 mg/kg according to Art 18(...,,,,
2,859,Silthiofam,175217-20-6,Approved,01/07/2018,30/06/2033,EFSA,,No,No,...,,,,,,Silthiofam,Annex II,Reg. (EU) No 398/2014,http://eur-lex.europa.eu/legal-content/EN/TXT/...,https://ec.europa.eu/food/plant/pesticides/eu-...
3,699,Endosulfan,115-29-7,Not approved,,,,IN - Insecticide,No,No,...,,,,,,Endosulfan (sum of alpha- and beta-isomers and...,Annex III,Reg. (EC) No 839/2008,http://eur-lex.europa.eu/LexUriServ/LexUriServ...,https://ec.europa.eu/food/plant/pesticides/eu-...
4,1328,"(E,Z)-7,9-Dodecadien-1-yl acetate (SCLP Acetates)",54364-62-4,Approved,01/09/2022,30/08/2037,EFSA,AT - Attractant,No,No,...,,Straight Chain Lepidopteran Pheromones (SCLPs)...,1530.0,356.0,"(Z,E)-9,12-Tetradecadien-1-yl acetate (SCLP -...",Straight Chain Lepidopteran Pheromones (SCLP),Annex IV,Reg. (EU) 2023/1719,http://data.europa.eu/eli/reg/2023/1719/oj,https://ec.europa.eu/food/plant/pesticides/eu-...


# Data Processing functions

The product_id and substance_id are added to the json. This info is then use to make an API call to the EU DB, to get the MRL limit associated with the specific product and substance. This info is appended to the json.

In [13]:
from src.data_processing import enrich_json_ids
from src.data_processing import add_mrl_limits
from src.data_processing import print_conformity_report

In [14]:
json_enriched = enrich_json_ids(json, df_prod, df_pest)
json_enriched

{'Product name on the analysis report': 'Ananas origine Costa Rica 34-03',
 'Product': 'Pineapple',
 'Pesticide_molecules': [{'Molecule in the report': 'Ethephon',
   'Molecule': 'Ethephon',
   'Measured': '0.27',
   'Unit': 'mg/kg',
   'substance_id': 89},
  {'Molecule in the report': 'Fludioxonil',
   'Molecule': 'Fludioxonil',
   'Measured': '0.038',
   'Unit': 'mg/kg',
   'substance_id': 37}],
 'product_id': 81}

In [15]:
json_with_mrls = add_mrl_limits(json_enriched)
json_with_mrls

{'Product name on the analysis report': 'Ananas origine Costa Rica 34-03',
 'Product': 'Pineapple',
 'Pesticide_molecules': [{'Molecule in the report': 'Ethephon',
   'Molecule': 'Ethephon',
   'Measured': '0.27',
   'Unit': 'mg/kg',
   'substance_id': 89,
   'MRL_limit': '0.03'},
  {'Molecule in the report': 'Fludioxonil',
   'Molecule': 'Fludioxonil',
   'Measured': '0.038',
   'Unit': 'mg/kg',
   'substance_id': 37,
   'MRL_limit': '0.01'}],
 'product_id': 81}

Once the json is ready with all the information, a report is generated, considering the analysis values and their relatives MRL limits. If the values are lower than their relative limits, the product is compliant, if they are higher the product is not compliant.

In [16]:
print_conformity_report(json_with_mrls)

Product - EU              Product - Analysis  Molecule - EU Molecule - Analysis Analysis_result MRL_limit    Compliant
    Pineapple Ananas origine Costa Rica 34-03      Ethephon            Ethephon            0.27      0.03 Non Conforme
    Pineapple Ananas origine Costa Rica 34-03   Fludioxonil         Fludioxonil           0.038      0.01 Non Conforme


The product is not compliant


# Langgraph

All the previous functions are orchestrated using a langgraph workflow, to do everything automatically.

The langgraph is build, creating a node per each function, and then connecting them. The output of the workflow is the final report.

In [17]:
from src.langgraph_builder import build_workflow

In [18]:
app = build_workflow()

# Graph execution
result = app.invoke({"pdf_path": "data/pdfs/Analysis-Example-2_B.pdf",
                    "df_prod": df_prod,
                    "df_pest": df_pest
                    })


pdf parsed
Product - EU  Product - Analysis  Molecule - EU Molecule - Analysis Analysis_result MRL_limit    Compliant
   pineapples       pinas - fruta      ethephon              etefon            0.52      0.03 Non Conforme


The product is not compliant


Below the state of the graph is printed. Specifically, the text parsed from the pdf and the json created by the LLM and modified by the various tools.

In [21]:
print(result['text'])

o—

oo =e
eo; er &. =—— a =
— eee See

ln, agai
—
Tipo de muestra: pinas - fruta
Referencia: Lote #1, Parcela #9

Fecha de confirmacion: 26/07/2023
Fecha ‘er reporte: 01/08/2023

Periodo de andalisis: 26/07/2023 - 01/08/2023


Muestra homogeneizada

ETH_01_A - Método especifico para la determinacion de residuos de Etefon en productos vegetales y piensos Termminado 7
por LC-MSMS - acreditado
Sustancia Acr. Resultado Limites Unidad U
(Espana)
etefon A 0,52 2 mg/kg (b)

GMS_01_D - Método multi-residuo para la cuantificaci6n de plaguicidas seleccionados en productos vegetales y Terminado 4
piensos mediante GC-MSMS - acreditado

LMS_01_D - Método multi-residuo para la cuantificacién de plaguicidas seleccionados en productos vegetales y Terminado 7
piensos mediante LC-MSMS - acreditado

Sustancia Acr. Resultado Limites Unidad U





In [22]:
print(result['json_data'])

{'Product name on the analysis report': 'pinas - fruta', 'Product': 'pineapples', 'Pesticide_molecules': [{'Molecule in the report': 'etefon', 'Molecule': 'ethephon', 'Measured': '0.52', 'Unit': 'mg/kg', 'substance_id': 89, 'MRL_limit': '0.03'}], 'product_id': 81}
