## Retrieving data from producer's catalogue

First phase of the project is to **extract the data about products from the actual collection from current catalog (.pdf)**. Due to the high variety of products and their characteristics, it was decided to initially focus on one kind of products - **football socks**.
For the first version of the project **I will use OpenAI API**, although, along with the development, there is an idea to implement YOLO machine learning model for this purpose.

Page with football socks as screenshot (.png) from the catalog:
<p align="center">
  <img src="milano_socks.png" width="700" height="590">
</p>

The structure of the exctracted data will be a .json with following elements:
```json
{
  "type_of_clothing": "...",
  "product_catalog_number": "...",
  "color": "...",
  "price": "..."
}
```

#### Activities
- uploading a fragment of the catalog - photo of the page with socks in .png format,
- using Chat GPT to get results in table form **(change to class and instructor)**
- verification - EDA (describe whether the model made a mistake, etc.)


#### Required imports:

In [13]:
import json
from pathlib import Path
import base64
from getpass import getpass
from openai import OpenAI
import pandas as pd
import streamlit as st
from dotenv import dotenv_values, load_dotenv
import re
from itables import show

#### Getting OpenAI key from environmental values:

In [2]:
env = dotenv_values(".env")
load_dotenv()

openai_client = OpenAI(api_key=env["OPENAI_API_KEY"])

#### Image encoding:

In [3]:
image_path = "milano_socks.png"

def prepare_image_for_open_ai(image_path):
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode('utf-8')

    return f"data:image/png;base64,{image_data}"

#### GTP request:

In [9]:
response = openai_client.chat.completions.create(
    # model="gpt-4o",
    model="gpt-4o-mini",
    temperature=0,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": """
Read the product information from the catalog photo. I need a table with information about these products: type of clothing (e.g., T-shirt, shorts, shoes, socks), 
color (from basic colors, e.g., rainbow colors, e.g., blue—it does not have to be sky blue, azure, etc.), product catalog number, price.
Present the data in json format. Present the data itself, without additional comments. Data structure:
{
  "type_of_clothing": "...",
  "product_catalog_number": "...",
  "color": "...",
  "price": "..."
}
"""
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": prepare_image_for_open_ai(image_path),
                        "detail": "high"
                    },
                },
            ],
        }
    ],
)

content = response.choices[0].message.content

#### Cleaning the response:

In [8]:
content = response.choices[0].message.content
match = re.search(r"\[\s*{.*?}\s*\]", content, re.DOTALL)
data = json.loads(match.group(0))


#### Saving response to a dataframe:

In [14]:
df = pd.DataFrame(data)
show(df)

0
Loading ITables v2.4.4 from the internet...  (need help?)


#### Saving data to Excel (xlsx) file:

In [12]:
df.to_excel("socks_current_data.xlsx", index=False)