## Retrieving data from producer's catalogue

First phase of the project is to **extract the data about products from the actual collection from current catalog (.pdf)**. Due to the high variety of products and their characteristics, it was decided to initially focus on one kind of products - **football socks**.
For the first version of the project **I will use OpenAI API**, although, along with the development, there is an idea to implement YOLO machine learning model for this purpose.

Page with football socks as screenshot (.png) from the catalog:
<p align="center">
  <img src="milano_socks.png" width="700" height="590">
</p>

The structure of the exctracted data will be a .json with following elements:
```json
{
  "type_of_clothing": "...",
  "product_catalog_number": "...",
  "color": "...",
  "price": "..."
}
```

#### Activities
- uploading a fragment of the catalog - photo of the page with socks in .png format,
- using Chat GPT to get results in table form **(change to class and instructor)**
- verification - EDA (describe whether the model made a mistake, etc.)


#### Required imports:

In [4]:
import json
from pathlib import Path
import base64
from getpass import getpass
from openai import OpenAI
import pandas as pd
from dotenv import dotenv_values, load_dotenv
import re
from itables import show

#### Getting OpenAI key from environmental values:

In [5]:
env = dotenv_values(".env")
load_dotenv()

openai_client = OpenAI(api_key=env["OPENAI_API_KEY"])

#### Image encoding:

In [6]:
image_path = "milano_socks.png"

def prepare_image_for_open_ai(image_path):
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode('utf-8')

    return f"data:image/png;base64,{image_data}"

#### GTP request:

In [56]:
response = openai_client.chat.completions.create(
    # model="gpt-4o",
    model="gpt-4o-mini",
    temperature=0,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": """
Read product information from the catalog image. Extract data about socks displayed in the image. 
Return a list of all socks shown, mapping the dominant visible color to one of the standard base colors: white, black, green, red, navy blue, blue, yellow, burgundy, orange, azure, grey, mint. 
Do not include combinations like "white black" — only extract the most dominant color for each item (e.g., "black", "white", "blue").

Present the data in JSON format. Include:
{
  "type_of_clothing": "socks",
  "product_catalog_number": "...",
  "color": "...",
  "price_in_PLN": "49.95"
}

Only return the JSON list with all products. Do not include explanations or other text.

"""
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": prepare_image_for_open_ai(image_path),
                        "detail": "high"
                    },
                },
            ],
        }
    ],
)

content = response.choices[0].message.content

#### Cleaning the response:

In [57]:
content = response.choices[0].message.content
match = re.search(r"\[\s*{.*?}\s*\]", content, re.DOTALL)
data = json.loads(match.group(0))


#### Saving response to a dataframe:

In [61]:
df = pd.DataFrame(data)
show(df)

0
Loading ITables v2.4.4 from the internet...  (need help?)


#### Saving data to Excel (xlsx) file:

In [59]:
df.to_excel("socks_current_data.xlsx", index=False)

## Data preprocessing/EDA:

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   type_of_clothing        12 non-null     object
 1   product_catalog_number  12 non-null     object
 2   color                   12 non-null     object
 3   price_in_PLN            12 non-null     object
dtypes: object(4)
memory usage: 516.0+ bytes


In [60]:
df.nunique()

type_of_clothing           1
product_catalog_number    12
color                     11
price_in_PLN               1
dtype: int64

#### What we can read from above?

| Status | Column name        | Description                                  | Needed actions |
|--------|--------------------|----------------------------------------------|----------------|
| ✅     | type_of_clothing   | 1 type - it is correct since all of them are socks-type; object - correct beause it is string value | None |
| ✅     | product_catalog_number | 12 unique product codes from manufacturer as we have 12 products on this page; object - correct beause it is string value | None |
| ❌     | color        | 11 unique colors - we need unique color for each so there should be 12; object - correct beause it is string value but we need to encode it| improve prompt, encode colors
| ❌     | pfice_in_PLN   | 1 unique price it is correct since all products have the same price; object - incorrect beause it should be marked as numerical value | change str do num
