# Overview
The main focus of this notebook is to test the usage of ChatGPT in labeling data. Labeled data will be used to train a smaller model, such as BERT, that will then be deployed on mobile devices.

In [32]:
import os
import pandas as pd
from openai import OpenAI
from dotenv import load_dotenv

# API key

In [17]:
load_dotenv()

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

# Simple test

In [18]:
response = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a geography expert."},
        {"role": "user", "content": "What is the capital of Uruguay?"}
    ],
    model="gpt-4",
)

response_content = response.choices[0].message.content

print(response_content)

The capital of Uruguay is Montevideo.


# Dataset
To generate the training data I will use phone screen views in a csv format, that are flattened XML trees obtained using Android Accessibility API.

In [34]:
coupon_df = pd.read_csv("data/18929485529.csv")
coupon_df.head()

Unnamed: 0,ID,User ID,Time,I,Language,Application Name,Package Name,Class Name,Context,View ID,View Depth,View Class Name,Text,Description,Seen Timestamp,Is Visible,X 1,Y 1,X 2,Y 2
0,18929485529,165559,2024-09-04T10:55:25.287,1,de,PENNY,de.penny.app,de.penny.app.main.view.MainActivity,,,0,de.penny.app.main.view.MainActivity,,,0,False,0,0,0,0
1,18929485529,165559,2024-09-04T10:55:25.287,2,de,PENNY,de.penny.app,de.penny.app.main.view.MainActivity,,android:id/content,2,android.widget.FrameLayout,,,1725440082464,True,0,0,1080,2400
2,18929485529,165559,2024-09-04T10:55:25.287,3,de,PENNY,de.penny.app,de.penny.app.main.view.MainActivity,,,11,android.widget.TextView,UVP 14.99,,1725440082464,True,339,833,498,874
3,18929485529,165559,2024-09-04T10:55:25.287,4,de,PENNY,de.penny.app,de.penny.app.main.view.MainActivity,,,11,android.widget.TextView,9.99,,1725440082464,True,356,884,482,960
4,18929485529,165559,2024-09-04T10:55:25.287,5,de,PENNY,de.penny.app,de.penny.app.main.view.MainActivity,,,10,android.widget.TextView,UVP,,1725440082464,True,63,986,125,1027


# Labels

In [50]:
ner_tags = ["N/A", "PRODUCT", "PRICE"]

# First approach (text only)
The first approach will consist of coalescing all of the text fields from one application and point in time into a single string. It should be noted that in one text filed there could be multiple entities that need to be labeled, so, in the general case, we shouldn't label csv rows.

In [36]:
grouping_columns = ['Application Name', 'Seen Timestamp']
grouped_dfs = [group for _, group in coupon_df.groupby(grouping_columns)]
view_texts_coalesced = [" ".join(df['Text'].dropna()) for df in grouped_dfs]

print(view_texts_coalesced)

['', 'UVP 14.99 9.99 UVP JOHNNIE WALKER Red Label Blended Scotch je 0,7 I UVP 0.99 0.75 UVP SAN MIGUEL Especial je 0,5 I UVP 2.99 2.79 UVP FELIX Knabber Mix je 200 g 3.89 Preisknaller FELIX So gut wie es aussieht in Gelee je 12 x 85 g Sparen auf Top-Marken ab 05.09. bis 07.09. Angebote Vorteile Einkaufsliste Vorteilscode Mein PENNY', 'UVP 14.99 9.99 UVP JOHNNIE WALKER Red Label Blended Scotch je 0,7 I UVP 0.99 0.75 UVP SAN MIGUEL Especial je 0,5 I', 'UVP 14.99 9.99 UVP JOHNNIE WALKER Red Label Blended Scotch je 0,7 I UVP 0.99 0.75 UVP SAN MIGUEL Especial je 0,5 I UVP 2.99 2.79 UVP FELIX Knabber Mix je 200 g 3.89 Preisknaller FELIX So gut wie es aussieht in Gelee je 12 x 85 g Framstag ab 06.09. bis 07.09.', 'UVP 2.99 2.79 UVP FELIX Knabber Mix je 200 g 3.89 Preisknaller FELIX So gut wie es aussieht in Gelee je 12 x 85 g Framstag ab 06.09. bis 07.09. UVP 8.99 5.85 UVP CHANTRÉ Weinbrand je 0,7 I', 'Framstag ab 06.09. bis 07.09. UVP 8.99 5.85 UVP CHANTRÉ Weinbrand je 0,7 I 2.79 2.49 1.09 0

# Inference

In [51]:
def text_only_labeling(text, tags, temperature=0):
    prompt = f"""
    You are an NER tagging assistant. Your task is to label all entities in the text based on the tags provided.
    Here are the tags: {', '.join(tags)}

    For each entity, return the entity and its tag.

    Input text: "{text}"

    Respond with the entities in this JSON format:
    [
        {{ "entity": str, "tag": str }},
        ...
    ]
    """
    
    response = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        model="gpt-4",
        temperature=temperature
    )

    return response.choices[0].message.content

print(text_only_labeling(view_texts_coalesced[1], ner_tags))

[
    { "entity": "UVP 14.99 9.99 UVP", "tag": "PRICE" },
    { "entity": "JOHNNIE WALKER Red Label Blended Scotch", "tag": "PRODUCT" },
    { "entity": "0,7 I UVP 0.99 0.75 UVP", "tag": "PRICE" },
    { "entity": "SAN MIGUEL Especial", "tag": "PRODUCT" },
    { "entity": "0,5 I UVP 2.99 2.79 UVP", "tag": "PRICE" },
    { "entity": "FELIX Knabber Mix", "tag": "PRODUCT" },
    { "entity": "je 200 g 3.89", "tag": "PRICE" },
    { "entity": "Preisknaller FELIX So gut wie es aussieht in Gelee", "tag": "PRODUCT" },
    { "entity": "je 12 x 85 g", "tag": "PRICE" },
    { "entity": "Sparen auf Top-Marken ab 05.09. bis 07.09.", "tag": "N/A" },
    { "entity": "Angebote Vorteile Einkaufsliste Vorteilscode Mein PENNY", "tag": "N/A" }
]


There are some issues. The weight is being labeled as a price, so let's try adding more labels.

In [53]:
ner_tags_extended = ner_tags + ["WEIGHT"]

print(text_only_labeling(view_texts_coalesced[1], ner_tags_extended))

[
    { "entity": "UVP 14.99 9.99 UVP", "tag": "PRICE" },
    { "entity": "JOHNNIE WALKER Red Label Blended Scotch", "tag": "PRODUCT" },
    { "entity": "0,7 I", "tag": "WEIGHT" },
    { "entity": "UVP 0.99 0.75 UVP", "tag": "PRICE" },
    { "entity": "SAN MIGUEL Especial", "tag": "PRODUCT" },
    { "entity": "0,5 I", "tag": "WEIGHT" },
    { "entity": "UVP 2.99 2.79 UVP", "tag": "PRICE" },
    { "entity": "FELIX Knabber Mix", "tag": "PRODUCT" },
    { "entity": "200 g", "tag": "WEIGHT" },
    { "entity": "3.89", "tag": "PRICE" },
    { "entity": "FELIX So gut wie es aussieht in Gelee", "tag": "PRODUCT" },
    { "entity": "12 x 85 g", "tag": "WEIGHT" },
    { "entity": "05.09. bis 07.09.", "tag": "N/A" },
    { "entity": "Angebote Vorteile Einkaufsliste Vorteilscode Mein PENNY", "tag": "N/A" }
]


That seems to have solved the issue. As can be seen, two prices are being extracted as one entity, but that could be easily fixed in postprocessing. Let's try a different part of the dataset.

In [55]:
print(text_only_labeling(view_texts_coalesced[6], ner_tags_extended))

[
    { "entity": "UVP 8.99 5.85 UVP", "tag": "PRICE" },
    { "entity": "CHANTRÉ Weinbrand", "tag": "PRODUCT" },
    { "entity": "0,7 I", "tag": "WEIGHT" },
    { "entity": "2.79 2.49", "tag": "PRICE" },
    { "entity": "10% gespart", "tag": "N/A" },
    { "entity": "PENNY Gouda-Scheiben", "tag": "PRODUCT" },
    { "entity": "je 400 g", "tag": "WEIGHT" },
    { "entity": "1.09 0.69", "tag": "PRICE" },
    { "entity": "36% gespart", "tag": "N/A" },
    { "entity": "KINDER Überraschungs-Ei", "tag": "PRODUCT" },
    { "entity": "je 20 g", "tag": "WEIGHT" },
    { "entity": "Framstag ab 06.09. bis 07.09.", "tag": "N/A" }
]
