# PART1. KG Construction

# Environment Setup

Install the required packages.

In [None]:
%pip install pandas
%pip install -U numpy
%pip install openai

# Load Packages

In [2]:
# from openai import OpenAI
from openai import AzureOpenAI
import pandas as pd
import json
import re
from tqdm import tqdm

# Load the Data

We will load the dataset containing 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon.

In [3]:
data = pd.read_csv('data/text_data.csv')
data.head()

Unnamed: 0,product_id,product_name,about_product
0,B07JW9H4J1,Wayona Nylon Braided USB to Lightning Fast Cha...,High Compatibility : Compatible With iPhone 12...
1,B098NS6PVG,Ambrane Unbreakable 60W / 3A Fast Charging 1.5...,"Compatible with all Type C enabled devices, be..."
2,B096MSW6CT,Sounce Fast Phone Charging Cable & Data Sync U...,【 Fast Charger& Data Sync】-With built-in safet...
3,B08HDJ86NZ,boAt Deuce USB 300 2 in 1 Type-C & Micro USB S...,The boAt Deuce USB 300 2 in 1 cable is compati...
4,B08CF3B7N1,Portronics Konnect L 1.2M Fast Charging 3A 8 P...,[CHARGE & SYNC FUNCTION]- This cable comes wit...


# Preliminary


In [4]:
# ENTITY TYPES
entity_types = [
    "product",
    "material",
    "brand",
    "measurement",
    "color",
    "characteristic"
]

# RELATION TYPES
relation_types = [
    "hasCharacteristic",
    "madeOfMaterial",
    "hasBrand",
    "hasMeasurement",
    "hasColor",
]

# Prompt Engineering

In system prompt, we provide the following information:
1. Task description
2. Output description

In user prompt, we provide the following information:
1. Instruction.
2. Example


In [5]:
system_prompt = """You are an expert agent specialized in analyzing product specifications in an online retail store.
Your task is to identify the entities and relations requested with the user prompt, from a given product specification.
You must generate the output in a JSON containing a list with JOSN objects having the following keys: "head", "head_type", "relation", "tail", and "tail_type".
The "head" key must contain the text of the extracted entity with one of the types from the provided list in the user prompt, the "head_type"
key must contain the type of the extracted head entity which must be one of the types from the provided user list,
the "relation" key must contain the type of relation between the "head" and the "tail", the "tail" key must represent the text of an
extracted entity which is the tail of the relation, and the "tail_type" key must contain the type of the tail entity. Attempt to extract as
many entities and relations as you can.
"""


user_prompt = """Based on the following example, extract entities and relations from the provided text.
Use the following entity types:
# ENTITY TYPES:
{entity_types}

Use the following relation types:
# RELATION TYPES:
{relation_types}

Example:

# TEXT

product_name: boAt Wave Call Smart Watch, Smart Talk with Advanced Dedicated Bluetooth Calling Chip, 1.69” HD Display with 550 NITS & 70% Color Gamut, 150+ Watch Faces, Multi-Sport Modes,HR,SpO2, IP68(Active Black)
about product: Bluetooth Calling- Wave Call comes with a premium built-in speaker and bluetooth calling via which you can stay connected to your friends, family, and colleagues|Dial Pad- Its dial pad is super responsive and convenient. You can also save upto 10 contacts in this smart watch|Screen Size- Wave Call comes with a 1.69” HD Display that features a bold, bright, and highly responsive 2.5D curved touch interface|Resolution- With 550 nits of brightness get sharper color resolution that brightens your virtual world exponentially.|Design- The ultra slim and lightweight design of the watch is ideal to keep you surfing your wave all day!|Watch Faces- Wave Call comes with 150+ Cloud watchfaces for you to pick from, complementing your every mood and outfit|HR, SpO2 & Breathing- Monitor your heart rate and blood oxygen levels on-the-go with the heart rate and SpO2 monitor. It also comes with Guided Breathing to help you relax and embrace mindfulness.
################

# OUTPUT
[
  {{
    "head": "boAt Wave Call Smart Watch",
    "head_type": "product",
    "relation": "hasBrand",
    "tail": "boAt",
    "tail_type": "brand"
  }},
  {{
    "head": "boAt Wave Call Smart Watch",
    "head_type": "product",
    "relation": "hasMeasurement",
    "tail": "1.69” HD Display",
    "tail_type": "measurement"
  }},
  {{
    "head": "boAt Wave Call Smart Watch",
    "head_type": "product",
    "relation": "hasCharacteristic",
    "tail": "Multi-Sport Modes",
    "tail_type": "characteristic"
  }},
  {{
    "head": "boAt Wave Call Smart Watch",
    "head_type": "product",
    "relation": "hasColor",
    "tail": "Active Black",
    "tail_type": "color"
  }},
  {{
    "head": "boAt Wave Call Smart Watch",
    "head_type": "product",
    "relation": "hasMeasurement",
    "tail": "550 nits of brightness",
    "tail_type": "measurement"
  }},
]
For the following text, generate extract entitites and relations as in the provided example.

# TEXT
{text}
################

# OUTPUT

"""


In [6]:
openai_api_key = "e877599eba2e4bcd99bcc08f92005b7b"
api_version = "2024-06-01"
endpoint = "https://hkust.azure-api.net"

In [7]:
# client = OpenAI(api_key=openai_api_key)# you may need to replace this with the API key / API Provider you use.
client = AzureOpenAI(api_key=openai_api_key, api_version=api_version ,azure_endpoint= endpoint)# you may need to replace this with the API key / API Provider you use.

def extract_information(text, model, system_prompt, user_prompt, entity_types, relation_types):
    comletion = client.chat.completions.create(# you may need to replace this with the API key / API Provider you use.
        model=model,
        temperature=0.0,
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": user_prompt.format(
                    text=text,
                    entity_types=entity_types,
                    relation_types=relation_types
                )
            }
        ]
    )
    return comletion.choices[0].message.content

We use regular expressions to extract the triples from the responses of large language models.

In [8]:
kg = []
for i in tqdm(range(len(data))):
    try: 
        text = "product_name: " + data.loc[i, "product_name"] + "\n" + "about product: " + data.loc[i, "about_product"] + "\n"
        output = extract_information(text, "gpt-4o-mini", system_prompt, user_prompt, entity_types, relation_types)
        pattern = r"\[.*?\]"
        output = re.findall(pattern, output, re.DOTALL)[0]
        output = json.loads(output)
        kg.extend(output)
    except Exception as e:
        print(e)
        continue

print(len(kg))
kg = pd.DataFrame(kg)

100%|██████████| 10/10 [00:47<00:00,  4.71s/it]

65





In [9]:
kg.to_csv("data/kg.csv", index=False, encoding="utf-8")
kg.head(20)

Unnamed: 0,head,head_type,relation,tail,tail_type
0,Wayona Nylon Braided USB to Lightning Fast Cha...,product,hasBrand,Wayona,brand
1,Wayona Nylon Braided USB to Lightning Fast Cha...,product,hasMeasurement,3 FT,measurement
2,Wayona Nylon Braided USB to Lightning Fast Cha...,product,hasColor,Grey,color
3,Wayona Nylon Braided USB to Lightning Fast Cha...,product,hasCharacteristic,High Compatibility,characteristic
4,Wayona Nylon Braided USB to Lightning Fast Cha...,product,hasCharacteristic,Fast Charge&Data Sync,characteristic
5,Wayona Nylon Braided USB to Lightning Fast Cha...,product,hasCharacteristic,Durability,characteristic
6,Wayona Nylon Braided USB to Lightning Fast Cha...,product,hasCharacteristic,High Security Level,characteristic
7,Ambrane Unbreakable 60W / 3A Fast Charging 1.5...,product,hasBrand,Ambrane,brand
8,Ambrane Unbreakable 60W / 3A Fast Charging 1.5...,product,hasMeasurement,1.5m,measurement
9,Ambrane Unbreakable 60W / 3A Fast Charging 1.5...,product,hasCharacteristic,Fast Charging,characteristic


# Post-Processing

To further enhance the quality of the extracted Knowledge Graph (KG), we can implement two additional optimization steps. First, we can merge entities that share the same meaning into a single representation, streamlining our graph and reducing redundancy. Second, we should apply disambiguation techniques to differentiate entities that may have identical representations but distinct meanings. By incorporating these steps, we can significantly improve the accuracy and usability of our Knowledge Graph, making it a more powerful tool for analysis and insights.

# PART2. Intention KG Mining

In [10]:
sessions = []
with open('data/session_data.jsonl') as f:
    for line in f:
        sessions.append(json.loads(line))
print(sessions)

[{'Session': [{'id': 'B099ST26S3', 'title': "Neal's Yard Remedies Geranium and Orange Foaming Bath | Promote Calmness & Wellbeing | 200ml", 'price': 11.25, 'brand': "Neal's Yard Remedies", 'size': '200 ml (Pack of 1)', 'model': '0838', 'desc': 'Helps promote a feeling of calm and wellbeing'}, {'id': 'B001M4C34O', 'title': "Neal's Yard Remedies Seaweed and Arnica Foaming Bath | Ease Tiredness & Restore Vitality | 200 ml", 'price': 10.0, 'brand': "Neal's Yard Remedies", 'size': '1 Count (Pack of 1)', 'model': '0120', 'desc': 'Directions: Directions: Pour liberally under running water. For adult use only.'}, {'id': 'B082MVK3C6', 'title': "Neal's Yard Remedies Mothers Bath Oil | Encourages a Sense of Wellbeing | 100ml", 'price': 15.0, 'brand': "Neal's Yard Remedies", 'size': '100 ml (Pack of 1)', 'model': '1660', 'desc': 'Boosts skin’s suppleness'}, {'id': '1787009726', 'title': "Magnetic Let's Play", 'price': 6.29, 'brand': 'Imagine That Publishing', 'author': 'Clover, Alfie'}], 'Message'

In [11]:
for i in sessions[0]['Session']:
    print(i)

{'id': 'B099ST26S3', 'title': "Neal's Yard Remedies Geranium and Orange Foaming Bath | Promote Calmness & Wellbeing | 200ml", 'price': 11.25, 'brand': "Neal's Yard Remedies", 'size': '200 ml (Pack of 1)', 'model': '0838', 'desc': 'Helps promote a feeling of calm and wellbeing'}
{'id': 'B001M4C34O', 'title': "Neal's Yard Remedies Seaweed and Arnica Foaming Bath | Ease Tiredness & Restore Vitality | 200 ml", 'price': 10.0, 'brand': "Neal's Yard Remedies", 'size': '1 Count (Pack of 1)', 'model': '0120', 'desc': 'Directions: Directions: Pour liberally under running water. For adult use only.'}
{'id': 'B082MVK3C6', 'title': "Neal's Yard Remedies Mothers Bath Oil | Encourages a Sense of Wellbeing | 100ml", 'price': 15.0, 'brand': "Neal's Yard Remedies", 'size': '100 ml (Pack of 1)', 'model': '1660', 'desc': 'Boosts skin’s suppleness'}
{'id': '1787009726', 'title': "Magnetic Let's Play", 'price': 6.29, 'brand': 'Imagine That Publishing', 'author': 'Clover, Alfie'}


In [12]:
prompt = f"""Below is a user chronological record list:
{sessions[0]['Session']}
Explain the basic intentions of this user exactly. Output several different intentions one by one to answer the following question: User buy these items becase they want to:
intention 1: [a simple verb phrase within 10 words]
intention 2: [a simple verb phrase within 10 words]...."""
print(prompt) #becase they want to/ because the items are all/ because the items are capable of ...

Below is a user chronological record list:
[{'id': 'B099ST26S3', 'title': "Neal's Yard Remedies Geranium and Orange Foaming Bath | Promote Calmness & Wellbeing | 200ml", 'price': 11.25, 'brand': "Neal's Yard Remedies", 'size': '200 ml (Pack of 1)', 'model': '0838', 'desc': 'Helps promote a feeling of calm and wellbeing'}, {'id': 'B001M4C34O', 'title': "Neal's Yard Remedies Seaweed and Arnica Foaming Bath | Ease Tiredness & Restore Vitality | 200 ml", 'price': 10.0, 'brand': "Neal's Yard Remedies", 'size': '1 Count (Pack of 1)', 'model': '0120', 'desc': 'Directions: Directions: Pour liberally under running water. For adult use only.'}, {'id': 'B082MVK3C6', 'title': "Neal's Yard Remedies Mothers Bath Oil | Encourages a Sense of Wellbeing | 100ml", 'price': 15.0, 'brand': "Neal's Yard Remedies", 'size': '100 ml (Pack of 1)', 'model': '1660', 'desc': 'Boosts skin’s suppleness'}, {'id': '1787009726', 'title': "Magnetic Let's Play", 'price': 6.29, 'brand': 'Imagine That Publishing', 'author'

In [13]:
response = client.chat.completions.create(# you may need to replace this with the API key / API Provider you use.
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content":prompt
            }
        ]
    )
print(response.choices[0].message.content)

intention 1: Promote calmness and wellbeing with foaming bath.  
intention 2: Ease tiredness and restore vitality with bath products.  
intention 3: Enhance skin suppleness using bath oil.  
intention 4: Entertain with a magnetic play set.  
