A knowledge graph represents semantic relationships between entities, with nodes representing entities like objects, people, or places, and edges defining their relationships, such as "Ottawa" being the "capital" of "Canada." These graphs are used by virtual assistants to answer queries and can model diverse relationships, such as movies and actors or recipes and ingredients. By integrating data from multiple sources, like census data and online reviews, knowledge graphs infer facts more accurately, such as estimating the number of Chinese restaurants in New York. Built using Natural Language Processing (NLP) through semantic enrichment, they transform unstructured text into structured knowledge. Commercially, knowledge graphs power recommendation systems (e.g., YouTube videos or retail product pairings), detect fraudulent insurance claims, and improve product recommendations. The transcript concludes with a simple, humorous graph illustrating "human," "coffee," and "sleep," where "human consumes coffee," "human needs sleep," and "coffee prevents sleep," advising against caffeine after 5 PM.

Building a knowledge graph for an e-commerce website can significantly enhance both search and recommendation systems by organizing product data into a structured format. Let's break it down step by step.

Step 1: Define the Objective and Dataset
Objective: The goal is to create a product knowledge graph to improve search and recommendation on an e-commerce website. The knowledge graph will capture product relationships, attributes, and categories, allowing the website to:

- Provide more relevant search results.
- Make personalized product recommendations.
- Better understand product attributes for better categorization and user experience.

Dataset: For the demonstration, we need a real-world e-commerce dataset that is small, easily loaded, and suitable for building a knowledge graph. A common open-source dataset to use is Amazon product data, specifically product metadata (e.g., title, description, category, and price). A sample dataset can be obtained from: https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews

In [1]:
import pandas as pd

# Sample dataset (this can be a CSV file with product metadata)
df = pd.read_csv('/Users/rshankar/Downloads/Projects/deep-learning/amazon-search-recommend/data/Digital_Music_Meta.csv')

# Preview the dataset
df.head()


Unnamed: 0,main_category,title,average_rating,rating_number,features,description,price,images,videos,store,categories,details,parent_asin,bought_together
0,Digital Music,Baja Marimba Band,4.9,8,[],[],,[{'thumb': 'https://m.media-amazon.com/images/...,[],,[],"{'Date First Available': 'February 28, 2010'}",B000V87RP2,
1,Digital Music,'80s Halloween-All Original Artists & Recordings,5.0,3,[],[],14.98,[{'thumb': 'https://m.media-amazon.com/images/...,[],"Love and Rockets (Artist), Duran Duran (...",[],{'Package Dimensions': '5.55 x 4.97 x 0.54 inc...,B0062F0MJQ,
2,Digital Music,TRIO +1,5.0,1,[],['CD ALBUM'],57.99,[{'thumb': 'https://m.media-amazon.com/images/...,[],Rob Wasserman Format: Audio CD,[],"{'Is Discontinued By Manufacturer': 'No', 'Pac...",B00005GT12,
3,Digital Music,"Gold and Silver: Lehar, Delibes, Lanner, Johan...",5.0,1,[],[],29.91,[{'thumb': 'https://m.media-amazon.com/images/...,[],"Franz Lehar (Composer), Leo Delibes (Com...",[],"{'Manufacturer': 'Hungaroton / White Label', '...",B0007PD2BW,
4,Digital Music,Grateful Dead Dave's Picks Volume 25 Live at B...,4.9,20,[],['Sold out. Numbered limited edition'],149.99,[{'thumb': 'https://m.media-amazon.com/images/...,[],"Grateful Dead (Artist, Orchestra) Format: ...",[],{'Package Dimensions': '5.55 x 4.97 x 0.54 inc...,B079CPD45R,


In [11]:
df.shape

(70537, 16)

Step 2: Refine, Clean, and Extract Attributes Using NLP
To build a knowledge graph, we need to refine and extract relevant attributes (e.g., product features, brand, material, etc.). This can be achieved using NLP techniques such as Named Entity Recognition (NER) or an OpenAI LLM to extract structured attributes.

For the cleaning process, we can use text pre-processing techniques to remove unnecessary words, correct misspellings, and extract key attributes. Then we use NLP or an LLM model to extract structured entities like "Brand", "Material", "Color", etc., from the product descriptions.

Here’s how we can clean and refine the attributes:

In [3]:
import re
import spacy
from openai import OpenAI

# Load SpaCy model for NER (if not installed, use pip install spacy)
nlp = spacy.load('en_core_web_sm')

# Function to clean text
def clean_text(text):
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove special characters
    return text.strip().lower()

# Example: Use SpaCy for NER to extract entities
def extract_entities(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

# Refine attributes for the dataset
df['cleaned_description'] = df['description'].apply(clean_text)
df['extracted_entities'] = df['cleaned_description'].apply(extract_entities)

# Preview the cleaned and extracted entities
df[['title', 'extracted_entities']].head()


Unnamed: 0,title,extracted_entities
0,Baja Marimba Band,[]
1,'80s Halloween-All Original Artists & Recordings,[]
2,TRIO +1,[]
3,"Gold and Silver: Lehar, Delibes, Lanner, Johan...",[]
4,Grateful Dead Dave's Picks Volume 25 Live at B...,[]


In [10]:
df[['title', 'extracted_entities']].sample(5)


Unnamed: 0,title,extracted_entities
35894,Omen Escape to Nowhere,[]
50831,Britten/Berkeley: Complete Works for Voice and...,"[(16, CARDINAL), (chinese, NORP), (58713, DATE..."
41734,Berg;Chamber Concerto,[]
8031,Tribal Eyes,"[(1, CARDINAL), (2, CARDINAL), (3, CARDINAL), ..."
39365,"I Can Hear It Now... 1919-1932, Vol. 3","[(record vinyl, PERSON)]"


In [None]:
from openai import OpenAI
import requests
from tqdm import tqdm  # Import tqdm for the progress bar
import pandas as pd

# Initialize LM Studio client
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

# Function to extract attributes using LM Studio
def extract_attributes_with_lm_studio(description):
    # Make an API call to the LLM
    response = requests.post(
        "http://localhost:1234/v1/chat/completions",
        json={
            "model": "lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF",
            "messages": [
                {"role": "system", "content": "Extract product attributes from the description."},
                {"role": "user", "content": description}
            ],
            "temperature": 0.7,
        }
    )

    if response.status_code == 200:
        return response.json().get('choices')[0]['message']['content']
    else:
        return None  # Handle API errors or missing content

# Initialize tqdm to show progress bar with apply
def extract_with_progress(row):
    # row is the entire DataFrame row, access description using row['description']
    extracted_attribute = extract_attributes_with_lm_studio(row['description'])
    return extracted_attribute

# Apply the function to extract attributes with tqdm for progress bar
df['extracted_attributes1'] = list(tqdm(df.apply(extract_with_progress, axis=1), total=len(df), desc="Extracting Attributes"))

# Show the results
print(df[['title', 'extracted_attributes1']])


Step 3: Build the Product Knowledge Graph
A knowledge graph consists of nodes (entities like products, categories, brands) and edges (relationships between them). After extracting key product attributes, we can use these attributes to form the nodes and relationships.

For example:

Nodes: Product ID, Brand, Category, Material
Edges: Product -> Belongs to -> Category, Product -> Made of -> Material, Product -> Manufactured by -> Brand
We can use libraries like networkx to build the knowledge graph:

In [None]:
import networkx as nx

# Create a directed graph
KG = nx.DiGraph()

# Add nodes and edges to the graph
for _, row in df.iterrows():
    product_id = row['parent_asin']
    brand = row['extracted_attributes'].get('Brand', 'Unknown')
    category = row['store']
    
    # Add product node
    KG.add_node(product_id, type='product', name=row['title'])
    
    # Add brand and category nodes
    KG.add_node(brand, type='brand')
    KG.add_node(category, type='category')

    # Add edges
    KG.add_edge(product_id, brand, relationship='Manufactured by')
    KG.add_edge(product_id, category, relationship='Belongs to')

# Visualize the knowledge graph (optional)
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 10))
nx.draw(KG, with_labels=True, node_size=3000, node_color='skyblue', font_size=10)
plt.show()
