<a href="https://colab.research.google.com/github/davidelgas/DataSciencePortfolio/blob/main/Knowledge%20Graph/Knowledge_Graph_with_DNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

In this notebook I will be creating a small NN and a knowledge graph with an open source e-Commerce dataset.


My use case:<br>
A language interface (chatbot or search engine) that answers user queries.<br>
A knowledge graph to structure relationships between products, conversations, and recommendations.<br>
A PyTorch-based neural network for embeddings, retrieval, or ranking.<br>
Retrieval-Augmented Generation (RAG) to enhance responses with knowledge graph lookups.<br><br>



# Dataset
**Product Metadata**<br>
Describes the structured attributes of a product, such as name, brand, category, price, specifications, and descriptions. This data is essential for building a knowledge graph and improving search and recommendation systems.

**User Interactions**<br>
Refers to data on how users engage with products, including clicks, purchases, reviews, ratings, wish lists, and cart additions. This data is valuable for recommendation systems and behavior-based predictions.

**Graph-Friendly**<br>
Indicates whether the dataset has a structure that can be easily converted into a knowledge graph. This means it contains relationships between entities, such as product-category, user-bought-product, or brand-produces-product.

**Pre-Built KG**<br>
A dataset that already includes a knowledge graph structure with entities (products, categories, users) and relationships. This saves time in preprocessing and can be directly used for graph neural networks (GNNs) and embeddings.

**Good for PyTorch NN**<br>
Determines whether the dataset is well-suited for training a neural network in PyTorch, particularly for tasks like embedding generation, recommendation systems, or graph-based learning. This usually depends on the dataset’s structure, size, and data richness.


## Dataset Scoring (1-5 Scale)

| Dataset                     | Product Metadata | User Interactions | Graph-Friendly | Pre-Built KG | Good for PyTorch NN | Total Score |
|-----------------------------|-----------------|-------------------|---------------|-------------|---------------------|-------------|
| **ICECAT**                  | 5               | 1                 | 4             | 1           | 4                   | **15**      |
| **Olist**                   | 4               | 5                 | 4             | 1           | 5                   | **19**      |
| **AliExpress KG**           | 5               | 5                 | 5             | 5           | 5                   | **25**      |
| **Facebook AI eCommerce KG**| 5               | 5                 | 5             | 5           | 5                   | **25**      |
| **DBPedia + Wikidata**      | 5               | 1                 | 5             | 5           | 2                   | **18**      |


# Workfow

1.   Load dataset
2.   Preprocess and Clean the Data
3.   Convert Data into a Knowledge Graph Structure
4.   Generate Knowledge Graph Embeddings
5.   Build a Graph-Based Recommendation Model
6.   Integrate Conversational AI (Natural Language Model)
7.   Train & Optimize the Full System
8.   Implement Real-Time Inference for Recommendations
9.   Deploy as an Interactive API or Chatbot
10.  Apppendx of code and notes




In [None]:
# Access to Google Drive
# This seems to propagate credentials better from its own cell

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Load Data

# Load dataset

# Convert Data into a Knowledge Graph Structure

# Generate Knowledge Graph Embeddings

# Build a Graph-Based Recommendation Model

# Integrate Conversational AI (Natural Language Model)

# Train & Optimize the Full System

# Implement Real-Time Inference for Recommendations

# Deploy as an Interactive API or Chatbot

# Appendix

## Scoring Criteria for Selecting an Encoder


| **Factor**                 | **Description** |
|---------------------------|----------------|
| **Computational Efficiency** | How fast is the encoding on CPU/GPU? |
| **Memory Usage**          | How much memory does it require? |
| **Scalability**           | Can it handle large datasets like OpenBG500? |
| **Preserves Semantic Meaning** | Does the encoding capture relationships between entities? |
| **Compatibility with PyTorch** | How well does it integrate into PyTorch models? |
| **Ease of Implementation** | How difficult is it to set up? |

Each encoding method gets a **score from 1 to 5** for each factor.

---

## Scoring Different Encoding Methods

| Encoding Method  | Computational Efficiency | Memory Usage | Scalability | Semantic Meaning | PyTorch Compatibility | Ease of Implementation | **Total Score** |
|-----------------|------------------------|--------------|-------------|------------------|----------------------|--------------------|--------------|
| **Label Encoding** (Integer Mapping) | **5** (Very fast) | **5** (Very low) | **5** (Handles millions of nodes) | **1** (No meaning captured) | **5** (PyTorch works with integers easily) | **5** (Simple `map()`) | **26** |
| **One-Hot Encoding** | **2** (Slow for large datasets) | **1** (Consumes huge memory) | **1** (Bad for large graphs) | **3** (Some structure captured) | **3** (Can be used, but not ideal) | **3** (Easy but inefficient) | **13** |
| **BERT Embeddings** (Text-Based) | **2** (Slow on CPU) | **3** (Moderate) | **3** (Can use pre-trained models) | **5** (Captures meaning well) | **4** (PyTorch supports it, but needs preprocessing) | **2** (Requires NLP model) | **19** |
| **Word2Vec/FastText** | **3** (Faster than BERT) | **3** (Moderate) | **4** (Good for large datasets) | **4** (Captures word meaning) | **4** (PyTorch supports it) | **3** (Requires preprocessing) | **21** |
| **Knowledge Graph Embeddings (TransE, RotatE)** | **4** (Moderate) | **4** (Efficient for large graphs) | **5** (Scales well) | **5** (Captures graph meaning) | **5** (Designed for PyTorch models) | **3** (Requires model training) | **26** |



In [None]:
# Get all unique entities (from both head and tail)
all_entities = set(triples_df_train["head"]).union(set(triples_df_train["tail"]))

# Get all unique relations
all_relations = set(triples_df_train["relation"])

# Create mapping dictionaries
entity2id = {entity: idx for idx, entity in enumerate(all_entities)}
relation2id = {relation: idx for idx, relation in enumerate(all_relations)}

def encode_triples(df):
    df["head"] = df["head"].map(entity2id)
    df["relation"] = df["relation"].map(relation2id)
    df["tail"] = df["tail"].map(entity2id)
    return df

# Encode train, test, and validation sets
triples_df_train = encode_triples(triples_df_train)
triples_df_test = encode_triples(triples_df_test)
triples_df_val = encode_triples(triples_df_val)


In [None]:
import torch

# Convert to tensor format
train_tensor = torch.tensor(triples_df_train.values, dtype=torch.long)
test_tensor = torch.tensor(triples_df_test.values, dtype=torch.long)
val_tensor = torch.tensor(triples_df_val.values, dtype=torch.long)

In [None]:
import torch

# Check the shape of the tensors
print("Train Tensor Shape:", train_tensor.shape)
print("Test Tensor Shape:", test_tensor.shape)
print("Validation Tensor Shape:", val_tensor.shape)

# Access the first 5 samples
print("First 5 Training Samples:\n", train_tensor[:5])

# Get specific columns
heads = train_tensor[:, 0]  # Head entities
relations = train_tensor[:, 1]  # Relations
tails = train_tensor[:, 2]  # Tail entities

print("First 5 Head Entities:\n", heads[:5])
print("First 5 Relations:\n", relations[:5])
print("First 5 Tail Entities:\n", tails[:5])

# Perform simple operations
sum_tensor = heads + tails  # Example tensor addition
print("Sum of Head & Tail Entities:\n", sum_tensor[:5])

# Get unique values
unique_heads = torch.unique(heads)
print(f"Unique Head Entities Count: {unique_heads.shape[0]}")


In [None]:
import torch

device = torch.device("cpu")  # Force CPU mode for now

print("Using Device:", device)


In [None]:
import torch.nn as nn
import torch.optim as optim

# Define a simple MLP model
class SimpleMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(SimpleMLP, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)


# Three layer network
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Set dimensions
input_dim = 3  # (head, relation, tail)
hidden_dim = 16
output_dim = 1  # Binary classification or regression

# Initialize model
model = SimpleMLP(input_dim, hidden_dim, output_dim).to(device)

# Define loss and optimizer
criterion = nn.MSELoss()  # Example: MSE loss for regression
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Dummy training loop
for epoch in range(5):  # Short training example
    optimizer.zero_grad()
    outputs = model(train_tensor.float())  # Convert tensor to float for Linear layers
    loss = criterion(outputs, torch.rand_like(outputs))  # Dummy target values
    loss.backward()
    optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")
