# Urban Air Quality Knowledge Graph: Embedding & Similarity Test

This notebook explicitly demonstrates how to:

- Generate semantic embeddings for nodes explicitly from a Neo4j knowledge graph.
- Store embeddings explicitly back into Neo4j.
- Perform semantic similarity searches explicitly using Neo4j embeddings.

🔧 Environment Setup
Before running the notebook, explicitly install all required Python packages:

!pip install -r ../requirements.txt

In [2]:
import sys
from pathlib import Path

# Explicitly add src to Python path
sys.path.append(str(Path("../src").resolve()))

# Explicit imports from existing modules
from neo4j_embedding_pipeline import generate_embeddings
from neo4j_similarity_search import similarity_search

# Explicitly run embedding generation and storage
generate_embeddings()

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# Explicit example similarity query
query_text = "vehicle emissions"
index_name = "source_embeddings"

results = similarity_search(query_text, index_name, top_k=5)

print("🔍 Explicitly retrieved similar nodes:")
for record in results:
    print(f"- {record['name']} (Category: {record['category']}, Score: {record['score']:.3f})")

🔍 Explicitly retrieved similar nodes:
- Vehicle exhaust (Category: Uncategorized, Score: 0.856)
- Cold start emissions for urban and rural drives for cars (Category: MobileSource, Score: 0.843)
- Aircraft emissions (Category: MobileSource, Score: 0.830)
- Evaporative emissions (fuel storage and handling) (Category: AreaSource, Score: 0.809)
- Vehicle exhaust (incomplete combustion) (Category: Uncategorized, Score: 0.804)


## 🔍 Explicit Semantic Similarity Tests

Explicitly test embedding quality with varied semantic queries covering different node types:
- Sources
- Pollutants
- Mitigation measures
- Meteorological factors
- Street canyon effects

In [6]:
# Define explicit test queries and corresponding indexes
test_cases = [
    {"query": "vehicle emissions", "index": "source_embeddings"},
    {"query": "industrial pollution", "index": "source_embeddings"},
    {"query": "traffic-related pollutants", "index": "pollutant_embeddings"},
    {"query": "domestic heating pollution", "index": "source_embeddings"},
    {"query": "effective policies to reduce NOx", "index": "mitigation_embeddings"},
    {"query": "impact of wind speed on air quality", "index": "meteorological_embeddings"},
    {"query": "urban design affecting pollution dispersion", "index": "street_canyon_embeddings"}
]

# Perform explicit similarity searches and print results
for case in test_cases:
    print("\n" + "="*80)
    print(f"🔎 Explicit Query: '{case['query']}' (Index: '{case['index']}')\n")
    results = similarity_search(case['query'], case['index'], top_k=3)

    print("Explicitly retrieved similar nodes:")
    for record in results:
        print(f"- {record['name']} (Category: {record['category']}, Similarity Score: {record['score']:.3f})")


🔎 Explicit Query: 'vehicle emissions' (Index: 'source_embeddings')

Explicitly retrieved similar nodes:
- Vehicle exhaust (Category: Uncategorized, Similarity Score: 0.856)
- Cold start emissions for urban and rural drives for cars (Category: MobileSource, Similarity Score: 0.843)
- Aircraft emissions (Category: MobileSource, Similarity Score: 0.830)

🔎 Explicit Query: 'industrial pollution' (Index: 'source_embeddings')

Explicitly retrieved similar nodes:
- Metal smelters and steel mills (Category: StationarySource, Similarity Score: 0.773)
- Chemical manufacturing plants (Category: StationarySource, Similarity Score: 0.761)
- Mining (metal ore, coal) (Category: Uncategorized, Similarity Score: 0.756)

🔎 Explicit Query: 'traffic-related pollutants' (Index: 'pollutant_embeddings')

Explicitly retrieved similar nodes:
- Chlorofluorocarbons (CFCs) (Category: HazardousOrganicCompounds, Similarity Score: 0.805)
- Volatile organic compounds (VOCs) (Category: GaseousPollutants, Similarity S

## ⚠️ Explicit Edge Case Testing

Explicitly test how your embeddings handle ambiguous, unusual, or unrelated queries:

In [8]:
# Explicit edge-case test queries
edge_case_queries = [
    {"query": "banana", "index": "pollutant_embeddings"},  # unrelated concept explicitly
    {"query": "emissions from unicorns", "index": "source_embeddings"},  # non-existent explicitly
    {"query": "", "index": "mitigation_embeddings"},  # empty string explicitly
    {"query": "temperature", "index": "meteorological_embeddings"},  # vague explicitly
]

# Explicitly run edge-case tests
for case in edge_case_queries:
    print("\n" + "="*80)
    print(f"⚠️ Explicit Edge Case Query: '{case['query']}' (Index: '{case['index']}')\n")
    results = similarity_search(case['query'], case['index'], top_k=3)

    if results:
        print("Explicitly retrieved similar nodes:")
        for record in results:
            print(f"- {record['name']} (Category: {record['category']}, Similarity Score: {record['score']:.3f})")
    else:
        print("⚠️ Explicitly no relevant nodes retrieved.")


⚠️ Explicit Edge Case Query: 'banana' (Index: 'pollutant_embeddings')

Explicitly retrieved similar nodes:
- Benzene (Category: GaseousPollutants, Similarity Score: 0.606)
- Benzo(a)pyrene (Category: Uncategorized, Similarity Score: 0.605)
- Lead (Pb) (Category: TraceElements, Similarity Score: 0.595)

⚠️ Explicit Edge Case Query: 'emissions from unicorns' (Index: 'source_embeddings')

Explicitly retrieved similar nodes:
- Aircraft emissions (Category: MobileSource, Similarity Score: 0.777)
- Biogenic emissions (vegetation) (Category: NaturalSource, Similarity Score: 0.761)
- Vehicle exhaust (Category: Uncategorized, Similarity Score: 0.729)

⚠️ Explicit Edge Case Query: '' (Index: 'mitigation_embeddings')

Explicitly retrieved similar nodes:
- Public awareness campaigns (Category: PolicyBasedMeasure, Similarity Score: 0.576)
- Banning leaded gasoline (Category: PolicyBasedMeasure, Similarity Score: 0.553)
- Employment of Best Available Technologies (BATs) (Category: TechnologicalMeas