# Section 1: Environment Setup and Imports
In this first section, we’ll:
1. Load environment variables containing API keys and Neo4j credentials.
1. Import the necessary Python libraries.
1. Create a connection to the Neo4j database.

**Key Points**:
- We’re using python-dotenv to load .env.
- We’ll verify that we can connect to Neo4j.
- No data manipulation happens here yet; just setup.


In [3]:
# Setup additional dependencies 
!pip install rdflib

Collecting rdflib
  Downloading rdflib-7.1.1-py3-none-any.whl.metadata (11 kB)
Downloading rdflib-7.1.1-py3-none-any.whl (562 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m562.4/562.4 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdflib
Successfully installed rdflib-7.1.1


In [6]:
# Section 1: Environment Setup and Imports

import os
from dotenv import load_dotenv
from neo4j import GraphDatabase
from rdflib import Graph
import pandas as pd

# Load environment variables
load_dotenv("../.env")

# Retrieve environment variables
NEO4J_URI = os.getenv("NEO4J_URI")
NEO4J_USER = os.getenv("NEO4J_USER")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Initialize Neo4j driver
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

# Test connection
with driver.session() as session:
    result = session.run("RETURN 1 AS test")
    print("Neo4j Connection Test Result:", result.single()["test"])


Neo4j Connection Test Result: 1


# Section 2: Parse and Ingest Ontology (TTL) into Neo4j
We’ll parse the ontology Turtle file using rdflib. We’ll store classes, object properties, and data properties as reference nodes in Neo4j. For now, we won’t apply complex SHACL validation, but we’ll set up the structure for later validation steps.

**Key Points**:
- Parse TTL using rdflib.
- Create OntologyClass, OntologyObjectProperty, and OntologyDataProperty nodes in Neo4j.
- This provides a reference backbone for later data integration and LLM-driven enrichment.


In [14]:
# Section 2: Parse and Ingest Ontology

print("Parsing ontology...")
# Parse ontology
g = Graph()
ontology_path = "./data/ontology_creation/refined_ontology_v1.ttl"
g.parse(ontology_path, format="turtle")

classes = [s for s, p, o in g if str(p).endswith("type") and o.endswith("Class")]
object_properties = [s for s, p, o in g if str(p).endswith("type") and o.endswith("ObjectProperty")]
data_properties = [s for s, p, o in g if str(p).endswith("type") and o.endswith("DatatypeProperty")]

print("Ingesting ontology into Neo4j...")

with driver.session() as session:
    # Create OntologyClass nodes
    for c in classes:
        label = c.split("#")[-1]
        session.run("""
        MERGE (oc:OntologyClass {name: $name, uri: $uri})
        """, name=label, uri=str(c))

    # Create OntologyObjectProperty nodes
    for op in object_properties:
        label = op.split("#")[-1]
        session.run("""
        MERGE (oop:OntologyObjectProperty {name: $name, uri: $uri})
        """, name=label, uri=str(op))

    # Create OntologyDataProperty nodes
    for dp in data_properties:
        label = dp.split("#")[-1]
        session.run("""
        MERGE (odp:OntologyDataProperty {name: $name, uri: $uri})
        """, name=label, uri=str(dp))

print("Ontology parsing and ingestion completed.")


Parsing ontology...
Ingesting ontology into Neo4j...
Ontology parsing and ingestion completed.


# Step 3. Inital data load

Ingest all of the provided data files except for the large directory of chunked group-variable data. 

1. SurveyNode.csv – loads survey nodes.
1. ExamplesNode.csv – links examples to survey nodes.
1. GeographyNode.csv – loads geography nodes (standalone for now).
1. SurveyGroupNode.csv – loads survey groups and links them to survey nodes.
1. SurveyVariablesNoGroupNode.csv – creates a NoGroup node for each survey and links these variables under that pseudo-group.

**Note**:
This code does not iterate through the GroupNodesWithVariables directory. We will handle the large chunked data set separately in a future step. For now, this section ensures that all the basic data (survey, examples, geography, groups, and no-group variables) are loaded into the graph.

**What This Code Does**:
Ensure a consistent pattern and prepares the graph for the next step: ingesting the large, chunked group-variable data set:
1. Surveys: Ingests and sets them as the backbone (:SurveyNode).
1. Examples: Creates :ExamplesNode and links them to their surveys with :HAS_EXAMPLE.
1. Geography: Creates :GeographyNode. If needed, can link them to surveys or to each other.
1. Groups: Creates :SurveyGroupNode and links them to their surveys. This lays the groundwork for grouped variables.
1. No-Group Variables: Treats ungrouped variables as children of a NoGroup group. Creates the NoGroup group node per survey if it doesn’t exist and links variables under it.


In [17]:
# Section 3: Ingest Survey, Example, Geography, Group, and No-Group Variable Data
import pandas as pd
from tqdm import tqdm

survey_csv_path = "../data/data_extraction/SurveyNode.csv"
examples_csv_path = "../data/data_extraction/ExamplesNode.csv"
geography_csv_path = "../data/data_extraction/GeographyNode.csv"
group_csv_path = "../data/data_extraction/SurveyGroupNode.csv"
no_group_vars_csv_path = "../data/data_extraction/SurveyVariablesNoGroupNode.csv"

# 1) Ingest Survey Nodes
df_survey = pd.read_csv(survey_csv_path)
with driver.session() as session:
    for idx, row in tqdm(df_survey.iterrows(), total=len(df_survey), desc="Loading Surveys"):
        session.run("""
        MERGE (s:SurveyNode {SurveyID: $SurveyID})
        SET s.DatasetType = $DatasetType,
            s.ReleaseYear = $ReleaseYear,
            s.Source = $Source,
            s.source = "CSV_import",
            s.enrichment_pass = 0,
            s.creation_timestamp = timestamp()
        """, 
        SurveyID=row["SurveyID"],
        DatasetType=row.get("DatasetType", None),
        ReleaseYear=row.get("ReleaseYear", None),
        Source=row.get("Source", None))

# 2) Ingest Examples and Link to Surveys
df_examples = pd.read_csv(examples_csv_path)
with driver.session() as session:
    for idx, row in tqdm(df_examples.iterrows(), total=len(df_examples), desc="Loading Examples"):
        session.run("""
        MATCH (s:SurveyNode {SurveyID: $SurveyID})
        MERGE (e:ExamplesNode {ExampleID: $ExampleID})
        SET e.source = "CSV_import",
            e.enrichment_pass = 0,
            e.creation_timestamp = timestamp()
        MERGE (s)-[:HAS_EXAMPLE]->(e)
        """,
        SurveyID=row["SurveyID"],
        ExampleID=row["ExampleID"] if "ExampleID" in row else idx)

# 3) Ingest Geography Nodes and Link to Surveys
df_geo = pd.read_csv(geography_csv_path)
with driver.session() as session:
    for idx, row in tqdm(df_geo.iterrows(), total=len(df_geo), desc="Loading Geography"):
        session.run("""
        MERGE (g:GeographyNode {GeoID: $GeoID})
        SET g.GeographyLevel = $GeographyLevel,
            g.source = "CSV_import",
            g.enrichment_pass = 0,
            g.creation_timestamp = timestamp()
        """,
        GeoID=row["GeoID"] if "GeoID" in row else f"Geo_{idx}",
        GeographyLevel=row.get("GeographyLevel", None))

        # Link geography to survey if SurveyID is present
        if "SurveyID" in row and pd.notnull(row["SurveyID"]):
            session.run("""
            MATCH (s:SurveyNode {SurveyID: $SurveyID})
            MATCH (g:GeographyNode {GeoID: $GeoID})
            MERGE (s)-[:APPLIES_TO]->(g)
            """, SurveyID=row["SurveyID"], GeoID=row["GeoID"] if "GeoID" in row else f"Geo_{idx}")

# 4) Ingest Survey Groups and Link to Surveys
df_group = pd.read_csv(group_csv_path)
with driver.session() as session:
    for idx, row in tqdm(df_group.iterrows(), total=len(df_group), desc="Loading Groups"):
        session.run("""
        MATCH (s:SurveyNode {SurveyID: $SurveyID})
        MERGE (g:SurveyGroupNode {SurveyID: $SurveyID, GroupName: $GroupName})
        SET g.source = "CSV_import",
            g.enrichment_pass = 0,
            g.creation_timestamp = timestamp()
        MERGE (s)-[:HAS_GROUP]->(g)
        """,
        SurveyID=row["SurveyID"],
        GroupName=row["GroupName"])

# 5) Ingest No-Group Variables
df_no_group_vars = pd.read_csv(no_group_vars_csv_path)

# If the CSV uses "Variable Name" as a column header, rename it to "VariableName"
if "Variable Name" in df_no_group_vars.columns:
    df_no_group_vars.rename(columns={"Variable Name": "VariableName"}, inplace=True)

with driver.session() as session:
    survey_ids_no_group = df_no_group_vars["SurveyID"].unique()
    
    # Ensure a NoGroup node exists for each Survey
    for sid in tqdm(survey_ids_no_group, desc="Ensuring NoGroup Node per Survey"):
        session.run("""
        MATCH (s:SurveyNode {SurveyID: $SurveyID})
        MERGE (ng:SurveyGroupNode {SurveyID: $SurveyID, GroupName: "NoGroup"})
        SET ng.source = "CSV_import",
            ng.enrichment_pass = 0,
            ng.creation_timestamp = timestamp()
        MERGE (s)-[:HAS_GROUP]->(ng)
        """, SurveyID=sid)

    # Create variable nodes and link them to NoGroup
    for idx, row in tqdm(df_no_group_vars.iterrows(), total=len(df_no_group_vars), desc="Loading No-Group Variables"):
        session.run("""
        MATCH (g:SurveyGroupNode {SurveyID: $SurveyID, GroupName: "NoGroup"})
        MERGE (v:SurveyVariableNode {
            SurveyID: $SurveyID,
            VariableName: $VariableName
        })
        ON CREATE SET v.Universe = $Universe,
                      v.Concept = $Concept,
                      v.MeasurementUnit = $MeasurementUnit,
                      v.source = "CSV_import",
                      v.enrichment_pass = 0,
                      v.creation_timestamp = timestamp()
        MERGE (g)-[:HAS_VARIABLE]->(v)
        """,
        SurveyID=row["SurveyID"],
        VariableName=row["VariableName"],
        Universe=row.get("Universe", None),
        Concept=row.get("Concept", None),
        MeasurementUnit=row.get("MeasurementUnit", None))

print("Base data (Surveys, Examples, Geography, Groups, No-Group Variables) loaded successfully.")


Loading Surveys: 100%|██████████████████████| 1648/1648 [00:17<00:00, 96.08it/s]
Loading Examples: 100%|███████████████████| 22649/22649 [04:12<00:00, 89.82it/s]
Loading Geography: 100%|██████████████████| 10101/10101 [03:11<00:00, 52.70it/s]
Loading Groups: 100%|█████████████████████| 64873/64873 [18:35<00:00, 58.14it/s]
  df_no_group_vars = pd.read_csv(no_group_vars_csv_path)
Ensuring NoGroup Node per Survey: 100%|█████| 1618/1618 [00:27<00:00, 58.73it/s]
Loading No-Group Variables: 100%|████| 539733/539733 [13:41:42<00:00, 10.95it/s]

Base data (Surveys, Examples, Geography, Groups, No-Group Variables) loaded successfully.





# Step 4: Merging GroupNodesWithVariables

Below is code for the next step, which involves ingesting the large set of group-variable data stored in multiple CSV files within the GroupNodesWithVariables directory. We will:

1. Iterate over all chunked CSV files in GroupNodesWithVariables.
1. For each chunk:
    - Read the CSV into a DataFrame.
    - (If needed) rename "Variable Name" to "VariableName".
    - For each row, find or create the corresponding :SurveyGroupNode (using SurveyID and GroupName) and :SurveyVariableNode (using SurveyID and VariableName).
    - Link the variable to the group with :HAS_VARIABLE.

**Performance Considerations**:
- This may be time-consuming if the datasets are large. Consider breaking them into batches or using Neo4j’s LOAD CSV for even faster ingestion. For now, we’ll proceed similarly to previous steps, using tqdm to track progress.
- To speed things up, the following indexes were added in Neo4j before this step:

**NOTE**: this syntax is for Neo4j 5.x or later...
```
// Create a uniqueness constraint for SurveyGroupID
CREATE CONSTRAINT FOR (g:SurveyGroupNode)
REQUIRE g.SurveyGroupID IS UNIQUE;

// Create a uniqueness constraint for SurveyID
CREATE CONSTRAINT FOR (s:SurveyNode)
REQUIRE s.SurveyID IS UNIQUE;

// Create a uniqueness constraint for SurveyGroupID and VariableName
CREATE CONSTRAINT FOR (v:SurveyVariableNode)
REQUIRE (v.SurveyGroupID, v.VariableName) IS UNIQUE;

```
- Running these constraints in Neo4j’s browser or via a session before loading can significantly speed up merges.

**Assumptions**:
- Each chunk CSV has at least SurveyID, GroupName, VariableName (or Variable Name) columns.
- dditional properties like Universe, Concept, MeasurementUnit may be included and will be set on :SurveyVariableNode.

In [20]:
import glob
import pandas as pd
from tqdm import tqdm
from neo4j import GraphDatabase
import os

# Retrieve environment variables for Neo4j
NEO4J_URI = os.getenv("NEO4J_URI")
NEO4J_USER = os.getenv("NEO4J_USER")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")

# Validate that all required environment variables are set
if not NEO4J_URI or not NEO4J_USER or not NEO4J_PASSWORD:
    raise EnvironmentError("Missing Neo4j credentials. Please set NEO4J_URI, NEO4J_USER, and NEO4J_PASSWORD environment variables.")

# Establish the Neo4j driver connection
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))


# Directory containing chunk files
group_variables_dir = "../data/data_extraction/GroupNodesWithVariables/"  
chunk_files = sorted(glob.glob(group_variables_dir + "Processed_GroupNode_chunk_*.csv"))
 

for chunk_file in chunk_files:
    df_chunk = pd.read_csv(chunk_file)

    # Standardize column names
    if "Variable Name" in df_chunk.columns:
        df_chunk.rename(columns={"Variable Name": "VariableName"}, inplace=True)
    if "Group" in df_chunk.columns:
        df_chunk.rename(columns={"Group": "GroupName"}, inplace=True)

    # Create SurveyGroupID if it doesn't already exist
    if "SurveyGroupID" not in df_chunk.columns:
        df_chunk["SurveyGroupID"] = df_chunk["SurveyID"] + "_" + df_chunk["GroupName"]

    # Create SurveyGroupVariableID as the unique key
    df_chunk["SurveyGroupVariableID"] = df_chunk["SurveyGroupID"] + "_" + df_chunk["VariableName"]

    with driver.session() as session:
        # Ingest variables in this chunk
        for idx, row in tqdm(df_chunk.iterrows(), total=len(df_chunk), desc=f"Loading {chunk_file}"):
            session.run("""
            MATCH (g:SurveyGroupNode {SurveyGroupID: $SurveyGroupID})
            MERGE (v:SurveyVariableNode {
                SurveyGroupID: $SurveyGroupID,
                VariableName: $VariableName,
                SurveyGroupVariableID: $SurveyGroupVariableID
            })
            ON CREATE SET v.Universe = $Universe,
                          v.Concept = $Concept,
                          v.MeasurementUnit = $MeasurementUnit,
                          v.source = "CSV_import",
                          v.enrichment_pass = 0,
                          v.creation_timestamp = timestamp()
            MERGE (g)-[:HAS_VARIABLE]->(v)
            """,
            SurveyGroupID=row["SurveyGroupID"],
            VariableName=row["VariableName"],
            SurveyGroupVariableID=row["SurveyGroupVariableID"],
            Universe=row.get("Universe", None),
            Concept=row.get("Concept", None),
            MeasurementUnit=row.get("MeasurementUnit", None))

    print(f"Finished loading {chunk_file}")

print("All group-variable chunks loaded successfully.")


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_000.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_001.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_002.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_003.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_004.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_005.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_006.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_007.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_008.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_009.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_010.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_011.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_012.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_013.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_014.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_015.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_016.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_017.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_018.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_019.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_020.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_021.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_022.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_023.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_024.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_025.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_026.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_027.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_028.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_029.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_030.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_031.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_032.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_033.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_034.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_035.csv


  df_chunk = pd.read_csv(chunk_file)
Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_036.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_037.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_038.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_039.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_040.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_041.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_042.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_043.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_044.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_045.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_046.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_047.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_048.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_049.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_050.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_051.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_052.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_053.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_054.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_055.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_056.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_057.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_058.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_059.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_060.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_061.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_062.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_063.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_064.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_065.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_066.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_067.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_068.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_069.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_070.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_071.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_072.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_073.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_074.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_075.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_076.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_077.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_078.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_079.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_080.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_081.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_082.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_083.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_084.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_085.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_086.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_087.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_088.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_089.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_090.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_091.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_092.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_093.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_094.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_095.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_096.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_097.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_098.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_099.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_100.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_101.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_102.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_103.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_104.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_105.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_106.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_107.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_108.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_109.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_110.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_111.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_112.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_113.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_114.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_115.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_116.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_117.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_118.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_119.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_120.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_121.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_122.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_123.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_124.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_125.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_126.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_127.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_128.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_129.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_130.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_131.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun


Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_132.csv


Loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chun

Finished loading ../data/data_extraction/GroupNodesWithVariables/Processed_GroupNode_chunk_133.csv
All group-variable chunks loaded successfully.





## Step 5: Quick data check and validation
Focusing on data validation and quality checks before proceeding with LLM-based enrichment. This step involves:
1. Creating Constraints and Indexes in Neo4j:
    - To ensure data integrity and improve query performance, we’ll add uniqueness constraints and indexes. This helps detect unexpected duplicates and speeds up queries for enrichment and analysis.

**Data Validation and Constraints**:
Now that all data (surveys, examples, geography, groups, variables) is ingested, the next logical step is to validate and improve the data quality. This involves:
- Adding Neo4j constraints to ensure uniqueness of certain identifiers (e.g., SurveyID, (SurveyID, GroupName), (SurveyID, VariableName)).
- Reviewing logs and metrics to confirm that ingestion completed without errors or anomalies.

**NOTE**: These checks are done before initiating LLM-driven enrichment as we would like to ensure that the knowledge graph is stable and consistent.

In [26]:
# Syntax is for Neo4j 4.5 ... 
# Next Step: Add Constraints

# Add uniqueness constraints
with driver.session() as session:
    try:
        # Create uniqueness constraint for SurveyNode
        session.run("CREATE CONSTRAINT constraint_for_SurveyNode FOR (s:SurveyNode) REQUIRE s.SurveyID IS UNIQUE")
    except Exception as e:
        if "already exists" in str(e):
            print("Constraint for SurveyNode already exists.")
        else:
            raise e

    try:
        # Create uniqueness constraint for SurveyGroupNode
        session.run("CREATE CONSTRAINT constraint_for_SurveyGroupNode FOR (g:SurveyGroupNode) REQUIRE (g.SurveyID, g.GroupName) IS UNIQUE")
    except Exception as e:
        if "already exists" in str(e):
            print("Constraint for SurveyGroupNode already exists.")
        else:
            raise e

    try:
        # Create uniqueness constraint for SurveyVariableNode
        session.run("CREATE CONSTRAINT constraint_for_SurveyVariableNode FOR (v:SurveyVariableNode) REQUIRE (v.SurveyID, v.VariableName) IS UNIQUE")
    except Exception as e:
        if "already exists" in str(e):
            print("Constraint for SurveyVariableNode already exists.")
        else:
            raise e

print("Constraints have been created or verified.")




Constraint for SurveyNode already exists.
Constraint for SurveyGroupNode already exists.
Constraint for SurveyVariableNode already exists.
Constraints have been created or verified.
