# Data Dictionary using nl2sql modules

This notebook demonstrates how to use the `nl2sql.knowledge_base.data_dictionary` module to create and work with database data dictionaries in a structured way.


## Settings

In [1]:
import os

if os.getcwd().endswith("notebooks"):
    os.chdir("..")
print(os.getcwd())


/Users/cmcoutosilva/Projects/github/nl2sql-agent


In [2]:
import yaml
from pathlib import Path

import tiktoken

from nl2sql.utils import print_section
from nl2sql.config import load_database_config, load_schema_config
from nl2sql.database.postgresql import PostgreSQLConnector
from nl2sql.knowledge_base.data_dictionary import DataDictionary

## Database Connection

First, let's establish a connection using the PostgreSQL connector.

In [3]:
# Create database connection
database_config = load_database_config()    
connector = PostgreSQLConnector(**database_config)

print(f"Connected to database: {connector.params.database}")
print(f"Inspector available: {connector.inspector is not None}")

Connected to database: olist_ecommerce
Inspector available: True


## Creating Data Dictionary

The `DataDictionary` class allows us to create a structured representation of our database schema with all tables, columns, relationships, and metadata.

In [4]:
# Load database schema configuration
database_schema = load_schema_config()

print("Database schema configuration:", "-" * 30, sep="\n")
print(yaml.dump(database_schema, default_flow_style=False))

Database schema configuration:
------------------------------
olist_ecommerce:
  ecommerce:
  - customers
  - geolocation
  - order_items
  - order_payments
  - order_reviews
  - orders
  - product_category_name_translations
  - products
  - sellers
  marketing:
  - closed_deals
  - marketing_qualified_leads



In [5]:
# Create data dictionary from inspector
data_dictionary = DataDictionary.from_inspector(
    inspector=connector.inspector,
    database_schema=database_schema
)

print(f"Data dictionary created successfully!")
print(f"Databases: {list(data_dictionary.databases.keys())}")
for db_name, db_info in data_dictionary.databases.items():
    print(f"  {db_name}: {list(db_info.schemas.keys())}")
    for schema_name, schema_info in db_info.schemas.items():
        print(f"    {schema_name}: {len(schema_info.tables)} tables")

Data dictionary created successfully!
Databases: ['olist_ecommerce']
  olist_ecommerce: ['ecommerce', 'marketing']
    ecommerce: 9 tables
    marketing: 2 tables


In [6]:
print(data_dictionary.model_dump_json(indent=2))

{
  "databases": {
    "olist_ecommerce": {
      "name": "olist_ecommerce",
      "schemas": {
        "ecommerce": {
          "name": "ecommerce",
          "tables": {
            "customers": {
              "name": "customers",
              "schema_name": "ecommerce",
              "description": "This dataset has information about the customer and its location. Use it to identify unique customers in the orders dataset and to find the orders delivery location. At our system each order is assigned to a unique customer_id. This means that the same customer will get different ids for different orders. The purpose of having a customer_unique_id on the dataset is to allow you to identify customers that made repurchases at the store. Otherwise you would find that each order had a different customer associated with.",
              "primary_keys": [
                "customer_id"
              ],
              "foreign_keys": [],
              "columns": [
                {
            

## Exploring Table Information

Let's explore the structured information about specific tables

In [7]:
# Get information about the orders table
orders_table = data_dictionary.databases["olist_ecommerce"].schemas["ecommerce"].tables["orders"]

print("=== ORDERS TABLE INFORMATION ===")
print(f"Table name: {orders_table.name}")
print(f"Schema: {orders_table.schema_name}")
print(f"Description: {orders_table.description}")
print(f"Primary keys: {orders_table.primary_keys}")
print(f"Number of columns: {len(orders_table.columns)}")
print(f"Foreign keys: {len(orders_table.foreign_keys)}")

print("\n=== COLUMNS ===")
for col in orders_table.columns:
    fk_info = f" [FK: {col.foreign_keys[0]['referred_table']}]" if col.foreign_keys else ""
    pk_info = " [PK]" if col.is_primary_key else ""
    print(f"  {col.name} ({col.type}){pk_info}{fk_info}: {col.description}")


=== ORDERS TABLE INFORMATION ===
Table name: orders
Schema: ecommerce
Description: This is the core dataset. From each order you might find all other information.
Primary keys: ['order_id']
Number of columns: 8
Foreign keys: 1

=== COLUMNS ===
  order_id (TEXT) [PK]: unique identifier of the order.
  customer_id (TEXT) [FK: customers]: key to the customer dataset. Each order has a unique customer_id.
  order_status (TEXT): Reference to the order status (delivered, shipped, etc).
  order_purchase_timestamp (TIMESTAMP): Shows the purchase timestamp.
  order_approved_at (TIMESTAMP): Shows the payment approval timestamp.
  order_delivered_carrier_date (TIMESTAMP): Shows the order posting timestamp. When it was handled to the logistic partner.
  order_delivered_customer_date (TIMESTAMP): Shows the actual order delivery date to the customer.
  order_estimated_delivery_date (TIMESTAMP): Shows the estimated delivery date that was informed to customer at the purchase moment.


## Formatted Context Output

The data dictionary provides formatted context output suitable for documentation or AI context.

In [8]:
# Show formatted context for a specific table
print("=== FORMATTED CONTEXT FOR ORDERS TABLE ===")
print(orders_table.format_context())

=== FORMATTED CONTEXT FOR ORDERS TABLE ===
TABLE: orders
DESCRIPTION: This is the core dataset. From each order you might find all other information.
PRIMARY KEYS: order_id
FOREIGN KEYS:
  - customer_id -> ecommerce.customers.customer_id
COLUMNS:
  - order_id (TEXT, NOT NULL): unique identifier of the order.
  - customer_id (TEXT, NOT NULL): key to the customer dataset. Each order has a unique customer_id.
  - order_status (TEXT, NULL): Reference to the order status (delivered, shipped, etc).
  - order_purchase_timestamp (TIMESTAMP, NULL): Shows the purchase timestamp.
  - order_approved_at (TIMESTAMP, NULL): Shows the payment approval timestamp.
  - order_delivered_carrier_date (TIMESTAMP, NULL): Shows the order posting timestamp. When it was handled to the logistic partner.
  - order_delivered_customer_date (TIMESTAMP, NULL): Shows the actual order delivery date to the customer.
  - order_estimated_delivery_date (TIMESTAMP, NULL): Shows the estimated delivery date that was informed t

In [9]:
# Show formatted context for the entire ecommerce schema (truncated for readability)
ecommerce_schema = data_dictionary.databases["olist_ecommerce"].schemas["ecommerce"]
schema_context = ecommerce_schema.format_context()

print("=== FIRST 2000 CHARACTERS OF ECOMMERCE SCHEMA CONTEXT ===")
print(schema_context[:2000])
print("...")
print(f"\nTotal context length: {len(schema_context)} characters")

=== FIRST 2000 CHARACTERS OF ECOMMERCE SCHEMA CONTEXT ===
SCHEMA: ecommerce

TABLE: customers
DESCRIPTION: This dataset has information about the customer and its location. Use it to identify unique customers in the orders dataset and to find the orders delivery location. At our system each order is assigned to a unique customer_id. This means that the same customer will get different ids for different orders. The purpose of having a customer_unique_id on the dataset is to allow you to identify customers that made repurchases at the store. Otherwise you would find that each order had a different customer associated with.
PRIMARY KEYS: customer_id
COLUMNS:
  - customer_id (TEXT, NOT NULL): key to the orders dataset. Each order has a unique customer_id.
  - customer_unique_id (TEXT, NOT NULL): unique identifier of a customer.
  - customer_zip_code_prefix (TEXT, NULL): first five digits of customer zip code
  - customer_city (TEXT, NULL): customer city name
  - customer_state (TEXT, NULL)

## Analysis Functions

The data dictionary provides several analysis functions to identify missing metadata.

In [10]:
# Find tables with missing descriptions
missing_table_descriptions = data_dictionary.get_tables_with_missing_descriptions()
print("=== TABLES WITH MISSING DESCRIPTIONS ===")
for db_name, schemas in missing_table_descriptions.items():
    for schema_name, tables in schemas.items():
        print(f"{db_name}.{schema_name}: {tables}")

# Find columns with missing descriptions
missing_column_descriptions = data_dictionary.get_columns_with_missing_descriptions()
print("\n=== COLUMNS WITH MISSING DESCRIPTIONS ===")
for db_name, schemas in missing_column_descriptions.items():
    for schema_name, tables in schemas.items():
        for table_name, columns in tables.items():
            print(f"{db_name}.{schema_name}.{table_name}: {len(columns)} columns missing descriptions")
            print(f"  Missing: {columns[:5]}{'...' if len(columns) > 5 else ''}")  # Show first 5

=== TABLES WITH MISSING DESCRIPTIONS ===

=== COLUMNS WITH MISSING DESCRIPTIONS ===


In [11]:
## Analyzing Costs of including the data dictionary in the prompt context

encoding = tiktoken.get_encoding("o200k_base")
num_tokens = len(encoding.encode(data_dictionary.format_context()))

gpt_4_1_price = 2 / 1E6 * num_tokens
gpt_4_1_mini_price = 0.4 / 1E6 * num_tokens

print_section("Price per query:")
print(f"GPT-4.1 price: {gpt_4_1_price:.4f} USD")
print(f"GPT-4.1-mini price: {gpt_4_1_mini_price:.4f} USD\n")

print_section("Price per 1K queries:")
print(f"GPT-4.1 price per 1K queries: {gpt_4_1_price * 1E3:.4f} USD")
print(f"GPT-4.1-mini price per 1K queries: {gpt_4_1_mini_price * 1E3:.4f} USD")

Price per query:
GPT-4.1 price: 0.0040 USD
GPT-4.1-mini price: 0.0008 USD

Price per 1K queries:
GPT-4.1 price per 1K queries: 4.0100 USD
GPT-4.1-mini price per 1K queries: 0.8020 USD


## Saving and Loading Data Dictionary

The data dictionary can be saved to and loaded from YAML files for persistence.

In [12]:
# Save data dictionary to YAML file
output_path = Path("knowledge/data_dictionary.yml")
saved_path = data_dictionary.save(output_path)
print(f"Data dictionary saved to: {saved_path}")

# Check file size
file_size = saved_path.stat().st_size
print(f"File size: {file_size:,} bytes")

Data dictionary saved to: knowledge/data_dictionary.yml
File size: 24,998 bytes


In [13]:
# Load data dictionary from YAML file
loaded_data_dictionary = DataDictionary.load(saved_path)
print(f"Data dictionary loaded successfully!")
print(f"Loaded databases: {list(loaded_data_dictionary.databases.keys())}")

# Verify the loaded data matches the original
original_tables = set(data_dictionary.databases["olist_ecommerce"].schemas["ecommerce"].tables.keys())
loaded_tables = set(loaded_data_dictionary.databases["olist_ecommerce"].schemas["ecommerce"].tables.keys())
print(f"Original tables: {len(original_tables)}")
print(f"Loaded tables: {len(loaded_tables)}")
print(f"Tables match: {original_tables == loaded_tables}")

Data dictionary loaded successfully!
Loaded databases: ['olist_ecommerce']
Original tables: 9
Loaded tables: 9
Tables match: True


## Working with Individual Components

The data dictionary provides access to individual components (databases, schemas, tables, columns) with their own methods.

In [14]:
# Work with individual table
order_items_table = data_dictionary.databases["olist_ecommerce"].schemas["ecommerce"].tables["order_items"]

print("=== ORDER ITEMS TABLE ANALYSIS ===")
print(f"Table: {order_items_table.name}")
print(f"Primary keys: {order_items_table.primary_keys}")
print(f"Foreign keys: {len(order_items_table.foreign_keys)}")

# Show foreign key relationships
print("\n=== FOREIGN KEY RELATIONSHIPS ===")
for fk in order_items_table.foreign_keys:
    print(f"  {fk['constrained_columns']} -> {fk['referred_schema']}.{fk['referred_table']}.{fk['referred_columns']}")

# Show columns with missing descriptions
missing_cols = order_items_table.get_columns_with_missing_descriptions()
print(f"\nColumns with missing descriptions: {missing_cols}")

=== ORDER ITEMS TABLE ANALYSIS ===
Table: order_items
Primary keys: ['order_id', 'order_item_id']
Foreign keys: 3

=== FOREIGN KEY RELATIONSHIPS ===
  ['order_id'] -> ecommerce.orders.['order_id']
  ['product_id'] -> ecommerce.products.['product_id']
  ['seller_id'] -> ecommerce.sellers.['seller_id']

Columns with missing descriptions: []


## Summary

This notebook demonstrated the key features of the `nl2sql.knowledge_base.data_dictionary` module:

### Key Features:
1. **Structured Data Representation**: Organized hierarchy of databases → schemas → tables → columns
2. **Metadata Extraction**: Automatic extraction of table descriptions, column types, constraints, and relationships
3. **Primary/Foreign Key Detection**: Automatic identification and mapping of table relationships
4. **Context Formatting**: Human-readable context output for documentation and AI systems
5. **Analysis Functions**: Built-in methods to identify missing metadata and documentation gaps
6. **Persistence**: Save/load functionality for data dictionary persistence
7. **Pydantic Models**: Type-safe data structures with validation

### Use Cases:
- **Database Documentation**: Generate comprehensive database documentation
- **Schema Analysis**: Identify missing metadata and documentation gaps
- **AI Context**: Provide structured context for natural language to SQL systems
- **Data Governance**: Track and maintain database schema information
- **Migration Planning**: Understand database structure and relationships

The data dictionary module provides a robust foundation for database schema management and documentation.