# Hugging Face Demo 
Code from overview document.

## Set up a special HF environment 
1. install required libraries (M1 has special 
2. install ipykernel
3. add to jupyter `python -m ipykernel install --user --name=hf_env --display-name "Python (hf_env)"`
4. restart jupyter

## Packages (in order)
1. torch
1. torchvision 
1. torchaudio
1. transformers
1. tf-keras
1. pandas
1. numpy
1. matplotlib

## M1/M2/M3 Macs ... 
- You have to use the CPU (not GPU) no CUDA support.
- torch, torchvision, torchaudio are different
- `pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu`

---
## Simple Test Script
**Output**: `{'sequence': 'Climate change is a significant challenge for the planet.', 'labels': ['environment', 'policy', 'economics'], 'scores': [0.9726617336273193, 0.012952088378369808, 0.00014581126742996275]}`

In [2]:
from transformers import pipeline

# Load a zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Test text
text = "Climate change is a significant challenge for the planet."
labels = ["environment", "policy", "economics"]

# Perform zero-shot classification
result = classifier(text, candidate_labels=labels, multi_label=True)
print(result)


Device set to use mps:0


{'sequence': 'Climate change is a significant challenge for the planet.', 'labels': ['environment', 'policy', 'economics'], 'scores': [0.9726617336273193, 0.012952088378369808, 0.00014581126742996275]}


## Load a pre-trained model for zero-shot classification

In [3]:
from transformers import pipeline

# Load a pre-trained model for zero-shot classification
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Example text
text = """
Climate change is impacting agriculture and biodiversity worldwide. 
Governments are implementing policies to reduce carbon emissions and promote renewable energy.
"""

# Define candidate labels
labels = ["environment", "policy", "science", "technology", "health"]

# Perform classification
results = classifier(text, candidate_labels=labels, multi_label=True)

# Print results
print("Concepts extracted:")
for label, score in zip(results["labels"], results["scores"]):
    print(f"{label}: {score:.2f}")


Device set to use mps:0


Concepts extracted:
environment: 0.71
policy: 0.64
science: 0.11
technology: 0.00
health: 0.00


## Census Example using Public Metadata
**NOTE**: This creates sample data as provided in the overview paper for an 'all in one' example here :)

In [4]:
import pandas as pd
from transformers import pipeline

# Sample data based on the provided example
data = {
    "SurveyID": [
        "POPESTprmagesex2014",
        "POPESTPROJagegroups2014",
        "POPESTPROJbirths2014",
        "POPESTPROJdeaths2014",
        "POPESTPROJnat2014",
    ],
    "Metadata": [
        "Annual Estimates of the Resident Population by Five-Year Age Groups and Sex for the Municipios of Puerto Rico. Source: U.S. Census Bureau, Population Division. Note: The estimates are based on the 2010 Census and reflect changes to the April 1, 2010 population due to the Count Question Resolution program and geographic program revisions.",
        "Projected Population by Age Groups, Sex, Race, and Hispanic Origin for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: 'In combination' means in combination with one or more other races. The sum of the five race-in-combination groups adds to more than the total population because individuals may report more than one race.",
        "Projected Births by Sex, Race, and Hispanic Origin for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: Hispanic origin is considered an ethnicity, not a race. Hispanics may be of any race. All projected births are considered native born.",
        "Projected Deaths by Single Year of Age, Sex, Race, and Hispanic Origin for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: Hispanic origin is considered an ethnicity, not a race. Hispanics may be of any race.",
        "Projected Population by Single Year of Age, Sex, Race, Hispanic Origin, and Nativity for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: Hispanic origin is considered an ethnicity, not a race. Hispanics may be of any race.",
    ],
}

# Convert the sample data into a DataFrame
df = pd.DataFrame(data)

# Initialize the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Define candidate labels
categories = [
    "population estimates",
    "migration",
    "housing data",
    "economic data",
    "birth data",
    "death data",
    "projections",
    "geographic information",
    "race and ethnicity",
]

# Function to classify metadata
def classify_metadata(text):
    result = classifier(text, candidate_labels=categories, multi_label=True)
    return {label: score for label, score in zip(result["labels"], result["scores"])}

# Apply classification to the Metadata column
df["Extracted Concepts"] = df["Metadata"].apply(classify_metadata)

# Save to CSV
output_path = "processed_census_metadata.csv"
df.to_csv(output_path, index=False)

print(f"Processed metadata saved to {output_path}")
# Display the first few rows
print(df.head())


Device set to use mps:0


Processed metadata saved to processed_census_metadata.csv
                  SurveyID                                           Metadata  \
0      POPESTprmagesex2014  Annual Estimates of the Resident Population by...   
1  POPESTPROJagegroups2014  Projected Population by Age Groups, Sex, Race,...   
2     POPESTPROJbirths2014  Projected Births by Sex, Race, and Hispanic Or...   
3     POPESTPROJdeaths2014  Projected Deaths by Single Year of Age, Sex, R...   
4        POPESTPROJnat2014  Projected Population by Single Year of Age, Se...   

                                  Extracted Concepts  
0  {'population estimates': 0.9928871989250183, '...  
1  {'projections': 0.9980393648147583, 'populatio...  
2  {'birth data': 0.6991598606109619, 'projection...  
3  {'projections': 0.996796190738678, 'death data...  
4  {'projections': 0.9980165958404541, 'populatio...  


# Super Bonus! Pipeline Implementation with Subcategories
### **Pipeline Implementation with Subcategories**

Our **Pipeline Implementation with Subcategories** enriches Census metadata by leveraging **Hugging Face models** to extract both high-level categories and granular subcategories (concepts) from metadata descriptions. The steps are as follows:

1. **Classify High-Level Categories**:
   - Using **Zero-Shot Classification** (e.g., `facebook/bart-large-mnli`), the pipeline assigns each metadata description a broad category such as `Demographics`, `Economics`, `Housing`, etc.
   - This provides an overarching organizational theme for each variable.

2. **Extract Subcategories (Concepts)**:
   - Using **Named Entity Recognition (NER)** (e.g., `dslim/bert-base-NER`), the pipeline identifies detailed terms or concepts embedded in the metadata descriptions.
   - These concepts include granular details such as "Sex," "Industry," or "Median Earnings," which serve as key nodes in the knowledge graph.

3. **Enrich Metadata and Create Relationships**:
   - The enriched dataset includes both the assigned category and extracted concepts for each metadata description.
   - Relationships are built for the knowledge graph:
     - `BELONGS_TO`: Links metadata descriptions to their high-level category.
     - `HAS_CONCEPT`: Links metadata descriptions to specific concepts extracted via NER.

4. **Prepare Data for Knowledge Graph Integration**:
   - Outputs an enriched DataFrame and a relationships DataFrame for direct import into graph databases like Neo4j.

This approach provides a structured and scalable way to organize metadata, enabling meaningful relationships and efficient querying in the knowledge graph.

---

### **Why Named Entity Recognition (NER) Is the Best Choice**

#### **What Is NER?**
Named Entity Recognition (NER) is a technique used in Natural Language Processing (NLP) to identify and extract key terms or entities from text. Entities can include people, organizations, locations, or, in our case, **concepts and subcategories** from metadata descriptions.

#### **Why Is NER Effective?**
- **Pre-Trained Knowledge**: NER models, such as `dslim/bert-base-NER`, are trained on large corpora to identify entities with high accuracy, even without fine-tuning.
- **Out-of-the-Box Performance**: NER works immediately without requiring extensive setup or labeled data.
- **Granularity**: Extracts detailed, contextually relevant terms (e.g., "Sex," "Industry") that serve as concepts for the knowledge graph.
- **Scalability**: Handles large datasets efficiently, adapting to various text lengths and structures.

#### **Other Possible Choices**
1. **Rule-Based Extraction**:
   - Relies on predefined patterns or regular expressions to extract key terms.
   - **Strengths**: High precision for structured or repetitive text.
   - **Weaknesses**: Fragile, hard to scale, and misses subtle variations in text.

2. **Clustering or Topic Modeling**:
   - Groups metadata descriptions into clusters or topics based on semantic similarity.
   - **Strengths**: Useful for finding hidden patterns or relationships.
   - **Weaknesses**: Difficult to interpret; doesn’t extract specific entities.

3. **Fine-Tuning a Transformer Model**:
   - Adapts a model like `bert-base-uncased` to your dataset by training it on labeled metadata.
   - **Strengths**: Custom-fit for your domain.
   - **Weaknesses**: Requires significant computational resources and labeled data.

#### **Decision Criteria and Analysis**
| **Criteria**         | **NER**                     | **Rule-Based**             | **Clustering/Topics**      | **Fine-Tuning**            |
|-----------------------|-----------------------------|----------------------------|----------------------------|----------------------------|
| Setup Effort          | Low                        | Medium                     | Medium                     | High                       |
| Scalability           | High                       | Low                        | Medium                     | Medium                     |
| Granularity           | High                       | Medium                     | Low                        | High                       |
| Flexibility           | High                       | Low                        | High                       | High                       |
| Computational Cost    | Low                        | Low                        | Medium                     | High                       |

NER strikes the best balance of **ease of use**, **accuracy**, and **scalability**. While fine-tuning offers more customization and clustering finds broader themes, NER’s ability to extract precise and meaningful concepts makes it the most effective and efficient choice for enriching Census metadata.

--- 

In [3]:
import pandas as pd
from transformers import pipeline

# Load Hugging Face pipelines
zero_shot_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
ner_pipeline = pipeline("ner", model="dslim/bert-base-NER")

# Predefined categories (high-level)
categories = [
    "Demographics", "Economics", "Housing", "Education", "Employment",
    "Health", "Geography", "Population Density", "Social Characteristics", "Environment"
]

# Assume you have a DataFrame with a column called 'metadata'
metadata_df = pd.DataFrame({
    "metadata": [
        "Annual Estimates of the Resident Population by Five-Year Age Groups and Sex for the Municipios of Puerto Rico. Source: U.S. Census Bureau, Population Division. Note: The estimates are based on the 2010 Census and reflect changes to the April 1, 2010 population due to the Count Question Resolution program and geographic program revisions.",
        "Projected Population by Age Groups, Sex, Race, and Hispanic Origin for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: 'In combination' means in combination with one or more other races. The sum of the five race-in-combination groups adds to more than the total population because individuals may report more than one race.",
        "Projected Births by Sex, Race, and Hispanic Origin for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: Hispanic origin is considered an ethnicity, not a race. Hispanics may be of any race. All projected births are considered native born.",
        "Projected Deaths by Single Year of Age, Sex, Race, and Hispanic Origin for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: Hispanic origin is considered an ethnicity, not a race. Hispanics may be of any race.",
        "Projected Population by Single Year of Age, Sex, Race, Hispanic Origin, and Nativity for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: Hispanic origin is considered an ethnicity, not a race. Hispanics may be of any race.",
    ]
})

# Step 1: Classify High-Level Categories
def classify_category(description):
    result = zero_shot_classifier(description, categories)
    return result["labels"][0]  # Top category

metadata_df["Category"] = metadata_df["metadata"].apply(classify_category)

# Step 2: Extract Subcategories or Concepts
def extract_concepts(description):
    entities = ner_pipeline(description)
    return [entity["word"] for entity in entities]

metadata_df["Concepts"] = metadata_df["metadata"].apply(extract_concepts)

# Step 3: Enrich with Relationships (High-Level + Subcategories)
relationships = []
for _, row in metadata_df.iterrows():
    metadata_description = row["metadata"]
    category_node = row["Category"]
    concept_nodes = row["Concepts"]
    
    # Add relationship to high-level category
    relationships.append({
        "source": metadata_description,
        "target": category_node,
        "relationship": "BELONGS_TO"
    })
    
    # Add relationships to concepts (subcategories)
    for concept in concept_nodes:
        relationships.append({
            "source": metadata_description,
            "target": concept,
            "relationship": "HAS_CONCEPT"
        })

# Convert relationships to DataFrame for export
relationships_df = pd.DataFrame(relationships)

# Output enriched metadata and relationships
print("Enriched Metadata DataFrame:")
print(metadata_df)

print("\nRelationships DataFrame:")
print(relationships_df)

# Save enriched metadata and relationships
metadata_df.to_csv("enriched_metadata_with_subcategories.csv", index=False)
relationships_df.to_csv("relationships_with_subcategories.csv", index=False)


Device set to use mps:0
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


Enriched Metadata DataFrame:
                                            metadata      Category  \
0  Annual Estimates of the Resident Population by...  Demographics   
1  Projected Population by Age Groups, Sex, Race,...  Demographics   
2  Projected Births by Sex, Race, and Hispanic Or...  Demographics   
3  Projected Deaths by Single Year of Age, Sex, R...  Demographics   
4  Projected Population by Single Year of Age, Se...  Demographics   

                                            Concepts  
0  [Puerto, Rico, U, ., S, ., Census, Bureau, Div...  
1  [Hispanic, United, States, U, ., S, ., Census,...  
2  [Hispanic, United, States, U, ., S, ., Census,...  
3  [Hispanic, United, States, U, ., S, ., Census,...  
4  [Hispanic, United, States, U, ., S, ., Census,...  

Relationships DataFrame:
                                               source        target  \
0   Annual Estimates of the Resident Population by...  Demographics   
1   Annual Estimates of the Resident Population by..

# **Issue with NER**
Named Entity Recognition (NER) struggled to effectively extract meaningful concepts from Census metadata due to:
- **Misinterpretation**: Fragmentation of terms (e.g., splitting "Hispanic Origin" into "Hispanic" and "Origin").
- **Lack of Domain Knowledge**: General-purpose NER models are not trained on Census-specific terminology, leading to poor recognition of key phrases.
- **Noisy Outputs**: Punctuation and irrelevant terms (e.g., "U", ".", "S") were extracted as entities, resulting in incorrect relationships in the knowledge graph.

---

## **What Would Be Needed to Fix NER**
To make NER work, we would need to:
1. **Preprocess Descriptions**: Remove punctuation, stopwords, and irrelevant terms before running NER.
2. **Postprocess Results**: Filter out noisy outputs and refine extracted entities to better match domain-specific concepts.
3. **Add Custom Rules**: Implement regex or predefined patterns to capture phrases missed by NER.
4. **Fine-Tune NER Models**: Train an NER model on labeled Census metadata to improve accuracy for domain-specific terms.

While these steps could work, they would require significant development, testing, and ongoing refinement for scalability.

---

## **Using ChatGPT for Concept Extraction**
ChatGPT addresses these challenges by:
- **Understanding Context**: ChatGPT’s pre-trained knowledge includes Census-related terminology, avoiding fragmentation or misinterpretation of key phrases.
- **Dynamic Flexibility**: Allows for extracting both high-level categories (e.g., "Demographics") and granular concepts (e.g., "Hispanic Origin") in a single prompt.
- **Simplified Pipeline**: Eliminates the need for extensive preprocessing, postprocessing, or fine-tuning, providing accurate outputs directly.
- **Scalable and Adaptable**: Easily processes large datasets and adapts to new descriptions with minimal adjustments to prompts.

This approach reduces complexity and ensures consistent, domain-specific concept extraction with high precision.

In [12]:
import os
import openai
import pandas as pd
from dotenv import load_dotenv
import re

# Load environment variables from .env file
load_dotenv("../.env")

# Initialize OpenAI client
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Load your metadata descriptions into a DataFrame
metadata_df = pd.DataFrame({
    "metadata": [
        "Annual Estimates of the Resident Population by Five-Year Age Groups and Sex for the Municipios of Puerto Rico. Source: U.S. Census Bureau, Population Division. Note: The estimates are based on the 2010 Census and reflect changes to the April 1, 2010 population due to the Count Question Resolution program and geographic program revisions.",
        "Projected Population by Age Groups, Sex, Race, and Hispanic Origin for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: 'In combination' means in combination with one or more other races. The sum of the five race-in-combination groups adds to more than the total population because individuals may report more than one race.",
        "Projected Births by Sex, Race, and Hispanic Origin for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: Hispanic origin is considered an ethnicity, not a race. Hispanics may be of any race. All projected births are considered native born.",
        "Projected Deaths by Single Year of Age, Sex, Race, and Hispanic Origin for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: Hispanic origin is considered an ethnicity, not a race. Hispanics may be of any race.",
        "Projected Population by Single Year of Age, Sex, Race, Hispanic Origin, and Nativity for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: Hispanic origin is considered an ethnicity, not a race. Hispanics may be of any race.",
    ]
})

import os
import re
import openai
import pandas as pd
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv("../.env")

# Initialize OpenAI client
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Load your metadata descriptions into a DataFrame
metadata_df = pd.DataFrame({
    "metadata": [
        "Annual Estimates of the Resident Population by Five-Year Age Groups and Sex for the Municipios of Puerto Rico. Source: U.S. Census Bureau, Population Division. Note: The estimates are based on the 2010 Census and reflect changes to the April 1, 2010 population due to the Count Question Resolution program and geographic program revisions.",
        "Projected Population by Age Groups, Sex, Race, and Hispanic Origin for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: 'In combination' means in combination with one or more other races. The sum of the five race-in-combination groups adds to more than the total population because individuals may report more than one race.",
        "Projected Births by Sex, Race, and Hispanic Origin for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: Hispanic origin is considered an ethnicity, not a race. Hispanics may be of any race. All projected births are considered native born.",
        "Projected Deaths by Single Year of Age, Sex, Race, and Hispanic Origin for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: Hispanic origin is considered an ethnicity, not a race. Hispanics may be of any race.",
        "Projected Population by Single Year of Age, Sex, Race, Hispanic Origin, and Nativity for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division. Note: Hispanic origin is considered an ethnicity, not a race. Hispanics may be of any race.",
    ]
})

def extract_metadata_enrichment(description):
    try:
        prompt = f"""
        Extract the following from the metadata description in a structured format:
        1. High-Level Category: The overarching category this description belongs to (e.g., Demographics, Economics, Housing).
        2. Concepts: List key terms or phrases that describe the metadata in brackets, separated by commas (e.g., [Population, Hispanic Origin, Age Groups])
        
        Format your response exactly like this:
        High-Level Category: [category]
        Concepts: [concept1, concept2, concept3]
        
        Metadata: "{description}"
        """
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are an assistant that extracts metadata enrichment in a structured format."},
                {"role": "user", "content": prompt}
            ]
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error processing metadata: {description}\nError: {e}")
        return None

# Apply the API call to each row in the DataFrame
metadata_df["Enrichment"] = metadata_df["metadata"].apply(extract_metadata_enrichment)

def parse_enrichment(enrichment):
    if enrichment is None:
        return pd.Series([None, None])
    try:
        # Extract Category and Concepts using regex
        category_match = re.search(r'High-Level Category:\s*(.+?)(?=\n|$)', enrichment)
        concepts_match = re.search(r'Concepts:\s*\[(.*?)\]', enrichment)
        
        if category_match and concepts_match:
            category = category_match.group(1).strip()
            # Split concepts and clean up each concept
            concepts = [concept.strip() for concept in concepts_match.group(1).split(',')]
            return pd.Series([category, concepts])
        return pd.Series([None, None])
    except Exception as e:
        print(f"Error parsing enrichment: {e}")
        return pd.Series([None, None])

# Parse and add extracted data
metadata_df[["Category", "Concepts"]] = metadata_df["Enrichment"].apply(parse_enrichment)

# Generate relationships for the knowledge graph
relationships = []

for idx, row in metadata_df.iterrows():
    metadata_description = row["metadata"]
    category_node = row["Category"]
    concept_nodes = row["Concepts"]
    
    # Create a shorter identifier for the metadata
    metadata_id = f"Metadata_{idx + 1}"
    
    # Add relationship to the category
    if category_node and isinstance(category_node, str):
        relationships.append({
            "source": metadata_id,
            "target": category_node,
            "relationship": "BELONGS_TO",
            "source_type": "Metadata",
            "target_type": "Category"
        })
    
    # Add relationships to concepts
    if concept_nodes and isinstance(concept_nodes, list):
        for concept in concept_nodes:
            if concept:  # Check if concept is not empty
                relationships.append({
                    "source": metadata_id,
                    "target": concept.strip(),
                    "relationship": "HAS_CONCEPT",
                    "source_type": "Metadata",
                    "target_type": "Concept"
                })

# Convert relationships to a DataFrame
relationships_df = pd.DataFrame(relationships)

# Create a mapping DataFrame to store metadata IDs and their full descriptions
metadata_mapping = pd.DataFrame({
    'metadata_id': [f"Metadata_{i+1}" for i in range(len(metadata_df))],
    'full_metadata': metadata_df['metadata']
})

# Save results
metadata_df.to_csv("enriched_metadata_chatgpt.csv", index=False)
relationships_df.to_csv("relationships_chatgpt.csv", index=False)
metadata_mapping.to_csv("metadata_mapping_chatgpt.csv", index=False)

print("Enriched Metadata:")
print(metadata_df)

print("\nRelationships for Knowledge Graph:")
print(relationships_df)

print("\nMetadata Mapping:")
print(metadata_mapping)


Enriched Metadata:
                                            metadata  \
0  Annual Estimates of the Resident Population by...   
1  Projected Population by Age Groups, Sex, Race,...   
2  Projected Births by Sex, Race, and Hispanic Or...   
3  Projected Deaths by Single Year of Age, Sex, R...   
4  Projected Population by Single Year of Age, Se...   

                                          Enrichment        Category  \
0  High-Level Category: [Demographics]\nConcepts:...  [Demographics]   
1  High-Level Category: [Demographics]\nConcepts:...  [Demographics]   
2  High-Level Category: [Demographics]\nConcepts:...  [Demographics]   
3  High-Level Category: [Demographics]\nConcepts:...  [Demographics]   
4  High-Level Category: [Demographics]\nConcepts:...  [Demographics]   

                                            Concepts  
0  [Annual Estimates, Resident Population, Five-Y...  
1  [Projected Population, Age Groups, Sex, Race, ...  
2  [Projected Births, Sex, Race, Hispanic Orig

# But we have an Issue: Generic Redundant Information: 

## **Issue Summary**
When processing metadata for the knowledge graph, we encountered the following challenges:
1. **Redundant Information**:
   - Generic terms like "US Census Bureau" added noise without value.
   - Variations of terms (e.g., "Population Division" vs. "Division of Population Studies") caused inconsistencies.
2. **Scaling**:
   - Cleaning and normalizing millions of rows manually was infeasible.
3. **Future Data Sources**:
   - Other agencies might introduce new terms, requiring scalable and adaptable handling.

---

### **Solution**
To address these challenges, we implemented a **dynamic normalization pipeline** using:
1. **Predefined Normalization Dictionary**:
   - Common terms are mapped to their canonical forms (e.g., `"US Census Bureau" → "Census Bureau"`).
   - Ensures consistent terminology from the start.

2. **Dynamic Matching with RapidFuzz**:
   - Automatically detects variations of terms using similarity scoring (`score_cutoff=80`).
   - Matches terms not explicitly listed in the dictionary but similar enough to known values.

3. **Hierarchical Relationships**:
   - Structured the knowledge graph to include clear parent-child relationships:
     - Example: `"Population Division BELONGS_TO Census Bureau"`

4. **Iterative Refinement**:
   - Logs new terms not in the dictionary to `new_terms_log.txt` for review and potential inclusion.
   - Allows the normalization dictionary to grow dynamically over time.

5. **Automation at Scale**:
   - Processes millions of rows programmatically, ensuring scalability and adaptability to new datasets.

---

### **Key Details to Remember**
- **RapidFuzz for Matching**:
  - Faster and more accurate than legacy approaches like `fuzzywuzzy`.
  - Used to match terms dynamically during processing.

- **Normalization Dictionary**:
  - Acts as a single source of truth for term standardization.
  - Periodic updates from `new_terms_log.txt` ensure continuous improvement.

- **Output Organization**:
  - All outputs are saved in a dedicated `/normalized` directory for easier management:
    - `relationships_with_normalization.csv`: Knowledge graph relationships.
    - `enriched_metadata_with_normalization.csv`: Metadata with normalized concepts.
    - `new_terms_log.txt`: New terms for review.

- **Future-Proofing**:
  - Hierarchical graph relationships (e.g., `BELONGS_TO`, `HAS_CONCEPT`) support integration of additional agencies or divisions without restructuring.

---

### **Next Steps for Ongoing Improvement**
1. **Review New Terms Regularly**:
   - Check `new_terms_log.txt` and update the normalization dictionary to improve automation accuracy.

2. **Monitor for Edge Cases**:
   - Analyze logs to identify terms that don’t match well or require additional refinement.

3. **Expand Normalization**:
   - As datasets grow, refine and expand the dictionary to accommodate new agencies, divisions, or data types.

4. **Ensure Query Efficiency**:
   - With the hierarchical structure in place, ensure the graph database supports fast and intuitive querying.

---

This approach balances flexibility, scalability, and maintainability, ensuring the pipeline can handle diverse and expanding datasets without manual intervention. 🚀

In [14]:
import os
import re
import openai
import pandas as pd
from dotenv import load_dotenv
from rapidfuzz import process  # For dynamic dictionary matching

# Load environment variables from .env file
load_dotenv("../.env")

# Initialize OpenAI client
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Load your metadata descriptions into a DataFrame
metadata_df = pd.DataFrame({
    "metadata": [
        "Annual Estimates of the Resident Population by Five-Year Age Groups and Sex for the Municipios of Puerto Rico. Source: U.S. Census Bureau, Population Division.",
        "Projected Population by Age Groups, Sex, Race, and Hispanic Origin for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division.",
        "Projected Births by Sex, Race, and Hispanic Origin for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division.",
        "Projected Deaths by Single Year of Age, Sex, Race, and Hispanic Origin for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division.",
        "Projected Population by Single Year of Age, Sex, Race, Hispanic Origin, and Nativity for the United States: 2014-2060. Source: U.S. Census Bureau, Population Division."
    ]
})

# Predefined normalization dictionary
normalization_dict = {
    "US Census Bureau": "Census Bureau",
    "Population Division": "Population Division"
}

# Dynamic term logs for review
new_terms_log = []

# Normalize text based on the dictionary
def normalize_text(term, norm_dict):
    # Use RapidFuzz for dynamic matching
    best_match = process.extractOne(term, norm_dict.keys(), score_cutoff=80)
    return norm_dict[best_match[0]] if best_match else term

# Enrichment extraction via OpenAI API
def extract_metadata_enrichment(description):
    try:
        prompt = f"""
        Extract the following from the metadata description in a structured format:
        1. High-Level Category: The overarching category this description belongs to (e.g., Demographics, Economics, Housing).
        2. Concepts: List key terms or phrases that describe the metadata in brackets, separated by commas (e.g., [Population, Hispanic Origin, Age Groups])
        
        Format your response exactly like this:
        High-Level Category: [category]
        Concepts: [concept1, concept2, concept3]
        
        Metadata: "{description}"
        """
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are an assistant that extracts metadata enrichment in a structured format."},
                {"role": "user", "content": prompt}
            ]
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error processing metadata: {description}\nError: {e}")
        return None

# Parse enrichment and normalize concepts
def parse_and_normalize(enrichment):
    if enrichment is None:
        return pd.Series([None, None])
    try:
        # Extract Category and Concepts
        category_match = re.search(r'High-Level Category:\s*(.+?)(?=\n|$)', enrichment)
        concepts_match = re.search(r'Concepts:\s*\[(.*?)\]', enrichment)

        category = category_match.group(1).strip() if category_match else None
        concepts = [concept.strip() for concept in concepts_match.group(1).split(",")] if concepts_match else []

        # Normalize concepts using the dictionary
        normalized_concepts = [normalize_text(concept, normalization_dict) for concept in concepts]

        # Log any new terms not in the dictionary
        for concept in normalized_concepts:
            if concept not in normalization_dict.values() and concept not in new_terms_log:
                new_terms_log.append(concept)

        return pd.Series([category, normalized_concepts])
    except Exception as e:
        print(f"Error parsing enrichment: {e}")
        return pd.Series([None, None])

# Apply API enrichment and normalization
metadata_df["Enrichment"] = metadata_df["metadata"].apply(extract_metadata_enrichment)
metadata_df[["Category", "Concepts"]] = metadata_df["Enrichment"].apply(parse_and_normalize)

# Relationships for the knowledge graph
relationships = []

for idx, row in metadata_df.iterrows():
    metadata_description = row["metadata"]
    category_node = row["Category"]
    concept_nodes = row["Concepts"]
    metadata_id = f"Metadata_{idx + 1}"
    
    # Add BELONGS_TO relationships
    if category_node:
        relationships.append({
            "source": metadata_id,
            "target": category_node,
            "relationship": "BELONGS_TO",
            "source_type": "Metadata",
            "target_type": "Category"
        })

    # Add HAS_CONCEPT relationships
    if concept_nodes:
        for concept in concept_nodes:
            relationships.append({
                "source": metadata_id,
                "target": concept,
                "relationship": "HAS_CONCEPT",
                "source_type": "Metadata",
                "target_type": "Concept"
            })

# Define output directory
output_dir = "./normalized"

# Ensure the output directory exists
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Save relationships to the output directory
relationships_df = pd.DataFrame(relationships)
relationships_df.to_csv(os.path.join(output_dir, "relationships_with_normalization.csv"), index=False)

# Save enriched metadata to the output directory
metadata_df.to_csv(os.path.join(output_dir, "enriched_metadata_with_normalization.csv"), index=False)

# Save new terms log to the output directory
if new_terms_log:
    with open(os.path.join(output_dir, "new_terms_log.txt"), "w") as f:
        for term in new_terms_log:
            f.write(f"{term}\n")

print("Relationships for Knowledge Graph:")
print(relationships_df)

print("\nNew Terms Logged for Review:")
print(new_terms_log)


Relationships for Knowledge Graph:
        source                     target relationship source_type target_type
0   Metadata_1             [Demographics]   BELONGS_TO    Metadata    Category
1   Metadata_1           Annual Estimates  HAS_CONCEPT    Metadata     Concept
2   Metadata_1        Resident Population  HAS_CONCEPT    Metadata     Concept
3   Metadata_1       Five-Year Age Groups  HAS_CONCEPT    Metadata     Concept
4   Metadata_1                        Sex  HAS_CONCEPT    Metadata     Concept
5   Metadata_1  Municipios of Puerto Rico  HAS_CONCEPT    Metadata     Concept
6   Metadata_1              Census Bureau  HAS_CONCEPT    Metadata     Concept
7   Metadata_1        Population Division  HAS_CONCEPT    Metadata     Concept
8   Metadata_2             [Demographics]   BELONGS_TO    Metadata    Category
9   Metadata_2       Projected Population  HAS_CONCEPT    Metadata     Concept
10  Metadata_2                 Age Groups  HAS_CONCEPT    Metadata     Concept
11  Metadata_2   