<a href="https://colab.research.google.com/github/charoo-rumsan/DSPy_research/blob/main/Header_labelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import os
from google.colab import files

print("Please upload your CSV file.")
uploaded = files.upload()

# Assuming a single file is uploaded, get its name
if uploaded:
    uploaded_filename = list(uploaded.keys())[0]
    print(f"File '{uploaded_filename}' uploaded successfully.")
else:
    print("No file uploaded. Please upload a CSV file to continue.")

Please upload your CSV file.


Saving first_100_rows (1) - first_100_rows (1).csv.csv to first_100_rows (1) - first_100_rows (1).csv (1).csv
File 'first_100_rows (1) - first_100_rows (1).csv (1).csv' uploaded successfully.


In [None]:
# Install dspy
!pip install dspy-ai



To use DSPy, you'll need an API key for a language model provider (e.g., OpenAI). If you don't already have one, create a key. In Colab, add the key to the secrets manager under the "ЁЯФС" in the left panel. Give it the name `OPENAI_API_KEY`. Then, we'll configure DSPy to use it.

Now, let's define the DSPy program to extract headers. We'll create a `Signature` that describes the input (CSV content) and the output (a list of headers), and then a `Module` that uses this signature.

In [None]:
import csv
import os
from pathlib import Path
import polars as pl

class HeaderExtractor:
    def __init__(self):
        self.supported_formats = ['.csv', '.tsv', '.txt']

    def extract_headers_from_file(self, file_path: str):
        if not os.path.exists(file_path):
            raise FileNotFoundError(f"File not found: {file_path}")

        file_ext = Path(file_path).suffix.lower()
        if file_ext not in self.supported_formats:
            raise ValueError(f"Unsupported file format: {file_ext}")

        headers = self._extract_headers(file_path)
        print(f"тЬЕ Extracted {len(headers)} headers from file.")

        return headers

    def _extract_headers(self, file_path: str):
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            sample = f.read(1024)
            f.seek(0)
            delimiter = csv.Sniffer().sniff(sample).delimiter
        df = pl.read_csv(file_path, separator=delimiter, n_rows=0)
        return df.columns

In [None]:
import json

# Instantiate the HeaderExtractor
extractor = HeaderExtractor()

# Assuming 'uploaded_filename' holds the path to your CSV file from earlier steps
if 'uploaded_filename' in locals() and uploaded_filename:
    # Extract headers using the new class
    extracted_headers = extractor.extract_headers_from_file(uploaded_filename)

    print("\nHeaders extracted using HeaderExtractor:")
    print(extracted_headers)

    # Save to JSON file
    output_filename = 'extracted_headers_using_class.json'
    with open(output_filename, 'w') as f:
        json.dump(extracted_headers, f, indent=4)
    print(f"\nHeaders also saved to {output_filename}")

else:
    print("No CSV file was uploaded. Please upload a file first.")

тЬЕ Extracted 353 headers from file.

Headers extracted using HeaderExtractor:

Headers also saved to extracted_headers_using_class.json


In [None]:
import litellm

# Re-using the configuration parameters from the dspy.LM object
# Note: openai_api_key, model, base_url, provider are assumed to be accessible from previous cells.
# If not, they would need to be redefined or passed explicitly.

print("Asking the LLM directly: What is AI?")

try:
    response = litellm.completion(
         #model="ollama/llama3.2:3b",
         #base_url="http://localhost:11434",
        #model="llama3.1:latest",
        #api_key="",
        #messages=[{"role": "user", "content": "What is AI?"}],

    )
    # The structure of the response might vary slightly, typically it's response.choices[0].message.content
    print(f"LLM's Answer: {response.choices[0].message.content}")
except Exception as e:
    print(f"An error occurred while calling the LLM directly: {e}")

Asking the LLM directly: What is AI?
An error occurred while calling the LLM directly: litellm.APIError: APIError: OpenrouterException - {"error":{"message":"Insufficient credits. This account never purchased credits. Make sure your key is on the correct account or org, and if so, purchase more at https://openrouter.ai/settings/credits","code":402}}


In [None]:
import dspy
from dspy.teleprompt import BootstrapFewShot


# Set up the language model
# You can choose other models like dspy.Google("models/gemini-pro") or dspy.Cohere()
#openai_api_key = "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"
llm =  dspy.LM(model="ollama/llama3.1:latest",
         base_url= "https://jo3m4y06rnnwhaz.askbhunte.com", api_key='')
dspy.configure(lm=llm)

print("DSPy configured with ollama.")

DSPy configured with ollama.


In [None]:

math = dspy.ChainOfThought("question -> answer: float")
print(math(question="Two dice are tossed. What is the probability that the sum equals two?"))

Prediction(
    reasoning="To find the probability that the sum of two dice equals 2, we need to consider all possible outcomes where this condition is met. There are only two ways this can happen: when one die shows a 1 and the other shows a 1 (order does not matter). The total number of outcomes for rolling two dice is 6 * 6 = 36 because each die has 6 faces, and we're looking at combinations, not permutations. We calculate the probability by dividing the number of successful outcomes by the total possible outcomes.\n\nThe successful outcome in this case is having a sum of 2, which happens in only one way: (1, 1). So there's only one successful outcome out of a total of 36 possibilities. Thus, the probability is 1/36.",
    answer=0.027777777777777776
)


In [None]:
import dspy

# Define the signature for standardizing headers
class StandardizeHeader(dspy.Signature):
    """Standardize a given CSV header name to a simpler, more usable format."""

    original_header = dspy.InputField(desc="The original, potentially complex, CSV header string")
    standardized_header = dspy.OutputField(desc="A simplified and standardized version of the header. Examples: 'General Questions/Municipality and Ward Details/Name of Municipality (рдирдЧрд░рдкрд╛рд▓рд┐рдХрд╛рдХреЛ рдирд╛рдо)' -> 'municipality_name', 'General Questions/_GPS Coordinates_latitude' -> 'latitude'")

# Define the DSPy module to use this signature
class HeaderStandardizer(dspy.Module):
    def __init__(self):
        super().__init__()
        self.predictor = dspy.Predict(StandardizeHeader)

    def forward(self, original_header):
        prediction = self.predictor(original_header=original_header)
        return prediction.standardized_header

print("DSPy Signature and Module for header standardization defined.")

DSPy Signature and Module for header standardization defined.


In [None]:
# Instantiate the HeaderStandardizer
standardizer = HeaderStandardizer()

# Ensure extracted_headers is available
if 'extracted_headers' in locals():
    standardized_headers = []
    print("\nStandardizing headers...")

    for header in extracted_headers:
        if not header.strip():
            continue

        # Correct DSPy usage
        standardized_name = standardizer(original_header=header)

        standardized_headers.append(standardized_name)
        print(f"'{header}' -> '{standardized_name}'")

    print("\nStandardized Headers List:")
    print(standardized_headers)

    # Save to JSON
    import json
    output_filename_standardized = 'standardized_headers.json'

    with open(output_filename_standardized, 'w') as f:
        json.dump(standardized_headers, f, indent=4)

    print(f"\nStandardized headers saved to {output_filename_standardized}")

else:
    print("Error: 'extracted_headers' list not found. Please ensure headers were extracted successfully.")



Standardizing headers...
'start' -> 'general_questions/municipality_and_ward_details/name_of_municipality_()'
'end' -> 'municipality_name'
'today' -> 'date'
'username' -> 'user_name'
'simserial' -> 'simulation_serial'
'subscriberid' -> 'subscribed_id'
'deviceid' -> 'device_id'
'phonenumber' -> 'phone_number'
'General Questions/Municipality and Ward Details/Name of Municipality (рдирдЧрд░рдкрд╛рд▓рд┐рдХрд╛рдХреЛ рдирд╛рдо)' -> 'municipality_name'
'General Questions/Municipality and Ward Details/Ward Number (рд╡рдбрд╛ рдирдВ .)' -> 'ward_number'
'General Questions/Municipality and Ward Details/Ward Number (рд╡рдбрд╛ рдирдВ )' -> 'ward_number'
'General Questions/Name of the Tole (рд╕рд░реНрд╡реЗрдХреНрд╖рдг рднреИрд░рд╣реЗрдХреЛ рд╕реНрдерд╛рдирдХреЛ рдирд╛рдо)' -> 'tole_name'
'General Questions/House No. (рдШрд░ рдирдВ)' -> 'house_number'
'GPS Coordinates' -> 'latitude'
'General Questions/_GPS Coordinates_latitude' -> 'latitude'
'General Questions/_GPS Coordinates_longitude' -> 'longitu

**CODE OPTIMIZATION V1**

---



In [None]:
import dspy
from typing import List
import asyncio
from tqdm.asyncio import tqdm_asyncio
import json
import re
import nest_asyncio

nest_asyncio.apply()

In [None]:
# 1. FAST OLLAMA SETUP WITH CACHING

ollama_lm = dspy.LM(
    model="ollama/llama3.1:latest",
    base_url="https://jo3m4y06rnnwhaz.askbhunte.com/v1",
    api_key="anything",
    temperature=0.0,
    max_tokens=64,
    cache=True,           # тЖР still gives huge speedup on duplicates
    timeout_s=60,
)
dspy.configure(lm=ollama_lm)

In [None]:
# 2. Better Signature with Examples

class StandardizeHeader(dspy.Signature):
    """Convert messy CSV headers into clean snake_case English column names.
    Rules: remove prefixes, take meaningful English part, use snake_case, no special chars."""

    original_header: str = dspy.InputField()
    standardized_header: str = dspy.OutputField(desc="e.g. municipality_name, latitude, respondent_age")

In [None]:
FEW_SHOT_EXAMPLES = [
    dspy.Example(
        original_header="General Questions/Municipality and Ward Details/Name of Municipality (рдирдЧрд░рдкрд╛рд▓рд┐рдХрд╛рдХреЛ рдирд╛рдо)",
        standardized_header="municipality_name"
    ).with_inputs("original_header"),
    dspy.Example(original_header="General Questions/_GPS Coordinates_latitude", standardized_header="latitude").with_inputs("original_header"),
    dspy.Example(original_header="General Questions/_GPS Coordinates_longitude", standardized_header="longitude").with_inputs("original_header"),
    dspy.Example(original_header="Household Survey/Section A - Demographics/A1. Respondent Age (рд╡рд░реНрд╖)", standardized_header="respondent_age").with_inputs("original_header"),
    dspy.Example(original_header="Income Sources/Q12_3. Remittances last year (USD)", standardized_header="remittances_usd").with_inputs("original_header"),
    dspy.Example(original_header="id", standardized_header="id").with_inputs("original_header"),
    dspy.Example(original_header="Timestamp", standardized_header="timestamp").with_inputs("original_header"),
]

In [None]:
# 3. Simple, reliable predictor (no ChainOfThought = no bugs)
# ========================
class HeaderStandardizer(dspy.Module):
    def __init__(self):
        super().__init__()
        self.predict = dspy.Predict("original_header -> standardized_header")

    def forward(self, original_header: str) -> str:
        with dspy.context(examples=FEW_SHOT_EXAMPLES):
            pred = self.predict(original_header=original_header)
        return pred.standardized_header.strip().lower()

standardizer = HeaderStandardizer()


In [None]:
# 4. ASYNC + CONCURRENCY (15тАУ20x faster than loop)
# ========================
async def standardize_async(header: str) -> str:
    try:
        return standardizer(original_header=header)
    except Exception as e:
        # Fallback
        clean = re.sub(r"[^a-zA-Z0-9 ]", " ", header.split("/")[-1].split("(")[0])
        return "_".join(clean.lower().split()) or "unknown"

async def run_all(headers: List[str]) -> List[str]:
    semaphore = asyncio.Semaphore(18)  # 18 concurrent = max speed on your server

    async def worker(h):
        async with semaphore:
            return await standardize_async(h)

    tasks = [worker(h) for h in headers]
    results = []
    for f in tqdm_asyncio.as_completed(tasks, desc="Standardizing", total=len(tasks)):
        results.append(await f)
    return results

In [None]:
# 5. RUN IT тАФ 352 headers in 6тАУ12 seconds!
# ========================
if 'extracted_headers' in locals() and extracted_headers:
    headers = [h.strip() for h in extracted_headers if h.strip()]
    print(f"Starting async standardization of {len(headers)} headers...")

    # тЖР THIS LINE IS NOW SAFE
    clean_headers = await run_all(headers)

    print("\nSample results:")
    for h, c in zip(headers[:25], clean_headers[:25]):
        print(f"  {h} тЖТ {c}")

    # Save
    with open("standardized_headers.json", "w", encoding="utf-8") as f:
        json.dump(clean_headers, f, indent=2, ensure_ascii=False)

    print(f"\nAll {len(clean_headers)} headers standardized and saved!")
else:
    print("No extracted_headers found.")


Starting async standardization of 352 headers...


Standardizing:   0%|          | 0/352 [00:00<?, ?it/s]

In [None]:
# 3. BATCHED + COMPILED STANDARDIZER
class SingleHeaderStandardizer(dspy.Module):
    def __init__(self):
        super().__init__()
        self.predict = dspy.ChainOfThought(StandardizeHeader)

    def forward(self, original_header: str):
        example = dspy.Example(original_header=original_header).with_inputs("original_header")
        with dspy.context(examples=self.few_shot_examples):
            pred = self.predict(example)
        return pred.standardized_header.strip().lower()

    # Strong few-shot examples
    few_shot_examples = [
        dspy.Example(original_header="General Questions/Municipality and Ward Details/Name of Municipality (рдирдЧрд░рдкрд╛рд▓рд┐рдХрд╛рдХреЛ рдирд╛рдо)", standardized_header="municipality_name").with_inputs("original_header"),
        dspy.Example(original_header="General Questions/_GPS Coordinates_latitude", standardized_header="latitude").with_inputs("original_header"),
        dspy.Example(original_header="General Questions/_GPS Coordinates_longitude", standardized_header="longitude").with_inputs("original_header"),
        dspy.Example(original_header="Household Survey/Section A - Demographics/A1. Respondent Age (рд╡рд░реНрд╖)", standardized_header="respondent_age").with_inputs("original_header"),
        dspy.Example(original_header="Income Sources/Q12_3. Remittances last year (USD)", standardized_header="remittances_usd").with_inputs("original_header"),
        dspy.Example(original_header="id", standardized_header="id").with_inputs("original_header"),
    ]

# Compile once (30тАУ60s)
print("Compiling predictor with few-shot examples...")
standardizer = SingleHeaderStandardizer()
standardizer.few_shot_examples = SingleHeaderStandardizer.few_shot_examples

compiled = BootstrapFewShot(metric=None, max_labeled_demos=6).compile(
    standardizer, trainset=standardizer.few_shot_examples
)
print("Compiled! Now running async at max speed...")

Compiling predictor with few-shot examples...


  0%|          | 0/6 [00:00<?, ?it/s]2025/11/26 10:06:50 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'original_header': 'General Questions/Municipality and Ward Details/Name of Municipality (рдирдЧрд░рдкрд╛рд▓рд┐рдХрд╛рдХреЛ рдирд╛рдо)', 'standardized_header': 'municipality_name'}) (input_keys={'original_header'}) with None due to ChainOfThought.forward() takes 1 positional argument but 2 were given.
2025/11/26 10:06:50 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'original_header': 'General Questions/_GPS Coordinates_latitude', 'standardized_header': 'latitude'}) (input_keys={'original_header'}) with None due to ChainOfThought.forward() takes 1 positional argument but 2 were given.
 33%|тЦИтЦИтЦИтЦО      | 2/6 [00:00<00:00, 11.69it/s]2025/11/26 10:06:50 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'original_header': 'General Questions/_GPS Coordinates_longitude', 'standardized_heade

Bootstrapped 0 full traces after 5 examples for up to 1 rounds, amounting to 6 attempts.
Compiled! Now running async at max speed...





In [None]:
# 4. ULTRA-FAST STANDARDIZATION

print("Creating and compiling standardizer...")

# Create a fresh instance so we can access .few_shot_examples
student = BatchHeaderStandardizer()

# Use the examples that are now on the instance
trainset = student.few_shot_examples

compiled_standardizer = BootstrapFewShot(
    metric=None,
    max_labeled_demos=6,
    max_bootstrapped_demos=4,
    max_rounds=1
).compile(student, trainset=trainset)

print("Compiled successfully! Ready for lightning-fast header cleaning!")

Creating and compiling standardizer...


  0%|          | 0/7 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

2025/11/26 09:58:41 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'General Questions/Municipality and Ward Details/Name of Municipality (рдирдЧрд░рдкр

Processed 0 / 1 examples: 100%|тЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИ| 1/1 [00:09<00:00,  9.05s/it]

2025/11/26 09:58:41 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'original_header': 'General Questions/Municipality and Ward Details/Name of Municipality (рдирдЧрд░рдкрд╛рд▓рд┐рдХрд╛рдХреЛ рдирд╛рдо)', 'standardized_header': 'municipality_name'}) (input_keys={'original_header'}) with None due to 'NoneType' object has no attribute 'standardized_header'.
 14%|тЦИтЦН        | 1/7 [00:09<00:54,  9.10s/it]


  0%|          | 0/1 [00:00<?, ?it/s]

2025/11/26 09:58:50 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'General Questions/_GPS Coordinates_latitude'}) (input_keys={'original_header'})): l

Processed 0 / 1 examples: 100%|тЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИ| 1/1 [00:08<00:00,  8.59s/it]

2025/11/26 09:58:50 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'original_header': 'General Questions/_GPS Coordinates_latitude', 'standardized_header': 'latitude'}) (input_keys={'original_header'}) with None due to 'NoneType' object has no attribute 'standardized_header'.
 29%|тЦИтЦИтЦК       | 2/7 [00:17<00:44,  8.83s/it]


  0%|          | 0/1 [00:00<?, ?it/s]

2025/11/26 09:58:58 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'General Questions/_GPS Coordinates_longitude'}) (input_keys={'original_header'})): 

Processed 0 / 1 examples: 100%|тЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИ| 1/1 [00:08<00:00,  8.33s/it]

2025/11/26 09:58:58 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'original_header': 'General Questions/_GPS Coordinates_longitude', 'standardized_header': 'longitude'}) (input_keys={'original_header'}) with None due to 'NoneType' object has no attribute 'standardized_header'.
 43%|тЦИтЦИтЦИтЦИтЦО     | 3/7 [00:26<00:34,  8.63s/it]


  0%|          | 0/1 [00:00<?, ?it/s]

2025/11/26 09:59:06 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'Household Survey/Section A - Demographics/A1. Respondent Age (рд╡рд░реНрд╖)'}) (inp

Processed 0 / 1 examples: 100%|тЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИ| 1/1 [00:08<00:00,  8.19s/it]

2025/11/26 09:59:06 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'original_header': 'Household Survey/Section A - Demographics/A1. Respondent Age (рд╡рд░реНрд╖)', 'standardized_header': 'respondent_age'}) (input_keys={'original_header'}) with None due to 'NoneType' object has no attribute 'standardized_header'.
 57%|тЦИтЦИтЦИтЦИтЦИтЦЛ    | 4/7 [00:34<00:25,  8.48s/it]


  0%|          | 0/1 [00:00<?, ?it/s]

2025/11/26 09:59:15 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'Income Sources/Q12_3. Remittances last year (USD)'}) (input_keys={'original_header'

Processed 0 / 1 examples: 100%|тЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИ| 1/1 [00:08<00:00,  8.17s/it]

2025/11/26 09:59:15 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'original_header': 'Income Sources/Q12_3. Remittances last year (USD)', 'standardized_header': 'remittances_usd'}) (input_keys={'original_header'}) with None due to 'NoneType' object has no attribute 'standardized_header'.
 71%|тЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦП  | 5/7 [00:42<00:16,  8.39s/it]


  0%|          | 0/1 [00:00<?, ?it/s]

2025/11/26 09:59:22 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'id'}) (input_keys={'original_header'})): litellm.APIConnectionError: OllamaExceptio

Processed 0 / 1 examples: 100%|тЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИ| 1/1 [00:07<00:00,  7.84s/it]

2025/11/26 09:59:22 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'original_header': 'id', 'standardized_header': 'id'}) (input_keys={'original_header'}) with None due to 'NoneType' object has no attribute 'standardized_header'.
 86%|тЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦМ | 6/7 [00:50<00:08,  8.22s/it]


  0%|          | 0/1 [00:00<?, ?it/s]

2025/11/26 09:59:30 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'Timestamp'}) (input_keys={'original_header'})): litellm.APIConnectionError: OllamaE

Processed 0 / 1 examples: 100%|тЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИ| 1/1 [00:07<00:00,  7.85s/it]

2025/11/26 09:59:30 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'original_header': 'Timestamp', 'standardized_header': 'timestamp'}) (input_keys={'original_header'}) with None due to 'NoneType' object has no attribute 'standardized_header'.
100%|тЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИ| 7/7 [00:58<00:00,  8.34s/it]


Bootstrapped 0 full traces after 6 examples for up to 1 rounds, amounting to 7 attempts.
Compiled successfully! Ready for lightning-fast header cleaning!





In [None]:
# 5. Run on autual headers

if 'extracted_headers' in locals() and extracted_headers:
    headers = [h.strip() for h in extracted_headers if h.strip()]
    print(f"\nStandardizing {len(headers)} headers in ONE call...")

    clean_headers = compiled_standardizer(headers)   # тЖР single call!

    # Show preview
    for orig, clean in zip(headers[:25], clean_headers[:25]):
        print(f"{orig} тЖТ {clean}")

    # Save
    import json
    with open("standardized_headers.json", "w", encoding="utf-8") as f:
        json.dump(clean_headers, f, indent=2, ensure_ascii=False)

    print(f"\nDone! {len(clean_headers)} headers saved to v1standardized_headers.json")
else:
    print("No extracted_headers found in this notebook.")


Standardizing 352 headers in ONE call...
  0%|          | 0/352 [00:00<?, ?it/s]

2025/11/26 10:05:02 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'deviceid'}) (input_keys={'original_header'})): litellm.APIConnectionError: OllamaEx

Processed 0 / 352 examples:   0%|          | 1/352 [00:08<51:00,  8.72s/it]

2025/11/26 10:05:03 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'phonenumber'}) (input_keys={'original_header'})): litellm.APIConnectionError: Ollam

Processed 0 / 352 examples:   1%|          | 2/352 [00:09<22:41,  3.89s/it]

2025/11/26 10:05:03 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'end'}) (input_keys={'original_header'})): litellm.APIConnectionError: OllamaExcepti

Processed 0 / 352 examples:   1%|          | 3/352 [00:09<13:00,  2.24s/it]

2025/11/26 10:05:03 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'start'}) (input_keys={'original_header'})): litellm.APIConnectionError: OllamaExcep

Processed 0 / 352 examples:   1%|тЦП         | 5/352 [00:09<05:55,  1.02s/it]

2025/11/26 10:05:03 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'subscriberid'}) (input_keys={'original_header'})): litellm.APIConnectionError: Olla

Processed 0 / 352 examples:   1%|тЦП         | 5/352 [00:09<05:55,  1.02s/it]

2025/11/26 10:05:03 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'today'}) (input_keys={'original_header'})): litellm.APIConnectionError: OllamaExcep

Processed 0 / 352 examples:   2%|тЦП         | 6/352 [00:09<05:54,  1.02s/it]

2025/11/26 10:05:03 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'simserial'}) (input_keys={'original_header'})): litellm.APIConnectionError: OllamaE

Processed 0 / 352 examples:   2%|тЦП         | 7/352 [00:09<03:23,  1.69it/s]

2025/11/26 10:05:10 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'General Questions/Municipality and Ward Details/Name of Municipality (рдирдЧрд░рдкр

Processed 0 / 352 examples:   3%|тЦО         | 9/352 [00:16<09:38,  1.69s/it]

2025/11/26 10:05:11 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'General Questions/Municipality and Ward Details/Ward Number (рд╡рдбрд╛ рдирдВ .)'})

Processed 0 / 352 examples:   4%|тЦН         | 14/352 [00:17<08:32,  1.52s/it]

2025/11/26 10:05:11 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'General Questions/Municipality and Ward Details/Ward Number (рд╡рдбрд╛ рдирдВ )'}) 

Processed 0 / 352 examples:  34%|тЦИтЦИтЦИтЦН      | 119/352 [00:17<00:09, 25.02it/s]

2025/11/26 10:05:11 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'General Questions/Name of the Tole (рд╕рд░реНрд╡реЗрдХреНрд╖рдг рднреИрд░рд╣реЗрдХр

Processed 0 / 352 examples:  35%|тЦИтЦИтЦИтЦМ      | 124/352 [00:17<00:09, 25.02it/s]

2025/11/26 10:05:11 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'General Questions/House No. (рдШрд░ рдирдВ)'}) (input_keys={'original_header'})): l

Processed 0 / 352 examples:  36%|тЦИтЦИтЦИтЦМ      | 126/352 [00:17<00:09, 25.02it/s]

2025/11/26 10:05:12 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'GPS Coordinates'}) (input_keys={'original_header'})): litellm.APIConnectionError: O

Processed 0 / 352 examples:  88%|тЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦК | 310/352 [00:18<00:00, 114.91it/s]

2025/11/26 10:05:12 ERROR dspy.utils.parallelizer: Error for (predict = Predict(StringSignature(original_header -> reasoning, standardized_header
    instructions='Convert messy CSV headers into clean snake_case English column names.\nRules: remove prefixes, take meaningful English part, use snake_case, no special chars.'
    original_header = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Original Header:', 'desc': '${original_header}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    standardized_header = Field(annotation=str required=True json_schema_extra={'desc': 'e.g. municipality_name, latitude, respondent_age', '__dspy_field_type': 'output', 'prefix': 'Standardized Header:'})
)), Example({'original_header': 'General Questions/_GPS Coordinates_latitude'}) (input_keys={'original_header'})): l

Processed 0 / 352 examples:  98%|тЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦИтЦК| 345/352 [00:18<00:00, 18.87it/s] 






Exception: Execution cancelled due to errors or interruption.