# Estrazione Dati Comune - Municipal Data Extraction

This notebook extracts missing data from Italian municipality CSV files by:
- Crawling official municipality websites
- Downloading and analyzing PDF documents  
- Using TF-IDF retrieval to find relevant information
- Intelligently generating queries and extracting values

**Key Features:**
- Automatic categorization of missing cells
- Multiple queries generated per cell for better coverage
- Text extraction from PDF and HTML documents
- Document ranking with TF-IDF
- Complete audit trail of queries and sources

**Version:** 2.0 (Package-based)

---

In [None]:
# Install dependencies and setup Python path
!pip install -q -r requirements.txt

import sys
from pathlib import Path

# Add src/ to Python path for importing municipality_extractor
if 'google.colab' in sys.modules:
    repo_path = Path('/content/estrazione_dati_comune')
else:
    repo_path = Path.cwd()

src_path = repo_path / 'src'
if src_path.exists() and str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

print(f"✓ Added {src_path} to Python path")
print("✓ Dependencies installed successfully")

In [None]:
# Mount Google Drive (Colab only) - Optional
try:
    from google.colab import drive
    drive.mount('/content/drive')
    print("✓ Google Drive mounted successfully")
    
    # Optional: Uncomment to change working directory to Drive folder
    # import os
    # os.chdir('/content/drive/MyDrive/estrazione_dati_comune')
    
except ImportError:
    print("Not running in Colab - skipping Drive mount")

In [None]:
# Configure pipeline parameters
from municipality_extractor import RunConfig
from pathlib import Path

config = RunConfig(
    # === Required Settings ===
    base_url="https://www.comune.vigone.to.it/",  # Municipality website URL
    comune="Vigone",  # Municipality name (optional, used in logging)
    years_to_fill=[2023, 2024],  # Years to extract data for
    
    # === Directories ===
    input_dir=Path("input"),  # Where to find input CSV files
    output_dir=Path("output"),  # Where to save results
    # cache_dir will default to output_dir/cache
    
    # === Crawling Limits ===
    max_pages=500,  # Maximum pages to crawl
    max_depth=None,  # Maximum crawl depth (None = unlimited)
    max_pdf_mb=50.0,  # Maximum PDF file size in MB
    
    # === Data Sources ===
    allow_external_official=False,  # Include external sources (ISTAT, MEF, etc.)
    
    # === Advanced Settings (usually don't need to change) ===
    politeness_delay=0.5,  # Seconds between requests (be polite!)
    top_k_queries=10,  # Number of top queries to use per missing cell
    max_tfidf_features=5000,  # TF-IDF vectorizer max features
    context_window_chars=500,  # Context chars around extracted values
)

# Display configuration
print("✓ Configuration created:")
print(f"  - Base URL: {config.base_url}")
print(f"  - Municipality: {config.comune}")
print(f"  - Years: {config.years_to_fill}")
print(f"  - Input dir: {config.input_dir}")
print(f"  - Output dir: {config.output_dir}")
print(f"  - Max pages: {config.max_pages}")

In [None]:
# Run the complete extraction pipeline
from municipality_extractor import run_pipeline

print("Starting extraction pipeline...\n")
print("=" * 80)

# Execute pipeline - this will:
# 1. Crawl the municipality website
# 2. Extract text from HTML and PDF documents
# 3. Build TF-IDF search index
# 4. Process all CSV files and fill missing values
# 5. Generate audit reports
result = run_pipeline(config)

print("\n" + "=" * 80)
print("Pipeline complete!\n")

# Display results summary
if result.get('success'):
    print("✓ Pipeline completed successfully\n")
    print(f"Results:")
    print(f"  - Documents processed: {result.get('documents', 0)}")
    print(f"  - Values extracted: {result.get('sources', 0)}")
    print(f"  - Queries generated: {result.get('queries', 0)}")
    print(f"\nOutput files saved to: {config.output_dir}/")
    print(f"  - *_filled.csv: CSV files with extracted values")
    print(f"  - sources_long.csv: Complete audit trail of sources")
    print(f"  - queries_generated.csv: All queries used")
    print(f"  - run_report.md: Summary statistics")
else:
    print(f"✗ Pipeline failed: {result.get('error', 'Unknown error')}")

In [None]:
# Optional: View results and inspect outputs
import pandas as pd

# List all output files
output_files = sorted([f for f in config.output_dir.glob("*") if f.is_file()])
print("Generated output files:")
for f in output_files:
    size_kb = f.stat().st_size / 1024
    print(f"  - {f.name:30s} ({size_kb:>8.1f} KB)")

# Display sources summary if available
sources_file = config.output_dir / "sources_long.csv"
if sources_file.exists():
    print("\n" + "=" * 80)
    print("Sources Summary (first 10 extracted values):")
    print("=" * 80)
    sources_df = pd.read_csv(sources_file)
    display(sources_df.head(10))
    
    print(f"\nTotal values extracted: {len(sources_df)}")
    print(f"Average confidence: {sources_df['confidence'].mean():.2%}")
    print(f"Unique source URLs: {sources_df['source_url'].nunique()}")

# Display extraction report if available
report_file = config.output_dir / "run_report.md"
if report_file.exists():
    print("\n" + "=" * 80)
    print("Extraction Report:")
    print("=" * 80)
    print(report_file.read_text())

---

## Output Files Description

The pipeline generates several files in the `output/` directory:

### Main Outputs

- **`*_filled.csv`** - Your original CSV files with missing values filled in
- **`sources_long.csv`** - Complete audit trail showing:
  - Which cells were filled (csv_file, row, column)
  - Extracted values and confidence scores
  - Source URLs and document IDs  
  - Text snippets with context
  - Keywords that matched

### Audit & Analysis Files

- **`queries_generated.csv`** - All search queries generated for each missing cell
- **`run_report.md`** - Summary statistics including:
  - Pages crawled and documents processed
  - Fill rates per CSV file
  - Success/failure statistics

### Cache Directory

- **`cache/`** - Downloaded HTML/PDF files and extracted text
  - Persists across runs to avoid re-downloading
  - Useful for debugging and faster re-runs

---

## Next Steps

1. **Review filled CSVs** - Check `*_filled.csv` files for accuracy
2. **Verify sources** - Inspect `sources_long.csv` to validate data sources
3. **Adjust if needed** - Modify configuration parameters and re-run
4. **Use cache** - Keep cache/ directory for faster subsequent runs

**Tip:** To extract data for a different municipality, simply change `base_url` and `comune` in the configuration cell and re-run cells 4-6.

For more information, see the [GitHub repository](https://github.com/eugenioservidiome/estrazione_dati_comune).