# üöÄ Entity Resolution Pipeline - A100 Optimized

**Estimated runtime: ~45 minutes** (vs 3+ hours on M1 MacBook)

## Prerequisites
1. Upload your data to Google Drive in this structure:
```
My Drive/
‚îî‚îÄ‚îÄ entity-resolution/
    ‚îú‚îÄ‚îÄ entity-resolution-pipeline/    ‚Üê The full pipeline folder
    ‚îú‚îÄ‚îÄ dati europe cb/                ‚Üê Crunchbase data
    ‚îú‚îÄ‚îÄ new orbis/                     ‚Üê Orbis Excel files (optional)
    ‚îî‚îÄ‚îÄ database-done.xlsx             ‚Üê Your manual matches file
```
2. Select **A100 GPU** runtime (Runtime ‚Üí Change runtime type ‚Üí A100)
3. Run all cells in order

In [9]:
#@title Cell 1: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# === CONFIGURE YOUR PATHS HERE ===
DRIVE_ROOT = '/content/drive/MyDrive/ricerca'
PIPELINE_FOLDER = 'entity-resolution-pipeline'

# Derived paths
PIPELINE_PATH = f'{DRIVE_ROOT}/{PIPELINE_FOLDER}'
CB_DATA_PATH = f'{DRIVE_ROOT}/dati europe cb'
ORBIS_DATA_PATH = f'{DRIVE_ROOT}/new orbis'
DB_DONE_PATH = f'{DRIVE_ROOT}/database-done.xlsx'

import os
print(f'‚úÖ Drive mounted')
print(f'üìÅ Pipeline: {PIPELINE_PATH}')
print(f'üìÅ Crunchbase: {CB_DATA_PATH}')
print(f'üìÅ Orbis: {ORBIS_DATA_PATH}')
print(f'üìÑ Database-done: {DB_DONE_PATH}')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úÖ Drive mounted
üìÅ Pipeline: /content/drive/MyDrive/ricerca/entity-resolution-pipeline
üìÅ Crunchbase: /content/drive/MyDrive/ricerca/dati europe cb
üìÅ Orbis: /content/drive/MyDrive/ricerca/new orbis
üìÑ Database-done: /content/drive/MyDrive/ricerca/database-done.xlsx


In [10]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [11]:
#@title Cell 2: Validate Data Exists
import os

errors = []

# Check pipeline folder
if not os.path.exists(PIPELINE_PATH):
    errors.append(f'‚ùå Pipeline folder not found: {PIPELINE_PATH}')
else:
    print(f'‚úÖ Pipeline folder found')
    # Check key files
    key_files = ['run_pipeline.py', 'src/data_io.py', 'src/normalize.py', 'configs']
    for f in key_files:
        if os.path.exists(f'{PIPELINE_PATH}/{f}'):
            print(f'   ‚úÖ {f}')
        else:
            errors.append(f'   ‚ùå Missing: {f}')

# Check Crunchbase data
if not os.path.exists(CB_DATA_PATH):
    errors.append(f'‚ùå Crunchbase data not found: {CB_DATA_PATH}')
else:
    print(f'‚úÖ Crunchbase data found')

# Check for existing Orbis parquet (preferred) or raw Excel
orbis_parquet = f'{PIPELINE_PATH}/data/interim/orbis_clean/orbis_raw.parquet'
if os.path.exists(orbis_parquet):
    size_gb = os.path.getsize(orbis_parquet) / 1e9
    print(f'‚úÖ Orbis parquet found ({size_gb:.2f} GB) - Will skip Excel conversion')
elif os.path.exists(ORBIS_DATA_PATH):
    print(f'‚ö†Ô∏è No Orbis parquet, will convert from Excel (slow)')
else:
    errors.append(f'‚ùå No Orbis data found')

# Check database-done.xlsx
if os.path.exists(DB_DONE_PATH):
    print(f'‚úÖ database-done.xlsx found')
else:
    print(f'‚ö†Ô∏è database-done.xlsx not found (alias step will be limited)')

if errors:
    print('\nüõë ERRORS FOUND:')
    for e in errors:
        print(e)
    raise Exception('Please fix the errors above before continuing')
else:
    print('\nüéâ All data validated! Ready to proceed.')

‚úÖ Crunchbase data found
‚ö†Ô∏è database-done.xlsx not found (alias step will be limited)

üõë ERRORS FOUND:
‚ùå Pipeline folder not found: /content/drive/MyDrive/ricerca/entity-resolution-pipeline
‚ùå No Orbis data found


Exception: Please fix the errors above before continuing

In [None]:
#@title Cell 3: Check GPU and Install Dependencies
!nvidia-smi

# Install ALL required packages
print('\nüì¶ Installing dependencies...')
!pip install -q \
    sentence-transformers \
    faiss-gpu \
    pandas \
    pyarrow \
    tqdm \
    rapidfuzz \
    scikit-learn \
    pyyaml \
    joblib \
    openpyxl \
    xlrd \
    numpy

import torch
print(f'\n‚úÖ PyTorch version: {torch.__version__}')
print(f'‚úÖ CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'‚úÖ GPU: {torch.cuda.get_device_name(0)}')
    print(f'‚úÖ VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB')
else:
    print('‚ö†Ô∏è No GPU detected! Select GPU runtime: Runtime ‚Üí Change runtime type ‚Üí A100')

In [None]:
#@title Cell 4: Setup Python Path
import sys
sys.path.insert(0, PIPELINE_PATH)
sys.path.insert(0, f'{PIPELINE_PATH}/src')

%cd {PIPELINE_PATH}

# Verify imports work
try:
    from config import Config, get_project_paths
    from data_io import ingest_all_data
    from normalize import normalize_name
    print('‚úÖ All core modules imported successfully')
except ImportError as e:
    print(f'‚ùå Import error: {e}')
    raise

In [None]:
#@title Cell 5: Create A100-Optimized Config
a100_config = f'''
# A100 Optimized Configuration
# Designed for 80GB VRAM + 160GB RAM

paths:
  project_root: {PIPELINE_PATH}
  raw_crunchbase: {CB_DATA_PATH}
  raw_orbis: {ORBIS_DATA_PATH}
  database_done: {DB_DONE_PATH}

blocking:
  max_candidates_per_cb: 500
  same_country_boost: 2.0
  rare_token_min_df: 3
  rare_token_max_df: 50

embeddings:
  enabled: true
  model_name: all-MiniLM-L6-v2
  batch_size: 4096
  device: cuda
  use_fp16: true

faiss:
  use_gpu: true
  nlist: 4096
  nprobe: 128
  top_k_per_query: 100

processing:
  feature_chunk_size: 500000
  parallel_workers: 32

tiers:
  tier_a_threshold: 0.95
  tier_b_threshold: 0.80
  tier_c_threshold: 0.60

features:
  semantic_similarity_weight: 0.30
  string_similarity_weight: 0.25
  domain_match_weight: 0.25
  structural_weight: 0.20

model:
  classifier: gradient_boosting
  n_estimators: 200
  max_depth: 8
  learning_rate: 0.1
'''

import os
os.makedirs('configs', exist_ok=True)
with open('configs/a100_colab.yaml', 'w') as f:
    f.write(a100_config)

print('‚úÖ Created configs/a100_colab.yaml')
print('\nConfiguration:')
print(a100_config)

In [None]:
#@title Cell 6: Run Full Pipeline üöÄ
import time
start_time = time.time()

print('='*60)
print('STARTING ENTITY RESOLUTION PIPELINE (A100 OPTIMIZED)')
print('='*60)

!python run_pipeline.py --step all --config configs/a100_colab.yaml

elapsed = time.time() - start_time
print('\n' + '='*60)
print(f'üéâ PIPELINE COMPLETED in {elapsed/60:.1f} minutes')
print('='*60)

In [None]:
#@title Cell 7: View Results
import pandas as pd
import os

# Check for output files
outputs = {
    'matches': 'data/outputs/matches/matches_final.parquet',
    'review_queue': 'data/outputs/review/review_queue.csv',
    'cb_clean': 'data/interim/cb_clean/cb_clean.parquet',
    'orbis_clean': 'data/interim/orbis_clean/orbis_clean.parquet',
}

print('üìä Output Files:')
for name, path in outputs.items():
    if os.path.exists(path):
        size = os.path.getsize(path) / 1e6
        print(f'  ‚úÖ {name}: {path} ({size:.1f} MB)')
    else:
        print(f'  ‚ùå {name}: Not found')

# Load and display matches if available
if os.path.exists(outputs['matches']):
    matches = pd.read_parquet(outputs['matches'])
    print(f'\nüìà Total matches: {len(matches)}')
    print(f'\nTier distribution:')
    print(matches['tier'].value_counts())
    print(f'\nSample matches:')
    display(matches.head(10))

In [None]:
#@title Cell 8: Download Results to Local Machine
from google.colab import files
import shutil

# Create a zip of all outputs
output_dir = 'data/outputs'
if os.path.exists(output_dir):
    shutil.make_archive('/content/pipeline_results', 'zip', output_dir)
    print('üì¶ Created pipeline_results.zip')

    # Option 1: Download directly (uncomment to use)
    # files.download('/content/pipeline_results.zip')

    # Option 2: Already saved to Drive (default)
    print(f'\n‚úÖ Results already saved to Drive at:')
    print(f'   {PIPELINE_PATH}/data/outputs/')
else:
    print('‚ùå No outputs found')

---
## üìä Results Summary

After the pipeline completes, your results are in:
- `data/outputs/matches/matches_final.parquet` - All matched pairs with scores
- `data/outputs/review/review_queue.csv` - Pairs needing manual review
- `data/outputs/reports/` - Run logs and analytics

Since you ran from Google Drive, all files are automatically synced!