# üéØ Multimodal Donor Legacy Intent Prediction

This notebook runs the complete multimodal fusion pipeline combining:
- **Tabular Data**: Donor demographics and giving history (9 features)
- **Text Data**: Contact reports via BERT embeddings (768-dim)
- **Graph Data**: Family relationships via GNN embeddings (64-dim)

---

## üöÄ Quick Start Guide

### **Step 1: Set Up GPU Runtime**
1. Click **Runtime** ‚Üí **Change runtime type**
2. Select **Hardware accelerator**: **GPU** (Tesla T4 recommended)
3. Click **Save**

### **Step 2: Run Setup Cell**
- Run Cell 2 to install all required packages (~2 minutes)

### **Step 3: Upload Files When Prompted**

You'll need to upload these files:

#### **üìä Dataset Files** (Cell 5):
- `donors.csv` (50,000 donors)
- `contact_reports.csv` (32,665 reports)
- `relationships.csv` (15,000 relationships)

#### **ü§ñ Pre-trained Model** (Cell 5 - Optional):
- `best_contact_classifier.pt` (trained BERT model)
  - If not uploaded, will train new model (~10 min)

#### **üíª Source Code Files** (Cell 7):
- `multimodal_arch.py`
- `bert_pipeline.py`
- All files from `gnn_models/` folder
- All files from `data_generation/` folder

### **Step 4: Run All Cells**
- Click **Runtime** ‚Üí **Run all**
- Total time: ~5-10 minutes with pre-trained model

---

## ‚úÖ Success Indicators

You should see:
- ‚úÖ `‚úÖ Imported GNN modules successfully`
- ‚úÖ `Loaded from checkpoint['model_state_dict']`
- ‚úÖ `‚úÖ Saved tabular feature scaler`
- ‚úÖ No "Falling back to dummy embeddings" messages
- ‚úÖ Train/Val Loss showing numbers (not `nan`)
- ‚úÖ Both classes predicted in classification report

---

## üìà Expected Results

- **Test Accuracy**: 75-82%
- **Test AUC**: 0.75-0.85
- **Macro F1 Score**: 0.60-0.70
- **Legacy Intent Recall**: 40-50% (minority class detected!)

---

## ‚ùì Need Help?

- **Detailed Guide**: See `COLAB_QUICKSTART_GUIDE.md`
- **Complete Documentation**: See `FINAL_FIX_SUMMARY.md`
- **Project Structure**: See `PROJECT_STRUCTURE.md`


## 1. Environment Setup & Installation


In [None]:
# Install required packages
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install transformers datasets accelerate
!pip install torch-geometric
!pip install scikit-learn matplotlib seaborn
!pip install pandas numpy tqdm
!pip install networkx

print("‚úÖ All packages installed successfully!")


In [None]:
import os
import sys
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set up paths
os.makedirs('src', exist_ok=True)
os.makedirs('synthetic_donor_dataset', exist_ok=True)

# Add src to Python path
sys.path.append('src')

# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"üöÄ Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

print("‚úÖ Environment setup complete!")


## 2. Upload Dataset Files

Upload the following files from your project:
- `donors.csv`
- `contact_reports.csv` 
- `relationships.csv`
- `best_contact_classifier.pt` (optional - will train new model if not provided)


In [None]:
from google.colab import files
import shutil

print("üìÅ Please upload your dataset files:")
print("1. donors.csv")
print("2. contact_reports.csv")
print("3. relationships.csv")
print("4. best_contact_classifier.pt (optional)")
print()

# Upload files
uploaded = files.upload()

# Move files to correct directories
for filename in uploaded.keys():
    if filename.endswith('.csv'):
        shutil.move(filename, f'synthetic_donor_dataset/{filename}')
        print(f"‚úÖ Moved {filename} to synthetic_donor_dataset/")
    elif filename.endswith('.pt'):
        shutil.move(filename, f'{filename}')
        print(f"‚úÖ Moved {filename} to root directory")

print("\nüìä Checking uploaded files:")
for file in ['synthetic_donor_dataset/donors.csv', 'synthetic_donor_dataset/contact_reports.csv', 'synthetic_donor_dataset/relationships.csv']:
    if os.path.exists(file):
        df = pd.read_csv(file)
        print(f"‚úÖ {file}: {len(df):,} rows")
    else:
        print(f"‚ùå {file}: Not found")

if os.path.exists('best_contact_classifier.pt'):
    print("‚úÖ best_contact_classifier.pt: Found (will use for BERT embeddings)")
else:
    print("‚ö†Ô∏è best_contact_classifier.pt: Not found (will train new BERT model)")


## 3. Upload Project Code Files

Upload the following Python files from your `src/` directory:
- `multimodal_arch.py`
- `bert_pipeline.py`
- `gnn_models/` folder (all files)
- `data_generation/` folder (all files)


In [None]:
print("üìÅ Please upload your project code files:")
print("1. multimodal_arch.py")
print("2. bert_pipeline.py")
print("3. All files from gnn_models/ folder")
print("4. All files from data_generation/ folder")
print()

# Upload code files
uploaded_code = files.upload()

# Create necessary directories
os.makedirs('src/gnn_models', exist_ok=True)
os.makedirs('src/data_generation', exist_ok=True)

# Move files to correct locations
for filename in uploaded_code.keys():
    if 'gnn_models' in filename:
        shutil.move(filename, f'src/{filename}')
        print(f"‚úÖ Moved {filename} to src/gnn_models/")
    elif 'data_generation' in filename:
        shutil.move(filename, f'src/{filename}')
        print(f"‚úÖ Moved {filename} to src/data_generation/")
    elif filename.endswith('.py') and 'src' in filename:
        # Handle files like 'src/multimodal_arch.py'
        new_name = filename.replace('src/', '')
        shutil.move(filename, f'src/{new_name}')
        print(f"‚úÖ Moved {filename} to src/{new_name}")
    elif filename.endswith('.py'):
        shutil.move(filename, f'src/{filename}')
        print(f"‚úÖ Moved {filename} to src/")

print("\nüìä Checking uploaded code files:")
code_files = [
    'src/multimodal_arch.py',
    'src/bert_pipeline.py',
    'src/gnn_models/gnn_pipeline.py',
    'src/gnn_models/gnn_models.py',
    'src/data_generation/data_generation.py'
]

for file in code_files:
    if os.path.exists(file):
        print(f"‚úÖ {file}")
    else:
        print(f"‚ùå {file}: Not found")


## 4. Create Multimodal Pipeline Script


In [None]:
# Create the multimodal pipeline script for Colab
print("Creating multimodal pipeline script...")

# Write the pipeline script directly
with open('colab_multimodal_pipeline.py', 'w') as f:
    f.write('''#!/usr/bin/env python3
"""
Multimodal Pipeline for Google Colab
Combines BERT, GNN, and tabular data for donor legacy intent prediction
"""

import pandas as pd
import numpy as np
import sys
import os
import torch
from pathlib import Path

# Add src directory to path for imports
sys.path.append('src')

def load_data():
    """Load the synthetic donor dataset"""
    print("Loading synthetic donor dataset...")
    
    # Load donors data
    donors_df = pd.read_csv('synthetic_donor_dataset/donors.csv')
    print(f"Loaded {len(donors_df):,} donors")
    
    # Load contact reports
    contact_reports_df = pd.read_csv('synthetic_donor_dataset/contact_reports.csv')
    print(f"Loaded {len(contact_reports_df):,} contact reports")
    
    return donors_df, contact_reports_df

def load_bert_embeddings(donors_df, contact_reports_df):
    """Load or generate BERT embeddings from the trained model"""
    print("Loading BERT embeddings...")
    
    try:
        # Import BERT pipeline components
        from bert_pipeline import (
            setup_transformer_environment, 
            select_model, 
            EmbeddingExtractor,
            run_bert_pipeline_on_contact_reports
        )
        
        # Check if we have a trained model
        model_path = 'best_contact_classifier.pt'
        if os.path.exists(model_path):
            print(f"Found trained model: {model_path}")
            
            # Load the trained model
            device = setup_transformer_environment()
            model_info = select_model('bert')  # Use BERT as default
            
            # Load model state (handle different checkpoint formats)
            checkpoint = torch.load(model_path, map_location=device)
            model = model_info['model']
            
            # Try different checkpoint formats
            try:
                if isinstance(checkpoint, dict):
                    if 'model_state_dict' in checkpoint:
                        model.load_state_dict(checkpoint['model_state_dict'])
                        print("Loaded from checkpoint['model_state_dict']")
                    elif 'state_dict' in checkpoint:
                        model.load_state_dict(checkpoint['state_dict'])
                        print("Loaded from checkpoint['state_dict']")
                    else:
                        # Checkpoint dict might contain the state dict directly
                        model.load_state_dict(checkpoint)
                        print("Loaded checkpoint as state dict")
                else:
                    # Checkpoint is the state dict itself
                    model.load_state_dict(checkpoint)
                    print("Loaded checkpoint directly")
            except Exception as load_error:
                print(f"Error loading checkpoint: {load_error}")
                print(f"Checkpoint type: {type(checkpoint)}")
                if isinstance(checkpoint, dict):
                    print(f"Checkpoint keys: {checkpoint.keys()}")
                raise
            
            model.to(device)
            
            # Initialize tokenizer
            tokenizer = model_info['tokenizer']
            
            # Create embedding extractor
            extractor = EmbeddingExtractor(model, tokenizer, device)
            
            # Get contact report texts for donors
            donor_texts = []
            for donor_id in donors_df['ID']:
                # Get all contact reports for this donor
                donor_reports = contact_reports_df[contact_reports_df['Donor_ID'] == donor_id]
                
                if len(donor_reports) > 0:
                    # Combine all report texts for this donor
                    combined_text = ' '.join(donor_reports['Report_Text'].fillna('').astype(str))
                    donor_texts.append(combined_text)
                else:
                    # Use empty text if no contact reports
                    donor_texts.append('')
            
            # Extract embeddings
            bert_embeddings = extractor.extract_embeddings(donor_texts, batch_size=16)
            
            print(f"BERT embeddings shape: {bert_embeddings.shape}")
            return bert_embeddings
            
        else:
            print("No trained BERT model found. Training new model...")
            
            # Run the full BERT pipeline to train and extract embeddings
            bert_results = run_bert_pipeline_on_contact_reports(
                data_dir="synthetic_donor_dataset",
                model_choice='bert',
                batch_size=16,
                epochs=3,  # Reduced for faster training
                learning_rate=2e-5
            )
            
            # Extract embeddings using the trained model
            extractor = bert_results['extractor']
            
            # Get contact report texts for donors
            donor_texts = []
            for donor_id in donors_df['ID']:
                donor_reports = contact_reports_df[contact_reports_df['Donor_ID'] == donor_id]
                if len(donor_reports) > 0:
                    combined_text = ' '.join(donor_reports['Report_Text'].fillna('').astype(str))
                    donor_texts.append(combined_text)
                else:
                    donor_texts.append('')
            
            bert_embeddings = extractor.extract_embeddings(donor_texts, batch_size=16)
            print(f"BERT embeddings shape: {bert_embeddings.shape}")
            return bert_embeddings
            
    except Exception as e:
        print(f"‚ùå Error loading BERT embeddings: {e}")
        import traceback
        print("Full error traceback:")
        traceback.print_exc()
        print("\n‚ö†Ô∏è  Falling back to dummy embeddings (random noise)...")
        print("WARNING: Model will not use real text features!")
        # Fallback to dummy embeddings
        rng = np.random.default_rng(42)
        bert_embeddings = rng.standard_normal((len(donors_df), 768)).astype(np.float32)
        print(f"BERT embeddings shape: {bert_embeddings.shape}")
        return bert_embeddings

def load_gnn_embeddings(donors_df, relationships_df):
    """Load or generate GNN embeddings from the trained model"""
    print("Loading GNN embeddings...")
    
    try:
        # Import GNN pipeline components
        import sys
        if 'src' not in sys.path:
            sys.path.append('src')
        
        # Try importing - handle different module structures
        try:
            from gnn_models.gnn_pipeline import main_gnn_pipeline
            from gnn_models.gnn_analysis import get_node_embeddings
            print("‚úÖ Imported GNN modules successfully")
        except (ImportError, ModuleNotFoundError) as import_error:
            print(f"Warning: {import_error}")
            print("Trying alternative import...")
            import gnn_models.gnn_pipeline as gnn_pipeline_module
            import gnn_models.gnn_analysis as gnn_analysis_module
            main_gnn_pipeline = gnn_pipeline_module.main_gnn_pipeline
            get_node_embeddings = gnn_analysis_module.get_node_embeddings
            print("‚úÖ Imported GNN modules via alternative path")
        
        # Check if we have relationships data
        if relationships_df.empty or len(relationships_df) == 0:
            print("No relationship data found. Using dummy GNN embeddings...")
            rng = np.random.default_rng(42)
            gnn_embeddings = rng.standard_normal((len(donors_df), 64)).astype(np.float32)
            print(f"GNN embeddings shape: {gnn_embeddings.shape}")
            return gnn_embeddings
        
        # Run GNN pipeline to get embeddings
        print("Running GNN pipeline to generate embeddings...")
        gnn_results = main_gnn_pipeline(
            donors_df=donors_df,
            relationships_df=relationships_df,
            contact_reports_df=None,  # Not needed for GNN
            giving_history_df=None    # Not needed for GNN
        )
        
        # Extract embeddings
        gnn_embeddings = gnn_results['embeddings']
        print(f"GNN embeddings shape: {gnn_embeddings.shape}")
        
        # Ensure embeddings match donor count
        if len(gnn_embeddings) != len(donors_df):
            print(f"Warning: GNN embeddings count ({len(gnn_embeddings)}) doesn't match donor count ({len(donors_df)})")
            # Pad or truncate as needed
            if len(gnn_embeddings) < len(donors_df):
                # Pad with zeros
                padding = np.zeros((len(donors_df) - len(gnn_embeddings), gnn_embeddings.shape[1]))
                gnn_embeddings = np.vstack([gnn_embeddings, padding])
            else:
                # Truncate
                gnn_embeddings = gnn_embeddings[:len(donors_df)]
        
        return gnn_embeddings
        
    except Exception as e:
        print(f"‚ùå Error loading GNN embeddings: {e}")
        import traceback
        print("Full error traceback:")
        traceback.print_exc()
        print("\n‚ö†Ô∏è  Falling back to dummy embeddings (random noise)...")
        print("WARNING: Model will not use real graph features!")
        # Fallback to dummy embeddings
        rng = np.random.default_rng(42)
        gnn_embeddings = rng.standard_normal((len(donors_df), 64)).astype(np.float32)
        print(f"GNN embeddings shape: {gnn_embeddings.shape}")
        return gnn_embeddings

def load_actual_embeddings(donors_df, contact_reports_df):
    """Load actual embeddings from BERT and GNN pipelines"""
    print("Loading actual embeddings from trained models...")
    
    # Load relationships data for GNN
    relationships_df = pd.read_csv('synthetic_donor_dataset/relationships.csv')
    
    # Load BERT embeddings
    bert_embeddings = load_bert_embeddings(donors_df, contact_reports_df)
    
    # Load GNN embeddings  
    gnn_embeddings = load_gnn_embeddings(donors_df, relationships_df)
    
    return bert_embeddings, gnn_embeddings

def main():
    """Main execution function"""
    print("=" * 80)
    print("MULTIMODAL DONOR LEGACY INTENT PREDICTION")
    print("=" * 80)
    
    # Load data
    donors_df, contact_reports_df = load_data()
    
    # Load actual embeddings from trained models
    bert_embeddings, gnn_embeddings = load_actual_embeddings(donors_df, contact_reports_df)
    
    # Import multimodal architecture
    from multimodal_arch import run_multimodal_fusion_pipeline
    
    # Run the multimodal fusion pipeline
    print("\\n" + "=" * 60)
    print("STARTING MULTIMODAL FUSION PIPELINE")
    print("=" * 60)
    
    results = run_multimodal_fusion_pipeline(
        donors_df=donors_df,
        contact_reports_df=contact_reports_df,
        bert_embeddings=bert_embeddings,  # From BERT pipeline
        gnn_embeddings=gnn_embeddings,     # From GNN pipeline
        target_column='Legacy_Intent_Binary',
        batch_size=32,
        epochs=50,
        learning_rate=1e-3
    )
    
    # Access results
    test_accuracy = results['results']['test_accuracy']
    test_auc = results['results']['test_auc']
    
    print("\\n" + "=" * 60)
    print("FINAL RESULTS SUMMARY")
    print("=" * 60)
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Test AUC: {test_auc:.4f}" if test_auc else "Test AUC: N/A")
    print("Model saved as: multimodal_fusion_model.pt")
    print("Best model saved as: best_multimodal_model.pt")

if __name__ == "__main__":
    main()
''')

print("‚úÖ Created colab_multimodal_pipeline.py")


## 5. Run the Multimodal Pipeline


In [None]:
# Fix the column issue before running the pipeline
print("üîß Applying column fix...")

# Update the multimodal_arch.py file to handle missing columns
fix_code = '''
# Fix for missing columns in multimodal_arch.py
import os

# Read the current file
with open('src/multimodal_arch.py', 'r') as f:
    content = f.read()

# Replace the problematic section
old_section = """    # 1. Tabular features
    tabular_cols = [
        'Lifetime_Giving', 'Last_Gift', 'Consecutive_Yr_Giving_Count',
        'Total_Yr_Giving_Count', 'Engagement_Score', 'Legacy_Intent_Probability',
        'Estimated_Age'
    ]"""

new_section = """    # 1. Tabular features (using available columns)
    tabular_cols = [
        'Lifetime_Giving', 'Engagement_Score', 'Estimated_Age'
    ]
    
    # Check which columns are available and add them
    available_cols = []
    for col in tabular_cols:
        if col in donors_df.columns:
            available_cols.append(col)
        else:
            print(f"Warning: Column '{col}' not found in dataset")
    
    # Add any additional numeric columns that might be useful
    numeric_cols = donors_df.select_dtypes(include=[np.number]).columns.tolist()
    for col in numeric_cols:
        if col not in available_cols and col != 'ID' and col != 'Legacy_Intent_Binary':
            available_cols.append(col)
    
    print(f"Using tabular columns: {available_cols}")
    tabular_cols = available_cols"""

# Apply the fix
if old_section in content:
    content = content.replace(old_section, new_section)
    with open('src/multimodal_arch.py', 'w') as f:
        f.write(content)
    print("‚úÖ Applied column fix to multimodal_arch.py")
else:
    print("‚ö†Ô∏è Fix already applied or section not found")
'''

exec(fix_code)

# Run the complete multimodal pipeline
print("üöÄ Starting multimodal pipeline...")
exec(open('colab_multimodal_pipeline.py').read())


## 6. Download Results

Download the trained models and results for use outside of Colab.


In [None]:
# Download trained models and results
files_to_download = [
    'multimodal_fusion_model.pt',
    'best_multimodal_model.pt',
    'best_contact_classifier.pt'  # If retrained
]

print("üì• Downloading trained models:")
for file in files_to_download:
    if os.path.exists(file):
        files.download(file)
        print(f"‚úÖ Downloaded {file}")
    else:
        print(f"‚ùå {file} not found")

print("\nüéâ Pipeline execution complete!")


## üîß Troubleshooting

### Common Issues:

1. **Import Errors**: Make sure all Python files are uploaded to the correct directories
2. **Memory Issues**: Reduce batch size or use fewer epochs
3. **CUDA Errors**: The code will automatically fall back to CPU if GPU issues occur
4. **Missing Files**: Ensure all dataset files are uploaded

### Performance Tips:
- Use GPU runtime for faster training
- Reduce epochs for quicker testing
- Use smaller batch sizes if memory is limited

### Support:
If you encounter issues, check:
- All required files are uploaded
- Python packages are installed correctly
- Dataset files are in the correct format
