# ProPublica Donor Data Analysis

This notebook analyzes the IRS 990 data collected from ProPublica to identify potential funders for nonprofits.

## What This Notebook Does

1. **Loads Data**: Imports the datasets collected by `collect_data.py`
2. **Analyzes Funders**: Examines donor-advised funds and private foundations
3. **Grant Analysis**: Studies historical grant patterns and amounts
4. **Funder Matching**: Identifies potential funders based on keywords and criteria
5. **Visualizations**: Creates charts and graphs to understand the data
6. **Export Results**: Generates prospect lists and reports

## Prerequisites

Before running this notebook:
1. Run `collect_data.py` to gather the data
2. Ensure you have data files in the `/data` folder
3. Customize the focus area keywords for your nonprofit

---

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import json
import glob
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

# Configure pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

print("‚úÖ Libraries imported successfully!")
print(f"üìÅ Current directory: {os.getcwd()}")
print(f"üìä Available data files: {len(glob.glob('data/*.csv'))} CSV files")

## 1. Load Data

Let's load the most recent datasets collected by the data collection script.

In [None]:
# Function to load the most recent datasets
def load_latest_data():
    """Load the most recent data files from the data directory"""
    
    # Find the most recent files
    donor_files = glob.glob('data/donor_advised_funds_*.csv')
    foundation_files = glob.glob('data/private_foundations_*.csv')
    grants_files = glob.glob('data/grants_data_*.csv')
    
    if not donor_files or not foundation_files:
        print("‚ùå No data files found!")
        print("üí° Run 'python collect_data.py' first to collect data")
        return None, None, None
    
    # Get the most recent files (sorted by filename which includes timestamp)
    latest_donor_file = sorted(donor_files)[-1]
    latest_foundation_file = sorted(foundation_files)[-1]
    latest_grants_file = sorted(grants_files)[-1] if grants_files else None
    
    print(f"üìä Loading data files:")
    print(f"   ‚Ä¢ Donor-advised funds: {latest_donor_file}")
    print(f"   ‚Ä¢ Private foundations: {latest_foundation_file}")
    if latest_grants_file:
        print(f"   ‚Ä¢ Grants data: {latest_grants_file}")
    
    # Load the data
    donor_funds_df = pd.read_csv(latest_donor_file)
    foundations_df = pd.read_csv(latest_foundation_file)
    grants_df = pd.read_csv(latest_grants_file) if latest_grants_file else pd.DataFrame()
    
    print(f"\n‚úÖ Data loaded successfully!")
    print(f"   ‚Ä¢ {len(donor_funds_df)} donor-advised funds")
    print(f"   ‚Ä¢ {len(foundations_df)} private foundations")
    print(f"   ‚Ä¢ {len(grants_df)} grant records")
    
    return donor_funds_df, foundations_df, grants_df

# Load the data
donor_funds_df, foundations_df, grants_df = load_latest_data()

In [None]:
# Quick data overview
if donor_funds_df is not None:
    print("=" * 60)
    print("üìä DATASET OVERVIEW")
    print("=" * 60)
    
    print("\nüè¶ DONOR-ADVISED FUNDS")
    print(f"Total organizations: {len(donor_funds_df)}")
    if 'revenue_amount' in donor_funds_df.columns:
        print(f"Average revenue: ${donor_funds_df['revenue_amount'].mean():,.0f}")
        print(f"Total revenue: ${donor_funds_df['revenue_amount'].sum():,.0f}")
    
    print("\nüèõÔ∏è PRIVATE FOUNDATIONS") 
    print(f"Total organizations: {len(foundations_df)}")
    if 'asset_amount' in foundations_df.columns:
        print(f"Average assets: ${foundations_df['asset_amount'].mean():,.0f}")
        print(f"Total assets: ${foundations_df['asset_amount'].sum():,.0f}")
    
    if not grants_df.empty:
        print("\nüí∞ GRANTS DATA")
        print(f"Total grant records: {len(grants_df)}")
        print(f"Total grant amount: ${grants_df['amount'].sum():,.2f}")
        print(f"Average grant: ${grants_df['amount'].mean():,.2f}")
        print(f"Median grant: ${grants_df['amount'].median():,.2f}")
    
    print("\nüìç GEOGRAPHIC DISTRIBUTION")
    all_orgs = pd.concat([donor_funds_df, foundations_df], ignore_index=True)
    if 'state' in all_orgs.columns:
        top_states = all_orgs['state'].value_counts().head()
        for state, count in top_states.items():
            print(f"   {state}: {count} organizations")
    
    print("=" * 60)

## 2. Configure Your Nonprofit Focus Area

**üéØ CUSTOMIZE THIS SECTION FOR YOUR ORGANIZATION**

Choose keywords that match your nonprofit's mission and programs. This will help identify funders who support similar causes.

In [None]:
# üéØ FOCUS AREA CONFIGURATIONS
# Choose the one that matches your nonprofit or create your own

FOCUS_AREAS = {
    'education': {
        'keywords': ['education', 'school', 'student', 'learning', 'teacher', 'classroom', 
                    'literacy', 'scholarship', 'academic', 'curriculum', 'STEM', 'university', 'college'],
        'description': 'Educational institutions and programs'
    },
    'health': {
        'keywords': ['health', 'medical', 'healthcare', 'hospital', 'clinic', 'patient', 
                    'disease', 'treatment', 'research', 'mental health', 'wellness', 'therapy'],
        'description': 'Health and medical services'
    },
    'arts': {
        'keywords': ['arts', 'culture', 'music', 'theater', 'theatre', 'dance', 'visual arts', 
                    'museum', 'gallery', 'performance', 'artist', 'creative', 'cultural'],
        'description': 'Arts and cultural organizations'
    },
    'environment': {
        'keywords': ['environment', 'environmental', 'conservation', 'sustainability', 'climate', 
                    'wildlife', 'renewable', 'green', 'ecosystem', 'pollution', 'nature'],
        'description': 'Environmental and conservation efforts'
    },
    'social_services': {
        'keywords': ['social services', 'community', 'homeless', 'housing', 'poverty', 'food bank', 
                    'shelter', 'human services', 'family services', 'youth', 'seniors'],
        'description': 'Social and community services'
    }
}

# üìù SELECT YOUR FOCUS AREA
# Change this to match your nonprofit's mission
CHOSEN_FOCUS = 'education'  # ‚Üê CHANGE THIS

# Get the selected focus area
if CHOSEN_FOCUS in FOCUS_AREAS:
    focus_config = FOCUS_AREAS[CHOSEN_FOCUS]
    chosen_keywords = focus_config['keywords']
    focus_description = focus_config['description']
    
    print(f"üéØ Selected focus area: {CHOSEN_FOCUS.title()}")
    print(f"üìù Description: {focus_description}")
    print(f"üîë Keywords: {', '.join(chosen_keywords[:10])}...")  # Show first 10
else:
    print("‚ùå Invalid focus area selected")
    
# You can also create custom keywords
CUSTOM_KEYWORDS = []  # Add your own keywords here if needed
if CUSTOM_KEYWORDS:
    chosen_keywords.extend(CUSTOM_KEYWORDS)
    print(f"‚úÖ Added {len(CUSTOM_KEYWORDS)} custom keywords")

## 3. Grant Analysis Functions

These functions help us analyze the grants data to find relevant funders.

In [None]:
def find_funders_by_keywords(grants_df, keywords, min_grant_amount=5000):
    """Find funders that have made grants related to specific keywords"""
    if grants_df.empty:
        return pd.DataFrame()
    
    # Create keyword pattern
    keyword_pattern = '|'.join(keywords)
    
    # Search in purpose field
    if 'purpose' in grants_df.columns:
        mask = grants_df['purpose'].str.contains(keyword_pattern, case=False, na=False)
        matching_grants = grants_df[mask]
    else:
        matching_grants = pd.DataFrame()
    
    if matching_grants.empty:
        return pd.DataFrame()
    
    # Filter by minimum amount
    matching_grants = matching_grants[matching_grants['amount'] >= min_grant_amount]
    
    # Summarize by grantor
    summary = matching_grants.groupby(['grantor_ein', 'grantor_name', 'grantor_type']).agg({
        'amount': ['sum', 'count', 'mean'],
        'filing_year': 'max'
    }).round(2)
    
    summary.columns = ['total_grants', 'grant_count', 'avg_grant', 'latest_filing']
    summary = summary.reset_index()
    summary = summary.sort_values('total_grants', ascending=False)
    
    return summary

def find_funders_by_recipient_type(grants_df, recipient_keywords, min_grant_amount=5000):
    """Find funders that give to specific types of organizations"""
    if grants_df.empty:
        return pd.DataFrame()
    
    # Create keyword pattern
    keyword_pattern = '|'.join(recipient_keywords)
    
    # Search in recipient names
    if 'recipient_name' in grants_df.columns:
        mask = grants_df['recipient_name'].str.contains(keyword_pattern, case=False, na=False)
        matching_grants = grants_df[mask]
    else:
        matching_grants = pd.DataFrame()
    
    if matching_grants.empty:
        return pd.DataFrame()
    
    # Filter by minimum amount
    matching_grants = matching_grants[matching_grants['amount'] >= min_grant_amount]
    
    # Summarize by grantor
    summary = matching_grants.groupby(['grantor_ein', 'grantor_name', 'grantor_type']).agg({
        'amount': ['sum', 'count', 'mean'],
        'filing_year': 'max'
    }).round(2)
    
    summary.columns = ['total_grants', 'grant_count', 'avg_grant', 'latest_filing']
    summary = summary.reset_index()
    summary = summary.sort_values('total_grants', ascending=False)
    
    return summary

def analyze_funder_patterns(grants_df, grantor_ein):
    """Analyze the grant-making patterns of a specific funder"""
    grantor_grants = grants_df[grants_df['grantor_ein'] == grantor_ein]
    
    if grantor_grants.empty:
        return {}
    
    analysis = {
        'grantor_name': grantor_grants['grantor_name'].iloc[0],
        'total_grants': len(grantor_grants),
        'total_amount': grantor_grants['amount'].sum(),
        'avg_grant': grantor_grants['amount'].mean(),
        'median_grant': grantor_grants['amount'].median(),
        'grant_range': (grantor_grants['amount'].min(), grantor_grants['amount'].max()),
        'years_active': grantor_grants['filing_year'].nunique(),
        'year_range': (grantor_grants['filing_year'].min(), grantor_grants['filing_year'].max()),
    }
    
    # Top recipients
    if 'recipient_name' in grantor_grants.columns:
        top_recipients = grantor_grants.groupby('recipient_name')['amount'].sum().sort_values(ascending=False).head(5)
        analysis['top_recipients'] = top_recipients.to_dict()
    
    return analysis

print("‚úÖ Analysis functions defined!")

## 4. Data Visualization

Let's create visualizations to understand our potential funder landscape.

In [None]:
# Create comprehensive visualizations
if donor_funds_df is not None and not donor_funds_df.empty:
    
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    fig.suptitle('ProPublica Donor Data Analysis Dashboard', fontsize=16, fontweight='bold')
    
    # 1. Geographic distribution of all organizations
    all_orgs = pd.concat([donor_funds_df, foundations_df], ignore_index=True)
    if 'state' in all_orgs.columns:
        top_states = all_orgs['state'].value_counts().head(10)
        axes[0, 0].bar(top_states.index, top_states.values)
        axes[0, 0].set_title('Organizations by State')
        axes[0, 0].set_xlabel('State')
        axes[0, 0].set_ylabel('Count')
        axes[0, 0].tick_params(axis='x', rotation=45)
    
    # 2. Organization type distribution
    type_counts = all_orgs['organization_type'].value_counts()
    axes[0, 1].pie(type_counts.values, labels=type_counts.index, autopct='%1.1f%%')
    axes[0, 1].set_title('Organization Types')
    
    # 3. Revenue distribution (donor-advised funds)
    if 'revenue_amount' in donor_funds_df.columns:
        revenue_data = donor_funds_df['revenue_amount'].dropna()
        if not revenue_data.empty:
            axes[0, 2].hist(revenue_data, bins=20, alpha=0.7, edgecolor='black')
            axes[0, 2].set_title('Revenue Distribution - Donor-Advised Funds')
            axes[0, 2].set_xlabel('Revenue ($)')
            axes[0, 2].set_ylabel('Frequency')
            axes[0, 2].set_xscale('log')
    
    # 4. Asset distribution (private foundations)
    if 'asset_amount' in foundations_df.columns:
        asset_data = foundations_df['asset_amount'].dropna()
        if not asset_data.empty:
            axes[1, 0].hist(asset_data, bins=20, alpha=0.7, color='orange', edgecolor='black')
            axes[1, 0].set_title('Asset Distribution - Private Foundations')
            axes[1, 0].set_xlabel('Assets ($)')
            axes[1, 0].set_ylabel('Frequency')
            axes[1, 0].set_xscale('log')
    
    # 5. Grant amount distribution
    if not grants_df.empty:
        grant_amounts = grants_df['amount'].dropna()
        if not grant_amounts.empty:
            axes[1, 1].hist(grant_amounts, bins=30, alpha=0.7, color='green', edgecolor='black')
            axes[1, 1].set_title('Grant Amount Distribution')
            axes[1, 1].set_xlabel('Grant Amount ($)')
            axes[1, 1].set_ylabel('Frequency')
            axes[1, 1].set_xscale('log')
    
    # 6. Top grantors by total amount
    if not grants_df.empty:
        top_grantors = grants_df.groupby('grantor_name')['amount'].sum().sort_values(ascending=False).head(8)
        if not top_grantors.empty:
            y_pos = np.arange(len(top_grantors))
            axes[1, 2].barh(y_pos, top_grantors.values)
            axes[1, 2].set_yticks(y_pos)
            axes[1, 2].set_yticklabels([name[:25] + '...' if len(name) > 25 else name 
                                       for name in top_grantors.index])
            axes[1, 2].set_title('Top Grantors by Total Amount')
            axes[1, 2].set_xlabel('Total Grant Amount ($)')
    
    plt.tight_layout()
    plt.show()
    
else:
    print("‚ùå No data available for visualization")
    print("üí° Run 'python collect_data.py' first to collect data")

## 5. Find Relevant Funders

Now let's find funders that align with your chosen focus area.

In [None]:
# Analyze grants data for relevant funders
if not grants_df.empty and 'chosen_keywords' in locals():
    
    print(f"üîç Analyzing grants for {CHOSEN_FOCUS} funders...")
    print(f"üéØ Using keywords: {', '.join(chosen_keywords[:5])}...")
    
    # Find funders by grant keywords
    relevant_funders = find_funders_by_keywords(
        grants_df, 
        chosen_keywords, 
        min_grant_amount=5000
    )
    
    print(f"\n‚úÖ Found {len(relevant_funders)} funders with {CHOSEN_FOCUS} grants")
    
    if not relevant_funders.empty:
        print(f"\nüèÜ Top {CHOSEN_FOCUS.title()} Funders:")
        display(relevant_funders.head(10))
        
        # Analyze the top funder in detail
        if len(relevant_funders) > 0:
            top_funder_ein = relevant_funders.iloc[0]['grantor_ein']
            top_funder_analysis = analyze_funder_patterns(grants_df, top_funder_ein)
            
            print(f"\nüî¨ Detailed Analysis of Top Funder:")
            print(f"   Name: {top_funder_analysis.get('grantor_name', 'Unknown')}")
            print(f"   Total grants made: {top_funder_analysis.get('total_grants', 0)}")
            print(f"   Total amount granted: ${top_funder_analysis.get('total_amount', 0):,.2f}")
            print(f"   Average grant: ${top_funder_analysis.get('avg_grant', 0):,.2f}")
            print(f"   Grant range: ${top_funder_analysis.get('grant_range', (0,0))[0]:,.0f} - ${top_funder_analysis.get('grant_range', (0,0))[1]:,.0f}")
            print(f"   Years active: {top_funder_analysis.get('years_active', 0)}")
            
            if 'top_recipients' in top_funder_analysis:
                print(f"\n   Top Recipients:")
                for recipient, amount in list(top_funder_analysis['top_recipients'].items())[:5]:
                    print(f"     ‚Ä¢ {recipient}: ${amount:,.2f}")
    
    else:
        print(f"‚ùå No funders found with {CHOSEN_FOCUS} grants in the dataset")
        print("üí° Try expanding keywords or checking the data collection")
    
    # Also search by recipient type
    recipient_keywords = FOCUS_AREAS[CHOSEN_FOCUS]['keywords'][:5]  # Use subset for recipient search
    recipient_funders = find_funders_by_recipient_type(
        grants_df,
        recipient_keywords,
        min_grant_amount=5000
    )
    
    if not recipient_funders.empty:
        print(f"\nüéØ Funders Supporting {CHOSEN_FOCUS.title()} Organizations:")
        display(recipient_funders.head(5))

else:
    if grants_df.empty:
        print("‚ùå No grants data available for analysis")
    else:
        print("‚ö†Ô∏è Focus area not configured properly")

## 6. Export Results

Let's create prospect lists and reports for fundraising outreach.

In [None]:
# Create comprehensive reports for your fundraising efforts
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

if donor_funds_df is not None and not donor_funds_df.empty:
    
    # 1. Create master prospect list
    all_orgs = pd.concat([donor_funds_df, foundations_df], ignore_index=True)
    
    # Select key columns for prospect list
    prospect_columns = ['name', 'city', 'state', 'ein', 'organization_type']
    if 'revenue_amount' in all_orgs.columns:
        prospect_columns.append('revenue_amount')
    if 'asset_amount' in all_orgs.columns:
        prospect_columns.append('asset_amount')
    
    prospect_list = all_orgs[prospect_columns].copy()
    
    # Sort by revenue/assets (whichever is available)
    if 'revenue_amount' in prospect_list.columns:
        prospect_list = prospect_list.sort_values('revenue_amount', ascending=False, na_last=True)
    elif 'asset_amount' in prospect_list.columns:
        prospect_list = prospect_list.sort_values('asset_amount', ascending=False, na_last=True)
    
    # Save prospect list
    prospect_filename = f'data/prospect_list_{CHOSEN_FOCUS}_{timestamp}.csv'
    prospect_list.to_csv(prospect_filename, index=False)
    print(f"üìã Saved prospect list: {prospect_filename}")
    
    # 2. Create focused funder report
    if not grants_df.empty and 'relevant_funders' in locals() and not relevant_funders.empty:
        
        # Enhanced funder report with contact info
        enhanced_funders = relevant_funders.copy()
        
        # Add organization details
        funder_details = []
        for _, funder in enhanced_funders.iterrows():
            ein = funder['grantor_ein']
            org_info = all_orgs[all_orgs['ein'] == ein]
            
            if not org_info.empty:
                org_data = org_info.iloc[0]
                funder_details.append({
                    'grantor_name': funder['grantor_name'],
                    'grantor_ein': ein,
                    'city': org_data.get('city', ''),
                    'state': org_data.get('state', ''),
                    'organization_type': org_data.get('organization_type', ''),
                    'total_grants': funder['total_grants'],
                    'grant_count': funder['grant_count'],
                    'avg_grant': funder['avg_grant'],
                    'latest_filing': funder['latest_filing']
                })
        
        funder_report_df = pd.DataFrame(funder_details)
        funder_report_filename = f'data/targeted_funders_{CHOSEN_FOCUS}_{timestamp}.csv'
        funder_report_df.to_csv(funder_report_filename, index=False)
        print(f"üéØ Saved targeted funders report: {funder_report_filename}")
    
    # 3. Create Excel workbook with multiple sheets
    excel_filename = f'data/comprehensive_analysis_{CHOSEN_FOCUS}_{timestamp}.xlsx'
    
    with pd.ExcelWriter(excel_filename, engine='openpyxl') as writer:
        
        # All prospects
        prospect_list.to_excel(writer, sheet_name='All_Prospects', index=False)
        
        # Donor-advised funds
        donor_funds_df.to_excel(writer, sheet_name='Donor_Advised_Funds', index=False)
        
        # Private foundations
        foundations_df.to_excel(writer, sheet_name='Private_Foundations', index=False)
        
        # Targeted funders (if available)
        if 'relevant_funders' in locals() and not relevant_funders.empty:
            relevant_funders.to_excel(writer, sheet_name='Targeted_Funders', index=False)
        
        # Grants data sample (first 1000 rows to avoid huge files)
        if not grants_df.empty:
            grants_sample = grants_df.head(1000)
            grants_sample.to_excel(writer, sheet_name='Sample_Grants', index=False)
    
    print(f"üìä Saved comprehensive Excel report: {excel_filename}")
    
    # 4. Create summary statistics
    summary_stats = {
        'analysis_date': datetime.now().isoformat(),
        'focus_area': CHOSEN_FOCUS,
        'total_prospects': len(prospect_list),
        'donor_advised_funds': len(donor_funds_df),
        'private_foundations': len(foundations_df),
        'grant_records': len(grants_df),
        'targeted_funders': len(relevant_funders) if 'relevant_funders' in locals() else 0
    }
    
    if not grants_df.empty:
        summary_stats.update({
            'total_grant_amount': float(grants_df['amount'].sum()),
            'avg_grant_amount': float(grants_df['amount'].mean()),
            'median_grant_amount': float(grants_df['amount'].median())
        })
    
    summary_filename = f'data/analysis_summary_{CHOSEN_FOCUS}_{timestamp}.json'
    with open(summary_filename, 'w') as f:
        json.dump(summary_stats, f, indent=2)
    
    print(f"üìà Saved analysis summary: {summary_filename}")
    
    # Print final summary
    print(f"\n{'='*60}")
    print(f"üéâ ANALYSIS COMPLETE - {CHOSEN_FOCUS.upper()}")
    print(f"{'='*60}")
    print(f"üìä Total prospects identified: {len(prospect_list)}")
    if 'relevant_funders' in locals():
        print(f"üéØ Targeted funders found: {len(relevant_funders)}")
    print(f"üìÅ Files exported to: /data")
    print(f"{'='*60}")
    
else:
    print("‚ùå No data available for export")
    print("üí° Run 'python collect_data.py' first")

## 7. Next Steps & Action Items

### üéØ Research Phase
1. **Review Your Prospect Lists**: Start with the targeted funders who have a history of supporting your cause area
2. **Research Each Funder**: 
   - Visit their websites
   - Check recent 990 filings on ProPublica
   - Understand their application processes and deadlines
3. **Prioritize Prospects**: Focus on local funders and those with grant sizes matching your needs

### üìù Outreach Phase
1. **Prepare Compelling Proposals** that clearly align with each funder's interests
2. **Follow Guidelines Exactly** - respect word limits, deadlines, and submission processes
3. **Build Relationships** before asking for funding when possible
4. **Track Your Outreach** and follow up appropriately

### üîÑ Continuous Improvement
1. **Re-run Analysis Quarterly** to find new funders and updated information
2. **Refine Keywords** based on successful grants and feedback
3. **Expand Geographic Search** if local options are limited
4. **Track Success Rates** to improve your approach

---

## üìö Additional Resources

- **ProPublica Nonprofit Explorer**: https://projects.propublica.org/nonprofits/
- **Foundation Directory Online**: https://fdo.foundationcenter.org/
- **Guidestar**: https://www.guidestar.org/
- **GrantSpace**: https://grantspace.org/

---

**üéâ Congratulations!** You've successfully analyzed IRS 990 data to identify potential funders. Remember: fundraising is about building relationships. Use this data as a starting point for meaningful connections with funders who share your mission.

### üí° Pro Tips
- **Quality over quantity**: Better to research 10 funders thoroughly than 100 superficially
- **Local focus**: Geographic proximity often increases funding likelihood  
- **Timing matters**: Many foundations have annual deadlines
- **Persistence pays**: Building relationships takes time but yields better results