# ADS NASA Query Analysis Notebook

This notebook performs queries to the NASA ADS (Astrophysics Data System) using keywords extracted from our wordcloud analysis.

## Features:
- Extract top keywords from wordcloud analysis
- Query ADS to count publications before retrieval
- Smart query planning to avoid overwhelming results
- Retrieve and analyze relevant publications

## Workflow:
1. **Extract Keywords**: Get top words from titles and abstracts
2. **Count Check**: Query ADS to see how many publications match
3. **Query Planning**: Adjust search strategy based on result count
4. **Data Retrieval**: Fetch publication data if reasonable
5. **Analysis**: Process and analyze the retrieved publications


## Setup and Imports


In [1]:
# Setup and imports
import sys
import os
import json
from pprint import pprint

# Add the src directory to the path
sys.path.append('../src')

# Import our custom modules
from wordcloud_utils import extract_top_words_from_json_files
from ads_parser import (
    test_ads_connection,
    search_papers_by_keywords,
    process_search_results,
    search_and_process_papers
)

print("✅ Libraries imported successfully")
print("✅ Ready to start ADS analysis!")


✅ Libraries imported successfully
✅ Ready to start ADS analysis!


## Step 1: Test ADS Connection


In [2]:
# Test ADS API connection
print("🚀 Testing ADS API connection...")
connection_ok = test_ads_connection()

if connection_ok:
    print("\n✅ ADS connection successful! Ready to proceed.")
else:
    print("\n❌ ADS connection failed. Please check your API token in .env file.")
    print("You can get an ADS API token from: https://ui.adsabs.harvard.edu/user/settings/token")


🚀 Testing ADS API connection...
🔍 Testing ADS API connection...
✅ ADS API connection successful!
   Found 1103 total results
   Retrieved 1 documents

✅ ADS connection successful! Ready to proceed.


## Step 2: Extract Keywords from Wordcloud Analysis


In [16]:
# Extract top keywords from our wordcloud analysis
titles_freq_file = '../wordclouds/titles_word_frequencies.json'
abstracts_freq_file = '../wordclouds/abstracts_word_frequencies.json'

# Check if the wordcloud files exist
if not os.path.exists(titles_freq_file) or not os.path.exists(abstracts_freq_file):
    print("❌ Wordcloud frequency files not found!")
    print("Please run the wordcloud_analysis.ipynb notebook first to generate the frequency data.")
    print(f"Looking for:")
    print(f"  - {titles_freq_file}")
    print(f"  - {abstracts_freq_file}")
else:
    # Extract different sets of keywords for testing
    print("📊 Extracting keywords from wordcloud analysis...")
    
    # Start with a small set for testing
    top_4_keywords = extract_top_words_from_json_files(titles_freq_file, abstracts_freq_file, 4)
    top_5_keywords = extract_top_words_from_json_files(titles_freq_file, abstracts_freq_file, 5)
    top_10_keywords = extract_top_words_from_json_files(titles_freq_file, abstracts_freq_file, 10)
    top_15_keywords = extract_top_words_from_json_files(titles_freq_file, abstracts_freq_file, 15)
    
    print(f"\n📝 Extracted keyword sets:")
    print(f"Top 4 unique keywords: {len(top_4_keywords)} words")
    print(f"  {top_4_keywords}")
    print(f"\nTop 5 unique keywords: {len(top_5_keywords)} words")
    print(f"  {top_5_keywords}")
    print(f"\nTop 10 unique keywords: {len(top_10_keywords)} words")
    print(f"  {top_10_keywords}")
    print(f"\nTop 15 unique keywords: {len(top_15_keywords)} words")
    print(f"  {top_15_keywords}")


📊 Extracting keywords from wordcloud analysis...

📝 Extracted keyword sets:
Top 4 unique keywords: 6 words
  ['period', 'eclipsing', 'photometric', 'contact', 'light', 'binary']

Top 5 unique keywords: 8 words
  ['period', 'eclipsing', 'photometric', 'contact', 'uma', 'light', 'mass', 'binary']

Top 10 unique keywords: 13 words
  ['binaries', 'system', 'type', 'period', 'curves', 'eclipsing', 'photometric', 'contact', 'uma', 'light', 'mass', 'orbital', 'binary']

Top 15 unique keywords: 20 words
  ['binaries', 'system', 'type', 'curve', 'photometric', 'investigation', 'orbital', 'period', 'curves', 'parameters', 'contact', 'uma', 'binary', 'short', 'component', 'ratio', 'systems', 'eclipsing', 'light', 'mass']


## Step 3: Count Publications Before Retrieval

Before retrieving any data, let's see how many publications each keyword set would return.


In [17]:
def count_publications_for_keywords(keywords, description=""):
    """
    Count how many publications match the given keywords in the full text without retrieving the full data.
    """
    search_fields = "full"
    print(f"\n🔍 Counting publications for {description}...")
    print(f"Keywords: {keywords}")
    print(f"Search fields: {search_fields}")
    
    # Use max_results=1 to minimize data transfer while getting the count
    results = search_papers_by_keywords(keywords, search_fields=search_fields, max_results=1)
    
    if results and "response" in results:
        total_count = results["response"].get("numFound", 0)
        print(f"📊 Total publications found: {total_count:,}")
        return total_count
    else:
        print("❌ Failed to get publication count")
        return 0

# Count publications for different keyword sets using only 'full' search
if 'top_5_keywords' in locals():
    counts = {}
    
    print("\n" + "="*60)
    print("PUBLICATION COUNT ANALYSIS")
    print("="*60)
    
    # Top 4 keywords - full text search only
    counts['top4_full'] = count_publications_for_keywords(
        top_4_keywords, "Top 4 keywords (full text search)"
    )
    
    # Top 5 keywords - full text search only
    counts['top5_full'] = count_publications_for_keywords(
        top_5_keywords, "Top 5 keywords (full text search)"
    )
    
    # Top 10 keywords - full text search only
    counts['top10_full'] = count_publications_for_keywords(
        top_10_keywords, "Top 10 keywords (full text search)"
    )
    
    # Top 15 keywords - full text search only
    counts['top15_full'] = count_publications_for_keywords(
        top_15_keywords, "Top 15 keywords (full text search)"
    )
    
    # Show summary
    print("\n" + "="*60)
    print("SUMMARY OF PUBLICATION COUNTS")
    print("="*60)
    for strategy, count in counts.items():
        print(f"{strategy:20s}: {count:,} publications")
        
    # Recommendations
    print("\n💡 RECOMMENDATIONS:")
    manageable_threshold = 1000
    large_threshold = 10000
    
    for strategy, count in counts.items():
        if count <= manageable_threshold:
            print(f"✅ {strategy}: {count:,} publications - Manageable for full retrieval")
        elif count <= large_threshold:
            print(f"⚠️  {strategy}: {count:,} publications - Large but feasible with pagination")
        else:
            print(f"❌ {strategy}: {count:,} publications - Too many, consider refining keywords")
else:
    print("❌ Keywords not available. Please run the previous cell first.")



PUBLICATION COUNT ANALYSIS

🔍 Counting publications for Top 4 keywords (full text search)...
Keywords: ['period', 'eclipsing', 'photometric', 'contact', 'light', 'binary']
Search fields: full
🔍 Searching for: full:period AND full:eclipsing AND full:photometric AND full:contact AND full:light AND full:binary
✅ Found 9425 papers, retrieved 1
📊 Total publications found: 9,425

🔍 Counting publications for Top 5 keywords (full text search)...
Keywords: ['period', 'eclipsing', 'photometric', 'contact', 'uma', 'light', 'mass', 'binary']
Search fields: full
🔍 Searching for: full:period AND full:eclipsing AND full:photometric AND full:contact AND full:uma AND full:light AND full:mass AND full:binary
✅ Found 3402 papers, retrieved 1
📊 Total publications found: 3,402

🔍 Counting publications for Top 10 keywords (full text search)...
Keywords: ['binaries', 'system', 'type', 'period', 'curves', 'eclipsing', 'photometric', 'contact', 'uma', 'light', 'mass', 'orbital', 'binary']
Search fields: full


## Step 4: Search for Bibcodes Only

Now let's search for papers using 15 keywords and retrieve only the bibcodes using the batch query function.


In [13]:
import requests
from ads_parser import ADS_API_TOKEN, ADS_API_BASE_URL
import time

def search_all_bibcodes(keywords, search_fields="full"):
    """
    Search for papers and return ALL bibcodes using pagination.
    Handles the 2000 per request limit automatically.
    """
    if not ADS_API_TOKEN:
        print("❌ Error: ADS_API_TOKEN not found in environment variables")
        return []
    
    headers = {"Authorization": f"Bearer {ADS_API_TOKEN}"}
    
    # Build query based on search fields
    if search_fields == "title":
        query_parts = [f"title:{keyword}" for keyword in keywords]
    elif search_fields == "abs":
        query_parts = [f"abs:{keyword}" for keyword in keywords]
    elif search_fields == "full":
        query_parts = [f"full:{keyword}" for keyword in keywords]
    elif search_fields == "title,abs":
        title_parts = [f"title:{keyword}" for keyword in keywords]
        abs_parts = [f"abs:{keyword}" for keyword in keywords]
        query_parts = title_parts + abs_parts
    else:
        print(f"❌ Invalid search_fields: {search_fields}")
        return []
    
    # Join keywords with AND operator
    query = " AND ".join(query_parts)
    
    # First request to get total count
    initial_params = {
        "q": query,
        "fl": "bibcode",
        "rows": 1,  # Just get count first
        "sort": "date desc"
    }
    
    try:
        print(f"🔍 Searching for bibcodes with query: {query}")
        print(f"📊 Getting total count first...")
        
        response = requests.get(
            f"{ADS_API_BASE_URL}/search/query",
            headers=headers,
            params=initial_params,
            timeout=30
        )
        
        if response.status_code != 200:
            print(f"❌ Initial request failed with status {response.status_code}")
            return []
        
        data = response.json()
        total_found = data.get("response", {}).get("numFound", 0)
        
        if total_found == 0:
            print("❌ No papers found for this query")
            return []
        
        print(f"✅ Found {total_found:,} total papers")
        
        # Calculate pagination
        max_per_request = 2000
        requests_needed = (total_found + max_per_request - 1) // max_per_request
        
        print(f"📄 Will need {requests_needed} requests to get all bibcodes")
        
        all_bibcodes = []
        
        # Get all bibcodes with pagination
        for i in range(requests_needed):
            start = i * max_per_request
            remaining = total_found - start
            rows = min(max_per_request, remaining)
            
            print(f"📥 Request {i+1}/{requests_needed}: Getting bibcodes {start+1:,}-{start+rows:,}")
            
            params = {
                "q": query,
                "fl": "bibcode",
                "rows": rows,
                "start": start,
                "sort": "date desc"
            }
            
            response = requests.get(
                f"{ADS_API_BASE_URL}/search/query",
                headers=headers,
                params=params,
                timeout=30
            )
            
            if response.status_code == 200:
                data = response.json()
                docs = data.get("response", {}).get("docs", [])
                batch_bibcodes = [doc.get("bibcode") for doc in docs if doc.get("bibcode")]
                all_bibcodes.extend(batch_bibcodes)
                
                print(f"   ✅ Retrieved {len(batch_bibcodes)} bibcodes")
                
                # Check rate limit
                if 'X-RateLimit-Remaining' in response.headers:
                    remaining_requests = response.headers['X-RateLimit-Remaining']
                    print(f"   🔄 API requests remaining: {remaining_requests}")
                
                # Small delay between requests to be nice to the API
                if i < requests_needed - 1:  # Don't delay after last request
                    time.sleep(0.5)
                    
            else:
                print(f"   ❌ Request {i+1} failed with status {response.status_code}")
                break
        
        print(f"\n🎯 FINAL RESULTS:")
        print(f"   Total papers found: {total_found:,}")
        print(f"   Total bibcodes retrieved: {len(all_bibcodes):,}")
        print(f"   API requests used: {min(i+1, requests_needed)}")
        
        return all_bibcodes
        
    except requests.exceptions.RequestException as e:
        print(f"❌ Request failed: {e}")
        return []

# Use top 5 keywords for the search, and set max_results to the count from top5_full
print("🎯 BIBCODE SEARCH WITH TOP 5 KEYWORDS")
print("="*50)

CHOSEN_KEYWORDS = top_5_keywords  # Using top 5 keywords
CHOSEN_SEARCH_FIELD = "full"      # Full text search

# Use the count from counts['top5_full'] for max_results
MAX_RESULTS = counts['top5_full'] if 'top5_full' in counts else 2000

print(f"Keywords: {CHOSEN_KEYWORDS}")
print(f"Search field: {CHOSEN_SEARCH_FIELD}")
print(f"Max results: {MAX_RESULTS}")

# Search for bibcodes
found_bibcodes = search_all_bibcodes(CHOSEN_KEYWORDS, CHOSEN_SEARCH_FIELD)

print(f"\n📋 BIBCODE SEARCH RESULTS:")
print(f"Total bibcodes retrieved: {len(found_bibcodes)}")

if found_bibcodes:
    print(f"\nFirst 10 bibcodes:")
    for i, bibcode in enumerate(found_bibcodes[:10]):
        print(f"  {i+1:2d}. {bibcode}")
    
    if len(found_bibcodes) > 10:
        print(f"  ... and {len(found_bibcodes) - 10} more")


🎯 BIBCODE SEARCH WITH TOP 5 KEYWORDS
Keywords: ['period', 'eclipsing', 'photometric', 'contact', 'uma', 'light', 'mass', 'binary']
Search field: full
Max results: 3402
🔍 Searching for bibcodes with query: full:period AND full:eclipsing AND full:photometric AND full:contact AND full:uma AND full:light AND full:mass AND full:binary
📊 Getting total count first...
✅ Found 3,402 total papers
📄 Will need 2 requests to get all bibcodes
📥 Request 1/2: Getting bibcodes 1-2,000
   ✅ Retrieved 2000 bibcodes
   🔄 API requests remaining: 4962
📥 Request 2/2: Getting bibcodes 2,001-3,402
   ✅ Retrieved 1402 bibcodes
   🔄 API requests remaining: 4961

🎯 FINAL RESULTS:
   Total papers found: 3,402
   Total bibcodes retrieved: 3,402
   API requests used: 2

📋 BIBCODE SEARCH RESULTS:
Total bibcodes retrieved: 3402

First 10 bibcodes:
   1. 2025NewA..12102445Y
   2. 2025NewA..11902392H
   3. 2025NewA..11902418Z
   4. 2025RAA....25i5018W
   5. 2025RAA....25h5002B
   6. 2025NewA..11802374N
   7. 2025AJ....1

## Step 4b: Search with Top 4 Keywords

Let's also perform a search with the top 4 keywords to see how the results differ with fewer, more specific terms.


In [18]:
# Search with TOP 4 KEYWORDS for comparison
print("🎯 BIBCODE SEARCH WITH TOP 4 KEYWORDS - ALL RESULTS")
print("="*60)

CHOSEN_KEYWORDS_4 = top_4_keywords  # Using top 4 keywords
CHOSEN_SEARCH_FIELD_4 = "full"      # Full text search

print(f"Keywords: {CHOSEN_KEYWORDS_4}")
print(f"Search field: {CHOSEN_SEARCH_FIELD_4}")
print(f"Getting ALL bibcodes (automatic pagination)")

# Search for ALL bibcodes with top 4 keywords
found_bibcodes_4 = search_all_bibcodes(CHOSEN_KEYWORDS_4, CHOSEN_SEARCH_FIELD_4)

print(f"\n📋 TOP 4 KEYWORDS SEARCH RESULTS:")
print(f"Total bibcodes retrieved: {len(found_bibcodes_4):,}")

if found_bibcodes_4:
    print(f"\nFirst 10 bibcodes:")
    for i, bibcode in enumerate(found_bibcodes_4[:10]):
        print(f"  {i+1:2d}. {bibcode}")
    
    if len(found_bibcodes_4) > 10:
        print(f"  ... and {len(found_bibcodes_4) - 10:,} more")

# Quick comparison with 15-keyword results
if 'found_bibcodes' in locals() and found_bibcodes:
    print(f"\n📊 QUICK COMPARISON:")
    print(f"  Top 4 keywords:  {len(found_bibcodes_4):,} bibcodes")
    print(f"  Top 15 keywords: {len(found_bibcodes):,} bibcodes")
    
    # Check overlap between the two searches
    if found_bibcodes_4 and found_bibcodes:
        overlap_4_15 = set(found_bibcodes_4).intersection(set(found_bibcodes))
        print(f"  Overlap between searches: {len(overlap_4_15):,} bibcodes")
        
        if len(found_bibcodes_4) > 0:
            overlap_pct = (len(overlap_4_15) / len(found_bibcodes_4)) * 100
            print(f"  {overlap_pct:.1f}% of top-4 results also found in top-15 search")
else:
    print(f"\n💡 Run the 15-keyword search first to enable comparison")


🎯 BIBCODE SEARCH WITH TOP 4 KEYWORDS - ALL RESULTS
Keywords: ['period', 'eclipsing', 'photometric', 'contact', 'light', 'binary']
Search field: full
Getting ALL bibcodes (automatic pagination)
🔍 Searching for bibcodes with query: full:period AND full:eclipsing AND full:photometric AND full:contact AND full:light AND full:binary
📊 Getting total count first...
✅ Found 9,425 total papers
📄 Will need 5 requests to get all bibcodes
📥 Request 1/5: Getting bibcodes 1-2,000
   ✅ Retrieved 2000 bibcodes
   🔄 API requests remaining: 4955
📥 Request 2/5: Getting bibcodes 2,001-4,000
   ✅ Retrieved 2000 bibcodes
   🔄 API requests remaining: 4954
📥 Request 3/5: Getting bibcodes 4,001-6,000
   ✅ Retrieved 2000 bibcodes
   🔄 API requests remaining: 4953
📥 Request 4/5: Getting bibcodes 6,001-8,000
   ✅ Retrieved 2000 bibcodes
   🔄 API requests remaining: 4952
📥 Request 5/5: Getting bibcodes 8,001-9,425
   ✅ Retrieved 1425 bibcodes
   🔄 API requests remaining: 4951

🎯 FINAL RESULTS:
   Total papers foun

## Step 5: Compare with WUMaCat Bibcodes

Compare the found bibcodes with the unique bibcodes from WUMaCat.csv to see overlaps and differences.


In [21]:
import pandas as pd

# Load WUMaCat.csv and extract bibcodes
print("📂 LOADING WUMACAT BIBCODES")
print("="*50)

wumacat_file = '../data/WUMaCat.csv'
if os.path.exists(wumacat_file):
    # Read the CSV file
    wumacat_df = pd.read_csv(wumacat_file)
    
    # Extract unique bibcodes (column 'Bibcode')
    wumacat_bibcodes = set(wumacat_df['Bibcode'].dropna().unique())
    
    print(f"✅ Loaded WUMaCat.csv")
    print(f"📊 Total unique bibcodes in WUMaCat: {len(wumacat_bibcodes)}")
    
    # Show sample WUMaCat bibcodes
    print(f"\nFirst 10 WUMaCat bibcodes:")
    for i, bibcode in enumerate(list(wumacat_bibcodes)[:10]):
        print(f"  {i+1:2d}. {bibcode}")
else:
    print(f"❌ WUMaCat file not found: {wumacat_file}")
    wumacat_bibcodes = set()

# Compare the bibcodes
print(f"\n🔍 BIBCODE COMPARISON")
print("="*50)

if found_bibcodes_4 and wumacat_bibcodes:
    # Convert found_bibcodes to set for comparison
    found_bibcodes_set = set(found_bibcodes_4)
    
    # Find overlaps and differences
    overlap = found_bibcodes_set.intersection(wumacat_bibcodes)
    ads_only = found_bibcodes_set - wumacat_bibcodes
    wumacat_only = wumacat_bibcodes - found_bibcodes_set
    
    print(f"📊 COMPARISON RESULTS:")
    print(f"  Found in ADS search:     {len(found_bibcodes_set):,} bibcodes")
    print(f"  Found in WUMaCat:       {len(wumacat_bibcodes):,} bibcodes")
    print(f"  Overlap (both):         {len(overlap):,} bibcodes")
    print(f"  ADS only (new):         {len(ads_only):,} bibcodes")
    print(f"  WUMaCat only (missing): {len(wumacat_only):,} bibcodes")
    
    # Calculate percentages
    if len(found_bibcodes_set) > 0:
        overlap_pct_ads = (len(overlap) / len(found_bibcodes_set)) * 100
        print(f"\n📈 OVERLAP STATISTICS:")
        print(f"  {overlap_pct_ads:.1f}% of ADS results are already in WUMaCat")
    
    if len(wumacat_bibcodes) > 0:
        overlap_pct_wumacat = (len(overlap) / len(wumacat_bibcodes)) * 100
        print(f"  {overlap_pct_wumacat:.1f}% of WUMaCat papers found in ADS search")
    
    # Show samples
    if overlap:
        print(f"\n✅ OVERLAPPING BIBCODES (first 5):")
        for i, bibcode in enumerate(list(overlap)[:5]):
            print(f"  {i+1}. {bibcode}")
    
    if ads_only:
        print(f"\n🆕 NEW BIBCODES FROM ADS (first 5):")
        for i, bibcode in enumerate(list(ads_only)[:5]):
            print(f"  {i+1}. {bibcode}")
    
    # Save results for further analysis
    comparison_results = {
        'ads_bibcodes': list(found_bibcodes_set),
        'wumacat_bibcodes': list(wumacat_bibcodes),
        'overlap': list(overlap),
        'ads_only': list(ads_only),
        'wumacat_only': list(wumacat_only)
    }
    
    # Save to JSON file
    output_file = '../data/bibcode_comparison.json'
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(comparison_results, f, indent=2, ensure_ascii=False)
    
    print(f"\n💾 Comparison results saved to: {output_file}")
    
else:
    print("❌ Cannot perform comparison - missing bibcode data")


📂 LOADING WUMACAT BIBCODES
✅ Loaded WUMaCat.csv
📊 Total unique bibcodes in WUMaCat: 424

First 10 WUMaCat bibcodes:
   1. 2015AJ....150....9X
   2. 2015NewA...36..100G
   3. 2011PASP..123..895Y
   4. 2009Ap&SS.321...19L
   5. 2018NewA...61....1K
   6. 2011A&A...525A..66D
   7. 2013NewA...23...59B
   8. 2009Ap&SS.321..209H
   9. 2016AJ....152..219S
  10. 2010RAA....10..569H

🔍 BIBCODE COMPARISON
📊 COMPARISON RESULTS:
  Found in ADS search:     9,425 bibcodes
  Found in WUMaCat:       424 bibcodes
  Overlap (both):         366 bibcodes
  ADS only (new):         9,059 bibcodes
  WUMaCat only (missing): 58 bibcodes

📈 OVERLAP STATISTICS:
  3.9% of ADS results are already in WUMaCat
  86.3% of WUMaCat papers found in ADS search

✅ OVERLAPPING BIBCODES (first 5):
  1. 2015AJ....150....9X
  2. 2015NewA...36..100G
  3. 2011PASP..123..895Y
  4. 2018NewA...61....1K
  5. 2009Ap&SS.321...19L

🆕 NEW BIBCODES FROM ADS (first 5):
  1. 2012NewA...17...46U
  2. 2000Ap&SS.273..257Q
  3. 1977ApJ...211..8

## Summary: Comparison of Search Strategies

This section provides a summary comparison of the different keyword strategies used.


In [19]:
# SUMMARY: Compare all search strategies
print("📊 COMPREHENSIVE SEARCH STRATEGY COMPARISON")
print("="*70)

# Check what searches were performed
searches_performed = []
if 'found_bibcodes' in locals() and found_bibcodes:
    searches_performed.append(('15 keywords', len(found_bibcodes), found_bibcodes))
    
if 'found_bibcodes_4' in locals() and found_bibcodes_4:
    searches_performed.append(('4 keywords', len(found_bibcodes_4), found_bibcodes_4))

if 'wumacat_bibcodes' in locals() and wumacat_bibcodes:
    wumacat_count = len(wumacat_bibcodes)
    print(f"🗂️  WUMaCat baseline:     {wumacat_count:,} bibcodes")

if searches_performed:
    print(f"\n🔍 ADS SEARCH RESULTS:")
    for name, count, bibcodes in searches_performed:
        print(f"   {name:15s}: {count:,} bibcodes")
    
    # Compare overlaps between different searches
    if len(searches_performed) >= 2:
        search1_name, search1_count, search1_bibcodes = searches_performed[0]
        search2_name, search2_count, search2_bibcodes = searches_performed[1]
        
        overlap_searches = set(search1_bibcodes).intersection(set(search2_bibcodes))
        
        print(f"\n🔗 SEARCH OVERLAP ANALYSIS:")
        print(f"   Overlap between {search1_name} and {search2_name}: {len(overlap_searches):,} bibcodes")
        
        if search1_count > 0 and search2_count > 0:
            pct1 = (len(overlap_searches) / search1_count) * 100
            pct2 = (len(overlap_searches) / search2_count) * 100
            print(f"   {pct1:.1f}% of {search1_name} results found in {search2_name} search")
            print(f"   {pct2:.1f}% of {search2_name} results found in {search1_name} search")
    
    # WUMaCat comparisons
    if 'wumacat_bibcodes' in locals():
        print(f"\n📋 WUMaCat OVERLAP ANALYSIS:")
        for name, count, bibcodes in searches_performed:
            overlap_wuma = set(bibcodes).intersection(wumacat_bibcodes)
            new_discoveries = set(bibcodes) - wumacat_bibcodes
            
            if count > 0:
                overlap_pct = (len(overlap_wuma) / count) * 100
                print(f"   {name:15s}: {len(overlap_wuma):3d} overlap ({overlap_pct:4.1f}%), {len(new_discoveries):,} new discoveries")
    
    # Final recommendations
    print(f"\n💡 RECOMMENDATIONS:")
    if len(searches_performed) >= 2:
        smaller_search = min(searches_performed, key=lambda x: x[1])
        larger_search = max(searches_performed, key=lambda x: x[1])
        
        print(f"   • {smaller_search[0]} search: More focused, {smaller_search[1]:,} results")
        print(f"   • {larger_search[0]} search: Broader coverage, {larger_search[1]:,} results")
        print(f"   • Consider your research goals: precision vs. completeness")
    
    # Save comprehensive summary
    summary_data = {
        'search_strategies': {
            name: {
                'total_bibcodes': count,
                'sample_bibcodes': bibcodes[:10] if bibcodes else []
            } for name, count, bibcodes in searches_performed
        },
        'wumacat_baseline': wumacat_count if 'wumacat_bibcodes' in locals() else 0
    }
    
    summary_file = '../data/search_strategy_summary.json'
    with open(summary_file, 'w', encoding='utf-8') as f:
        json.dump(summary_data, f, indent=2, ensure_ascii=False)
    
    print(f"\n💾 Search strategy summary saved to: {summary_file}")
    
else:
    print("❌ No search results available for comparison")


📊 COMPREHENSIVE SEARCH STRATEGY COMPARISON
🗂️  WUMaCat baseline:     424 bibcodes

🔍 ADS SEARCH RESULTS:
   15 keywords    : 3,402 bibcodes
   4 keywords     : 9,425 bibcodes

🔗 SEARCH OVERLAP ANALYSIS:
   Overlap between 15 keywords and 4 keywords: 3,402 bibcodes
   100.0% of 15 keywords results found in 4 keywords search
   36.1% of 4 keywords results found in 15 keywords search

📋 WUMaCat OVERLAP ANALYSIS:
   15 keywords    : 316 overlap ( 9.3%), 3,086 new discoveries
   4 keywords     : 366 overlap ( 3.9%), 9,059 new discoveries

💡 RECOMMENDATIONS:
   • 15 keywords search: More focused, 3,402 results
   • 4 keywords search: Broader coverage, 9,425 results
   • Consider your research goals: precision vs. completeness

💾 Search strategy summary saved to: ../data/search_strategy_summary.json
