# Exact Keyword Search Experiment

This notebook experiments with **exact keyword searches** in ADS using the `=field:"keyword"` syntax to avoid the embedding-based semantic search that returns large result sets.

## Objectives:
1. **Extract top keywords** from our wordcloud analysis
2. **Test exact searches** in different ADS fields (title, abstract, full text)
3. **Measure overlap** with WUMaCat bibcodes to evaluate search precision
4. **Compare strategies** to find optimal exact search approach
5. **Analyze results** and recommend best practices

## ADS Search Syntax:
- **Semantic search**: `keyword` (uses embeddings, large results)
- **Exact field search**: `=title:"keyword"`, `=abs:"keyword"`, `=full:"keyword"`
- **Exact author search**: `=author:"Last, First"`
- **Combined exact search**: `=title:"binary" AND =title:"eclipsing"` (papers containing ALL keywords)

## Methodology:
We'll use our existing wordcloud data to extract the most relevant astronomical terms and test how well exact field searches with **AND logic** can find papers that contain ALL specified keywords. We test combinations of **20, 15, 10, 7, and 5 keywords** with an increased limit of **20,000 papers** for comprehensive analysis.


## 1. Setup and Data Loading


In [1]:
# Setup and imports
import sys
import os
import json
from datetime import datetime

# Add the src directory to the path
sys.path.append('../src')

# Import our enhanced ADS parser functions
from ads_parser import (
    test_ads_connection, 
    search_exact_keywords, 
    test_keyword_combination_sizes
)

# Import wordcloud utilities
from wordcloud_utils import (
    load_wumacat_bibcodes,
    get_top_keywords_from_wordclouds,
    save_experiment_results
)

print("‚úÖ Libraries imported successfully")
print("‚úÖ New keyword combination testing function available")
print("‚úÖ Ready for full-text keyword combination experiment!")

# Test ADS connection
print("\nüîç Testing ADS API connection...")
if test_ads_connection():
    print("‚úÖ ADS connection successful - ready to proceed!")
else:
    print("‚ùå ADS connection failed - check your API token")


‚úÖ Libraries imported successfully
‚úÖ New keyword combination testing function available
‚úÖ Ready for full-text keyword combination experiment!

üîç Testing ADS API connection...
üîç Testing ADS API connection...
‚úÖ ADS API connection successful!
   Found 1104 total results
   Retrieved 1 documents
‚úÖ ADS connection successful - ready to proceed!


## 2. Load WUMaCat Reference Data


In [2]:
# Load WUMaCat bibcodes for overlap analysis
print("üìÇ Loading WUMaCat bibcodes for reference...")
wumacat_bibcodes = load_wumacat_bibcodes('../data/WUMaCat.csv')

print(f"\nüìä WUMaCat Reference Dataset:")
print(f"   Total unique bibcodes: {len(wumacat_bibcodes)}")

# Show a few examples
if wumacat_bibcodes:
    sample_bibcodes = list(wumacat_bibcodes)[:5]
    print(f"   Sample bibcodes: {sample_bibcodes}")
    
print("\n‚úÖ Reference data loaded successfully!")


üìÇ Loading WUMaCat bibcodes for reference...
‚úÖ Loaded 424 unique bibcodes from WUMaCat

üìä WUMaCat Reference Dataset:
   Total unique bibcodes: 424
   Sample bibcodes: ['2011AJ....141..147L', '2006AcA....56..127G', '2002IBVS.5258....1P', '2019IBVS.6266....1N', '2013NewA...20...52P']

‚úÖ Reference data loaded successfully!


## 3. Extract Top Keywords from Wordclouds


In [3]:
# Extract top keywords from our wordcloud frequency analysis
print("üéØ Extracting top keywords from wordcloud analysis...")

# Load keywords from wordcloud frequency files
titles_freq_file = '../wordclouds/titles_word_frequencies.json'
abstracts_freq_file = '../wordclouds/abstracts_word_frequencies.json'

# Check if files exist
if os.path.exists(titles_freq_file) and os.path.exists(abstracts_freq_file):
    # Extract top 25 keywords for testing (need at least 20 for largest combination)
    top_keywords = get_top_keywords_from_wordclouds(
        titles_freq_file, 
        abstracts_freq_file, 
        top_n=25, 
        exclude_generic=True
    )
    
    print(f"\nüîç Top {len(top_keywords)} keywords for exact searching:")
    for i, keyword in enumerate(top_keywords, 1):
        print(f"   {i:2d}. {keyword}")
        
    print("\n‚úÖ Keywords extracted successfully!")
    
else:
    print("‚ùå Wordcloud frequency files not found!")
    print("   Please run the wordcloud_analysis.ipynb notebook first.")
    
    # Fallback: use manually selected binary star keywords
    print("\nüîÑ Using fallback binary star keywords...")
    top_keywords = [
        'binary', 'binaries', 'eclipsing', 'contact', 'detached', 
        'stellar', 'photometry', 'variability', 'period', 'orbital',
        'lightcurve', 'eclipse', 'companion', 'mass', 'radius',
        'magnitude', 'flux', 'brightness', 'amplitude', 'pulsation',
        'transit', 'occultation', 'synchronization', 'evolution', 'formation'
    ]
    print(f"   Using {len(top_keywords)} fallback keywords: {top_keywords}")


üéØ Extracting top keywords from wordcloud analysis...
‚úÖ Extracted top 25 keywords for exact searching

üîç Top 25 keywords for exact searching:
    1. binary
    2. light
    3. contact
    4. period
    5. photometric
    6. mass
    7. type
    8. curves
    9. binaries
   10. eclipsing
   11. orbital
   12. ratio
   13. curve
   14. component
   15. star
   16. components
   17. massive
   18. solutions
   19. wilson
   20. devinney
   21. overcontact
   22. primary
   23. short
   24. stars
   25. secondary

‚úÖ Keywords extracted successfully!


## 4. Run Comprehensive Strategy Comparison

This is the main experiment - we'll test exact keyword searches across different ADS fields and measure their effectiveness by comparing overlap with our WUMaCat reference dataset.


In [4]:
# Run keyword combination size experiment in full text
print("üöÄ Running keyword combination size experiment...")
print("This will test different numbers of keywords (20, 15, 10, 7, 5) in full-text searches")
print("using AND logic to find papers containing ALL specified keywords.")
print("Increased limit to 20,000 papers for more comprehensive results.")

# Test different combination sizes in full text
combination_sizes = [20, 15, 10, 7, 5]
print(f"\nüéØ Available keywords: {len(top_keywords)}")
print(f"üìä Testing combination sizes: {combination_sizes}")
print(f"üîç Search field: full text (more comprehensive than titles/abstracts)")

# Run the experiment
combination_results = test_keyword_combination_sizes(
    all_keywords=top_keywords,
    wumacat_bibcodes=wumacat_bibcodes,
    combination_sizes=combination_sizes,
    source_field="full"
)

print("\n‚úÖ Keyword combination experiment completed!")

# Display summary results
if combination_results and 'summary' in combination_results:
    summary = combination_results['summary']
    print(f"\nüéØ EXPERIMENT SUMMARY:")
    print(f"   Successful combinations: {summary.get('successful_combinations', [])}")
    
    if 'best_overlap' in summary:
        best_overlap = summary['best_overlap']
        print(f"   Best overlap: {best_overlap['percentage']:.1f}% with {best_overlap['size']} keywords")
    
    if 'best_precision' in summary:
        best_precision = summary['best_precision']
        print(f"   Best precision: {best_precision['percentage']:.1f}% with {best_precision['size']} keywords")
        
    if 'best_f1' in summary:
        best_f1 = summary['best_f1']
        print(f"   Best F1-score: {best_f1['score']:.1f}% with {best_f1['size']} keywords")
else:
    print("\n‚ö†Ô∏è  Combination experiment did not complete successfully")


üöÄ Running keyword combination size experiment...
This will test different numbers of keywords (20, 15, 10, 7, 5) in full-text searches
using AND logic to find papers containing ALL specified keywords.
Increased limit to 20,000 papers for more comprehensive results.

üéØ Available keywords: 25
üìä Testing combination sizes: [20, 15, 10, 7, 5]
üîç Search field: full text (more comprehensive than titles/abstracts)
üß™ Testing keyword combination sizes in 'full' field
   Available keywords: 25
   Combination sizes to test: [20, 15, 10, 7, 5]
   WUMaCat reference size: 424

üìä Testing combination size: 20 keywords
   Keywords: ['binary', 'light', 'contact', 'period', 'photometric', 'mass', 'type', 'curves', 'binaries', 'eclipsing', 'orbital', 'ratio', 'curve', 'component', 'star', 'components', 'massive', 'solutions', 'wilson', 'devinney']
üîç Searching for exact keywords in full field:
   Keywords: ['binary', 'light', 'contact', 'period', 'photometric', 'mass', 'type', 'curves', 

## 5. Detailed Analysis of Combination Results

Let's analyze the detailed results for each keyword combination size to understand the trade-offs between precision and recall.


In [5]:
# Analyze detailed results for each combination size
print("üìä DETAILED COMBINATION ANALYSIS:")
print("="*60)

if combination_results and 'combination_results' in combination_results:
    combo_results = combination_results['combination_results']
    
    # Create a summary table
    print(f"\n{'Size':<4} {'Keywords':<35} {'Found':<8} {'Overlap':<8} {'Precision':<10} {'Recall':<8} {'F1':<6}")
    print("-" * 80)
    
    for size in sorted(combo_results.keys(), reverse=True):
        data = combo_results[size]
        
        if "error" not in data:
            keywords_str = ', '.join(data['keywords'][:3]) + ('...' if len(data['keywords']) > 3 else '')
            keywords_str = keywords_str[:32] + '...' if len(keywords_str) > 35 else keywords_str
            
            found = data['total_found']
            overlap = data['overlap_count']
            precision = data.get('precision', 0)
            recall = data.get('overlap_percentage', 0)  # Same as recall in our case
            f1 = data.get('f1_score', 0)
            
            print(f"{size:<4} {keywords_str:<35} {found:<8} {overlap:<8} {precision:<10.1f} {recall:<8.1f} {f1:<6.1f}")
        else:
            print(f"{size:<4} {'ERROR':<35} {'-':<8} {'-':<8} {'-':<10} {'-':<8} {'-':<6}")
    
    print("\nüí° KEY INSIGHTS:")
    
    # Analyze trends
    successful_sizes = [size for size, data in combo_results.items() 
                       if "error" not in data and data['total_found'] > 0]
    
    if len(successful_sizes) >= 2:
        # Sort by size for trend analysis
        sorted_successful = sorted([(size, combo_results[size]) for size in successful_sizes])
        
        print(f"\nüìà TRENDS (as keyword count increases):")
        print(f"   ‚Ä¢ Total papers found: Generally decreases (more restrictive)")
        print(f"   ‚Ä¢ Precision: Generally increases (more relevant papers)")
        print(f"   ‚Ä¢ Recall: May decrease (fewer WUMaCat papers found)")
        
        # Find optimal balance
        best_balance = None
        best_balance_score = 0
        
        for size, data in sorted_successful:
            # Balance score: weighted combination of precision, recall, and reasonable count
            precision = data.get('precision', 0)
            recall = data.get('overlap_percentage', 0)
            count = data['total_found']
            
            # Prefer combinations with good precision, decent recall, and manageable count
            balance_score = (precision * 0.4 + recall * 0.4 + min(count/100, 10) * 0.2)
            
            if balance_score > best_balance_score:
                best_balance_score = balance_score
                best_balance = (size, data)
        
        if best_balance:
            size, data = best_balance
            print(f"\nüéØ RECOMMENDED COMBINATION:")
            print(f"   ‚Ä¢ {size} keywords: {data['keywords']}")
            print(f"   ‚Ä¢ Found {data['total_found']} papers total")
            print(f"   ‚Ä¢ {data['overlap_count']} WUMaCat overlaps ({data['overlap_percentage']:.1f}%)")
            print(f"   ‚Ä¢ Precision: {data.get('precision', 0):.1f}%")
            print(f"   ‚Ä¢ Good balance of precision, recall, and manageability")
    
    else:
        print("   ‚ö†Ô∏è  Limited successful combinations for trend analysis")

else:
    print("‚ùå No detailed results available for analysis")

print(f"\n‚úÖ Detailed analysis completed!")


üìä DETAILED COMBINATION ANALYSIS:

Size Keywords                            Found    Overlap  Precision  Recall   F1    
--------------------------------------------------------------------------------
20   binary, light, contact...           925      198      21.4       46.7     29.4  
15   binary, light, contact...           3227     288      8.9        67.9     15.8  
10   binary, light, contact...           4409     314      7.1        74.1     13.0  
7    binary, light, contact...           7436     340      4.6        80.2     8.7   
5    binary, light, contact...           9000     352      3.9        83.0     7.5   

üí° KEY INSIGHTS:

üìà TRENDS (as keyword count increases):
   ‚Ä¢ Total papers found: Generally decreases (more restrictive)
   ‚Ä¢ Precision: Generally increases (more relevant papers)
   ‚Ä¢ Recall: May decrease (fewer WUMaCat papers found)

üéØ RECOMMENDED COMBINATION:
   ‚Ä¢ 5 keywords: ['binary', 'light', 'contact', 'period', 'photometric']
   ‚Ä¢ Found 

## 6. Abstract Search Experiment

Now let's test exact keyword searches in abstracts with smaller keyword combinations [10, 5, 4, 3]. Abstracts are more concise than full text, so smaller combinations should still yield meaningful results.


In [8]:
# Test keyword combinations in abstracts
print("üî¨ Running abstract search experiment...")
print("This will test smaller keyword combinations (10, 5, 4, 3) in abstract fields")
print("since abstracts are more concise and focused than full text.")

# Test smaller combination sizes in abstracts
abstract_combination_sizes = [10, 5, 4, 3, 2]
print(f"\nüéØ Available keywords: {len(top_keywords)}")
print(f"üìä Testing combination sizes in abstracts: {abstract_combination_sizes}")
print(f"üîç Search field: abstracts (more focused than full text)")

# Run the abstract experiment
abstract_results = test_keyword_combination_sizes(
    all_keywords=top_keywords,
    wumacat_bibcodes=wumacat_bibcodes,
    combination_sizes=abstract_combination_sizes,
    source_field="abs"
)

print("\n‚úÖ Abstract search experiment completed!")

# Display summary results
if abstract_results and 'summary' in abstract_results:
    summary = abstract_results['summary']
    print(f"\nüéØ ABSTRACT EXPERIMENT SUMMARY:")
    print(f"   Successful combinations: {summary.get('successful_combinations', [])}")
    
    if 'best_overlap' in summary:
        best_overlap = summary['best_overlap']
        print(f"   Best overlap: {best_overlap['percentage']:.1f}% with {best_overlap['size']} keywords")
    
    if 'best_precision' in summary:
        best_precision = summary['best_precision']
        print(f"   Best precision: {best_precision['percentage']:.1f}% with {best_precision['size']} keywords")
        
    if 'best_f1' in summary:
        best_f1 = summary['best_f1']
        print(f"   Best F1-score: {best_f1['score']:.1f}% with {best_f1['size']} keywords")
else:
    print("\n‚ö†Ô∏è  Abstract experiment did not complete successfully")


üî¨ Running abstract search experiment...
This will test smaller keyword combinations (10, 5, 4, 3) in abstract fields
since abstracts are more concise and focused than full text.

üéØ Available keywords: 25
üìä Testing combination sizes in abstracts: [10, 5, 4, 3, 2]
üîç Search field: abstracts (more focused than full text)
üß™ Testing keyword combination sizes in 'abs' field
   Available keywords: 25
   Combination sizes to test: [10, 5, 4, 3, 2]
   WUMaCat reference size: 424

üìä Testing combination size: 10 keywords
   Keywords: ['binary', 'light', 'contact', 'period', 'photometric', 'mass', 'type', 'curves', 'binaries', 'eclipsing']
üîç Searching for exact keywords in abs field:
   Keywords: ['binary', 'light', 'contact', 'period', 'photometric', 'mass', 'type', 'curves', 'binaries', 'eclipsing']
   Query: =abs:"binary" AND =abs:"light" AND =abs:"contact" AND =abs:"period" AND =abs:"photometric" AND =abs:"mass" AND =abs:"type" AND =abs:"curves" AND =abs:"binaries" AND =abs

## 7. Comprehensive Results Summary

Let's compare and summarize the results from both full-text and abstract searches to identify the optimal search strategies.


In [9]:
# Comprehensive analysis of both full-text and abstract results
print("üìã COMPREHENSIVE RESULTS SUMMARY")
print("="*60)

print(f"\nüîç FULL-TEXT SEARCH RESULTS:")
print("-" * 40)
if combination_results and 'combination_results' in combination_results:
    full_combo_results = combination_results['combination_results']
    
    print(f"{'Size':<4} {'Found':<8} {'Overlap':<8} {'Precision':<10} {'Recall':<8} {'F1':<6}")
    print("-" * 50)
    
    for size in sorted(full_combo_results.keys(), reverse=True):
        data = full_combo_results[size]
        if "error" not in data:
            found = data['total_found']
            overlap = data['overlap_count']
            precision = data.get('precision', 0)
            recall = data.get('overlap_percentage', 0)
            f1 = data.get('f1_score', 0)
            print(f"{size:<4} {found:<8} {overlap:<8} {precision:<10.1f} {recall:<8.1f} {f1:<6.1f}")

print(f"\nüìÑ ABSTRACT SEARCH RESULTS:")
print("-" * 40)
if abstract_results and 'combination_results' in abstract_results:
    abs_combo_results = abstract_results['combination_results']
    
    print(f"{'Size':<4} {'Found':<8} {'Overlap':<8} {'Precision':<10} {'Recall':<8} {'F1':<6}")
    print("-" * 50)
    
    for size in sorted(abs_combo_results.keys(), reverse=True):
        data = abs_combo_results[size]
        if "error" not in data:
            found = data['total_found']
            overlap = data['overlap_count']
            precision = data.get('precision', 0)
            recall = data.get('overlap_percentage', 0)
            f1 = data.get('f1_score', 0)
            print(f"{size:<4} {found:<8} {overlap:<8} {precision:<10.1f} {recall:<8.1f} {f1:<6.1f}")

# Find overall best strategies
print(f"\nüèÜ BEST STRATEGIES COMPARISON:")
print("-" * 50)

all_strategies = []

# Add full-text strategies
if combination_results and 'combination_results' in combination_results:
    for size, data in combination_results['combination_results'].items():
        if "error" not in data and data['total_found'] > 0:
            all_strategies.append({
                'field': 'full',
                'size': size,
                'data': data
            })

# Add abstract strategies
if abstract_results and 'combination_results' in abstract_results:
    for size, data in abstract_results['combination_results'].items():
        if "error" not in data and data['total_found'] > 0:
            all_strategies.append({
                'field': 'abs',
                'size': size,
                'data': data
            })

if all_strategies:
    # Find best by different metrics
    best_precision = max(all_strategies, key=lambda x: x['data'].get('precision', 0))
    best_recall = max(all_strategies, key=lambda x: x['data'].get('overlap_percentage', 0))
    best_f1 = max(all_strategies, key=lambda x: x['data'].get('f1_score', 0))
    
    # Find best balanced strategy (good F1 + reasonable paper count)
    balanced_strategies = [s for s in all_strategies if s['data']['total_found'] >= 10 and s['data']['total_found'] <= 5000]
    if balanced_strategies:
        best_balanced = max(balanced_strategies, key=lambda x: x['data'].get('f1_score', 0))
    else:
        best_balanced = best_f1
    
    print(f"üéØ Best Precision: {best_precision['data'].get('precision', 0):.1f}% ")
    print(f"   ‚Üí {best_precision['field']} field, {best_precision['size']} keywords")
    print(f"   ‚Üí Found {best_precision['data']['total_found']} papers, {best_precision['data']['overlap_count']} WUMaCat overlaps")
    
    print(f"\nüìä Best Recall: {best_recall['data'].get('overlap_percentage', 0):.1f}%")
    print(f"   ‚Üí {best_recall['field']} field, {best_recall['size']} keywords")
    print(f"   ‚Üí Found {best_recall['data']['total_found']} papers, {best_recall['data']['overlap_count']} WUMaCat overlaps")
    
    print(f"\n‚öñÔ∏è  Best F1-Score: {best_f1['data'].get('f1_score', 0):.1f}%")
    print(f"   ‚Üí {best_f1['field']} field, {best_f1['size']} keywords")
    print(f"   ‚Üí Found {best_f1['data']['total_found']} papers, {best_f1['data']['overlap_count']} WUMaCat overlaps")
    
    print(f"\nüéØ RECOMMENDED STRATEGY (Best Balanced):")
    print(f"   ‚Üí {best_balanced['field']} field, {best_balanced['size']} keywords")
    print(f"   ‚Üí F1-Score: {best_balanced['data'].get('f1_score', 0):.1f}%")
    print(f"   ‚Üí Precision: {best_balanced['data'].get('precision', 0):.1f}% | Recall: {best_balanced['data'].get('overlap_percentage', 0):.1f}%")
    print(f"   ‚Üí Found {best_balanced['data']['total_found']} papers, {best_balanced['data']['overlap_count']} WUMaCat overlaps")
    print(f"   ‚Üí Keywords: {best_balanced['data']['keywords'][:5]}..." if len(best_balanced['data']['keywords']) > 5 else f"   ‚Üí Keywords: {best_balanced['data']['keywords']}")

print(f"\nüí° KEY INSIGHTS:")
print(f"   ‚Ä¢ Full-text searches: Better for comprehensive discovery")
print(f"   ‚Ä¢ Abstract searches: Better precision with fewer keywords")
print(f"   ‚Ä¢ Optimal range: Likely 4-10 keywords depending on field")
print(f"   ‚Ä¢ Use full-text for broad discovery, abstracts for focused reviews")

# Save comprehensive results
experiment_summary = {
    "full_text_results": combination_results,
    "abstract_results": abstract_results,
    "best_strategies": {
        "best_precision": {
            "field": best_precision['field'],
            "size": best_precision['size'],
            "precision": best_precision['data'].get('precision', 0)
        } if 'best_precision' in locals() else None,
        "best_recall": {
            "field": best_recall['field'], 
            "size": best_recall['size'],
            "recall": best_recall['data'].get('overlap_percentage', 0)
        } if 'best_recall' in locals() else None,
        "recommended": {
            "field": best_balanced['field'],
            "size": best_balanced['size'],
            "f1_score": best_balanced['data'].get('f1_score', 0),
            "keywords": best_balanced['data']['keywords']
        } if 'best_balanced' in locals() else None
    }
}

# Save results
output_file = '../data/comprehensive_keyword_experiment_results.json'
save_experiment_results(experiment_summary, output_file)

print(f"\n‚úÖ Comprehensive analysis completed!")
print(f"üìÑ Results saved to: {output_file}")


üìã COMPREHENSIVE RESULTS SUMMARY

üîç FULL-TEXT SEARCH RESULTS:
----------------------------------------
Size Found    Overlap  Precision  Recall   F1    
--------------------------------------------------
20   925      198      21.4       46.7     29.4  
15   3227     288      8.9        67.9     15.8  
10   4409     314      7.1        74.1     13.0  
7    7436     340      4.6        80.2     8.7   
5    9000     352      3.9        83.0     7.5   

üìÑ ABSTRACT SEARCH RESULTS:
----------------------------------------
Size Found    Overlap  Precision  Recall   F1    
--------------------------------------------------
10   158      56       35.4       13.2     19.2  
5    669      158      23.6       37.3     28.9  
4    1152     196      17.0       46.2     24.9  
3    2038     264      13.0       62.3     21.4  
2    28615    324      1.6        76.4     3.2   

üèÜ BEST STRATEGIES COMPARISON:
--------------------------------------------------
üéØ Best Precision: 35.4% 
   ‚Ü