# Ticker Linking Module Testing

This notebook tests various functions in the ticker linking module to ensure they work correctly.

**Note:** This notebook can run with or without a database connection. Set `USE_DATABASE = False` to use mock data.

## Setup

In [1]:
# Configuration: Set to False to test without database
USE_DATABASE = False

print(f"Database mode: {'ENABLED' if USE_DATABASE else 'DISABLED (using mock data)'}")

Database mode: DISABLED (using mock data)


In [2]:
# Add project root to path
import sys
from pathlib import Path

project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print(f"Project root: {project_root}")

Project root: /Users/alex/market-pulse-v2


In [3]:
# Import required modules
import logging
from datetime import UTC, datetime
from pprint import pprint
from unittest.mock import Mock

from app.db.models import Article, Ticker
from jobs.ingest.linker import TickerLinker, COMMON_WORD_TICKERS

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print("âœ“ Imports successful")

âœ“ Imports successful


## 1. Load Tickers from Database (or Mock Data)

In [16]:
if USE_DATABASE:
    # Load from database
    from app.db.session import SessionLocal
    
    session = SessionLocal()
    tickers = session.query(Ticker).all()
    print(f"âœ“ Loaded {len(tickers)} tickers from database")
else:
    # Create mock tickers for testing
    session = None
    
    mock_ticker_data = [
        ("WAR", "some ETF"),
        ("TSLA", "Tesla, Inc."),
        ("NVDA", "NVIDIA Corporation"),
        ("MSFT", "Microsoft Corporation"),
        ("V", "Visa Inc."),
        ("GOOGL", "Alphabet Inc."),
        ("AMZN", "Amazon.com, Inc."),
        ("META", "Meta Platforms, Inc."),
        ("AMD", "Advanced Micro Devices, Inc."),
        ("INTC", "Intel Corporation"),
        ("GME", "GameStop Corp."),
        ("AMC", "AMC Entertainment Holdings, Inc."),
        ("SPY", "SPDR S&P 500 ETF Trust"),
        ("QQQ", "Invesco QQQ Trust"),
        ("NFLX", "Netflix, Inc."),
        ("DIS", "The Walt Disney Company"),
        ("UBER", "Uber Technologies, Inc."),
        ("PLTR", "Palantir Technologies Inc."),
        ("COIN", "Coinbase Global, Inc."),
        ("SHOP", "Shopify Inc."),
    ]
    
    tickers = []
    for symbol, name in mock_ticker_data:
        mock_ticker = Mock(spec=Ticker)
        mock_ticker.symbol = symbol
        mock_ticker.name = name
        tickers.append(mock_ticker)
    
    print(f"âœ“ Created {len(tickers)} mock tickers for testing")

# Display first 10 tickers
print(f"\nFirst 10 tickers:")
for ticker in tickers[:10]:
    print(f"  {ticker.symbol}: {ticker.name}")

âœ“ Created 20 mock tickers for testing

First 10 tickers:
  WAR: some ETF
  TSLA: Tesla, Inc.
  NVDA: NVIDIA Corporation
  MSFT: Microsoft Corporation
  V: Visa Inc.
  GOOGL: Alphabet Inc.
  AMZN: Amazon.com, Inc.
  META: Meta Platforms, Inc.
  AMD: Advanced Micro Devices, Inc.
  INTC: Intel Corporation


## 2. Initialize TickerLinker

In [17]:
# Initialize the linker
linker = TickerLinker(tickers, max_scraping_workers=5)

print(f"âœ“ TickerLinker initialized")
print(f"  Alias map size: {len(linker.alias_to_ticker)} entries")
print(f"  Number of tickers: {len(linker.tickers)}")
print(f"\nSample alias mappings:")
for i, (alias, symbol) in enumerate(list(linker.alias_to_ticker.items())[:10]):
    print(f"  '{alias}' -> {symbol}")

INFO:jobs.ingest.linker:Built ticker symbol map with 40 entries (no aliases)


âœ“ TickerLinker initialized
  Alias map size: 40 entries
  Number of tickers: 20

Sample alias mappings:
  'war' -> WAR
  'WAR' -> WAR
  'tsla' -> TSLA
  'TSLA' -> TSLA
  'nvda' -> NVDA
  'NVDA' -> NVDA
  'msft' -> MSFT
  'MSFT' -> MSFT
  'v' -> V
  'V' -> V


## 3. Test Common Word Tickers Filter

Check that common English words are in the filter list to prevent false positives.

In [None]:
# Check some common word tickers (including newly added ones)
test_common_words = ['AI', 'GO', 'RUN', 'FAST', 'HOME', 'PLAY', 'WORK', 'V', 'A', 'T', 'WAR', 'BRO', 'LOT', 'FUN', 'YALL', 'WOW', 'SUB', 'TILL', 'TALK', 'EAT', 'COST', 'DIPS']

print("Common word ticker filter test:")
print("These words require $ prefix OR (ALL CAPS + financial context):\n")
for word in test_common_words:
    is_common = word in COMMON_WORD_TICKERS
    symbol = "âœ“" if is_common else "âœ—"
    status = "COMMON WORD (strict rules)" if is_common else "Normal matching allowed"
    print(f"  {symbol} {word:6s} : {status}")

print(f"\nTotal common words in filter: {len(COMMON_WORD_TICKERS)}")

# Import and display the new CAPITALIZED_COMMON_WORDS list
from jobs.ingest.linker import CAPITALIZED_COMMON_WORDS
print(f"\nCapitalized common words (ALWAYS require $): {CAPITALIZED_COMMON_WORDS}")

## 4. Test Ticker Pattern Matching

Test the `_find_ticker_matches` method with various text patterns.

In [8]:
# Test cases for ticker matching
test_texts = [
    "I love $AAPL and $TSLA stocks!",
    "Apple (AAPL) reported strong earnings.",
    "Tesla TSLA is performing well.",
    "$nvda and $amd are great semiconductor stocks",
    "I bought apples at the store (not the stock)",
    "Just got my visa application approved",
    "Visa Inc (V) stock is up 5% today",
    "$V is doing well in the market",
    "Microsoft MSFT, Google GOOGL, and Amazon AMZN are tech giants",
    "The letter V is very common in English",
]

print("Ticker Matching Tests:\n" + "="*70)
for i, text in enumerate(test_texts, 1):
    matches = linker._find_ticker_matches(text)
    print(f"\n{i}. Text: \"{text}\"")
    if matches:
        print(f"   âœ“ Matches found:")
        for ticker, terms in matches.items():
            print(f"     â€¢ {ticker}: {terms}")
    else:
        print(f"   âœ— No matches found")

Ticker Matching Tests:

1. Text: "I love $AAPL and $TSLA stocks!"
   âœ“ Matches found:
     â€¢ AAPL: ['AAPL', '$AAPL']
     â€¢ TSLA: ['TSLA', '$TSLA']

2. Text: "Apple (AAPL) reported strong earnings."
   âœ“ Matches found:
     â€¢ AAPL: ['AAPL']

3. Text: "Tesla TSLA is performing well."
   âœ“ Matches found:
     â€¢ TSLA: ['TSLA']

4. Text: "$nvda and $amd are great semiconductor stocks"
   âœ“ Matches found:
     â€¢ NVDA: ['$nvda', 'nvda']
     â€¢ AMD: ['amd', '$amd']

5. Text: "I bought apples at the store (not the stock)"
   âœ— No matches found

6. Text: "Just got my visa application approved"
   âœ— No matches found

7. Text: "Visa Inc (V) stock is up 5% today"
   âœ— No matches found

8. Text: "$V is doing well in the market"
   âœ“ Matches found:
     â€¢ V: ['$V']

9. Text: "Microsoft MSFT, Google GOOGL, and Amazon AMZN are tech giants"
   âœ“ Matches found:
     â€¢ MSFT: ['MSFT']
     â€¢ GOOGL: ['GOOGL']
     â€¢ AMZN: ['AMZN']

10. Text: "The letter V is very commo

## 5. Test Article Linking with Context Analysis

Create test articles and link them to tickers using the full linking pipeline.

In [None]:
# Create test articles
test_articles = [
    # Clear ticker mention with $
    Article(
        source="test",
        url="https://example.com/1",
        published_at=datetime.now(UTC),
        title="Apple Stock Soars",
        text="Apple stock $AAPL is up 5% today after strong iPhone sales."
    ),
    
    # Multiple tickers
    Article(
        source="test",
        url="https://example.com/2",
        published_at=datetime.now(UTC),
        title="Tech Stocks Rally",
        text="$AAPL, $MSFT, and $GOOGL all saw gains today in the tech sector."
    ),
    
    # Ticker without $ but with financial context
    Article(
        source="test",
        url="https://example.com/3",
        published_at=datetime.now(UTC),
        title="Tesla Earnings Beat Expectations",
        text="Tesla TSLA stock surged after the company reported quarterly earnings that beat analyst expectations."
    ),
    
    # Ambiguous - should not match (visa application, not Visa Inc)
    Article(
        source="test",
        url="https://example.com/4",
        published_at=datetime.now(UTC),
        title="Visa Application Process",
        text="The visa application process has been simplified. You can now apply for a visa online."
    ),
    
    # Clear Visa Inc mention with $
    Article(
        source="test",
        url="https://example.com/5",
        published_at=datetime.now(UTC),
        title="Visa Inc Earnings",
        text="Visa Inc $V reported strong quarterly revenue growth driven by increased payment volume."
    ),
]

print(f"âœ“ Created {len(test_articles)} test articles")

In [None]:
# Link articles to tickers
print("\nArticle Linking Results:\n" + "="*80)

for i, article in enumerate(test_articles, 1):
    ticker_links = linker.link_article(article, use_title_only=False)
    
    print(f"\n{i}. Article: {article.title}")
    print(f"   Text: {article.text[:80]}...")
    
    if ticker_links:
        print(f"   âœ“ Linked to {len(ticker_links)} ticker(s):")
        for link in ticker_links:
            print(f"     â€¢ {link.ticker}")
            print(f"       - Confidence: {link.confidence:.2f}")
            print(f"       - Matched terms: {', '.join(link.matched_terms)}")
            print(f"       - Reasoning: {', '.join(link.reasoning)}")
    else:
        print(f"   âœ— No tickers linked (below confidence threshold)")

## 6. Test Reddit Comment Fast Path

Test the optimized fast path for Reddit comments (skips context analysis for speed).

In [12]:
# Create Reddit comment test articles
reddit_comments = [
    Article(
        source="reddit_comment",
        url="https://reddit.com/comment/4",
        published_at=datetime.now(UTC),
        title="Comment 4",
        text="*Reported* homicides. Truth is the first casualty of war. And our govt. is at war with its citizens."
    ),
]

print("Reddit Comment Fast Path Tests:\n" + "="*80)

for i, comment in enumerate(reddit_comments, 1):
    # This should use the fast path
    ticker_links = linker.link_article(comment)
    
    print(f"\n{i}. Comment: {comment.text}")
    
    if ticker_links:
        print(f"   âœ“ Linked to {len(ticker_links)} ticker(s):")
        for link in ticker_links:
            print(f"     â€¢ {link.ticker}: {link.confidence:.2f} confidence ({', '.join(link.matched_terms)})")
    else:
        print(f"   âœ— No tickers linked")

Reddit Comment Fast Path Tests:

1. Comment: *Reported* homicides. Truth is the first casualty of war. And our govt. is at war with its citizens.
   âœ— No tickers linked


## 7. Test Edge Cases

Test various edge cases and potential false positives.

In [None]:
edge_cases = [
    # Single letter ticker without $ (should NOT match)
    Article(
        source="test",
        url="https://example.com/edge1",
        published_at=datetime.now(UTC),
        title="Letter V",
        text="The letter V is very common in English."
    ),
    
    # Single letter ticker with $ (SHOULD match)
    Article(
        source="test",
        url="https://example.com/edge2",
        published_at=datetime.now(UTC),
        title="Visa Stock",
        text="I'm buying $V today. Visa is a strong company."
    ),
    
    # Common word ticker without $ and no financial context (should NOT match)
    Article(
        source="test",
        url="https://example.com/edge3",
        published_at=datetime.now(UTC),
        title="Fast Cars",
        text="I love fast cars and going on road trips."
    ),
    
    # Mixed case tickers (should normalize)
    Article(
        source="test",
        url="https://example.com/edge4",
        published_at=datetime.now(UTC),
        title="Tech Stocks",
        text="Looking at $aapl, $Msft, and $GOOGL for my portfolio."
    ),
    
    # Empty text (should not crash)
    Article(
        source="test",
        url="https://example.com/edge5",
        published_at=datetime.now(UTC),
        title="",
        text=""
    ),
    
    # Ticker as part of URL (should still match if $ prefix)
    Article(
        source="test",
        url="https://example.com/edge6",
        published_at=datetime.now(UTC),
        title="Investment Discussion",
        text="Check out $PLTR and $COIN - both have huge potential!"
    ),
    
    # NEW TEST: WAR in lowercase context (should NOT match)
    Article(
        source="test",
        url="https://example.com/edge7",
        published_at=datetime.now(UTC),
        title="War Discussion",
        text="The war on drugs has been going on for decades."
    ),
    
    # NEW TEST: WAR in ALL CAPS with financial context (SHOULD match)
    Article(
        source="test",
        url="https://example.com/edge8",
        published_at=datetime.now(UTC),
        title="ETF Analysis",
        text="Looking at WAR stock performance, the ETF has shown strong gains this quarter."
    ),
    
    # NEW TEST: WAR with $ prefix (SHOULD match)
    Article(
        source="test",
        url="https://example.com/edge9",
        published_at=datetime.now(UTC),
        title="ETF Discussion",
        text="I'm considering buying $WAR for my portfolio."
    ),
    
    # NEW TEST: Multiple common words in lowercase (should NOT match)
    Article(
        source="test",
        url="https://example.com/edge10",
        published_at=datetime.now(UTC),
        title="Food Talk",
        text="I love to eat good food and talk with my bro about fun stuff. The cost is worth it till we're done."
    ),
    
    # NEW TEST: Multiple common words in ALL CAPS with financial context (SHOULD match if they're tickers)
    Article(
        source="test",
        url="https://example.com/edge11",
        published_at=datetime.now(UTC),
        title="Stock Portfolio",
        text="My stock portfolio includes FUN and COST. Both stocks have shown solid earnings growth."
    ),
    
    # NEW TEST: Letter A at start of sentence (should NOT match)
    Article(
        source="test",
        url="https://example.com/edge12",
        published_at=datetime.now(UTC),
        title="General Discussion",
        text="A great opportunity is emerging in the market."
    ),
    
    # NEW TEST: Letter I in sentence (should NOT match)
    Article(
        source="test",
        url="https://example.com/edge13",
        published_at=datetime.now(UTC),
        title="Personal Opinion",
        text="I think the market will continue to grow."
    ),
    
    # NEW TEST: $A and $I with dollar sign (SHOULD match)
    Article(
        source="test",
        url="https://example.com/edge14",
        published_at=datetime.now(UTC),
        title="Ticker Discussion",
        text="Looking at $A and $I for potential investments."
    ),
]

print("Edge Case Tests:\n" + "="*80)

for i, article in enumerate(edge_cases, 1):
    ticker_links = linker.link_article(article, use_title_only=False)
    
    print(f"\n{i}. Title: {article.title or '(empty)'}")
    print(f"   Text: {article.text[:60] if article.text else '(empty)'}...")
    
    if ticker_links:
        print(f"   âœ“ Linked to {len(ticker_links)} ticker(s):")
        for link in ticker_links:
            print(f"     â€¢ {link.ticker}: {link.confidence:.2f}")
    else:
        print(f"   âœ— No tickers linked")

## 8. Batch Processing Test

Test linking multiple articles at once and gather statistics.

In [None]:
# Combine all test articles
all_articles = test_articles + reddit_comments + edge_cases

print(f"Batch processing {len(all_articles)} articles...\n")

# Link all articles
results = linker.link_articles(all_articles)

# Calculate statistics
total_links = sum(len(links) for _, links in results)
linked_articles = sum(1 for _, links in results if links)
avg_confidence = sum(link.confidence for _, links in results for link in links) / total_links if total_links > 0 else 0

print("\nBatch Processing Summary:")
print("="*60)
print(f"Total articles processed: {len(all_articles)}")
print(f"Articles with ticker links: {linked_articles}")
print(f"Articles without links: {len(all_articles) - linked_articles}")
print(f"Total ticker links created: {total_links}")
print(f"Average links per article: {total_links/len(all_articles):.2f}")
print(f"Average confidence score: {avg_confidence:.2f}")

# Show distribution of confidence scores
if total_links > 0:
    all_confidences = [link.confidence for _, links in results for link in links]
    print(f"\nConfidence Distribution:")
    print(f"  Min: {min(all_confidences):.2f}")
    print(f"  Max: {max(all_confidences):.2f}")
    print(f"  Avg: {avg_confidence:.2f}")

## 9. Performance Testing

Test the performance of the linking system with a larger batch.

In [None]:
import time

# Create a larger batch of test articles
performance_articles = []
test_patterns = [
    "$AAPL is up today",
    "$TSLA reported strong earnings",
    "$NVDA and $AMD are both rising",
    "Microsoft $MSFT announced new products",
    "Google stock $GOOGL surged after hours",
]

for i in range(50):
    pattern = test_patterns[i % len(test_patterns)]
    performance_articles.append(
        Article(
            source="test",
            url=f"https://example.com/perf{i}",
            published_at=datetime.now(UTC),
            title=f"Test Article {i}",
            text=f"This is test article {i}. {pattern}"
        )
    )

print(f"Performance test with {len(performance_articles)} articles...\n")

# Time the linking process
start_time = time.time()
perf_results = linker.link_articles(performance_articles)
end_time = time.time()

elapsed = end_time - start_time
per_article = elapsed / len(performance_articles)

print("\nPerformance Results:")
print("="*60)
print(f"Total time: {elapsed:.2f} seconds")
print(f"Time per article: {per_article*1000:.2f} ms")
print(f"Articles per second: {len(performance_articles)/elapsed:.2f}")

# Calculate success rate
successful_links = sum(1 for _, links in perf_results if links)
print(f"\nLinking success rate: {successful_links/len(performance_articles)*100:.1f}%")

## 10. Real Database Articles Test (Optional)

Test linking with actual articles from the database (only runs if `USE_DATABASE = True`).

In [None]:
if USE_DATABASE and session:
    # Get some real articles from the database
    real_articles = session.query(Article).limit(10).all()
    
    if real_articles:
        print(f"Testing with {len(real_articles)} real articles from database:\n" + "="*80)
        
        for i, article in enumerate(real_articles, 1):
            ticker_links = linker.link_article(article, use_title_only=False)
            
            print(f"\n{i}. Source: {article.source}")
            print(f"   Title: {article.title[:60] if article.title else '(no title)'}...")
            print(f"   URL: {article.url[:50]}...")
            
            if ticker_links:
                print(f"   âœ“ Linked to {len(ticker_links)} ticker(s):")
                for link in ticker_links:
                    print(f"     â€¢ {link.ticker}: {link.confidence:.2f} ({', '.join(link.matched_terms[:3])})")
            else:
                print(f"   âœ— No tickers linked")
    else:
        print("No articles found in database")
else:
    print("Skipping database test (USE_DATABASE = False)")
    print("\nTo test with real database articles:")
    print("  1. Set USE_DATABASE = True in the first cell")
    print("  2. Restart the notebook and run all cells")

## 11. Test Specific Linker Methods

Test individual methods directly for debugging.

In [None]:
# Test _fast_reddit_comment_linking directly
print("Testing _fast_reddit_comment_linking method:\n" + "="*60)

test_comment = Article(
    source="reddit_comment",
    url="https://reddit.com/test",
    published_at=datetime.now(UTC),
    title="Test Comment",
    text="ðŸš€ðŸš€ðŸš€ $GME $AMC to the moon! YOLO! ðŸ’ŽðŸ™Œ"
)

fast_results = linker._fast_reddit_comment_linking(test_comment)

print(f"Comment: {test_comment.text}")
print(f"\nResults from fast path:")
for link in fast_results:
    print(f"  â€¢ {link.ticker}: {link.confidence:.2f}")
    print(f"    Matched: {link.matched_terms}")
    print(f"    Reasoning: {link.reasoning}")

In [None]:
# Test _extract_text_for_matching method
print("Testing _extract_text_for_matching method:\n" + "="*60)

test_article = Article(
    source="test",
    url="https://example.com/test",
    published_at=datetime.now(UTC),
    title="Test Article Title",
    text="This is the article body text with $AAPL mention."
)

extracted_text = linker._extract_text_for_matching(test_article, use_title_only=True)
print(f"Article title: {test_article.title}")
print(f"Article text: {test_article.text}")
print(f"\nExtracted text (title_only=True):")
print(f"  {extracted_text}")

extracted_full = linker._extract_text_for_matching(test_article, use_title_only=False)
print(f"\nExtracted text (title_only=False):")
print(f"  {extracted_full}")

## 12. Cleanup

In [None]:
# Close database session if it was opened
if USE_DATABASE and session:
    session.close()
    print("âœ“ Database session closed")
else:
    print("âœ“ No database session to close (using mock data)")

## Summary

This notebook tested the following linking module functions:

### Functions Tested:

1. **`TickerLinker.__init__()`** - Initialization with tickers
2. **`TickerLinker._build_alias_map()`** - Building ticker symbol mapping
3. **`TickerLinker._find_ticker_matches()`** - Pattern matching for tickers in text
4. **`TickerLinker._extract_text_for_matching()`** - Text extraction from articles
5. **`TickerLinker._fast_reddit_comment_linking()`** - Fast path for Reddit comments
6. **`TickerLinker.link_article()`** - Main linking method with context analysis
7. **`TickerLinker.link_articles()`** - Batch processing of multiple articles

### Key Features Tested:

- âœ“ **$TICKER format matching** (high confidence - 0.9)
- âœ“ **TICKER format matching** (medium confidence - 0.7)
- âœ“ **Common word filtering** (prevents false positives like "I", "AI", "GO")
- âœ“ **Single letter ticker handling** (requires $ prefix)
- âœ“ **Context analysis integration** (confidence scoring)
- âœ“ **Reddit comment optimization** (fast path without heavy analysis)
- âœ“ **Confidence scoring** (0.5 minimum threshold)
- âœ“ **Matched terms tracking** (shows what triggered the match)
- âœ“ **Edge cases** (empty text, mixed case, ambiguous mentions)
- âœ“ **Performance benchmarking** (articles per second)

### Testing Modes:

- **Mock Mode** (`USE_DATABASE = False`): Tests with 20 mock tickers, no DB required
- **Database Mode** (`USE_DATABASE = True`): Tests with real tickers and articles from DB

### Next Steps:

- Run with real database to test integration
- Adjust confidence thresholds if needed
- Add more common words to filter if false positives occur
- Monitor performance with larger datasets