# 🏢 Entity Recognition Tool - Step by Step Tutorial

This notebook demonstrates how to build a complete company entity recognition system from scratch, starting with basic text processing and progressively adding more advanced features.

## 📋 What We'll Build:

1. **Step 1**: Basic entity extraction from text
2. **Step 2**: Simple Streamlit app for text processing
3. **Step 3**: Add PDF processing functionality
4. **Step 4**: Add image OCR capabilities
5. **Step 5**: Add AI vision analysis
6. **Step 6**: Final comprehensive application

Let's get started! 🚀

## 🛠️ Initial Setup

First, let's import all the libraries we'll need and set up our environment.

In [None]:
# Core libraries
import json
import pandas as pd
import numpy as np
from typing import List, Dict, Any
import re
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Set up API key - try multiple approaches
API_KEY = None

# Try to get from environment
env_key = os.getenv("OPENAI_API_KEY")
if env_key and len(env_key) > 10 and env_key.startswith("sk-"):
    API_KEY = env_key
    print("✅ Using API key from environment variable")
else:
    print("❌ No valid API key found in environment")
    print("📝 Please set OPENAI_API_KEY in your .env file or environment")
    print("🔗 Get your API key from: https://platform.openai.com/api-keys")

if API_KEY:
    # Verify the key format
    if not API_KEY.startswith("sk-"):
        print("⚠️ Warning: API key should start with 'sk-'")
    else:
        print(f"✅ API key format looks correct (starts with: {API_KEY[:10]}...)")
else:
    print("❌ No API key available - please set it before proceeding")
    
print("✅ Initial setup complete!")
print(f"📁 Working directory: {os.getcwd()}")

---

# 📝 Step 1: Entity Extraction with 3-Step Pipeline

Let's start with building a complete entity recognition system using a multi-prompt pipeline:
1. **Extract** - Find potential company mentions
2. **Ground** - Match against company database  
3. **Verify** - Use RAG-style verification

In [None]:
# Optional: Set API key directly here for testing (uncomment and add your key)
# API_KEY = "sk-your-actual-api-key-here"

# Or check what's currently set
print("Current API key status:")
if API_KEY:
    print(f"✅ API key available: {API_KEY[:20]}...")
else:
    print("❌ No API key set")
    print("💡 You can:")
    print("   1. Create a .env file with OPENAI_API_KEY=your_key")
    print("   2. Set it directly in the cell above")
    print("   3. Export it in your terminal: export OPENAI_API_KEY=your_key")

In [None]:
# Test API key and credits
print("🧪 Testing API Key and Credits...")

if API_KEY:
    try:
        from openai import OpenAI
        client = OpenAI(api_key=API_KEY)
        
        print(f"🔑 Testing key: {API_KEY[:20]}...")
        
        # Make a minimal API call to test
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": "Say 'API test successful'"}],
            max_tokens=10
        )
        
        print("✅ API Test Successful!")
        print(f"📝 Response: {response.choices[0].message.content}")
        print("💳 Credits available - ready to proceed!")
        
        # Test with the model we'll actually use
        print("\n🔬 Testing gpt-4o-mini model...")
        response2 = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": "Test"}],
            max_tokens=5
        )
        print("✅ gpt-4o-mini model working!")
        print("🚀 Ready for entity extraction pipeline!")
        
    except Exception as e:
        print(f"❌ API Error: {e}")
        error_str = str(e).lower()
        
        if "401" in error_str or "unauthorized" in error_str:
            print("🔑 Issue: Invalid API key")
            print("💡 Solution: Check your API key at https://platform.openai.com/api-keys")
        elif "insufficient_quota" in error_str or "quota" in error_str:
            print("💳 Issue: No credits remaining")
            print("💡 Solution: Add billing at https://platform.openai.com/account/billing")
        elif "rate_limit" in error_str:
            print("⏰ Issue: Rate limit exceeded")
            print("💡 Solution: Wait a moment and try again")
        else:
            print("🔧 Unknown API issue")
            print("💡 Solution: Check https://status.openai.com/ for service status")
            
else:
    print("❌ No API key found")
    print("💡 Please set your API key in the previous cells")

from openai import OpenAI
import json

# First, let's create a simple single-prompt extractor for comparison
class SimpleEntityExtractor:
    """Simple entity extractor using OpenAI API"""
    
    def __init__(self, api_key: str):
        if not api_key:
            raise ValueError("API key is required")
        self.client = OpenAI(api_key=api_key)
    
    def extract_companies(self, text: str) -> List[Dict[str, Any]]:
        """Extract company mentions from text"""
        
        prompt = f"""
        Extract all company names and stock tickers from the following text.
        Return the result as a JSON object with a 'companies' array.
        
        Text: {text}
        
        Format: {{"companies": [{{"name": "Company Name", "ticker": "TICK", "original_text": "as found in text"}}]}}
        """
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are a financial entity extraction expert. Always respond with valid JSON."},
                    {"role": "user", "content": prompt}
                ],
                response_format={"type": "json_object"}
            )
            
            result = json.loads(response.choices[0].message.content)
            return result.get('companies', [])
            
        except Exception as e:
            print(f"Error extracting entities: {e}")
            return []

# Create simple extractor instance if API key is available
if API_KEY:
    try:
        simple_extractor = SimpleEntityExtractor(API_KEY)
        print("✅ Simple entity extractor created!")
    except Exception as e:
        print(f"❌ Error creating simple extractor: {e}")
        simple_extractor = None
else:
    print("⚠️ Skipping simple extractor creation - no API key")
    simple_extractor = None

# Now let's create the advanced 3-step pipeline
class CompanyDatabase:
    """Simple company database for entity grounding"""
    
    def __init__(self):
        # Hardcoded database of known companies
        self.companies = [
            {"name": "Apple Inc.", "ticker": "AAPL", "aliases": ["Apple", "AAPL"], "sector": "Technology"},
            {"name": "Microsoft Corporation", "ticker": "MSFT", "aliases": ["Microsoft", "MSFT", "MS"], "sector": "Technology"},
            {"name": "Tesla, Inc.", "ticker": "TSLA", "aliases": ["Tesla", "TSLA"], "sector": "Automotive"},
            {"name": "Alphabet Inc.", "ticker": "GOOGL", "aliases": ["Google", "Alphabet", "GOOGL", "GOOG"], "sector": "Technology"},
            {"name": "NVIDIA Corporation", "ticker": "NVDA", "aliases": ["NVIDIA", "Nvidia", "NVDA"], "sector": "Technology"},
            {"name": "Meta Platforms, Inc.", "ticker": "META", "aliases": ["Meta", "Facebook", "META", "FB"], "sector": "Technology"},
            {"name": "Amazon.com, Inc.", "ticker": "AMZN", "aliases": ["Amazon", "AMZN"], "sector": "E-commerce"},
            {"name": "Netflix, Inc.", "ticker": "NFLX", "aliases": ["Netflix", "NFLX"], "sector": "Entertainment"},
            {"name": "Adobe Inc.", "ticker": "ADBE", "aliases": ["Adobe", "ADBE"], "sector": "Software"},
            {"name": "Intel Corporation", "ticker": "INTC", "aliases": ["Intel", "INTC"], "sector": "Technology"}
        ]
    
    def search_companies(self, query: str) -> List[Dict[str, Any]]:
        """Search for companies matching query"""
        query_lower = query.lower()
        matches = []
        
        for company in self.companies:
            # Check if query matches name, ticker, or aliases
            if (query_lower in company['name'].lower() or 
                query_lower == company['ticker'].lower() or
                any(query_lower in alias.lower() for alias in company['aliases'])):
                matches.append(company)
        
        return matches
    
    def get_all_companies(self) -> List[Dict[str, Any]]:
        """Get all companies in database"""
        return self.companies

class AdvancedEntityExtractor:
    """Advanced entity extractor with multi-prompt pipeline and deduplication"""
    
    def __init__(self, api_key: str):
        if not api_key:
            raise ValueError("API key is required")
        self.client = OpenAI(api_key=api_key)
        self.company_db = CompanyDatabase()
    
    def _deduplicate_extracted_entities(self, entities: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """Remove duplicate entities from step 1 extraction"""
        seen = set()
        deduped = []
        
        for entity in entities:
            # Create a key based on text (case insensitive)
            key = entity['text'].lower().strip()
            if key not in seen:
                seen.add(key)
                deduped.append(entity)
        
        return deduped
    
    def _deduplicate_grounded_entities(self, entities: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """Remove duplicate entities from step 2 grounding based on ticker"""
        seen_tickers = set()
        deduped = []
        
        for entity in entities:
            ticker = entity['matched_company']['ticker']
            if ticker not in seen_tickers:
                seen_tickers.add(ticker)
                deduped.append(entity)
            else:
                # If we see the same ticker again, keep the one with higher confidence
                existing_idx = next(i for i, e in enumerate(deduped) 
                                  if e['matched_company']['ticker'] == ticker)
                if entity['confidence'] > deduped[existing_idx]['confidence']:
                    deduped[existing_idx] = entity
        
        return deduped
    
    def _deduplicate_verified_entities(self, entities: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """Remove duplicate entities from step 3 verification based on ticker"""
        seen_tickers = set()
        deduped = []
        
        for entity in entities:
            ticker = entity['ticker']
            if ticker not in seen_tickers:
                seen_tickers.add(ticker)
                deduped.append(entity)
            else:
                # If we see the same ticker again, keep the one with higher confidence
                existing_idx = next(i for i, e in enumerate(deduped) 
                                  if e['ticker'] == ticker)
                if entity['confidence'] > deduped[existing_idx]['confidence']:
                    deduped[existing_idx] = entity
        
        return deduped
    
    def step1_extract_entities(self, text: str) -> List[Dict[str, Any]]:
        """Step 1: Extract potential company mentions from text"""
        
        prompt = f"""
        STEP 1: ENTITY EXTRACTION
        
        Extract ALL potential company names, brand names, and stock tickers from the following text.
        Be liberal in extraction - include anything that might be a company.
        IMPORTANT: Do not include duplicates in your response.
        
        Text: {text}
        
        Return JSON format:
        {{
            "entities": [
                {{
                    "text": "extracted text exactly as found",
                    "type": "company_name|ticker|brand",
                    "context": "surrounding context (10-15 words)"
                }}
            ]
        }}
        """
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are an expert at extracting company mentions from text. Be thorough but avoid duplicates."},
                    {"role": "user", "content": prompt}
                ],
                response_format={"type": "json_object"}
            )
            
            result = json.loads(response.choices[0].message.content)
            entities = result.get('entities', [])
            
            # Apply deduplication
            deduped_entities = self._deduplicate_extracted_entities(entities)
            
            return deduped_entities
            
        except Exception as e:
            print(f"Error in step 1: {e}")
            return []
    
    def step2_ground_entities(self, entities: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """Step 2: Ground entities against company database"""
        
        grounded_entities = []
        
        for entity in entities:
            # Search in company database
            matches = self.company_db.search_companies(entity['text'])
            
            if matches:
                # Take the best match (first one for now)
                best_match = matches[0]
                grounded_entities.append({
                    'original': entity['text'],
                    'type': entity['type'],
                    'context': entity.get('context', ''),
                    'matched_company': best_match,
                    'match_type': 'database_exact' if entity['text'].lower() == best_match['ticker'].lower() else 'database_fuzzy',
                    'confidence': 95 if entity['text'].lower() == best_match['ticker'].lower() else 85
                })
        
        # Apply deduplication based on ticker
        deduped_entities = self._deduplicate_grounded_entities(grounded_entities)
        
        return deduped_entities
    
    def step3_verify_entities(self, grounded_entities: List[Dict[str, Any]], original_text: str) -> List[Dict[str, Any]]:
        """Step 3: Use RAG-style verification with company database context"""
        
        if not grounded_entities:
            return []
        
        # Prepare company database context
        db_context = "Company Database Context:\n"
        for company in self.company_db.get_all_companies():
            db_context += f"- {company['name']} ({company['ticker']}): {', '.join(company['aliases'][:3])}\n"
        
        # Prepare entities for verification
        entities_text = "\n".join([
            f"- '{e['original']}' → {e['matched_company']['name']} ({e['matched_company']['ticker']})"
            for e in grounded_entities
        ])
        
        prompt = f"""
        STEP 3: ENTITY VERIFICATION (RAG-style)
        
        Given the company database context below, verify and refine the extracted entities.
        IMPORTANT: Return only unique companies (no duplicates by ticker).
        
        {db_context}
        
        Original Text: {original_text}
        
        Extracted and Matched Entities:
        {entities_text}
        
        For each entity, verify it's correctly matched and actually mentioned in the original text.
        Rate confidence 1-100 based on:
        - Exact ticker match: 95-100
        - Company name match: 85-95  
        - Alias/brand match: 75-85
        - Contextual relevance: +/- 10
        
        Return JSON format (NO DUPLICATES):
        {{
            "verified_entities": [
                {{
                    "original": "text as found",
                    "company_name": "official company name",
                    "ticker": "stock ticker",
                    "confidence": 85,
                    "reasoning": "why this match is correct"
                }}
            ]
        }}
        """
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are a financial entity verification expert. Use the company database to verify entities and assign accurate confidence scores. Return unique entities only."},
                    {"role": "user", "content": prompt}
                ],
                response_format={"type": "json_object"}
            )
            
            result = json.loads(response.choices[0].message.content)
            entities = result.get('verified_entities', [])
            
            # Apply final deduplication
            deduped_entities = self._deduplicate_verified_entities(entities)
            
            return deduped_entities
            
        except Exception as e:
            print(f"Error in step 3: {e}")
            return []
    
    def process_text_pipeline(self, text: str) -> Dict[str, Any]:
        """Complete 3-step pipeline: Extract → Ground → Verify with deduplication"""
        
        print("🔍 Step 1: Extracting potential entities...")
        extracted = self.step1_extract_entities(text)
        print(f"   Found {len(extracted)} unique potential entities")
        
        print("🗃️ Step 2: Grounding against company database...")
        grounded = self.step2_ground_entities(extracted)
        print(f"   Grounded {len(grounded)} unique entities")
        
        print("✅ Step 3: Verifying with RAG-style analysis...")
        verified = self.step3_verify_entities(grounded, text)
        print(f"   Verified {len(verified)} unique final entities")
        
        return {
            'step1_extracted': extracted,
            'step2_grounded': grounded,
            'step3_verified': verified,
            'summary': f"Pipeline processed {len(extracted)} → {len(grounded)} → {len(verified)} unique entities"
        }

# Create advanced extractor instance if API key is available
if API_KEY:
    try:
        advanced_extractor = AdvancedEntityExtractor(API_KEY)
        print("✅ Advanced entity extractor with deduplication created!")
        print("📊 Pipeline: Extract → Database Lookup → RAG Verification → Deduplicate")
    except Exception as e:
        print(f"❌ Error creating advanced extractor: {e}")
        advanced_extractor = None
else:
    print("⚠️ Skipping advanced extractor creation - no API key")
    advanced_extractor = None

In [None]:
# Test the advanced 3-step pipeline with detailed output
sample_text = """
Apple (AAPL) reported strong Q4 earnings, with revenue reaching $89.5 billion. 
Microsoft (MSFT) also showed impressive results with Azure growing 35% year-over-year.
Meanwhile, Tesla stock (TSLA) surged after announcing new production milestones.
Google parent company Alphabet (GOOGL) saw advertising revenue rebound.
Facebook's parent Meta announced new VR initiatives.
"""

print("🚀 Testing Advanced 3-Step Entity Recognition Pipeline with Deduplication")
print(f"📄 Text: {sample_text.strip()}")
print("\n" + "="*70)

# Check if extractor is available
if advanced_extractor is None:
    print("❌ Advanced extractor not available. Please run the previous cell first.")
else:
    try:
        # Run the complete pipeline
        pipeline_result = advanced_extractor.process_text_pipeline(sample_text)

        print(f"\n📊 {pipeline_result['summary']}")
        print("\n" + "="*70)

        # Show detailed results for each step
        print("\n🔍 STEP 1 - Raw Extraction:")
        for i, entity in enumerate(pipeline_result['step1_extracted'], 1):
            print(f"  {i}. '{entity['text']}' ({entity['type']})")

        print(f"\n🗃️ STEP 2 - Database Grounding:")
        for i, entity in enumerate(pipeline_result['step2_grounded'], 1):
            print(f"  {i}. '{entity['original']}' → {entity['matched_company']['ticker']} ({entity['confidence']}%)")

        print(f"\n✅ STEP 3 - Final Verification:")
        if len(pipeline_result['step3_verified']) > 5:
            print(f"⚠️ WARNING: Found {len(pipeline_result['step3_verified'])} entities - checking for duplicates...")
            
            # Check for duplicates by ticker
            tickers_seen = {}
            for i, entity in enumerate(pipeline_result['step3_verified'], 1):
                ticker = entity['ticker']
                if ticker in tickers_seen:
                    print(f"  🔴 DUPLICATE: {ticker} found at positions {tickers_seen[ticker]} and {i}")
                else:
                    tickers_seen[ticker] = i
                print(f"  {i}. {entity['company_name']} ({entity['ticker']}) - '{entity['original']}' ({entity['confidence']}%)")
        else:
            for i, entity in enumerate(pipeline_result['step3_verified'], 1):
                print(f"  {i}. {entity['company_name']} ({entity['ticker']}) - '{entity['original']}' ({entity['confidence']}%)")

        print(f"\n🎯 Final Result: {len(pipeline_result['step3_verified'])} entities")
        
        # Additional analysis
        if len(pipeline_result['step3_verified']) > 5:
            print("\n🔧 DEBUGGING INFO:")
            unique_tickers = set(e['ticker'] for e in pipeline_result['step3_verified'])
            print(f"   Unique tickers found: {len(unique_tickers)}")
            print(f"   All tickers: {list(unique_tickers)}")
            
            if len(unique_tickers) < len(pipeline_result['step3_verified']):
                print("   ❌ Deduplication failed - multiple entries per ticker")
            else:
                print("   ✅ No ticker duplicates, but more entities than expected")
        
    except Exception as e:
        print(f"❌ Error running pipeline: {e}")
        print("This may be due to API issues or missing dependencies.")

In [None]:
# Test the advanced 3-step pipeline with AGGRESSIVE deduplication
sample_text = """
Apple (AAPL) reported strong Q4 earnings, with revenue reaching $89.5 billion. 
Microsoft (MSFT) also showed impressive results with Azure growing 35% year-over-year.
Meanwhile, Tesla stock (TSLA) surged after announcing new production milestones.
Google parent company Alphabet (GOOGL) saw advertising revenue rebound.
Facebook's parent Meta announced new VR initiatives.
"""

print("🚀 Testing Advanced 3-Step Entity Recognition Pipeline with AGGRESSIVE Deduplication")
print(f"📄 Text: {sample_text.strip()}")
print("\n" + "="*70)

# Check if extractor is available
if advanced_extractor is None:
    print("❌ Advanced extractor not available. Please run the previous cell first.")
else:
    try:
        # Run the complete pipeline
        pipeline_result = advanced_extractor.process_text_pipeline(sample_text)

        print(f"\n📊 {pipeline_result['summary']}")
        
        # Apply additional aggressive deduplication
        original_count = len(pipeline_result['step3_verified'])
        pipeline_result['step3_verified'] = force_deduplicate_final_results(pipeline_result['step3_verified'])
        final_count = len(pipeline_result['step3_verified'])
        
        print("\n" + "="*70)

        # Show detailed results for each step
        print("\n🔍 STEP 1 - Raw Extraction:")
        for i, entity in enumerate(pipeline_result['step1_extracted'], 1):
            print(f"  {i}. '{entity['text']}' ({entity['type']})")

        print(f"\n🗃️ STEP 2 - Database Grounding:")
        for i, entity in enumerate(pipeline_result['step2_grounded'], 1):
            print(f"  {i}. '{entity['original']}' → {entity['matched_company']['ticker']} ({entity['confidence']}%)")

        print(f"\n✅ STEP 3 - Final Verification (after aggressive deduplication):")
        for i, entity in enumerate(pipeline_result['step3_verified'], 1):
            print(f"  {i}. {entity['company_name']} ({entity['ticker']}) - '{entity['original']}' ({entity['confidence']}%)")
            print(f"     Reasoning: {entity.get('reasoning', 'N/A')}")

        print(f"\n🎯 Final Result: {final_count} unique entities")
        
        if original_count != final_count:
            print(f"🔧 Deduplication removed {original_count - final_count} duplicates")
        
        # Verify uniqueness
        tickers = [e['ticker'] for e in pipeline_result['step3_verified']]
        if len(tickers) == len(set(tickers)):
            print("✅ All entities are unique by ticker!")
        else:
            print("❌ Still found duplicates - manual inspection needed")
        
    except Exception as e:
        print(f"❌ Error running pipeline: {e}")
        print("This may be due to API issues or missing dependencies.")

In [None]:
# Test the advanced 3-step pipeline
sample_text = """
Apple (AAPL) reported strong Q4 earnings, with revenue reaching $89.5 billion. 
Microsoft (MSFT) also showed impressive results with Azure growing 35% year-over-year.
Meanwhile, Tesla stock (TSLA) surged after announcing new production milestones.
Google parent company Alphabet (GOOGL) saw advertising revenue rebound.
Facebook's parent Meta announced new VR initiatives.
"""

print("🚀 Testing Advanced 3-Step Entity Recognition Pipeline")
print(f"📄 Text: {sample_text.strip()}")
print("\n" + "="*70)

# Check if extractor is available
if advanced_extractor is None:
    print("❌ Advanced extractor not available. Please run the previous cell first.")
else:
    try:
        # Run the complete pipeline
        pipeline_result = advanced_extractor.process_text_pipeline(sample_text)

        print(f"\n📊 {pipeline_result['summary']}")
        print("\n" + "="*70)

        # Show detailed results for each step
        print("\n🔍 STEP 1 - Raw Extraction:")
        for i, entity in enumerate(pipeline_result['step1_extracted'], 1):
            print(f"  {i}. '{entity['text']}' ({entity['type']}) - Context: {entity['context'][:50]}...")

        print(f"\n🗃️ STEP 2 - Database Grounding:")
        for i, entity in enumerate(pipeline_result['step2_grounded'], 1):
            print(f"  {i}. '{entity['original']}' → {entity['matched_company']['name']} ({entity['matched_company']['ticker']}) - {entity['match_type']} ({entity['confidence']}%)")

        print(f"\n✅ STEP 3 - Final Verification:")
        for i, entity in enumerate(pipeline_result['step3_verified'], 1):
            print(f"  {i}. {entity['company_name']} ({entity['ticker']}) - Found as: '{entity['original']}' (Confidence: {entity['confidence']}%)")
            print(f"     Reasoning: {entity['reasoning']}")

        print(f"\n🎯 Final Result: {len(pipeline_result['step3_verified'])} verified companies")
        
    except Exception as e:
        print(f"❌ Error running pipeline: {e}")
        print("This may be due to API issues or missing dependencies.")

---

# 🖥️ Step 2: Simple Streamlit App for Text Processing

Now let's create a minimal Streamlit app that uses our entity extractor. We'll save this to a file and then run it.

In [None]:
# Create enhanced Streamlit app with 3-step pipeline
streamlit_app_step2_enhanced = '''
import streamlit as st
import json
from openai import OpenAI
import pandas as pd
from typing import List, Dict, Any

# App configuration
st.set_page_config(
    page_title="Entity Extractor - Step 2 Enhanced",
    page_icon="🏢",
    layout="wide"
)

class CompanyDatabase:
    """Simple company database for entity grounding"""
    
    def __init__(self):
        self.companies = [
            {"name": "Apple Inc.", "ticker": "AAPL", "aliases": ["Apple", "AAPL"]},
            {"name": "Microsoft Corporation", "ticker": "MSFT", "aliases": ["Microsoft", "MSFT", "MS"]},
            {"name": "Tesla, Inc.", "ticker": "TSLA", "aliases": ["Tesla", "TSLA"]},
            {"name": "Alphabet Inc.", "ticker": "GOOGL", "aliases": ["Google", "Alphabet", "GOOGL", "GOOG"]},
            {"name": "NVIDIA Corporation", "ticker": "NVDA", "aliases": ["NVIDIA", "Nvidia", "NVDA"]},
            {"name": "Meta Platforms, Inc.", "ticker": "META", "aliases": ["Meta", "Facebook", "META", "FB"]},
            {"name": "Amazon.com, Inc.", "ticker": "AMZN", "aliases": ["Amazon", "AMZN"]},
            {"name": "Netflix, Inc.", "ticker": "NFLX", "aliases": ["Netflix", "NFLX"]},
            {"name": "Adobe Inc.", "ticker": "ADBE", "aliases": ["Adobe", "ADBE"]},
            {"name": "Intel Corporation", "ticker": "INTC", "aliases": ["Intel", "INTC"]}
        ]
    
    def search_companies(self, query: str) -> List[Dict[str, Any]]:
        query_lower = query.lower()
        matches = []
        for company in self.companies:
            if (query_lower in company['name'].lower() or 
                query_lower == company['ticker'].lower() or
                any(query_lower in alias.lower() for alias in company['aliases'])):
                matches.append(company)
        return matches
    
    def get_all_companies(self) -> List[Dict[str, Any]]:
        return self.companies

class AdvancedEntityExtractor:
    """Advanced entity extractor with multi-prompt pipeline"""
    
    def __init__(self, api_key: str):
        self.client = OpenAI(api_key=api_key)
        self.company_db = CompanyDatabase()
    
    def step1_extract_entities(self, text: str) -> List[Dict[str, Any]]:
        prompt = f"""
        STEP 1: ENTITY EXTRACTION
        Extract ALL potential company names, brand names, and stock tickers from the text.
        Text: {text}
        Return JSON: {{"entities": [{{"text": "name", "type": "company_name|ticker|brand", "context": "context"}}]}}
        """
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "Extract all potential company mentions thoroughly."},
                    {"role": "user", "content": prompt}
                ],
                response_format={"type": "json_object"}
            )
            result = json.loads(response.choices[0].message.content)
            return result.get('entities', [])
        except Exception as e:
            st.error(f"Error in step 1: {e}")
            return []
    
    def step2_ground_entities(self, entities: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        grounded_entities = []
        for entity in entities:
            matches = self.company_db.search_companies(entity['text'])
            if matches:
                best_match = matches[0]
                grounded_entities.append({
                    'original': entity['text'],
                    'type': entity['type'],
                    'context': entity['context'],
                    'matched_company': best_match,
                    'match_type': 'exact' if entity['text'].lower() == best_match['ticker'].lower() else 'fuzzy',
                    'confidence': 95 if entity['text'].lower() == best_match['ticker'].lower() else 85
                })
        return grounded_entities
    
    def step3_verify_entities(self, grounded_entities: List[Dict[str, Any]], original_text: str) -> List[Dict[str, Any]]:
        if not grounded_entities:
            return []
        
        db_context = "Company Database:\\n"
        for company in self.company_db.get_all_companies()[:10]:
            db_context += f"- {company['name']} ({company['ticker']})\\n"
        
        entities_text = "\\n".join([
            f"- {e['original']} → {e['matched_company']['name']} ({e['matched_company']['ticker']})"
            for e in grounded_entities
        ])
        
        prompt = f"""
        STEP 3: ENTITY VERIFICATION (RAG-style)
        
        {db_context}
        
        Original Text: {original_text}
        Extracted Entities: {entities_text}
        
        Verify entities and return JSON:
        {{"verified_entities": [{{"original": "text", "company_name": "name", "ticker": "ticker", "confidence": 85, "reasoning": "why correct"}}]}}
        """
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "Verify entities using the company database context."},
                    {"role": "user", "content": prompt}
                ],
                response_format={"type": "json_object"}
            )
            result = json.loads(response.choices[0].message.content)
            return result.get('verified_entities', [])
        except Exception as e:
            st.error(f"Error in step 3: {e}")
            return []
    
    def process_text_pipeline(self, text: str) -> Dict[str, Any]:
        with st.status("Running 3-step pipeline...", expanded=True) as status:
            st.write("🔍 Step 1: Extracting potential entities...")
            extracted = self.step1_extract_entities(text)
            st.write(f"   Found {len(extracted)} potential entities")
            
            st.write("🗃️ Step 2: Grounding against company database...")
            grounded = self.step2_ground_entities(extracted)
            st.write(f"   Grounded {len(grounded)} entities")
            
            st.write("✅ Step 3: Verifying with RAG-style analysis...")
            verified = self.step3_verify_entities(grounded, text)
            st.write(f"   Verified {len(verified)} final entities")
            
            status.update(label="Pipeline complete!", state="complete", expanded=False)
        
        return {
            'step1_extracted': extracted,
            'step2_grounded': grounded,
            'step3_verified': verified,
            'summary': f"Pipeline: {len(extracted)} → {len(grounded)} → {len(verified)} entities"
        }

def main():
    st.title("🏢 Entity Recognition - Step 2: Advanced 3-Step Pipeline")
    st.markdown("""
    **Enhanced entity extraction with:**
    - 🔍 **Step 1**: Extract potential entities with AI
    - 🗃️ **Step 2**: Ground against company database  
    - ✅ **Step 3**: Verify with RAG-style analysis
    """)
    
    # API Key
    
    
    if not api_key:
        st.warning("Please set your OpenAI API key to continue.")
        return
    
    # Initialize extractor
    extractor = AdvancedEntityExtractor(api_key)
    
    # Show company database
    with st.expander("📊 View Company Database"):
        db_df = pd.DataFrame(extractor.company_db.get_all_companies())
        st.dataframe(db_df, use_container_width=True)
    
    # Text input
    st.subheader("📝 Enter Financial Text")
    sample_text = "Apple (AAPL) reported strong Q4 earnings. Microsoft (MSFT) showed impressive Azure growth. Tesla (TSLA) announced milestones. Google parent Alphabet (GOOGL) saw revenue rebound. Facebook's Meta announced VR initiatives."
    
    text_input = st.text_area(
        "Enter or paste financial text here:",
        value=sample_text,
        height=150,
        placeholder="Paste financial news or article text here..."
    )
    
    if st.button("🚀 Run 3-Step Pipeline", type="primary"):
        if text_input.strip():
            # Run the pipeline
            result = extractor.process_text_pipeline(text_input)
            
            # Create tabs to show each step
            tab1, tab2, tab3, tab4 = st.tabs(["📊 Summary", "🔍 Step 1: Extract", "🗃️ Step 2: Ground", "✅ Step 3: Verify"])
            
            with tab1:
                st.metric("Pipeline Summary", result['summary'])
                
                if result['step3_verified']:
                    st.success(f"✅ Successfully identified {len(result['step3_verified'])} companies!")
                    
                    # Final results table
                    df = pd.DataFrame(result['step3_verified'])
                    st.subheader("🎯 Final Verified Companies")
                    st.dataframe(df, use_container_width=True)
                    
                    # Company cards
                    st.subheader("🏷️ Company Cards")
                    cols = st.columns(min(len(result['step3_verified']), 3))
                    for i, entity in enumerate(result['step3_verified']):
                        with cols[i % 3]:
                            st.info(f"""**{entity['company_name']}**
                            
📈 **Ticker:** {entity['ticker']}
🔍 **Found as:** '{entity['original']}'
📊 **Confidence:** {entity['confidence']}%
💡 **Reasoning:** {entity['reasoning']}""")
                else:
                    st.warning("No companies found in the text.")
            
            with tab2:
                st.subheader("🔍 Step 1: Raw Entity Extraction")
                if result['step1_extracted']:
                    df1 = pd.DataFrame(result['step1_extracted'])
                    st.dataframe(df1, use_container_width=True)
                else:
                    st.info("No entities extracted in step 1")
            
            with tab3:
                st.subheader("🗃️ Step 2: Database Grounding")
                if result['step2_grounded']:
                    # Create a simplified view for grounded entities
                    grounded_simple = []
                    for entity in result['step2_grounded']:
                        grounded_simple.append({
                            'Original': entity['original'],
                            'Matched Company': entity['matched_company']['name'],
                            'Ticker': entity['matched_company']['ticker'],
                            'Match Type': entity['match_type'],
                            'Confidence': f"{entity['confidence']}%"
                        })
                    df2 = pd.DataFrame(grounded_simple)
                    st.dataframe(df2, use_container_width=True)
                else:
                    st.info("No entities grounded in step 2")
            
            with tab4:
                st.subheader("✅ Step 3: RAG Verification")
                if result['step3_verified']:
                    df3 = pd.DataFrame(result['step3_verified'])
                    st.dataframe(df3, use_container_width=True)
                else:
                    st.info("No entities verified in step 3")
        else:
            st.error("Please enter some text to analyze.")

if __name__ == "__main__":
    main()
'''

# Save the enhanced Streamlit app
with open('app_step2_enhanced.py', 'w') as f:
    f.write(streamlit_app_step2_enhanced)

print("✅ Enhanced Step 2 Streamlit app created: app_step2_enhanced.py")
print("🚀 Features the complete 3-step pipeline:")
print("   🔍 Step 1: AI Entity Extraction")
print("   🗃️ Step 2: Database Grounding") 
print("   ✅ Step 3: RAG-style Verification")
print("\n🚀 To run this enhanced app, execute in terminal:")
print("streamlit run app_step2_enhanced.py")

---

# 📄 Step 3: Add PDF Processing Functionality

Now let's extend our capabilities to handle PDF files. We'll add PDF text extraction using PyPDF2.

In [None]:
# First, let's create a PDF processing class
import PyPDF2
import io

class PDFProcessor:
    """Process PDF files to extract text"""
    
    @staticmethod
    def extract_text_from_pdf(pdf_file) -> str:
        """Extract text from uploaded PDF file"""
        try:
            # Read the PDF file
            if hasattr(pdf_file, 'read'):
                # File-like object (from Streamlit)
                pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file.read()))
            else:
                # File path
                pdf_reader = PyPDF2.PdfReader(pdf_file)
            
            # Extract text from all pages
            text = ""
            for page_num in range(len(pdf_reader.pages)):
                page = pdf_reader.pages[page_num]
                text += page.extract_text() + "\n"
            
            return text.strip()
        except Exception as e:
            print(f"Error reading PDF: {str(e)}")
            return None

# Test PDF processing (we'll create a simple test)
pdf_processor = PDFProcessor()
print("✅ PDF processor created!")
print("📄 Ready to handle PDF files in our Streamlit app")

In [None]:
# Create enhanced Streamlit app with PDF support
streamlit_app_step3 = '''
import streamlit as st
import json
from openai import OpenAI
import pandas as pd
import PyPDF2
import io

# App configuration
st.set_page_config(
    page_title="Entity Extractor - Step 3",
    page_icon="🏢",
    layout="wide"
)

class SimpleEntityExtractor:
    """Simple entity extractor using OpenAI API"""
    
    def __init__(self, api_key: str):
        self.client = OpenAI(api_key=api_key)
    
    def extract_companies(self, text: str):
        """Extract company mentions from text"""
        
        prompt = f"""
        Extract all company names and stock tickers from the following text.
        Return the result as a JSON object with a "companies" field containing a list.
        
        Text: {text[:3000]}...  # Truncate for API limits
        
        Format: {{"companies": [{{"name": "Company Name", "ticker": "TICK", "original_text": "original"}}]}}
        """
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are a financial entity extraction expert. Always respond with valid JSON."},
                    {"role": "user", "content": prompt}
                ],
                response_format={"type": "json_object"}
            )
            
            result = json.loads(response.choices[0].message.content)
            return result.get('companies', [])
            
        except Exception as e:
            st.error(f"Error extracting entities: {e}")
            return []

class PDFProcessor:
    """Process PDF files to extract text"""
    
    @staticmethod
    def extract_text_from_pdf(pdf_file) -> str:
        """Extract text from uploaded PDF file"""
        try:
            # Read the PDF file
            pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file.read()))
            
            # Extract text from all pages
            text = ""
            for page_num in range(len(pdf_reader.pages)):
                page = pdf_reader.pages[page_num]
                text += page.extract_text() + "\\n"
            
            return text.strip()
        except Exception as e:
            st.error(f"Error reading PDF: {str(e)}")
            return None

# Main app
def main():
    st.title("🏢 Entity Recognition - Step 3: Text + PDF Processing")
    st.markdown("Extract company mentions from financial text and PDF documents using AI")
    
    # API Key
    api_key = 
    
    if not api_key:
        st.warning("Please set your OpenAI API key to continue.")
        return
    
    # Initialize components
    extractor = SimpleEntityExtractor(api_key)
    pdf_processor = PDFProcessor()
    
    # Create tabs for different input methods
    tab1, tab2 = st.tabs(["📝 Text Input", "📄 PDF Upload"])
    
    with tab1:
        st.subheader("📝 Enter Financial Text")
        sample_text = "Apple (AAPL) reported strong Q4 earnings. Microsoft (MSFT) showed impressive Azure growth. Tesla (TSLA) announced new production milestones."
        
        text_input = st.text_area(
            "Enter or paste financial text here:",
            value=sample_text,
            height=150,
            placeholder="Paste financial news or article text here...",
            key="text_input"
        )
        
        if st.button("🔍 Extract from Text", type="primary", key="text_extract"):
            process_text(text_input, extractor)
    
    with tab2:
        st.subheader("📄 Upload PDF Document")
        uploaded_file = st.file_uploader("Choose a PDF file", type="pdf")
        
        if uploaded_file is not None:
            if st.button("🔍 Extract from PDF", type="primary", key="pdf_extract"):
                with st.spinner("Extracting text from PDF..."):
                    extracted_text = pdf_processor.extract_text_from_pdf(uploaded_file)
                
                if extracted_text:
                    st.success(f"✅ Extracted {len(extracted_text)} characters from PDF")
                    
                    # Show preview of extracted text
                    with st.expander("Preview extracted text"):
                        st.text(extracted_text[:1000] + "..." if len(extracted_text) > 1000 else extracted_text)
                    
                    # Process the extracted text
                    process_text(extracted_text, extractor)
                else:
                    st.error("❌ Could not extract text from PDF")

def process_text(text: str, extractor):
    """Process text and display results"""
    if text.strip():
        with st.spinner("Extracting companies..."):
            entities = extractor.extract_companies(text)
        
        if entities:
            st.success(f"✅ Found {len(entities)} companies!")
            
            # Display results in a table
            df = pd.DataFrame(entities)
            st.subheader("📊 Extracted Companies")
            st.dataframe(df, use_container_width=True)
            
            # Display as cards
            st.subheader("🏷️ Company Cards")
            cols = st.columns(min(len(entities), 3))
            for i, entity in enumerate(entities):
                with cols[i % 3]:
                    st.info(f"**{entity['name']}**\\n\\nTicker: {entity.get('ticker', 'N/A')}\\n\\nFound as: '{entity.get('original_text', 'N/A')}'")
        else:
            st.warning("No companies found in the text.")
    else:
        st.error("Please provide some text to analyze.")

if __name__ == "__main__":
    main()
'''

# Save the enhanced Streamlit app
with open('app_step3.py', 'w') as f:
    f.write(streamlit_app_step3)

print("✅ Step 3 Streamlit app created: app_step3.py")
print("📄 Now supports both text input and PDF file uploads!")
print("\\n🚀 To run this app, execute in terminal:")
print("streamlit run app_step3.py")

---

# 🖼️ Step 4: Add Image OCR Capabilities

Now let's add the ability to extract text from images using OCR (Optical Character Recognition).

In [None]:
# Import OCR libraries
from PIL import Image, ImageDraw, ImageFont
import pytesseract
import cv2
import base64

class OCRProcessor:
    """Process images to extract text using OCR"""
    
    @staticmethod
    def extract_text_from_image(image: Image.Image) -> str:
        """Extract text from image using OCR"""
        try:
            # Convert PIL Image to OpenCV format
            img_array = np.array(image)
            
            # Convert to grayscale if needed
            if len(img_array.shape) == 3:
                gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
            else:
                gray = img_array
            
            # Apply some preprocessing for better OCR
            alpha = 1.5  # Contrast control
            beta = 10    # Brightness control
            adjusted = cv2.convertScaleAbs(gray, alpha=alpha, beta=beta)
            
            # Apply threshold to get binary image
            _, thresh = cv2.threshold(adjusted, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
            
            # Perform OCR
            text = pytesseract.image_to_string(thresh)
            
            return text.strip()
        except Exception as e:
            print(f"OCR extraction failed: {str(e)}")
            return ""

# Test OCR with a simple text image
def create_test_image():
    """Create a test image with company names"""
    img = Image.new('RGB', (400, 200), color='white')
    draw = ImageDraw.Draw(img)
    
    try:
        font = ImageFont.truetype('/System/Library/Fonts/Arial.ttf', 24)
    except:
        font = ImageFont.load_default()
    
    draw.text((20, 50), "Apple Inc. (AAPL)", fill='black', font=font)
    draw.text((20, 90), "Microsoft Corp. (MSFT)", fill='black', font=font)
    draw.text((20, 130), "Tesla Inc. (TSLA)", fill='black', font=font)
    
    return img

# Test OCR functionality
ocr_processor = OCRProcessor()
test_image = create_test_image()

print("🖼️ Testing OCR functionality...")
extracted_text = ocr_processor.extract_text_from_image(test_image)
print(f"📝 Extracted text: {repr(extracted_text)}")

if extracted_text and ('Apple' in extracted_text or 'Microsoft' in extracted_text):
    print("✅ OCR is working correctly!")
else:
    print("⚠️ OCR may need adjustment, but functionality is ready")

In [None]:
# Create Streamlit app with OCR support
streamlit_app_step4 = '''
import streamlit as st
import json
from openai import OpenAI
import pandas as pd
import PyPDF2
import io
from PIL import Image
import pytesseract
import cv2
import numpy as np

# App configuration
st.set_page_config(
    page_title="Entity Extractor - Step 4",
    page_icon="🏢",
    layout="wide"
)

class SimpleEntityExtractor:
    """Simple entity extractor using OpenAI API"""
    
    def __init__(self, api_key: str):
        self.client = OpenAI(api_key=api_key)
    
    def extract_companies(self, text: str):
        """Extract company mentions from text"""
        
        prompt = f"""
        Extract all company names and stock tickers from the following text.
        Return the result as a JSON object with a "companies" field containing a list.
        
        Text: {text[:3000]}...  # Truncate for API limits
        
        Format: {{"companies": [{{"name": "Company Name", "ticker": "TICK", "original_text": "original"}}]}}
        """
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are a financial entity extraction expert. Always respond with valid JSON."},
                    {"role": "user", "content": prompt}
                ],
                response_format={"type": "json_object"}
            )
            
            result = json.loads(response.choices[0].message.content)
            return result.get('companies', [])
            
        except Exception as e:
            st.error(f"Error extracting entities: {e}")
            return []

class PDFProcessor:
    """Process PDF files to extract text"""
    
    @staticmethod
    def extract_text_from_pdf(pdf_file) -> str:
        """Extract text from uploaded PDF file"""
        try:
            pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file.read()))
            text = ""
            for page_num in range(len(pdf_reader.pages)):
                page = pdf_reader.pages[page_num]
                text += page.extract_text() + "\\n"
            return text.strip()
        except Exception as e:
            st.error(f"Error reading PDF: {str(e)}")
            return None

class OCRProcessor:
    """Process images to extract text using OCR"""
    
    @staticmethod
    def extract_text_from_image(image: Image.Image) -> str:
        """Extract text from image using OCR"""
        try:
            # Convert PIL Image to OpenCV format
            img_array = np.array(image)
            
            # Convert to grayscale if needed
            if len(img_array.shape) == 3:
                gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
            else:
                gray = img_array
            
            # Apply some preprocessing for better OCR
            alpha = 1.5  # Contrast control
            beta = 10    # Brightness control
            adjusted = cv2.convertScaleAbs(gray, alpha=alpha, beta=beta)
            
            # Apply threshold to get binary image
            _, thresh = cv2.threshold(adjusted, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
            
            # Perform OCR
            text = pytesseract.image_to_string(thresh)
            
            return text.strip()
        except Exception as e:
            st.error(f"OCR extraction failed: {str(e)}")
            return ""

# Main app
def main():
    st.title("🏢 Entity Recognition - Step 4: Text + PDF + Images (OCR)")
    st.markdown("Extract company mentions from text, PDF documents, and images using AI and OCR")
    
    # API Key
    api_key = 
    
    if not api_key:sk-
        st.warning("Please set your OpenAI API key to continue.")
        return
    
    # Initialize components
    extractor = SimpleEntityExtractor(api_key)
    pdf_processor = PDFProcessor()
    ocr_processor = OCRProcessor()
    
    # Create tabs for different input methods
    tab1, tab2, tab3 = st.tabs(["📝 Text Input", "📄 PDF Upload", "🖼️ Image Upload"])
    
    with tab1:
        st.subheader("📝 Enter Financial Text")
        sample_text = "Apple (AAPL) reported strong Q4 earnings. Microsoft (MSFT) showed impressive Azure growth. Tesla (TSLA) announced new production milestones."
        
        text_input = st.text_area(
            "Enter or paste financial text here:",
            value=sample_text,
            height=150,
            placeholder="Paste financial news or article text here...",
            key="text_input"
        )
        
        if st.button("🔍 Extract from Text", type="primary", key="text_extract"):
            process_text(text_input, extractor, "Text Input")
    
    with tab2:
        st.subheader("📄 Upload PDF Document")
        uploaded_file = st.file_uploader("Choose a PDF file", type="pdf")
        
        if uploaded_file is not None:
            if st.button("🔍 Extract from PDF", type="primary", key="pdf_extract"):
                with st.spinner("Extracting text from PDF..."):
                    extracted_text = pdf_processor.extract_text_from_pdf(uploaded_file)
                
                if extracted_text:
                    st.success(f"✅ Extracted {len(extracted_text)} characters from PDF")
                    
                    # Show preview of extracted text
                    with st.expander("Preview extracted text"):
                        st.text(extracted_text[:1000] + "..." if len(extracted_text) > 1000 else extracted_text)
                    
                    # Process the extracted text
                    process_text(extracted_text, extractor, "PDF")
                else:
                    st.error("❌ Could not extract text from PDF")
    
    with tab3:
        st.subheader("🖼️ Upload Image")
        uploaded_image = st.file_uploader("Choose an image file", type=["png", "jpg", "jpeg", "bmp", "tiff"])
        
        if uploaded_image is not None:
            # Display the uploaded image
            image = Image.open(uploaded_image)
            st.image(image, caption="Uploaded Image", use_column_width=True)
            
            if st.button("🔍 Extract from Image (OCR)", type="primary", key="image_extract"):
                with st.spinner("Extracting text from image using OCR..."):
                    extracted_text = ocr_processor.extract_text_from_image(image)
                
                if extracted_text:
                    st.success(f"✅ Extracted text from image: {len(extracted_text)} characters")
                    
                    # Show preview of extracted text
                    with st.expander("Preview extracted text"):
                        st.text(extracted_text)
                    
                    # Process the extracted text
                    process_text(extracted_text, extractor, "Image (OCR)")
                else:
                    st.warning("⚠️ No text found in the image")

def process_text(text: str, extractor, source: str):
    """Process text and display results"""
    if text.strip():
        with st.spinner("Extracting companies..."):
            entities = extractor.extract_companies(text)
        
        if entities:
            st.success(f"✅ Found {len(entities)} companies from {source}!")
            
            # Display results in a table
            df = pd.DataFrame(entities)
            df['Source'] = source
            st.subheader("📊 Extracted Companies")
            st.dataframe(df, use_container_width=True)
            
            # Display as cards
            st.subheader("🏷️ Company Cards")
            cols = st.columns(min(len(entities), 3))
            for i, entity in enumerate(entities):
                with cols[i % 3]:
                    st.info(f"**{entity['name']}**\\n\\nTicker: {entity.get('ticker', 'N/A')}\\n\\nFound as: '{entity.get('original_text', 'N/A')}'\\n\\nSource: {source}")
        else:
            st.warning(f"No companies found in the {source.lower()}.")
    else:
        st.error("Please provide some content to analyze.")

if __name__ == "__main__":
    main()
'''

# Save the enhanced Streamlit app
with open('app_step4.py', 'w') as f:
    f.write(streamlit_app_step4)

print("✅ Step 4 Streamlit app created: app_step4.py")
print("🖼️ Now supports text input, PDF files, and image OCR!")
print("\n🚀 To run this app, execute in terminal:")
print("streamlit run app_step4.py")

---

# 🤖 Step 5: Add AI Vision Analysis

Finally, let's add GPT-4 Vision capabilities to analyze charts and graphs for company information.

In [None]:
# Create AI Vision processor
class VisionProcessor:
    """Process images using AI vision to detect companies"""
    
    def __init__(self, api_key: str):
        self.client = OpenAI(api_key=api_key)
    
    def analyze_image_for_companies(self, image: Image.Image) -> Dict[str, Any]:
        """Use GPT-4 Vision to analyze image for company information"""
        try:
            # Convert image to base64
            buffered = io.BytesIO()
            image.save(buffered, format="PNG")
            img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
            
            # Create prompt for vision analysis
            prompt = """Analyze this image and extract any company information present.
            Look for:
            1. Company names mentioned in text, labels, or titles
            2. Stock tickers (like AAPL, MSFT, etc.)
            3. Company logos or branding
            4. Data labels in charts referring to companies
            
            Return a JSON object with:
            {
                "companies_found": [
                    {
                        "name": "company name",
                        "ticker": "ticker symbol if found",
                        "context": "where/how it was found in the image"
                    }
                ],
                "has_chart": true/false,
                "chart_type": "bar/line/pie/other if applicable",
                "description": "brief description of what the image shows"
            }
            """
            
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": prompt},
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": f"data:image/png;base64,{img_base64}",
                                    "detail": "high"
                                }
                            }
                        ]
                    }
                ],
                response_format={"type": "json_object"},
                max_tokens=500
            )
            
            result = json.loads(response.choices[0].message.content)
            return result
            
        except Exception as e:
            print(f"Vision API analysis failed: {str(e)}")
            return {
                "companies_found": [],
                "has_chart": False,
                "description": "Failed to analyze image"
            }

# Test vision processing with our test image
vision_processor = VisionProcessor(API_KEY)
print("🤖 Testing AI Vision functionality...")

# Use the same test image from OCR
vision_result = vision_processor.analyze_image_for_companies(test_image)
print(f"🔍 Vision analysis result: {json.dumps(vision_result, indent=2)}")

if vision_result.get('companies_found'):
    print(f"✅ Vision AI detected {len(vision_result['companies_found'])} companies!")
else:
    print("⚠️ Vision AI didn't detect companies in test image, but functionality is ready")

---

# 🎯 Step 6: Final Comprehensive Application

Now let's create the final version that combines all capabilities: text processing, PDF handling, OCR, and AI vision analysis.

In [None]:
# Create the final comprehensive Streamlit app
streamlit_app_final = '''
import streamlit as st
import json
from openai import OpenAI
import pandas as pd
import PyPDF2
import io
from PIL import Image
import pytesseract
import cv2
import numpy as np
import base64
from typing import Dict, Any, List

# App configuration
st.set_page_config(
    page_title="Entity Recognition - Complete",
    page_icon="🏢",
    layout="wide"
)

class ComprehensiveEntityExtractor:
    """Complete entity extractor with all capabilities"""
    
    def __init__(self, api_key: str):
        self.client = OpenAI(api_key=api_key)
    
    def extract_companies(self, text: str):
        """Extract company mentions from text"""
        
        prompt = f"""
        Extract all company names and stock tickers from the following text.
        Return the result as a JSON object with a "companies" field containing a list.
        
        Text: {text[:3000]}...  # Truncate for API limits
        
        Format: {{"companies": [{{"name": "Company Name", "ticker": "TICK", "original_text": "original", "confidence": 90}}]}}
        """
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are a financial entity extraction expert. Always respond with valid JSON."},
                    {"role": "user", "content": prompt}
                ],
                response_format={"type": "json_object"}
            )
            
            result = json.loads(response.choices[0].message.content)
            return result.get('companies', [])
            
        except Exception as e:
            st.error(f"Error extracting entities: {e}")
            return []
    
    def analyze_image_for_companies(self, image: Image.Image) -> Dict[str, Any]:
        """Use GPT-4 Vision to analyze image for company information"""
        try:
            # Convert image to base64
            buffered = io.BytesIO()
            image.save(buffered, format="PNG")
            img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
            
            prompt = """Analyze this image and extract any company information present.
            Look for company names, stock tickers, logos, and data labels in charts.
            
            Return JSON: {"companies_found": [{"name": "Company", "ticker": "TICK", "context": "where found"}], "description": "what the image shows"}
            """
            
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": prompt},
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": f"data:image/png;base64,{img_base64}",
                                    "detail": "high"
                                }
                            }
                        ]
                    }
                ],
                response_format={"type": "json_object"},
                max_tokens=500
            )
            
            result = json.loads(response.choices[0].message.content)
            return result
            
        except Exception as e:
            st.warning(f"Vision analysis failed: {str(e)}")
            return {"companies_found": [], "description": "Failed to analyze image"}

class PDFProcessor:
    @staticmethod
    def extract_text_from_pdf(pdf_file) -> str:
        try:
            pdf_reader = PyPDF2.PdfReader(io.BytesIO(pdf_file.read()))
            text = ""
            for page_num in range(len(pdf_reader.pages)):
                page = pdf_reader.pages[page_num]
                text += page.extract_text() + "\\n"
            return text.strip()
        except Exception as e:
            st.error(f"Error reading PDF: {str(e)}")
            return None

class OCRProcessor:
    @staticmethod
    def extract_text_from_image(image: Image.Image) -> str:
        try:
            img_array = np.array(image)
            if len(img_array.shape) == 3:
                gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
            else:
                gray = img_array
            
            alpha = 1.5
            beta = 10
            adjusted = cv2.convertScaleAbs(gray, alpha=alpha, beta=beta)
            _, thresh = cv2.threshold(adjusted, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
            text = pytesseract.image_to_string(thresh)
            return text.strip()
        except Exception as e:
            st.warning(f"OCR extraction failed: {str(e)}")
            return ""

def main():
    st.title("🏢 Complete Entity Recognition System")
    st.markdown("""Extract company mentions from **text**, **PDF documents**, and **images** using:
    - 🤖 **AI-powered text analysis**
    - 📄 **PDF text extraction** 
    - 🖼️ **OCR (Optical Character Recognition)**
    - 👁️ **AI Vision analysis** for charts and graphics
    """)
    
    # API Key
    api_key = 
    
    if not api_key:
        st.warning("Please set your OpenAI API key to continue.")
        return
    
    # Initialize all processors
    extractor = ComprehensiveEntityExtractor(api_key)
    pdf_processor = PDFProcessor()
    ocr_processor = OCRProcessor()
    
    # Create tabs for different input methods
    tab1, tab2, tab3 = st.tabs(["📝 Text Input", "📄 PDF Upload", "🖼️ Image Analysis"])
    
    with tab1:
        st.subheader("📝 Enter Financial Text")
        sample_text = "Apple (AAPL) reported strong Q4 earnings with revenue of $89.5B. Microsoft (MSFT) showed impressive Azure growth of 35% YoY. Tesla (TSLA) announced new production milestones, while Google parent Alphabet (GOOGL) saw advertising revenue rebound."
        
        text_input = st.text_area(
            "Enter or paste financial text here:",
            value=sample_text,
            height=150,
            placeholder="Paste financial news, earnings reports, or any text mentioning companies...",
            key="text_input"
        )
        
        if st.button("🔍 Extract Companies from Text", type="primary", key="text_extract"):
            process_text(text_input, extractor, "📝 Text Input")
    
    with tab2:
        st.subheader("📄 Upload PDF Document")
        st.markdown("Upload financial reports, earnings statements, or any PDF containing company information.")
        
        uploaded_file = st.file_uploader("Choose a PDF file", type="pdf")
        
        if uploaded_file is not None:
            if st.button("🔍 Extract Companies from PDF", type="primary", key="pdf_extract"):
                with st.spinner("Extracting text from PDF..."):
                    extracted_text = pdf_processor.extract_text_from_pdf(uploaded_file)
                
                if extracted_text:
                    st.success(f"✅ Extracted {len(extracted_text)} characters from PDF")
                    
                    with st.expander("Preview extracted text"):
                        st.text(extracted_text[:1000] + "..." if len(extracted_text) > 1000 else extracted_text)
                    
                    process_text(extracted_text, extractor, "📄 PDF Document")
                else:
                    st.error("❌ Could not extract text from PDF")
    
    with tab3:
        st.subheader("🖼️ Upload Image for Analysis")
        st.markdown("Upload charts, screenshots, infographics, or any image containing company information.")
        
        uploaded_image = st.file_uploader(
            "Choose an image file", 
            type=["png", "jpg", "jpeg", "bmp", "tiff"],
            help="Supports PNG, JPG, JPEG, BMP, and TIFF formats"
        )
        
        if uploaded_image is not None:
            image = Image.open(uploaded_image)
            st.image(image, caption="Uploaded Image", use_column_width=True)
            
            col1, col2 = st.columns(2)
            
            with col1:
                if st.button("👁️ AI Vision Analysis", type="primary", key="vision_extract"):
                    with st.spinner("Analyzing image with AI Vision..."):
                        vision_result = extractor.analyze_image_for_companies(image)
                    
                    st.subheader("🤖 AI Vision Results")
                    st.write(f"**Description:** {vision_result.get('description', 'N/A')}")
                    
                    companies = vision_result.get('companies_found', [])
                    if companies:
                        df = pd.DataFrame(companies)
                        df['Source'] = '👁️ AI Vision'
                        df['Confidence'] = 85  # Default confidence for vision
                        display_results(df, "👁️ AI Vision Analysis")
                    else:
                        st.info("No companies detected through AI vision analysis.")
            
            with col2:
                if st.button("📝 OCR Text Extraction", type="secondary", key="ocr_extract"):
                    with st.spinner("Extracting text using OCR..."):
                        extracted_text = ocr_processor.extract_text_from_image(image)
                    
                    if extracted_text:
                        st.success(f"✅ OCR extracted {len(extracted_text)} characters")
                        
                        with st.expander("Preview OCR text"):
                            st.text(extracted_text)
                        
                        process_text(extracted_text, extractor, "📝 Image OCR")
                    else:
                        st.warning("⚠️ No text found in the image through OCR")

def process_text(text: str, extractor, source: str):
    """Process text and display results"""
    if text.strip():
        with st.spinner("Extracting companies using AI..."):
            entities = extractor.extract_companies(text)
        
        if entities:
            # Create DataFrame with source information
            df = pd.DataFrame(entities)
            df['Source'] = source
            if 'confidence' not in df.columns:
                df['Confidence'] = 90  # Default confidence
            
            display_results(df, source)
        else:
            st.warning(f"No companies found in the {source.lower()}.")
    else:
        st.error("Please provide some content to analyze.")

def display_results(df: pd.DataFrame, source: str):
    """Display extraction results"""
    st.success(f"✅ Found {len(df)} companies from {source}!")
    
    # Results table
    st.subheader("📊 Extracted Companies")
    
    # Reorder columns for better display
    column_order = ['name', 'ticker', 'original_text', 'confidence', 'Source']
    display_df = df.reindex(columns=[col for col in column_order if col in df.columns])
    
    st.dataframe(display_df, use_container_width=True)
    
    # Company cards
    st.subheader("🏷️ Company Details")
    cols = st.columns(min(len(df), 3))
    
    for i, (_, entity) in enumerate(df.iterrows()):
        with cols[i % 3]:
            confidence = entity.get('confidence', entity.get('Confidence', 'N/A'))
            st.info(f"""**{entity['name']}**
            
📈 **Ticker:** {entity.get('ticker', 'N/A')}
🔍 **Found as:** '{entity.get('original_text', 'N/A')}'
📊 **Confidence:** {confidence}%
📍 **Source:** {entity.get('Source', source)}""")
    
    # Download option
    csv = df.to_csv(index=False)
    st.download_button(
        label="📥 Download Results as CSV",
        data=csv,
        file_name=f"company_extraction_{source.replace(' ', '_').lower()}.csv",
        mime="text/csv"
    )

if __name__ == "__main__":
    main()
'''

# Save the final comprehensive Streamlit app
with open('app_final.py', 'w') as f:
    f.write(streamlit_app_final)

print("✅ Final comprehensive Streamlit app created: app_final.py")
print("🎯 Includes ALL features: Text, PDF, OCR, and AI Vision!")
print("\n🚀 To run the final app, execute in terminal:")
print("streamlit run app_final.py")

---

# 🎉 Summary

Congratulations! You've built a complete entity recognition system step by step. Here's what we've accomplished:

## 📱 Applications Created:

1. **`app_step2.py`** - Basic text processing with OpenAI
2. **`app_step3.py`** - Added PDF file support
3. **`app_step4.py`** - Added image OCR capabilities
4. **`app_final.py`** - Complete system with AI vision analysis

## 🛠️ Technologies Used:

- **OpenAI GPT-4** for text analysis and vision processing
- **Streamlit** for web interface
- **PyPDF2** for PDF text extraction
- **Tesseract OCR** for image text recognition
- **OpenCV** for image preprocessing
- **PIL (Pillow)** for image handling

## 🎯 Capabilities:

✅ Extract companies from plain text  
✅ Process PDF documents  
✅ Read text from images (OCR)  
✅ Analyze charts and graphics with AI  
✅ Interactive web interface  
✅ Export results to CSV  
✅ Source tracking  
✅ Confidence scoring  

## 🚀 Next Steps:

You can now:
- Run any of the step-by-step applications to see the progression
- Use the final application for comprehensive entity recognition
- Extend the system with additional features like:
  - Entity verification against company databases
  - Batch processing capabilities
  - Advanced visualization
  - API endpoints

Happy coding! 🎊