# Pitch Deck Alignment Analysis

This notebook analyzes a pitch deck to determine how well it aligns with the thesis themes in the Neo4j graph database.

**Workflow:**
1. Load pitch deck from `data/pitch_decks`
2. Extract core takeaways and theses using GPT-4
3. Query Neo4j for all thesis data
4. Use GPT-4 to check alignment between pitch deck and all theses (batched in one call)
5. Aggregate results by core thesis
6. Generate comprehensive analysis using GPT-4

**Output:**
- Alignment scores (0-1) for each thesis
- Evidence and strength of alignment
- Core thesis alignment summary
- GPT-4 analysis of overall fit

**Cost:** ~$0.20-0.50 per analysis (much cheaper than individual calls)


In [1]:
import os
import json
import re
from pathlib import Path
from typing import List, Dict, Any
import fitz  # PyMuPDF
from openai import OpenAI
from neo4j import GraphDatabase
from dotenv import load_dotenv

# Load environment variables
env_path = Path("../.env").resolve()
load_dotenv(env_path)

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
NEO4J_URI = os.getenv("NEO4J_URI", "bolt://localhost:7687")
NEO4J_USER = os.getenv("NEO4J_USER", "neo4j")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD", "password")

# Initialize clients
openai_client = OpenAI(api_key=OPENAI_API_KEY) if OPENAI_API_KEY else None

# Initialize Neo4j driver
neo4j_driver = None
if NEO4J_URI:
    try:
        neo4j_driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
        neo4j_driver.verify_connectivity()
        print(f"✓ Configuration loaded")
        print(f"  OpenAI client: {'✓' if openai_client else '✗'}")
        print(f"  Neo4j driver: ✓ Connected to {NEO4J_URI}")
    except Exception as e:
        print(f"✓ Configuration loaded")
        print(f"  OpenAI client: {'✓' if openai_client else '✗'}")
        print(f"  Neo4j driver: ✗ Connection failed")
        print(f"  Error: {type(e).__name__}: {str(e)[:100]}")
        neo4j_driver = None
else:
    print(f"✓ Configuration loaded")
    print(f"  OpenAI client: {'✓' if openai_client else '✗'}")
    print(f"  Neo4j driver: ✗ (NEO4J_URI not set)")


✓ Configuration loaded
  OpenAI client: ✓
  Neo4j driver: ✓ Connected to neo4j+s://be06b044.databases.neo4j.io


In [2]:
# Step 1: Load Pitch Deck from data/pitch_decks

def extract_text_from_pdf(pdf_path: Path) -> str:
    """Extract all text from a PDF file."""
    doc = fitz.open(pdf_path)
    text = ""
    for page_num in range(len(doc)):
        page = doc[page_num]
        text += page.get_text()
    doc.close()
    return text

# Set the pitch deck file path
PITCH_DECK_DIR = Path("../data/pitch_decks")
PITCH_DECK_PATH = None  # Set this to your pitch deck filename

# Option 1: List available pitch decks
if PITCH_DECK_DIR.exists():
    pdf_files = list(PITCH_DECK_DIR.glob("*.pdf"))
    if pdf_files:
        print(f"Available pitch decks in {PITCH_DECK_DIR}:")
        for i, pdf_file in enumerate(pdf_files, 1):
            print(f"  {i}. {pdf_file.name}")
        
        # Use the first PDF if PITCH_DECK_PATH not set, or set it manually:
        if not PITCH_DECK_PATH:
            PITCH_DECK_PATH = pdf_files[0]
            print(f"\nUsing: {PITCH_DECK_PATH.name}")
    else:
        print(f"⚠ No PDF files found in {PITCH_DECK_DIR}")
else:
    print(f"⚠ Directory {PITCH_DECK_DIR} does not exist")

# Load pitch deck
if PITCH_DECK_PATH and PITCH_DECK_PATH.exists():
    print(f"\n✓ Loading PDF: {PITCH_DECK_PATH.name}")
    pitch_deck_content = extract_text_from_pdf(PITCH_DECK_PATH)
    print(f"  Extracted {len(pitch_deck_content):,} characters")
    print(f"  Preview (first 500 chars):\n{pitch_deck_content[:500]}...")
else:
    print("⚠ No pitch deck loaded. Please set PITCH_DECK_PATH to a PDF file in data/pitch_decks")
    pitch_deck_content = ""


Available pitch decks in ..\data\pitch_decks:
  1. 20240403 Cognaize Series B Pitch Deck.pdf
  2. OptiU Investor Deck - Oct 2025.pdf

Using: 20240403 Cognaize Series B Pitch Deck.pdf

✓ Loading PDF: 20240403 Cognaize Series B Pitch Deck.pdf
  Extracted 11,862 characters
  Preview (first 500 chars):
Company Overview
Automating Unstructured Data 
with Hybrid Intelligence
3 April 2024
Cognaize Holdings Inc. Confidential & Proprietary
Automating Unstructured Data with Hybrid Intelligence
• Founded in 2020
• HQ in New York City, with offices in LA, 
Frankfurt, and Armenia
• Over 100 customers including Moody’s, 
Fitch, and Truist
• Strong sales pipeline: $16.9M (as of April 3rd)
2
Cognaize has developed a new tech platform and methodology 
to apply human-centered AI to complex automation tasks ...


In [3]:
# Step 2: Extract Core Takeaways and Theses using GPT-4

def extract_pitch_deck_themes(text: str) -> List[Dict[str, str]]:
    """
    Extract core takeaways, theses, and value propositions from pitch deck.
    Returns a list of themes with descriptions.
    """
    if not openai_client:
        print("⚠ OpenAI client not configured")
        return []
    
    prompt = f"""Analyze the following pitch deck and extract the core takeaways, theses, and value propositions.

For each key theme/thesis, provide:
1. A concise name/title for the theme
2. A 2-3 sentence description of what this theme asserts or claims

Focus on:
- Core value propositions
- What the company does, for what target market, and what is its differentiation
- Key market assertions
- Business model assertions
- Market trends which led to this company's formation
- Differentiated technology perspectives
- Background of the founding team and management
- Perspectives on capital efficiency
- Perspective on the use of AI and machine learning

Return a JSON array with this structure:
[
    {{
        "theme": "Theme name",
        "description": "2-3 sentence description of what this theme asserts"
    }},
    ...
]

Pitch deck content:
{text[:100000]}  # Limit to first 100K chars to stay within token limits
"""

    try:
        print("Extracting themes from pitch deck...")
        response = openai_client.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=[
                {"role": "system", "content": "You are an expert at analyzing pitch decks and extracting key themes and theses. Always return valid JSON."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.3,
            max_tokens=2000
        )
        
        result_text = response.choices[0].message.content.strip()
        
        # Extract JSON from response
        if "```json" in result_text:
            result_text = result_text.split("```json")[1].split("```")[0]
        elif "```" in result_text:
            result_text = result_text.split("```")[1].split("```")[0]
        
        json_match = re.search(r'\[.*\]', result_text, re.DOTALL)
        if json_match:
            result_text = json_match.group(0)
        
        themes = json.loads(result_text)
        print(f"✓ Extracted {len(themes)} themes")
        return themes
    except Exception as e:
        print(f"⚠ Error extracting themes: {e}")
        import traceback
        traceback.print_exc()
        return []

# Extract themes
if pitch_deck_content:
    pitch_themes = extract_pitch_deck_themes(pitch_deck_content)
    
    print(f"\nExtracted Themes:")
    for i, theme in enumerate(pitch_themes, 1):
        print(f"\n{i}. {theme.get('theme', 'Unknown')}")
        print(f"   {theme.get('description', 'No description')}")
else:
    print("⚠ No pitch deck content to analyze")
    pitch_themes = []


Extracting themes from pitch deck...
✓ Extracted 10 themes

Extracted Themes:

1. Hybrid Intelligence for Unstructured Data
   Cognaize leverages a unique blend of human-centered AI to automate the processing of complex, unstructured documents. This approach allows for more accurate and efficient handling of data, setting the company apart in the automation space.

2. Targeting Financial Services and Insurance
   The company identifies financial services and insurance sectors in the USA as its primary markets, recognizing the vast opportunity within these industries for automating unstructured data processes.

3. Focusing on Longtail Use Cases
   Unlike competitors that target high-volume automation tasks, Cognaize focuses on longtail use cases. This strategy addresses a market need for automation in areas with lower volume but high complexity, which are often overlooked by other providers.

4. Enterprise Knowledge Platform
   Cognaize has developed an Enterprise Knowledge Platform tha

In [4]:
# Step 3: Query Neo4j for All Thesis Data

def get_all_theses(driver) -> List[Dict[str, Any]]:
    """Retrieve all Thesis nodes from Neo4j."""
    if not driver:
        print("⚠ Neo4j driver not configured")
        return []
    
    with driver.session() as session:
        result = session.run("""
            MATCH (t:Thesis)
            RETURN t.thesis as thesis,
                   t.title as title,
                   t.node_num as node_num,
                   t.description as description,
                   t.core_thesis as core_thesis
            ORDER BY t.node_num
        """)
        
        theses = []
        for record in result:
            theses.append({
                'thesis': record['thesis'],
                'title': record.get('title', ''),
                'node_num': record.get('node_num', 0),
                'description': record.get('description', ''),
                'core_thesis': record.get('core_thesis', '')
            })
        
        return theses

# Get all theses
if neo4j_driver:
    print("Querying Neo4j for all theses...")
    thesis_data = get_all_theses(neo4j_driver)
    print(f"✓ Retrieved {len(thesis_data)} theses")
    
    if thesis_data:
        print(f"\nSample thesis:")
        sample = thesis_data[0]
        print(f"  Title: {sample.get('title', 'N/A')}")
        print(f"  Node #: {sample.get('node_num', 'N/A')}")
        print(f"  Core Thesis: {sample.get('core_thesis', 'N/A')}")
else:
    print("⚠ Neo4j not connected")
    thesis_data = []


Querying Neo4j for all theses...
✓ Retrieved 19 theses

Sample thesis:
  Title: Emerging Impact of Generative AI
  Node #: 1
  Core Thesis: AI


In [5]:
# Step 4: Check Alignment Using GPT-4 (Batch All Theses in One Call)

def check_all_theses_alignment_batch(
    pitch_deck_text: str,
    pitch_themes: List[Dict[str, str]],
    all_theses: List[Dict[str, Any]]
) -> List[Dict[str, Any]]:
    """
    Use GPT-4 to check alignment between pitch deck and all theses in one batch call.
    Much more cost-effective than individual calls.
    """
    if not openai_client:
        print("⚠ OpenAI client not configured")
        return []
    
    # Format pitch deck themes summary
    themes_summary = "\n".join([
        f"- {theme.get('theme', 'Unknown')}: {theme.get('description', '')}"
        for theme in pitch_themes
    ])
    
    # Format all theses
    theses_text = ""
    for i, thesis in enumerate(all_theses, 1):
        theses_text += f"""
{i}. TITLE: {thesis.get('title', 'N/A')}
   STATEMENT: {thesis.get('thesis', 'N/A')}
   DESCRIPTION: {thesis.get('description', 'N/A')}
   CORE THEME: {thesis.get('core_thesis', 'N/A')}
   NODE #: {thesis.get('node_num', 'N/A')}
"""
    
    prompt = f"""Analyze whether the following pitch deck supports or aligns with each of these investment theses.

PITCH DECK THEMES EXTRACTED:
{themes_summary}

FULL PITCH DECK CONTENT (first 50K characters):
{pitch_deck_text[:50000]}

INVESTMENT THESES TO EVALUATE:
{theses_text}

For each thesis, determine:
1. Does the pitch deck support this thesis? (Yes/No/Partially)
2. Alignment score (0-1): How well does it align? Be specific:
   - 0.8-1.0: Strong alignment with clear evidence
   - 0.6-0.79: Moderate alignment with some evidence
   - 0.4-0.59: Weak alignment or partial support
   - 0.0-0.39: No alignment or contradicts thesis
3. Evidence: Specific examples from the pitch deck that support or contradict the thesis
4. Strength: Strong/Moderate/Weak/None

Return a JSON array with one object per thesis, in the same order as listed:
[
    {{
        "node_num": 1,
        "title": "...",
        "supports": true,
        "alignment_score": 0.75,
        "evidence": "The pitch deck mentions X which aligns with this thesis because...",
        "strength": "Moderate"
    }},
    ...
]

Be thorough and accurate. Return ONLY valid JSON, no other text."""

    try:
        print("Analyzing alignment with all theses using GPT-4 (this may take 30-60 seconds)...")
        response = openai_client.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=[
                {"role": "system", "content": "You are an expert investment analyst who evaluates pitch decks against investment thesis frameworks. Always return valid JSON. Be thorough and accurate in your alignment assessments."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.3,
            max_tokens=4000  # Increased for multiple theses
        )
        
        result_text = response.choices[0].message.content.strip()
        
        # Extract JSON from response
        if "```json" in result_text:
            result_text = result_text.split("```json")[1].split("```")[0]
        elif "```" in result_text:
            result_text = result_text.split("```")[1].split("```")[0]
        
        json_match = re.search(r'\[.*\]', result_text, re.DOTALL)
        if json_match:
            result_text = json_match.group(0)
        
        alignments = json.loads(result_text)
        
        # Merge with original thesis data
        results = []
        for alignment in alignments:
            # Find matching thesis by node_num
            node_num = alignment.get('node_num')
            thesis = next((t for t in all_theses if t.get('node_num') == node_num), None)
            if thesis:
                results.append({
                    **thesis,
                    **alignment  # Add alignment data
                })
        
        print(f"✓ Analyzed {len(results)} theses")
        return results
        
    except Exception as e:
        print(f"⚠ Error analyzing alignment: {e}")
        import traceback
        traceback.print_exc()
        return []

# Check alignment for all theses
if pitch_deck_content and thesis_data and pitch_themes:
    alignment_results = check_all_theses_alignment_batch(
        pitch_deck_content,
        pitch_themes,
        thesis_data
    )
    
    # Show summary
    if alignment_results:
        print(f"\n{'='*60}")
        print("ALIGNMENT RESULTS SUMMARY")
        print(f"{'='*60}")
        
        # Count by strength
        strength_counts = {}
        for result in alignment_results:
            strength = result.get('strength', 'Unknown')
            strength_counts[strength] = strength_counts.get(strength, 0) + 1
        
        print(f"\nAlignment by Strength:")
        for strength, count in sorted(strength_counts.items(), key=lambda x: x[1], reverse=True):
            print(f"  {strength}: {count}")
        
        # Show top alignments
        sorted_results = sorted(alignment_results, key=lambda x: x.get('alignment_score', 0), reverse=True)
        print(f"\nTop 5 Alignments:")
        for i, result in enumerate(sorted_results[:5], 1):
            print(f"  {i}. {result.get('title', 'N/A')} - Score: {result.get('alignment_score', 0):.3f} ({result.get('strength', 'N/A')})")
else:
    print("⚠ Cannot check alignment: missing required data")
    print("   Make sure you've run:")
    print("   - Cell 2: Load pitch deck")
    print("   - Cell 3: Extract themes")
    print("   - Cell 4: Query Neo4j for theses")
    alignment_results = []


Analyzing alignment with all theses using GPT-4 (this may take 30-60 seconds)...
✓ Analyzed 19 theses

ALIGNMENT RESULTS SUMMARY

Alignment by Strength:
  None: 15
  Strong: 3
  Moderate: 1

Top 5 Alignments:
  1. Hybrid AI: Enterprise Strategy Essential - Score: 1.000 (Strong)
  2. Verticalized Approaches Enhance Solutions - Score: 0.800 (Strong)
  3. AI Business Models: Risks, Rewards - Score: 0.800 (Strong)
  4. Reshaping Compute Infrastructure Demand - Score: 0.500 (Moderate)
  5. Emerging Impact of Generative AI - Score: 0.000 (None)


In [7]:
# Step 3: Query Neo4j for All Thesis Data

def get_all_theses(driver) -> List[Dict[str, Any]]:
    """Retrieve all Thesis nodes from Neo4j."""
    if not driver:
        print("⚠ Neo4j driver not configured")
        return []
    
    with driver.session() as session:
        result = session.run("""
            MATCH (t:Thesis)
            RETURN t.thesis as thesis,
                   t.title as title,
                   t.node_num as node_num,
                   t.description as description,
                   t.core_thesis as core_thesis
            ORDER BY t.node_num
        """)
        
        theses = []
        for record in result:
            theses.append({
                'thesis': record['thesis'],
                'title': record.get('title', ''),
                'node_num': record.get('node_num', 0),
                'description': record.get('description', ''),
                'core_thesis': record.get('core_thesis', '')
            })
        
        return theses

# Get all theses
if neo4j_driver:
    print("Querying Neo4j for all theses...")
    thesis_data = get_all_theses(neo4j_driver)
    print(f"✓ Retrieved {len(thesis_data)} theses")
    
    if thesis_data:
        print(f"\nSample thesis:")
        sample = thesis_data[0]
        print(f"  Title: {sample.get('title', 'N/A')}")
        print(f"  Node #: {sample.get('node_num', 'N/A')}")
        print(f"  Core Thesis: {sample.get('core_thesis', 'N/A')}")
else:
    print("⚠ Neo4j not connected")
    thesis_data = []


Querying Neo4j for all theses...
✓ Retrieved 19 theses

Sample thesis:
  Title: Emerging Impact of Generative AI
  Node #: 1
  Core Thesis: AI


In [8]:
# Step 6: Detailed Alignment Results

# Show detailed results for all theses
if 'alignment_results' in locals() and alignment_results:
    print("="*80)
    print("DETAILED ALIGNMENT RESULTS")
    print("="*80)
    
    # Sort by alignment score
    sorted_results = sorted(alignment_results, key=lambda x: x.get('alignment_score', 0), reverse=True)
    
    print(f"\n{'Rank':<6} {'Score':<8} {'Strength':<12} {'Node#':<8} {'Title':<50}")
    print("-" * 80)
    
    for i, result in enumerate(sorted_results, 1):
        print(f"{i:<6} {result.get('alignment_score', 0):.4f}  {result.get('strength', 'N/A'):<12} "
              f"#{result.get('node_num', '?'):<7} {result.get('title', 'N/A')[:48]}")
    
    # Show top 5 with full details
    print(f"\n{'='*80}")
    print("TOP 5 ALIGNMENTS (DETAILED):")
    print(f"{'='*80}")
    for i, result in enumerate(sorted_results[:5], 1):
        print(f"\n{i}. {result.get('title', 'N/A')}")
        print(f"   Node #: {result.get('node_num', 'N/A')}")
        print(f"   Core Thesis: {result.get('core_thesis', 'N/A')}")
        print(f"   Alignment Score: {result.get('alignment_score', 0):.4f}")
        print(f"   Strength: {result.get('strength', 'N/A')}")
        print(f"   Supports: {result.get('supports', 'N/A')}")
        print(f"   Evidence: {result.get('evidence', 'N/A')[:300]}...")
        print(f"   Full Thesis: {result.get('thesis', 'N/A')[:200]}...")
else:
    print("⚠ No alignment results to display")
    print("   Run Cell 5 first to check alignment with all theses")


DETAILED ALIGNMENT RESULTS

Rank   Score    Strength     Node#    Title                                             
--------------------------------------------------------------------------------
1      1.0000  Strong       #3       Hybrid AI: Enterprise Strategy Essential
2      0.8000  Strong       #2       Verticalized Approaches Enhance Solutions
3      0.8000  Strong       #6       AI Business Models: Risks, Rewards
4      0.5000  Moderate     #4       Reshaping Compute Infrastructure Demand
5      0.0000  None         #1       Emerging Impact of Generative AI
6      0.0000  None         #5       Scaling Agentic AI Infrastructure
7      0.0000  None         #7       Trapped By Localized Challenges
8      0.0000  None         #8       Accelerating Drug Development Revolution
9      0.0000  None         #9       Accelerating Drug Discovery Technologies
10     0.0000  None         #10      Digital Health Interoperability Transformation
11     0.0000  None         #11      Tech-Enab

In [9]:
# Step 4: Check Alignment Using GPT-4 (Batch All Theses in One Call)

def check_all_theses_alignment_batch(
    pitch_deck_text: str,
    pitch_themes: List[Dict[str, str]],
    all_theses: List[Dict[str, Any]]
) -> List[Dict[str, Any]]:
    """
    Use GPT-4 to check alignment between pitch deck and all theses in one batch call.
    Much more cost-effective than individual calls.
    """
    if not openai_client:
        print("⚠ OpenAI client not configured")
        return []
    
    # Format pitch deck themes summary
    themes_summary = "\n".join([
        f"- {theme.get('theme', 'Unknown')}: {theme.get('description', '')}"
        for theme in pitch_themes
    ])
    
    # Format all theses
    theses_text = ""
    for i, thesis in enumerate(all_theses, 1):
        theses_text += f"""
{i}. TITLE: {thesis.get('title', 'N/A')}
   STATEMENT: {thesis.get('thesis', 'N/A')}
   DESCRIPTION: {thesis.get('description', 'N/A')}
   CORE THEME: {thesis.get('core_thesis', 'N/A')}
   NODE #: {thesis.get('node_num', 'N/A')}
"""
    
    prompt = f"""Analyze whether the following pitch deck supports or aligns with each of these investment theses.

PITCH DECK THEMES EXTRACTED:
{themes_summary}

FULL PITCH DECK CONTENT (first 50K characters):
{pitch_deck_text[:50000]}

INVESTMENT THESES TO EVALUATE:
{theses_text}

For each thesis, determine:
1. Does the pitch deck support this thesis? (Yes/No/Partially)
2. Alignment score (0-1): How well does it align? Be specific:
   - 0.8-1.0: Strong alignment with clear evidence
   - 0.6-0.79: Moderate alignment with some evidence
   - 0.4-0.59: Weak alignment or partial support
   - 0.0-0.39: No alignment or contradicts thesis
3. Evidence: Specific examples from the pitch deck that support or contradict the thesis
4. Strength: Strong/Moderate/Weak/None

Return a JSON array with one object per thesis, in the same order as listed:
[
    {{
        "node_num": 1,
        "title": "...",
        "supports": true,
        "alignment_score": 0.75,
        "evidence": "The pitch deck mentions X which aligns with this thesis because...",
        "strength": "Moderate"
    }},
    ...
]

Be thorough and accurate. Return ONLY valid JSON, no other text."""

    try:
        print("Analyzing alignment with all theses using GPT-4 (this may take 30-60 seconds)...")
        response = openai_client.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=[
                {"role": "system", "content": "You are an expert investment analyst who evaluates pitch decks against investment thesis frameworks. Always return valid JSON. Be thorough and accurate in your alignment assessments."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.3,
            max_tokens=4000  # Increased for multiple theses
        )
        
        result_text = response.choices[0].message.content.strip()
        
        # Extract JSON from response
        if "```json" in result_text:
            result_text = result_text.split("```json")[1].split("```")[0]
        elif "```" in result_text:
            result_text = result_text.split("```")[1].split("```")[0]
        
        json_match = re.search(r'\[.*\]', result_text, re.DOTALL)
        if json_match:
            result_text = json_match.group(0)
        
        alignments = json.loads(result_text)
        
        # Merge with original thesis data
        results = []
        for alignment in alignments:
            # Find matching thesis by node_num
            node_num = alignment.get('node_num')
            thesis = next((t for t in all_theses if t.get('node_num') == node_num), None)
            if thesis:
                results.append({
                    **thesis,
                    **alignment  # Add alignment data
                })
        
        print(f"✓ Analyzed {len(results)} theses")
        return results
        
    except Exception as e:
        print(f"⚠ Error analyzing alignment: {e}")
        import traceback
        traceback.print_exc()
        return []

# Check alignment for all theses
if pitch_deck_content and thesis_data and pitch_themes:
    alignment_results = check_all_theses_alignment_batch(
        pitch_deck_content,
        pitch_themes,
        thesis_data
    )
    
    # Show summary
    if alignment_results:
        print(f"\n{'='*60}")
        print("ALIGNMENT RESULTS SUMMARY")
        print(f"{'='*60}")
        
        # Count by strength
        strength_counts = {}
        for result in alignment_results:
            strength = result.get('strength', 'Unknown')
            strength_counts[strength] = strength_counts.get(strength, 0) + 1
        
        print(f"\nAlignment by Strength:")
        for strength, count in sorted(strength_counts.items(), key=lambda x: x[1], reverse=True):
            print(f"  {strength}: {count}")
        
        # Show top alignments
        sorted_results = sorted(alignment_results, key=lambda x: x.get('alignment_score', 0), reverse=True)
        print(f"\nTop 5 Alignments:")
        for i, result in enumerate(sorted_results[:5], 1):
            print(f"  {i}. {result.get('title', 'N/A')} - Score: {result.get('alignment_score', 0):.3f} ({result.get('strength', 'N/A')})")
else:
    print("⚠ Cannot check alignment: missing required data")
    print("   Make sure you've run:")
    print("   - Cell 2: Load pitch deck")
    print("   - Cell 3: Extract themes")
    print("   - Cell 4: Query Neo4j for theses")
    alignment_results = []


Analyzing alignment with all theses using GPT-4 (this may take 30-60 seconds)...
✓ Analyzed 19 theses

ALIGNMENT RESULTS SUMMARY

Alignment by Strength:
  None: 15
  Strong: 3
  Weak: 1

Top 5 Alignments:
  1. Hybrid AI: Enterprise Strategy Essential - Score: 1.000 (Strong)
  2. Verticalized Approaches Enhance Solutions - Score: 0.800 (Strong)
  3. AI Business Models: Risks, Rewards - Score: 0.800 (Strong)
  4. Reshaping Compute Infrastructure Demand - Score: 0.500 (Weak)
  5. Emerging Impact of Generative AI - Score: 0.000 (None)


In [10]:
# Step 5: Aggregate Results by Core Thesis

def aggregate_by_core_thesis(alignment_results: List[Dict[str, Any]]) -> Dict[str, Dict]:
    """Aggregate alignment scores by core thesis theme."""
    core_thesis_scores = {}
    
    for result in alignment_results:
        core_thesis = result.get('core_thesis', 'Unknown')
        score = result.get('alignment_score', 0)
        
        if core_thesis not in core_thesis_scores:
            core_thesis_scores[core_thesis] = {
                'total_score': 0.0,
                'match_count': 0,
                'strong_matches': 0,
                'moderate_matches': 0,
                'weak_matches': 0,
                'top_matches': []
            }
        
        core_thesis_scores[core_thesis]['total_score'] += score
        core_thesis_scores[core_thesis]['match_count'] += 1
        
        # Count by strength
        strength = result.get('strength', '').lower()
        if 'strong' in strength:
            core_thesis_scores[core_thesis]['strong_matches'] += 1
        elif 'moderate' in strength:
            core_thesis_scores[core_thesis]['moderate_matches'] += 1
        else:
            core_thesis_scores[core_thesis]['weak_matches'] += 1
        
        # Add to top matches
        core_thesis_scores[core_thesis]['top_matches'].append((result, score))
    
    # Calculate average scores and sort
    for core_thesis in core_thesis_scores:
        data = core_thesis_scores[core_thesis]
        data['average_score'] = data['total_score'] / data['match_count'] if data['match_count'] > 0 else 0
        # Sort top matches by score
        data['top_matches'].sort(key=lambda x: x[1], reverse=True)
        data['top_matches'] = data['top_matches'][:10]  # Keep top 10
    
    # Sort by average score
    sorted_core_theses = sorted(
        core_thesis_scores.items(),
        key=lambda x: x[1]['average_score'],
        reverse=True
    )
    
    return dict(sorted_core_theses)

# Aggregate results
if 'alignment_results' in locals() and alignment_results:
    core_thesis_aggregates = aggregate_by_core_thesis(alignment_results)
    
    print("="*60)
    print("CORE THESIS ALIGNMENT SUMMARY")
    print("="*60)
    
    for core_thesis, data in core_thesis_aggregates.items():
        print(f"\n{core_thesis}:")
        print(f"  Average Alignment Score: {data['average_score']:.3f}")
        print(f"  Total Theses: {data['match_count']}")
        print(f"  Strong Alignments: {data['strong_matches']}")
        print(f"  Moderate Alignments: {data['moderate_matches']}")
        print(f"  Weak Alignments: {data['weak_matches']}")
        print(f"  Top Matches:")
        for i, (thesis, score) in enumerate(data['top_matches'][:3], 1):
            title = thesis.get('title', thesis.get('thesis', 'Unknown')[:60])
            strength = thesis.get('strength', 'N/A')
            print(f"    {i}. {title} (score: {score:.3f}, {strength})")
else:
    print("⚠ No alignment results to aggregate")
    print("   Run Cell 5 first to check alignment with all theses")
    core_thesis_aggregates = {}


CORE THESIS ALIGNMENT SUMMARY

AI:
  Average Alignment Score: 0.443
  Total Theses: 7
  Strong Alignments: 3
  Moderate Alignments: 0
  Weak Alignments: 4
  Top Matches:
    1. Hybrid AI: Enterprise Strategy Essential (score: 1.000, Strong)
    2. Verticalized Approaches Enhance Solutions (score: 0.800, Strong)
    3. AI Business Models: Risks, Rewards (score: 0.800, Strong)

Biotech:
  Average Alignment Score: 0.000
  Total Theses: 5
  Strong Alignments: 0
  Moderate Alignments: 0
  Weak Alignments: 5
  Top Matches:
    1. Accelerating Drug Development Revolution (score: 0.000, None)
    2. Accelerating Drug Discovery Technologies (score: 0.000, None)
    3. Digital Health Interoperability Transformation (score: 0.000, None)

Construction Tech:
  Average Alignment Score: 0.000
  Total Theses: 7
  Strong Alignments: 0
  Moderate Alignments: 0
  Weak Alignments: 7
  Top Matches:
    1. Data-Driven Intelligence Demand Growth (score: 0.000, None)
    2. Technology's Role In Construction (

In [11]:
# Step 6: Detailed Alignment Results

# Show detailed results for all theses
if 'alignment_results' in locals() and alignment_results:
    print("="*80)
    print("DETAILED ALIGNMENT RESULTS")
    print("="*80)
    
    # Sort by alignment score
    sorted_results = sorted(alignment_results, key=lambda x: x.get('alignment_score', 0), reverse=True)
    
    print(f"\n{'Rank':<6} {'Score':<8} {'Strength':<12} {'Node#':<8} {'Title':<50}")
    print("-" * 80)
    
    for i, result in enumerate(sorted_results, 1):
        print(f"{i:<6} {result.get('alignment_score', 0):.4f}  {result.get('strength', 'N/A'):<12} "
              f"#{result.get('node_num', '?'):<7} {result.get('title', 'N/A')[:48]}")
    
    # Show top 5 with full details
    print(f"\n{'='*80}")
    print("TOP 5 ALIGNMENTS (DETAILED):")
    print(f"{'='*80}")
    for i, result in enumerate(sorted_results[:5], 1):
        print(f"\n{i}. {result.get('title', 'N/A')}")
        print(f"   Node #: {result.get('node_num', 'N/A')}")
        print(f"   Core Thesis: {result.get('core_thesis', 'N/A')}")
        print(f"   Alignment Score: {result.get('alignment_score', 0):.4f}")
        print(f"   Strength: {result.get('strength', 'N/A')}")
        print(f"   Supports: {result.get('supports', 'N/A')}")
        print(f"   Evidence: {result.get('evidence', 'N/A')[:300]}...")
        print(f"   Full Thesis: {result.get('thesis', 'N/A')[:200]}...")
else:
    print("⚠ No alignment results to display")
    print("   Run Cell 5 first to check alignment with all theses")


DETAILED ALIGNMENT RESULTS

Rank   Score    Strength     Node#    Title                                             
--------------------------------------------------------------------------------
1      1.0000  Strong       #3       Hybrid AI: Enterprise Strategy Essential
2      0.8000  Strong       #2       Verticalized Approaches Enhance Solutions
3      0.8000  Strong       #6       AI Business Models: Risks, Rewards
4      0.5000  Weak         #4       Reshaping Compute Infrastructure Demand
5      0.0000  None         #1       Emerging Impact of Generative AI
6      0.0000  None         #5       Scaling Agentic AI Infrastructure
7      0.0000  None         #7       Trapped By Localized Challenges
8      0.0000  None         #8       Accelerating Drug Development Revolution
9      0.0000  None         #9       Accelerating Drug Discovery Technologies
10     0.0000  None         #10      Digital Health Interoperability Transformation
11     0.0000  None         #11      Tech-Enab

In [12]:
# Step 7: Generate Comprehensive Analysis with GPT-4

def generate_alignment_analysis(
    pitch_themes: List[Dict[str, str]],
    alignment_results: List[Dict[str, Any]],
    core_thesis_aggregates: Dict[str, Dict]
) -> str:
    """Generate comprehensive analysis of pitch deck alignment with thesis themes."""
    if not openai_client:
        return "⚠ OpenAI client not configured"
    
    # Build context summary
    context_parts = []
    
    context_parts.append("PITCH DECK THEMES:")
    for theme in pitch_themes:
        context_parts.append(f"- {theme.get('theme')}: {theme.get('description')}")
    
    context_parts.append("\n\nALIGNMENT RESULTS:")
    # Top alignments
    sorted_results = sorted(alignment_results, key=lambda x: x.get('alignment_score', 0), reverse=True)
    context_parts.append(f"\nTop 10 Alignments:")
    for result in sorted_results[:10]:
        title = result.get('title', 'N/A')
        score = result.get('alignment_score', 0)
        strength = result.get('strength', 'N/A')
        evidence = result.get('evidence', 'N/A')[:150]
        context_parts.append(f"  - {title}: Score {score:.3f} ({strength}) - {evidence}...")
    
    context_parts.append("\n\nCORE THESIS ALIGNMENT SUMMARY:")
    for core_thesis, data in core_thesis_aggregates.items():
        context_parts.append(f"\n{core_thesis}:")
        context_parts.append(f"  - Average Score: {data['average_score']:.3f}")
        context_parts.append(f"  - Total Theses: {data['match_count']}")
        context_parts.append(f"  - Strong: {data['strong_matches']}, Moderate: {data['moderate_matches']}, Weak: {data['weak_matches']}")
    
    context = "\n".join(context_parts)
    
    prompt = f"""Analyze the alignment between this pitch deck and the thesis themes in our database.

Context:
{context}

Provide a comprehensive analysis covering:

1. **Overall Alignment**: How well does this pitch deck align with our thesis themes? (Overall score/assessment)

2. **Strengths**: Which areas show strong alignment? What core theses does this pitch deck best support?

3. **Gaps**: Are there important thesis themes that are NOT addressed in this pitch deck? What's missing?

4. **Recommendations**: 
   - How could the pitch deck be improved to better align with our theses?
   - Are there specific thesis themes they should emphasize more?
   - Any red flags or areas of concern?

5. **Strategic Fit**: Does this pitch deck represent a good strategic fit with our investment thesis framework?

Be specific, actionable, and reference the alignment scores where relevant.
"""

    try:
        print("Generating comprehensive analysis...")
        response = openai_client.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=[
                {"role": "system", "content": "You are an expert investment analyst who evaluates pitch decks against investment thesis frameworks. Provide detailed, actionable analysis."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.3,
            max_tokens=2000
        )
        
        analysis = response.choices[0].message.content
        return analysis
    except Exception as e:
        return f"⚠ Error generating analysis: {e}"

# Generate analysis
if 'alignment_results' in locals() and alignment_results and 'core_thesis_aggregates' in locals() and core_thesis_aggregates:
    analysis = generate_alignment_analysis(pitch_themes, alignment_results, core_thesis_aggregates)
    
    print("="*60)
    print("COMPREHENSIVE ALIGNMENT ANALYSIS")
    print("="*60)
    print("\n" + analysis)
else:
    print("⚠ Cannot generate analysis: missing alignment data")
    print("   Make sure you've run:")
    print("   - Cell 5: Check alignment with all theses")
    print("   - Cell 6: Aggregate results by core thesis")


Generating comprehensive analysis...
COMPREHENSIVE ALIGNMENT ANALYSIS

### Overall Alignment:

The pitch deck from Cognaize demonstrates a strong alignment with our AI-focused thesis themes, achieving an average score of 0.443 in this category. This indicates a robust strategic fit, particularly in areas emphasizing hybrid AI, verticalized approaches, and AI business models. However, there's a notable lack of alignment with Biotech and Construction Tech themes, which is understandable given the company's focus on financial services and insurance sectors. The overall alignment suggests that Cognaize's pitch deck is well-tailored to our investment interests in AI but lacks relevance to our broader thematic interests in Biotech and Construction Tech.

### Strengths:

1. **Hybrid AI and Human-Centered AI**: The pitch deck's emphasis on hybrid intelligence for unstructured data processing aligns perfectly with our thesis on the necessity of enterprise strategy around hybrid AI, scoring a st

In [None]:
# Optional: Export Results to JSON

def export_results(
    pitch_themes: List[Dict[str, str]],
    alignment_results: List[Dict[str, Any]],
    core_thesis_aggregates: Dict[str, Dict],
    output_path: Path
):
    """Export alignment results to JSON file."""
    export_data = {
        'pitch_themes': pitch_themes,
        'alignment_results': alignment_results,
        'core_thesis_summary': {}
    }
    
    # Add core thesis aggregates
    for core_thesis, data in core_thesis_aggregates.items():
        export_data['core_thesis_summary'][core_thesis] = {
            'average_score': float(data['average_score']),
            'match_count': data['match_count'],
            'strong_matches': data['strong_matches'],
            'moderate_matches': data['moderate_matches'],
            'weak_matches': data['weak_matches'],
            'top_matches': [
                {
                    'title': result.get('title', ''),
                    'thesis': result.get('thesis', ''),
                    'node_num': result.get('node_num', 0),
                    'alignment_score': float(score),
                    'strength': result.get('strength', ''),
                    'evidence': result.get('evidence', '')
                }
                for result, score in data['top_matches']
            ]
        }
    
    # Write to file
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(export_data, f, indent=2, ensure_ascii=False)
    
    print(f"✓ Results exported to {output_path}")

# Uncomment to export results
# OUTPUT_DIR = Path("../data/analysis")
# OUTPUT_DIR.mkdir(exist_ok=True, parents=True)
# if 'alignment_results' in locals() and alignment_results and 'core_thesis_aggregates' in locals() and core_thesis_aggregates:
#     export_results(pitch_themes, alignment_results, core_thesis_aggregates, 
#                    OUTPUT_DIR / "pitch_deck_alignment_results.json")
