# Executive Search Insights
## Intranet Search Analytics -- Interview Reference

This notebook extracts key metrics from the intranet search analytics database and presents them as **executive-level findings**, organized for use with the **STAR interview framework**.

**Prerequisites:** The DuckDB database must exist at `../data/searchanalytics.db`. Run `python process_search_analytics.py` first.

**Audience:** Internal Communications Executive

**Sections:**
1. Executive Summary Dashboard
2. Search Volume & Adoption Trends
3. Search Quality Scorecard
4. Content Gap Analysis (The Big Win)
5. Search Term Performance
6. User Journey & Behavior Insights
7. Regional & Temporal Patterns
8. Interview Cheat Sheet

In [None]:
# ===== SETUP & DATABASE CONNECTION =====
import duckdb
import pandas as pd
import numpy as np
from pathlib import Path

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as mticker
import seaborn as sns

# Style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12

# Color scheme
C = {
    'success': '#2E7D32', 'success_light': '#81C784',
    'fail': '#C62828', 'fail_light': '#EF9A9A',
    'warn': '#F57C00', 'warn_light': '#FFB74D',
    'neutral': '#1565C0', 'neutral_light': '#64B5F6',
    'gray': '#757575', 'gray_light': '#E0E0E0',
}
OUTCOME_COLORS = {
    'Success': C['success'], 'Engaged': C['neutral'],
    'Abandoned': C['warn'], 'No Results': C['fail'], 'Unknown': C['gray'],
}

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: f'{x:,.1f}')

# Connect
db_path = Path('../data/searchanalytics.db')
if not db_path.exists():
    raise FileNotFoundError(
        f"Database not found at {db_path}\n"
        "Run: python process_search_analytics.py")

con = duckdb.connect(str(db_path), read_only=True)

def query(sql):
    """Execute SQL and return DataFrame"""
    return con.execute(sql).df()

def fmt_pct(val, decimals=1):
    if pd.isna(val): return 'N/A'
    return f"{val:.{decimals}f}%"

def fmt_num(val):
    if pd.isna(val): return 'N/A'
    return f"{int(val):,}"

# Verify tables
tables = query("SHOW TABLES")['name'].tolist()
required = ['searches', 'searches_daily', 'searches_journeys', 'searches_terms']
missing = [t for t in required if t not in tables]
if missing:
    print(f"WARNING: Missing tables: {missing}")
    print("Run: python process_search_analytics.py --full-refresh")
else:
    for t in required:
        n = query(f"SELECT COUNT(*) as n FROM {t}")['n'][0]
        print(f"  {t}: {n:,} rows")

date_range = query("""
    SELECT MIN(session_date) as first_date, MAX(session_date) as last_date,
           COUNT(DISTINCT session_date) as days
    FROM searches
""")
print(f"\nDate range: {date_range['first_date'][0]} to {date_range['last_date'][0]} ({date_range['days'][0]} days)")
print("\nReady.")

---
# 1. Executive Summary Dashboard
**Business Question:** What is the overall health of our intranet search, at a glance?

In [None]:
# ===== EXECUTIVE SUMMARY KPIs =====
summary = query("""
    SELECT
        MIN(date) as date_from,
        MAX(date) as date_to,
        COUNT(*) as days_covered,
        SUM(search_starts) as total_searches,
        SUM(unique_users) as total_user_days,
        SUM(unique_sessions) as total_sessions,
        SUM(unique_search_terms) as total_unique_terms,
        -- Session Success Rate % = sessions_with_clicks / sessions_with_results
        ROUND(100.0 * SUM(sessions_with_clicks) / NULLIF(SUM(sessions_with_results), 0), 1) as session_success_rate,
        -- Null Result Rate % = null_results / result_events
        ROUND(100.0 * SUM(null_results) / NULLIF(SUM(result_events), 0), 1) as null_result_rate,
        -- Abandonment Rate % = sessions_abandoned / sessions_with_results
        ROUND(100.0 * SUM(sessions_abandoned) / NULLIF(SUM(sessions_with_results), 0), 1) as abandonment_rate,
        -- Avg Searches per Session
        ROUND(SUM(search_starts) * 1.0 / NULLIF(SUM(unique_sessions), 0), 2) as avg_searches_per_session,
        -- Reformulation rate from journeys
        ROUND(100.0 * (SELECT COUNT(*) FROM searches_journeys WHERE had_reformulation = true)
            / NULLIF((SELECT COUNT(*) FROM searches_journeys), 0), 1) as reformulation_rate
    FROM searches_daily
""")

s = summary.iloc[0]
print("=" * 64)
print("  INTRANET SEARCH ANALYTICS -- EXECUTIVE SUMMARY")
print("=" * 64)
print(f"  Period:     {s['date_from']} to {s['date_to']} ({int(s['days_covered'])} days)")
print(f"  Searches:   {fmt_num(s['total_searches'])}")
print(f"  Users:      {fmt_num(s['total_user_days'])} user-days")
print(f"  Sessions:   {fmt_num(s['total_sessions'])}")
print("-" * 64)
print(f"  Session Success Rate:    {fmt_pct(s['session_success_rate'])}")
print(f"  Null Result Rate:        {fmt_pct(s['null_result_rate'])}")
print(f"  Abandonment Rate:        {fmt_pct(s['abandonment_rate'])}")
print(f"  Avg Searches/Session:    {s['avg_searches_per_session']:.2f}")
print(f"  Reformulation Rate:      {fmt_pct(s['reformulation_rate'])}")
print("=" * 64)

# Store for cheat sheet
EXEC = s.to_dict()

### Key Takeaway (Interview-Ready)
> "I built a search analytics solution tracking **[total_searches]** searches across **[users]** users over **[days]** days. The key finding: our session success rate was **[X]%**, meaning roughly **[Y in Z]** search sessions resulted in users clicking on actual content. The null result rate of **[X]%** represented our biggest content gap opportunity."

---
# 2. Search Volume & Adoption Trends
**Business Question:** Is intranet search usage growing? How are users adopting the platform?

In [None]:
# ===== SEARCH VOLUME OVER TIME =====
daily = query("""
    SELECT date, search_starts, unique_users, unique_sessions,
           day_of_week, day_of_week_num, new_users, returning_users
    FROM searches_daily
    ORDER BY date
""")
daily['date'] = pd.to_datetime(daily['date'])

fig, ax1 = plt.subplots(figsize=(14, 5))
ax1.bar(daily['date'], daily['search_starts'], color=C['neutral_light'],
        alpha=0.7, label='Daily Searches', width=0.8)
ax1.set_ylabel('Search Count', color=C['neutral'])
ax1.tick_params(axis='y', labelcolor=C['neutral'])
ax1.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
ax1.xaxis.set_major_locator(mdates.WeekdayLocator(interval=1))
plt.xticks(rotation=45, ha='right')

ax2 = ax1.twinx()
ax2.plot(daily['date'], daily['unique_users'], color=C['success'],
         linewidth=2, marker='o', markersize=3, label='Unique Users')
ax2.set_ylabel('Unique Users', color=C['success'])
ax2.tick_params(axis='y', labelcolor=C['success'])

fig.legend(loc='upper left', bbox_to_anchor=(0.12, 0.95))
plt.title('Daily Search Volume & Unique Users')
plt.tight_layout()
plt.show()

# Growth rate
if len(daily) >= 14:
    first_week = daily.head(7)['search_starts'].sum()
    last_week = daily.tail(7)['search_starts'].sum()
    growth = ((last_week - first_week) / first_week * 100) if first_week > 0 else 0
    print(f"Growth (first week vs last week): {growth:+.1f}%")
    print(f"  First 7 days: {first_week:,} searches")
    print(f"  Last 7 days:  {last_week:,} searches")

avg_daily = daily['search_starts'].mean()
avg_users = daily['unique_users'].mean()
print(f"\nAverage daily searches: {avg_daily:,.0f}")
print(f"Average daily unique users: {avg_users:,.0f}")

In [None]:
# ===== NEW vs RETURNING USERS =====
fig, ax = plt.subplots(figsize=(14, 5))
ax.fill_between(daily['date'], 0, daily['returning_users'],
                color=C['neutral_light'], alpha=0.7, label='Returning Users')
ax.fill_between(daily['date'], daily['returning_users'],
                daily['returning_users'] + daily['new_users'],
                color=C['success_light'], alpha=0.7, label='New Users')
ax.set_ylabel('Users')
ax.set_title('New vs Returning Users Over Time')
ax.legend(loc='upper left')
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

total_new = daily['new_users'].sum()
total_returning = daily['returning_users'].sum()
total_all = total_new + total_returning
if total_all > 0:
    print(f"New users:       {total_new:,} ({100*total_new/total_all:.1f}%)")
    print(f"Returning users: {total_returning:,} ({100*total_returning/total_all:.1f}%)")

In [None]:
# ===== DAY OF WEEK PATTERNS =====
dow = daily.groupby(['day_of_week', 'day_of_week_num']).agg(
    avg_searches=('search_starts', 'mean'),
    total_searches=('search_starts', 'sum'),
    avg_users=('unique_users', 'mean'),
).reset_index().sort_values('day_of_week_num')

fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.bar(dow['day_of_week'], dow['avg_searches'], color=C['neutral'])
for i, (_, row) in enumerate(dow.iterrows()):
    if row['day_of_week_num'] >= 6:
        bars[i].set_color(C['gray_light'])
ax.set_ylabel('Average Daily Searches')
ax.set_title('Search Volume by Day of Week')
for bar in bars:
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height(),
            f'{bar.get_height():.0f}', ha='center', va='bottom', fontsize=10)
plt.tight_layout()
plt.show()

busiest = dow.loc[dow['avg_searches'].idxmax()]
quietest = dow.loc[dow['avg_searches'].idxmin()]
print(f"Busiest day:  {busiest['day_of_week']} ({busiest['avg_searches']:.0f} avg searches)")
print(f"Quietest day: {quietest['day_of_week']} ({quietest['avg_searches']:.0f} avg searches)")

### Key Takeaway (Interview-Ready)
> "Search adoption showed **[growth trend]** over the measurement period. We tracked **[X]** unique user-days, with **[Y]%** being returning users -- demonstrating strong repeat engagement. **[Busiest day]** was consistently the peak day with **[N]** average daily searches, while weekends showed minimal activity, confirming this is a workplace productivity tool."

---
# 3. Search Quality Scorecard
**Business Question:** Are users finding what they need? Is search quality improving over time?

In [None]:
# ===== QUALITY METRICS TRENDED =====
quality = query("""
    SELECT date, sessions_with_clicks, sessions_with_results, sessions_abandoned,
           null_results, result_events, search_starts, success_clicks, unique_sessions
    FROM searches_daily
    ORDER BY date
""")
quality['date'] = pd.to_datetime(quality['date'])

# Daily rates
quality['session_success_rate'] = 100.0 * quality['sessions_with_clicks'] / quality['sessions_with_results'].replace(0, np.nan)
quality['null_result_rate'] = 100.0 * quality['null_results'] / quality['result_events'].replace(0, np.nan)
quality['abandonment_rate'] = 100.0 * quality['sessions_abandoned'] / quality['sessions_with_results'].replace(0, np.nan)

# 7-day rolling averages
for col in ['session_success_rate', 'null_result_rate', 'abandonment_rate']:
    quality[f'{col}_7d'] = quality[col].rolling(7, min_periods=1).mean()

fig, ax = plt.subplots(figsize=(14, 6))
ax.plot(quality['date'], quality['session_success_rate_7d'],
        color=C['success'], linewidth=2.5, label='Session Success Rate %')
ax.plot(quality['date'], quality['null_result_rate_7d'],
        color=C['fail'], linewidth=2.5, label='Null Result Rate %')
ax.plot(quality['date'], quality['abandonment_rate_7d'],
        color=C['warn'], linewidth=2.5, label='Abandonment Rate %')

ax.set_ylabel('Rate (%)')
ax.set_title('Search Quality Metrics -- 7-Day Rolling Average')
ax.legend(loc='best')
ax.set_ylim(0, 100)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Overall assessment
overall_success = 100.0 * quality['sessions_with_clicks'].sum() / quality['sessions_with_results'].sum()
overall_null = 100.0 * quality['null_results'].sum() / quality['result_events'].sum()
overall_abandon = 100.0 * quality['sessions_abandoned'].sum() / quality['sessions_with_results'].sum()

assessment = pd.DataFrame({
    'Metric': ['Session Success Rate', 'Null Result Rate', 'Abandonment Rate'],
    'Value': [f'{overall_success:.1f}%', f'{overall_null:.1f}%', f'{overall_abandon:.1f}%'],
    'Target': ['>40%', '<5%', '<50%'],
    'Assessment': [
        'Good' if overall_success > 40 else ('Fair' if overall_success > 25 else 'Needs Improvement'),
        'Good' if overall_null < 5 else ('Fair' if overall_null < 15 else 'Needs Improvement'),
        'Good' if overall_abandon < 50 else ('Fair' if overall_abandon < 70 else 'Needs Improvement'),
    ]
})
print("\nOverall Quality Assessment:")
display(assessment.style.hide(axis='index'))

### Key Takeaway (Interview-Ready)
> "I established a search quality scorecard with three headline metrics. Session Success Rate was **[X]%** (target >40%), Null Result Rate was **[Y]%** (target <5%), and Abandonment Rate was **[Z]%** (target <50%). The 7-day rolling average showed **[improving/stable/declining]** quality over time, providing a clear signal for whether content and search improvements were working."

---
# 4. Content Gap Analysis (The Big Win)
**Business Question:** What are employees searching for that we have no content for? Where is the biggest opportunity for Internal Communications?

This is the most actionable section. Each zero-result term represents a **content gap** -- employees actively looking for information that does not exist on the intranet.

In [None]:
# ===== TOP ZERO-RESULT SEARCH TERMS =====
zero_results = query("""
    SELECT
        search_term,
        SUM(search_count) as total_searches,
        SUM(null_result_count) as null_count,
        SUM(result_events) as result_events,
        ROUND(100.0 * SUM(null_result_count) / NULLIF(SUM(result_events), 0), 1) as null_rate,
        SUM(unique_users) as unique_users_searching
    FROM searches_terms
    GROUP BY search_term
    HAVING SUM(null_result_count) > 0
       AND ROUND(100.0 * SUM(null_result_count) / NULLIF(SUM(result_events), 0), 1) >= 50
    ORDER BY SUM(search_count) DESC
    LIMIT 20
""")

display_df = zero_results[['search_term', 'total_searches', 'null_count', 'null_rate', 'unique_users_searching']].rename(columns={
    'search_term': 'Search Term',
    'total_searches': 'Total Searches',
    'null_count': 'Zero-Result Count',
    'null_rate': 'Null Rate %',
    'unique_users_searching': 'Unique Users',
})

print("TOP 20 CONTENT GAPS: Search Terms with >= 50% Null Rate")
print("Each row = employees searching for content that DOES NOT EXIST.\n")
display(display_df.style.hide(axis='index')
    .format({'Total Searches': '{:,}', 'Zero-Result Count': '{:,}',
             'Null Rate %': '{:.1f}%', 'Unique Users': '{:,}'}))

In [None]:
# ===== FAILED SEARCH VOLUME SUMMARY =====
gap_summary = query("""
    SELECT
        SUM(null_results) as total_failed,
        SUM(result_events) as total_results,
        ROUND(100.0 * SUM(null_results) / NULLIF(SUM(result_events), 0), 1) as overall_null_rate
    FROM searches_daily
""")

pure_zero = query("""
    SELECT COUNT(*) as cnt FROM (
        SELECT search_term
        FROM searches_terms
        GROUP BY search_term
        HAVING SUM(null_result_count) = SUM(result_events) AND SUM(result_events) > 0
    )
""")

# Volume covered by top 10 gaps
top10_gap_volume = zero_results.head(10)['total_searches'].sum() if len(zero_results) >= 10 else zero_results['total_searches'].sum()
all_gap_volume = query("""
    SELECT SUM(search_count) as vol FROM (
        SELECT search_term, SUM(search_count) as search_count, SUM(null_result_count) as nulls, SUM(result_events) as results
        FROM searches_terms GROUP BY search_term
        HAVING SUM(null_result_count) > 0 AND ROUND(100.0 * SUM(null_result_count) / NULLIF(SUM(result_events), 0), 1) >= 50
    )
""")['vol'][0]
top10_pct = 100.0 * top10_gap_volume / all_gap_volume if all_gap_volume > 0 else 0

gs = gap_summary.iloc[0]
print(f"Total failed searches (zero results):      {fmt_num(gs['total_failed'])}")
print(f"Overall null result rate:                   {fmt_pct(gs['overall_null_rate'])}")
print(f"Unique terms that ALWAYS return zero:       {fmt_num(pure_zero['cnt'][0])}")
print(f"Top 10 gap terms cover:                     {top10_pct:.0f}% of all gap-related searches")

# Chart: horizontal bar of top content gaps
if len(zero_results) > 0:
    top_n = zero_results.head(15).sort_values('total_searches')
    fig, ax = plt.subplots(figsize=(12, max(5, len(top_n) * 0.45)))
    ax.barh(top_n['search_term'], top_n['total_searches'], color=C['fail'])
    ax.set_xlabel('Total Searches (returning zero results)')
    ax.set_title('Top Content Gaps: Most-Searched Terms with Zero Results')
    for i, (_, row) in enumerate(top_n.iterrows()):
        ax.text(row['total_searches'] + 0.3, i, f"{int(row['total_searches']):,}",
                va='center', fontsize=9)
    plt.tight_layout()
    plt.show()

### Key Takeaway (Interview-Ready) -- THE BIG WIN

> **SITUATION:** Employees were searching the intranet but Internal Communications had no data on what content was missing.
>
> **TASK:** I built a search analytics pipeline to identify content gaps and measure search quality.
>
> **ACTION:** I analyzed all search terms that consistently returned zero results. I identified **[N]** unique terms that always fail and ranked them by search volume so the content team could prioritize.
>
> **RESULT:** The top content gap, "[top term]", was searched **[X]** times with zero results. Addressing just the top 10 terms would cover **[Y]%** of all gap-related search volume. This gave Internal Comms their first ever **data-driven content roadmap**.

---
# 5. Search Term Performance
**Business Question:** Which search terms are working well? Which need attention?

In [None]:
# ===== TOP 20 TERMS BY VOLUME WITH SUCCESS RATES =====
top_terms = query("""
    SELECT
        search_term as "Search Term",
        SUM(search_count) as "Searches",
        SUM(success_click_count) as "Success Clicks",
        ROUND(100.0 * SUM(success_click_count) / NULLIF(SUM(search_count), 0), 1) as "Success CTR %",
        SUM(null_result_count) as "Null Results",
        ROUND(100.0 * SUM(null_result_count) / NULLIF(SUM(result_events), 0), 1) as "Null Rate %",
        SUM(unique_users) as "Unique Users"
    FROM searches_terms
    GROUP BY search_term
    HAVING SUM(search_count) >= 3
    ORDER BY SUM(search_count) DESC
    LIMIT 20
""")

print("Top 20 Search Terms by Volume")
display(top_terms.style.hide(axis='index')
    .format({'Searches': '{:,}', 'Success Clicks': '{:,}',
             'Success CTR %': '{:.1f}%', 'Null Results': '{:,}',
             'Null Rate %': '{:.1f}%', 'Unique Users': '{:,}'})
    .bar(subset=['Success CTR %'], color=C['success_light'], vmin=0, vmax=100)
    .bar(subset=['Null Rate %'], color=C['fail_light'], vmin=0, vmax=100))

In [None]:
# ===== PROBLEM TERMS vs SUCCESS STORIES =====

# Problem: results exist but nobody clicks (relevance issue)
problem_terms = query("""
    SELECT
        search_term as "Search Term",
        SUM(search_count) as "Searches",
        ROUND(100.0 * SUM(success_click_count) / NULLIF(SUM(search_count), 0), 1) as "Success CTR %",
        ROUND(100.0 * SUM(null_result_count) / NULLIF(SUM(result_events), 0), 1) as "Null Rate %"
    FROM searches_terms
    GROUP BY search_term
    HAVING SUM(search_count) >= 5
       AND ROUND(100.0 * SUM(success_click_count) / NULLIF(SUM(search_count), 0), 1) < 20
       AND ROUND(100.0 * SUM(null_result_count) / NULLIF(SUM(result_events), 0), 1) < 50
    ORDER BY SUM(search_count) DESC
    LIMIT 15
""")

print("RELEVANCE PROBLEMS: Results shown but users don't click (CTR < 20%, Null Rate < 50%)")
print("Action: Improve content titles, metadata, or search ranking\n")
if len(problem_terms) > 0:
    display(problem_terms.style.hide(axis='index')
        .format({'Searches': '{:,}', 'Success CTR %': '{:.1f}%', 'Null Rate %': '{:.1f}%'}))
else:
    print("No terms matching these criteria (good news!)")

# Success stories: high volume + high CTR
success_terms = query("""
    SELECT
        search_term as "Search Term",
        SUM(search_count) as "Searches",
        ROUND(100.0 * SUM(success_click_count) / NULLIF(SUM(search_count), 0), 1) as "Success CTR %",
        ROUND(100.0 * SUM(null_result_count) / NULLIF(SUM(result_events), 0), 1) as "Null Rate %"
    FROM searches_terms
    GROUP BY search_term
    HAVING SUM(search_count) >= 5
       AND ROUND(100.0 * SUM(success_click_count) / NULLIF(SUM(search_count), 0), 1) >= 30
    ORDER BY SUM(search_count) DESC
    LIMIT 15
""")

print("\nSUCCESS STORIES: High click-through terms (CTR >= 30%)")
print("These demonstrate what 'good' looks like -- benchmark for other content\n")
if len(success_terms) > 0:
    display(success_terms.style.hide(axis='index')
        .format({'Searches': '{:,}', 'Success CTR %': '{:.1f}%', 'Null Rate %': '{:.1f}%'}))
else:
    print("No terms matching these criteria.")

In [None]:
# ===== QUERY COMPLEXITY: Word count vs success =====
complexity = query("""
    SELECT
        CASE
            WHEN word_count = 1 THEN '1 word'
            WHEN word_count = 2 THEN '2 words'
            WHEN word_count = 3 THEN '3 words'
            WHEN word_count >= 4 THEN '4+ words'
        END as query_length,
        CASE
            WHEN word_count = 1 THEN 1
            WHEN word_count = 2 THEN 2
            WHEN word_count = 3 THEN 3
            WHEN word_count >= 4 THEN 4
        END as sort_order,
        SUM(search_count) as searches,
        ROUND(100.0 * SUM(success_click_count) / NULLIF(SUM(search_count), 0), 1) as success_ctr,
        ROUND(100.0 * SUM(null_result_count) / NULLIF(SUM(result_events), 0), 1) as null_rate,
        ROUND(100.0 * SUM(search_count) / NULLIF((SELECT SUM(search_count) FROM searches_terms), 0), 1) as pct_of_total
    FROM searches_terms
    GROUP BY 1, 2
    ORDER BY 2
""")

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Volume distribution
axes[0].bar(complexity['query_length'], complexity['pct_of_total'], color=C['neutral'])
axes[0].set_ylabel('% of Total Searches')
axes[0].set_title('Query Length Distribution')
for i, v in enumerate(complexity['pct_of_total']):
    axes[0].text(i, v + 0.3, f'{v:.0f}%', ha='center', fontsize=10)

# CTR by length
axes[1].bar(complexity['query_length'], complexity['success_ctr'], color=C['success'])
axes[1].set_ylabel('Success CTR %')
axes[1].set_title('Success Rate by Query Length')
for i, v in enumerate(complexity['success_ctr']):
    axes[1].text(i, v + 0.3, f'{v:.1f}%', ha='center', fontsize=10)

# Null rate by length
axes[2].bar(complexity['query_length'], complexity['null_rate'], color=C['fail'])
axes[2].set_ylabel('Null Result Rate %')
axes[2].set_title('Null Rate by Query Length')
for i, v in enumerate(complexity['null_rate']):
    axes[2].text(i, v + 0.3, f'{v:.1f}%', ha='center', fontsize=10)

plt.tight_layout()
plt.show()

best_length = complexity.loc[complexity['success_ctr'].idxmax()]
print(f"Best performing query length: {best_length['query_length']} (CTR: {best_length['success_ctr']:.1f}%)")
most_common = complexity.loc[complexity['searches'].idxmax()]
print(f"Most common query length: {most_common['query_length']} ({most_common['pct_of_total']:.0f}% of all searches)")

### Key Takeaway (Interview-Ready)
> "I categorized search terms into three groups: **Success Stories** (CTR > 30%), **Relevance Problems** (results exist but CTR < 20%), and **Content Gaps** (zero results). This framework gave Internal Comms a clear action plan for each category: create content for gaps, improve metadata for relevance problems, and replicate what works from success stories. **[Best length]**-word queries performed best at **[X]% CTR**, suggesting users get better results with more specific queries."

---
# 6. User Journey & Behavior Insights
**Business Question:** How do search sessions play out? Where do users struggle?

In [None]:
# ===== JOURNEY OUTCOME DISTRIBUTION =====
outcomes = query("""
    SELECT
        journey_outcome,
        journey_outcome_sort,
        COUNT(*) as sessions,
        ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER(), 1) as pct
    FROM searches_journeys
    GROUP BY journey_outcome, journey_outcome_sort
    ORDER BY journey_outcome_sort
""")

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Donut chart
colors_list = [OUTCOME_COLORS.get(o, C['gray']) for o in outcomes['journey_outcome']]
wedges, texts, autotexts = ax1.pie(
    outcomes['sessions'], labels=outcomes['journey_outcome'],
    colors=colors_list, autopct='%1.1f%%', startangle=90,
    pctdistance=0.8, wedgeprops=dict(width=0.4))
ax1.set_title('Session Journey Outcomes')

# Bar chart
y_pos = range(len(outcomes))
bars = ax2.barh(
    [o for o in reversed(outcomes['journey_outcome'])],
    [s for s in reversed(outcomes['sessions'])],
    color=[OUTCOME_COLORS.get(o, C['gray']) for o in reversed(outcomes['journey_outcome'])])
for i, (idx, row) in enumerate(outcomes.iloc[::-1].iterrows()):
    ax2.text(row['sessions'] + max(outcomes['sessions'])*0.01, i,
             f"{int(row['sessions']):,} ({row['pct']:.1f}%)",
             va='center', fontsize=10)
ax2.set_xlabel('Sessions')
ax2.set_title('Session Count by Outcome')

plt.tight_layout()
plt.show()

In [None]:
# ===== SESSION COMPLEXITY + BEHAVIORAL KPIs =====
complexity_dist = query("""
    SELECT
        session_complexity,
        session_complexity_sort,
        COUNT(*) as sessions,
        ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER(), 1) as pct
    FROM searches_journeys
    GROUP BY session_complexity, session_complexity_sort
    ORDER BY session_complexity_sort
""")

behavior = query("""
    SELECT
        ROUND(100.0 * SUM(CASE WHEN had_reformulation THEN 1 ELSE 0 END)
            / COUNT(*), 1) as reformulation_rate,
        ROUND(100.0 * SUM(CASE WHEN recovered_from_null THEN 1 ELSE 0 END)
            / NULLIF(SUM(CASE WHEN had_null_result THEN 1 ELSE 0 END), 0), 1) as null_recovery_rate,
        SUM(CASE WHEN had_null_result THEN 1 ELSE 0 END) as sessions_with_null,
        SUM(CASE WHEN recovered_from_null THEN 1 ELSE 0 END) as sessions_recovered,
        ROUND(AVG(CASE WHEN sec_result_to_click IS NOT NULL THEN sec_result_to_click END), 1) as avg_sec_to_click,
        ROUND(AVG(CASE WHEN sec_search_to_result IS NOT NULL THEN sec_search_to_result END), 2) as avg_sec_search_to_result
    FROM searches_journeys
""")

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Complexity distribution
ccolors = [C['success_light'], C['neutral_light'], C['warn_light'], C['fail_light']][:len(complexity_dist)]
ax1.bar(complexity_dist['session_complexity'], complexity_dist['sessions'], color=ccolors)
ax1.set_ylabel('Sessions')
ax1.set_title('Session Complexity Distribution')
for i, (_, row) in enumerate(complexity_dist.iterrows()):
    ax1.text(i, row['sessions'], f"{row['pct']:.0f}%", ha='center', va='bottom', fontsize=10)

# KPI text card
b = behavior.iloc[0]
kpi_lines = [
    f"Reformulation Rate:      {fmt_pct(b['reformulation_rate'])}",
    f"  (users who rephrased their query)",
    "",
    f"Null Recovery Rate:      {fmt_pct(b['null_recovery_rate'])}",
    f"  ({fmt_num(b['sessions_recovered'])} of {fmt_num(b['sessions_with_null'])} null sessions)",
    "",
    f"Avg Search-to-Result:    {b['avg_sec_search_to_result']:.2f}s",
    f"Avg Result-to-Click:     {b['avg_sec_to_click']:.1f}s",
]
ax2.text(0.1, 0.5, '\n'.join(kpi_lines), transform=ax2.transAxes,
         fontsize=13, verticalalignment='center', fontfamily='monospace',
         bbox=dict(boxstyle='round', facecolor='#f5f5f5', alpha=0.8))
ax2.set_title('Behavioral KPIs')
ax2.axis('off')

plt.tight_layout()
plt.show()

### Key Takeaway (Interview-Ready)
> "Journey analysis revealed that **[X]%** of sessions ended in Success (user found content), while **[Y]%** were Abandoned (results shown but nobody clicked). The reformulation rate of **[Z]%** shows how often users needed to rephrase queries -- a signal for search relevance improvements. The null recovery rate of **[W]%** tells us how resilient the search experience is: **[W]%** of users who initially got zero results were able to recover by rephrasing. Search latency averaged **[N]** seconds, well within acceptable limits."

---
# 7. Regional & Temporal Patterns
**Business Question:** When do employees search, and what does that tell us about our global workforce?

In [None]:
# ===== TIME OF DAY DISTRIBUTION =====
time_dist = query("""
    SELECT
        SUM(searches_night) as "APAC (03-09 CET)",
        SUM(searches_morning) as "CET (09-16 CET)",
        SUM(searches_afternoon) as "Americas (16-22 CET)",
        SUM(searches_evening) as "Dead Time (22-03 CET)"
    FROM searches_daily
""")

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Pie chart
labels = time_dist.columns.tolist()
values = time_dist.iloc[0].values.astype(float)
region_colors = [C['warn'], C['success'], C['neutral'], C['gray']]
ax1.pie(values, labels=labels, colors=region_colors,
        autopct='%1.1f%%', startangle=90)
ax1.set_title('Search Volume by Time Zone / Region')

# Hour-of-day x day-of-week heatmap
hourly_dow = query("""
    SELECT
        event_weekday as weekday,
        event_weekday_num as dow_num,
        event_hour as hour,
        COUNT(*) as searches
    FROM searches
    WHERE name = 'SEARCH_TRIGGERED'
    GROUP BY 1, 2, 3
    ORDER BY 2, 3
""")

if len(hourly_dow) > 0:
    pivot = hourly_dow.pivot_table(index='weekday', columns='hour', values='searches',
                                    aggfunc='sum', fill_value=0)
    dow_order = hourly_dow.drop_duplicates('weekday').sort_values('dow_num')['weekday'].tolist()
    pivot = pivot.reindex(dow_order)
    sns.heatmap(pivot, cmap='YlOrRd', ax=ax2, fmt='.0f',
                cbar_kws={'label': 'Searches'})
    ax2.set_title('Search Heatmap: Day x Hour (CET)')
    ax2.set_xlabel('Hour (CET)')
    ax2.set_ylabel('')

plt.tight_layout()
plt.show()

# Peak hours
peak_hours = query("""
    SELECT event_hour as hour, COUNT(*) as searches
    FROM searches WHERE name = 'SEARCH_TRIGGERED'
    GROUP BY 1 ORDER BY 2 DESC LIMIT 3
""")
print("Top 3 peak hours (CET):")
for _, row in peak_hours.iterrows():
    print(f"  {int(row['hour']):02d}:00 -- {int(row['searches']):,} searches")

# Regional percentages
total_regional = values.sum()
if total_regional > 0:
    print(f"\nRegional breakdown:")
    for label, val in zip(labels, values):
        print(f"  {label}: {val:,.0f} ({100*val/total_regional:.1f}%)")

### Key Takeaway (Interview-Ready)
> "Time-of-day analysis confirmed our intranet serves a global workforce. **[X]%** of searches occur during CET business hours (09-16), **[Y]%** during Americas hours (16-22 CET), and **[Z]%** during APAC hours (03-09 CET). The peak search hour is **[HH]:00 CET**. This data informed content publishing schedules -- publishing before the peak ensures content is fresh when most users search. The day-of-week heatmap revealed the exact windows of highest activity."

---
# 8. Interview Cheat Sheet
All key numbers compiled in one place for quick reference during STAR-based interview answers.

In [None]:
# ===== COMPILE ALL KEY NUMBERS =====
cheat = query("""
    WITH daily_totals AS (
        SELECT
            MIN(date) as date_from, MAX(date) as date_to,
            COUNT(*) as days,
            SUM(search_starts) as total_searches,
            SUM(unique_users) as total_user_days,
            SUM(unique_sessions) as total_sessions,
            ROUND(100.0 * SUM(sessions_with_clicks) / NULLIF(SUM(sessions_with_results), 0), 1) as session_success_rate,
            ROUND(100.0 * SUM(null_results) / NULLIF(SUM(result_events), 0), 1) as null_result_rate,
            ROUND(100.0 * SUM(sessions_abandoned) / NULLIF(SUM(sessions_with_results), 0), 1) as abandonment_rate,
            ROUND(SUM(search_starts) * 1.0 / NULLIF(SUM(unique_sessions), 0), 2) as avg_searches_per_session,
            SUM(null_results) as total_null_results,
            SUM(new_users) as total_new_users,
            SUM(returning_users) as total_returning_users
        FROM searches_daily
    ),
    journey_totals AS (
        SELECT
            COUNT(*) as total_journeys,
            ROUND(100.0 * SUM(CASE WHEN had_reformulation THEN 1 ELSE 0 END) / COUNT(*), 1) as reformulation_rate,
            ROUND(100.0 * SUM(CASE WHEN recovered_from_null THEN 1 ELSE 0 END)
                / NULLIF(SUM(CASE WHEN had_null_result THEN 1 ELSE 0 END), 0), 1) as null_recovery_rate,
            ROUND(AVG(CASE WHEN sec_result_to_click IS NOT NULL THEN sec_result_to_click END), 1) as avg_sec_to_click,
            ROUND(AVG(CASE WHEN sec_search_to_result IS NOT NULL THEN sec_search_to_result END), 2) as avg_latency,
            ROUND(100.0 * SUM(CASE WHEN journey_outcome = 'Success' THEN 1 ELSE 0 END) / COUNT(*), 1) as pct_success,
            ROUND(100.0 * SUM(CASE WHEN journey_outcome = 'Abandoned' THEN 1 ELSE 0 END) / COUNT(*), 1) as pct_abandoned,
            ROUND(100.0 * SUM(CASE WHEN journey_outcome = 'No Results' THEN 1 ELSE 0 END) / COUNT(*), 1) as pct_no_results,
            ROUND(100.0 * SUM(CASE WHEN journey_outcome = 'Engaged' THEN 1 ELSE 0 END) / COUNT(*), 1) as pct_engaged
        FROM searches_journeys
    ),
    term_totals AS (
        SELECT
            COUNT(DISTINCT search_term) as unique_terms,
            (SELECT COUNT(*) FROM (
                SELECT search_term FROM searches_terms GROUP BY search_term
                HAVING SUM(null_result_count) = SUM(result_events) AND SUM(result_events) > 0
            )) as pure_zero_result_terms
        FROM searches_terms
    )
    SELECT * FROM daily_totals, journey_totals, term_totals
""")

c = cheat.iloc[0]

cheat_sheet = pd.DataFrame([
    ['SCOPE', ''],
    ['  Period', f"{c['date_from']} to {c['date_to']} ({int(c['days'])} days)"],
    ['  Total Searches', fmt_num(c['total_searches'])],
    ['  User-Days', fmt_num(c['total_user_days'])],
    ['  Total Sessions', fmt_num(c['total_sessions'])],
    ['  Unique Search Terms', fmt_num(c['unique_terms'])],
    ['', ''],
    ['QUALITY METRICS', ''],
    ['  Session Success Rate', fmt_pct(c['session_success_rate'])],
    ['  Null Result Rate', fmt_pct(c['null_result_rate'])],
    ['  Abandonment Rate', fmt_pct(c['abandonment_rate'])],
    ['  Avg Searches/Session', f"{c['avg_searches_per_session']:.2f}"],
    ['  Reformulation Rate', fmt_pct(c['reformulation_rate'])],
    ['  Null Recovery Rate', fmt_pct(c['null_recovery_rate'])],
    ['', ''],
    ['JOURNEY OUTCOMES', ''],
    ['  Success (clicked result)', fmt_pct(c['pct_success'])],
    ['  Engaged (navigated, no click)', fmt_pct(c['pct_engaged'])],
    ['  Abandoned (results, no action)', fmt_pct(c['pct_abandoned'])],
    ['  No Results', fmt_pct(c['pct_no_results'])],
    ['', ''],
    ['PERFORMANCE', ''],
    ['  Avg Search Latency', f"{c['avg_latency']:.2f}s"],
    ['  Avg Time to Click', f"{c['avg_sec_to_click']:.1f}s"],
    ['', ''],
    ['CONTENT GAPS', ''],
    ['  Total Failed Searches', fmt_num(c['total_null_results'])],
    ['  Terms Always Returning Zero', fmt_num(c['pure_zero_result_terms'])],
    ['', ''],
    ['USER ADOPTION', ''],
    ['  New Users', f"{fmt_num(c['total_new_users'])} ({100*c['total_new_users']/(c['total_new_users']+c['total_returning_users']):.0f}%)"],
    ['  Returning Users', f"{fmt_num(c['total_returning_users'])} ({100*c['total_returning_users']/(c['total_new_users']+c['total_returning_users']):.0f}%)"],
], columns=['Metric', 'Value'])

print("=" * 56)
print("  INTERVIEW REFERENCE: ALL KEY NUMBERS")
print("=" * 56)
for _, row in cheat_sheet.iterrows():
    if row['Value'] == '':
        print(f"\n  {row['Metric']}")
    else:
        print(f"  {row['Metric']:<40s} {row['Value']}")
print("\n" + "=" * 56)

### STAR Talking Points

Fill in the numbers from above and rehearse these four stories:

---

#### Story 1: Building the Analytics Pipeline (Technical Leadership)
- **S:** Our intranet search had no analytics -- zero visibility into what employees were searching for or whether they found content.
- **T:** I was tasked with designing and building an end-to-end search analytics solution from raw telemetry to executive dashboards.
- **A:** I built a Python ETL pipeline (1,000+ lines) that ingests Azure App Insights telemetry, enriches it with session analysis and journey classification, and outputs analytics-ready data. I designed 30+ DAX measures for Power BI and created 5 dashboard pages. I documented the complete data model and visualization guide.
- **R:** We went from zero visibility to tracking **[total_searches]** searches across **[users]** users, with automated session classification, content gap detection, and quality scorecards. The solution now runs weekly and serves senior management.

---

#### Story 2: Identifying Content Gaps (Business Impact)
- **S:** Internal Communications had no data-driven way to decide what intranet content to create or improve.
- **T:** Use search analytics to identify the most impactful content gaps.
- **A:** I analyzed zero-result search terms and identified **[pure_zero_result_terms]** unique terms that consistently return no results, ranked by volume.
- **R:** This gave Internal Comms a prioritized content roadmap. Addressing just the top 10 gaps would cover **[X]%** of all gap-related searches. Each term on the list represents a direct, measurable improvement opportunity.

---

#### Story 3: Measuring Search Quality (Data-Driven Decisions)
- **S:** Leadership needed to know if the intranet was actually working -- no baseline metrics existed.
- **T:** Define and implement quality KPIs that an executive audience would understand and act on.
- **A:** I designed a scorecard with Session Success Rate (**[X]%**), Null Result Rate (**[Y]%**), and Abandonment Rate (**[Z]%**). I set benchmark targets based on industry standards and trended these weekly.
- **R:** For the first time, leadership could see a single number (session success rate) and understand whether search was improving. The reformulation rate of **[W]%** became a key input for the search team's optimization roadmap.

---

#### Story 4: Global Workforce Insights (Strategic Thinking)
- **S:** We assumed our intranet served primarily a European audience but had no data.
- **T:** Analyze temporal patterns to understand global usage.
- **A:** I implemented CET-based time zone analysis, classifying searches into APAC, CET, and Americas windows.
- **R:** Data revealed the actual regional split, informing content publishing schedules and multilingual content priorities. Publishing before peak hour ensures maximum reach.

In [None]:
# ===== CLEANUP =====
con.close()
print("Database connection closed. Notebook complete.")