# Killer Statistics Analysis

Comprehensive analysis of killer patterns and behaviors across CSI episodes.

## Analysis Components:
1. **Distribution Analysis**: How many killers per episode?
2. **Speaking Patterns**: Do killers speak differently?
3. **Temporal Analysis**: When do killers appear and reveal themselves?
4. **Archetype Discovery**: Common killer behavior patterns
5. **Statistical Testing**: Significance of behavioral differences

In [None]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from IPython.display import display, HTML, Markdown

# Add src to path
sys.path.append(str(Path('../src').resolve()))

# Import killer statistics analyzer
from analysis.killer_statistics import KillerStatisticsAnalyzer

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("Setup complete!")

## 1. Initialize Analyzer and Load Data

In [None]:
# Initialize the analyzer
analyzer = KillerStatisticsAnalyzer(data_dir=Path('../data/original'))

# Get overall statistics
overall_stats = analyzer.get_overall_statistics()

print("Dataset Overview:")
print("="*50)
print(f"Total episodes analyzed: {overall_stats['total_episodes']}")
print(f"Episodes with identified killers: {overall_stats['episodes_with_killers']}")
print(f"Total killers across all episodes: {overall_stats['total_killers']}")
print(f"Average killers per episode: {overall_stats['avg_killers_per_episode']:.2f}")
print(f"\nKiller count distribution:")
for count, episodes in sorted(overall_stats['killer_distribution'].items()):
    print(f"  {count} killer(s): {episodes} episodes ({episodes/overall_stats['total_episodes']*100:.1f}%)")

## 2. Killer Distribution Analysis

In [None]:
# Visualize killer distributions
fig = analyzer.plot_killer_distributions()

# Additional summary
print("\nKey Insights:")
print(f"• Average killer speaking ratio: {overall_stats['avg_killer_speaking_ratio']:.3f}")
print(f"• Average killer word ratio: {overall_stats['avg_killer_word_ratio']:.3f}")
print(f"• Average first appearance (normalized): {overall_stats['avg_killer_first_appearance']:.3f}")
print(f"• Average dialogue spread: {overall_stats['avg_killer_spread']:.3f}")

## 3. Speaking Pattern Analysis

In [None]:
# Analyze speaking patterns
speaking_patterns = analyzer.analyze_speaking_patterns()

# Create comparison DataFrame
comparison_df = pd.DataFrame({
    'Metric': ['Mean Sentences', 'Median Sentences', 'Std Dev', 'Mean Words', 'Median Words'],
    'Killers': [
        speaking_patterns['killer_sentences']['mean'],
        speaking_patterns['killer_sentences']['median'],
        speaking_patterns['killer_sentences']['std'],
        speaking_patterns['killer_words']['mean'],
        speaking_patterns['killer_words']['median']
    ],
    'Non-Killers': [
        speaking_patterns['non_killer_sentences']['mean'],
        speaking_patterns['non_killer_sentences']['median'],
        speaking_patterns['non_killer_sentences']['std'],
        speaking_patterns['non_killer_words']['mean'],
        speaking_patterns['non_killer_words']['median']
    ]
})

# Display with styling
def highlight_differences(val):
    if isinstance(val, (int, float)):
        return 'font-weight: bold' if val > 50 else ''
    return ''

styled_df = comparison_df.style.format({
    'Killers': '{:.1f}',
    'Non-Killers': '{:.1f}'
})

display(HTML("<h3>Speaking Pattern Comparison</h3>"))
display(styled_df)

# Statistical test results
tests = speaking_patterns['statistical_tests']
print("\nStatistical Significance Tests (Mann-Whitney U):")
print("="*50)
print(f"Sentence count difference:")
print(f"  U-statistic: {tests['sentence_mann_whitney_u']:.2f}")
print(f"  p-value: {tests['sentence_p_value']:.4f}")
print(f"  Significant: {'YES ✓' if tests['sentence_significant'] else 'NO ✗'}")
print(f"\nWord count difference:")
print(f"  U-statistic: {tests['word_mann_whitney_u']:.2f}")
print(f"  p-value: {tests['word_p_value']:.4f}")
print(f"  Significant: {'YES ✓' if tests['word_significant'] else 'NO ✗'}")

## 4. Temporal Pattern Analysis

In [None]:
# Analyze temporal patterns
temporal_patterns = analyzer.analyze_temporal_patterns()

# Visualize temporal patterns
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Appearance timing distribution
appearance = temporal_patterns['appearance_analysis']
ax = axes[0]
timing_data = [
    appearance['early_appearance_rate'] * 100,
    appearance['middle_appearance_rate'] * 100,
    appearance['late_appearance_rate'] * 100
]
colors = ['green', 'yellow', 'red']
ax.pie(timing_data, labels=['Early (0-33%)', 'Middle (33-67%)', 'Late (67-100%)'],
       colors=colors, autopct='%1.1f%%', startangle=90)
ax.set_title('When Killers First Appear', fontweight='bold')

# Average positions
ax = axes[1]
positions = ['First\nAppearance', 'Last\nAppearance']
values = [appearance['avg_first_appearance'], appearance['avg_last_appearance']]
bars = ax.bar(positions, values, color=['lightgreen', 'lightcoral'])
ax.set_ylabel('Normalized Position (0-1)')
ax.set_title('Average Killer Appearance Positions', fontweight='bold')
ax.set_ylim(0, 1)
ax.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
for bar, val in zip(bars, values):
    ax.text(bar.get_x() + bar.get_width()/2, val + 0.02,
            f'{val:.2f}', ha='center', va='bottom')

# Dialogue spread
ax = axes[2]
spread_data = [stats.killer_spread for stats in analyzer.killer_stats.values() if stats.killer_count > 0]
ax.hist(spread_data, bins=15, color='purple', alpha=0.7, edgecolor='black')
ax.set_xlabel('Dialogue Spread (0=concentrated, 1=throughout)')
ax.set_ylabel('Number of Episodes')
ax.set_title('Killer Dialogue Distribution', fontweight='bold')
ax.axvline(x=appearance['avg_spread'], color='red', linestyle='--',
           label=f'Mean: {appearance["avg_spread"]:.2f}')
ax.legend()

plt.suptitle('Temporal Analysis of Killer Behavior', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# Reveal analysis
if 'reveal_analysis' in temporal_patterns and temporal_patterns['reveal_analysis']['episodes_with_reveals'] > 0:
    reveal = temporal_patterns['reveal_analysis']
    print("\nKiller Reveal Analysis:")
    print(f"  Episodes with clear reveals: {reveal['episodes_with_reveals']}")
    print(f"  Average reveal position: {reveal['avg_reveal_position']:.2f}")
    print(f"  Post-reveal dialogue ratio: {reveal['avg_post_reveal_dialogue']:.2f}")

## 5. Killer Archetype Analysis

In [None]:
# Analyze killer archetypes
archetypes = analyzer.analyze_killer_archetypes()

if archetypes:
    # Create archetype DataFrame
    archetype_data = []
    for archetype_name, data in archetypes.items():
        archetype_data.append({
            'Archetype': archetype_name.replace('_', ' ').title(),
            'Count': data['count'],
            'Percentage': f"{data['percentage']:.1f}%",
            'Examples': ', '.join(data['example_episodes'])
        })
    
    archetype_df = pd.DataFrame(archetype_data)
    archetype_df = archetype_df.sort_values('Count', ascending=False)
    
    # Display archetype table
    display(HTML("<h3>Killer Archetypes</h3>"))
    display(archetype_df.style.set_properties(**{'text-align': 'left'}))
    
    # Visualize archetype distribution
    fig, ax = plt.subplots(figsize=(10, 6))
    
    archetype_names = archetype_df['Archetype'].values
    archetype_counts = archetype_df['Count'].values
    
    colors = plt.cm.Set3(np.linspace(0, 1, len(archetype_names)))
    bars = ax.bar(archetype_names, archetype_counts, color=colors, alpha=0.8, edgecolor='black')
    
    ax.set_xlabel('Killer Archetype', fontsize=12)
    ax.set_ylabel('Number of Episodes', fontsize=12)
    ax.set_title('Distribution of Killer Archetypes', fontsize=14, fontweight='bold')
    ax.tick_params(axis='x', rotation=45)
    
    # Add value labels on bars
    for bar, count in zip(bars, archetype_counts):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                str(count), ha='center', va='bottom', fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    print("\nArchetype Descriptions:")
    print("• Verbose: Killers who talk significantly more than average")
    print("• Silent: Killers who speak very little")
    print("• Early Bird: Killers who appear in the first third of the episode")
    print("• Late Arrival: Killers who don't appear until the final third")
    print("• Consistent: Killers whose dialogue is spread throughout the episode")
    print("• Burst: Killers who speak in concentrated bursts")

## 6. Killer vs Suspect Comparison

In [None]:
# Compare killers with suspects
suspect_comparison = analyzer.compare_killers_vs_suspects()

if suspect_comparison['suspects_available'] and suspect_comparison['comparison']:
    comp = suspect_comparison['comparison']
    
    # Create comparison visualization
    fig, ax = plt.subplots(figsize=(10, 6))
    
    categories = ['Killers', 'Suspects', 'Innocent']
    values = [
        comp['killer_avg_sentences'],
        comp['suspect_avg_sentences'],
        comp['innocent_avg_sentences']
    ]
    colors = ['crimson', 'orange', 'lightblue']
    
    bars = ax.bar(categories, values, color=colors, alpha=0.7, edgecolor='black')
    ax.set_ylabel('Average Sentences per Character', fontsize=12)
    ax.set_title('Speaking Frequency: Killers vs Suspects vs Innocent', fontsize=14, fontweight='bold')
    
    # Add value labels
    for bar, val in zip(bars, values):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                f'{val:.1f}', ha='center', va='bottom', fontweight='bold')
    
    # Add statistical test result
    p_value = comp['kruskal_wallis_p_value']
    sig_text = f"Kruskal-Wallis Test: p = {p_value:.4f}"
    if comp['significant_difference']:
        sig_text += " (Significant)"
    else:
        sig_text += " (Not Significant)"
    
    ax.text(0.5, 0.95, sig_text, transform=ax.transAxes,
            ha='center', va='top', fontsize=11, style='italic',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    plt.tight_layout()
    plt.show()
    
    print("\nStatistical Analysis:")
    print(f"  Kruskal-Wallis H-statistic: {comp['kruskal_wallis_statistic']:.2f}")
    print(f"  p-value: {p_value:.4f}")
    print(f"  Significant difference between groups: {'YES ✓' if comp['significant_difference'] else 'NO ✗'}")
else:
    print("Suspect data not available in this dataset.")

## 7. Baseline Prediction Accuracies

In [None]:
# Get baseline prediction accuracies
baselines = analyzer.get_baseline_prediction_accuracy()

# Create baseline comparison
fig, ax = plt.subplots(figsize=(10, 6))

baseline_names = ['Random\n(50/50)', 'Majority\nClass', 'Most Common\nKiller Count']
baseline_values = [
    baselines['random_baseline'],
    baselines['majority_class_baseline'],
    baselines['most_common_count_baseline']
]

colors = ['gray', 'lightcoral', 'lightgreen']
bars = ax.bar(baseline_names, baseline_values, color=colors, alpha=0.7, edgecolor='black')

ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title('Baseline Prediction Accuracies', fontsize=14, fontweight='bold')
ax.set_ylim(0, 1)
ax.axhline(y=0.5, color='red', linestyle='--', alpha=0.3, label='Random chance')

# Add value labels
for bar, val in zip(bars, baseline_values):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
            f'{val:.3f}', ha='center', va='bottom', fontweight='bold')

ax.legend()
plt.tight_layout()
plt.show()

print("\nBaseline Details:")
print(f"• Random baseline: Always 50% (coin flip)")
print(f"• Majority class: Predict '{baselines['majority_class_prediction']}' for all episodes")
print(f"• Most common count: Always predict {baselines['most_common_killer_count']} killer(s) per episode")
print(f"\nThese are the minimum accuracies that any ML model should beat.")

## 8. Episode-Level Analysis

In [None]:
# Analyze specific episodes in detail
episode_stats = []
for stats in analyzer.killer_stats.values():
    if stats.killer_count > 0:
        episode_stats.append({
            'Episode': stats.episode_id,
            'Killers': stats.killer_count,
            'Speaking Ratio': stats.killer_speaking_ratio,
            'First Appear': stats.killer_first_appearance,
            'Spread': stats.killer_spread
        })

episode_df = pd.DataFrame(episode_stats)

# Find interesting episodes
print("Interesting Episodes:")
print("="*50)

# Most talkative killer
most_talkative = episode_df.nlargest(3, 'Speaking Ratio')
print("\nMost Talkative Killers:")
for _, row in most_talkative.iterrows():
    print(f"  {row['Episode']}: {row['Speaking Ratio']:.3f} of all dialogue")

# Most silent killer
most_silent = episode_df.nsmallest(3, 'Speaking Ratio')
print("\nMost Silent Killers:")
for _, row in most_silent.iterrows():
    print(f"  {row['Episode']}: {row['Speaking Ratio']:.3f} of all dialogue")

# Latest appearing killer
latest_appearance = episode_df.nlargest(3, 'First Appear')
print("\nLatest First Appearances:")
for _, row in latest_appearance.iterrows():
    print(f"  {row['Episode']}: First appears at {row['First Appear']:.2f} position")

# Most concentrated dialogue
most_concentrated = episode_df.nsmallest(3, 'Spread')
print("\nMost Concentrated Dialogue (burst speakers):")
for _, row in most_concentrated.iterrows():
    print(f"  {row['Episode']}: Spread of {row['Spread']:.3f}")

## 9. Summary Report

In [None]:
# Generate and save comprehensive report
output_dir = Path('../experiments/killer_analysis')
output_dir.mkdir(parents=True, exist_ok=True)

report = analyzer.generate_report(save_path=output_dir / 'killer_statistics.json')

# Create summary markdown
summary_md = f"""
# Killer Statistics Summary

## Dataset Overview
- **Total Episodes**: {overall_stats['total_episodes']}
- **Episodes with Killers**: {overall_stats['episodes_with_killers']}
- **Average Killers per Episode**: {overall_stats['avg_killers_per_episode']:.2f}

## Key Findings

### Speaking Patterns
- Killers speak **{speaking_patterns['killer_sentences']['mean']:.1f}** sentences on average
- Non-killers speak **{speaking_patterns['non_killer_sentences']['mean']:.1f}** sentences on average
- Statistical significance: **p = {speaking_patterns['statistical_tests']['sentence_p_value']:.4f}**

### Temporal Patterns
- Average first appearance: **{temporal_patterns['appearance_analysis']['avg_first_appearance']:.2f}** (normalized position)
- Early appearance rate: **{temporal_patterns['appearance_analysis']['early_appearance_rate']*100:.1f}%**
- Average dialogue spread: **{temporal_patterns['appearance_analysis']['avg_spread']:.2f}**

### Baseline Accuracies
- Random: **{baselines['random_baseline']:.3f}**
- Majority Class: **{baselines['majority_class_baseline']:.3f}**
- Most Common Count: **{baselines['most_common_count_baseline']:.3f}**

## Implications for Neural Models
1. Killer prediction is inherently challenging (baseline ~{baselines['majority_class_baseline']:.3f})
2. Speaking patterns show {'significant' if speaking_patterns['statistical_tests']['sentence_significant'] else 'no significant'} differences
3. Temporal patterns suggest killers {'appear early' if temporal_patterns['appearance_analysis']['avg_first_appearance'] < 0.4 else 'appear throughout'}
4. Multiple archetypes exist, requiring flexible model representations
"""

display(Markdown(summary_md))

# Save summary
with open(output_dir / 'killer_summary.md', 'w') as f:
    f.write(summary_md)

print(f"\nResults saved to {output_dir}")
print("Files created:")
for file in output_dir.glob('*'):
    print(f"  • {file.name}")