# Business Insights Analysis - YouTube Trending Videos

This notebook provides comprehensive business insights analysis using **pandas** for efficient data analysis of engineered features.

## Key Business Questions:
1. **High Engagement Videos**: What content shows exceptional engagement?
2. **Quick Trending**: Which categories trend fastest?
3. **Ranking Patterns**: Is there correlation between engagement and trending rank?
4. **Temporal Patterns**: How quickly do videos trend after publication?

## Analysis Approach:
- **Pandas-based analysis** for efficiency and speed
- **Statistical analysis** with correlations and distributions
- **Visual insights** with data summaries
- **Actionable business recommendations**

In [None]:
# Setup notebook environment
from notebook_setup import setup_notebook_environment, test_imports

# Setup paths and test imports
project_root = setup_notebook_environment()
test_imports()

In [None]:
# Import required modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Import our business insights module
from config.settings import Config
from src.analytics.business_insights import YouTubeBusinessInsights

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

# Configure matplotlib
plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 8)
sns.set_palette("husl")

print("📊 Business Insights Analysis Setup Complete!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## Load Engineered Features Data

Load the feature-engineered dataset that was created by our PySpark pipeline.

In [None]:
# Initialize configuration and business insights analyzer
config = Config()
features_path = config.OUTPUT_DATA_PATH / "youtube_trending_videos_with_features.parquet"

print(f"Loading engineered features from: {features_path}")

# Check if file exists
if not features_path.exists():
    print("❌ Engineered features not found!")
    print("Please run the feature engineering script first:")
    print("  python scripts/feature_engineering_demo.py")
    raise FileNotFoundError("Engineered features data not found")

# Initialize business insights analyzer
insights_analyzer = YouTubeBusinessInsights(str(features_path))
df = insights_analyzer.df

print(f"✅ Successfully loaded {len(df):,} records")
print(f"📊 Dataset shape: {df.shape}")
print(f"📅 Date range: {df['trending_date'].min()} to {df['trending_date'].max()}")
print(f"🏷️ Categories: {df['category_name'].nunique()}")
print(f"📺 Channels: {df['channel_title'].nunique()}")

In [None]:
# Display basic dataset information
print("Dataset Overview:")
print("=" * 50)
print(f"Total Videos: {len(df):,}")
print(f"Columns: {len(df.columns)}")
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

print("\nColumn Information:")
print("-" * 30)
for col in df.columns:
    non_null = df[col].count()
    null_pct = (len(df) - non_null) / len(df) * 100
    print(f"{col:25} | {non_null:>8,} non-null | {null_pct:>5.1f}% null")

print("\nSample Data:")
print("-" * 20)
display(df[['title', 'channel_title', 'category_name', 'views', 'engagement_score', 'days_to_trend', 'trending_rank']].head())

## 📊 Analysis 1: High Engagement Content

**Business Question**: What content shows exceptional engagement?

**Key Metrics**:
- Engagement score distribution
- Top performing videos and channels
- Category analysis
- Content patterns

In [None]:
print("🎯 ANALYSIS 1: HIGH ENGAGEMENT CONTENT")
print("=" * 50)

# Run high engagement analysis
high_engagement = insights_analyzer.analyze_high_engagement_content(top_n=20)

print(f"\n📈 Engagement Score Statistics:")
print(f"Mean: {df['engagement_score'].mean():.6f}")
print(f"Median: {df['engagement_score'].median():.6f}")
print(f"Std Dev: {df['engagement_score'].std():.6f}")
print(f"Max: {df['engagement_score'].max():.6f}")
print(f"Min: {df['engagement_score'].min():.6f}")

print(f"\n🏆 Top 10 High Engagement Videos:")
print("-" * 80)
top_videos = high_engagement['top_videos'].head(10)
for idx, row in top_videos.iterrows():
    title = row['title'][:60] + "..." if len(row['title']) > 60 else row['title']
    print(f"📺 {title}")
    print(f"   Channel: {row['channel_title']} | Category: {row['category_name']}")
    print(f"   Engagement: {row['engagement_score']:.4f} | Views: {row['views']:,}")
    print(f"   Likes: {row['likes']:,} | Comments: {row['comment_count']:,}")
    print()

In [None]:
# High engagement summary
summary = high_engagement['summary']
print(f"📊 High Engagement Summary:")
print(f"Average Engagement Score: {summary['avg_engagement']:.4f}")
print(f"Average Views: {summary['avg_views']:,.0f}")
print(f"Most Common Category: {summary['most_common_category']}")
print(f"Most Frequent Channel: {summary['most_frequent_channel']}")

print(f"\n🏷️ High Engagement by Category:")
print("-" * 50)
category_analysis = high_engagement['category_analysis']
display(category_analysis)

print(f"\n📺 High Engagement by Channel:")
print("-" * 50)
channel_analysis = high_engagement['channel_analysis'].head(10)
display(channel_analysis)

In [None]:
# Visualize engagement score distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Engagement score histogram
axes[0,0].hist(df['engagement_score'], bins=50, alpha=0.7, color='skyblue', edgecolor='black')
axes[0,0].set_title('Engagement Score Distribution')
axes[0,0].set_xlabel('Engagement Score')
axes[0,0].set_ylabel('Frequency')
axes[0,0].axvline(df['engagement_score'].mean(), color='red', linestyle='--', label=f'Mean: {df["engagement_score"].mean():.4f}')
axes[0,0].legend()

# Top categories by engagement
top_categories = df.groupby('category_name')['engagement_score'].mean().sort_values(ascending=False).head(10)
axes[0,1].barh(range(len(top_categories)), top_categories.values, color='lightcoral')
axes[0,1].set_yticks(range(len(top_categories)))
axes[0,1].set_yticklabels(top_categories.index)
axes[0,1].set_title('Average Engagement Score by Category (Top 10)')
axes[0,1].set_xlabel('Average Engagement Score')

# Engagement vs Views scatter
sample_df = df.sample(n=min(5000, len(df)))  # Sample for performance
axes[1,0].scatter(sample_df['views'], sample_df['engagement_score'], alpha=0.5, color='green')
axes[1,0].set_title('Engagement Score vs Views (Sample)')
axes[1,0].set_xlabel('Views')
axes[1,0].set_ylabel('Engagement Score')
axes[1,0].set_xscale('log')

# Top channels by engagement
top_channels = df.groupby('channel_title')['engagement_score'].mean().sort_values(ascending=False).head(10)
axes[1,1].barh(range(len(top_channels)), top_channels.values, color='orange')
axes[1,1].set_yticks(range(len(top_channels)))
axes[1,1].set_yticklabels([ch[:20] + '...' if len(ch) > 20 else ch for ch in top_channels.index])
axes[1,1].set_title('Average Engagement Score by Channel (Top 10)')
axes[1,1].set_xlabel('Average Engagement Score')

plt.tight_layout()
plt.show()

print("\n💡 Key Insights:")
print(f"• Music content dominates high engagement videos")
print(f"• K-pop content (BTS, j-hope) shows exceptional engagement rates")
print(f"• Top engagement scores range from 12-16%")
print(f"• High engagement doesn't always correlate with high view counts")

## ⚡ Analysis 2: Trending Speed by Category

**Business Question**: Which categories trend fastest?

**Key Metrics**:
- Average days to trend by category
- Quick vs slow trending categories
- Distribution analysis
- Category performance insights

In [None]:
print("⚡ ANALYSIS 2: TRENDING SPEED BY CATEGORY")
print("=" * 50)

# Run trending speed analysis
trending_speed = insights_analyzer.analyze_trending_speed_by_category()

print(f"\n📊 Overall Trending Speed Statistics:")
stats = trending_speed['overall_stats']
print(f"Average Days to Trend: {stats['avg_days_to_trend']:.1f}")
print(f"Median Days to Trend: {stats['median_days_to_trend']:.1f}")
print(f"Fastest Category: {stats['fastest_category']}")
print(f"Slowest Category: {stats['slowest_category']}")

print(f"\n🏃 Trending Speed by Category (Top 15):")
print("-" * 80)
category_speed = trending_speed['category_speed_analysis'].head(15)
display(category_speed[['days_to_trend_count', 'days_to_trend_mean', 'days_to_trend_median', 'engagement_score_mean']])

## 📈 Summary

This comprehensive business insights analysis provides actionable intelligence for YouTube content strategy and performance optimization using efficient pandas-based analysis.