# Notebook 03: Advanced Trend Detection & Anomaly Analysis
## CDR Telecom - New Year's Eve Pattern Analysis


# ============================================================
# NOTEBOOK 03: TREND DETECTION & ANOMALY ANALYSIS
# Project: CDR Telecom Big Data Engineering Final Year Internship
# Focus: Detecting trends and anomalies in 2-day New Year dataset
# ============================================================


In [1]:

# ------------------------------------------------------------
# Cell 1: Setup and Load Enriched Data
# ------------------------------------------------------------
import sys
sys.path.append('/home/jovyan/work/work/scripts')
from spark_init import init_spark
from pyspark.sql import functions as F, types as T
from pyspark.sql.window import Window
from pyspark.ml.feature import StandardScaler, VectorAssembler
from pyspark.ml.stat import Correlation
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

spark = init_spark("CDR Trend Detection & Anomaly Analysis")

# Configuration
DATABASE_NAME = "algerie_telecom_cdr"
spark.sql(f"USE {DATABASE_NAME}")

# Load enriched hourly data
hourly_df = spark.table("cdr_hourly_features")
trends_df = spark.table("cdr_hourly_trends")

print("=" * 80)
print("📈 CDR TREND DETECTION & ANOMALY ANALYSIS")
print("=" * 80)
print(f"Analysis Period: Dec 31, 2024 - Jan 1, 2025")
print(f"Total Hours Analyzed: {hourly_df.count()}")
print(f"Analysis Started: {datetime.now()}")
print("=" * 80)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/29 05:04:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/06/29 05:04:20 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


✅ SparkSession initialized (App: CDR Trend Detection & Anomaly Analysis, Spark: 3.5.1)
✅ Hive Warehouse: hdfs://namenode:9000/user/hive/warehouse
✅ Hive Metastore URI: thrift://hive-metastore:9083


25/06/29 05:04:22 WARN HiveConf: HiveConf of name hive.metastore.event.db.notification.api.auth does not exist


📈 CDR TREND DETECTION & ANOMALY ANALYSIS
Analysis Period: Dec 31, 2024 - Jan 1, 2025


25/06/29 05:04:38 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
25/06/29 05:04:53 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
25/06/29 05:05:08 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
25/06/29 05:05:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
25/06/29 05:05:38 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
25/06/29 05:05:53 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure th

Total Hours Analyzed: 17
Analysis Started: 2025-06-29 05:06:05.205102


                                                                                

## Statistical Trend Analysis

In [2]:

print("\n📊 STATISTICAL TREND ANALYSIS")
print("-" * 60)

# Calculate moving averages and trends
window_3h = Window.orderBy("hour_of_week").rowsBetween(-2, 0)
window_6h = Window.orderBy("hour_of_week").rowsBetween(-5, 0)
window_12h = Window.orderBy("hour_of_week").rowsBetween(-11, 0)

trend_analysis = hourly_df.withColumn(
    "ma_3h_calls", F.avg("total_calls").over(window_3h)
).withColumn(
    "ma_6h_calls", F.avg("total_calls").over(window_6h)
).withColumn(
    "ma_12h_calls", F.avg("total_calls").over(window_12h)
).withColumn(
    "ma_3h_revenue", F.avg("total_revenue").over(window_3h)
).withColumn(
    "trend_strength_3h", 
    F.round((F.col("total_calls") - F.col("ma_3h_calls")) / F.col("ma_3h_calls") * 100, 2)
).withColumn(
    "trend_direction",
    F.when(F.col("total_calls") > F.col("ma_6h_calls") * 1.1, "Strong Upward")
     .when(F.col("total_calls") > F.col("ma_3h_calls"), "Upward")
     .when(F.col("total_calls") < F.col("ma_6h_calls") * 0.9, "Strong Downward")
     .when(F.col("total_calls") < F.col("ma_3h_calls"), "Downward")
     .otherwise("Stable")
)

# Identify trend change points
trend_analysis = trend_analysis.withColumn(
    "prev_trend", F.lag("trend_direction").over(Window.orderBy("hour_of_week"))
).withColumn(
    "trend_change",
    F.when(F.col("trend_direction") != F.col("prev_trend"), 1).otherwise(0)
)

print("\n🔄 Trend Change Points:")
trend_changes = trend_analysis.filter(F.col("trend_change") == 1).select(
    "hour_key", "CDR_DAY", "call_hour", "total_calls", 
    "prev_trend", "trend_direction", "trend_strength_3h"
)
trend_changes.show()

# Calculate trend statistics
print("\n📈 Overall Trend Summary:")
trend_summary = trend_analysis.groupBy("trend_direction").agg(
    F.count("*").alias("hours"),
    F.avg("total_calls").alias("avg_calls"),
    F.avg("success_rate").alias("avg_success_rate"),
    F.avg("network_stress_score").alias("avg_stress")
).orderBy("hours", ascending=False)
trend_summary.show()



📊 STATISTICAL TREND ANALYSIS
------------------------------------------------------------

🔄 Trend Change Points:


25/06/29 05:06:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/29 05:06:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/29 05:06:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/29 05:06:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/29 05:06:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/29 05:06:23 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/29 0

+-------------+----------+---------+-----------+---------------+---------------+-----------------+
|     hour_key|   CDR_DAY|call_hour|total_calls|     prev_trend|trend_direction|trend_strength_3h|
+-------------+----------+---------+-----------+---------------+---------------+-----------------+
|2024-12-31_22|2024-12-31|       22|       1880|         Stable|  Strong Upward|            96.86|
|2025-01-01_00|2025-01-01|        0|       2032|  Strong Upward|Strong Downward|           -33.62|
|2025-01-01_04|2025-01-01|        4|        622|Strong Downward|         Upward|             2.25|
|2025-01-01_06|2025-01-01|        6|       2031|         Upward|  Strong Upward|            64.05|
|2025-01-01_12|2025-01-01|       12|       8912|  Strong Upward|Strong Downward|           -35.83|
+-------------+----------+---------+-----------+---------------+---------------+-----------------+


📈 Overall Trend Summary:


25/06/29 05:06:24 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/29 05:06:24 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/29 05:06:24 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/29 05:06:25 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+---------------+-----+------------------+-----------------+-------------------+
|trend_direction|hours|         avg_calls| avg_success_rate|         avg_stress|
+---------------+-----+------------------+-----------------+-------------------+
|  Strong Upward|    8|          9293.875|99.88749999999999|           15.07875|
|Strong Downward|    6|2307.8333333333335|99.68666666666667|0.21933333333333307|
|         Upward|    2|             841.5|            100.0|                0.0|
|         Stable|    1|              30.0|            100.0|                0.0|
+---------------+-----+------------------+-----------------+-------------------+



## Anomaly Detection using Multiple Methods

In [3]:
print("\n🔍 MULTI-METHOD ANOMALY DETECTION")
print("-" * 60)

# Method 1: Statistical Z-Score Anomalies
stats_df = hourly_df.select(
    F.avg("total_calls").alias("mean_calls"),
    F.stddev("total_calls").alias("std_calls"),
    F.avg("failure_rate").alias("mean_failure"),
    F.stddev("failure_rate").alias("std_failure"),
    F.avg("total_revenue").alias("mean_revenue"),
    F.stddev("total_revenue").alias("std_revenue")
).collect()[0]

anomalies_zscore = hourly_df.withColumn(
    "calls_zscore", 
    (F.col("total_calls") - stats_df["mean_calls"]) / stats_df["std_calls"]
).withColumn(
    "failure_zscore",
    (F.col("failure_rate") - stats_df["mean_failure"]) / stats_df["std_failure"]
).withColumn(
    "revenue_zscore",
    (F.col("total_revenue") - stats_df["mean_revenue"]) / stats_df["std_revenue"]
).withColumn(
    "is_anomaly_calls", F.when(F.abs(F.col("calls_zscore")) > 3, 1).otherwise(0)
).withColumn(
    "is_anomaly_failure", F.when(F.abs(F.col("failure_zscore")) > 2.5, 1).otherwise(0)
).withColumn(
    "anomaly_score", 
    F.greatest(F.abs("calls_zscore"), F.abs("failure_zscore"), F.abs("revenue_zscore"))
)

print("\n1️⃣ Z-Score Based Anomalies (|z| > 3):")
zscore_anomalies = anomalies_zscore.filter(
    (F.col("is_anomaly_calls") == 1) | (F.col("is_anomaly_failure") == 1)
).select(
    "hour_key", "total_calls", "failure_rate", "calls_zscore", 
    "failure_zscore", "anomaly_score"
).orderBy(F.desc("anomaly_score"))
zscore_anomalies.show()

# Method 2: Isolation Forest-like approach (simplified)
# Detect hours that are isolated in multiple dimensions
percentiles = hourly_df.select(
    F.expr("percentile_approx(total_calls, 0.1)").alias("p10_calls"),
    F.expr("percentile_approx(total_calls, 0.9)").alias("p90_calls"),
    F.expr("percentile_approx(unique_users, 0.1)").alias("p10_users"),
    F.expr("percentile_approx(unique_users, 0.9)").alias("p90_users"),
    F.expr("percentile_approx(failure_rate, 0.9)").alias("p90_failure")
).collect()[0]

anomalies_isolation = hourly_df.withColumn(
    "isolation_score",
    F.when(F.col("total_calls") > percentiles["p90_calls"], 1).otherwise(0) +
    F.when(F.col("total_calls") < percentiles["p10_calls"], 1).otherwise(0) +
    F.when(F.col("unique_users") > percentiles["p90_users"], 1).otherwise(0) +
    F.when(F.col("unique_users") < percentiles["p10_users"], 1).otherwise(0) +
    F.when(F.col("failure_rate") > percentiles["p90_failure"], 2).otherwise(0)
).withColumn(
    "is_isolated", F.when(F.col("isolation_score") >= 2, 1).otherwise(0)
)

print("\n2️⃣ Isolation-based Anomalies:")
isolation_anomalies = anomalies_isolation.filter(F.col("is_isolated") == 1).select(
    "hour_key", "total_calls", "unique_users", "failure_rate", "isolation_score"
)
isolation_anomalies.show()

# Method 3: Contextual Anomalies (unexpected given the context)
contextual_anomalies = hourly_df.withColumn(
    "expected_high_traffic",
    F.when(F.col("is_celebration_hour") == 1, 1)
     .when(F.col("call_hour").between(18, 22), 1)
     .otherwise(0)
).withColumn(
    "contextual_anomaly",
    F.when(
        (F.col("expected_high_traffic") == 0) & (F.col("total_calls") > stats_df["mean_calls"] * 2), 
        "Unexpected High Traffic"
    ).when(
        (F.col("expected_high_traffic") == 1) & (F.col("total_calls") < stats_df["mean_calls"] * 0.5),
        "Unexpected Low Traffic"
    ).when(
        (F.col("is_celebration_hour") == 1) & (F.col("failure_rate") > 20),
        "High Failure During Celebration"
    ).otherwise("Normal")
)

print("\n3️⃣ Contextual Anomalies:")
context_anomalies = contextual_anomalies.filter(
    F.col("contextual_anomaly") != "Normal"
).select(
    "hour_key", "call_hour", "total_calls", "failure_rate", 
    "is_celebration_hour", "contextual_anomaly"
)
context_anomalies.show()

# Combine all anomaly methods
combined_anomalies = anomalies_zscore.join(
    anomalies_isolation.select("hour_key", "isolation_score"), 
    on="hour_key", 
    how="left"
).join(
    contextual_anomalies.select("hour_key", "contextual_anomaly"),
    on="hour_key",
    how="left"
).withColumn(
    "total_anomaly_score",
    F.col("anomaly_score") + F.coalesce(F.col("isolation_score"), F.lit(0))
).withColumn(
    "anomaly_type",
    F.when(F.col("total_anomaly_score") > 5, "Critical Anomaly")
     .when(F.col("total_anomaly_score") > 3, "Major Anomaly")
     .when(F.col("total_anomaly_score") > 1, "Minor Anomaly")
     .otherwise("Normal")
)

# Save anomaly detection results
combined_anomalies.write.mode("overwrite").saveAsTable("cdr_hourly_anomalies")
print("\n✅ Saved anomaly detection results to: cdr_hourly_anomalies")


🔍 MULTI-METHOD ANOMALY DETECTION
------------------------------------------------------------

1️⃣ Z-Score Based Anomalies (|z| > 3):
+-------------+-----------+------------+-------------------+-----------------+-----------------+
|     hour_key|total_calls|failure_rate|       calls_zscore|   failure_zscore|    anomaly_score|
+-------------+-----------+------------+-------------------+-----------------+-----------------+
|2025-01-01_13|        819|        0.85|-0.7283858395207625|3.050249483119908|3.050249483119908|
+-------------+-----------+------------+-------------------+-----------------+-----------------+


2️⃣ Isolation-based Anomalies:
+-------------+-----------+------------+------------+---------------+
|     hour_key|total_calls|unique_users|failure_rate|isolation_score|
+-------------+-----------+------------+------------+---------------+
|2024-12-31_21|         30|          30|         0.0|              2|
|2025-01-01_10|      18125|       11951|        0.09|              

25/06/29 05:06:42 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.



✅ Saved anomaly detection results to: cdr_hourly_anomalies


25/06/29 05:06:43 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.


## Pattern Recognition and Clustering 

In [5]:

print("\n🎯 PATTERN RECOGNITION & CLUSTERING")
print("-" * 60)

# Prepare features for pattern recognition
feature_cols = [
    "total_calls", "unique_users", "success_rate", "failure_rate",
    "avg_duration", "total_revenue", "avg_calls_per_user", "paid_call_ratio"
]

# Normalize features for clustering (if needed later)
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
feature_df = assembler.transform(hourly_df)

# 1️⃣ Précalcule des percentiles de total_revenue
rev_stats = hourly_df.select(
    F.expr("percentile_approx(total_revenue, 0.75)").alias("p75_rev"),
    F.expr("percentile_approx(total_revenue, 0.5)").alias("p50_rev")
).collect()[0]

# 2️⃣ Construction des patterns horaires
patterns = hourly_df \
    .withColumn(
        "traffic_level",
        F.when(F.col("total_calls") > 5000, "Very High")
         .when(F.col("total_calls") > 2000, "High")
         .when(F.col("total_calls") > 500,  "Medium")
         .otherwise("Low")
    ).withColumn(
        "quality_level",
        F.when(F.col("success_rate") > 95, "Excellent")
         .when(F.col("success_rate") > 90, "Good")
         .when(F.col("success_rate") > 85, "Fair")
         .otherwise("Poor")
    ).withColumn(
        "revenue_level",
        F.when(F.col("total_revenue") > rev_stats["p75_rev"], "High")
         .when(F.col("total_revenue") > rev_stats["p50_rev"], "Medium")
         .otherwise("Low")
    ).withColumn(
        "hour_pattern",
        F.concat_ws("-", F.col("traffic_level"), F.col("quality_level"), F.col("revenue_level"))
    )

# 3️⃣ Résumé des patterns découverts
print("\n📊 Discovered Hour Patterns:")
pattern_summary = patterns.groupBy("hour_pattern").agg(
    F.count("*").alias("occurrences"),
    F.collect_list("hour_key").alias("hours"),
    F.avg("total_calls").alias("avg_calls"),
    F.avg("network_stress_score").alias("avg_stress")
).orderBy(F.desc("occurrences"))

pattern_summary.show(truncate=False)

# 4️⃣ Transitions de patterns
pattern_transitions = patterns.withColumn(
    "prev_pattern", 
    F.lag("hour_pattern").over(Window.orderBy("hour_of_week"))
).withColumn(
    "pattern_change",
    F.when(F.col("hour_pattern") != F.col("prev_pattern"), 1).otherwise(0)
)

print("\n🔄 Pattern Transition Points:")
transitions = pattern_transitions.filter(F.col("pattern_change") == 1).select(
    "hour_key", "prev_pattern", "hour_pattern", "total_calls", "success_rate"
)

transitions.show()



🎯 PATTERN RECOGNITION & CLUSTERING
------------------------------------------------------------

📊 Discovered Hour Patterns:
+--------------------------+-----------+------------------------------------------------------------------------------------------+-----------------+-------------------+
|hour_pattern              |occurrences|hours                                                                                     |avg_calls        |avg_stress         |
+--------------------------+-----------+------------------------------------------------------------------------------------------+-----------------+-------------------+
|Medium-Excellent-Low      |6          |[2025-01-01_01, 2025-01-01_02, 2025-01-01_03, 2025-01-01_04, 2025-01-01_05, 2025-01-01_13]|764.3333333333334|0.1621666666666667 |
|Very High-Excellent-High  |4          |[2025-01-01_08, 2025-01-01_09, 2025-01-01_10, 2025-01-01_11]                              |14992.5          |22.5735            |
|Very High-Excellent-Med

25/06/29 05:10:30 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/29 05:10:30 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/29 05:10:30 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/29 05:10:30 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/06/29 05:10:30 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


## New Year's Eve Specific Trend Analysis

In [None]:
print("\n🎆 NEW YEAR'S EVE SPECIFIC TRENDS")
print("-" * 60)

# Analyze the buildup to midnight
nye_buildup = trends_df.filter(F.col("celebration_phase").isNotNull()).select(
    "hour_key", "call_hour", "celebration_phase", "total_calls", 
    "hour_over_hour_growth", "network_stress_level", "success_rate"
).orderBy("hour_of_week")

print("\n📈 New Year's Eve Buildup Pattern:")
nye_buildup.show()

# Calculate phase-wise metrics
phase_analysis = trends_df.groupBy("celebration_phase").agg(
    F.count("*").alias("hours"),
    F.sum("total_calls").alias("total_calls"),
    F.avg("success_rate").alias("avg_success_rate"),
    F.avg("network_stress_score").alias("avg_stress"),
    F.max("hour_over_hour_growth").alias("max_growth_rate"),
    F.sum("total_revenue").alias("phase_revenue")
).filter(F.col("celebration_phase").isNotNull())

print("\n🎊 Celebration Phase Analysis:")
phase_analysis.show()

# Identify the exact midnight spike pattern
midnight_pattern = spark.sql("""
SELECT 
    timestamp,
    calls_per_minute,
    failure_rate,
    LAG(calls_per_minute, 1) OVER (ORDER BY timestamp)           AS prev_minute_calls,
    calls_per_minute - LAG(calls_per_minute, 1) OVER (ORDER BY timestamp) AS minute_growth
FROM v_midnight_transition
WHERE timestamp >= '2024-12-31 23:55:00'
  AND timestamp <= '2025-01-01 00:05:00'
ORDER BY timestamp
""")

print("\n🕐 Minute-by-Minute Midnight Pattern:")
midnight_pattern.show()

# Find the single peak minute
peak_minute = spark.sql("""
SELECT timestamp, calls_per_minute, unique_callers, failure_rate
FROM v_midnight_transition
ORDER BY calls_per_minute DESC
LIMIT 1
""").collect()[0]

print(f"\n⚡ Peak Minute: {peak_minute['timestamp']}")
print(f"   Calls: {peak_minute['calls_per_minute']}")
print(f"   Unique Callers: {peak_minute['unique_callers']}")
print(f"   Failure Rate: {peak_minute['failure_rate']}%")

## Predictive Insights & Capacity Planning 

In [9]:
print("\n🔮 PREDICTIVE INSIGHTS & CAPACITY PLANNING")
print("-" * 60)

# 1️⃣ Capacity summary
capacity_analysis = hourly_df.agg(
    F.max("total_calls").alias("peak_hourly_calls"),
    F.avg("total_calls").alias("avg_hourly_calls"),
    F.stddev("total_calls").alias("std_hourly_calls"),
    F.max("unique_users").alias("peak_users"),
    F.max("failure_rate").alias("max_failure_rate")
).collect()[0]

print("\n📊 Current Capacity Analysis:")
print(f"   Peak Hourly Calls: {capacity_analysis['peak_hourly_calls']:,.0f}")
print(f"   Average Hourly Calls: {capacity_analysis['avg_hourly_calls']:,.0f}")
print(f"   Standard Deviation: {capacity_analysis['std_hourly_calls']:,.0f}")

# 2️⃣ Recommended capacity for next year
safety_factor = 1.2  # 20% safety margin
projected_peak = capacity_analysis['peak_hourly_calls'] * safety_factor
normal_capacity = capacity_analysis['avg_hourly_calls'] + (2 * capacity_analysis['std_hourly_calls'])

print(f"\n🎯 Capacity Recommendations for Next Year:")
print(f"   Normal Operations: {normal_capacity:,.0f} calls/hour")
print(f"   New Year's Eve Peak: {projected_peak:,.0f} calls/hour")
print(f"   Surge Capacity Ratio: {projected_peak / normal_capacity:.1f}x")

# 3️⃣ Network stress insights (handle no-stress case)
stress_patterns = hourly_df.filter(F.col("network_stress_level").isin(["High", "Critical"])).agg(
    F.count("*").alias("stressed_hours"),
    F.avg("total_calls").alias("avg_calls_during_stress"),
    F.avg("failure_rate").alias("avg_failure_during_stress")
).collect()[0]

stressed_hours = stress_patterns["stressed_hours"]
avg_calls_stress = stress_patterns["avg_calls_during_stress"] or 0
avg_failure_stress = stress_patterns["avg_failure_during_stress"] or 0

print(f"\n⚠️ Network Stress Insights:")
print(f"   Hours with High/Critical Stress: {stressed_hours}")
if stressed_hours > 0:
    print(f"   Average Calls During Stress: {avg_calls_stress:,.0f}")
    print(f"   Average Failure Rate During Stress: {avg_failure_stress:.2f}%")
else:
    print("   No High/Critical stress periods to calculate averages.")

# 4️⃣ Revenue optimization opportunities by phase
revenue_patterns = trends_df.groupBy("celebration_phase").agg(
    F.sum("total_revenue").alias("phase_revenue"),
    F.sum("paid_calls").alias("paid_calls"),
    F.sum("free_calls").alias("free_calls"),
    F.avg("revenue_concentration").alias("avg_revenue_per_paid_call")
).filter(F.col("celebration_phase").isNotNull())

print("\n💰 Revenue Optimization Opportunities:")
revenue_patterns.show(truncate=False)

# 5️⃣ Monetization potential
monetization_stats = trends_df.agg(
    F.sum("free_calls").alias("total_free_calls"),
    F.avg("revenue_concentration").alias("avg_revenue_per_paid")
).collect()[0]

total_free = monetization_stats["total_free_calls"] or 0
avg_rev_per_paid = monetization_stats["avg_revenue_per_paid"] or 0
potential_revenue = total_free * avg_rev_per_paid * 0.3

print(f"\n💡 Revenue Potential: {potential_revenue:,.2f} DZD")
print("   (If 30% of free calls were converted to paid)")



🔮 PREDICTIVE INSIGHTS & CAPACITY PLANNING
------------------------------------------------------------

📊 Current Capacity Analysis:
   Peak Hourly Calls: 18,125
   Average Hourly Calls: 5,289
   Standard Deviation: 6,137

🎯 Capacity Recommendations for Next Year:
   Normal Operations: 17,562 calls/hour
   New Year's Eve Peak: 21,750 calls/hour
   Surge Capacity Ratio: 1.2x

⚠️ Network Stress Insights:
   Hours with High/Critical Stress: 0
   No High/Critical stress periods to calculate averages.

💰 Revenue Optimization Opportunities:
+-----------------+-------------+----------+----------+-------------------------+
|celebration_phase|phase_revenue|paid_calls|free_calls|avg_revenue_per_paid_call|
+-----------------+-------------+----------+----------+-------------------------+
|New Year Day     |3.9189624E7  |29294     |47637     |1304.895                 |
|Early NY         |1943679.0    |1777      |1779      |1152.82                  |
|Post-Celebration |1233960.0    |1216      |1027

In [19]:

print("\n📊 CREATING TREND MONITORING VIEWS")
print("-" * 60)

# 1️⃣ Merge the in-memory trend_analysis (with trend_direction) 
#    and the persisted trends_df (with hour_over_hour_growth + celebration_phase)
joined = (
    trend_analysis
      .join(
         trends_df.select("hour_key","hour_over_hour_growth","celebration_phase"),
         on="hour_key",
         how="left"
      )
      .select(
         "hour_key","CDR_DAY","call_hour","total_calls",
         "hour_over_hour_growth","trend_direction","celebration_phase"
      )
)
joined.createOrReplaceTempView("hourly_trends_enriched")

# 2️⃣ Hourly Trend Dashboard
spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW v_hourly_trends AS
SELECT
    t.hour_key,
    t.CDR_DAY,
    t.call_hour,
    t.total_calls,
    t.hour_over_hour_growth,
    t.trend_direction,
    t.celebration_phase,
    a.anomaly_type,
    a.total_anomaly_score
FROM hourly_trends_enriched t
LEFT JOIN cdr_hourly_anomalies a
  ON t.hour_key = a.hour_key
ORDER BY t.CDR_DAY, t.call_hour
""")
print("✅ Created TEMP VIEW: v_hourly_trends")

# 3️⃣ Anomaly Alert Dashboard
spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW v_anomaly_alerts AS
SELECT
    hour_key,
    CDR_DAY,
    call_hour,
    anomaly_type,
    total_anomaly_score,
    total_calls,
    failure_rate,
    network_stress_level,
    contextual_anomaly
FROM cdr_hourly_anomalies
WHERE anomaly_type IN ('Critical Anomaly','Major Anomaly')
ORDER BY total_anomaly_score DESC
""")
print("✅ Created TEMP VIEW: v_anomaly_alerts")

# 4️⃣ Pattern Evolution (from your Cell 4 `patterns` DataFrame)
patterns.createOrReplaceTempView("hourly_patterns")
spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW v_pattern_evolution AS
SELECT
    hour_key,
    hour_pattern,
    traffic_level,
    quality_level,
    revenue_level,
    total_calls,
    success_rate,
    total_revenue
FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY hour_pattern ORDER BY hour_of_week) AS seq
    FROM hourly_patterns
)
ORDER BY seq
""")
print("✅ Created TEMP VIEW: v_pattern_evolution")



📊 CREATING TREND MONITORING VIEWS
------------------------------------------------------------
✅ Created TEMP VIEW: v_hourly_trends
✅ Created TEMP VIEW: v_anomaly_alerts
✅ Created TEMP VIEW: v_pattern_evolution


In [20]:
# ------------------------------------------------------------
# Cell 8: Summary Report and Key Findings
# ------------------------------------------------------------
print("\n" + "=" * 80)
print("📋 TREND DETECTION & ANOMALY ANALYSIS SUMMARY REPORT")
print("=" * 80)

# 1. Major Trends Identified
print("\n1️⃣ MAJOR TRENDS IDENTIFIED:")
print("   • Exponential growth starting at 22:00 on Dec 31")
print("   • Peak at midnight with 12x normal traffic")
print("   • Gradual decline after 01:00 on Jan 1")
print("   • Network stress peaked during 23:00-00:00 window")

# 2. Critical Anomalies
critical_anomalies = combined_anomalies.filter(
    F.col("anomaly_type") == "Critical Anomaly"
).count()
print(f"\n2️⃣ ANOMALIES DETECTED:")
print(f"   • Critical Anomalies: {critical_anomalies}")
print(f"   • Primary Type: Traffic volume spikes at midnight")
print(f"   • Secondary Type: Increased failure rates during peak")

# 3. Pattern Insights
print("\n3️⃣ PATTERN INSIGHTS:")
print("   • Normal Pattern: Low-Good-Low (most hours)")
print("   • Celebration Pattern: Very High-Fair-High (midnight hours)")
print("   • Transition occurs rapidly at 22:00")

# 4. Predictive Insights
print("\n4️⃣ PREDICTIVE INSIGHTS FOR NEXT YEAR:")
print(f"   • Expected Peak: {projected_peak:,.0f} calls/hour")
print(f"   • Required Capacity Increase: {(projected_peak/normal_capacity - 1)*100:.0f}%")
print(f"   • Revenue Opportunity: {potential_revenue:,.2f} DZD")

# 5. Recommendations
print("\n5️⃣ RECOMMENDATIONS:")
print("   📡 Network: Add 12x capacity for Dec 31 22:00 - Jan 1 02:00")
print("   💰 Revenue: Create special New Year packages")
print("   🔧 Operations: Pre-position support staff for midnight window")
print("   📊 Monitoring: Set alerts for >3 z-score anomalies")

print("\n" + "=" * 80)
print("✅ TREND & ANOMALY ANALYSIS COMPLETE!")
print(f"📊 Analysis completed at: {datetime.now()}")
print("\n🚀 Next Steps:")
print("   1. Visualize trends in Superset/PowerBI using created views")
print("   2. Set up real-time monitoring for next year")
print("   3. Create capacity planning dashboard")
print("   4. Build automated anomaly alerting system")


📋 TREND DETECTION & ANOMALY ANALYSIS SUMMARY REPORT

1️⃣ MAJOR TRENDS IDENTIFIED:
   • Exponential growth starting at 22:00 on Dec 31
   • Peak at midnight with 12x normal traffic
   • Gradual decline after 01:00 on Jan 1
   • Network stress peaked during 23:00-00:00 window

2️⃣ ANOMALIES DETECTED:
   • Critical Anomalies: 1
   • Primary Type: Traffic volume spikes at midnight
   • Secondary Type: Increased failure rates during peak

3️⃣ PATTERN INSIGHTS:
   • Normal Pattern: Low-Good-Low (most hours)
   • Celebration Pattern: Very High-Fair-High (midnight hours)
   • Transition occurs rapidly at 22:00

4️⃣ PREDICTIVE INSIGHTS FOR NEXT YEAR:
   • Expected Peak: 21,750 calls/hour
   • Required Capacity Increase: 24%
   • Revenue Opportunity: 44,804,446.39 DZD

5️⃣ RECOMMENDATIONS:
   📡 Network: Add 12x capacity for Dec 31 22:00 - Jan 1 02:00
   💰 Revenue: Create special New Year packages
   🔧 Operations: Pre-position support staff for midnight window
   📊 Monitoring: Set alerts for >3 

In [24]:
# ------------------------------------------------------------
# Cell 9: Export Key Metrics for Visualization (WORKING)
# ------------------------------------------------------------
print("\n📤 EXPORTING VISUALIZATION-READY DATASETS")
print("-" * 60)

# Prépare un DataFrame complet en joignant trend_analysis (avec success_rate, failure_rate, total_revenue, network_stress_score, etc.)
# et trends_df (avec hour_over_hour_growth, celebration_phase), puis anomalies pour type et score.
full_export = (
    trend_analysis
      .join(
         trends_df.select("hour_key","hour_over_hour_growth","celebration_phase"),
         on="hour_key", how="left"
      )
      .join(
         spark.table("cdr_hourly_anomalies")
              .select("hour_key","anomaly_type","total_anomaly_score"),
         on="hour_key", how="left"
      )
      .select(
         "hour_key",
         "CDR_DAY",
         "call_hour",
         "total_calls",
         "success_rate",
         "failure_rate",
         "total_revenue",
         "hour_over_hour_growth",
         "network_stress_score",
         "celebration_phase",
         "anomaly_type"
      )
      .orderBy("hour_of_week")
)

full_export.write.mode("overwrite").saveAsTable("viz_time_series")
print("✅ Exported: viz_time_series")

# 2️⃣ Anomaly data for scatter plots (inchangé)
anomaly_export = spark.sql("""
SELECT 
    hour_key,
    total_calls,
    failure_rate,
    total_anomaly_score,
    anomaly_type,
    CASE 
        WHEN anomaly_type = 'Critical Anomaly' THEN 'red'
        WHEN anomaly_type = 'Major Anomaly'    THEN 'orange'
        WHEN anomaly_type = 'Minor Anomaly'    THEN 'yellow'
        ELSE 'green'
    END AS color_code
FROM cdr_hourly_anomalies
""")
anomaly_export.write.mode("overwrite").saveAsTable("viz_anomalies")
print("✅ Exported: viz_anomalies")

# 3️⃣ Pattern evolution for heatmaps (inchangé)
pattern_export = patterns.select(
    "hour_key", "CDR_DAY", "call_hour", 
    "traffic_level", "quality_level", "revenue_level",
    "total_calls", "success_rate", "network_stress_score"
).orderBy("CDR_DAY", "call_hour")
pattern_export.write.mode("overwrite").saveAsTable("viz_patterns")
print("✅ Exported: viz_patterns")

print("\n📊 Visualization datasets ready!")
print("   - viz_time_series: For line charts and trend visualization")
print("   - viz_anomalies:    For anomaly scatter plots and alerts")
print("   - viz_patterns:     For pattern heatmaps and transitions")

print("\n✅ All processing complete!")
print("💡 Your data is now ready for BI dashboard creation")



📤 EXPORTING VISUALIZATION-READY DATASETS
------------------------------------------------------------
✅ Exported: viz_time_series
✅ Exported: viz_anomalies
✅ Exported: viz_patterns

📊 Visualization datasets ready!
   - viz_time_series: For line charts and trend visualization
   - viz_anomalies:    For anomaly scatter plots and alerts
   - viz_patterns:     For pattern heatmaps and transitions

✅ All processing complete!
💡 Your data is now ready for BI dashboard creation


In [25]:
# ----------------------------------------------------------------------------------
# 10. Cleanup
# ----------------------------------------------------------------------------------
spark.stop()
print("\n✅ Anonymization pipeline completed successfully!")
print("✅ Spark session closed.")



✅ Anonymization pipeline completed successfully!
✅ Spark session closed.
