- 03_Hive_Tables_Creation.ipynb
- Algerie Telecom - Hive Tables Creation and Optimization
- Author: Data Engineering Team
- Date: July 2025

# 🗄️ Hive Tables Creation and Optimization

Create optimized Hive tables for efficient querying and analytics.

## Objectives:
- Create external tables for raw data
- Implement partitioning strategy
- Create optimized views for analytics
- Set up aggregated tables for performance


In [1]:
import sys
sys.path.append('/home/jovyan/work/batch/jupyter/notebooks/work/scripts')
from spark_init import init_spark
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import hashlib
from pyspark.sql import functions as F, types as T
from pyspark.sql.functions import *
from pyspark.sql.window import Window
from pyspark.sql.types import *
from datetime import datetime
from datetime import datetime, timedelta

# Initialize Spark with proper configuration
spark = init_spark("Hive Tables Creation - Generated AT CDR")
print("✅ SparkSession initialized")
print(f"Spark Version: {spark.version}")
print(f"Warehouse Location: {spark.conf.get('spark.sql.warehouse.dir')}")


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/07 19:57:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


✅ SparkSession initialized (App: Hive Tables Creation - Generated AT CDR, Spark: 3.5.1)
✅ Hive Warehouse: hdfs://namenode:9000/user/hive/warehouse
✅ Hive Metastore URI: thrift://hive-metastore:9083
✅ SparkSession initialized
Spark Version: 3.5.1
Warehouse Location: hdfs://namenode:9000/user/hive/warehouse


## 2. Create Database

In [2]:
spark.sql("CREATE DATABASE IF NOT EXISTS at_cdr_analysis")
spark.sql("USE at_cdr_analysis")
print("✅ Database created/selected")

25/07/07 19:57:32 WARN HiveConf: HiveConf of name hive.metastore.event.db.notification.api.auth does not exist


✅ Database created/selected


## 3. Create Customer Dimension Table

In [3]:
# Drop existing table if exists
spark.sql("DROP TABLE IF EXISTS dim_customers")

# Create external table for customers
create_customer_table_sql = """
CREATE EXTERNAL TABLE IF NOT EXISTS dim_customers (
    customer_id STRING,
    connection_id STRING,
    wilaya_code STRING,
    wilaya_name STRING,
    customer_type STRING,
    service_type STRING,
    offer_name STRING,
    offer_price DOUBLE,
    activation_date TIMESTAMP,
    is_active BOOLEAN,
    tech_adoption_score DOUBLE,
    business_score DOUBLE,
    network_quality_score DOUBLE,
    usage_profile STRING,
    bandwidth_mbps DOUBLE,
    data_cap_gb DOUBLE,
    upload_mbps DOUBLE,
    sla_percentage DOUBLE,
    boost_hours STRING,
    night_boost BOOLEAN
)
STORED AS PARQUET
LOCATION '/user/hive/warehouse/Raw/customer_dim_enhanced/'
"""

spark.sql(create_customer_table_sql)
print("✅ Customer dimension table created successfully")

# Verify table
spark.sql("DESCRIBE FORMATTED dim_customers").show(truncate=False)


25/07/07 16:41:04 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.


✅ Customer dimension table created successfully
+---------------------+---------+-------+
|col_name             |data_type|comment|
+---------------------+---------+-------+
|customer_id          |string   |NULL   |
|connection_id        |string   |NULL   |
|wilaya_code          |string   |NULL   |
|wilaya_name          |string   |NULL   |
|customer_type        |string   |NULL   |
|service_type         |string   |NULL   |
|offer_name           |string   |NULL   |
|offer_price          |double   |NULL   |
|activation_date      |timestamp|NULL   |
|is_active            |boolean  |NULL   |
|tech_adoption_score  |double   |NULL   |
|business_score       |double   |NULL   |
|network_quality_score|double   |NULL   |
|usage_profile        |string   |NULL   |
|bandwidth_mbps       |double   |NULL   |
|data_cap_gb          |double   |NULL   |
|upload_mbps          |double   |NULL   |
|sla_percentage       |double   |NULL   |
|boost_hours          |string   |NULL   |
|night_boost          |boole

## 4. Create CDR Fact Table (Partitioned)

In [5]:
spark.conf.set("spark.sql.parquet.int64AsTimestamp", "TIMESTAMP_NANOS")

from pyspark.sql.functions import year, month, dayofmonth

# 1) Read your enhanced raw CDRs (this will infer the schema from the underlying parquet)
raw = spark.read.parquet("/user/hive/warehouse/Raw/raw_cdr_enhanced")

# 2) Add year/month/day partition columns
fact = (
    raw
    .withColumn("yr",  year("timestamp"))
    .withColumn("mo",  month("timestamp"))
    .withColumn("day", dayofmonth("timestamp"))
)

# 3) Drop any existing managed table
spark.sql("DROP TABLE IF EXISTS at_generated_cdr.fact_cdr")

# 4) Write out as a managed, partitioned parquet table
(
    fact.write
        .mode("overwrite")           # blow away old data & metadata
        .format("parquet")           # store as parquet
        .partitionBy("yr","mo","day")
        .saveAsTable("at_generated_cdr.fact_cdr")
)

# 5) Ensure Hive metastore sees all of your new daily partitions
spark.sql("MSCK REPAIR TABLE at_generated_cdr.fact_cdr")

# 6) Quick sanity-check
print("✅ Created at_generated_cdr.fact_cdr – first few partitions:")
spark.sql("SHOW PARTITIONS at_generated_cdr.fact_cdr LIMIT 10").show(truncate=False)

total = spark.table("at_generated_cdr.fact_cdr").count()
print(f"✅ Total rows in fact_cdr = {total:,}")

25/07/07 16:47:07 WARN HiveExternalCatalog: Hive incompatible types found: timestamp_ntz. Persisting data source table `spark_catalog`.`at_generated_cdr`.`fact_cdr` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
                                                                                

✅ Created at_generated_cdr.fact_cdr – first few partitions:


ParseException: 
[PARSE_SYNTAX_ERROR] Syntax error at or near 'LIMIT'.(line 1, pos 42)

== SQL ==
SHOW PARTITIONS at_generated_cdr.fact_cdr LIMIT 10
------------------------------------------^^^


25/07/07 16:49:14 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: Master removed our application: KILLED
25/07/07 16:49:14 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED
	at org.apache.spark.errors.SparkCoreErrors$.clusterSchedulerError(SparkCoreErrors.scala:291)
	at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:981)
	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.dead(StandaloneSchedulerBackend.scala:165)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint.markDead(StandaloneAppClient.scala:263)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(StandaloneAppClient.scala:170)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.proce

In [9]:
# 1) Create the Hive database (if not already there)
spark.sql("""
  CREATE DATABASE IF NOT EXISTS Raw
  LOCATION '/user/hive/warehouse/Raw'
""")

# 2) Now create an EXTERNAL table over the parquet files in that directory
spark.sql("""
  CREATE EXTERNAL TABLE IF NOT EXISTS Raw.raw_cdr_enhanced (
    -- (you could list the full schema here, 
    -- but since it's parquet you can let Spark infer it)
  )
  STORED AS PARQUET
  LOCATION '/user/hive/warehouse/Raw/raw_cdr_enhanced'
""")


ParseException: 
[PARSE_SYNTAX_ERROR] Syntax error at or near ')'.(line 5, pos 2)

== SQL ==

  CREATE EXTERNAL TABLE IF NOT EXISTS Raw.raw_cdr_enhanced (
    -- (you could list the full schema here, 
    -- but since it's parquet you can let Spark infer it)
  )
--^^^
  STORED AS PARQUET
  LOCATION '/user/hive/warehouse/Raw/raw_cdr_enhanced'


In [8]:
spark.sql("SHOW TABLES IN raw").show(truncate=False)


AnalysisException: [SCHEMA_NOT_FOUND] The schema `raw` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a catalog, verify the current_schema() output, or qualify the name with the correct catalog.
To tolerate the error on drop use DROP SCHEMA IF EXISTS.

In [7]:
spark.sql("SHOW DATABASES").show(truncate=False)


+-------------------+
|namespace          |
+-------------------+
|algerie_telecom_cdr|
|algerie_telecom_gen|
|at_generated_cdr   |
|default            |
+-------------------+



In [6]:
from pyspark.sql.functions import year, month, dayofmonth

# 1) Read the raw data and add your partition columns
df = (spark.table("Raw.raw_cdr_enhanced")
           .withColumn("year",  year("timestamp"))
           .withColumn("month", month("timestamp"))
           .withColumn("day",   dayofmonth("timestamp")))

# 2) Drop any old version of the managed table
spark.sql("DROP TABLE IF EXISTS fact_cdr")

# 3) Write out as a new MANAGED partitioned table in one shot
(df.write
   .mode("overwrite")                # overwrite any old data & metadata
   .partitionBy("year","month","day")# partition columns
   .format("parquet")
   .saveAsTable("fact_cdr")          # <=== managed table in your Hive warehouse
)

# 4) Quick sanity-check: show a few partitions
print("👉 Partitions:")
spark.sql("SHOW PARTITIONS fact_cdr LIMIT 10").show(truncate=False)

# 5) Count total rows
cnt = spark.table("fact_cdr").count()
print(f"✅ Loaded fact_cdr – total rows = {cnt:,}")


AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `Raw`.`raw_cdr_enhanced` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.;
'UnresolvedRelation [Raw, raw_cdr_enhanced], [], false


In [4]:
# First, let's analyze the data to determine optimal partitioning
sample_df = spark.read.parquet("/user/hive/warehouse/Raw/raw_cdr_enhanced/")
sample_df.createOrReplaceTempView("sample_cdr")

# Check date range for partitioning
date_range = spark.sql("""
    SELECT 
        DATE(MIN(timestamp)) as min_date,
        DATE(MAX(timestamp)) as max_date,
        DATEDIFF(DATE(MAX(timestamp)), DATE(MIN(timestamp))) as total_days
    FROM sample_cdr
""").collect()[0]

print(f"Date Range: {date_range['min_date']} to {date_range['max_date']} ({date_range['total_days']} days)")

# Create partitioned CDR table
spark.sql("DROP TABLE IF EXISTS fact_cdr")

create_cdr_table_sql = """
CREATE EXTERNAL TABLE IF NOT EXISTS fact_cdr (
    cdr_id STRING,
    timestamp TIMESTAMP,
    customer_id STRING,
    connection_id STRING,
    wilaya_code STRING,
    wilaya_name STRING,
    cdr_type STRING,
    service_type STRING,
    data_volume_mb DOUBLE,
    duration_minutes DOUBLE,
    session_quality STRING,
    usage_type STRING,
    offer_name STRING,
    customer_type STRING,
    bandwidth_mbps DOUBLE,
    data_cap_gb DOUBLE,
    anomaly_type STRING,
    severity STRING,
    outage_duration_hours DOUBLE,
    outage_type STRING,
    impact_level STRING,
    estimated_loss_da DOUBLE,
    affected_services STRING,
    old_offer STRING,
    new_offer STRING,
    old_price DOUBLE,
    new_price DOUBLE,
    change_reason STRING,
    upgrade_type STRING,
    upgrade_price DOUBLE,
    speed_multiplier DOUBLE,
    duration_days DOUBLE,
    original_bandwidth DOUBLE,
    boosted_bandwidth DOUBLE
)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS PARQUET
LOCATION '/user/hive/warehouse/at_generated_cdr.db/fact_cdr/'
"""

spark.sql(create_cdr_table_sql)
print("✅ CDR fact table created with daily partitions")


25/07/06 15:10:10 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

Date Range: 2025-03-20 to 2025-07-19 (121 days)
✅ CDR fact table created with daily partitions


                                                                                

## 5. Load CDR Data into Partitioned Table

In [21]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Register fact_cdr External Table") \
    .enableHiveSupport() \
    .getOrCreate()

cdr_path   = "/user/hive/warehouse/Raw/raw_cdr_enhanced"
table_name = "fact_cdr"

# 1) Drop any old definition
spark.sql(f"DROP TABLE IF EXISTS {table_name}")

# 2) Read recursively so Spark can infer schema from any file in that folder:
sample_df = (
    spark.read
         .option("recursiveFileLookup", "true")
         .parquet(cdr_path)
         .limit(1)
)
schema = sample_df.schema

# 3) Build a DDL column list
cols_ddl = ",\n  ".join(f"{f.name} {f.dataType.simpleString()}" for f in schema.fields)

# 4) Create the external table
ddl = f"""
CREATE EXTERNAL TABLE {table_name} (
  {cols_ddl}
)
STORED AS PARQUET
LOCATION '{cdr_path}'
"""
print("Running DDL:\n", ddl)
spark.sql(ddl)

# 5) Repair partitions
spark.sql(f"MSCK REPAIR TABLE {table_name}")

# 6) Verify
print("Total rows:", spark.table(table_name).count())


25/07/07 12:08:34 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


AnalysisException: [UNABLE_TO_INFER_SCHEMA] Unable to infer schema for Parquet. It must be specified manually.

In [20]:
from pyspark.sql import SparkSession
import os, glob

spark = SparkSession.builder \
    .appName("Create fact_cdr External Table") \
    .enableHiveSupport() \
    .getOrCreate()

cdr_path   = "/user/hive/warehouse/Raw/raw_cdr_enhanced"
table_name = "fact_cdr"

print(f"🧹 Dropping any pre‐existing `{table_name}`…")
spark.sql(f"DROP TABLE IF EXISTS {table_name}")

# Glob your parquet files (assumes you're running on a node with local HDFS mount)
# If that doesn't work, point to one known file, e.g. ".../cdr_20250718_to_20250719.parquet"
local_files = glob.glob(os.path.join(cdr_path, "*.parquet"))
if not local_files:
    raise RuntimeError(f"No parquet files found in {cdr_path}")

sample_file = local_files[0]
print(f"📂 Reading sample Parquet to infer schema:\n   {sample_file}")
sample_df = spark.read.parquet(sample_file)
schema    = sample_df.schema
print(f"📐 Inferred schema with {len(schema.fields)} fields:")
sample_df.printSchema()

print(f"\n🚀 Creating empty DataFrame with that schema…")
empty_df = spark.createDataFrame([], schema)

print(f"🚀 Writing empty DataFrame as external Hive table `{table_name}`…")
(
    empty_df.write
        .mode("overwrite")
        .format("parquet")
        .option("path", cdr_path)    # point at existing parquet dir
        .saveAsTable(table_name)     # register in the Hive metastore
)

print("✅ Table created! Repairing partitions…")
spark.sql(f"MSCK REPAIR TABLE {table_name}")

# -- Verification --
print("\n📊 Total record count in `fact_cdr`:")
total = spark.table(table_name).count()
print(f"➡️  {total:,} rows")

print("\n📅 Date summary:")
spark.table(table_name) \
     .selectExpr(
         "MIN(timestamp) AS first_ts",
         "MAX(timestamp) AS last_ts",
         "COUNT(DISTINCT TO_DATE(timestamp)) AS unique_days"
     ) \
     .show(truncate=False)

print("\n📈 Monthly distribution (year/month → rows):")
spark.sql(f"""
    SELECT
      YEAR(timestamp) AS year,
      MONTH(timestamp) AS month,
      COUNT(*)          AS records
    FROM {table_name}
    GROUP BY YEAR(timestamp), MONTH(timestamp)
    ORDER BY year, month
""").show()


🧹 Dropping any pre‐existing `fact_cdr`…


RuntimeError: No parquet files found in /user/hive/warehouse/Raw/raw_cdr_enhanced

 ## 6. Create Optimized Views for Common Queries

In [None]:
# View for daily aggregated metrics
spark.sql("DROP VIEW IF EXISTS v_daily_metrics")

daily_metrics_view = """
CREATE VIEW IF NOT EXISTS v_daily_metrics AS
SELECT 
    DATE(timestamp) as date,
    wilaya_code,
    wilaya_name,
    service_type,
    customer_type,
    COUNT(DISTINCT customer_id) as unique_customers,
    COUNT(*) as total_cdrs,
    SUM(CASE WHEN cdr_type = 'DATA' THEN 1 ELSE 0 END) as data_sessions,
    SUM(CASE WHEN cdr_type = 'DATA' THEN data_volume_mb ELSE 0 END) as total_data_mb,
    SUM(CASE WHEN cdr_type = 'DATA' THEN duration_minutes ELSE 0 END) as total_duration_minutes,
    AVG(CASE WHEN cdr_type = 'DATA' THEN data_volume_mb ELSE NULL END) as avg_session_data_mb,
    COUNT(DISTINCT CASE WHEN cdr_type = 'ANOMALY' THEN customer_id END) as customers_with_anomalies,
    COUNT(CASE WHEN cdr_type = 'OUTAGE' THEN 1 END) as outage_events,
    SUM(CASE WHEN cdr_type = 'OUTAGE' THEN outage_duration_hours ELSE 0 END) as total_outage_hours
FROM fact_cdr
GROUP BY DATE(timestamp), wilaya_code, wilaya_name, service_type, customer_type
"""

spark.sql(daily_metrics_view)
print("✅ Daily metrics view created")

# View for customer usage summary
spark.sql("DROP VIEW IF EXISTS v_customer_usage_summary")

customer_usage_view = """
CREATE VIEW IF NOT EXISTS v_customer_usage_summary AS
SELECT 
    c.customer_id,
    c.customer_type,
    c.service_type,
    c.offer_name,
    c.bandwidth_mbps,
    c.wilaya_name,
    COUNT(f.cdr_id) as total_sessions,
    SUM(CASE WHEN f.cdr_type = 'DATA' THEN f.data_volume_mb ELSE 0 END) as total_data_mb,
    AVG(CASE WHEN f.cdr_type = 'DATA' THEN f.data_volume_mb ELSE NULL END) as avg_session_mb,
    SUM(CASE WHEN f.cdr_type = 'DATA' THEN f.duration_minutes ELSE 0 END) as total_minutes,
    COUNT(DISTINCT DATE(f.timestamp)) as active_days,
    COUNT(CASE WHEN f.cdr_type = 'ANOMALY' THEN 1 END) as anomaly_count,
    MAX(f.timestamp) as last_activity
FROM dim_customers c
LEFT JOIN fact_cdr f ON c.customer_id = f.customer_id
WHERE c.is_active = true
GROUP BY c.customer_id, c.customer_type, c.service_type, c.offer_name, 
         c.bandwidth_mbps, c.wilaya_name
"""

spark.sql(customer_usage_view)
print("✅ Customer usage summary view created")

## 7. Create Aggregated Tables for Performance

In [None]:
# 

# %%
# Hourly aggregated table for real-time analytics
spark.sql("DROP TABLE IF EXISTS agg_hourly_metrics")

hourly_agg_sql = """
CREATE TABLE IF NOT EXISTS agg_hourly_metrics AS
SELECT 
    DATE(timestamp) as date,
    HOUR(timestamp) as hour,
    wilaya_code,
    service_type,
    COUNT(DISTINCT customer_id) as unique_customers,
    COUNT(CASE WHEN cdr_type = 'DATA' THEN 1 END) as data_sessions,
    SUM(CASE WHEN cdr_type = 'DATA' THEN data_volume_mb ELSE 0 END) as total_data_gb,
    AVG(CASE WHEN cdr_type = 'DATA' THEN data_volume_mb ELSE NULL END) as avg_session_mb,
    PERCENTILE_APPROX(CASE WHEN cdr_type = 'DATA' THEN data_volume_mb ELSE NULL END, 0.5) as median_session_mb,
    MAX(CASE WHEN cdr_type = 'DATA' THEN data_volume_mb ELSE 0 END) as max_session_mb,
    COUNT(CASE WHEN cdr_type = 'ANOMALY' THEN 1 END) as anomaly_count,
    COUNT(CASE WHEN cdr_type = 'OUTAGE' THEN 1 END) as outage_count,
    year,
    month
FROM fact_cdr
GROUP BY DATE(timestamp), HOUR(timestamp), wilaya_code, service_type, year, month
"""

spark.sql(hourly_agg_sql)
print("✅ Hourly aggregated metrics table created")

# Network quality impact table
spark.sql("DROP TABLE IF EXISTS agg_network_quality_impact")

network_quality_sql = """
CREATE TABLE IF NOT EXISTS agg_network_quality_impact AS
SELECT 
    c.wilaya_code,
    c.wilaya_name,
    c.network_quality_score,
    CASE 
        WHEN c.network_quality_score >= 0.9 THEN 'Excellent'
        WHEN c.network_quality_score >= 0.8 THEN 'Good'
        WHEN c.network_quality_score >= 0.6 THEN 'Fair'
        ELSE 'Poor'
    END as quality_category,
    COUNT(DISTINCT c.customer_id) as customer_count,
    AVG(c.offer_price) as avg_revenue,
    COUNT(f.cdr_id) as total_cdrs,
    SUM(CASE WHEN f.cdr_type = 'DATA' THEN f.data_volume_mb ELSE 0 END) as total_data_mb,
    COUNT(CASE WHEN f.cdr_type = 'ANOMALY' THEN 1 END) as anomaly_count,
    COUNT(CASE WHEN f.cdr_type = 'OUTAGE' THEN 1 END) as outage_impact
FROM dim_customers c
LEFT JOIN fact_cdr f ON c.customer_id = f.customer_id
GROUP BY c.wilaya_code, c.wilaya_name, c.network_quality_score
"""

spark.sql(network_quality_sql)
print("✅ Network quality impact table created")


## 8. Create Indexes and Statistics

In [None]:
# 

# %%
# Compute statistics for query optimization
print("Computing table statistics...")

# Analyze tables
spark.sql("ANALYZE TABLE dim_customers COMPUTE STATISTICS")
spark.sql("ANALYZE TABLE fact_cdr PARTITION(year, month, day) COMPUTE STATISTICS")
spark.sql("ANALYZE TABLE agg_hourly_metrics COMPUTE STATISTICS")
spark.sql("ANALYZE TABLE agg_network_quality_impact COMPUTE STATISTICS")

print("✅ Table statistics computed")

# Create bloom filter indexes for frequently filtered columns (if supported)
try:
    spark.sql("""
        CREATE BLOOMFILTER INDEX idx_customer_id 
        ON TABLE fact_cdr (customer_id) 
        OPTIONS (numBits=1000000, numHashFunctions=5)
    """)
    print("✅ Bloom filter index created")
except:
    print("ℹ️ Bloom filter indexes not supported in this Hive version")

## 9. Create Materialized Views for Complex Queries

In [None]:
# 


# Customer behavior patterns view
spark.sql("DROP VIEW IF EXISTS v_customer_behavior_patterns")

behavior_patterns_sql = """
CREATE VIEW IF NOT EXISTS v_customer_behavior_patterns AS
WITH hourly_usage AS (
    SELECT 
        customer_id,
        HOUR(timestamp) as hour,
        AVG(CASE WHEN cdr_type = 'DATA' THEN data_volume_mb ELSE 0 END) as avg_data_mb
    FROM fact_cdr
    WHERE cdr_type = 'DATA'
    GROUP BY customer_id, HOUR(timestamp)
),
usage_patterns AS (
    SELECT 
        customer_id,
        CASE 
            WHEN MAX(CASE WHEN hour BETWEEN 20 AND 23 THEN avg_data_mb ELSE 0 END) > 
                 AVG(avg_data_mb) * 2 THEN 'Evening Heavy'
            WHEN MAX(CASE WHEN hour BETWEEN 9 AND 17 THEN avg_data_mb ELSE 0 END) > 
                 AVG(avg_data_mb) * 2 THEN 'Business Hours'
            WHEN MAX(CASE WHEN hour BETWEEN 0 AND 5 THEN avg_data_mb ELSE 0 END) > 
                 AVG(avg_data_mb) * 2 THEN 'Night Owl'
            ELSE 'Regular'
        END as usage_pattern
    FROM hourly_usage
    GROUP BY customer_id
)
SELECT 
    c.*,
    up.usage_pattern,
    cs.total_sessions,
    cs.total_data_mb,
    cs.anomaly_count
FROM dim_customers c
LEFT JOIN usage_patterns up ON c.customer_id = up.customer_id
LEFT JOIN v_customer_usage_summary cs ON c.customer_id = cs.customer_id
"""

spark.sql(behavior_patterns_sql)
print("✅ Customer behavior patterns view created")

# ## 10. Create Special Analytics Tables

In [None]:
# Anomaly detection summary table
spark.sql("DROP TABLE IF EXISTS analytics_anomaly_summary")

anomaly_summary_sql = """
CREATE TABLE IF NOT EXISTS analytics_anomaly_summary AS
SELECT 
    DATE(timestamp) as date,
    wilaya_code,
    anomaly_type,
    severity,
    COUNT(*) as anomaly_count,
    COUNT(DISTINCT customer_id) as affected_customers,
    AVG(data_volume_mb) as avg_anomaly_volume,
    COLLECT_LIST(customer_id) as sample_customers
FROM fact_cdr
WHERE cdr_type = 'ANOMALY'
GROUP BY DATE(timestamp), wilaya_code, anomaly_type, severity
"""

spark.sql(anomaly_summary_sql)
print("✅ Anomaly summary table created")

# Revenue impact analysis table
spark.sql("DROP TABLE IF EXISTS analytics_revenue_impact")

revenue_impact_sql = """
CREATE TABLE IF NOT EXISTS analytics_revenue_impact AS
SELECT 
    c.customer_type,
    c.service_type,
    c.value_segment,
    COUNT(DISTINCT c.customer_id) as customer_count,
    SUM(c.offer_price) as total_monthly_revenue,
    AVG(usage.total_data_mb) as avg_monthly_data_mb,
    COUNT(DISTINCT CASE WHEN usage.anomaly_count > 0 THEN c.customer_id END) as customers_with_issues,
    AVG(CASE WHEN NOT c.is_active THEN c.offer_price ELSE 0 END) as churned_revenue
FROM (

In [4]:
# # %% [markdown]
# # # 🆘 SURVIVAL MODE: Hive Tables Creation (Fixed Version)
# # 
# # **Strategy**: Use DataFrames to register tables, avoid schema issues
# # **Time Budget**: 20 minutes max

# # %%
# import sys
# sys.path.append('/home/jovyan/work/batch/jupyter/notebooks/work/scripts')
# from spark_init import init_spark
# from pyspark.sql import functions as F
# import time

# # Initialize Spark
# spark = init_spark("Hive Tables SURVIVAL MODE - Fixed")
# print("✅ SparkSession initialized")
# print(f"⚠️ C: Drive Space Critical - Using minimal approach!")

# %% [markdown]
# ## 1. Create Database

# %%
# Create database
spark.sql("CREATE DATABASE IF NOT EXISTS at_cdr_analysis")
spark.sql("USE at_cdr_analysis")
print("✅ Database created/selected")

# %% [markdown]
# ## 2. Register Tables Using DataFrames (Avoids Schema Issues)

# %%
print("📋 Registering customer dimension table...")

# Read the parquet files directly
customer_df = spark.read.parquet("/user/hive/warehouse/Raw/customer_dim_enhanced/")

# Register as a table (this handles schema automatically)
customer_df.createOrReplaceTempView("dim_customers_temp")

# Create permanent table from the temp view
spark.sql("""
    CREATE TABLE IF NOT EXISTS dim_customers
    USING PARQUET
    LOCATION '/user/hive/warehouse/Raw/customer_dim_enhanced/'
    AS SELECT * FROM dim_customers_temp WHERE 1=0
""")

# Drop temp view
spark.sql("DROP VIEW IF EXISTS dim_customers_temp")
print("✅ Customer dimension table registered")
print(f"   Records: {customer_df.count():,}")

# %% [markdown]
# ## 3. Register CDR Table

# %%
print("\n📋 Registering CDR table...")

# Read CDR parquet files
cdr_df = spark.read.parquet("/user/hive/warehouse/Raw/raw_cdr_enhanced/")

# Register as a view first (fastest approach)
cdr_df.createOrReplaceTempView("fact_cdr_raw")

# Create a view with computed columns
spark.sql("DROP VIEW IF EXISTS fact_cdr")
spark.sql("""
CREATE VIEW IF NOT EXISTS fact_cdr AS
SELECT 
    *,
    YEAR(timestamp) as year,
    MONTH(timestamp) as month,
    DAY(timestamp) as day,
    HOUR(timestamp) as hour
FROM fact_cdr_raw
""")

cdr_count = cdr_df.count()
print("✅ CDR tables registered")
print(f"   Records: {cdr_count:,}")

# %% [markdown]
# ## 4. Create Essential Views Only

# %%
print("\n📊 Creating essential views...")

# Daily summary view (lightweight)
spark.sql("DROP VIEW IF EXISTS v_daily_summary")
spark.sql("""
CREATE VIEW IF NOT EXISTS v_daily_summary AS
SELECT 
    DATE(timestamp) as date,
    wilaya_name,
    service_type,
    customer_type,
    COUNT(DISTINCT customer_id) as unique_customers,
    COUNT(*) as total_cdrs,
    SUM(CASE WHEN cdr_type = 'DATA' THEN data_volume_mb ELSE 0 END) as total_data_mb,
    COUNT(CASE WHEN cdr_type = 'ANOMALY' THEN 1 END) as anomaly_count
FROM fact_cdr
GROUP BY DATE(timestamp), wilaya_name, service_type, customer_type
""")
print("✅ Daily summary view created")

# Customer profile view
spark.sql("DROP VIEW IF EXISTS v_customer_profile")
spark.sql("""
CREATE VIEW IF NOT EXISTS v_customer_profile AS
SELECT 
    customer_id,
    wilaya_name,
    service_type,
    customer_type,
    MIN(timestamp) as first_activity,
    MAX(timestamp) as last_activity,
    COUNT(*) as total_sessions,
    SUM(CASE WHEN cdr_type = 'DATA' THEN data_volume_mb ELSE 0 END) as total_data_mb,
    COUNT(CASE WHEN cdr_type = 'ANOMALY' THEN 1 END) as anomaly_count
FROM fact_cdr
GROUP BY customer_id, wilaya_name, service_type, customer_type
""")
print("✅ Customer profile view created")

# %% [markdown]
# ## 5. Create ONE Small Summary Table

# %%
print("\n⚡ Creating essential summary table...")

# Create a small daily metrics table
daily_metrics = spark.sql("""
SELECT 
    DATE(timestamp) as date,
    COUNT(DISTINCT customer_id) as customers,
    COUNT(*) as total_cdrs,
    ROUND(SUM(CASE WHEN cdr_type = 'DATA' THEN data_volume_mb ELSE 0 END) / 1024, 2) as data_gb,
    COUNT(CASE WHEN cdr_type = 'ANOMALY' THEN 1 END) as anomalies,
    COUNT(CASE WHEN cdr_type = 'OUTAGE' THEN 1 END) as outages
FROM fact_cdr
GROUP BY DATE(timestamp)
ORDER BY date
""")

# Convert to pandas and save (small file)
daily_metrics_pd = daily_metrics.toPandas()
daily_metrics_pd.to_csv("/mnt/d/daily_metrics.csv", index=False)
print(f"✅ Daily metrics saved ({len(daily_metrics_pd)} rows)")

# %% [markdown]
# ## 6. Quick Validation

# %%
print("\n🔍 Quick validation...")

# Date range check
date_info = spark.sql("""
SELECT 
    MIN(timestamp) as min_date,
    MAX(timestamp) as max_date,
    COUNT(DISTINCT DATE(timestamp)) as unique_days,
    COUNT(DISTINCT customer_id) as unique_customers
FROM fact_cdr
""").collect()[0]

print(f"📅 Date range: {date_info['min_date']} to {date_info['max_date']}")
print(f"📊 Days: {date_info['unique_days']}, Customers: {date_info['unique_customers']:,}")

# Data distribution
print("\n📊 CDR Type Distribution:")
spark.sql("""
SELECT 
    cdr_type,
    COUNT(*) as count,
    ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM fact_cdr), 2) as percentage
FROM fact_cdr
GROUP BY cdr_type
ORDER BY count DESC
""").show()

# %% [markdown]
# ## 7. Save Key Metrics for Presentation

# %%
print("\n💾 Saving presentation metrics...")

# Collect key metrics
metrics = {
    "total_records": cdr_count,
    "unique_customers": date_info['unique_customers'],
    "date_range": f"{date_info['min_date']} to {date_info['max_date']}",
    "unique_days": date_info['unique_days'],
    "data_volume_tb": spark.sql("SELECT ROUND(SUM(data_volume_mb)/1024/1024, 2) FROM fact_cdr WHERE cdr_type='DATA'").collect()[0][0],
    "anomaly_percentage": spark.sql("SELECT ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM fact_cdr), 2) FROM fact_cdr WHERE cdr_type='ANOMALY'").collect()[0][0]
}

# Save as JSON
import json
with open("/mnt/d/key_metrics.json", "w") as f:
    json.dump(metrics, f, indent=2, default=str)

print("✅ Metrics saved to D: drive")
for k, v in metrics.items():
    print(f"   {k}: {v}")

# %% [markdown]
# ## Summary

# %%
print("\n" + "="*50)
print("🎉 SURVIVAL MODE COMPLETE - READY FOR ANALYSIS!")
print("="*50)

print("\n✅ What we created:")
print("  - fact_cdr view: Your main CDR data")
print("  - dim_customers: Customer dimension") 
print("  - v_daily_summary: Pre-aggregated daily view")
print("  - v_customer_profile: Customer-level summary")
print("  - daily_metrics.csv: Small file for PowerBI")
print("  - key_metrics.json: Presentation numbers")

print("\n📊 Data available:")
print(f"  - {cdr_count:,} CDR records")
print(f"  - {date_info['unique_customers']:,} unique customers")
print(f"  - {date_info['unique_days']} days of data")

print("\n⚠️ CRITICAL REMINDERS:")
print("  ❌ Do NOT create large tables")
print("  ❌ Do NOT use .write.partitionBy()")
print("  ✅ Use views for queries")
print("  ✅ Save only small aggregated results")
print("  ✅ Work directly with DataFrames when possible")

print("\n🚀 Next: Run anomaly detection notebook!")

# Clean cache
spark.catalog.clearCache()
print("\n🧹 Cache cleared!")

✅ Database created/selected
📋 Registering customer dimension table...
✅ Customer dimension table registered


                                                                                

   Records: 530,719

📋 Registering CDR table...


25/07/07 18:10:52 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


AnalysisException: [INVALID_TEMP_OBJ_REFERENCE] Cannot create the persistent object `spark_catalog`.`at_cdr_analysis`.`fact_cdr` of the type VIEW because it references to the temporary object `fact_cdr_raw` of the type VIEW. Please make the temporary object `fact_cdr_raw` persistent, or make the persistent object `spark_catalog`.`at_cdr_analysis`.`fact_cdr` temporary.

In [3]:
# ## 1. Register All DataFrames as Temporary Views

# %%
# Customers dimension: register as temp view
customer_df = spark.read.parquet("/user/hive/warehouse/Raw/customer_dim_enhanced/")
customer_df.createOrReplaceTempView("dim_customers")
print("✅ TempView registered: dim_customers")
print(f"   Records: {customer_df.count():,}")

# CDR facts: register as temp view
cdr_df = spark.read.parquet("/user/hive/warehouse/Raw/raw_cdr_enhanced/")
cdr_df.createOrReplaceTempView("fact_cdr_raw")
print("✅ TempView registered: fact_cdr_raw")
print(f"   Records: {cdr_df.count():,}")

                                                                                

✅ TempView registered: dim_customers


25/07/07 19:57:48 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


   Records: 530,719
✅ TempView registered: fact_cdr_raw
   Records: 768,359,379


                                                                                

In [4]:
# ## 2. Analytical Temp Views (with Columns for Fast Analysis)

# %%
# Main fact CDR view with extra columns
spark.sql("DROP VIEW IF EXISTS fact_cdr")
spark.sql("""
CREATE OR REPLACE TEMP VIEW fact_cdr AS
SELECT *,
    YEAR(timestamp)  as year,
    MONTH(timestamp) as month,
    DAY(timestamp)   as day,
    HOUR(timestamp)  as hour
FROM fact_cdr_raw
""")
print("✅ TempView: fact_cdr (core analytical view)")


✅ TempView: fact_cdr (core analytical view)


In [5]:
# ## 3. Aggregated Analytical Views (Daily, Profile)

# %%
# Daily summary (by wilaya/service/customer type)
spark.sql("DROP VIEW IF EXISTS v_daily_summary")
spark.sql("""
CREATE OR REPLACE TEMP VIEW v_daily_summary AS
SELECT 
    DATE(timestamp) as date,
    wilaya_name,
    service_type,
    customer_type,
    COUNT(DISTINCT customer_id) as unique_customers,
    COUNT(*) as total_cdrs,
    SUM(CASE WHEN cdr_type = 'DATA' THEN data_volume_mb ELSE 0 END) as total_data_mb,
    COUNT(CASE WHEN cdr_type = 'ANOMALY' THEN 1 END) as anomaly_count
FROM fact_cdr
GROUP BY DATE(timestamp), wilaya_name, service_type, customer_type
""")
print("✅ TempView: v_daily_summary")


# Customer profile summary
spark.sql("DROP VIEW IF EXISTS v_customer_profile")
spark.sql("""
CREATE OR REPLACE TEMP VIEW v_customer_profile AS
SELECT 
    customer_id,
    wilaya_name,
    service_type,
    customer_type,
    MIN(timestamp) as first_activity,
    MAX(timestamp) as last_activity,
    COUNT(*) as total_sessions,
    SUM(CASE WHEN cdr_type = 'DATA' THEN data_volume_mb ELSE 0 END) as total_data_mb,
    COUNT(CASE WHEN cdr_type = 'ANOMALY' THEN 1 END) as anomaly_count
FROM fact_cdr
GROUP BY customer_id, wilaya_name, service_type, customer_type
""")
print("✅ TempView: v_customer_profile")

✅ TempView: v_daily_summary
✅ TempView: v_customer_profile


In [6]:
# ## 4. Daily Metrics Table (Exported for BI Tools)

# %%
# Only export *small* result tables!
daily_metrics = spark.sql("""
SELECT 
    DATE(timestamp) as date,
    COUNT(DISTINCT customer_id) as customers,
    COUNT(*) as total_cdrs,
    ROUND(SUM(CASE WHEN cdr_type = 'DATA' THEN data_volume_mb ELSE 0 END) / 1024, 2) as data_gb,
    COUNT(CASE WHEN cdr_type = 'ANOMALY' THEN 1 END) as anomalies,
    COUNT(CASE WHEN cdr_type = 'OUTAGE' THEN 1 END) as outages
FROM fact_cdr
GROUP BY DATE(timestamp)
ORDER BY date
""")
daily_metrics_pd = daily_metrics.toPandas()
daily_metrics_pd.to_csv("/mnt/d/daily_metrics.csv", index=False)
print(f"✅ Exported: daily_metrics.csv ({len(daily_metrics_pd)} rows)")

                                                                                

✅ Exported: daily_metrics.csv (122 rows)


In [7]:
# ## 5. Validation & Quick Analysis

# %%
# Basic stats
cdr_count = cdr_df.count()
date_info = spark.sql("""
SELECT 
    MIN(timestamp) as min_date,
    MAX(timestamp) as max_date,
    COUNT(DISTINCT DATE(timestamp)) as unique_days,
    COUNT(DISTINCT customer_id) as unique_customers
FROM fact_cdr
""").collect()[0]

print(f"📅 Date range: {date_info['min_date']} to {date_info['max_date']}")
print(f"📊 Days: {date_info['unique_days']}, Customers: {date_info['unique_customers']:,}")

# CDR Type breakdown
print("\n📊 CDR Type Distribution:")
spark.sql("""
SELECT 
    cdr_type,
    COUNT(*) as count,
    ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM fact_cdr), 2) as percentage
FROM fact_cdr
GROUP BY cdr_type
ORDER BY count DESC
""").show(truncate=False)

                                                                                

📅 Date range: 2025-03-20 00:00:00 to 2025-07-19 23:59:59
📊 Days: 122, Customers: 519,912

📊 CDR Type Distribution:


                                                                                

+------------+---------+----------+
|cdr_type    |count    |percentage|
+------------+---------+----------+
|DATA        |757223524|98.55     |
|ANOMALY     |10646236 |1.39      |
|OUTAGE      |372880   |0.05      |
|PLAN_CHANGE |62832    |0.01      |
|TEMP_UPGRADE|53907    |0.01      |
+------------+---------+----------+



In [8]:
# ## 6. Export Key Metrics for Reporting
import json 
# %%
metrics = {
    "total_records": cdr_count,
    "unique_customers": date_info['unique_customers'],
    "date_range": f"{date_info['min_date']} to {date_info['max_date']}",
    "unique_days": date_info['unique_days'],
    "data_volume_tb": spark.sql("SELECT ROUND(SUM(data_volume_mb)/1024/1024, 2) FROM fact_cdr WHERE cdr_type='DATA'").collect()[0][0],
    "anomaly_percentage": spark.sql("SELECT ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM fact_cdr), 2) FROM fact_cdr WHERE cdr_type='ANOMALY'").collect()[0][0]
}
with open("/mnt/d/key_metrics.json", "w") as f:
    json.dump(metrics, f, indent=2, default=str)
print("✅ Exported: key_metrics.json")
for k, v in metrics.items():
    print(f"   {k}: {v}")




✅ Exported: key_metrics.json
   total_records: 768359379
   unique_customers: 519912
   date_range: 2025-03-20 00:00:00 to 2025-07-19 23:59:59
   unique_days: 122
   data_volume_tb: 119046.54
   anomaly_percentage: 1.39


                                                                                

In [11]:
# ## 7. Survival Mode Completion

# %%
print("\n" + "="*50)
print("🎉 SURVIVAL MODE COMPLETE - SESSION IS EPHEMERAL!")
print("="*50)
print("\n✅ What’s available now:")
print("  - fact_cdr: Main CDR analytic view (temporary)")
print("  - dim_customers: Customer dimension (temporary)")
print("  - v_daily_summary: Daily aggregate view")
print("  - v_customer_profile: Per-customer summary")
print("  - daily_metrics.csv: Ready for BI/PowerBI/Superset")
print("  - key_metrics.json: Quick reporting")

print("\n📊 Data:")
print(f"  - {cdr_count:,} CDR records")
print(f"  - {date_info['unique_customers']:,} unique customers")
print(f"  - {date_info['unique_days']} days covered")

print("\n⚠️ REMINDERS:")
print("  - No Hive tables/views created (safe to re-run, no warehouse bloat)")
print("  - Everything disappears after Spark stops")
print("  - Export only small Pandas files for presentations/BI")

# Clean Spark cache
spark.catalog.clearCache()
print("🧹 Cache cleared. Survival mode session complete!")


🎉 SURVIVAL MODE COMPLETE - SESSION IS EPHEMERAL!

✅ What’s available now:
  - fact_cdr: Main CDR analytic view (temporary)
  - dim_customers: Customer dimension (temporary)
  - v_daily_summary: Daily aggregate view
  - v_customer_profile: Per-customer summary
  - daily_metrics.csv: Ready for BI/PowerBI/Superset
  - key_metrics.json: Quick reporting

📊 Data:
  - 768,359,379 CDR records
  - 519,912 unique customers
  - 122 days covered

⚠️ REMINDERS:
  - No Hive tables/views created (safe to re-run, no warehouse bloat)
  - Everything disappears after Spark stops
  - Export only small Pandas files for presentations/BI
🧹 Cache cleared. Survival mode session complete!


25/07/07 19:56:14 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: Master removed our application: KILLED
25/07/07 19:56:14 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED
	at org.apache.spark.errors.SparkCoreErrors$.clusterSchedulerError(SparkCoreErrors.scala:291)
	at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:981)
	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.dead(StandaloneSchedulerBackend.scala:165)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint.markDead(StandaloneAppClient.scala:263)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(StandaloneAppClient.scala:170)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.proce