# 02. Create Hive Tables – CDR Data Management

### Project: CDR Telecom Big Data Engineering Final Year Internship
### Objective: Create robust, query-optimized Hive tables and views for anonymized CDR data.

#### Table Creation Strategy

1- External Tables: Point to Parquet data in HDFS

2- Partitioning: By date for scalable analytics

3- Schema: Strong types, data cleanliness

4- Views: For BI tools and analytics

5- Data Quality: Monitoring for row count, nulls, uniqueness

In [1]:
# ------------------------------------------------------------
# Cell 1 – Setup, Imports, and Config
# ------------------------------------------------------------
import sys, os
from datetime import datetime
from pyspark.sql import functions as F, types as T

# --- Spark Project Init ---
sys.path.append('/home/jovyan/work/work/scripts')  # Path to your custom scripts
from spark_init import init_spark

spark = init_spark("CDR Data Engineering - Creating Hive Tables")

# HDFS and Hive Paths
HDFS_ANON_PATH = "/user/hive/warehouse/cdr_anonymized/"
DATABASE_NAME = "algerie_telecom_cdr"
MAIN_TABLE = "cdr_anonymized"
DAILY_TABLE = "cdr_daily_summary"
HOURLY_TABLE = "cdr_hourly_summary"
NETWORK_TABLE = "cdr_network_metrics"

print(f"🏗️ Table Creation Configuration:")
print(f"   Database: {DATABASE_NAME}")
print(f"   Main table: {MAIN_TABLE}")
print(f"   Data location: hdfs://namenode:9000{HDFS_ANON_PATH}")
print(f"   Spark Version: {spark.version}")


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/19 06:52:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


✅ SparkSession initialized (App: CDR Data Engineering - Creating Hive Tables, Spark: 3.5.1)
✅ Hive Warehouse: hdfs://namenode:9000/user/hive/warehouse
✅ Hive Metastore URI: thrift://hive-metastore:9083
🏗️ Table Creation Configuration:
   Database: algerie_telecom_cdr
   Main table: cdr_anonymized
   Data location: hdfs://namenode:9000/user/hive/warehouse/cdr_anonymized/
   Spark Version: 3.5.1


#### Load and Inspect Anonymized Data

In [2]:
# ------------------------------------------------------------
# Cell 2 – Load and Inspect Anonymized Data
# ------------------------------------------------------------
anon_df = spark.read.parquet(f"hdfs://namenode:9000{HDFS_ANON_PATH}")
anon_df.cache()

total_records = anon_df.count()
total_columns = len(anon_df.columns)
hash_columns = [c for c in anon_df.columns if c.endswith("_HASH")]
regular_columns = [c for c in anon_df.columns if not c.endswith("_HASH") and c != "CDR_DAY"]

print(f"\n📊 Data Overview:")
print(f"   Total records: {total_records:,}")
print(f"   Total columns: {total_columns}")
print(f"\n📋 Hash columns (anonymized): {hash_columns}")
print(f"   Regular columns: {regular_columns}")
anon_df.printSchema()
anon_df.show(5, truncate=False)


25/06/19 06:52:20 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                


📊 Data Overview:
   Total records: 89,911
   Total columns: 39

📋 Hash columns (anonymized): ['PRI_IDENTITY_HASH', 'CallingPartyNumber_HASH', 'CalledPartyNumber_HASH', 'CallingPartyIMSI_HASH', 'CalledPartyIMSI_HASH', 'IMEI_HASH']
   Regular columns: ['CDR_ID', 'CDR_SUB_ID', 'CDR_TYPE', 'CDR_BATCH_ID', 'SRC_CDR_ID', 'START_DATE', 'END_DATE', 'CREATE_DATE', 'CUST_LOCAL_START_DATE', 'CUST_LOCAL_END_DATE', 'OBJ_ID', 'ACTUAL_USAGE', 'RATE_USAGE', 'SERVICE_UNIT_TYPE', 'SERVICE_CATEGORY', 'USAGE_SERVICE_TYPE', 'STD_EVT_TYPE_ID', 'SESSION_ID', 'DEBIT_AMOUNT', 'UN_DEBIT_AMOUNT', 'TOTAL_TAX', 'ServiceFlow', 'CallForwardIndicator', 'ChargingTime', 'CallType', 'RoamState', 'CallingRoamInfo', 'CalledRoamInfo', 'CallingCellID', 'CalledCellID', 'MSCAddress', 'BrandID']
root
 |-- CDR_ID: string (nullable = true)
 |-- CDR_SUB_ID: string (nullable = true)
 |-- CDR_TYPE: string (nullable = true)
 |-- CDR_BATCH_ID: string (nullable = true)
 |-- SRC_CDR_ID: string (nullable = true)
 |-- START_DATE: string

#### Create Database and Use It

In [3]:
# ------------------------------------------------------------
# Cell 3 – Create and Use Database
# ------------------------------------------------------------
spark.sql(f"""
    CREATE DATABASE IF NOT EXISTS {DATABASE_NAME}
    COMMENT 'Algerie Telecom CDR Data Warehouse - Final Year Project'
    LOCATION 'hdfs://namenode:9000/user/hive/warehouse/{DATABASE_NAME}.db'
""")
spark.sql(f"USE {DATABASE_NAME}")
print(f"✅ Database '{DATABASE_NAME}' created and selected")
spark.sql("SHOW DATABASES").show()


✅ Database 'algerie_telecom_cdr' created and selected
+-------------------+
|          namespace|
+-------------------+
|algerie_telecom_cdr|
|            default|
+-------------------+



#### Analyze Columns for Schema Design

In [4]:
# ------------------------------------------------------------
# Cell 4 – Analyze Columns for Schema Design
# ------------------------------------------------------------
date_columns = [c for c in anon_df.columns if "date" in c.lower() or "day" in c.lower()]
numeric_columns = [c for c, t in anon_df.dtypes if t in ("int", "bigint", "double", "float")]
string_columns = [c for c in anon_df.columns if c not in numeric_columns and c not in date_columns]

print(f"Date columns: {date_columns}")
print(f"Numeric columns: {numeric_columns}")
print(f"String columns: {string_columns}")


Date columns: ['START_DATE', 'END_DATE', 'CREATE_DATE', 'CUST_LOCAL_START_DATE', 'CUST_LOCAL_END_DATE', 'CDR_DAY']
Numeric columns: ['ACTUAL_USAGE', 'RATE_USAGE', 'DEBIT_AMOUNT', 'UN_DEBIT_AMOUNT', 'TOTAL_TAX', 'ChargingTime']
String columns: ['CDR_ID', 'CDR_SUB_ID', 'CDR_TYPE', 'CDR_BATCH_ID', 'SRC_CDR_ID', 'OBJ_ID', 'SERVICE_UNIT_TYPE', 'SERVICE_CATEGORY', 'USAGE_SERVICE_TYPE', 'STD_EVT_TYPE_ID', 'SESSION_ID', 'ServiceFlow', 'CallForwardIndicator', 'CallType', 'RoamState', 'CallingRoamInfo', 'CalledRoamInfo', 'CallingCellID', 'CalledCellID', 'MSCAddress', 'BrandID', 'PRI_IDENTITY_HASH', 'CallingPartyNumber_HASH', 'CalledPartyNumber_HASH', 'CallingPartyIMSI_HASH', 'CalledPartyIMSI_HASH', 'IMEI_HASH']


#### Create External Main Table (Parquet, Partitioned)

In [5]:
# ------------------------------------------------------------
# Cell 5 – Create Main External Table (Parquet, Partitioned)
# ------------------------------------------------------------
spark.sql(f"DROP TABLE IF EXISTS {MAIN_TABLE}")

# Partition if possible
partition_by = "CDR_DAY" if "CDR_DAY" in anon_df.columns else None

# Build CREATE TABLE columns with correct Hive types, EXCLUDING the partition column!
col_defs = []
for c, t in anon_df.dtypes:
    if c == partition_by:
        continue  # partition columns go ONLY in PARTITIONED BY
    if t.startswith("int") or t.startswith("bigint"):
        col_defs.append(f"{c} BIGINT")
    elif t.startswith("double") or t.startswith("float"):
        col_defs.append(f"{c} DOUBLE")
    elif t.startswith("date"):
        col_defs.append(f"{c} DATE")
    else:
        col_defs.append(f"{c} STRING")

# Compose the correct Hive SQL
if partition_by:
    table_sql = f"""
    CREATE EXTERNAL TABLE IF NOT EXISTS {MAIN_TABLE} (
        {', '.join(col_defs)}
    )
    PARTITIONED BY ({partition_by} DATE)
    STORED AS PARQUET
    LOCATION 'hdfs://namenode:9000{HDFS_ANON_PATH}'
    TBLPROPERTIES (
        'parquet.compression'='SNAPPY',
        'created_by'='Algerie_Telecom_Data_Engineering',
        'created_date'='{datetime.now().strftime('%Y-%m-%d')}'
    )
    """
else:
    table_sql = f"""
    CREATE EXTERNAL TABLE IF NOT EXISTS {MAIN_TABLE} (
        {', '.join(col_defs)}
    )
    STORED AS PARQUET
    LOCATION 'hdfs://namenode:9000{HDFS_ANON_PATH}'
    TBLPROPERTIES (
        'parquet.compression'='SNAPPY',
        'created_by'='Algerie_Telecom_Data_Engineering',
        'created_date'='{datetime.now().strftime('%Y-%m-%d')}'
    )
    """

# Run the SQL to create the table
spark.sql(table_sql)
print(f"✅ Main external table '{MAIN_TABLE}' created")

# Recover partitions if partitioned
if partition_by:
    spark.sql(f"MSCK REPAIR TABLE {MAIN_TABLE}")

spark.sql(f"DESCRIBE EXTENDED {MAIN_TABLE}").show(50, False)


25/06/19 06:52:37 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.


✅ Main external table 'cdr_anonymized' created
+----------------------------+----------------------------+-------+
|col_name                    |data_type                   |comment|
+----------------------------+----------------------------+-------+
|CDR_ID                      |string                      |NULL   |
|CDR_SUB_ID                  |string                      |NULL   |
|CDR_TYPE                    |string                      |NULL   |
|CDR_BATCH_ID                |string                      |NULL   |
|SRC_CDR_ID                  |string                      |NULL   |
|START_DATE                  |string                      |NULL   |
|END_DATE                    |string                      |NULL   |
|CREATE_DATE                 |string                      |NULL   |
|CUST_LOCAL_START_DATE       |string                      |NULL   |
|CUST_LOCAL_END_DATE         |string                      |NULL   |
|OBJ_ID                      |string                      |NULL   |
|

#### Create Aggregated Tables (Daily, Hourly, Network)

In [6]:
# ------------------------------------------------------------
# Cell 6 – Create Aggregated Tables (Daily, Hourly, Network)
# ------------------------------------------------------------

main_df = spark.table(MAIN_TABLE)

# --- CAST numerics for reliable aggregation ---
# Make a copy so we don't modify main_df in memory if not needed
agg_df = main_df

duration_col = "ACTUAL_USAGE" if "ACTUAL_USAGE" in main_df.columns else None
if duration_col and isinstance(agg_df.schema[duration_col].dataType, T.StringType):
    agg_df = agg_df.withColumn(duration_col, F.col(duration_col).cast("double"))

# Identify date/hash columns again for safety
date_col = "CDR_DAY" if "CDR_DAY" in agg_df.columns else (next((c for c in agg_df.columns if "date" in c.lower() or "day" in c.lower()), None))
hash_col = next((c for c in agg_df.columns if c.endswith("_HASH")), None)

# --- Daily Aggregation
daily_df = agg_df.groupBy(date_col).agg(
    F.count("*").alias("total_records"),
    F.countDistinct(hash_col).alias("unique_subscribers") if hash_col else F.lit(0).alias("unique_subscribers"),
    F.avg(duration_col).alias("avg_duration") if duration_col else F.lit(0).alias("avg_duration"),
    F.sum(duration_col).alias("total_duration") if duration_col else F.lit(0).alias("total_duration")
)
daily_df.write.mode("overwrite").saveAsTable(DAILY_TABLE)
print(f"✅ Daily summary table '{DAILY_TABLE}' created")
spark.sql(f"SELECT * FROM {DAILY_TABLE} LIMIT 5").show()

# --- Hourly Aggregation (if CALL_TIME exists)
if "CALL_TIME" in agg_df.columns and date_col:
    hourly_df = agg_df.withColumn(
        "call_hour",
        F.concat_ws(' ', F.col(date_col), F.col("CALL_TIME"))
    ).groupBy("call_hour").agg(
        F.count("*").alias("hourly_records"),
        F.countDistinct(hash_col).alias("hourly_unique_users") if hash_col else F.lit(0).alias("hourly_unique_users"),
        F.avg(duration_col).alias("avg_hourly_duration") if duration_col else F.lit(0).alias("avg_hourly_duration")
    )
    hourly_df.write.mode("overwrite").saveAsTable(HOURLY_TABLE)
    print(f"✅ Hourly summary table '{HOURLY_TABLE}' created")
    spark.sql(f"SELECT * FROM {HOURLY_TABLE} LIMIT 5").show()
else:
    print("⚠️ Skipping hourly aggregation: 'CALL_TIME' column not found.")

# --- Network Metrics (CallingCellID)
if "CallingCellID" in agg_df.columns:
    network_df = agg_df.groupBy("CallingCellID").agg(
        F.count("*").alias("total_calls"),
        F.countDistinct(hash_col).alias("unique_users") if hash_col else F.lit(0).alias("unique_users"),
        F.avg(duration_col).alias("avg_call_duration") if duration_col else F.lit(0).alias("avg_call_duration")
    )
    network_df.write.mode("overwrite").saveAsTable(NETWORK_TABLE)
    print(f"✅ Network metrics table '{NETWORK_TABLE}' created")
    spark.sql(f"SELECT * FROM {NETWORK_TABLE} LIMIT 5").show()
else:
    print("⚠️ Skipping network metrics: 'CallingCellID' column not found.")

                                                                                

✅ Daily summary table 'cdr_daily_summary' created
+----------+-------------+------------------+------------------+--------------+
|   CDR_DAY|total_records|unique_subscribers|      avg_duration|total_duration|
+----------+-------------+------------------+------------------+--------------+
|2024-12-31|         7181|              3924|105.00612728032307|      754049.0|
|2025-01-01|        82730|             38311| 166.0054152060921|   1.3733628E7|
+----------+-------------+------------------+------------------+--------------+

⚠️ Skipping hourly aggregation: 'CALL_TIME' column not found.
✅ Network metrics table 'cdr_network_metrics' created
+----------------+-----------+------------+------------------+
|   CallingCellID|total_calls|unique_users| avg_call_duration|
+----------------+-----------+------------+------------------+
|603093E819668914|      18441|        9664|186.92945068054877|
|            NULL|      71470|       31179|154.47755701693018|
+----------------+-----------+--------

#### Create Analytical Views

In [7]:
# ------------------------------------------------------------
# Cell 7 – Create Analytical Views
# ------------------------------------------------------------

# View 1: Peak Hours
if spark.catalog.tableExists(HOURLY_TABLE):
    spark.sql(f"""
    CREATE OR REPLACE VIEW v_peak_hours AS
    SELECT HOUR(call_hour) as hour_of_day,
           AVG(hourly_records) as avg_calls_per_hour,
           MAX(hourly_records) as max_calls_per_hour,
           COUNT(*) as total_hours_sampled
    FROM {HOURLY_TABLE}
    WHERE call_hour IS NOT NULL
    GROUP BY HOUR(call_hour)
    ORDER BY avg_calls_per_hour DESC
    """)
    print("✅ View v_peak_hours created")

# View 2: Daily Trends
if spark.catalog.tableExists(DAILY_TABLE):
    spark.sql(f"""
    CREATE OR REPLACE VIEW v_daily_trends AS
    SELECT {date_col} as call_date, total_records, unique_subscribers,
           ROUND(total_records * 1.0 / unique_subscribers, 2) as calls_per_subscriber,
           ROUND(avg_duration, 2) as avg_duration_formatted
    FROM {DAILY_TABLE}
    ORDER BY call_date DESC
    """)
    print("✅ View v_daily_trends created")

# View 3: Network Performance
if spark.catalog.tableExists(NETWORK_TABLE):
    spark.sql(f"""
    CREATE OR REPLACE VIEW v_network_performance AS
    SELECT CallingCellID, total_calls, unique_users, avg_call_duration,
           ROUND(total_calls * 1.0 / unique_users, 2) as calls_per_user,
           CASE
               WHEN total_calls > 1000 THEN 'High Traffic'
               WHEN total_calls > 100 THEN 'Medium Traffic'
               ELSE 'Low Traffic'
           END as traffic_category
    FROM {NETWORK_TABLE}
    ORDER BY total_calls DESC
    """)
    print("✅ View v_network_performance created")


✅ View v_daily_trends created
✅ View v_network_performance created


#### Data Quality Table 

In [8]:
# ------------------------------------------------------------
# Cell 8 – Data Quality Table (Optional/Recommended)
# ------------------------------------------------------------
quality_df = main_df.agg(
    F.count("*").alias("total_rows"),
    F.countDistinct(hash_col).alias("unique_hash_values") if hash_col else F.lit(0).alias("unique_hash_values"),
    F.sum(F.when(F.col(hash_col).isNull(), 1).otherwise(0)).alias("null_hash_count") if hash_col else F.lit(0).alias("null_hash_count"),
    F.sum(F.when(F.col(date_col).isNull(), 1).otherwise(0)).alias("null_date_count") if date_col else F.lit(0).alias("null_date_count")
).withColumn("check_timestamp", F.lit(datetime.now().strftime("%Y-%m-%d %H:%M:%S")))

quality_df.write.mode("overwrite").saveAsTable("data_quality_checks")
print("✅ Data quality checks table created")
spark.sql("SELECT * FROM data_quality_checks").show(truncate=False)


✅ Data quality checks table created
+----------+------------------+---------------+---------------+-------------------+
|total_rows|unique_hash_values|null_hash_count|null_date_count|check_timestamp    |
+----------+------------------+---------------+---------------+-------------------+
|89911     |40843             |0              |0              |2025-06-19 07:02:58|
+----------+------------------+---------------+---------------+-------------------+



#### Summary, Verification, and Next Steps

In [9]:
# ------------------------------------------------------------
# Cell 9 – Summary, Verification, and Next Steps
# ------------------------------------------------------------
print("="*80)
print("📋 HIVE TABLES CREATION SUMMARY REPORT")
print("="*80)
print(f"Database: {DATABASE_NAME}")
print("Tables in DB:")
spark.sql("SHOW TABLES").show(truncate=False)
print(f"Source Parquet: {HDFS_ANON_PATH}")

print("\nSample Queries:")
if spark.catalog.tableExists("v_daily_trends"):
    spark.sql("SELECT * FROM v_daily_trends LIMIT 5").show()
if spark.catalog.tableExists("v_peak_hours"):
    spark.sql("SELECT * FROM v_peak_hours LIMIT 5").show()
if spark.catalog.tableExists("v_network_performance"):
    spark.sql("SELECT * FROM v_network_performance LIMIT 5").show()

print("✅ All tables and views ready for BI and further Spark analysis.")
print("🚀 Next: Proceed to Data Engineering Processing & Trend Detection notebook.")

spark.stop()
print("✅ Spark session stopped.")



📋 HIVE TABLES CREATION SUMMARY REPORT
Database: algerie_telecom_cdr
Tables in DB:
+-------------------+---------------------+-----------+
|namespace          |tableName            |isTemporary|
+-------------------+---------------------+-----------+
|algerie_telecom_cdr|cdr_anonymized       |false      |
|algerie_telecom_cdr|cdr_daily_summary    |false      |
|algerie_telecom_cdr|cdr_network_metrics  |false      |
|algerie_telecom_cdr|v_daily_trends       |false      |
|algerie_telecom_cdr|v_network_performance|false      |
|algerie_telecom_cdr|data_quality_checks  |false      |
+-------------------+---------------------+-----------+

Source Parquet: /user/hive/warehouse/cdr_anonymized/

Sample Queries:
+----------+-------------+------------------+--------------------+----------------------+
| call_date|total_records|unique_subscribers|calls_per_subscriber|avg_duration_formatted|
+----------+-------------+------------------+--------------------+----------------------+
|2025-01-01|     

In [None]:
Genrated Mobile data 

In [6]:
# =====================================================
# QUICK FIX FOR HIVE TABLE TYPE MISMATCH
# =====================================================

import sys
sys.path.append('/home/jovyan/work/scripts')
from spark_init import init_spark

spark = init_spark("Fix Hive Table Types")

print("=" * 80)
print("🔧 FIXING HIVE TABLE TYPE MISMATCH")
print("=" * 80)

# Use database
spark.sql("USE algerie_telecom_gen")

# =====================================================
# OPTION 1: DROP AND RECREATE WITH CORRECT TYPES (FASTEST)
# =====================================================
print("\n🚀 Applying Quick Fix: Recreating table with correct types...")

# Drop the problematic partitioned table
spark.sql("DROP TABLE IF EXISTS cdr_partitioned")
print("✅ Dropped old table")

# Recreate with correct types matching your Parquet schema
spark.sql("""
    CREATE TABLE IF NOT EXISTS cdr_partitioned (
        -- String fields
        cdr_id STRING,
        subscriber_id STRING,
        msisdn STRING,
        imsi STRING,
        imei STRING,
        service_type STRING,
        service_subtype STRING,
        session_id STRING,
        calling_party STRING,
        called_party STRING,
        start_time TIMESTAMP,
        end_time TIMESTAMP,
        
        -- Fix: Use BIGINT instead of INT for these fields
        duration BIGINT,  -- Changed from INT
        signal_strength BIGINT,  -- Changed from INT
        
        -- Double fields
        data_volume_mb DOUBLE,
        upload_mb DOUBLE,
        download_mb DOUBLE,
        charging_amount DOUBLE,
        tax_amount DOUBLE,
        revenue_per_mb DOUBLE,
        quality_score DOUBLE,
        promotional_discount DOUBLE,
        
        -- String fields continued
        cell_id STRING,
        lac STRING,
        location_area STRING,
        serving_cell_tower STRING,
        network_type STRING,
        currency STRING,
        payment_type STRING,
        call_result STRING,
        customer_segment STRING,
        tariff_plan STRING,
        operator STRING,
        age_group STRING,
        gender STRING,
        roaming_country STRING,
        roaming_type STRING,
        special_offer_applied STRING,
        network_congestion_level STRING,
        time_of_day_category STRING,
        day_of_week STRING,
        application_used STRING,
        content_category STRING,
        customer_lifetime_value_category STRING,
        
        -- Boolean fields
        dropped_call_flag BOOLEAN,
        roaming_flag BOOLEAN,
        fraud_indicator BOOLEAN,
        unusual_pattern_flag BOOLEAN,
        is_weekend BOOLEAN,
        is_holiday BOOLEAN
    )
    PARTITIONED BY (year INT, month INT, day INT)
    STORED AS PARQUET
    TBLPROPERTIES (
        'compression' = 'snappy',
        'transactional' = 'false'
    )
""")

print("✅ Created new table with correct types")

# =====================================================
# RELOAD DATA WITH DYNAMIC PARTITIONING
# =====================================================
print("\n📥 Reloading data into corrected table...")

# Configure for dynamic partitioning
spark.sql("SET hive.exec.dynamic.partition = true")
spark.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
spark.sql("SET hive.exec.max.dynamic.partitions = 10000")
spark.sql("SET hive.exec.max.dynamic.partitions.pernode = 1000")

# Insert data - no casting needed since types now match
spark.sql("""
    INSERT OVERWRITE TABLE cdr_partitioned PARTITION(year, month, day)
    SELECT 
        cdr_id,
        subscriber_id,
        msisdn,
        imsi,
        imei,
        service_type,
        service_subtype,
        session_id,
        calling_party,
        called_party,
        CAST(start_time AS TIMESTAMP) as start_time,
        CAST(end_time AS TIMESTAMP) as end_time,
        duration,  -- No cast needed, already BIGINT
        data_volume_mb,
        upload_mb,
        download_mb,
        cell_id,
        lac,
        location_area,
        serving_cell_tower,
        network_type,
        charging_amount,
        currency,
        payment_type,
        tax_amount,
        call_result,
        quality_score,
        signal_strength,  -- No cast needed, already BIGINT
        dropped_call_flag,
        network_congestion_level,
        customer_segment,
        tariff_plan,
        operator,
        age_group,
        gender,
        fraud_indicator,
        unusual_pattern_flag,
        time_of_day_category,
        day_of_week,
        is_weekend,
        is_holiday,
        roaming_flag,
        roaming_country,
        roaming_type,
        special_offer_applied,
        promotional_discount,
        application_used,
        content_category,
        revenue_per_mb,
        customer_lifetime_value_category,
        YEAR(CAST(start_time AS TIMESTAMP)) as year,
        MONTH(CAST(start_time AS TIMESTAMP)) as month,
        DAY(CAST(start_time AS TIMESTAMP)) as day
    FROM cdr_raw
""")

print("✅ Data loaded successfully!")

# Verify the load
count = spark.sql("SELECT COUNT(*) FROM cdr_partitioned").collect()[0][0]
print(f"\n📊 Verification: {count:,} records loaded into partitioned table")

# Show partition statistics
print("\n📊 Partition Statistics:")
spark.sql("""
    SELECT 
        year, 
        month,
        COUNT(DISTINCT day) as days,
        COUNT(*) as records
    FROM cdr_partitioned
    GROUP BY year, month
    ORDER BY year, month
""").show()

print("\n✅ Table fixed and data reloaded successfully!")
print("You can now continue with the rest of Notebook 02")


25/06/22 15:04:23 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
25/06/22 15:04:23 WARN SetCommand: 'SET hive.exec.dynamic.partition=true' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.exec.dynamic.partition) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
25/06/22 15:04:23 WARN SetCommand: 'SET hive.exec.dynamic.partition.mode=nonstrict' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.exec.dynamic.partition.mode) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
25/

✅ SparkSession initialized (App: Fix Hive Table Types, Spark: 3.5.1)
✅ Hive Warehouse: hdfs://namenode:9000/user/hive/warehouse
✅ Hive Metastore URI: thrift://hive-metastore:9083
🔧 FIXING HIVE TABLE TYPE MISMATCH

🚀 Applying Quick Fix: Recreating table with correct types...
✅ Dropped old table
✅ Created new table with correct types

📥 Reloading data into corrected table...


AnalysisException: [INCOMPATIBLE_DATA_FOR_TABLE.CANNOT_SAFELY_CAST] Cannot write incompatible data for the table `spark_catalog`.`algerie_telecom_gen`.`cdr_partitioned`: Cannot safely cast `download_mb` "STRING" to "DOUBLE".

In [7]:
# =====================================================
# NOTEBOOK 02: CONTINUATION - CREATE VIEWS AND AGGREGATIONS
# Continue from Section 5 after fixing table types
# =====================================================

import sys
sys.path.append('/home/jovyan/work/scripts')
from spark_init import init_spark
from pyspark.sql import functions as F
import time

# Initialize Spark
spark = init_spark("Hive Tables - Continuation")

# Use database
spark.sql("USE algerie_telecom_gen")

print("=" * 80)
print("📊 CONTINUING HIVE TABLE SETUP")
print("=" * 80)

# =====================================================
# 5. CREATE OPTIMIZED ANALYTICAL VIEWS
# =====================================================
print("\n👁️  CREATING ADVANCED ANALYTICAL VIEWS...")

views_created = 0

# Service-based views
spark.sql("""
    CREATE OR REPLACE VIEW voice_calls AS
    SELECT * FROM cdr_partitioned
    WHERE service_type = 'VOICE'
""")
views_created += 1
print("✅ Created voice_calls view")

spark.sql("""
    CREATE OR REPLACE VIEW data_sessions AS
    SELECT * FROM cdr_partitioned
    WHERE service_type = 'DATA'
""")
views_created += 1
print("✅ Created data_sessions view")

spark.sql("""
    CREATE OR REPLACE VIEW sms_records AS
    SELECT * FROM cdr_partitioned
    WHERE service_type = 'SMS'
""")
views_created += 1
print("✅ Created sms_records view")

# Advanced analytical views
spark.sql("""
    CREATE OR REPLACE VIEW fraud_cases AS
    SELECT * FROM cdr_partitioned
    WHERE fraud_indicator = true OR unusual_pattern_flag = true
""")
views_created += 1
print("✅ Created fraud_cases view")

spark.sql("""
    CREATE OR REPLACE VIEW network_issues AS
    SELECT * FROM cdr_partitioned
    WHERE dropped_call_flag = true 
       OR call_result = 'FAILED'
       OR network_congestion_level IN ('MEDIUM', 'HIGH')
       OR quality_score < 0.5
""")
views_created += 1
print("✅ Created network_issues view")

spark.sql("""
    CREATE OR REPLACE VIEW high_value_customers AS
    SELECT DISTINCT
        subscriber_id,
        customer_segment,
        customer_lifetime_value_category,
        tariff_plan,
        payment_type,
        age_group,
        gender,
        operator
    FROM cdr_partitioned
    WHERE customer_segment IN ('Premium')
       OR customer_lifetime_value_category IN ('High', 'Very High')
       OR payment_type = 'POSTPAID'
""")
views_created += 1
print("✅ Created high_value_customers view")

spark.sql("""
    CREATE OR REPLACE VIEW special_offers_usage AS
    SELECT * FROM cdr_partitioned
    WHERE special_offer_applied != 'None' 
      AND promotional_discount > 0
""")
views_created += 1
print("✅ Created special_offers_usage view")

spark.sql("""
    CREATE OR REPLACE VIEW roaming_records AS
    SELECT * FROM cdr_partitioned
    WHERE roaming_flag = true
""")
views_created += 1
print("✅ Created roaming_records view")

spark.sql("""
    CREATE OR REPLACE VIEW app_usage_data AS
    SELECT * FROM cdr_partitioned
    WHERE service_type = 'DATA' 
      AND application_used IS NOT NULL 
      AND application_used != ''
""")
views_created += 1
print("✅ Created app_usage_data view")

print(f"\n✅ Total views created: {views_created}")

# =====================================================
# 6. CREATE PRE-AGGREGATED TABLES FOR PERFORMANCE
# =====================================================
print("\n📊 CREATING PRE-AGGREGATED ANALYTICAL TABLES...")

# Daily KPIs by Service and Operator
print("\n⏳ Creating daily_kpis table...")
spark.sql("""
    CREATE TABLE IF NOT EXISTS daily_kpis AS
    SELECT 
        year, month, day,
        service_type,
        operator,
        customer_segment,
        COUNT(DISTINCT subscriber_id) as unique_subscribers,
        COUNT(*) as total_transactions,
        SUM(duration) as total_duration_seconds,
        SUM(data_volume_mb) as total_data_mb,
        SUM(charging_amount) as total_revenue,
        SUM(tax_amount) as total_tax,
        AVG(quality_score) as avg_quality_score,
        SUM(CASE WHEN fraud_indicator THEN 1 ELSE 0 END) as fraud_cases,
        SUM(CASE WHEN dropped_call_flag THEN 1 ELSE 0 END) as dropped_calls,
        SUM(CASE WHEN special_offer_applied != 'None' THEN 1 ELSE 0 END) as special_offer_usage,
        AVG(promotional_discount) as avg_discount_applied
    FROM cdr_partitioned
    GROUP BY year, month, day, service_type, operator, customer_segment
""")
print("✅ Created daily_kpis table")

# Hourly Usage Patterns
print("\n⏳ Creating hourly_patterns table...")
spark.sql("""
    CREATE TABLE IF NOT EXISTS hourly_patterns AS
    SELECT 
        time_of_day_category,
        day_of_week,
        is_weekend,
        service_type,
        operator,
        COUNT(*) as transaction_count,
        COUNT(DISTINCT subscriber_id) as unique_users,
        AVG(duration) as avg_duration,
        AVG(data_volume_mb) as avg_data_mb,
        AVG(charging_amount) as avg_revenue,
        AVG(quality_score) as avg_quality,
        SUM(CASE WHEN network_congestion_level = 'HIGH' THEN 1 ELSE 0 END) as high_congestion_count
    FROM cdr_partitioned
    GROUP BY time_of_day_category, day_of_week, is_weekend, service_type, operator
""")
print("✅ Created hourly_patterns table")

# Location Performance Metrics
print("\n⏳ Creating location_network_metrics table...")
spark.sql("""
    CREATE TABLE IF NOT EXISTS location_network_metrics AS
    SELECT 
        location_area,
        network_type,
        operator,
        service_type,
        COUNT(*) as total_transactions,
        COUNT(DISTINCT subscriber_id) as unique_subscribers,
        AVG(quality_score) as avg_quality_score,
        AVG(signal_strength) as avg_signal_strength,
        SUM(CASE WHEN dropped_call_flag THEN 1 ELSE 0 END) as dropped_count,
        SUM(CASE WHEN call_result = 'FAILED' THEN 1 ELSE 0 END) as failed_count,
        SUM(CASE WHEN network_congestion_level = 'HIGH' THEN 1 ELSE 0 END) as high_congestion_count,
        AVG(data_volume_mb) as avg_data_usage,
        SUM(charging_amount) as total_revenue
    FROM cdr_partitioned
    GROUP BY location_area, network_type, operator, service_type
""")
print("✅ Created location_network_metrics table")

# Customer Demographics Analysis
print("\n⏳ Creating customer_demographics_summary table...")
spark.sql("""
    CREATE TABLE IF NOT EXISTS customer_demographics_summary AS
    SELECT 
        customer_segment,
        age_group,
        gender,
        payment_type,
        operator,
        COUNT(DISTINCT subscriber_id) as subscriber_count,
        COUNT(*) as total_activities,
        AVG(charging_amount) as avg_transaction_value,
        SUM(charging_amount) as total_revenue,
        AVG(data_volume_mb) as avg_data_usage,
        SUM(CASE WHEN fraud_indicator THEN 1 ELSE 0 END) as fraud_incidents,
        AVG(promotional_discount) as avg_discount_received
    FROM cdr_partitioned
    GROUP BY customer_segment, age_group, gender, payment_type, operator
""")
print("✅ Created customer_demographics_summary table")

# App Usage Analytics
print("\n⏳ Creating app_usage_analytics table...")
spark.sql("""
    CREATE TABLE IF NOT EXISTS app_usage_analytics AS
    SELECT 
        application_used,
        content_category,
        customer_segment,
        age_group,
        COUNT(DISTINCT subscriber_id) as unique_users,
        COUNT(*) as total_sessions,
        SUM(data_volume_mb) as total_data_mb,
        AVG(data_volume_mb) as avg_data_per_session,
        SUM(duration) as total_duration,
        AVG(duration) as avg_session_duration,
        SUM(charging_amount) as total_revenue,
        AVG(revenue_per_mb) as avg_revenue_per_mb
    FROM cdr_partitioned
    WHERE service_type = 'DATA' AND application_used IS NOT NULL
    GROUP BY application_used, content_category, customer_segment, age_group
""")
print("✅ Created app_usage_analytics table")

print("\n✅ All pre-aggregated tables created successfully!")

# =====================================================
# 7. CREATE MATERIALIZED VIEWS FOR DASHBOARDS
# =====================================================
print("\n📈 CREATING MATERIALIZED VIEWS FOR REAL-TIME ANALYTICS...")

# Real-time fraud monitoring
print("\n⏳ Creating fraud_monitoring table...")
spark.sql("""
    CREATE TABLE IF NOT EXISTS fraud_monitoring AS
    SELECT 
        DATE(start_time) as fraud_date,
        operator,
        location_area,
        COUNT(*) as total_incidents,
        COUNT(DISTINCT subscriber_id) as affected_subscribers,
        SUM(charging_amount) as potential_loss,
        COLLECT_SET(service_type) as affected_services
    FROM cdr_partitioned
    WHERE fraud_indicator = true OR unusual_pattern_flag = true
    GROUP BY DATE(start_time), operator, location_area
""")
print("✅ Created fraud_monitoring table")

# Revenue tracking
print("\n⏳ Creating revenue_tracking table...")
spark.sql("""
    CREATE TABLE IF NOT EXISTS revenue_tracking AS
    SELECT 
        year, month, day,
        operator,
        service_type,
        customer_segment,
        payment_type,
        SUM(charging_amount) as gross_revenue,
        SUM(tax_amount) as tax_collected,
        SUM(charging_amount - tax_amount) as net_revenue,
        COUNT(DISTINCT subscriber_id) as active_customers,
        COUNT(*) as total_transactions,
        AVG(charging_amount) as arpu_daily
    FROM cdr_partitioned
    GROUP BY year, month, day, operator, service_type, customer_segment, payment_type
""")
print("✅ Created revenue_tracking table")

# =====================================================
# 8. COMPUTE COMPREHENSIVE STATISTICS
# =====================================================
print("\n📈 COMPUTING TABLE STATISTICS FOR OPTIMIZATION...")

tables_to_analyze = [
    'cdr_partitioned', 'daily_kpis', 'hourly_patterns', 
    'location_network_metrics', 'customer_demographics_summary',
    'app_usage_analytics', 'fraud_monitoring', 'revenue_tracking'
]

for table in tables_to_analyze:
    print(f"\n⏳ Analyzing {table}...")
    try:
        spark.sql(f"ANALYZE TABLE {table} COMPUTE STATISTICS")
        # Note: COMPUTE STATISTICS FOR ALL COLUMNS might be too intensive
        # Only do it for smaller tables
        if table in ['hourly_patterns', 'customer_demographics_summary']:
            spark.sql(f"ANALYZE TABLE {table} COMPUTE STATISTICS FOR ALL COLUMNS")
        print(f"   ✅ Analyzed {table}")
    except Exception as e:
        print(f"   ⚠️  Could not analyze {table}: {str(e)}")

# =====================================================
# 9. FINAL VERIFICATION AND SUMMARY
# =====================================================
print("\n" + "="*80)
print("📊 HIVE INFRASTRUCTURE SETUP COMPLETE!")
print("="*80)

# Summary of created objects
print("\n📋 Database Objects Created:")

# Tables
print("\n📦 Tables:")
tables = spark.sql(f"SHOW TABLES IN algerie_telecom_gen").filter("isTemporary = false").collect()
for table in tables:
    if table.tableName not in ['cdr_raw', 'cdr_partitioned']:  # Skip these as we know their counts
        try:
            count = spark.sql(f"SELECT COUNT(*) FROM {table.tableName}").collect()[0][0]
            print(f"   - {table.tableName}: {count:,} records")
        except:
            print(f"   - {table.tableName}: (aggregated table)")

# Views
print("\n👁️  Views created: {}".format(views_created))

# Performance test
print("\n⚡ Testing query performance...")
test_start = time.time()
result = spark.sql("""
    SELECT 
        operator,
        service_type,
        COUNT(*) as count,
        ROUND(SUM(charging_amount), 2) as revenue
    FROM cdr_partitioned
    WHERE year = 2025 AND month = 1
    GROUP BY operator, service_type
    ORDER BY operator, service_type
""").collect()

test_time = time.time() - test_start
print(f"\n✅ Query executed in {test_time:.2f} seconds")
print("\nSample results:")
for row in result[:6]:  # Show first 6 rows
    print(f"   {row['operator']} - {row['service_type']}: {row['count']:,} calls, {row['revenue']:,.2f} DZD")

# Final summary
print("\n" + "="*80)
print("🎯 SETUP COMPLETE! Next Steps:")
print("   → Run Notebook 03 for Advanced Data Engineering")
print("   → Run Notebook 04 for Anomaly Detection & Trend Analysis")
print("   → Run Notebook 05 for Business Intelligence Dashboards")
print("="*80)

spark.stop()
print("\n🔚 Spark session closed successfully.")

25/06/22 15:04:50 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


✅ SparkSession initialized (App: Hive Tables - Continuation, Spark: 3.5.1)
✅ Hive Warehouse: hdfs://namenode:9000/user/hive/warehouse
✅ Hive Metastore URI: thrift://hive-metastore:9083
📊 CONTINUING HIVE TABLE SETUP

👁️  CREATING ADVANCED ANALYTICAL VIEWS...
✅ Created voice_calls view
✅ Created data_sessions view
✅ Created sms_records view
✅ Created fraud_cases view
✅ Created network_issues view
✅ Created high_value_customers view
✅ Created special_offers_usage view
✅ Created roaming_records view
✅ Created app_usage_data view

✅ Total views created: 9

📊 CREATING PRE-AGGREGATED ANALYTICAL TABLES...

⏳ Creating daily_kpis table...


25/06/22 15:04:50 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.
25/06/22 15:04:51 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.


✅ Created daily_kpis table

⏳ Creating hourly_patterns table...
✅ Created hourly_patterns table

⏳ Creating location_network_metrics table...


25/06/22 15:04:52 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.


✅ Created location_network_metrics table

⏳ Creating customer_demographics_summary table...


25/06/22 15:04:52 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.


✅ Created customer_demographics_summary table

⏳ Creating app_usage_analytics table...


25/06/22 15:04:53 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.


✅ Created app_usage_analytics table

✅ All pre-aggregated tables created successfully!

📈 CREATING MATERIALIZED VIEWS FOR REAL-TIME ANALYTICS...

⏳ Creating fraud_monitoring table...


25/06/22 15:04:53 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.


✅ Created fraud_monitoring table

⏳ Creating revenue_tracking table...


25/06/22 15:04:53 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.


✅ Created revenue_tracking table

📈 COMPUTING TABLE STATISTICS FOR OPTIMIZATION...

⏳ Analyzing cdr_partitioned...
   ✅ Analyzed cdr_partitioned

⏳ Analyzing daily_kpis...
   ✅ Analyzed daily_kpis

⏳ Analyzing hourly_patterns...
   ✅ Analyzed hourly_patterns

⏳ Analyzing location_network_metrics...
   ✅ Analyzed location_network_metrics

⏳ Analyzing customer_demographics_summary...
   ✅ Analyzed customer_demographics_summary

⏳ Analyzing app_usage_analytics...
   ✅ Analyzed app_usage_analytics

⏳ Analyzing fraud_monitoring...
   ✅ Analyzed fraud_monitoring

⏳ Analyzing revenue_tracking...
   ✅ Analyzed revenue_tracking

📊 HIVE INFRASTRUCTURE SETUP COMPLETE!

📋 Database Objects Created:

📦 Tables:
   - voice_calls: 0 records
   - data_sessions: 0 records
   - sms_records: 0 records
   - fraud_cases: 0 records
   - network_issues: 0 records
   - high_value_customers: 0 records
   - special_offers_usage: 0 records
   - roaming_records: 0 records
   - app_usage_data: 0 records
   - daily_k

In [5]:
# =====================================================
# NOTEBOOK 02: CREATE HIVE TABLES FOR GENERATED CDR DATA
# Algerie Telecom Big Data Project - Advanced Analytics
# =====================================================

import sys
sys.path.append('/home/jovyan/work/work/scripts')
from spark_init import init_spark
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime
import time

# Initialize Spark with Hive support
spark = init_spark("Hive Tables - Generated CDR Advanced")

print("=" * 80)
print("📊 ALGERIE TELECOM - ADVANCED CDR DATA PIPELINE")
print("📅 Creating Comprehensive Hive Infrastructure")
print("=" * 80)

# =====================================================
# 1. CREATE ADVANCED DATABASE
# =====================================================
print("\n🏗️  CREATING ADVANCED DATABASE STRUCTURE...")

db_name = "algerie_telecom_gen"
spark.sql(f"DROP DATABASE IF EXISTS {db_name} CASCADE")
spark.sql(f"""
    CREATE DATABASE IF NOT EXISTS {db_name}
    COMMENT 'Algerie Telecom Generated CDR - Advanced Analytics Platform'
    LOCATION '/user/hive/warehouse/{db_name}.db'
    WITH DBPROPERTIES (
        'creator' = 'CDR Analytics Pipeline',
        'version' = '2.0',
        'date' = '{datetime.now().strftime("%Y-%m-%d")}'
    )
""")
spark.sql(f"USE {db_name}")
print(f"✅ Database '{db_name}' created successfully")

# =====================================================
# 2. ANALYZE DATA STRUCTURE & QUALITY
# =====================================================
print("\n🔍 ANALYZING GENERATED DATA STRUCTURE...")

raw_path = "/user/hive/warehouse/generated_raw_cdr/*.parquet"
df_sample = spark.read.parquet(raw_path)

# Get schema information
print("\n📋 Data Schema Analysis:")
total_columns = len(df_sample.columns)
print(f"Total columns: {total_columns}")

# Categorize columns
id_cols = ['cdr_id', 'subscriber_id', 'session_id']
pii_cols = ['msisdn', 'imsi', 'imei', 'calling_party', 'called_party']
service_cols = ['service_type', 'service_subtype']
time_cols = ['start_time', 'end_time', 'duration']
data_cols = ['data_volume_mb', 'upload_mb', 'download_mb']
location_cols = ['cell_id', 'lac', 'location_area', 'serving_cell_tower']
network_cols = ['network_type', 'operator', 'network_congestion_level']
financial_cols = ['charging_amount', 'tax_amount', 'revenue_per_mb']
customer_cols = ['customer_segment', 'tariff_plan', 'payment_type', 'age_group', 'gender', 'customer_lifetime_value_category']
quality_cols = ['call_result', 'quality_score', 'signal_strength', 'dropped_call_flag']
pattern_cols = ['time_of_day_category', 'day_of_week', 'is_weekend', 'is_holiday']
anomaly_cols = ['fraud_indicator', 'unusual_pattern_flag']
special_cols = ['special_offer_applied', 'promotional_discount']
app_cols = ['application_used', 'content_category']
roaming_cols = ['roaming_flag', 'roaming_country', 'roaming_type']

# Check data quality
print("\n📊 Data Quality Metrics:")
total_records = df_sample.count()
print(f"Total records: {total_records:,}")

# Sample anonymization check
sample_msisdn = df_sample.select("msisdn").first()[0]
is_anonymized = len(str(sample_msisdn)) == 16 and all(c in '0123456789abcdef' for c in str(sample_msisdn))
print(f"Anonymization status: {'✅ Enabled' if is_anonymized else '❌ Disabled'}")

# =====================================================
# 3. CREATE MAIN EXTERNAL TABLE
# =====================================================
print("\n📦 CREATING MAIN EXTERNAL TABLE...")

spark.sql(f"""
    CREATE EXTERNAL TABLE IF NOT EXISTS cdr_raw (
        -- Identifiers
        cdr_id STRING COMMENT 'Unique CDR identifier',
        subscriber_id STRING COMMENT 'Subscriber identifier',
        msisdn STRING COMMENT 'Mobile number (anonymized)',
        imsi STRING COMMENT 'IMSI (anonymized)',
        imei STRING COMMENT 'Device IMEI (anonymized)',
        
        -- Service Information
        service_type STRING COMMENT 'VOICE/DATA/SMS',
        service_subtype STRING COMMENT 'Detailed service type',
        session_id STRING COMMENT 'Session identifier',
        
        -- Call Details
        calling_party STRING COMMENT 'Caller number (anonymized)',
        called_party STRING COMMENT 'Called number (anonymized)',
        start_time STRING COMMENT 'Call/session start time',
        end_time STRING COMMENT 'Call/session end time',
        duration INT COMMENT 'Duration in seconds',
        
        -- Data Usage
        data_volume_mb DOUBLE COMMENT 'Total data volume in MB',
        upload_mb DOUBLE COMMENT 'Upload volume in MB',
        download_mb DOUBLE COMMENT 'Download volume in MB',
        
        -- Location & Network
        cell_id STRING COMMENT 'Cell tower ID',
        lac STRING COMMENT 'Location area code',
        location_area STRING COMMENT 'Geographic location',
        serving_cell_tower STRING COMMENT 'Serving tower ID',
        network_type STRING COMMENT '2G/3G/4G/5G',
        operator STRING COMMENT 'Mobilis/Djezzy/Ooredoo',
        
        -- Financial
        charging_amount DOUBLE COMMENT 'Total charge in DZD',
        currency STRING COMMENT 'Currency code',
        payment_type STRING COMMENT 'PREPAID/POSTPAID',
        tax_amount DOUBLE COMMENT 'Tax amount in DZD',
        revenue_per_mb DOUBLE COMMENT 'Revenue per MB for data',
        
        -- Quality & Status
        call_result STRING COMMENT 'SUCCESS/FAILED',
        quality_score DOUBLE COMMENT 'Quality score 0-1',
        signal_strength INT COMMENT 'Signal strength',
        dropped_call_flag BOOLEAN COMMENT 'Call dropped flag',
        network_congestion_level STRING COMMENT 'LOW/MEDIUM/HIGH',
        
        -- Customer Profile
        customer_segment STRING COMMENT 'Premium/Standard/Basic',
        tariff_plan STRING COMMENT 'Customer tariff plan',
        age_group STRING COMMENT 'Age group category',
        gender STRING COMMENT 'M/F',
        customer_lifetime_value_category STRING COMMENT 'CLV category',
        
        -- Patterns & Analytics
        time_of_day_category STRING COMMENT 'MORNING/AFTERNOON/EVENING/NIGHT',
        day_of_week STRING COMMENT 'Day name',
        is_weekend BOOLEAN COMMENT 'Weekend flag',
        is_holiday BOOLEAN COMMENT 'Holiday flag',
        
        -- Special Features
        special_offer_applied STRING COMMENT 'Applied offer name',
        promotional_discount DOUBLE COMMENT 'Discount percentage',
        application_used STRING COMMENT 'App name for data',
        content_category STRING COMMENT 'Content type',
        
        -- Roaming
        roaming_flag BOOLEAN COMMENT 'Roaming indicator',
        roaming_country STRING COMMENT 'Roaming country',
        roaming_type STRING COMMENT 'NONE/NATIONAL/INTERNATIONAL',
        
        -- Anomalies
        fraud_indicator BOOLEAN COMMENT 'Fraud detection flag',
        unusual_pattern_flag BOOLEAN COMMENT 'Unusual pattern detected'
    )
    STORED AS PARQUET
    LOCATION '/user/hive/warehouse/generated_raw_cdr'
    TBLPROPERTIES (
        'compression' = 'snappy',
        'creator' = 'Advanced CDR Pipeline',
        'external.table.purge' = 'false'
    )
""")

print("✅ External table 'cdr_raw' created")

# Verify and compute statistics
spark.sql("ANALYZE TABLE cdr_raw COMPUTE STATISTICS")
row_count = spark.sql("SELECT COUNT(*) FROM cdr_raw").collect()[0][0]
print(f"   Records loaded: {row_count:,}")

# =====================================================
# 4. CREATE OPTIMIZED PARTITIONED TABLE
# =====================================================
print("\n🗂️  CREATING PARTITIONED TABLE FOR PERFORMANCE...")

# Drop if exists and create new
spark.sql("DROP TABLE IF EXISTS cdr_partitioned")

spark.sql(f"""
    CREATE TABLE IF NOT EXISTS cdr_partitioned (
        -- All columns from raw table
        cdr_id STRING,
        subscriber_id STRING,
        msisdn STRING,
        imsi STRING,
        imei STRING,
        service_type STRING,
        service_subtype STRING,
        session_id STRING,
        calling_party STRING,
        called_party STRING,
        start_time TIMESTAMP,  -- Note: converted to TIMESTAMP
        end_time TIMESTAMP,
        duration INT,
        data_volume_mb DOUBLE,
        upload_mb DOUBLE,
        download_mb DOUBLE,
        cell_id STRING,
        lac STRING,
        location_area STRING,
        serving_cell_tower STRING,
        network_type STRING,
        operator STRING,
        charging_amount DOUBLE,
        currency STRING,
        payment_type STRING,
        tax_amount DOUBLE,
        revenue_per_mb DOUBLE,
        call_result STRING,
        quality_score DOUBLE,
        signal_strength INT,
        dropped_call_flag BOOLEAN,
        network_congestion_level STRING,
        customer_segment STRING,
        tariff_plan STRING,
        age_group STRING,
        gender STRING,
        customer_lifetime_value_category STRING,
        time_of_day_category STRING,
        day_of_week STRING,
        is_weekend BOOLEAN,
        is_holiday BOOLEAN,
        special_offer_applied STRING,
        promotional_discount DOUBLE,
        application_used STRING,
        content_category STRING,
        roaming_flag BOOLEAN,
        roaming_country STRING,
        roaming_type STRING,
        fraud_indicator BOOLEAN,
        unusual_pattern_flag BOOLEAN
    )
    PARTITIONED BY (year INT, month INT, day INT)
    STORED AS PARQUET
    TBLPROPERTIES (
        'compression' = 'snappy',
        'transactional' = 'false',
        'auto.purge' = 'true'
    )
""")

print("✅ Partitioned table structure created")

# Configure for dynamic partitioning
print("\n📥 Loading data with dynamic partitioning...")
spark.sql("SET hive.exec.dynamic.partition = true")
spark.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
spark.sql("SET hive.exec.max.dynamic.partitions = 10000")
spark.sql("SET hive.exec.max.dynamic.partitions.pernode = 1000")

# Insert with proper timestamp conversion and partitioning
start_time = time.time()
spark.sql("""
    INSERT OVERWRITE TABLE cdr_partitioned PARTITION(year, month, day)
    SELECT 
        cdr_id,
        subscriber_id,
        msisdn,
        imsi,
        imei,
        service_type,
        service_subtype,
        session_id,
        calling_party,
        called_party,
        CAST(start_time AS TIMESTAMP) as start_time,
        CAST(end_time AS TIMESTAMP) as end_time,
        CAST(duration AS INT) as duration,
        data_volume_mb,
        upload_mb,
        download_mb,
        cell_id,
        lac,
        location_area,
        serving_cell_tower,
        network_type,
        operator,
        charging_amount,
        currency,
        payment_type,
        tax_amount,
        revenue_per_mb,
        call_result,
        quality_score,
        signal_strength,
        dropped_call_flag,
        network_congestion_level,
        customer_segment,
        tariff_plan,
        age_group,
        gender,
        customer_lifetime_value_category,
        time_of_day_category,
        day_of_week,
        is_weekend,
        is_holiday,
        special_offer_applied,
        promotional_discount,
        application_used,
        content_category,
        roaming_flag,
        roaming_country,
        roaming_type,
        fraud_indicator,
        unusual_pattern_flag,
        YEAR(CAST(start_time AS TIMESTAMP)) as year,
        MONTH(CAST(start_time AS TIMESTAMP)) as month,
        DAY(CAST(start_time AS TIMESTAMP)) as day
    FROM cdr_raw
""")

load_time = time.time() - start_time
print(f"✅ Data loaded in {load_time:.1f} seconds")

# Show partition statistics
print("\n📊 Partition Statistics:")
spark.sql("""
    SELECT 
        year, 
        month,
        COUNT(DISTINCT day) as days,
        COUNT(*) as records,
        ROUND(SUM(charging_amount), 2) as revenue
    FROM cdr_partitioned
    GROUP BY year, month
    ORDER BY year, month
""").show(12)

# =====================================================
# 5. CREATE OPTIMIZED ANALYTICAL VIEWS
# =====================================================
print("\n👁️  CREATING ADVANCED ANALYTICAL VIEWS...")

# Service-based views
views_created = 0

spark.sql("""
    CREATE OR REPLACE VIEW voice_calls AS
    SELECT * FROM cdr_partitioned
    WHERE service_type = 'VOICE'
""")
views_created += 1

spark.sql("""
    CREATE OR REPLACE VIEW data_sessions AS
    SELECT * FROM cdr_partitioned
    WHERE service_type = 'DATA'
""")
views_created += 1

spark.sql("""
    CREATE OR REPLACE VIEW sms_records AS
    SELECT * FROM cdr_partitioned
    WHERE service_type = 'SMS'
""")
views_created += 1

# Advanced analytical views
spark.sql("""
    CREATE OR REPLACE VIEW fraud_cases AS
    SELECT * FROM cdr_partitioned
    WHERE fraud_indicator = true OR unusual_pattern_flag = true
""")
views_created += 1

spark.sql("""
    CREATE OR REPLACE VIEW network_issues AS
    SELECT * FROM cdr_partitioned
    WHERE dropped_call_flag = true 
       OR call_result = 'FAILED'
       OR network_congestion_level IN ('MEDIUM', 'HIGH')
       OR quality_score < 0.5
""")
views_created += 1

spark.sql("""
    CREATE OR REPLACE VIEW high_value_customers AS
    SELECT DISTINCT
        subscriber_id,
        customer_segment,
        customer_lifetime_value_category,
        tariff_plan,
        payment_type,
        age_group,
        gender,
        operator
    FROM cdr_partitioned
    WHERE customer_segment IN ('Premium')
       OR customer_lifetime_value_category IN ('High', 'Very High')
       OR payment_type = 'POSTPAID'
""")
views_created += 1

spark.sql("""
    CREATE OR REPLACE VIEW special_offers_usage AS
    SELECT * FROM cdr_partitioned
    WHERE special_offer_applied != 'None' 
      AND promotional_discount > 0
""")
views_created += 1

spark.sql("""
    CREATE OR REPLACE VIEW roaming_records AS
    SELECT * FROM cdr_partitioned
    WHERE roaming_flag = true
""")
views_created += 1

spark.sql("""
    CREATE OR REPLACE VIEW app_usage_data AS
    SELECT * FROM cdr_partitioned
    WHERE service_type = 'DATA' 
      AND application_used IS NOT NULL 
      AND application_used != ''
""")
views_created += 1

print(f"✅ Created {views_created} analytical views")

# =====================================================
# 6. CREATE PRE-AGGREGATED TABLES
# =====================================================
print("\n📊 CREATING PRE-AGGREGATED ANALYTICAL TABLES...")

# Daily KPIs by Service and Operator
spark.sql("""
    CREATE TABLE IF NOT EXISTS daily_kpis AS
    SELECT 
        year, month, day,
        service_type,
        operator,
        customer_segment,
        COUNT(DISTINCT subscriber_id) as unique_subscribers,
        COUNT(*) as total_transactions,
        SUM(duration) as total_duration_seconds,
        SUM(data_volume_mb) as total_data_mb,
        SUM(charging_amount) as total_revenue,
        SUM(tax_amount) as total_tax,
        AVG(quality_score) as avg_quality_score,
        SUM(CASE WHEN fraud_indicator THEN 1 ELSE 0 END) as fraud_cases,
        SUM(CASE WHEN dropped_call_flag THEN 1 ELSE 0 END) as dropped_calls,
        SUM(CASE WHEN special_offer_applied != 'None' THEN 1 ELSE 0 END) as special_offer_usage,
        AVG(promotional_discount) as avg_discount_applied
    FROM cdr_partitioned
    GROUP BY year, month, day, service_type, operator, customer_segment
""")

# Hourly Usage Patterns
spark.sql("""
    CREATE TABLE IF NOT EXISTS hourly_patterns AS
    SELECT 
        time_of_day_category,
        day_of_week,
        is_weekend,
        service_type,
        operator,
        COUNT(*) as transaction_count,
        COUNT(DISTINCT subscriber_id) as unique_users,
        AVG(duration) as avg_duration,
        AVG(data_volume_mb) as avg_data_mb,
        AVG(charging_amount) as avg_revenue,
        AVG(quality_score) as avg_quality,
        SUM(CASE WHEN network_congestion_level = 'HIGH' THEN 1 ELSE 0 END) as high_congestion_count
    FROM cdr_partitioned
    GROUP BY time_of_day_category, day_of_week, is_weekend, service_type, operator
""")

# Location Performance Metrics
spark.sql("""
    CREATE TABLE IF NOT EXISTS location_network_metrics AS
    SELECT 
        location_area,
        network_type,
        operator,
        service_type,
        COUNT(*) as total_transactions,
        COUNT(DISTINCT subscriber_id) as unique_subscribers,
        AVG(quality_score) as avg_quality_score,
        AVG(signal_strength) as avg_signal_strength,
        SUM(CASE WHEN dropped_call_flag THEN 1 ELSE 0 END) as dropped_count,
        SUM(CASE WHEN call_result = 'FAILED' THEN 1 ELSE 0 END) as failed_count,
        SUM(CASE WHEN network_congestion_level = 'HIGH' THEN 1 ELSE 0 END) as high_congestion_count,
        AVG(data_volume_mb) as avg_data_usage,
        SUM(charging_amount) as total_revenue
    FROM cdr_partitioned
    GROUP BY location_area, network_type, operator, service_type
""")

# Customer Demographics Analysis
spark.sql("""
    CREATE TABLE IF NOT EXISTS customer_demographics_summary AS
    SELECT 
        customer_segment,
        age_group,
        gender,
        payment_type,
        operator,
        COUNT(DISTINCT subscriber_id) as subscriber_count,
        COUNT(*) as total_activities,
        AVG(charging_amount) as avg_transaction_value,
        SUM(charging_amount) as total_revenue,
        AVG(data_volume_mb) as avg_data_usage,
        SUM(CASE WHEN fraud_indicator THEN 1 ELSE 0 END) as fraud_incidents,
        AVG(promotional_discount) as avg_discount_received
    FROM cdr_partitioned
    GROUP BY customer_segment, age_group, gender, payment_type, operator
""")

# App Usage Analytics
spark.sql("""
    CREATE TABLE IF NOT EXISTS app_usage_analytics AS
    SELECT 
        application_used,
        content_category,
        customer_segment,
        age_group,
        COUNT(DISTINCT subscriber_id) as unique_users,
        COUNT(*) as total_sessions,
        SUM(data_volume_mb) as total_data_mb,
        AVG(data_volume_mb) as avg_data_per_session,
        SUM(duration) as total_duration,
        AVG(duration) as avg_session_duration,
        SUM(charging_amount) as total_revenue,
        AVG(revenue_per_mb) as avg_revenue_per_mb
    FROM cdr_partitioned
    WHERE service_type = 'DATA' AND application_used IS NOT NULL
    GROUP BY application_used, content_category, customer_segment, age_group
""")

print("✅ Created 5 pre-aggregated analytical tables")

# =====================================================
# 7. CREATE MATERIALIZED VIEWS FOR DASHBOARDS
# =====================================================
print("\n📈 CREATING MATERIALIZED VIEWS FOR REAL-TIME ANALYTICS...")

# Real-time fraud monitoring
spark.sql("""
    CREATE TABLE IF NOT EXISTS fraud_monitoring AS
    SELECT 
        DATE(start_time) as fraud_date,
        operator,
        location_area,
        COUNT(*) as total_incidents,
        COUNT(DISTINCT subscriber_id) as affected_subscribers,
        SUM(charging_amount) as potential_loss,
        COLLECT_SET(service_type) as affected_services
    FROM cdr_partitioned
    WHERE fraud_indicator = true OR unusual_pattern_flag = true
    GROUP BY DATE(start_time), operator, location_area
""")

# Revenue tracking
spark.sql("""
    CREATE TABLE IF NOT EXISTS revenue_tracking AS
    SELECT 
        year, month, day,
        operator,
        service_type,
        customer_segment,
        payment_type,
        SUM(charging_amount) as gross_revenue,
        SUM(tax_amount) as tax_collected,
        SUM(charging_amount - tax_amount) as net_revenue,
        COUNT(DISTINCT subscriber_id) as active_customers,
        COUNT(*) as total_transactions,
        AVG(charging_amount) as arpu_daily
    FROM cdr_partitioned
    GROUP BY year, month, day, operator, service_type, customer_segment, payment_type
""")

print("✅ Created materialized views for dashboards")

# =====================================================
# 8. COMPUTE COMPREHENSIVE STATISTICS
# =====================================================
print("\n📈 COMPUTING TABLE STATISTICS FOR OPTIMIZATION...")

tables_to_analyze = [
    'cdr_partitioned', 'daily_kpis', 'hourly_patterns', 
    'location_network_metrics', 'customer_demographics_summary',
    'app_usage_analytics', 'fraud_monitoring', 'revenue_tracking'
]

for table in tables_to_analyze:
    spark.sql(f"ANALYZE TABLE {table} COMPUTE STATISTICS")
    spark.sql(f"ANALYZE TABLE {table} COMPUTE STATISTICS FOR ALL COLUMNS")
    print(f"   ✅ Analyzed {table}")

# =====================================================
# 9. FINAL VERIFICATION AND SUMMARY
# =====================================================
print("\n" + "="*80)
print("📊 HIVE INFRASTRUCTURE SETUP COMPLETE!")
print("="*80)

# Summary of created objects
print("\n📋 Database Objects Created:")

# Tables
print("\n📦 Tables:")
tables = spark.sql(f"SHOW TABLES IN {db_name}").filter("isTemporary = false").collect()
for table in tables:
    count = spark.sql(f"SELECT COUNT(*) FROM {table.tableName}").collect()[0][0]
    print(f"   - {table.tableName}: {count:,} records")

# Views
print("\n👁️  Views:")
views = spark.sql("SHOW VIEWS").collect()
for view in views:
    print(f"   - {view.viewName}")

# Performance test
print("\n⚡ Performance Test Query:")
test_start = time.time()
spark.sql("""
    SELECT 
        operator,
        service_type,
        COUNT(*) as count,
        ROUND(SUM(charging_amount), 2) as revenue
    FROM cdr_partitioned
    WHERE year = 2025 AND month = 1
    GROUP BY operator, service_type
    ORDER BY operator, service_type
""").show()
test_time = time.time() - test_start
print(f"Query executed in {test_time:.2f} seconds")

print("\n🎯 Next Steps:")
print("   1. Run Notebook 03 for Advanced Data Engineering")
print("   2. Run Notebook 04 for Anomaly Detection & Trend Analysis")
print("   3. Run Notebook 05 for Business Intelligence Dashboards")

# Save metadata
metadata = {
    "setup_date": datetime.now().isoformat(),
    "database": db_name,
    "total_records": row_count,
    "tables_created": len(tables),
    "views_created": views_created,
    "load_time_seconds": load_time
}

print("\n📊 Setup Metadata:")
for key, value in metadata.items():
    print(f"   {key}: {value}")

spark.stop()
print("\n🔚 Spark session closed successfully.")

25/06/22 15:02:53 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


✅ SparkSession initialized (App: Hive Tables - Generated CDR Advanced, Spark: 3.5.1)
✅ Hive Warehouse: hdfs://namenode:9000/user/hive/warehouse
✅ Hive Metastore URI: thrift://hive-metastore:9083
📊 ALGERIE TELECOM - ADVANCED CDR DATA PIPELINE
📅 Creating Comprehensive Hive Infrastructure

🏗️  CREATING ADVANCED DATABASE STRUCTURE...
✅ Database 'algerie_telecom_gen' created successfully

🔍 ANALYZING GENERATED DATA STRUCTURE...

📋 Data Schema Analysis:
Total columns: 50

📊 Data Quality Metrics:
Total records: 146,876,149
Anonymization status: ✅ Enabled

📦 CREATING MAIN EXTERNAL TABLE...
✅ External table 'cdr_raw' created
   Records loaded: 146,876,149

🗂️  CREATING PARTITIONED TABLE FOR PERFORMANCE...
✅ Partitioned table structure created

📥 Loading data with dynamic partitioning...


25/06/22 15:02:55 WARN SetCommand: 'SET hive.exec.dynamic.partition=true' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.exec.dynamic.partition) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
25/06/22 15:02:55 WARN SetCommand: 'SET hive.exec.dynamic.partition.mode=nonstrict' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.exec.dynamic.partition.mode) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
25/06/22 15:02:55 WARN SetCommand: 'SET hive.exec.max.dynamic.partitions=10000' might not work, since Spark doesn't support

Py4JJavaError: An error occurred while calling o73.sql.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 39.0 failed 4 times, most recent failure: Lost task 2.3 in stage 39.0 (TID 1197) (172.30.0.35 executor 1): org.apache.spark.SparkException: Parquet column cannot be converted in file hdfs://namenode:9000/user/hive/warehouse/generated_raw_cdr/cdr_20250106_to_20250110.parquet. Column: [duration], Expected: int, Found: INT64.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:854)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:287)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:593)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:385)
	at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException: column: [duration], physicalType: INT64, logicalType: int
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.constructConvertNotSupportedException(ParquetVectorUpdaterFactory.java:1136)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.getUpdater(ParquetVectorUpdaterFactory.java:199)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:175)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:342)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:233)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
	... 25 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:989)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2398)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeWrite$4(FileFormatWriter.scala:307)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.writeAndCommit(FileFormatWriter.scala:271)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeWrite(FileFormatWriter.scala:304)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:190)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437)
	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83)
	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:220)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:638)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:629)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:659)
	at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkException: Parquet column cannot be converted in file hdfs://namenode:9000/user/hive/warehouse/generated_raw_cdr/cdr_20250106_to_20250110.parquet. Column: [duration], Expected: int, Found: INT64.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:854)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:287)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:593)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:385)
	at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException: column: [duration], physicalType: INT64, logicalType: int
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.constructConvertNotSupportedException(ParquetVectorUpdaterFactory.java:1136)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.getUpdater(ParquetVectorUpdaterFactory.java:199)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:175)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:342)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:233)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
	... 25 more


25/06/22 15:02:58 WARN TaskSetManager: Lost task 5.3 in stage 39.0 (TID 1205) (172.30.0.34 executor 0): TaskKilled (Stage cancelled: Job aborted due to stage failure: Task 2 in stage 39.0 failed 4 times, most recent failure: Lost task 2.3 in stage 39.0 (TID 1197) (172.30.0.35 executor 1): org.apache.spark.SparkException: Parquet column cannot be converted in file hdfs://namenode:9000/user/hive/warehouse/generated_raw_cdr/cdr_20250106_to_20250110.parquet. Column: [duration], Expected: int, Found: INT64.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:854)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:287)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:593)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$Generate

In [11]:

import sys
sys.path.append('/home/jovyan/work/work/scripts')
from spark_init import init_spark
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime
import time

# Initialize Spark with Hive support
spark = init_spark("Check - Generated CDR Advanced")
spark.sql("USE algerie_telecom_gen")
spark.sql("SHOW TABLES IN algerie_telecom_gen").show(100, truncate=False)
spark.sql("SELECT COUNT(*) FROM cdr_partitioned").show()
spark.sql("SELECT * FROM cdr_partitioned LIMIT 10").show(truncate=False)


✅ SparkSession initialized (App: Check - Generated CDR Advanced, Spark: 3.5.1)
✅ Hive Warehouse: hdfs://namenode:9000/user/hive/warehouse
✅ Hive Metastore URI: thrift://hive-metastore:9083
+-------------------+-----------------------------+-----------+
|namespace          |tableName                    |isTemporary|
+-------------------+-----------------------------+-----------+
|algerie_telecom_gen|cdr_raw                      |false      |
|algerie_telecom_gen|cdr_partitioned              |false      |
|algerie_telecom_gen|voice_calls                  |false      |
|algerie_telecom_gen|data_sessions                |false      |
|algerie_telecom_gen|sms_records                  |false      |
|algerie_telecom_gen|fraud_cases                  |false      |
|algerie_telecom_gen|network_issues               |false      |
|algerie_telecom_gen|high_value_customers         |false      |
|algerie_telecom_gen|special_offers_usage         |false      |
|algerie_telecom_gen|roaming_records       

                                                                                

+--------+
|count(1)|
+--------+
|       0|
+--------+

+------+-------------+------+----+----+------------+---------------+----------+-------------+------------+----------+--------+--------+---------------+--------------+---------+-----------+---------------+----------+--------------+-------------+--------------------+-------+---+-------------+------------------+------------+--------+------------+-----------+----------------+-----------+--------+---------+------+---------------+------------+---------------------+------------------------+--------------------+-----------+----------------+----------------+--------------------------------+-----------------+------------+---------------+--------------------+----------+----------+----+-----+---+
|cdr_id|subscriber_id|msisdn|imsi|imei|service_type|service_subtype|session_id|calling_party|called_party|start_time|end_time|duration|signal_strength|data_volume_mb|upload_mb|download_mb|charging_amount|tax_amount|revenue_per_mb|quality_score|promot

In [3]:
df = spark.read.parquet("hdfs://namenode:9000/user/hive/warehouse/generated_raw_cdr/cdr_20250111_to_20250115.parquet")
df.printSchema()

root
 |-- cdr_id: string (nullable = true)
 |-- subscriber_id: string (nullable = true)
 |-- msisdn: string (nullable = true)
 |-- imsi: string (nullable = true)
 |-- imei: string (nullable = true)
 |-- service_type: string (nullable = true)
 |-- service_subtype: string (nullable = true)
 |-- session_id: string (nullable = true)
 |-- calling_party: string (nullable = true)
 |-- called_party: string (nullable = true)
 |-- start_time: string (nullable = true)
 |-- end_time: string (nullable = true)
 |-- duration: long (nullable = true)
 |-- data_volume_mb: double (nullable = true)
 |-- upload_mb: double (nullable = true)
 |-- download_mb: double (nullable = true)
 |-- cell_id: string (nullable = true)
 |-- lac: string (nullable = true)
 |-- location_area: string (nullable = true)
 |-- serving_cell_tower: string (nullable = true)
 |-- network_type: string (nullable = true)
 |-- charging_amount: double (nullable = true)
 |-- currency: string (nullable = true)
 |-- payment_type: string (nul

                                                                                

In [12]:
print("cdr_raw:")
spark.sql("SELECT COUNT(*) FROM cdr_raw").show()
spark.sql("SELECT * FROM cdr_raw LIMIT 5").show(truncate=False)

print("cdr_partitioned:")
spark.sql("SELECT COUNT(*) FROM cdr_partitioned").show()
spark.sql("SELECT * FROM cdr_partitioned LIMIT 5").show(truncate=False)


cdr_raw:


                                                                                

+---------+
| count(1)|
+---------+
|146876149|
+---------+



25/06/22 15:59:41 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 111) (172.30.0.34 executor 0): org.apache.spark.SparkException: Parquet column cannot be converted in file hdfs://namenode:9000/user/hive/warehouse/generated_raw_cdr/cdr_20250101_to_20250105.parquet. Column: [duration], Expected: int, Found: INT64.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:854)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:287)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:593)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Sour

Py4JJavaError: An error occurred while calling o573.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 114) (172.30.0.34 executor 0): org.apache.spark.SparkException: Parquet column cannot be converted in file hdfs://namenode:9000/user/hive/warehouse/generated_raw_cdr/cdr_20250101_to_20250105.parquet. Column: [duration], Expected: int, Found: INT64.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:854)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:287)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:593)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException: column: [duration], physicalType: INT64, logicalType: int
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.constructConvertNotSupportedException(ParquetVectorUpdaterFactory.java:1136)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.getUpdater(ParquetVectorUpdaterFactory.java:199)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:175)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:342)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:233)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
	... 23 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:989)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2398)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2419)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2438)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:530)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:483)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:61)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4332)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3314)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4322)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4320)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4320)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:3314)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:3537)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:280)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:315)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkException: Parquet column cannot be converted in file hdfs://namenode:9000/user/hive/warehouse/generated_raw_cdr/cdr_20250101_to_20250105.parquet. Column: [duration], Expected: int, Found: INT64.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:854)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:287)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:593)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException: column: [duration], physicalType: INT64, logicalType: int
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.constructConvertNotSupportedException(ParquetVectorUpdaterFactory.java:1136)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.getUpdater(ParquetVectorUpdaterFactory.java:199)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:175)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:342)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:233)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
	... 23 more


In [13]:
df = spark.read.parquet("/user/hive/warehouse/generated_raw_cdr/*.parquet")
print(df.count())
df.show(5, truncate=False)


                                                                                

146876149
+------------------------------------+-------------+----------------+----------------+----------------+------------+----------------+--------------+----------------+----------------+-------------------+-------------------+--------+--------------+---------+-----------+-----------+-------+--------------+------------------+------------+---------------+--------+------------+----------+-----------+-------------+---------------+-----------------+----------------+-------------+--------+---------+------+------------+---------------+------------+---------------------+--------------------+------------------------+---------------+--------------------+--------------------+-----------+----------+----------+----------------+----------------+--------------+--------------------------------+
|cdr_id                              |subscriber_id|msisdn          |imsi            |imei            |service_type|service_subtype |session_id    |calling_party   |called_party    |start_time         |en

In [14]:
spark.sql("DROP TABLE IF EXISTS cdr_raw")
spark.sql("""
    CREATE EXTERNAL TABLE IF NOT EXISTS cdr_raw (
        -- Identifiers
        cdr_id STRING,
        subscriber_id STRING,
        msisdn STRING,
        imsi STRING,
        imei STRING,
        
        -- Service Information
        service_type STRING,
        service_subtype STRING,
        session_id STRING,
        
        -- Call Details
        calling_party STRING,
        called_party STRING,
        start_time STRING,
        end_time STRING,
        duration BIGINT,  -- ← FIXED! Was INT, now BIGINT

        -- Data Usage
        data_volume_mb DOUBLE,
        upload_mb DOUBLE,
        download_mb DOUBLE,

        -- Location & Network
        cell_id STRING,
        lac STRING,
        location_area STRING,
        serving_cell_tower STRING,
        network_type STRING,
        operator STRING,

        -- Financial
        charging_amount DOUBLE,
        currency STRING,
        payment_type STRING,
        tax_amount DOUBLE,
        revenue_per_mb DOUBLE,

        -- Quality & Status
        call_result STRING,
        quality_score DOUBLE,
        signal_strength BIGINT, -- (if Parquet is INT64, else leave INT)
        dropped_call_flag BOOLEAN,
        network_congestion_level STRING,

        -- Customer Profile
        customer_segment STRING,
        tariff_plan STRING,
        age_group STRING,
        gender STRING,
        customer_lifetime_value_category STRING,

        -- Patterns & Analytics
        time_of_day_category STRING,
        day_of_week STRING,
        is_weekend BOOLEAN,
        is_holiday BOOLEAN,

        -- Special Features
        special_offer_applied STRING,
        promotional_discount DOUBLE,
        application_used STRING,
        content_category STRING,

        -- Roaming
        roaming_flag BOOLEAN,
        roaming_country STRING,
        roaming_type STRING,

        -- Anomalies
        fraud_indicator BOOLEAN,
        unusual_pattern_flag BOOLEAN
    )
    STORED AS PARQUET
    LOCATION '/user/hive/warehouse/generated_raw_cdr'
""")


25/06/22 16:07:51 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.


DataFrame[]

In [15]:
spark.sql("ANALYZE TABLE cdr_raw COMPUTE STATISTICS")


DataFrame[]

In [16]:
spark.sql("SELECT COUNT(*) FROM cdr_raw").show()
spark.sql("SELECT * FROM cdr_raw LIMIT 5").show(truncate=False)


+---------+
| count(1)|
+---------+
|146876149|
+---------+

+------------------------------------+-------------+----------------+----------------+----------------+------------+----------------+--------------+----------------+----------------+-------------------+-------------------+--------+--------------+---------+-----------+-----------+-------+--------------+------------------+------------+--------+---------------+--------+------------+----------+--------------+-----------+-------------+---------------+-----------------+------------------------+----------------+-------------+---------+------+--------------------------------+--------------------+-----------+----------+----------+---------------------+--------------------+----------------+----------------+------------+---------------+------------+---------------+--------------------+
|cdr_id                              |subscriber_id|msisdn          |imsi            |imei            |service_type|service_subtype |session_id    |calli

In [1]:
import sys
sys.path.append('/home/jovyan/work/work/scripts')
from spark_init import init_spark
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime
import time

# Initialize Spark with Hive support
spark = init_spark("Notebook 02: Hive Advanced Setup")
print(f"✅ SparkSession initialized (App: {spark.sparkContext.appName}, Spark: {spark.version})")


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/22 16:55:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


✅ SparkSession initialized (App: Notebook 02: Hive Advanced Setup, Spark: 3.5.1)
✅ Hive Warehouse: hdfs://namenode:9000/user/hive/warehouse
✅ Hive Metastore URI: thrift://hive-metastore:9083
✅ SparkSession initialized (App: Notebook 02: Hive Advanced Setup, Spark: 3.5.1)


In [2]:
# (Re)create database for a clean start (only if you want to overwrite old stuff)
db_name = "algerie_telecom_gen"
spark.sql(f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION '/user/hive/warehouse/{db_name}.db'")
spark.sql(f"USE {db_name}")
print(f"✅ Using database: {db_name}")


25/06/22 16:56:02 WARN HiveConf: HiveConf of name hive.metastore.event.db.notification.api.auth does not exist


✅ Using database: algerie_telecom_gen


In [3]:
raw_parquet_path = "/user/hive/warehouse/generated_raw_cdr/*.parquet"
df = spark.read.parquet(raw_parquet_path)
print("📋 Data Schema from Parquet:")
df.printSchema()
print(f"Total records in Parquet: {df.count():,}")
df.show(3, truncate=False)


                                                                                

📋 Data Schema from Parquet:
root
 |-- cdr_id: string (nullable = true)
 |-- subscriber_id: string (nullable = true)
 |-- msisdn: string (nullable = true)
 |-- imsi: string (nullable = true)
 |-- imei: string (nullable = true)
 |-- service_type: string (nullable = true)
 |-- service_subtype: string (nullable = true)
 |-- session_id: string (nullable = true)
 |-- calling_party: string (nullable = true)
 |-- called_party: string (nullable = true)
 |-- start_time: string (nullable = true)
 |-- end_time: string (nullable = true)
 |-- duration: long (nullable = true)
 |-- data_volume_mb: double (nullable = true)
 |-- upload_mb: double (nullable = true)
 |-- download_mb: double (nullable = true)
 |-- cell_id: string (nullable = true)
 |-- lac: string (nullable = true)
 |-- location_area: string (nullable = true)
 |-- serving_cell_tower: string (nullable = true)
 |-- network_type: string (nullable = true)
 |-- charging_amount: double (nullable = true)
 |-- currency: string (nullable = true)
 |

25/06/22 16:56:18 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Total records in Parquet: 146,876,149
+------------------------------------+-------------+----------------+----------------+----------------+------------+----------------+--------------+----------------+----------------+-------------------+-------------------+--------+--------------+---------+-----------+-----------+-------+-------------+------------------+------------+---------------+--------+------------+----------+-----------+-------------+---------------+-----------------+----------------+-------------+--------+---------+------+------------+---------------+------------+---------------------+--------------------+------------------------+---------------+--------------------+--------------------+-----------+----------+----------+----------------+----------------+--------------+--------------------------------+
|cdr_id                              |subscriber_id|msisdn          |imsi            |imei            |service_type|service_subtype |session_id    |calling_party   |called_party

In [4]:
spark.sql("DROP TABLE IF EXISTS cdr_raw")

spark.sql(f"""
    CREATE EXTERNAL TABLE IF NOT EXISTS cdr_raw (
        cdr_id STRING,
        subscriber_id STRING,
        msisdn STRING,
        imsi STRING,
        imei STRING,
        service_type STRING,
        service_subtype STRING,
        session_id STRING,
        calling_party STRING,
        called_party STRING,
        start_time STRING,
        end_time STRING,
        duration BIGINT,  -- Matches Parquet INT64
        data_volume_mb DOUBLE,
        upload_mb DOUBLE,
        download_mb DOUBLE,
        cell_id STRING,
        lac STRING,
        location_area STRING,
        serving_cell_tower STRING,
        network_type STRING,
        operator STRING,
        charging_amount DOUBLE,
        currency STRING,
        payment_type STRING,
        tax_amount DOUBLE,
        revenue_per_mb DOUBLE,
        call_result STRING,
        quality_score DOUBLE,
        signal_strength BIGINT,  -- Matches Parquet INT64
        dropped_call_flag BOOLEAN,
        network_congestion_level STRING,
        customer_segment STRING,
        tariff_plan STRING,
        age_group STRING,
        gender STRING,
        customer_lifetime_value_category STRING,
        time_of_day_category STRING,
        day_of_week STRING,
        is_weekend BOOLEAN,
        is_holiday BOOLEAN,
        special_offer_applied STRING,
        promotional_discount DOUBLE,
        application_used STRING,
        content_category STRING,
        roaming_flag BOOLEAN,
        roaming_country STRING,
        roaming_type STRING,
        fraud_indicator BOOLEAN,
        unusual_pattern_flag BOOLEAN
    )
    STORED AS PARQUET
    LOCATION '/user/hive/warehouse/generated_raw_cdr'
""")
print("✅ External table 'cdr_raw' created with correct schema.")


25/06/22 16:56:24 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.


✅ External table 'cdr_raw' created with correct schema.


In [5]:
print("cdr_raw Table Count:")
spark.sql("SELECT COUNT(*) FROM cdr_raw").show()

print("cdr_raw Table Preview:")
spark.sql("SELECT * FROM cdr_raw LIMIT 5").show(truncate=False)


cdr_raw Table Count:
+---------+
| count(1)|
+---------+
|146876149|
+---------+

cdr_raw Table Preview:
+------------------------------------+-------------+----------------+----------------+----------------+------------+----------------+--------------+----------------+----------------+-------------------+-------------------+--------+--------------+---------+-----------+-----------+-------+--------------+------------------+------------+--------+---------------+--------+------------+----------+--------------+-----------+-------------+---------------+-----------------+------------------------+----------------+-------------+---------+------+--------------------------------+--------------------+-----------+----------+----------+---------------------+--------------------+----------------+----------------+------------+---------------+------------+---------------+--------------------+
|cdr_id                              |subscriber_id|msisdn          |imsi            |imei            |servic

In [6]:
spark.sql("DROP TABLE IF EXISTS cdr_partitioned")

spark.sql(f"""
    CREATE TABLE IF NOT EXISTS cdr_partitioned (
        cdr_id STRING,
        subscriber_id STRING,
        msisdn STRING,
        imsi STRING,
        imei STRING,
        service_type STRING,
        service_subtype STRING,
        session_id STRING,
        calling_party STRING,
        called_party STRING,
        start_time TIMESTAMP,
        end_time TIMESTAMP,
        duration BIGINT,
        data_volume_mb DOUBLE,
        upload_mb DOUBLE,
        download_mb DOUBLE,
        cell_id STRING,
        lac STRING,
        location_area STRING,
        serving_cell_tower STRING,
        network_type STRING,
        operator STRING,
        charging_amount DOUBLE,
        currency STRING,
        payment_type STRING,
        tax_amount DOUBLE,
        revenue_per_mb DOUBLE,
        call_result STRING,
        quality_score DOUBLE,
        signal_strength BIGINT,
        dropped_call_flag BOOLEAN,
        network_congestion_level STRING,
        customer_segment STRING,
        tariff_plan STRING,
        age_group STRING,
        gender STRING,
        customer_lifetime_value_category STRING,
        time_of_day_category STRING,
        day_of_week STRING,
        is_weekend BOOLEAN,
        is_holiday BOOLEAN,
        special_offer_applied STRING,
        promotional_discount DOUBLE,
        application_used STRING,
        content_category STRING,
        roaming_flag BOOLEAN,
        roaming_country STRING,
        roaming_type STRING,
        fraud_indicator BOOLEAN,
        unusual_pattern_flag BOOLEAN
    )
    PARTITIONED BY (year INT, month INT, day INT)
    STORED AS PARQUET
    TBLPROPERTIES (
        'compression' = 'snappy',
        'transactional' = 'false'
    )
""")
print("✅ Partitioned table 'cdr_partitioned' created.")


✅ Partitioned table 'cdr_partitioned' created.


In [7]:
spark.sql("SET hive.exec.dynamic.partition = true")
spark.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
spark.sql("SET hive.exec.max.dynamic.partitions = 10000")
spark.sql("SET hive.exec.max.dynamic.partitions.pernode = 1000")

print("🗂️ Loading data into cdr_partitioned...")

spark.sql("""
    INSERT OVERWRITE TABLE cdr_partitioned PARTITION(year, month, day)
    SELECT
        cdr_id, subscriber_id, msisdn, imsi, imei,
        service_type, service_subtype, session_id,
        calling_party, called_party,
        CAST(start_time AS TIMESTAMP) as start_time,
        CAST(end_time AS TIMESTAMP) as end_time,
        duration, data_volume_mb, upload_mb, download_mb,
        cell_id, lac, location_area, serving_cell_tower,
        network_type, operator, charging_amount, currency, payment_type,
        tax_amount, revenue_per_mb, call_result, quality_score,
        signal_strength, dropped_call_flag, network_congestion_level,
        customer_segment, tariff_plan, age_group, gender,
        customer_lifetime_value_category, time_of_day_category,
        day_of_week, is_weekend, is_holiday,
        special_offer_applied, promotional_discount, application_used, content_category,
        roaming_flag, roaming_country, roaming_type,
        fraud_indicator, unusual_pattern_flag,
        YEAR(CAST(start_time AS TIMESTAMP)) as year,
        MONTH(CAST(start_time AS TIMESTAMP)) as month,
        DAY(CAST(start_time AS TIMESTAMP)) as day
    FROM cdr_raw
""")
print("✅ Data inserted into cdr_partitioned.")


25/06/22 16:57:30 WARN SetCommand: 'SET hive.exec.dynamic.partition=true' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.exec.dynamic.partition) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
25/06/22 16:57:30 WARN SetCommand: 'SET hive.exec.dynamic.partition.mode=nonstrict' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.exec.dynamic.partition.mode) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
25/06/22 16:57:30 WARN SetCommand: 'SET hive.exec.max.dynamic.partitions=10000' might not work, since Spark doesn't support

🗂️ Loading data into cdr_partitioned...


                                                                                

✅ Data inserted into cdr_partitioned.


In [8]:
spark.sql("SELECT COUNT(*) FROM cdr_partitioned").show()
spark.sql("""
    SELECT year, month, COUNT(DISTINCT day) AS days, COUNT(*) AS records
    FROM cdr_partitioned
    GROUP BY year, month
    ORDER BY year, month
""").show(24)


                                                                                

+---------+
| count(1)|
+---------+
|146876149|
+---------+





+----+-----+----+--------+
|year|month|days| records|
+----+-----+----+--------+
|2025|    1|  31|25157166|
|2025|    2|  28|22717051|
|2025|    3|  31|25160438|
|2025|    4|  30|24347011|
|2025|    5|  31|25156520|
|2025|    6|  30|24337963|
+----+-----+----+--------+



                                                                                

In [9]:
views_created = 0

# Service-based
spark.sql("""
    CREATE OR REPLACE VIEW voice_calls AS
    SELECT * FROM cdr_partitioned WHERE service_type = 'VOICE'
"""); views_created += 1

spark.sql("""
    CREATE OR REPLACE VIEW data_sessions AS
    SELECT * FROM cdr_partitioned WHERE service_type = 'DATA'
"""); views_created += 1

spark.sql("""
    CREATE OR REPLACE VIEW sms_records AS
    SELECT * FROM cdr_partitioned WHERE service_type = 'SMS'
"""); views_created += 1

# Fraud/Anomalies
spark.sql("""
    CREATE OR REPLACE VIEW fraud_cases AS
    SELECT * FROM cdr_partitioned
    WHERE fraud_indicator = true OR unusual_pattern_flag = true
"""); views_created += 1

# Network Issues
spark.sql("""
    CREATE OR REPLACE VIEW network_issues AS
    SELECT * FROM cdr_partitioned
    WHERE dropped_call_flag = true OR call_result = 'FAILED'
       OR network_congestion_level IN ('MEDIUM', 'HIGH') OR quality_score < 0.5
"""); views_created += 1

# High Value
spark.sql("""
    CREATE OR REPLACE VIEW high_value_customers AS
    SELECT DISTINCT
        subscriber_id, customer_segment, customer_lifetime_value_category,
        tariff_plan, payment_type, age_group, gender, operator
    FROM cdr_partitioned
    WHERE customer_segment IN ('Premium')
       OR customer_lifetime_value_category IN ('High', 'Very High')
       OR payment_type = 'POSTPAID'
"""); views_created += 1

# Special offers
spark.sql("""
    CREATE OR REPLACE VIEW special_offers_usage AS
    SELECT * FROM cdr_partitioned
    WHERE special_offer_applied != 'None' AND promotional_discount > 0
"""); views_created += 1

# Roaming
spark.sql("""
    CREATE OR REPLACE VIEW roaming_records AS
    SELECT * FROM cdr_partitioned WHERE roaming_flag = true
"""); views_created += 1

# App usage
spark.sql("""
    CREATE OR REPLACE VIEW app_usage_data AS
    SELECT * FROM cdr_partitioned
    WHERE service_type = 'DATA' AND application_used IS NOT NULL AND application_used != ''
"""); views_created += 1

print(f"✅ Created {views_created} analytical views.")


✅ Created 9 analytical views.


In [10]:
spark.sql("""
    CREATE TABLE IF NOT EXISTS daily_kpis AS
    SELECT
        year, month, day, service_type, operator, customer_segment,
        COUNT(DISTINCT subscriber_id) as unique_subscribers,
        COUNT(*) as total_transactions,
        SUM(duration) as total_duration_seconds,
        SUM(data_volume_mb) as total_data_mb,
        SUM(charging_amount) as total_revenue,
        SUM(tax_amount) as total_tax,
        AVG(quality_score) as avg_quality_score,
        SUM(CASE WHEN fraud_indicator THEN 1 ELSE 0 END) as fraud_cases,
        SUM(CASE WHEN dropped_call_flag THEN 1 ELSE 0 END) as dropped_calls,
        SUM(CASE WHEN special_offer_applied != 'None' THEN 1 ELSE 0 END) as special_offer_usage,
        AVG(promotional_discount) as avg_discount_applied
    FROM cdr_partitioned
    GROUP BY year, month, day, service_type, operator, customer_segment
""")
print("✅ Created daily_kpis table.")


✅ Created daily_kpis table.


25/06/22 17:02:47 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.


In [11]:
spark.sql("""
    CREATE TABLE IF NOT EXISTS hourly_patterns AS
    SELECT
        time_of_day_category, day_of_week, is_weekend, service_type, operator,
        COUNT(*) as transaction_count,
        COUNT(DISTINCT subscriber_id) as unique_users,
        AVG(duration) as avg_duration,
        AVG(data_volume_mb) as avg_data_mb,
        AVG(charging_amount) as avg_revenue,
        AVG(quality_score) as avg_quality,
        SUM(CASE WHEN network_congestion_level = 'HIGH' THEN 1 ELSE 0 END) as high_congestion_count
    FROM cdr_partitioned
    GROUP BY time_of_day_category, day_of_week, is_weekend, service_type, operator
""")
print("✅ Created hourly_patterns table.")


✅ Created hourly_patterns table.


25/06/22 17:02:50 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.


In [12]:
spark.sql("""
    CREATE TABLE IF NOT EXISTS location_network_metrics AS
    SELECT
        location_area, network_type, operator, service_type,
        COUNT(*) as total_transactions,
        COUNT(DISTINCT subscriber_id) as unique_subscribers,
        AVG(quality_score) as avg_quality_score,
        AVG(signal_strength) as avg_signal_strength,
        SUM(CASE WHEN dropped_call_flag THEN 1 ELSE 0 END) as dropped_count,
        SUM(CASE WHEN call_result = 'FAILED' THEN 1 ELSE 0 END) as failed_count,
        SUM(CASE WHEN network_congestion_level = 'HIGH' THEN 1 ELSE 0 END) as high_congestion_count,
        AVG(data_volume_mb) as avg_data_usage,
        SUM(charging_amount) as total_revenue
    FROM cdr_partitioned
    GROUP BY location_area, network_type, operator, service_type
""")
print("✅ Created location_network_metrics table.")


✅ Created location_network_metrics table.


25/06/22 17:02:52 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.


In [13]:
spark.sql("""
    CREATE TABLE IF NOT EXISTS customer_demographics_summary AS
    SELECT
        customer_segment, age_group, gender, payment_type, operator,
        COUNT(DISTINCT subscriber_id) as subscriber_count,
        COUNT(*) as total_activities,
        AVG(charging_amount) as avg_transaction_value,
        SUM(charging_amount) as total_revenue,
        AVG(data_volume_mb) as avg_data_usage,
        SUM(CASE WHEN fraud_indicator THEN 1 ELSE 0 END) as fraud_incidents,
        AVG(promotional_discount) as avg_discount_received
    FROM cdr_partitioned
    GROUP BY customer_segment, age_group, gender, payment_type, operator
""")
print("✅ Created customer_demographics_summary table.")


✅ Created customer_demographics_summary table.


25/06/22 17:02:55 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.


In [14]:
spark.sql("""
    CREATE TABLE IF NOT EXISTS app_usage_analytics AS
    SELECT
        application_used, content_category, customer_segment, age_group,
        COUNT(DISTINCT subscriber_id) as unique_users,
        COUNT(*) as total_sessions,
        SUM(data_volume_mb) as total_data_mb,
        AVG(data_volume_mb) as avg_data_per_session,
        SUM(duration) as total_duration,
        AVG(duration) as avg_session_duration,
        SUM(charging_amount) as total_revenue,
        AVG(revenue_per_mb) as avg_revenue_per_mb
    FROM cdr_partitioned
    WHERE service_type = 'DATA' AND application_used IS NOT NULL
    GROUP BY application_used, content_category, customer_segment, age_group
""")
print("✅ Created app_usage_analytics table.")


✅ Created app_usage_analytics table.


25/06/22 17:04:57 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.


In [15]:
spark.sql("""
    CREATE TABLE IF NOT EXISTS fraud_monitoring AS
    SELECT
        DATE(start_time) as fraud_date,
        operator, location_area,
        COUNT(*) as total_incidents,
        COUNT(DISTINCT subscriber_id) as affected_subscribers,
        SUM(charging_amount) as potential_loss,
        COLLECT_SET(service_type) as affected_services
    FROM cdr_partitioned
    WHERE fraud_indicator = true OR unusual_pattern_flag = true
    GROUP BY DATE(start_time), operator, location_area
""")
print("✅ Created fraud_monitoring table.")


✅ Created fraud_monitoring table.


25/06/22 17:05:00 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.


In [16]:
spark.sql("""
    CREATE TABLE IF NOT EXISTS revenue_tracking AS
    SELECT
        year, month, day, operator, service_type, customer_segment, payment_type,
        SUM(charging_amount) as gross_revenue,
        SUM(tax_amount) as tax_collected,
        SUM(charging_amount - tax_amount) as net_revenue,
        COUNT(DISTINCT subscriber_id) as active_customers,
        COUNT(*) as total_transactions,
        AVG(charging_amount) as arpu_daily
    FROM cdr_partitioned
    GROUP BY year, month, day, operator, service_type, customer_segment, payment_type
""")
print("✅ Created revenue_tracking table.")


✅ Created revenue_tracking table.


25/06/22 17:05:02 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.


In [17]:
tables_to_analyze = [
    'cdr_partitioned', 'daily_kpis', 'hourly_patterns',
    'location_network_metrics', 'customer_demographics_summary',
    'app_usage_analytics', 'fraud_monitoring', 'revenue_tracking'
]
for table in tables_to_analyze:
    spark.sql(f"ANALYZE TABLE {table} COMPUTE STATISTICS")
    if table in ['hourly_patterns', 'customer_demographics_summary']:
        spark.sql(f"ANALYZE TABLE {table} COMPUTE STATISTICS FOR ALL COLUMNS")
    print(f"✅ Analyzed {table}")


                                                                                

✅ Analyzed cdr_partitioned
✅ Analyzed daily_kpis
✅ Analyzed hourly_patterns
✅ Analyzed location_network_metrics
✅ Analyzed customer_demographics_summary
✅ Analyzed app_usage_analytics
✅ Analyzed fraud_monitoring
✅ Analyzed revenue_tracking


In [18]:
print("\n📦 Tables in database:")
for row in spark.sql(f"SHOW TABLES IN {db_name}").collect():
    print(f"- {row.tableName}")

print("\n👁️ Views created:", views_created)

print("\n⚡ Performance Test:")
start = time.time()
spark.sql("""
    SELECT operator, service_type, COUNT(*) as count, ROUND(SUM(charging_amount),2) as revenue
    FROM cdr_partitioned
    WHERE year = 2025 AND month = 1
    GROUP BY operator, service_type
    ORDER BY operator, service_type
""").show()
print(f"Test query executed in {time.time()-start:.2f} seconds")

print("\n🎯 Next Steps:")
print("   → Run Notebook 03 for Advanced Data Engineering")
print("   → Run Notebook 04 for Anomaly Detection & Trend Analysis")
print("   → Run Notebook 05 for Business Intelligence Dashboards")



📦 Tables in database:
- daily_kpis
- hourly_patterns
- location_network_metrics
- customer_demographics_summary
- app_usage_analytics
- fraud_monitoring
- revenue_tracking
- cdr_raw
- cdr_partitioned
- voice_calls
- data_sessions
- sms_records
- fraud_cases
- network_issues
- high_value_customers
- special_offers_usage
- roaming_records
- app_usage_data

👁️ Views created: 9

⚡ Performance Test:




+--------+------------+-------+---------------+
|operator|service_type|  count|        revenue|
+--------+------------+-------+---------------+
|  Djezzy|        DATA|4673700|5.88676314612E9|
|  Djezzy|         SMS|2337228|  1.455933981E7|
|  Djezzy|       VOICE|1868374| 1.4483556425E8|
| Mobilis|        DATA|5922757|7.45828300811E9|
| Mobilis|         SMS|2961333|  1.843686113E7|
| Mobilis|       VOICE|2372992| 1.8381413374E8|
| Ooredoo|        DATA|2642823|3.32654639053E9|
| Ooredoo|         SMS|1320866|     8225269.28|
| Ooredoo|       VOICE|1057093|  8.202056435E7|
+--------+------------+-------+---------------+

Test query executed in 1.19 seconds

🎯 Next Steps:
   → Run Notebook 03 for Advanced Data Engineering
   → Run Notebook 04 for Anomaly Detection & Trend Analysis
   → Run Notebook 05 for Business Intelligence Dashboards


                                                                                