<a href="https://colab.research.google.com/github/codingniket/Python-Training/blob/main/15-01-2026/Solution/1_Order_Processing_and_Analytics_Pipeline_using_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Load the CSV file without schema inference.
2. Print the schema.
3. Count total records.
4. Show sample rows.


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType
from pyspark.sql.functions import trim, col, when, to_date, sum as spark_sum, avg, desc, rank, lit, coalesce, isnull,try_to_timestamp,regexp_extract,initcap
from pyspark.sql.window import Window

spark = SparkSession.builder.appName("Project1").getOrCreate()

In [4]:
df_raw = spark.read \
.option("header", "true") \
.option("inferSchema", "false") \
.csv("orders.csv")

df_raw.show()
df_raw.printSchema()
df_raw.count()
df_raw.describe()

+-----------+-----------+-----------+-----------+-----------+-------+----------+---------+
|   order_id|customer_id|       city|   category|    product| amount|order_date|   status|
+-----------+-----------+-----------+-----------+-----------+-------+----------+---------+
|ORD00000000|    C000000| hyderabad |   grocery |       Oil |invalid|01/01/2024|Cancelled|
|ORD00000001|    C000001|       Pune|    Grocery|      Sugar|  35430|2024-01-02|Completed|
|ORD00000002|    C000002|       Pune|Electronics|     Mobile|  65358|2024-01-03|Completed|
|ORD00000003|    C000003|  Bangalore|Electronics|     Laptop|   5558|2024-01-04|Completed|
|ORD00000004|    C000004|       Pune|       Home|AirPurifier|  33659|2024-01-05|Completed|
|ORD00000005|    C000005|      Delhi|    Fashion|      Jeans|   8521|2024-01-06|Completed|
|ORD00000006|    C000006|      Delhi|    Grocery|      Sugar|  42383|2024-01-07|Completed|
|ORD00000007|    C000007|       Pune|    Grocery|       Rice|  45362|2024-01-08|Completed|

DataFrame[summary: string, order_id: string, customer_id: string, city: string, category: string, product: string, amount: string, order_date: string, status: string]

In [8]:
df_raw.count()

300000

In [7]:
df_raw.isEmpty()

False

5. Explain why all columns must be treated as StringType initially.


If we used specific column value like int , date all the corrupted data wont fit in the schema and it will result in error and only way to handle the situatuion is either delete the data or convert to stringtype clean the data and then use it

The data has no empty and has around 3 lakhs of input
Now cleaning the data

PHASE 2 – Data Cleaning
The dataset must be cleaned in the following way:
1. Remove leading and trailing spaces from:
city
category
product
2. Standardize text:
Convert city, category, and product to proper case.
3. Clean the amount column:
Remove commas.
Replace empty strings and invalid values with null.
Convert amount into IntegerType.
Rows with invalid amounts must not crash the pipeline.
4. Clean the order_date column:
Support the following formats:
yyyy-MM-dd
dd/MM/yyyy
yyyy/MM/dd

Create a new column: order_date_clean with DateType.



In [11]:
cols_to_clean = ["city", "category", "product", "amount","status"]
df_cleaned_str = df_raw
for col_name in cols_to_clean:
  df_cleaned_str = df_cleaned_str.withColumn(col_name, initcap(trim(col(col_name))))

In [13]:
df_cleaned_str.show()
print(df_cleaned_str.count())

+-----------+-----------+---------+-----------+-----------+-------+----------+---------+
|   order_id|customer_id|     city|   category|    product| amount|order_date|   status|
+-----------+-----------+---------+-----------+-----------+-------+----------+---------+
|ORD00000000|    C000000|Hyderabad|    Grocery|        Oil|Invalid|01/01/2024|Cancelled|
|ORD00000001|    C000001|     Pune|    Grocery|      Sugar|  35430|2024-01-02|Completed|
|ORD00000002|    C000002|     Pune|Electronics|     Mobile|  65358|2024-01-03|Completed|
|ORD00000003|    C000003|Bangalore|Electronics|     Laptop|   5558|2024-01-04|Completed|
|ORD00000004|    C000004|     Pune|       Home|Airpurifier|  33659|2024-01-05|Completed|
|ORD00000005|    C000005|    Delhi|    Fashion|      Jeans|   8521|2024-01-06|Completed|
|ORD00000006|    C000006|    Delhi|    Grocery|      Sugar|  42383|2024-01-07|Completed|
|ORD00000007|    C000007|     Pune|    Grocery|       Rice|  45362|2024-01-08|Completed|
|ORD00000008|    C000

In [14]:
numeric_price_str = regexp_extract(col("amount"), r"(\d+)", 0)
df_error_free = df_cleaned_str.withColumn("amount",when((numeric_price_str == "") | numeric_price_str.isNull(), lit(0))
.otherwise(numeric_price_str.cast('int')))\
.withColumn(
    "order_date",
    coalesce(
        to_date(try_to_timestamp(col("order_date"), lit("yyyy-MM-dd"))),
        to_date(try_to_timestamp(col("order_date"), lit("dd/MM/yyyy"))),
        to_date(try_to_timestamp(col("order_date"), lit("yyyy/MM/dd")))
    )
)

df_error_free.show()
df_error_free.printSchema()

+-----------+-----------+---------+-----------+-----------+------+----------+---------+
|   order_id|customer_id|     city|   category|    product|amount|order_date|   status|
+-----------+-----------+---------+-----------+-----------+------+----------+---------+
|ORD00000000|    C000000|Hyderabad|    Grocery|        Oil|     0|2024-01-01|Cancelled|
|ORD00000001|    C000001|     Pune|    Grocery|      Sugar| 35430|2024-01-02|Completed|
|ORD00000002|    C000002|     Pune|Electronics|     Mobile| 65358|2024-01-03|Completed|
|ORD00000003|    C000003|Bangalore|Electronics|     Laptop|  5558|2024-01-04|Completed|
|ORD00000004|    C000004|     Pune|       Home|Airpurifier| 33659|2024-01-05|Completed|
|ORD00000005|    C000005|    Delhi|    Fashion|      Jeans|  8521|2024-01-06|Completed|
|ORD00000006|    C000006|    Delhi|    Grocery|      Sugar| 42383|2024-01-07|Completed|
|ORD00000007|    C000007|     Pune|    Grocery|       Rice| 45362|2024-01-08|Completed|
|ORD00000008|    C000008|Bangalo

PHASE 3 – Data Validation
1. Count how many records had invalid amounts.
2. Count how many records had invalid dates.
3. Identify duplicate order_id values.
4. Remove duplicates using order_id.
5. Filter only records with:

status = "Completed"

6. Record row counts at every stage.

In [15]:
print(f"Count before cleaning: {df_error_free.count()}")

df_valid = df_error_free.dropna(subset=["amount", "order_date"])

print(f"Count after cleaning: {df_valid.count()}")


Count before cleaning: 300000
Count after cleaning: 297405


In [16]:
df_clean = df_valid.dropDuplicates(["order_id"])

print(f"Count after dropping duplicates: {df_clean.count()}")

Count after dropping duplicates: 297405


In [17]:
df_completed = df_clean.filter(col("status") == "Completed")

print(f"Count after filtering: {df_completed.count()}")

Count after filtering: 282535


PHASE 4 – Performance Engineering
1. Check the number of partitions.

2. Run a groupBy on city and calculate total revenue.

3. Use:

explain(True)

to analyze execution.

4. Identify where shuffle happens.

5. Repartition the dataset by city.

6. Compare execution plans before and after repartition.
This phase exists to demonstrate understanding of Spark internals, not just outputs.


In [18]:
print(f"Number of partitions before repartition: {df_completed.rdd.getNumPartitions()}")

Number of partitions before repartition: 2


In [20]:
city_revenue = df_completed.groupBy("city").agg(spark_sum("amount").alias("total_revenue"))
city_revenue.show()

+---------+-------------+
|     city|total_revenue|
+---------+-------------+
|Bangalore|   1595093850|
|  Chennai|   1594968796|
|   Mumbai|   1592819957|
|  Kolkata|   1589960718|
|     Pune|   1611302685|
|    Delhi|   1602686184|
|Hyderabad|   1609260488|
+---------+-------------+



In [21]:
city_revenue.explain(True)

== Parsed Logical Plan ==
'Aggregate ['city], ['city, 'sum('amount) AS total_revenue#859]
+- Filter (status#492 = Completed)
   +- Deduplicate [order_id#155]
      +- Filter atleastnnonnulls(2, amount#571, order_date#572)
         +- Project [order_id#155, customer_id#156, city#488, category#489, product#490, amount#571, coalesce(to_date(try_to_timestamp(order_date#161, Some(yyyy-MM-dd), TimestampType, Some(Etc/UTC), false), None, Some(Etc/UTC), true), to_date(try_to_timestamp(order_date#161, Some(dd/MM/yyyy), TimestampType, Some(Etc/UTC), false), None, Some(Etc/UTC), true), to_date(try_to_timestamp(order_date#161, Some(yyyy/MM/dd), TimestampType, Some(Etc/UTC), false), None, Some(Etc/UTC), true)) AS order_date#572, status#492]
            +- Project [order_id#155, customer_id#156, city#488, category#489, product#490, CASE WHEN ((regexp_extract(amount#491, (\d+), 0) = ) OR isnull(regexp_extract(amount#491, (\d+), 0))) THEN 0 ELSE cast(regexp_extract(amount#491, (\d+), 0) as int) END AS

We can see the shuffle happens in Physical Plan

In [22]:


df_repartitioned = df_completed.repartition('city')
print(f"Number of partitions after repartitioning by city: {df_repartitioned.rdd.getNumPartitions()}")

Number of partitions after repartitioning by city: 3


In [24]:
df_repartitioned.explain(True)

== Parsed Logical Plan ==
'RepartitionByExpression ['city]
+- Filter (status#492 = Completed)
   +- Deduplicate [order_id#155]
      +- Filter atleastnnonnulls(2, amount#571, order_date#572)
         +- Project [order_id#155, customer_id#156, city#488, category#489, product#490, amount#571, coalesce(to_date(try_to_timestamp(order_date#161, Some(yyyy-MM-dd), TimestampType, Some(Etc/UTC), false), None, Some(Etc/UTC), true), to_date(try_to_timestamp(order_date#161, Some(dd/MM/yyyy), TimestampType, Some(Etc/UTC), false), None, Some(Etc/UTC), true), to_date(try_to_timestamp(order_date#161, Some(yyyy/MM/dd), TimestampType, Some(Etc/UTC), false), None, Some(Etc/UTC), true)) AS order_date#572, status#492]
            +- Project [order_id#155, customer_id#156, city#488, category#489, product#490, CASE WHEN ((regexp_extract(amount#491, (\d+), 0) = ) OR isnull(regexp_extract(amount#491, (\d+), 0))) THEN 0 ELSE cast(regexp_extract(amount#491, (\d+), 0) as int) END AS amount#571, order_date#161, st


*   The `df_completed` DataFrame was repartitioned by the `city` column, resulting in the `df_repartitioned` DataFrame having 3 partitions.
*   Following repartitioning, the total revenue for each city was successfully calculated from `df_repartitioned`, displaying revenues for seven cities: Bangalore, Chennai, Mumbai, Kolkata, Pune, Delhi, and Hyderabad.




PHASE 5 – Analytics
Using the cleaned dataset:
1. Total revenue per city.
2. Total revenue per category.
3. Average order value per city.
4. Top 10 products by revenue.
5. Cities sorted by revenue descending.

In [25]:
city_revenue = df_completed.groupBy("city").agg(spark_sum("amount").alias("total_revenue"))
city_revenue.show()

catgory_revenue = df_completed.groupBy("category").agg(spark_sum("amount").alias("total_revenue"))
catgory_revenue.show()

city_average = df_completed.groupBy("city").agg(avg("amount").alias("average_order_value"))
city_average.show()

top_10_products = df_completed.groupBy("product").agg(spark_sum("amount").alias("total_revenue"))
top_10_products = top_10_products.orderBy(desc("total_revenue")).limit(10)
top_10_products.show()

city_revenue_sorted = city_revenue.orderBy(desc("total_revenue"))
city_revenue_sorted.show()

+---------+-------------+
|     city|total_revenue|
+---------+-------------+
|Bangalore|   1595093850|
|  Chennai|   1594968796|
|   Mumbai|   1592819957|
|  Kolkata|   1589960718|
|     Pune|   1611302685|
|    Delhi|   1602686184|
|Hyderabad|   1609260488|
+---------+-------------+

+-----------+-------------+
|   category|total_revenue|
+-----------+-------------+
|       Home|   2808429137|
|    Fashion|   2774452711|
|    Grocery|   2806779336|
|Electronics|   2806431494|
+-----------+-------------+

+---------+-------------------+
|     city|average_order_value|
+---------+-------------------+
|Bangalore| 39910.272224585286|
|  Chennai|  39488.22252481989|
|   Mumbai| 39540.748131966335|
|  Kolkata| 39558.149876844225|
|     Pune|  39758.74565104745|
|    Delhi| 39603.790254027874|
|Hyderabad|  39533.74165970619|
+---------+-------------------+

+-----------+-------------+
|    product|total_revenue|
+-----------+-------------+
|        Oil|    943593995|
|     Laptop|    943181

PHASE 6 – Window Functions
1. Rank cities by revenue.
2. Rank products inside each category by revenue.
3. Find the top product for every category.
4. Identify the top 3 performing cities.

In [27]:
city_window = Window.orderBy(col("total_revenue").desc())
df_city_rank = city_revenue.withColumn("rank", rank().over(city_window))
df_city_rank.show()

+---------+-------------+----+
|     city|total_revenue|rank|
+---------+-------------+----+
|     Pune|   1611302685|   1|
|Hyderabad|   1609260488|   2|
|    Delhi|   1602686184|   3|
|Bangalore|   1595093850|   4|
|  Chennai|   1594968796|   5|
|   Mumbai|   1592819957|   6|
|  Kolkata|   1589960718|   7|
+---------+-------------+----+



In [28]:
df_product_rank = top_10_products.withColumn("rank", rank().over(city_window))
df_product_rank.show()

+-----------+-------------+----+
|    product|total_revenue|rank|
+-----------+-------------+----+
|        Oil|    943593995|   1|
|     Laptop|    943181599|   2|
|     Tablet|    939989279|   3|
|     Vacuum|    939626254|   4|
|      Mixer|    937113183|   5|
|       Rice|    934345709|   6|
|Airpurifier|    931689700|   7|
|      Jeans|    930922473|   8|
|      Sugar|    928839632|   9|
|      Shoes|    926563627|  10|
+-----------+-------------+----+



In [32]:
product_revenue_per_category = df_completed.groupBy("category", "product").agg(spark_sum("amount").alias("total_revenue"))

category_product_window = Window.partitionBy("category").orderBy(col("total_revenue").desc())

df_top_product_per_category = product_revenue_per_category.withColumn("rank", rank().over(category_product_window))

df_top_product_per_category.filter(col("rank") == 1).show()

+-----------+-------+-------------+----+
|   category|product|total_revenue|rank|
+-----------+-------+-------------+----+
|Electronics| Laptop|    943181599|   1|
|    Fashion|  Jeans|    930922473|   1|
|    Grocery|    Oil|    943593995|   1|
|       Home| Vacuum|    939626254|   1|
+-----------+-------+-------------+----+



In [29]:
df_city_rank.limit(3).show()

+---------+-------------+----+
|     city|total_revenue|rank|
+---------+-------------+----+
|     Pune|   1611302685|   1|
|Hyderabad|   1609260488|   2|
|    Delhi|   1602686184|   3|
+---------+-------------+----+



PHASE 7 – Broadcast Join
A small lookup table is provided:

city,region
Delhi,North
Mumbai,West
Bangalore,South
Hyderabad,South
Pune,West
Chennai,South
Kolkata,East


Tasks:
1. Join the orders data with this city-region dataset.
2. Apply broadcast join explicitly.
3. Verify using the physical plan that:

BroadcastHashJoin

is used.

4. Explain why broadcast join is efficient in this case.

In [33]:
data = [
("Delhi","North"),
("Mumbai","West"),
("Bangalore","South"),
("Hyderabad","South"),
("Pune","West"),
("Chennai","South"),
("Kolkata","East"),
]

columns = ["city","region"]

df_city_lookup = spark.createDataFrame(data, columns)
df_city_lookup.show()

+---------+------+
|     city|region|
+---------+------+
|    Delhi| North|
|   Mumbai|  West|
|Bangalore| South|
|Hyderabad| South|
|     Pune|  West|
|  Chennai| South|
|  Kolkata|  East|
+---------+------+



In [34]:
joined_data = df_completed.join(df_city_lookup, on="city", how="left")
joined_data.show()

+---------+-----------+-----------+-----------+-----------+------+----------+---------+------+
|     city|   order_id|customer_id|   category|    product|amount|order_date|   status|region|
+---------+-----------+-----------+-----------+-----------+------+----------+---------+------+
|     Pune|ORD00000001|    C000001|    Grocery|      Sugar| 35430|2024-01-02|Completed|  West|
|     Pune|ORD00000007|    C000007|    Grocery|       Rice| 45362|2024-01-08|Completed|  West|
|Bangalore|ORD00000008|    C000008|    Fashion|      Jeans| 10563|2024-01-09|Completed| South|
|Bangalore|ORD00000010|    C000010|    Grocery|      Sugar| 66576|2024-01-11|Completed| South|
|  Kolkata|ORD00000011|    C000011|Electronics|     Tablet| 50318|2024-01-12|Completed|  East|
|Bangalore|ORD00000012|    C000012|    Grocery|      Sugar| 84768|2024-01-13|Completed| South|
|   Mumbai|ORD00000014|    C000014|Electronics|     Tablet| 79469|2024-01-15|Completed|  West|
|     Pune|ORD00000015|    C000015|Electronics|   

In [35]:
from pyspark.sql.functions import broadcast

broadcasted_joined_data = df_completed.join(broadcast(df_city_lookup), on="city", how="left")
broadcasted_joined_data.show()
broadcasted_joined_data.explain(True)

+---------+-----------+-----------+-----------+-----------+------+----------+---------+------+
|     city|   order_id|customer_id|   category|    product|amount|order_date|   status|region|
+---------+-----------+-----------+-----------+-----------+------+----------+---------+------+
|     Pune|ORD00000001|    C000001|    Grocery|      Sugar| 35430|2024-01-02|Completed|  West|
|     Pune|ORD00000007|    C000007|    Grocery|       Rice| 45362|2024-01-08|Completed|  West|
|Bangalore|ORD00000008|    C000008|    Fashion|      Jeans| 10563|2024-01-09|Completed| South|
|Bangalore|ORD00000010|    C000010|    Grocery|      Sugar| 66576|2024-01-11|Completed| South|
|  Kolkata|ORD00000011|    C000011|Electronics|     Tablet| 50318|2024-01-12|Completed|  East|
|Bangalore|ORD00000012|    C000012|    Grocery|      Sugar| 84768|2024-01-13|Completed| South|
|   Mumbai|ORD00000014|    C000014|Electronics|     Tablet| 79469|2024-01-15|Completed|  West|
|     Pune|ORD00000015|    C000015|Electronics|   

 A broadcast join is often much better than a normal (shuffle) join when one dataset is significantly smaller than the other and fits in memory, as it avoids costly data shuffling across the network, allowing each node to perform the join locally, leading to significant performance gains, though normal joins remain better for large-to-large dataset scenarios.

PHASE 8 – UDF

Create a classification based on amount:

amount >= 80000 → High
amount >= 40000 → Medium
else → Low

Add a new column:

order_value_category

Analyze distribution.

In [39]:
df_value = df_completed.withColumn(
    "order_value_category",
    when(col("amount") >= 80000, lit("High"))
    .when((col("amount") >= 40000) & (col("amount") < 80000), lit("Medium"))
    .otherwise(lit("Low"))
)

df_value.show()

+-----------+-----------+---------+-----------+-----------+------+----------+---------+--------------------+
|   order_id|customer_id|     city|   category|    product|amount|order_date|   status|order_value_category|
+-----------+-----------+---------+-----------+-----------+------+----------+---------+--------------------+
|ORD00000001|    C000001|     Pune|    Grocery|      Sugar| 35430|2024-01-02|Completed|                 Low|
|ORD00000007|    C000007|     Pune|    Grocery|       Rice| 45362|2024-01-08|Completed|              Medium|
|ORD00000008|    C000008|Bangalore|    Fashion|      Jeans| 10563|2024-01-09|Completed|                 Low|
|ORD00000010|    C000010|Bangalore|    Grocery|      Sugar| 66576|2024-01-11|Completed|              Medium|
|ORD00000011|    C000011|  Kolkata|Electronics|     Tablet| 50318|2024-01-12|Completed|              Medium|
|ORD00000012|    C000012|Bangalore|    Grocery|      Sugar| 84768|2024-01-13|Completed|                High|
|ORD00000014|    C0

PHASE 9 – RDD
1. Convert the cleaned DataFrame to RDD.
2. Compute:
Total revenue using reduce.
Orders per city using map and reduce.
3. Explain why DataFrames are preferred over RDDs for analytics.

In [40]:
cleaned_rdd = df_completed.rdd

total_revenue_rdd = cleaned_rdd.map(lambda row: row.amount).reduce(lambda a, b: a + b)
print(f"Total revenue (RDD reduce): {total_revenue_rdd}")


city_counts_rdd = cleaned_rdd.map(lambda row: {row.city: 1})

def merge_city_counts(dict1, dict2):
    merged_dict = dict1.copy()
    for city, count in dict2.items():
        merged_dict[city] = merged_dict.get(city, 0) + count
    return merged_dict

orders_per_city_rdd = city_counts_rdd.reduce(merge_city_counts)

print("Orders per city (RDD map and reduce):")
for city, count in orders_per_city_rdd.items():
    print(f"  {city}: {count}")

Total revenue (RDD reduce): 11196092678
Orders per city (RDD map and reduce):
  Pune: 40527
  Bangalore: 39967
  Kolkata: 40193
  Mumbai: 40283
  Hyderabad: 40706
  Chennai: 40391
  Delhi: 40468


they provide a high-level, schema-aware abstraction that outperforms raw RDDs (Resilient Distributed Datasets) in speed, efficiency, and ease of use

PHASE 10 – Caching
1. Identify datasets reused in multiple queries.
2. Apply cache().
3. Execute multiple aggregations.
4. Compare performance.
5. Unpersist after use.

Explain why unnecessary caching is dangerous.

In [None]:
df_completed

In [41]:
df_completed.cache()

DataFrame[order_id: string, customer_id: string, city: string, category: string, product: string, amount: int, order_date: date, status: string]

In [42]:
df_completed.count()
df_completed.unpersist()

DataFrame[order_id: string, customer_id: string, city: string, category: string, product: string, amount: int, order_date: date, status: string]

Unnecessary or improper caching is dangerous primarily because it can lead to the exposure of sensitive data, the presentation of outdated information, system outages from cascading failures, and increased application complexity

PHASE 11 – Storage Formats
1. Write cleaned dataset to:

Parquet

Partitioned by:

city

2. Write aggregated datasets to:

ORC

3. Read both formats back and validate:
Schema
Row counts
4. Compare size and performance against CSV.

In [44]:
print("Writing cleaned data to Parquet (partitioned by city)...")
df_value.write \
    .mode("overwrite") \
    .partitionBy("city") \
    .parquet("output/orders_cleaned_parquet")

Writing cleaned data to Parquet (partitioned by city)...


In [45]:
print("Writing city_revenue to ORC...")
city_revenue.write \
    .mode("overwrite") \
    .orc("output/city_revenue_orc")

Writing city_revenue to ORC...


In [46]:
print("Writing catgory_revenue to ORC...")
catgory_revenue.write \
    .mode("overwrite") \
    .orc("output/category_revenue_orc")

Writing catgory_revenue to ORC...


In [47]:
print("Reading Parquet file: output/orders_cleaned_parquet...")
df_parquet_read = spark.read.parquet("output/orders_cleaned_parquet")
df_parquet_read.show(5)

Reading Parquet file: output/orders_cleaned_parquet...
+-----------+-----------+-----------+-----------+------+----------+---------+--------------------+---------+
|   order_id|customer_id|   category|    product|amount|order_date|   status|order_value_category|     city|
+-----------+-----------+-----------+-----------+------+----------+---------+--------------------+---------+
|ORD00000053|    C000053|       Home|Airpurifier| 74634|2024-02-23|Completed|              Medium|Hyderabad|
|ORD00000056|    C000056|    Grocery|      Sugar| 39461|2024-02-26|Completed|                 Low|Hyderabad|
|ORD00000064|    C000064|    Grocery|       Rice| 82413|2024-01-05|Completed|                High|Hyderabad|
|ORD00000081|    C000081|       Home|Airpurifier| 43515|2024-01-22|Completed|              Medium|Hyderabad|
|ORD00000096|    C000096|Electronics|     Tablet|  8445|2024-02-06|Completed|                 Low|Hyderabad|
+-----------+-----------+-----------+-----------+------+----------+------

In [48]:
print("Reading ORC file: output/city_revenue_orc...")
city_revenue_orc_read = spark.read.orc("output/city_revenue_orc")
city_revenue_orc_read.show(5)

Reading ORC file: output/city_revenue_orc...
+---------+-------------+
|     city|total_revenue|
+---------+-------------+
|Bangalore|   1595093850|
|  Chennai|   1594968796|
|   Mumbai|   1592819957|
|  Kolkata|   1589960718|
|     Pune|   1611302685|
+---------+-------------+
only showing top 5 rows


In [49]:
print("Reading ORC file: output/category_revenue_orc...")
category_revenue_orc_read = spark.read.orc("output/category_revenue_orc")
category_revenue_orc_read.show(5)

Reading ORC file: output/category_revenue_orc...
+-----------+-------------+
|   category|total_revenue|
+-----------+-------------+
|       Home|   2808429137|
|    Fashion|   2774452711|
|    Grocery|   2806779336|
|Electronics|   2806431494|
+-----------+-------------+



In [50]:
print("Schema of df_parquet_read:")
df_parquet_read.printSchema()

Schema of df_parquet_read:
root
 |-- order_id: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- category: string (nullable = true)
 |-- product: string (nullable = true)
 |-- amount: integer (nullable = true)
 |-- order_date: date (nullable = true)
 |-- status: string (nullable = true)
 |-- order_value_category: string (nullable = true)
 |-- city: string (nullable = true)



In [51]:
print("Schema of city_revenue_orc_read:")
city_revenue_orc_read.printSchema()

Schema of city_revenue_orc_read:
root
 |-- city: string (nullable = true)
 |-- total_revenue: long (nullable = true)



In [52]:
print("Schema of category_revenue_orc_read:")
category_revenue_orc_read.printSchema()

Schema of category_revenue_orc_read:
root
 |-- category: string (nullable = true)
 |-- total_revenue: long (nullable = true)



In [53]:
print(f"Row count of df_parquet_read: {df_parquet_read.count()}")

Row count of df_parquet_read: 282535


In [54]:
print(f"Row count of city_revenue_orc_read: {city_revenue_orc_read.count()}")

Row count of city_revenue_orc_read: 7


In [55]:
print(f"Row count of category_revenue_orc_read: {category_revenue_orc_read.count()}")

Row count of category_revenue_orc_read: 4


Comparing Size and Performance:

*   **Smaller File Sizes:** They compress data more efficiently than row-oriented formats like CSV because columns often have similar data types and values, allowing for better compression algorithms. This reduces storage costs and improves I/O performance.
*   **Faster Query Performance:** For analytical queries that select a subset of columns, column-oriented formats only read the necessary data, significantly reducing disk I/O compared to CSV, which must read entire rows. Partitioning (as done with Parquet by 'city') further enhances performance by allowing Spark to skip irrelevant data entirely.



PHASE 12 – Debugging
Explain why this breaks:

df = df.filter(df.amount > 50000).show()

And why after this line df is no longer a DataFrame.

The code "breaks" your workflow because .show() is an action, not a transformation.

After this line, the variable df is reassigned to the return value of the .show() method.

## Document

### Cleaning Strategy
- Trim spaces from city, category, product.
- Standardize to proper case.
- Clean amount: remove commas, handle invalids as null, convert to int.
- Parse order_date in multiple formats to DateType.

### Performance Strategy
- Use broadcast join for small lookup tables to optimize joins.
- Partition data by city in Parquet for faster queries.
- Store aggregates in ORC for efficient columnar access.
- Prefer column-oriented formats over CSV for compression and selective reading.

### Debugging Learnings
- Avoid reassigning DataFrame to .show() output, as it returns None, breaking the chain.