In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, round, max, min, avg

These statements set up the Spark environment and import tools to access data columns and perform basic statistical computations.

In [2]:
spark = SparkSession.builder.appName("HousingAdvancedExample").getOrCreate()

This statement sets up the Spark environment by creating a Spark session named "HousingAdvancedExample", which serves as the entry point for reading, processing, and analyzing large datasets.

In [3]:
df = spark.read.csv("Housing.csv", header=True, inferSchema=True)
df = df.withColumn("price_per_sqft", round(col("price") / col("area"), 2))

The statements read the dataset "Housing.csv" into a Spark DataFrame with headers and automatically inferred data types, and then create a new column "price_per_sqft" by calculating the price per unit area, rounded to two decimal places.

In [4]:
print("=== Top 10 Most Expensive Houses ===")
df.orderBy(col("price").desc()).show(10)

=== Top 10 Most Expensive Houses ===
+--------+-----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+--------------+
|   price| area|bedrooms|bathrooms|stories|mainroad|guestroom|basement|hotwaterheating|airconditioning|parking|prefarea|furnishingstatus|price_per_sqft|
+--------+-----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+--------------+
|13300000| 7420|       4|        2|      3|     yes|       no|      no|             no|            yes|      2|     yes|       furnished|       1792.45|
|12250000| 8960|       4|        4|      4|     yes|       no|      no|             no|            yes|      3|      no|       furnished|       1367.19|
|12250000| 9960|       3|        2|      2|     yes|       no|     yes|             no|             no|      2|     yes|  semi-furnished|       1229.92|
|12215000| 7500|       4|        2|      2|  

display a header message and then show the top 10 most expensive houses by ordering the DataFrame in descending order based on the "price" column.

In [5]:
print("=== Cheapest 10 Houses ===")
df.orderBy(col("price").asc()).show(10)

=== Cheapest 10 Houses ===
+-------+----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+--------------+
|  price|area|bedrooms|bathrooms|stories|mainroad|guestroom|basement|hotwaterheating|airconditioning|parking|prefarea|furnishingstatus|price_per_sqft|
+-------+----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+--------------+
|1750000|3620|       2|        1|      1|     yes|       no|      no|             no|             no|      0|      no|     unfurnished|        483.43|
|1750000|2910|       3|        1|      1|      no|       no|      no|             no|             no|      0|      no|       furnished|        601.37|
|1750000|3850|       3|        1|      2|     yes|       no|      no|             no|             no|      0|      no|     unfurnished|        454.55|
|1767150|2400|       3|        1|      1|      no|       no|      n

display a header message and then show the 10 cheapest houses by ordering the DataFrame in ascending order based on the "price" column.

In [6]:
print("=== Average Price by Bedrooms ===")
df.groupBy("bedrooms").agg(round(avg("price"), 2).alias("avg_price")).orderBy("bedrooms").show()

=== Average Price by Bedrooms ===
+--------+----------+
|bedrooms| avg_price|
+--------+----------+
|       1| 2712500.0|
|       2|3632022.06|
|       3|4954598.13|
|       4|5729757.89|
|       5| 5819800.0|
|       6| 4791500.0|
+--------+----------+



display a header message and then calculate the average house price for each number of bedrooms, rounding the results to two decimal places, and show the data ordered by the number of bedrooms.

In [7]:
print("=== Average Price per Sq.Ft by Bathrooms ===")
df.groupBy("bathrooms").agg(round(avg("price_per_sqft"), 2).alias("avg_pps")).orderBy("bathrooms").show()

=== Average Price per Sq.Ft by Bathrooms ===
+---------+-------+
|bathrooms|avg_pps|
+---------+-------+
|        1| 933.64|
|        2|1153.87|
|        3|1213.92|
|        4|1367.19|
+---------+-------+



These statements display a header message and then calculate the average price per square foot for each number of bathrooms, rounding the results to two decimal places, and show the data ordered by the number of bathrooms.

In [8]:
print("=== Max and Min Price by Bedrooms ===")
df.groupBy("bedrooms").agg(
    max("price").alias("max_price"),
    min("price").alias("min_price")
).show(5)

=== Max and Min Price by Bedrooms ===
+--------+---------+---------+
|bedrooms|max_price|min_price|
+--------+---------+---------+
|       1|  3150000|  2275000|
|       6|  6083000|  3500000|
|       3| 12250000|  1750000|
|       5| 10150000|  1960000|
|       4| 13300000|  2100000|
+--------+---------+---------+
only showing top 5 rows


These statements display a header message and then calculate the maximum and minimum house prices for each number of bedrooms, showing the first five results.