In [1]:
from pyspark.sql import SparkSession

imports the SparkSession class from PySpark, which is used to create a Spark environment for reading, processing, and analyzing data.

In [2]:
spark = SparkSession.builder.appName("HousingRDDExample").getOrCreate()
sc = spark.sparkContext

These statements create a Spark session named `"HousingRDDExample"` to set up the Spark environment and then access its SparkContext (`sc`) to perform low-level RDD operations.

In [3]:
data = sc.textFile("Housing.csv")

reads the file "Housing.csv" as an RDD of text lines, allowing low-level Spark operations on each line of the dataset.

In [4]:
header = data.first()
rows = data.filter(lambda line: line != header)

extract the first line of the RDD as the header and then create a new RDD rows that excludes the header, leaving only the data rows.

In [5]:
split_rdd = rows.map(lambda line: line.split(","))

splits each line in the rows RDD by commas, creating a new RDD split_rdd where each element is a list of column values for a row.

In [6]:
print("=== Housing Dataset (first 10 rows) ===")
for row in split_rdd.take(10):
    print(row)

=== Housing Dataset (first 10 rows) ===
['13300000', '7420', '4', '2', '3', 'yes', 'no', 'no', 'no', 'yes', '2', 'yes', 'furnished']
['12250000', '8960', '4', '4', '4', 'yes', 'no', 'no', 'no', 'yes', '3', 'no', 'furnished']
['12250000', '9960', '3', '2', '2', 'yes', 'no', 'yes', 'no', 'no', '2', 'yes', 'semi-furnished']
['12215000', '7500', '4', '2', '2', 'yes', 'no', 'yes', 'no', 'yes', '3', 'yes', 'furnished']
['11410000', '7420', '4', '1', '2', 'yes', 'yes', 'yes', 'no', 'yes', '2', 'no', 'furnished']
['10850000', '7500', '3', '3', '1', 'yes', 'no', 'yes', 'no', 'yes', '2', 'yes', 'semi-furnished']
['10150000', '8580', '4', '3', '4', 'yes', 'no', 'no', 'no', 'yes', '2', 'yes', 'semi-furnished']
['10150000', '16200', '5', '3', '2', 'yes', 'no', 'no', 'no', 'no', '0', 'no', 'unfurnished']
['9870000', '8100', '4', '1', '2', 'yes', 'yes', 'yes', 'no', 'yes', '2', 'yes', 'furnished']
['9800000', '5750', '3', '2', '4', 'yes', 'yes', 'no', 'no', 'yes', '1', 'yes', 'unfurnished']


display a header message and then print the first 10 rows of the split_rdd RDD to show a sample of the dataset.

In [7]:
housing_rdd = split_rdd.map(lambda x: (int(x[0]), float(x[1]), int(x[2]), int(x[3]), float(x[4])))

displays a header message and prints the first 10 rows of the split_rdd RDD, providing a sample view of the dataset.

In [8]:
price_per_area_rdd = housing_rdd.map(lambda x: (x[0], x[4] / x[1]))

creates a new RDD price_per_area_rdd by calculating the price per area for each house, mapping each row to a tuple of (id, price/area).

In [9]:
expensive_rdd = price_per_area_rdd.filter(lambda x: x[1] >= 5000)

filters the price_per_area_rdd to include only houses where the price per area is at least 5000, creating a new RDD expensive_rdd.

In [10]:
sorted_expensive_rdd = expensive_rdd.sortBy(lambda x: x[1], ascending=False)

sorts the expensive_rdd in descending order based on the price per area, creating a new RDD sorted_expensive_rdd.

In [11]:
results = sorted_expensive_rdd.collect()

This statement collects all elements of the `sorted_expensive_rdd` RDD into a Python list named `results`, bringing the data from the distributed RDD to the local driver.

In [15]:
price_per_area_rdd = housing_rdd.map(lambda x: (x[0], x[1] / x[2]))
sorted_expensive_rdd = price_per_area_rdd.sortBy(lambda x: x[1], ascending=False)
top10_expensive = sorted_expensive_rdd.take(10)
print("=== Top 10 Expensive Houses (Price per Area) ===")
for house in top10_expensive:
    print(f"ID: {house[0]}, Price per Area: {house[1]:.2f}")

=== Top 10 Expensive Houses (Price per Area) ===
ID: 6930000, Price per Area: 6600.00
ID: 5110000, Price per Area: 5705.00
ID: 5600000, Price per Area: 5250.00
ID: 5943000, Price per Area: 5200.00
ID: 4305000, Price per Area: 5180.00
ID: 4760000, Price per Area: 5120.00
ID: 4760000, Price per Area: 4583.00
ID: 7070000, Price per Area: 4440.00
ID: 9800000, Price per Area: 4400.00
ID: 3500000, Price per Area: 4314.67


This code calculates the price per area for each house, sorts the houses in descending order of price per area, retrieves the top 10 most expensive houses, and prints their IDs along with the price per area rounded to two decimals.