<div style="display: flex; align-items: center; gap: 18px; margin-bottom: 15px;">
  <img src="https://files.codebasics.io/v3/images/sticky-logo.svg" alt="Codebasics Logo" style="display: inline-block;" width="130">
  <h1 style="font-size: 34px; color: #1f4e79; margin: 0; display: inline-block;">Codebasics Practice Room - Data Engineering Bootcamp </h1>
</div>

## üßëüèº‚Äçüîß Setup

In [0]:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, LongType, TimestampType
from delta.tables import DeltaTable

In [0]:
# --------------------------------------------
# ‚öôÔ∏è Databricks Unity Catalog Setup (Auto)
# --------------------------------------------
from pyspark.sql import SparkSession

catalog_name = "practice_db_catalog"
schema_name = "airbnb"
volume_name = "data_volume"

# 1Ô∏è‚É£ Create Catalog if not exists
spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog_name}")
print(f"‚úÖ Catalog `{catalog_name}` ready.")

# 2Ô∏è‚É£ Create Schema (Database) if not exists
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog_name}.{schema_name}")
print(f"‚úÖ Schema `{schema_name}` created inside `{catalog_name}`.")

# 3Ô∏è‚É£ Create Volume if not exists
spark.sql(f"CREATE VOLUME IF NOT EXISTS {catalog_name}.{schema_name}.{volume_name}")
print(f"‚úÖ Volume `{volume_name}` created inside `{catalog_name}.{schema_name}`")

# 4Ô∏è‚É£ Set current context
spark.sql(f"USE CATALOG {catalog_name}")
spark.sql(f"USE {schema_name}")

# 5Ô∏è‚É£ Define volume-backed paths
base_path = f"/Volumes/{catalog_name}/{schema_name}/{volume_name}/airbnb"
raw_path = f"{base_path}/raw"
clean_path = f"{base_path}/clean"
silver_path = f"{base_path}/silver"

# 6Ô∏è‚É£ Create directories inside the volume
dbutils.fs.mkdirs(raw_path)
dbutils.fs.mkdirs(clean_path)
dbutils.fs.mkdirs(silver_path)

print("‚úÖ Paths initialized successfully:")
print(f"Raw: {raw_path}")
print(f"Clean: {clean_path}")
print(f"Silver: {silver_path}")


In [0]:
# üßÆ Generate Airbnb listings dataset (Spark-native version)
from pyspark.sql import Row
import random, datetime

# --------------------------------
# Configuration
# --------------------------------

random.seed(42) # ‚úÖ reproducibility

num_records = 600  # Adjust as needed

amenities_pool = [
    "Wifi", "Kitchen", "Washer", "Dryer", "TV", "Essentials", "Air conditioning",
    "Heating", "Pool", "Hot tub", "Balcony", "Garden", "Parking", "Fireplace",
    "Sea view", "Mountain view", "Pet friendly", "Gym", "Breakfast", "Workspace"
]

property_types = [
    "Studio Apartment", "Private Room", "Entire Home", "Cottage", "Villa",
    "Cabin", "Loft", "Guest Suite", "Bungalow", "Condo"
]

cities = ["Mumbai", "Bangalore", "Hyderabad", "Chennai", "Pune", "Delhi", "Goa"]
boolean_variants = [True, False, "true", "false", "Yes", "No", "yes", "no", "TRUE", "FALSE"]

# --------------------------------
# Generate data as list of Rows
# --------------------------------
data = []
for i in range(1, num_records + 1):
    created_date = datetime.date(2025, 1, 1) + datetime.timedelta(days=random.randint(0, 300))
    last_booked_date = created_date + datetime.timedelta(days=random.randint(1, 60))

    data.append(Row(
        id=100 + i,
        name=f"{random.choice(['Cozy', 'Modern', 'Luxury', 'Spacious', 'Budget'])} "
             f"{random.choice(property_types)} in {random.choice(cities)}",
        city=random.choice(cities),
        price_per_night=random.randint(1000, 10000),
        amenities=random.sample(amenities_pool, random.randint(3, 8)),
        has_parking=random.choice(boolean_variants),
        is_superhost=random.choice(boolean_variants),
        created_date=str(created_date),
        last_booked_date=str(last_booked_date)
    ))

# --------------------------------
# Convert to Spark DataFrame
# --------------------------------
df_raw = spark.createDataFrame(data)

# --------------------------------
# Write directly to UC Volume (JSON format)
# --------------------------------
raw_path = "/Volumes/practice_db_catalog/airbnb/data_volume/airbnb/raw/listings.json"

df_raw.write.mode("overwrite").json(raw_path)
print(f"‚úÖ Successfully generated {num_records} Airbnb listings and saved to:")
print(f"üìÇ {raw_path}")

# --------------------------------
# Quick sanity check
# --------------------------------



# ‚ùì Scenario Question: Airbnb ‚Äî Clean Listing Amenities (PySpark) [Easy]



## üóÇÔ∏è Scenario

You are working with raw **Airbnb listing data** ingested from multiple sources.  
Each listing contains property details and a **nested list of amenities**.  
The goal is to **clean, normalize, and store** this data for downstream analysis.

The data is available as a JSON file (`listings.json`) in the **Bronze layer**, which now needs to be transformed into a clean **Silver Delta Table**.

---

## üéØ Task

Perform the following transformations:

1. **Read** the input data from `listings.json` using Spark.  
2. **Explode** the `amenities` array so that each row contains a single amenity.  
3. **Normalize** boolean-like columns (e.g., `"true"`, `"false"`, `"yes"`, `"no"`) into proper boolean (`True` / `False`) Spark data types.  
4. **Rename** or select only the relevant columns for downstream use.  
5. **Save** the cleaned DataFrame in **Delta format** to the **Silver layer** path:  
   `/Volumes/practice_db_catalog/airbnb/data_volume/airbnb/silver/listings.json` 

---

## üß© Assumptions

- The input file `listings.json` exists in the **Bronze** path:  
  `/Volumes/practice_db_catalog/airbnb/data_volume/airbnb/raw/listings.json`
- The `amenities` field may contain an array or a stringified array.  
- Boolean columns may contain values like `"TRUE"`, `"Yes"`, `"0"`, `"1"`, etc.  
- The final cleaned DataFrame should contain only essential columns:  
  `id`, `name`, `amenity`, `has_parking`, and `is_superhost`.  
- Handle missing or malformed columns gracefully (e.g., cast to `null`).  

---

## üì¶ Deliverables

- **Output Format:** Delta table written to Silver  
- **Output Path:** `/Volumes/practice_db_catalog/airbnb/data_volume/airbnb/silver/listings.json`

| **Expected Columns** | `id`, `name`, `amenity`, `has_parking`, `is_superhost` |

---

## üß† Notes

- Use `pyspark.sql.functions.explode()` to expand the amenities array.  
- Use `F.when()` or `F.col().cast("boolean")` for boolean normalization.  
- Use clear column aliases for readability.  
- Validate the write by reading from the Silver path and displaying the first few rows.

---

## üß© Example Output (simplified)

| id  | name               | amenity          | has_parking | is_superhost |
|-----|--------------------|------------------|--------------|---------------|
| 101 | Cozy Beach House   | Wifi             | true         | false         |
| 101 | Cozy Beach House   | Ocean View       | true         | false         |
| 102 | City Apartment     | Air Conditioning | false        | true          |


## üõ¢Ô∏èInput data

In [0]:
display(df_raw.limit(5))

# üìù Your Solution

In [0]:
# ‚úçÔ∏è Your Solution Here

from pyspark.sql import functions as F

# Steps:
# 1. Read the JSON file
# 2. Explode the amenities
# 3. Normalize boolean-like fields and retrun the dataframe


## üîç Validation Questions

After creating the final DataFrame (`df_final`), answer these to check your understanding:

1. How many amenities are listed for the property with **ID = 101**?  
2. How many listings have **`is_superhost = true`**?  
3. What are the **unique amenities** available for listing **ID = 103**?  
4. Count how many listings have **`has_parking = true`**.  
5. For each listing, how many total amenities are available? (Hint: use `groupBy().count()`.)

In [0]:
df_raw.limit(10).display()

In [0]:

#Exploading the amenities column
df_silver = df_raw.select("id", "name", F.explode("amenities").alias("amenities"), "has_parking", "is_superhost")


In [0]:
df_silver.display()

In [0]:
df_raw.printSchema()

In [0]:
#Standarizing has_parking coloumn
df_silver = df_silver.withColumn("has_parking", F.when(F.col("has_parking").isin(["true", "TRUE", "Yes", "yes", "True"]), True)
                                                      .otherwise(False).cast("boolean"))

In [0]:
#Standarizing is_superhost coloumn
df_final = df_silver.withColumn("is_superhost", F.when(F.col("is_superhost").isin(["true", "TRUE", "Yes", "yes", "True"]), True)
                                                      .otherwise(False).cast("boolean"))


In [0]:
#How many amenities are listed for the property with ID = 101?
df_final.filter(F.col("id") == '101').display()

In [0]:
#How many listings have is_superhost = true?
df_final.filter(F.col("is_superhost") == True).select("id").distinct().count()

In [0]:
#What are the unique amenities available for listing ID = 103?
df_final.filter(F.col("id") == 103).select("amenities").distinct().count()

In [0]:
#Count how many listings have has_parking = true
df_final.filter(F.col("has_parking") == True).select("id").distinct().count()

In [0]:
#For each listing, how many total amenities are available? (Hint: use groupBy().count().)
df_final.groupBy("id").agg(F.countDistinct("amenities").alias("total_amenities")).orderBy("id").display()

In [0]:
#Writting it to the silver in the file '/Volumes/practice_db_catalog/airbnb/data_volume/airbnb/silver/listings.json'

final_out_path = '/Volumes/practice_db_catalog/airbnb/data_volume/airbnb/silver/listings.json'

df_final.write.mode("Overwrite").json(final_out_path)