# 🧊 Hotels Practice Drills

Use this lab to run the SQL, PySpark, and Python exercises against the hotels practice dataset. The first two cells spin up Spark with Iceberg support and configure Trino access. All subsequent sections list tasks only—add your own cells beneath each bullet to implement solutions.


In [None]:
import os

MINIO_ENDPOINT = os.getenv("MINIO_ENDPOINT", "http://minio:9000")
MINIO_ACCESS_KEY = os.getenv("MINIO_ROOT_USER", "minio")
MINIO_SECRET_KEY = os.getenv("MINIO_ROOT_PASSWORD", "minio123")
HIVE_METASTORE_URI = os.getenv("HIVE_METASTORE_URI", "thrift://hive-metastore:9083")
TRINO_URL = os.getenv("TRINO_URL", "http://trino:8080")
SPARK_MASTER = os.getenv("SPARK_MASTER_URL", "spark://spark-master:7077")
S3_ENDPOINT = os.getenv("S3_ENDPOINT", "minio:9000")

os.environ.setdefault("AWS_REGION", "us-east-1")
os.environ.setdefault("AWS_DEFAULT_REGION", os.environ["AWS_REGION"])

print("Spark master:", SPARK_MASTER)
print("Trino URL:", TRINO_URL)


In [None]:
from pyspark.sql import SparkSession

packages = [
    "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.2",
    "org.apache.hadoop:hadoop-aws:3.3.4",
    "software.amazon.awssdk:bundle:2.20.158",
]

spark = (
    SparkSession.builder
    .appName("HotelsPracticeDrills")
    .master(SPARK_MASTER)
    .config("spark.jars.packages", ",".join(packages))
    .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
    .config("spark.sql.catalog.spark_catalog.type", "hive")
    .config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.iceberg.type", "rest")
    .config("spark.sql.catalog.iceberg.uri", "http://hive-metastore:9001/iceberg")
    .config("spark.sql.catalog.iceberg.warehouse", "s3a://iceberg/warehouse")
    .config("spark.sql.catalog.iceberg.s3.endpoint", f"http://{S3_ENDPOINT}")
    .config("spark.sql.catalog.iceberg.s3.access-key-id", MINIO_ACCESS_KEY)
    .config("spark.sql.catalog.iceberg.s3.secret-access-key", MINIO_SECRET_KEY)
    .config("spark.sql.catalog.iceberg.s3.region", os.environ["AWS_REGION"])
    .config("spark.sql.catalog.iceberg.s3.path-style-access", "true")
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.defaultCatalog", "iceberg")
    .enableHiveSupport()
    .getOrCreate()
)

spark.sparkContext.setLogLevel("WARN")
spark.conf.set("spark.sql.session.timeZone", "UTC")

print("Spark session ready:", spark.version)


In [None]:
import trino

trino_conn = trino.dbapi.connect(
    host=os.getenv("TRINO_HOST", "trino"),
    port=int(os.getenv("TRINO_PORT", "8080")),
    user=os.getenv("TRINO_USER", "admin"),
    catalog=os.getenv("TRINO_CATALOG", "iceberg"),
    schema=os.getenv("TRINO_SCHEMA", "hotels_practice"),
)
print("Connected to Trino catalog/schema:", trino_conn.schema)


## 🧮 Section A — SQL (Trino) Drills

Catalog/schema default: `iceberg.hotels_practice`. Adjust queries if you wrote tables elsewhere.

### A1. Top Hotels by Recent Review Quality (last 180 days)
- Inputs: `reviews`, `hotels`
- Compute `review_cnt`, `avg_rating` grouped by `(country, city, hotel_id)` for reviews in the last 180 days.
- Return the top 20 ordered by `review_cnt` desc, then `avg_rating` desc. Handle ties safely.

### A2. Monthly Occupancy Proxy
- Inputs: `bookings`, `hotels`
- For `status = 'completed'`, calculate monthly `bookings` and `total_nights` per `(country, city, hotel_id, month)` using `date_trunc('month', checkin_date)`.
- Order by `month` desc, then `bookings` desc.

### A3. Top-3 Hotels per Country (Window)
- Inputs: `bookings`, `hotels`
- For completed bookings, compute total bookings and avg price per `(country, hotel_id)`.
- Return at most three rows per country using window functions (e.g., `row_number()` or `dense_rank()` when ties should retain peers).

### A4. Rolling Rating per Hotel (Window Frame)
- Inputs: `reviews`
- For each `hotel_id`, compute a rolling average of `rating` over the previous 10 rows ordered by `created_at`.

### A5. Cancellation & No-Show Rates
- Inputs: `bookings`, `hotels`
- For each `(country, month)` compute cancel and no-show rates using `NULLIF(total, 0)` to avoid divide-by-zero.

### A6. Review Language Mix & Bias Check
- Inputs: `reviews`, `hotels`
- Last 90 days, compute language share per country and return those with any language share > 0.6.

### A7. Joining Images for Quality Screening
- Inputs: `images`, `hotels`
- Last 60 days, compute `avg_quality`, `image_cnt` per `(country, hotel_id)` and filter to `image_cnt >= 5` and `avg_quality >= 0.8`.

### A8. “Trusted” Hotel Surface (Multi-signal)
- Inputs: `hotels`, `reviews`, `bookings`, `images`
- Build a view/query where hotels satisfy all of:
  - ≥ 30 reviews in last 180 days with `avg_rating >= 4.2`
  - ≥ 20 completed bookings in last 180 days
  - ≥ 5 images in last 90 days with `avg_quality >= 0.75`
- Compose with CTEs then join.


## 🔥 Section B — PySpark Drills

Load tables via `spark.table("iceberg.hotels_practice.<table>")` or `spark.sql`.

### B1. Sessionize Reviews by User (30-minute gaps)
- Build sessions per `user_id` using window + `lag` to reset when gap > 30 minutes.

### B2. Late-arriving Event Simulation
- Use `spark.readStream` with a rate source or file source and the static reviews data to demonstrate a 5-minute tumbling aggregation with a 10-minute watermark.

### B3. Skew Handling in Joins (Hotels × Reviews)
- Join reviews to hotels and compute avg rating per hotel. Show one skew mitigation strategy (broadcast or salting).

### B4. Partition-Aware Writes (Iceberg)
- Rewrite reviews into a temp Iceberg table partitioned by `months(created_at)` and set a target file size property. Inspect metadata/EXPLAIN.

### B5. Deduplicate Near-Duplicates by Text Fingerprint
- Normalize `review_text` (lowercase, strip punctuation) and drop duplicates by `(hotel_id, fingerprint)`.

### B6. Curate Balanced ML Training Slices
- Sample balanced subsets across rating buckets (1–5) and languages (`en,de,fr,es,it,he`) with max N per bucket, writing to `iceberg.hotels_practice.ml_reviews_balanced`.


## 🧠 Section C — Python / Data Prep Drills

You can use pandas or PySpark DataFrames. Parse `review_metadata` JSON as needed.

### C1. JSON Tag Explosion + Top Tags per Hotel
- Parse tags from `review_metadata`, handle invalid JSON, and compute top three tags per hotel.

### C2. Text Cleaning + Chunking for SFT
- Normalize `review_text`, strip emojis/punctuation, and chunk into ~60-word segments per review.

### C3. Language Filter + Coverage Report
- Keep languages in `{en,de,fr,es,it,he}`. Report coverage % per lang and top 10 hotels by distinct language count.

### C4. Toxicity/PII Placeholder Filter
- Implement simple regex filters for profanity, emails, phone numbers. Output counts of filtered rows.

### C5. Train/Val/Test Split by Hotel + Time
- For each hotel, assign latest month to test, previous month to val, rest to train. Ensure no leakage.

### C6. Simple Embedding Cache Index (Mock)
- Build TF-IDF vectors for chunks (from C2), store metadata mapping, and implement a cosine-similarity top-K search helper.


## ☁️ Section D — Advanced Dataset Workloads (Iceberg Only)

All tasks rely exclusively on the tables produced by `hotels_iceberg_population.ipynb` (namespace default: `iceberg.hotels_practice`).

### D1. Historical Snapshot Audits
- Use Iceberg time travel to compare yesterday's snapshot with today and report row count deltas per table.

### D2. Partition Health Checks
- For `bookings` and `reviews`, compute partition sizes (`months(checkin_date)` / `months(created_at)`) and flag skewed partitions (e.g., >3× median).

### D3. Compaction Strategy Proposal
- Analyze file counts/average file sizes via `table.files` metadata and outline a compaction cadence (commands + triggers).


## 🔄 Section E — Streaming Simulation on Existing Data

### E1. Micro-batch Replay
- Use Structured Streaming with the static `reviews` table as a rate-limited source (e.g., `spark.readStream.format("iceberg").load(...)`).
- Demonstrate watermarking and exactly-once upserts back into an Iceberg staging table.

### E2. Late Data Handling
- Inject artificial delays by duplicating rows with older timestamps; verify logic handles duplicates without off-stack sources.

### E3. Quality Metrics
- Compute per-batch metrics (records processed, duplicates dropped) and write them to `iceberg.hotels_practice.stream_metrics`.


## ✅ Section F — Data Quality & Governance

### F1. Automated Profiling
- Write profiling helpers that operate on the Iceberg tables (min/max, distinct counts, null ratios) and persist to `data_quality_profile`.

### F2. Contract Enforcement
- Express a contract for `bookings` as code (dictionary). Validate nightly and log violations to `data_quality_violations`.

### F3. Catalog Documentation
- Build a small metadata table with columns (table_name, description, owner, quality_score, sample_query) sourced from the current dataset.


## 🧪 Section G — Experimentation & Monitoring (Dataset Only)

### G1. Prompt/Response Logging Stub
- Create a framework that logs generated summaries derived from `reviews` into `genai_prompt_log` (Iceberg table).
- Include latency, prompt hash, response hash columns.

### G2. Offline Metrics
- Compute text similarity metrics between review_text and generated summaries using only local libraries (e.g., cosine over TF-IDF).
- Store metrics per run in `genai_eval_metrics`.

### G3. A/B Simulation
- Split hotels into pseudo A/B cohorts using existing data; compute uplift in review conversion or image engagement using window functions.


## ☕ Section H — JVM & API Exercises within Dataset Scope

### H1. Scala/Java Spark Translation
- Translate the PySpark booking aggregation into Scala/Java using the same Iceberg tables (include code snippet or sbt skeleton).

### H2. Data Access Service Sketch
- Design a REST/gRPC service that reads from Iceberg via Spark or Trino to serve hotel insights.
- Keep the architecture grounded in the existing dataset (no new storages).


## 🔐 Section I — Privacy & Compliance on Generated Tables

### I1. PII Masking
- Re-write `reviews` masking emails/phones using regex UDFs; store results in `reviews_redacted`.

### I2. Auditing Access
- Capture query history against `hotels_practice` tables by parsing Spark event logs or Trino query logs (simulated).

### I3. Data Retention Drill
- Implement a delete workflow for a user (`user_id`) across bookings/reviews/images inside Iceberg, preserving an audit trail table.
