<div style="display: flex; align-items: center; gap: 18px; margin-bottom: 15px;">
  <img src="https://files.codebasics.io/v3/images/sticky-logo.svg" alt="Codebasics Logo" style="display: inline-block;" width="130">
  <h1 style="font-size: 34px; color: #1f4e79; margin: 0; display: inline-block;">Codebasics Practice Room - Data Engineering Bootcamp </h1>
</div>


#### üîó Joining Small Lookup Tables Efficiently in Spark

This notebook demonstrates how Apache Spark can **efficiently join a very large fact table**
with a **small lookup (dimension) table** using a **broadcast join**.

This is a common real-world pattern in analytics and data engineering,
especially in **star-schema‚Äìlike designs**.


## üìÇ Dataset

### Fact Table (Large)
**Dataset Name:** `big_events_50k.csv`  
Represents a large transactional or event-level dataset.

### Dimension Table (Small)
**Dataset Name:** `country_dim.csv`  
Contains country metadata such as country name and region.

> ‚ö†Ô∏è In real-world systems, the fact table can contain **millions or billions of rows**,  
while the country dimension usually contains **only a few hundred rows**.

Both datasets are assumed to be available in **your catalog / database storage**.

### Example Columns

**Fact Table**
- `event_id`
- `event_time`
- `country`
- `amount`

**Country Dimension**
- `country_code`
- `country_name`
- `region_group`


## üóÇÔ∏è Scenario

You are working with a **large fact table** containing transaction or event data.
Each record includes a **country code**, but no descriptive country information.

To enrich the data for reporting, you need to join it with a **small country dimension table**
that contains:
- full country names
- regional groupings (e.g., Asia, Europe, Americas)

Because the fact table is large and the dimension table is very small,
a **standard join would cause unnecessary data shuffling**.

Your goal is to perform this join in the **most efficient way possible** using Spark.

---

## üéØ Task

Perform the following steps using Spark:

1. **Read** the large fact dataset (events / transactions).
2. **Read** the small country dimension dataset.
3. Use a **broadcast join** to join the two datasets on country code.
4. Create an enriched DataFrame with country name and region information.
5. Use the enriched data for reporting or aggregation.

---

## üß© Assumptions

- The fact dataset is large and distributed across partitions.
- The country dimension dataset is very small (‚âà200 rows in real life).
- The join key is:
  - `fact.country` ‚Üí `dim.country_code`
- Spark‚Äôs broadcast join threshold is sufficient for the dimension table.
- Spark Serverless compute is being used.

---

## üì¶ Deliverables

- **Enriched DataFrame** with country metadata
- **Example Report:** Total transaction amount by region group

### **Expected Columns (after join)**

| country | country_name | region_group | amount |
|--------|--------------|--------------|--------|

---

## üß† Notes

- Spark automatically distributes large datasets across executors.
- Small lookup tables should be **broadcast** to avoid shuffling large data.
- Broadcast joins are ideal for **fact‚Äìdimension** relationships.
- This pattern is widely used in **data warehousing and analytics pipelines**.


## üß† Solution Strategy (High-Level)

1. **Read the large fact dataset** (events / transactions) from your catalog or database storage.
2. **Read the small country dimension dataset** containing country metadata.
3. Identify the join relationship:
   - `fact.country` ‚Üí `dim.country_code`
4. **Broadcast the small dimension DataFrame** so it is available on all executors.
5. Perform a **broadcast hash join** between the fact and dimension tables.
6. Use the enriched DataFrame (with country name and region) for downstream reports and aggregations.

Spark handles:
- Distributing the large fact table across executors
- Sending the small dimension table to each executor once
- Executing fast local hash joins without shuffling large data
- Optimizing the join using its query planner
