<div style="display: flex; align-items: center; gap: 18px; margin-bottom: 15px;">
  <img src="https://files.codebasics.io/v3/images/sticky-logo.svg" alt="Codebasics Logo" style="display: inline-block;" width="130">
  <h1 style="font-size: 34px; color: #1f4e79; margin: 0; display: inline-block;">Codebasics Practice Room - Data Engineering Bootcamp </h1>
</div>


#### üì¶ Big Data, Small Machines

This notebook demonstrates how Apache Spark on Databricks can process
datasets that are **too large to fit into the memory of a single machine**.

We simulate a **500 GB CSV file** scenario using a smaller dataset,
while following the **exact same processing pattern** used for large-scale data.


## üìÇ Dataset

**Dataset Name:** `big_events_50k.csv`  
**Storage Location:** Databricks File System (DBFS) (Refer Create a Databricks Catalog and Upload a CSV.pdf)

> ‚ö†Ô∏è In real-world scenarios, this dataset would be a **500 GB CSV**
stored in **ADLS / S3**.  
Since we cannot upload such a large file here, we use **50k rows to simulate the same pattern**.

### Example Columns:
- `event_id`
- `event_time`
- `country`
- `device`
- `amount`


## üóÇÔ∏è Scenario

You are working with a **very large events dataset** (think **hundreds of GBs**) stored as a CSV file in **cloud storage**.

Each row represents a **user event** (such as a click, view, or purchase) along with:
- when the event happened
- which country it came from
- which device was used
- the transaction amount (if any)

Because the dataset is **too large to fit into a single machine**, it must be processed using **Apache Spark on Databricks**.

For learning purposes, we use a **smaller sample file** (`big_events_50k.csv`) that follows the **same structure and processing pattern** as the real large dataset.

The input data already exists in **your Unity Catalog / database storage** and needs to be read, processed, and stored in an optimized format.

---

## üéØ Task

Perform the following steps using Spark:

1. **Read** the large CSV dataset (`big_events_50k.csv`) from **your catalog / database storage**.  
2. Let Spark **automatically distribute the data** across multiple executors.  
3. **Aggregate** the data to calculate:
   - total number of events
   - total transaction amount  
   grouped by **country** and **device**.
4. **Write** the aggregated result in **Delta format** for efficient analytics.  
5. **Validate** the output by reading it back and displaying sample records.

---

## üß© Assumptions

- The input CSV file is already available in **your Databricks catalog or database storage**.  
- The dataset contains the following columns:  
  `event_id`, `event_time`, `country`, `device`, `amount`
- The file is large enough that **single-machine processing is not feasible**.  
- Spark handles:
  - file partitioning
  - parallel execution
  - fault tolerance automatically.

---

## üì¶ Deliverables

- **Output Format:** Delta table  
- **Output Location:**  
  Stored in **your catalog / database** (Silver / curated layer)

### **Expected Columns**

| country | device | event_count | total_amount |
|--------|--------|-------------|---------------|

---

## üß† Notes

- You do **not** need to manually split the file ‚Äî Spark does this for you.  
- The **same Spark code works** for both small and very large datasets.  
- Scalability comes from **distributed execution**, not from changing logic.  
- Delta format helps with **reliability, performance, and analytics**.

---

## üß© Example Output (simplified)

| country | device  | event_count | total_amount |
|--------|---------|-------------|---------------|
| India  | Android | 12,450      | 8,945,230.50 |
| USA    | iOS     | 9,820       | 7,112,980.75 |
| UK     | Web     | 6,310       | 3,456,120.00 |


## üß† Solution Strategy (High-Level)

1. Read the dataset directly from cloud storage using Spark
2. Let Spark automatically split the file into partitions
3. Process partitions in parallel across executors
4. Apply transformations using Spark DataFrame APIs
5. Write results in an optimized format (Delta / Parquet)

Spark handles:
- Distributed processing
- Lazy evaluation
- Query optimization
- Fault tolerance


In [0]:
from pyspark.sql import functions as F

input_path = "your_data"


In [0]:
df = (
    spark.read
         .option("header", "true")
         .option("inferSchema", "true")
         .csv(input_path)
)


In [0]:
df.printSchema()


## üõ¢Ô∏èInput data

In [0]:
display(df.limit(10))


üîç Spark does **not** load the entire dataset into one machine.

- The file is split into partitions
- Each partition is processed by a different executor
- This same approach works for 50k rows or 500 GB


## üîÑ Transformation

### Business Question:
**What is the total number of events and total transaction amount
per country and device?**


In [0]:
agg_df = (
    df.groupBy("country", "device")
      .agg(
          F.count("*").alias("event_count"),
          F.sum("amount").alias("total_amount")
      )
)


In [0]:
display(agg_df)


## üíæ Writing Output

We write the results using **Delta Lake**, which is optimized for analytics
and supported natively in Databricks.


In [0]:
agg_df.write \
  .mode("overwrite") \
  .format("delta") \
  .save("your_directory")


In [0]:
result_df = (
    spark.read
         .format("delta")
         .load("your_directory")
)

display(result_df)


## ‚öôÔ∏è Why This Solution Scales to 500 GB

### Key Spark Features Used:
- **Distributed File Reading** from ADLS / S3
- **Automatic Partitioning**
- **Parallel Execution across Executors**
- **Lazy Evaluation**
- **Catalyst Query Optimizer**
- **Tungsten Execution Engine**
- **Fault Tolerance**

üí° The same code runs regardless of dataset size.
Only the cluster size changes.


## ‚úÖ Summary

- Spark enables processing of datasets **larger than a single machine**
- Databricks simplifies cluster management and optimization
- Using columnar formats like **Delta** improves performance
- This approach is production-ready and industry standard

This notebook demonstrates a scalable Spark solution
for large-scale data processing.
