# Table of Contents 
- [Understanding database tables and spark dataframes](#understanding-database-tables-and-spark-dataframes)

# Understanding Database Tables and Spark DataFrames

---

## Database Tables

We learned that databases are one of the most popular data processing platforms.  
At a high level, they offer **Tables** and **SQL**.

- **Table = Schema + Data**
- **Schema** is stored in the **metadata store** (data dictionary).
- **Data** is stored in underlying files on disk.
- Users only see the **logical table**, not the physical files.

![Database Table](./images/sql_database.png)

---

### How SQL Executes

1. User submits SQL query to the SQL Engine.  
2. SQL Engine consults the **metadata store** (for schema validation).  
3. If you use a non-existent column, an **analysis error** is thrown.  
4. If query passes schema validation:
   - Engine reads data from physical files.
   - Processes data.
   - Returns results.

![SQL Engine Flow](./images/sql_database_1.png)

---

### Table Layers in a Database

- **Storage Layer** → physical files on disk.  
- **Metadata Layer** → schema and schema info.  
- **Logical Layer** → database table exposed to users.  

![Database Table Layers](./images/database_table.png)

---

## Spark Tables

Apache Spark is also a data processing platform.  
It provides two ways:
- **Spark Tables (SQL-like)**  
- **DataFrames (API-based)**  

Spark tables are similar to database tables:  
- Schema stored in a metadata store.  
- Data stored in files (but Spark supports many formats: CSV, JSON, Parquet, Avro, XML, etc.).  
- Storage can be distributed (HDFS, S3, ADLS).

![Spark Table](./images/spark_database.png)

---

## Spark DataFrames

A **DataFrame** is structurally the same as a table, but:  
- Schema is **not in persistent metastore**.  
- Instead, it’s stored in a **runtime catalog** (in-memory).  
- Exists only during the Spark application.  
- Deleted once the Spark job ends.  

This allows Spark to support **schema-on-read**:
- You define schema at read time.
- Spark applies it to the file and builds a temporary DataFrame object.

![Spark DataFrame](./images/spark_dataframes.png)

---

## Spark Table vs Spark DataFrame

| Spark Table | Spark DataFrame |
|-------------|-----------------|
| Schema in **metadata store** (persistent). | Schema in **runtime catalog** (temporary). |
| Persistent, visible across applications. | Runtime-only, visible only in current app. |
| Predefined schema required when creating table. | Schema-on-read (defined at load time). |
| Supports SQL expressions. | Supports APIs, not SQL. |
| Convertible → Table ↔ DataFrame. | Convertible → DataFrame ↔ Table. |

![Spark Table vs DataFrame](./images/spark_vs_data_frame.png)

# Spark DataFrame Methods

![DataFrame Methods](./images/dataframe_methods_.png)

Spark DataFrame methods are grouped into three categories:

1. **Actions**  
   - Kick off a Spark Job execution  
   - Return results to the Spark driver  

   ![Actions & Transformations](./images/actions_and_transformations.png)

2. **Transformations**  
   - Produce a new transformed DataFrame  
   - Do **not** trigger a Spark Job immediately  

3. **Functions / Utility Methods**  
   - Other DataFrame functions not classified as Actions or Transformations  

   ![Functions & Methods](./images/methods.png)