## Big Data & Data Engineering Q&A

### 1. Why can't traditional single-node databases (e.g., MySQL/PostgreSQL) handle Big Data workloads?

Traditional single-node databases face several fundamental limitations when dealing with Big Data:

**Vertical Scaling Limits**: Single-node databases can only scale up (adding more CPU, RAM, storage to one machine), which has physical and economic ceilings. You can't infinitely upgrade a single server.

**Storage Capacity**: A single machine has finite disk space. Big Data workloads often involve petabytes of data that simply cannot fit on one node.

**Processing Bottlenecks**: All queries must be processed by one CPU/memory system. Complex analytical queries on billions of rows would take hours or days, as there's no parallelism across machines.

**I/O Constraints**: Disk read/write speeds become bottlenecks. Even with SSDs, reading terabytes of data through a single I/O channel is slow.

**No Fault Tolerance**: If the single node fails, the entire database becomes unavailable. There's no built-in redundancy.

**Concurrency Limits**: High-volume concurrent read/write operations overwhelm a single node's connection and lock management capabilities.

---

### 2. What problems do distributed systems solve, and what new challenges do they introduce?

**Problems Solved:**
- **Scalability**: Horizontal scaling by adding more nodes to handle growing data and traffic
- **Fault Tolerance**: Data replication across nodes ensures availability even when machines fail
- **Performance**: Parallel processing across multiple nodes dramatically speeds up computations
- **Storage Capacity**: Combined storage across many machines can handle petabyte-scale data
- **Geographic Distribution**: Data can be placed closer to users for lower latency

**New Challenges Introduced:**
- **Network Partitions**: Nodes may become unreachable, requiring decisions about consistency vs. availability (CAP theorem)
- **Data Consistency**: Keeping data synchronized across nodes is complex; must choose between strong and eventual consistency
- **Coordination Overhead**: Distributed transactions, leader election, and consensus protocols add complexity
- **Debugging Difficulty**: Tracing issues across multiple nodes is significantly harder
- **Data Skew**: Uneven data distribution can create hotspots and load imbalance
- **Increased Latency**: Network communication between nodes adds overhead compared to local operations

---

### 3. Compare OLTP and OLAP systems. What are their design goals?

| Aspect | OLTP (Online Transaction Processing) | OLAP (Online Analytical Processing) |
|--------|--------------------------------------|-------------------------------------|
| **Purpose** | Handle day-to-day transactions | Support complex analytical queries |
| **Operations** | INSERT, UPDATE, DELETE (many small ops) | SELECT with aggregations (few large reads) |
| **Query Pattern** | Simple queries on few rows | Complex queries scanning millions of rows |
| **Data Model** | Normalized (3NF) to minimize redundancy | Denormalized (star/snowflake schema) for read performance |
| **Optimization** | Write-optimized, low latency | Read-optimized, high throughput |
| **Concurrency** | High (thousands of concurrent users) | Lower (analysts running batch queries) |
| **Data Freshness** | Real-time, current state | Historical, periodic updates |
| **Examples** | Banking systems, e-commerce orders | Business intelligence, reporting dashboards |

**Design Goals:**
- **OLTP**: Maximize transaction throughput, ensure ACID compliance, minimize response time for individual operations
- **OLAP**: Maximize query performance for aggregations, support ad-hoc analysis, handle large data volumes efficiently

---

### 4. Describe the differences between ETL and ELT. Why is ELT more commonly used in cloud-based architectures such as AWS?

**ETL (Extract, Transform, Load):**
1. **Extract** data from source systems
2. **Transform** data in a staging/processing server (clean, aggregate, join)
3. **Load** transformed data into the destination warehouse

**ELT (Extract, Load, Transform):**
1. **Extract** data from source systems
2. **Load** raw data directly into the destination (data lake/warehouse)
3. **Transform** data within the destination system using its compute power

**Key Differences:**

| Aspect | ETL | ELT |
|--------|-----|-----|
| Transform Location | External processing server | Inside the target system |
| Data Movement | Only transformed data reaches warehouse | Raw data stored, transformed in place |
| Flexibility | Schema defined upfront | Schema-on-read, more flexible |
| Storage Cost | Lower (only processed data) | Higher (raw + processed data) |

**Why ELT is preferred in cloud architectures:**

1. **Elastic Compute**: Cloud warehouses (Redshift, Snowflake, BigQuery) offer massive parallel processing—cheaper to transform data there than maintain separate ETL servers

2. **Separation of Storage and Compute**: Cloud platforms decouple these, making it economical to store raw data cheaply (S3) and spin up compute only when needed

3. **Scalability**: Cloud systems auto-scale; no need to provision ETL infrastructure

4. **Data Lake Pattern**: Raw data preservation enables future use cases without re-extraction

5. **Cost Model**: Pay-per-query pricing makes it efficient to transform data on-demand rather than maintaining always-on ETL pipelines

---

### 5. Explain the MapReduce execution flow in your own words. What are the roles of the Map, Shuffle, and Reduce phases?

**MapReduce** is a programming model for processing large datasets in parallel across a distributed cluster.

**Execution Flow:**

```
Input Data → Split → Map → Shuffle & Sort → Reduce → Output
```

**Phase Breakdown:**

**1. Map Phase**
- Input data is split into chunks distributed across worker nodes
- Each mapper processes its chunk independently and in parallel
- Mappers emit intermediate key-value pairs
- *Example*: For word count, mapper reads "hello world hello" and emits: `(hello, 1), (world, 1), (hello, 1)`

**2. Shuffle & Sort Phase**
- The framework automatically groups all values by key
- Data is transferred across the network so all values for the same key arrive at the same reducer
- Values are sorted by key
- *Example*: All `(hello, [1, 1])` pairs go to one reducer, `(world, [1])` to another

**3. Reduce Phase**
- Each reducer receives a key and all its associated values
- Reducer applies aggregation logic (sum, count, average, etc.)
- Outputs final results
- *Example*: Reducer receives `(hello, [1, 1])` and outputs `(hello, 2)`

**In essence**: Map handles parallel transformation, Shuffle handles data redistribution by key, and Reduce handles aggregation.

---

### 6. Explain the difference between Data Lake and Data Warehouse.

| Aspect | Data Lake | Data Warehouse |
|--------|-----------|----------------|
| **Data Type** | Raw, unstructured, semi-structured, structured | Structured, processed, curated |
| **Schema** | Schema-on-read (define when querying) | Schema-on-write (define before loading) |
| **Storage Format** | Files (Parquet, JSON, CSV, images, logs) | Tables with strict schemas |
| **Users** | Data scientists, ML engineers | Business analysts, BI users |
| **Processing** | Flexible; batch and streaming | Primarily optimized for SQL queries |
| **Cost** | Low (cheap object storage like S3) | Higher (optimized compute + storage) |
| **Data Quality** | Variable; may contain duplicates/errors | High; cleaned and validated |
| **Use Cases** | ML training, exploration, raw data archival | Reporting, dashboards, business metrics |
| **Examples** | AWS S3 + Athena, Azure Data Lake | Snowflake, Redshift, BigQuery |

**Key Insight**: Modern architectures often use both—data lakes for raw storage and flexibility, data warehouses for curated analytics. The **Lakehouse** pattern (Delta Lake, Iceberg) attempts to combine benefits of both.

---

### 7. What is the difference between Batch Processing and Streaming Processing? Give one real-world use case for each.

**Batch Processing:**
- Processes large volumes of data at scheduled intervals
- Data is collected over time, then processed together
- Higher latency (minutes to hours)
- Higher throughput for large datasets
- Simpler to implement and debug

**Streaming Processing:**
- Processes data continuously as it arrives
- Real-time or near-real-time results
- Low latency (milliseconds to seconds)
- More complex state management
- Handles unbounded data streams

| Aspect | Batch | Streaming |
|--------|-------|-----------|
| Latency | High (minutes-hours) | Low (ms-seconds) |
| Data Scope | Bounded, finite datasets | Unbounded, continuous |
| Complexity | Simpler | More complex |
| Use Case Fit | Historical analysis | Real-time reactions |

**Real-World Use Cases:**

**Batch Processing Example: Monthly Financial Reports**
- A bank collects all transactions throughout the month
- At month-end, a batch job processes millions of records to generate statements, calculate interest, and produce regulatory reports
- Results don't need to be instant; accuracy and completeness matter more

**Streaming Processing Example: Fraud Detection**
- A credit card company monitors transactions in real-time
- Each swipe is analyzed instantly against patterns (unusual location, amount, frequency)
- Suspicious transactions trigger immediate alerts or blocks
- Waiting hours (batch) would be too late—the fraud would already have occurred

---

### 8. What is the purpose of Star Schema and Snowflake Schema in data warehousing? Which one is generally preferred for BI workloads, and why?

**Purpose of Dimensional Schemas:**
Both schemas organize data for analytical queries by separating **facts** (measurable events like sales, clicks) from **dimensions** (descriptive context like time, product, customer). This structure optimizes for:
- Fast aggregations and filtering
- Intuitive business modeling
- Efficient joins for common query patterns

**Star Schema:**
- Fact table at center, connected directly to denormalized dimension tables
- Dimensions are flat (no normalization)
- Fewer joins required for queries
- Some data redundancy in dimensions

```text
        [Date Dim]
             |
[Product Dim]—[FACT: Sales]—[Customer Dim]
             |
        [Store Dim]
```

**Snowflake Schema:**
- Dimensions are normalized into sub-dimensions
- Reduces data redundancy
- More tables, more joins required

```
[Category]—[Product Dim]—[FACT: Sales]—[Customer Dim]—[City]—[Country]
```

**Comparison:**

| Aspect | Star Schema | Snowflake Schema |
|--------|-------------|------------------|
| Joins | Fewer (faster queries) | More (complex queries) |
| Storage | More redundant | More normalized |
| Query Performance | Better | Slower |
| Maintenance | Easier | More complex |
| ETL Complexity | Simpler loads | More transformation |

**Preferred for BI Workloads: Star Schema**

**Reasons:**
1. **Query Performance**: BI tools generate SQL that benefits from fewer joins; star schema queries are faster
2. **Simplicity**: Business users and BI tools understand flat dimensions more easily
3. **Aggregation Speed**: Pre-aggregated, denormalized dimensions speed up common GROUP BY operations
4. **Storage is Cheap**: Modern systems make redundancy less of a concern than query speed
5. **BI Tool Optimization**: Most BI tools (Tableau, Power BI, Looker) are optimized for star schema patterns

## SQL

```sql
/*
## Tables

**fct_ride**
    - ride_id
    - user_id
    - vehicle_id
    - ride_type (carpool or regular)
    - start_time
    - end_time

**dim_vehicle**
    - vehicle_id
    - vehicle_type
    - model
    - capacity
*/


-- Question 1: Percentage of rides that are carpools
SELECT 
    ROUND(
        COUNT(CASE WHEN ride_type = 'carpool' THEN 1 END) * 100.0 / COUNT(*), 
        2
    ) AS carpool_percentage
FROM fct_ride;

-- Question 2: What percentage of vehicles had more carpool rides than regular rides?
WITH vehicle_ride_counts AS (
    SELECT 
        vehicle_id,
        COUNT(CASE WHEN ride_type = 'carpool' THEN 1 END) AS carpool_count,
        COUNT(CASE WHEN ride_type = 'regular' THEN 1 END) AS regular_count
    FROM fct_ride
    GROUP BY vehicle_id
)
SELECT 
    ROUND(
        COUNT(CASE WHEN carpool_count > regular_count THEN 1 END) * 100.0 / COUNT(*),
        2
    ) AS percentage_vehicles_more_carpools
FROM vehicle_ride_counts;

-- Question 3: Which vehicle had the highest total usage time?
-- (Usage time = duration between start_time and end_time)
SELECT 
    fr.vehicle_id,
    dv.vehicle_type,
    dv.model,
    SUM(EXTRACT(EPOCH FROM (fr.end_time - fr.start_time))) / 3600 AS total_usage_hours
FROM fct_ride fr
JOIN dim_vehicle dv ON fr.vehicle_id = dv.vehicle_id
GROUP BY fr.vehicle_id, dv.vehicle_type, dv.model
ORDER BY total_usage_hours DESC
LIMIT 1;

-- Alternative for Question 3 if timestamps are stored differently:
-- (Use this if end_time and start_time are DATETIME/TIMESTAMP types)
SELECT 
    fr.vehicle_id,
    dv.vehicle_type,
    dv.model,
    SUM(TIMESTAMPDIFF(SECOND, fr.start_time, fr.end_time)) / 3600 AS total_usage_hours
FROM fct_ride fr
JOIN dim_vehicle dv ON fr.vehicle_id = dv.vehicle_id
GROUP BY fr.vehicle_id, dv.vehicle_type, dv.model
ORDER BY total_usage_hours DESC
LIMIT 1;
```