<a href="https://colab.research.google.com/github/ankitarm/Data_Engineer_Scenario/blob/main/Data_Eengineer_Mocks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mock 1 : https://www.youtube.com/watch?v=-iMhSpP77F8

##Q2. Please explain what Kinesis is and its role.
### **What is AWS Kinesis?**  
AWS Kinesis is a **fully managed, scalable, real-time data streaming service** provided by Amazon Web Services (AWS). It is designed to **collect, process, and analyze streaming data** (e.g., logs, metrics, transactions, social media feeds) in real time.  

Kinesis is often compared to **Apache Kafka** (an open-source distributed event streaming platform), but unlike Kafka, Kinesis is **fully managed by AWS**, eliminating the need to handle cluster setup, scaling, or maintenance.

---

 **Key Components of Kinesis**  
Kinesis consists of multiple services, each serving different streaming needs:  

1. **Kinesis Data Streams**  
   - **Role:** Ingests and stores real-time data records (e.g., clickstreams, IoT sensor data).  

2. **Kinesis Data Firehose**  
   - **Role:** **Automatically loads streaming data into AWS destinations** (S3, Redshift, Elasticsearch, etc.).  


##Q3. Can you elaborate more on diff between kafka and kinesis


| **Feature**               | **Apache Kafka**                                   | **AWS Kinesis**                                  |
|---------------------------|---------------------------------------------------|-------------------------------------------------|
| **Managed Service**       | Self-hosted (requires setup/maintenance)          | Fully managed by AWS                            |
| **Scalability**           | Manual scaling (add brokers/partitions)           | Auto-scaling (adjust shards dynamically)        |
| **Latency**               | Ultra-low (~10ms)                                 | Higher (~200ms)                                 |
| **Durability**            | Configurable replication (user-managed)           | Built-in replication across AZs                 |
| **Pricing Model**         | Open-source (infrastructure costs apply)          | Pay per shard/hour + data volume               |
| **Data Retention**         | Configurable (days to years)                      | Fixed (1–365 days)                              |
| **Throughput**            | Higher (supports 100k+ msgs/sec per partition)    | Lower (1MB/sec or 1k msgs/sec per shard)       |
| **Partitions/Shards**     | Partitions (user-managed)                         | Shards (AWS-managed)                            |
| **Integrations**          | Multi-cloud, on-prem, hybrid                      | AWS-native (Lambda, S3, Redshift, etc.)        |
| **Use Cases**             | High-throughput, low-latency, cross-platform apps | Real-time analytics, AWS-centric pipelines     |
| **Setup Complexity**      | High (requires tuning, monitoring)                | Low (serverless, minimal configuration)        |
| **Security**              | Custom (SSL, SASL, Kerberos)                      | AWS IAM, KMS encryption                        |
| **Consumer Model**        | Pull-based (consumers request data)               | Push/pull hybrid (Kinesis Client Library)      |

---

### **Key Takeaways:**
1. **Choose Kafka** if you need:  
   - Cross-platform deployment (e.g., hybrid cloud).  
   - Ultra-low latency (e.g., financial trading).  
   - Full control over infrastructure.  

2. **Choose Kinesis** if you:  
   - Prefer a serverless, AWS-native solution.  
   - Need quick setup without managing clusters.  
   - Use AWS services (e.g., Lambda, Firehose).  

Which to Choose?
Need long retention (years) or flexibility? → Kafka.

Want simplicity and AWS integration? → Kinesis (but max 1 year).

##Q4. **Kafka vs. Kinesis: Producer-Consumer or Pub/Sub?**  


| **Model Aspect**       | **Apache Kafka**                                   | **AWS Kinesis**                                  |
|------------------------|---------------------------------------------------|-------------------------------------------------|
| **Primary Model**      | **Pub/Sub** (with consumer groups)                | **Producer-Consumer** (with shard-level parallelism) |
| **Data Flow**          | - Producers → **Topics** → Consumers (pull-based) | - Producers → **Streams/Shards** → Consumers (push/pull hybrid) |
| **Subscription Style** | - Consumers **subscribe to topics** (flexible)    | - Consumers **read from shards** (fixed mapping) |
| **Message Delivery**   | - **Pull-based** (consumers request data)         | - **Push (KCL) or Pull (Lambda, SDKs)**         |
| **Parallelism**        | - **Partitions** allow parallel consumers         | - **Shards** enable parallel processing         |
| **Decoupling**         | - High (multiple consumer groups per topic)       | - Medium (consumers compete for shard leases)   |

---

### **Key Differences**  
1. **Kafka = Pub/Sub with Flexibility**  
   - Producers publish to **topics**, and **consumer groups** subscribe.  
   - Multiple independent consumer groups can read the same data (e.g., one for analytics, another for alerts).  
   - Example:  
     ```python
     # Kafka Pub/Sub Example
     producer.send("orders-topic", order_data)  # Publisher
     consumer.subscribe(["orders-topic"])      # Subscriber
     ```

2. **Kinesis = Producer-Consumer with Scalability**  
   - Producers write to **shards**, and consumers **lease shards** for processing.  
   - Each shard supports **one consumer at a time** (no native fan-out).  
   - Example:  
     ```python
     # Kinesis Producer-Consumer Example
     kinesis.put_record(StreamName="logs-stream", Data=log_data)  # Producer
     shard_reader = kinesis.get_shard_iterator(...)               # Consumer
     ```



##Q5. What is the difference between producer-consumer and publisher-subscriber?

### **Producer-Consumer vs. Publisher-Subscriber**  

These are two fundamental messaging patterns in distributed systems, differing in **coupling**, **scalability**, and **message delivery** mechanics.

---

## **1. Producer-Consumer Model**  
**Definition:**  
- A **synchronous, point-to-point** messaging pattern.  
- **Producers** send messages to a **queue**, and **consumers** pull messages from it.  
- Each message is processed by **exactly one consumer**.  

**Key Traits:**  
✅ **Point-to-Point (1:1)** – Only one consumer gets each message.  
✅ **Tight Coupling** – Producers/consumers must know the queue.  
✅ **Pull-Based** – Consumers request messages when ready.  
✅ **Order Guaranteed** – FIFO (First-In-First-Out) by default.  

**Example:**  
- A **task queue** (e.g., RabbitMQ, SQS).  
- Workers processing orders from an e-commerce system.  

**Diagram:**  
```
Producer → [ Queue ] → Consumer
```

---

## **2. Publisher-Subscriber (Pub/Sub) Model**  
**Definition:**  
- An **asynchronous, broadcast** messaging pattern.  
- **Publishers** send messages to a **topic**, and **subscribers** receive them.  
- Each message is processed by **all interested subscribers**.  

**Key Traits:**  
✅ **Broadcast (1:N)** – Multiple subscribers receive the same message.  
✅ **Loose Coupling** – Publishers/subscribers only know the topic, not each other.  
✅ **Push-Based** – Messages are pushed to subscribers (or pulled via polling).  
✅ **Dynamic Scaling** – Subscribers can join/leave anytime.  

**Example:**  
- **Stock market feeds** (multiple apps listen to price updates).  
- **Event-driven microservices** (e.g., Kafka, AWS SNS).  

**Diagram:**  
```
Publisher → [ Topic ] → Subscriber 1  
                     → Subscriber 2  
                     → Subscriber 3
```

---

## **Key Differences Summary**  
| **Aspect**          | **Producer-Consumer**                | **Publisher-Subscriber**             |
|----------------------|--------------------------------------|--------------------------------------|
| **Messaging Style**  | Point-to-point (1:1)                 | Broadcast (1:N)                      |
| **Coupling**         | Tight (queue-aware)                  | Loose (topic-based)                  |
| **Delivery**         | Pull-based (consumer-driven)         | Push-based (or polled)               |
| **Scalability**      | Limited (competing consumers)        | High (independent subscribers)       |
| **Use Cases**        | Task queues, ordered processing      | Real-time notifications, event buses |

---

### **Which to Choose?**  
- **Use Producer-Consumer** when:  
  - You need **exactly-once processing** (e.g., order fulfillment).  
  - Order matters (e.g., FIFO queues).  

- **Use Pub/Sub** when:  
  - Multiple systems need **the same data** (e.g., analytics + alerts).  
  - You want **decoupled, event-driven architectures**.  

**Hybrid Systems (Kafka/Kinesis):**  
- Both models can coexist (e.g., Kafka uses **consumer groups** for Pub/Sub-like behavior while maintaining ordering per partition).  

Would you like a real-world analogy (e.g., postal service vs. radio broadcast)?

##Q6. Why are Kafka or Kinesis used instead of simple messaging queues?

### **Message Ordering in Kafka vs. Kinesis**  

#### **✅ Kafka**  
- **Ordered per partition**: Messages in the same partition are strictly ordered (FIFO).  
- **No global order**: Across partitions, order is **not** guaranteed unless using a single partition (bottleneck).  
- **Key-based routing**: Same key → same partition → preserves order for related messages.  
- **Use case**: E.g., ensuring all events for a user ID are processed in sequence.  

#### **✅ Kinesis**  
- **Ordered per shard**: Messages in the same shard are strictly ordered.  
- **No cross-shard order**: Like Kafka, order is lost across shards.  
- **Partition key routing**: Same key → same shard → ordered sequence.  
- **Use case**: E.g., processing financial transactions in exact arrival order per account.  

#### **⚠️ Key Limitation**  
- Both **cannot guarantee global order** (across partitions/shard) without sacrificing scalability.  


##Q7. Can messages be ordered in Kafka or Kinesis?

### **Message Ordering in Kafka vs. Kinesis**  

**Short Answer:**  
Yes, but **only within a partition (Kafka) or shard (Kinesis)**. Global ordering across all partitions/shard is **not guaranteed**.  

#### **Kafka**  
- **Order Guarantee**: Strictly ordered **per partition**.  
  - Example: If messages M1, M2, M3 are sent to **Partition 1**, they’ll be consumed in order.  
  - Cross-partition order is **not guaranteed** (e.g., M1 in Partition 1 and M4 in Partition 2 may arrive out of order).  
- **How to Preserve Order**:  
  - Use a **message key** (e.g., `user_id`) to ensure related messages go to the same partition.  

#### **Kinesis**  
- **Order Guarantee**: Strictly ordered **per shard**.  
  - Example: Messages in **Shard A** are ordered, but Shard A vs. Shard B may deliver data out of sequence.  
- **How to Preserve Order**:  
  - Use a **partition key** (e.g., `device_id`) to route related messages to the same shard.  

---

### **When Order Matters**  
- **Use Cases**:  
  - **Kafka**: Financial transactions (e.g., debit before credit).  
  - **Kinesis**: Clickstream sequences (e.g., page A → page B).  
- **Trade-off**:  
  - Ordering reduces parallelism (since keys must map to the same partition/shard).  

---

### **Interview Tip**  
*"Kafka and Kinesis guarantee ordering only within a partition/shard. For global ordering, you’d need a single partition (which limits throughput) or external sequencing (like a ledger)."*  



##Q8. **Priority Queue: Definition & Key Concepts**  
A **priority queue** is a specialized data structure where elements are processed based on **priority** (not just insertion order).  

#### **Key Characteristics (Bullet Points)**  
- **Not FIFO**: Unlike regular queues, the **highest-priority** element is removed first.  
- **Priority Assignment**: Each element has an associated priority (e.g., numerical value, urgency level).  
- **Operations**:  
  - **Insert**: Add an element with a priority.  
  - **Extract**: Remove the highest/lowest priority element.  
- **Implementation**: Typically uses a **heap** (binary heap) for efficient O(log n) inserts/extracts.  

#### **Real-World Examples**  
- **Hospital ER**: Critical patients (high priority) are treated before mild cases.  
- **CPU Scheduling**: High-priority tasks (e.g., system processes) run before background apps.  
- **Ride-Sharing**: Premium users get faster driver allocation.  


#### **Example in Python**  
```python
import heapq

pq = []
heapq.heappush(pq, (2, "Task A"))  # (priority, data)
heapq.heappush(pq, (1, "Task B"))  # Lower number = higher priority
print(heapq.heappop(pq)[1])         # Output: "Task B" (higher priority)
```

#### **Interview Tip**  
- Mention trade-offs: **Heaps** (O(log n)) vs. **sorted lists** (O(n) insert).  
- Use cases where order matters beyond FIFO (e.g., real-time systems).  



##Q9. **Process vs. Thread (Short Interview Answer)**  

- **Process**:  
  - Independent program instance with **separate memory** (isolated).  
  - Heavyweight (slower to create/switch).  
  - Crash in one process **doesn’t affect others**.  

- **Thread**:  
  - Lightweight **subset of a process**, shares memory/resources.  
  - Faster to create/switch (less overhead).  
  - Crash in one thread **can crash the entire process**.  

**Key Difference**:  
- Processes = **Isolated execution** (e.g., separate browser tabs).  
- Threads = **Parallel tasks within a process** (e.g., rendering + downloads in one tab).  


##Q10. **CAP Theorem: Short & Clear Explanation**  

**CAP Theorem** states that in a distributed system, you can only guarantee **two out of three** properties at the same time:  

1. **Consistency (C)**  
   - All nodes see the **same data at the same time** (e.g., a bank balance is accurate across all servers).  

2. **Availability (A)**  
   - Every request gets a **response (even if stale)**, with no system downtime.  

3. **Partition Tolerance (P)**  
   - The system keeps working **even if nodes disconnect** (e.g., network failure).  

#### **Trade-offs (Pick 2/3)**  
- **CA** (e.g., SQL databases): Sacrifice partition tolerance (single-server systems).  
- **CP** (e.g., MongoDB, Kafka): Sacrifice availability (e.g., reject requests if nodes can’t sync).  
- **AP** (e.g., Cassandra, DynamoDB): Sacrifice consistency (allow stale reads during partitions).  

### **Real-World Examples of CAP Theorem Trade-offs**  

#### **1. CP (Consistency + Partition Tolerance)**  
**Example: Banks & Financial Systems**  
- **Why?** Transactions **must** be consistent (e.g., $100 withdrawn can’t show $90 in one server and $100 in another).  
- **Sacrifice:** Availability (e.g., during a network partition, the system may reject transactions to avoid inconsistencies).  
- **Tech:** PostgreSQL, Kafka (for transactions), Zookeeper.  

#### **2. AP (Availability + Partition Tolerance)**  
**Example: Social Media (Facebook, Twitter)**  
- **Why?** Better to show **stale data** (e.g., delayed like counts) than fail to load the page.  
- **Sacrifice:** Consistency (temporary mismatches are acceptable).  
- **Tech:** Cassandra, DynamoDB, Redis (eventual consistency).  

#### **3. CA (Consistency + Availability)**  
**Example: Single-Node Databases (Rare in Distributed Systems)**  
- **Why?** No partitions (single server), so it’s always consistent and available.  
- **Sacrifice:** Partition tolerance (fails if the network splits).  
- **Tech:** SQLite, standalone MySQL (non-replicated).  

### **Key Insight**  
- **Most distributed systems choose **CP or AP** (since partitions are inevitable).  
- **CA systems** are rare in modern cloud architectures (they’re not fault-tolerant).  




##Q11. SQL Query to Find Percentage of Users Who Bought AirPods After iPhones

To solve this problem, we need to:
1. Identify users who bought an iPhone
2. Check if their next purchase was AirPods
3. Calculate the percentage of such users out of all iPhone buyers

Here's the SQL solution:

```sql
WITH UserPurchaseSequence AS (
    SELECT
        customer_id,
        product,
        transaction_timestamp,
        LEAD(product) OVER (PARTITION BY customer_id ORDER BY transaction_timestamp) AS next_product
    FROM transactions
),

iPhoneBuyers AS (
    SELECT DISTINCT customer_id
    FROM transactions
    WHERE product = 'iPhone'
),

iPhoneToAirPodsBuyers AS (
    SELECT DISTINCT customer_id
    FROM UserPurchaseSequence
    WHERE product = 'iPhone' AND next_product = 'AirPods'
)

SELECT
    (COUNT(DISTINCT iab.customer_id) * 100.0 /
    NULLIF(COUNT(DISTINCT ib.customer_id), 0) AS percentage
FROM iPhoneBuyers ib
LEFT JOIN iPhoneToAirPodsBuyers iab ON ib.customer_id = iab.customer_id;
```

## Explanation:

1. **UserPurchaseSequence CTE**:
   - Uses the `LEAD()` window function to see what each customer bought next
   - Partitions by customer and orders by timestamp to get chronological sequence

2. **iPhoneBuyers CTE**:
   - Identifies all unique customers who bought iPhones (denominator)

3. **iPhoneToAirPodsBuyers CTE**:
   - Finds customers whose immediate next purchase after iPhone was AirPods (numerator)

4. **Final Calculation**:
   - Divides the count of customers who bought AirPods after iPhones by total iPhone buyers
   - Multiplies by 100 to get percentage
   - Uses NULLIF to avoid division by zero


##Q12. Will the code work if you try to delete dictionary items during iteration?

### **Deleting Dictionary Items During Iteration in Python**

**Short Answer:**  
**No**, the code will **not** work correctly if you try to delete dictionary items during iteration. It raises a **`RuntimeError: dictionary changed size during iteration`**.

---

### **Why It Fails**
Python dictionaries track their size, and modifying them (adding/removing keys) while iterating breaks the internal iterator, causing an immediate error.

#### **Example of the Problem:**
```python
my_dict = {'a': 1, 'b': 2, 'c': 3}

for key in my_dict:
    if my_dict[key] % 2 == 0:
        del my_dict[key]  # 🚨 RuntimeError
```

---

### **How to Fix It**
#### **Option 1: Iterate Over a Copy of Keys**
```python
for key in list(my_dict.keys()):  # Explicit copy
    if my_dict[key] % 2 == 0:
        del my_dict[key]  # Safe
```

#### **Option 2: Store Keys to Delete First**
```python
keys_to_delete = [key for key, value in my_dict.items() if value % 2 == 0]
for key in keys_to_delete:
    del my_dict[key]
```

#### **Option 3: Use Dictionary Comprehension (Creates New Dict)**
```python
my_dict = {k: v for k, v in my_dict.items() if v % 2 != 0}
```

---

### **Key Takeaways**
- **Never modify a dict while iterating directly** over it.  
- **Safe methods**:  
  - Iterate over a **copy** of keys (`list(my_dict.keys())`).  
  - Record keys to delete first, then delete them.  
  - Use **dictionary comprehension** for filtering.  



In [2]:
my_dict = {'a': 1, 'b': 2, 'c': 3}
for key in list(my_dict.keys()):  # Explicit copy
    if my_dict[key] % 2 == 0:
        del my_dict[key]
print( my_dict)

{'a': 1, 'c': 3}


##Q13. What is the difference between Pandas and PySpark?
### **Pandas vs. PySpark: Key Differences**

| **Feature**               | **Pandas**                                  | **PySpark**                                  |
|---------------------------|---------------------------------------------|---------------------------------------------|
| **Execution Environment** | Single-machine (RAM-limited)                | Distributed (cluster of machines)           |
| **Scalability**           | Handles data that fits in memory (~GBs)     | Handles TBs+ via parallel processing       |
| **Lazy Evaluation**       | No (immediate execution)                    | Yes (optimizes execution plan)              |
| **Fault Tolerance**       | No (crash = data loss)                      | Yes (via RDD lineage recovery)             |
| **Syntax**                | Pythonic (similar to NumPy)                 | SQL-like (with DataFrame API)               |
| **Use Cases**             | EDA, small datasets, single-node workflows  | Big data pipelines, ETL, cloud processing  |
| **Performance**           | Faster for small data (low overhead)        | Slower for tiny data (cluster setup cost)  |
| **Integration**           | Works with Python libraries (Matplotlib, etc.) | Integrates with Hadoop/Hive/S3/etc.       |

---

### **When to Use Which?**
- **Choose Pandas when:**
  - Data fits in memory (e.g., <10GB).
  - You need quick prototyping or EDA.
  - Your workflow uses Python ML libraries (scikit-learn, TensorFlow).

- **Choose PySpark when:**
  - Data exceeds memory (e.g., TBs of logs).
  - You need distributed processing (e.g., ETL pipelines).
  - Working in cloud environments (AWS EMR, Databricks).

---

### **Code Comparison**
#### **Pandas (Single-Node)**
```python
import pandas as pd
df = pd.read_csv("data.csv")
df_filtered = df[df["value"] > 100]  # In-memory operation
```

#### **PySpark (Distributed)**
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("s3://bucket/data.csv")
df_filtered = df.filter(df["value"] > 100)  # Lazy evaluation
```

---

### **Key Technical Differences**
1. **Memory vs. Disk**:  
   - Pandas operates in memory.  
   - PySpark spills to disk when needed.  

2. **Parallelism**:  
   - Pandas: Single-threaded (unless using `modin.pandas`).  
   - PySpark: Automatically partitions data across nodes.  

---


##Q14. **Step-by-Step Approach to Migrate from Pandas to PySpark**  

#### **1. Assess the Current Pipeline**  
- **Identify bottlenecks**: Check RAM/CPU usage, slow transformations, and I/O operations.  
- **Document dependencies**: Note Pandas functions used (e.g., `.apply()`, `.groupby()`).  
- **Data volume**: Confirm if data size justifies PySpark (e.g., >10GB or growing).  

#### **2. Set Up PySpark Environment**  
- **Cluster configuration**: Choose # of nodes/cores based on data size (e.g., AWS EMR, Databricks).  
- **Dependencies**: Install `pyspark` and ensure compatibility with Python libraries (e.g., `numpy`, `scikit-learn`).  

#### **3. Rewrite Code Logic**  
- **Replace Pandas functions with PySpark equivalents**:  

  | **Pandas**               | **PySpark**                          |  
  |--------------------------|--------------------------------------|  
  | `df[df.col > 100]`       | `df.filter(df.col > 100)`            |  
  | `df.groupby().agg()`     | `df.groupBy().agg()` (*case-sensitive*) |  
  | `df.apply(func)`         | `df.withColumn("new", udf(func))`    |  
  | `pd.merge()`             | `df1.join(df2, on="key")`            |  

- **Handle lazy evaluation**: Call `.cache()` for reused DataFrames and `.collect()` only when needed.  

#### **4. Optimize Performance**  
- **Partitioning**: Repartition data to avoid skew (e.g., `df.repartition(100)`).  
- **Broadcast small DataFrames**: Use `broadcast()` for joins with small tables.  
- **Avoid UDFs**: Prefer built-in PySpark functions (e.g., `F.expr()` over Python UDFs).  

#### **5. Test and Validate**  
- **Unit tests**: Verify logic matches Pandas output (e.g., `assert df_pandas.equals(df_spark.toPandas())`).  
- **Scale testing**: Run on sample data → full dataset to catch memory/shuffle issues.  
- **Data integrity checks**: Compare row counts, NULL values, and key metrics.  

#### **6. Deploy and Monitor**  
- **Orchestration**: Schedule jobs via Airflow/Luigi (for batch) or Spark Streaming (for real-time).  
- **Monitoring**: Track Spark UI for slow tasks, skew, or spills to disk.  

---

### **Example Migration**  
**Pandas**:  
```python  
df = pd.read_csv("data.csv")  
df["discount"] = df["price"].apply(lambda x: x * 0.9)  
```  

**PySpark**:  
```python  
from pyspark.sql import functions as F  
df = spark.read.csv("data.csv")  
df = df.withColumn("discount", F.col("price") * 0.9)  # Avoid UDF for speed  
```  

---