# üß† PySpark Interview Preparation Guide

---

## ‚öôÔ∏è Spark Architecture & Core Concepts
- Explain architecture of Spark Application  
- What is lazy evaluation? What are actions and transformations?  
- What is lineage and DAG? What is the difference among them?  
- Explain DAG in detail  
- What is the job of Catalyst Optimizer?  
- What is checkpointing, why is it needed? Show me how we can do it  
- What are memory overflow errors? When can we see this? How to handle them?  
- What is Spark dynamic and static allocation? How can we choose?

---

## üßÆ Data Structures: RDDs, DataFrames, Datasets
- What are differences among RDDs, DataFrames, and Datasets? Did you use any of them? If so, for what purposes?  
- What are some differences among Pandas and PySpark?

---

## üîÑ Transformations & Actions
- Explain differences among narrow and wide transformations  
- What are the transformations that you have worked upon?  
- How can we do custom transformations in PySpark?

---

## üßä Caching & Persistence
- Differences between cache and persist in detail  
- Explain how caching works by default in Spark? When to use caching techniques?

---

## üßπ Data Cleaning & Handling
- How to handle duplicate values and null values? At the time of loading and after loading? For rows and columns? How do you handle missing data?  
- What is schema inference? How to define a schema manually?  
- How do you handle schema evolution in Spark?

---

## üìÇ File Formats & Storage
- File formats known? CSV, Parquet? What is the algorithm for Parquet? Snappy  
- What is the problem of having bulk small files in PySpark? How does it affect performance and why?

---

## üß± Partitioning & Bucketing
- What is the difference between partitioning and bucketing? Which one is better and why? Is there any performance difference?  
- Difference between coalesce() vs repartition()

---

## üîç Joins & Broadcasts
- What are different joins that can be performed in PySpark?  
- What is broadcast join? When to use it? How does it work?  
- What kind of joins are supported and not supported in Structured Streaming and why?

---

## üìä Window Functions & Aggregations
- What are window functions?  
- What are different types of windows in Structured Streaming? When to use them? Give use cases

---

## üß® Data Skew & Spilling
- What is data spilling in PySpark and how to handle this?  
- What is data skewness? How can we handle data skewness? How can salting help here?

---

## üöÄ Performance Tuning
- How will you tune the performance of the Spark job?  
- What are the challenges encountered with large data sets? How can we overcome them?

---

## üßæ Reading & Writing Data
- How to read and write data in Spark? Show me different ways: list, CSV, Parquet  
- How do you read streaming data from Kafka and S3?

---

## üîÑ Structured Streaming
- What is Structured Streaming? How does it work internally?  
- What are different types of output modes and when to use them?  
- How to handle late events in PySpark?  
- How is Structured Streaming fault tolerant in PySpark?  
- Explain the steps involved in Structured Streaming. Consider an example like reading from Kafka topics and writing to Kafka topics. How can we achieve it? Can you write the steps involved?

---

## üì° Stream Processing & Alternatives
- Why do you need stream processing? Do you know any other competitive applications? Which one is better among them you think and why?

---

## üß† Broadcast Variables & Shared State
- What are broadcast variables?

---

## üß® Error Handling
- How can we handle errors and exceptions in PySpark?

---

## üñ•Ô∏è Deployment & Cluster Setup
- How many ways can we deploy Spark clusters?

---

## üõ†Ô∏è Troubleshooting & Optimization Scenarios
- Your Spark job is running out of memory with OutOfMemoryError. What steps would you take?  
- If we have 100GB of data (with 2 files, let‚Äôs say 50GB each), and we have 50GB memory in cluster ‚Äî are we able to process the data? If so, how? Explain mechanisms.  
- You have a 2TB dataset and Spark job is failing with OOM. How do you fix it?  
- You have a slow running PySpark job. How can you optimize it?  
- Streaming job is lagging behind real-time data. What could be reasons?

---

## üîÑ Interoperability with Pandas
- How to convert Pandas to a DataFrame in PySpark and vice versa?

- Lazy evaluation means transformations in PySpark are not executed immediately. Instead, Spark builds a logical execution plan.
- Actions trigger the execution, which allows optimization like pipelining and minimizing data shuffling.

In [0]:
- Caching stores intermediate results in memory across operations, preventing recomputation. 
- Use df.cache() or df.persist() to cache DataFrames. 
- It improves performance when the same data is accessed multiple times.
- df.persist() modes 
    - df.persist(StorageLevel.MEMORY_AND_DISK),  df.persist(StorageLevel.MEMORY_ONLY)
- df.cache() is a synonym for df.persist(StorageLevel.MEMORY_ONLY_SER)
- df.persist() is a synonym for df.persist(StorageLevel.MEMORY_ONLY_SER)
- df.unpersist() if done with cache() or persist()

In [0]:
- Narrow Transformations: Data required for a computation comes from a single partition (e.g., map, filter).
- Wide Transformations: Data shuffling across partitions (e.g., groupByKey, reduceByKey).