`Question 1:  Read about how Spark and Hadoop work. What does the term ‘lazy evaluation’ mean
for them? Explain with a simple example.`  
Spark and Hadoop are both frameworks used for big data processing, but they have different architectures and approaches.  
- Hadoop: Hadoop is a distributed computing framework designed to handle large-scale data processing across clusters of commodity hardware. It operates on a master-slave architecture and consists of several key components:  
  &nbsp;&nbsp;&nbsp;&nbsp;Hadoop Distributed File System (HDFS): It stores data in a distributed manner across multiple machines, with replication for fault tolerance. Files in HDFS are divided into blocks, typically 128 MB or 256 MB in size, and distributed across the cluster.  
  &nbsp;&nbsp;&nbsp;&nbsp;MapReduce: MapReduce is a programming model and processing engine used for processing and generating large datasets in parallel across a Hadoop cluster. It divides processing into two phases: Map and Reduce.  
  &nbsp;&nbsp;&nbsp;&nbsp;Yet Another Resource Negotiator (YARN): YARN is the resource management and job scheduling component of Hadoop. It manages resources (CPU, memory) across the cluster and allocates them to running applications.  
- Spark: Spark is a distributed computing framework that provides an in-memory data processing engine, allowing for faster data processing compared to Hadoop MapReduce. It offers a variety of APIs including Scala, Java, Python, and SQL for data processing tasks. Spark operates on Resilient Distributed Datasets (RDDs), which are distributed collections of data that can be cached in memory across a cluster of machines.
- Lazy Evaluation: Lazy evaluation is a programming strategy where an expression is not evaluated until its value is actually needed. In other words, the computation is delayed until the result is required by another part of the program. This approach can lead to efficiency gains by avoiding unnecessary computations, especially in scenarios where the result might not be needed at all or might be needed only partially. In the context of Spark and Hadoop, lazy evaluation means that transformations on data (such as filtering, mapping, or aggregating) are not executed immediately. Instead, they are recorded as a series of transformations on the input data.
Actions, such as collecting data or saving it to disk, trigger the execution of these transformations.
- Example: Consider the following code snippet in Spark using the Python API (PySpark):
```python
# Create an RDD from a list
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# Apply transformations (lazy)
transformed_rdd = rdd.filter(lambda x: x % 2 == 0).map(lambda x: x * 2)

# Action: collect the result
result = transformed_rdd.collect()
```
In this example, filter and map are transformations applied to the RDD rdd. These transformations are lazily evaluated, meaning they are not executed immediately. When collect is called, it triggers the execution of the transformations, and the filtered and mapped data is collected and returned as the result.

`Question 2: Your main task’s dataset has about 1,200,000 rows, which makes it quite hard, and even sometimes impossible, to work with. Explain how parquet files try to solve this problem, compared to normal file formats like csv.`
Parquet files offer a highly optimized and scalable solution for storing and processing large datasets.
- Parquet files organize data by column rather than by row. Each column is stored separately, allowing for highly efficient compression and encoding techniques to be applied independently to each column. This columnar storage format reduces the amount of I/O required to read specific columns, leading to faster query performance.
- Parquet files support advanced compression algorithms, which can significantly reduce the storage footprint of the data. This compression reduces disk space usage and I/O operations, making it easier to work with large datasets.
- Parquet files store metadata about the data, such as data types and encoding, within the file itself. This metadata allows for efficient schema evolution and data pruning, enabling faster query execution by skipping irrelevant data.
- Parquet files are splittable, meaning that they can be divided into smaller chunks that can be processed in parallel. This feature allows for parallel processing of data across multiple nodes in a distributed computing environment, improving overall performance.

`Question 3: As you might have noticed, Spark doesn’t save checkpoints. How can we enforce it to do so? This can help us if we have multiple computation steps and we don’t want to wait a lot for the result.`  
In Spark, you can enforce checkpointing to save intermediate RDDs to disk by using the checkpoint() method. Before performing any operations that require checkpointing, we need to enable checkpointing in our SparkSession.
```python
spark.sparkContext.setCheckpointDir("hdfs://path/to/checkpoint_directory")
```
After enabling checkpointing, we can call the checkpoint() method on an RDD to trigger the checkpointing process.
```python
transformed_rdd.checkpoint()
```

`Question 4: Top companies stream their data on a regular routine, e.g. daily. How can we save data, so that we could filter it based on specific columns, e.g. date, faster than regular filtering?`
To save data in a way that allows for faster filtering based on specific columns, we can use partitioning. Storing the streaming data in a columnar format rather than a row-based format can improve query performance, especially when filtering based on specific columns like date. Also partitioning the data based on the column(s) we want to filter on can redduce the amount of data that needs to be scanned during filtering operations. For example, if we partition the data by date, we can quickly filter the data based on the date column without scanning the entire dataset.

`Question 5: Let's face off Pandas and PySpark in the data analysis arena! When does each library truly shine, and why? Consider factors like data size, processing complexity, and user experience.`
- Data Size: Pandas is well-suited for working with small to medium-sized datasets that can fit into memory on a single machine. On the other hand, PySpark is designed for processing large-scale datasets that are distributed across a cluster of machines, allowing for parallel processing and scalability.
- Processing Complexity: Pandas is ideal for interactive data analysis and exploration, as it provides a rich set of data manipulation and analysis tools. It is well-suited for complex data transformations, feature engineering, and statistical analysis. PySpark, on the other hand, is optimized for data processing tasks like ETL (Extract, Transform, Load), machine learning, and graph processing and provides rich set of APIs.
- User Experience: Pandas offers a user-friendly and intuitive interface for data analysis, It's syntax is easy to understand and use. PySpark, on the other hand, has a steeper learning curve due to its distributed nature and the need to understand concepts like RDDs, transformations, and actions. Also the syntax is less intuitive compared to Pandas.  

PySpark shines when it comes to scalability. This scalability makes PySpark ideal for big data analytics in production environments. In the realm of data manipulation and exploration, Pandas library shines as a versatile tool. The ability to handle complex data structures is crucial, especially when dealing with datasets that involve nested lists
