# RDD: 

In Apache Spark, RDD (Resilient Distributed Datasets) is a fundamental data structure that represents a collection of elements, partitioned across the nodes of a cluster. RDDs can be created from various data sources, including Hadoop Distributed File System (HDFS), local file system, and data stored in a relational database.
we create an RDD by calling the parallelize method on the SparkContext object

    Can be easily converted to DataFrames and vice versa using the toDF() and rdd() methods.
    Not type-safe
    Low-level API with more control over the data, but lower-level optimizations compared to DataFrames and Datasets.
    Provide full control over memory management, as they can be cached in memory or disk as per the user’s choice.
    Whenever Spark needs to distribute the data within the cluster or write the data to disk, it does so use Java serialization. The overhead of serializing individual Java and Scala objects is expensive and requires sending both data and structure between nodes.
    Provide a low-level API that requires more code to perform transformations and actions on data
    Do not have an explicit schema, and are often used for unstructured data.
    RDD APIs are available in Java, Scala, Python, and R languages. Hence, this feature provides flexibility to the developers.
    No inbuilt optimization engine is available in RDD.
    Suitable for structured and semi-structured data processing with a higher level of abstraction.
    Suitable for low-level data processing and batch jobs that require fine-grained control over data

# DataFrame

In Spark Scala, a DataFrame is a distributed collection of data organized into named columns similar to an SQL table.
It is similar to a table in a relational database or a spreadsheet in that it has a schema, which defines the types and names of its columns, and each row represents a single record or observation.
DataFrames in Spark Scala can be created from a variety of sources, such as RDDs, structured data files (e.g., CSV, JSON, Parquet), Hive tables, or external databases
Once created, DataFrames support a wide range of operations and transformations, such as filtering, aggregating, joining, and grouping data.
One of the key benefits of using DataFrames in Spark Scala is their ability to leverage Spark’s distributed computing capabilities to process large amounts of data quickly and efficiently.
Overall, DataFrames in Spark provides a powerful and flexible way to work with structured data in a distributed computing environment.

    Can be easily converted to RDDs and Datasets using the rdd() and as[] methods respectively.
    DataFrames are not type-safe, When we are trying to access the column which does not exist in the table in such case Dataframe APIs does not support compile-time error. It detects attribute errors only at runtime
    Optimized for performance, with high-level API, Catalyst optimizer, and code generation.
    Have more optimized memory management, with a Spark SQL optimizer that helps to reduce memory usage.
    DataFrames use a generic encoder that can handle any object type.
    Provide a high-level API that makes it easier to perform transformations and actions on data.
    DataFrames enforce schema at runtime. Have an explicit schema that describes the data and its types.
    Available In 4 languages like Java, Python, Scala, and R.
    It uses a catalyst optimizer for optimization.
    DataFrames supports most of the available dataTypes
    Suitable for structured and semi-structured data processing with a higher-level of abstraction.

# Dataset

A Dataset is a distributed collection of data that provides the benefits of strong typing, compile-time type safety, and object-oriented programming. It is essentially a strongly-typed version of a DataFrame, where each row of the Dataset is an object of a specific type, defined by a case class or a Java class.
One of the key benefits of using Datasets in Spark Scala is their ability to provide compile-time type safety and object-oriented programming, which can help catch errors at compile time rather than runtime. This can help improve code quality and reduce the likelihood of errors.

    Can be easily converted to DataFrames using the toDF() method, and to RDDs using the rdd() method.
    Datasets are type-safe, Datasets provide compile-time type checking, which helps catch errors early in the development process. DataFrames are schema-based, meaning that the structure of the data is defined at runtime and is not checked until runtime.
    Datasets are faster than DataFrames because they use JVM bytecode generation to perform operations on data. This means that Datasets can take advantage of the JVM’s optimization capabilities, such as just-in-time (JIT) compilation, to speed up processing.
    support most of the available dataTypes
    Datasets are serialized using specialized encoders that are optimized for performance.
    Datasets provide a richer set of APIs. Datasets support both functional and object-oriented programming paradigms and provide a more expressive API for working with data
    Datasets enforce schema at compile time. With Datasets, errors in data types or structures are caught earlier in the development cycle. Have an explicit schema that describes the data and its types, and is strongly typed.
    Only available in Scala and Java.
    It includes the concept of a Dataframe Catalyst optimizer for optimizing query plans.
    Datasets support all of the same data types as DataFrames, but they also support user-defined types. Datasets are more flexible when it comes to working with complex data types.
    Suitable for high-performance batch and stream processing with strong typing and functional programming.

| . | RDD | DataFrame | Dataset |
|:----------:|:----------:|:----------:|:----------:|
|API|Low level|High level|High level|
||requires more code|optimized code|more expressive API|
|Language support|In 4 languages|In 4 languages|Only in Java/Scala|
|Type safety|NO|At run time|At compile time|
||for more control over data|optimized for performance, uses catalyst optimizer|faster then DFs because uses JVM's bytecode generation|
|Schema enforcement|Do not have an explicit schema, and are often used for unstructured data.|DataFrames enforce schema at runtime.|Datasets enforce schema at compile time.(strongly typed)|
|Optimization|No inbuilt optimization engine|uses a catalyst optimizer|uses Dataframe Catalyst optimizer|
|Data types|Suitable for structured and semi-structured data| supports most of the available dataTypes|support all of the same data types as DF, as well as user defined datatypes|
|Use Cases|Suitable for low-level data processing and batch jobs that require fine-grained control over data|Suitable for structured and semi-structured data processing with a higher-level of abstraction.|Suitable for high-performance batch and stream processing with strong typing and functional programming.|