## Introduction and Overview
- **Objective**:
    - Review Spark APIs covered in previous weeks.
    - Introduce Spark DataFrames focusing on three key aspects:
        - Creation
        - Manipulation
        - User-Defined Functions (UDFs)

## Review of Spark APIs

**Resilient Distributed Datasets (RDDs)**

- **Definition**:
    - RDDs are the foundational data structure in Spark, representing a distributed collection of data.
- **Characteristics**:
    -**Low-level AP**: Requires explicit manipulation.
    - **Supports**:
        - **Creation**:
            - `parallelize` existing data structures.
            - `textFile` to load data from files.
        - **Transformations**:
            - Examples include `map`, `reduce`, `reduceByKey`, `sortByKey`.
        - **Actions**:
            - Return non-RDD results, e.g., `count`, `collect`.
- **Usage Recap**:
    - Filter lines containing specific keywords.
    - Apply transformations like squaring numbers using `map`.
    - Perform actions to retrieve results, keeping in mind performance implications with large datasets.

## Spark Datasets

- **Definition**:
    - Distributed collections of data, similar to DataFrames but available only in Scala and Java.
- **Limitations**:
    - No Python interface.
- **Additional Notes**:
    - Spark is written in Scala, which runs on the JVM, leading to certain functionalities being exclusive to Scala and Java.
    - Error messages may reference Java due to Spark's underlying implementation.

## Spark DataFrames
- **Definition**:
    - Structured as distributed collections of data organized into named columns, analogous to tables in Pandas or R.
- **Availability**:
    - Accessible via Python, Scala, and Java (excluding Python for Datasets).

## In-Depth Discussion on Spark DataFrames

## Creation

## From a List
- **Method**:
    - Use `createDataFrame` function with a list of tuples and a list of column names.
- **Example**:
    - Data: `[("Chris", 67), ("Frank", 70)]`
    - Columns: `["name", "score"]`

In [None]:
## Output   
+-----+-----+
| name|score|
+-----+-----+
|Chris|   67|
|Frank|   70|
+-----+-----+

## From an RDD
- **Method**:
    - Convert an existing RDD to a DataFrame.
    - Default column names appear as `_1`, `_2`, etc., unless explicitly defined.
- **Enhancement**:
    - Use `Row` objects to assign meaningful column names during conversion.
- **Example**:
    - Original RDD: `[("Chris", 67), ("Frank", 70)]`
    - Transformed with `Row(name=x[0], score=int(x[1]))`

In [None]:
## Output with Named Columns
+-----+-----+
| name|score|
+-----+-----+
|Chris|   67|
|Frank|   70|
+-----+-----+

## From a JSON File
- **Method**:
    - Utilize `spark.read.json("filename.json")` to create a DataFrame.
- **Advantages**:
    - Automatic schema inference.
- **Complexity**:
    - JSON files can have nested structures which Spark handles efficiently.
- **Schema Example**: `root`
 |`-- addresses: string (nullable = true)`
 |`-- attributes: struct (nullable = true)`
 |`    |-- ambiance: struct (nullable = true)`
 |`    |    |-- casual: string (nullable = true)`
 |`    |    |-- intimate: string (nullable = true)`
 |`-- is_open: boolean (nullable = true)`
 |`-- latitude: double (nullable = true)`
 |`-- longitude: double (nullable = true)`

## Manipulation
- **Primary Functions**:
    - `show()`: Displays the first 20 entries by default.
    - `first()`: Retrieves the first row.
    - `head(n)`: Returns the first n rows as a list.
    - `take(n)`: Similar to head, retrieves the first `n` rows.
    - `collect()`: Returns all rows as a Python list (use with caution on large datasets).
- **Best Practices**:
    - Use `collect()` sparingly to avoid memory issues with large datasets.
    - Prefer actions like `show`, `head`, or `take` for exploratory data analysis.

## User-Defined Functions (UDFs)
- **Upcoming Content**:
    - Detailed exploration of creating and applying UDFs on DataFrames.

## Review of RDDs (from Last Week)
- **Pattern**:
    - Create a Spark session using the Builder pattern.
    - Utilize SparkContext for RDD operations.
- **Example Workflow**:
    1. Create an RDD:
        - `data_lines = sc.textFile("data.txt")`
    2. **Filter and Transform**:
        - Filter lines containing the keyword "data".
        - Apply map functions to transform data (e.g., squaring numbers).
    3. **Actions**:
        - Use `collect()` to retrieve the transformed data.

## Examples and Demonstrations
- **Jupyter Notebook Usage**:
    - Demonstrated creating Spark sessions and DataFrames.
    - Showcased various DataFrame creation methods and their outputs.
    - Highlighted the importance of specifying data types explicitly in Spark.
- **Code Highlights**:
    - Importing necessary functions: from `pyspark.sql.types` `import` `FloatType`
    - Creating DataFrames with specified schemas to ensure data type consistency.

## Action Items
- **Practice DataFrame Creation**:
    - Create DataFrames from different sources: lists, RDDs, JSON files.
    - Experiment with specifying column names and data types.
- **Explore Data Manipulation Functions**:
    - Use `show`, `first`, `head`, `take`, and `collect` on various DataFrames.
    - Observe the differences in outputs and performance implications.
- **Prepare for UDFs**:
    - Start thinking about potential use cases for user-defined functions in data manipulation.

## Follow-up
- **Next Session**:
    - Dive deeper into DataFrame manipulation and transformations.
    - Introduction to User-Defined Functions (UDFs) and their applications.
- **Resources**:
    - Review Spark documentation on DataFrames and RDDs.
    - Explore Scala basics for understanding Spark's underlying architecture (optional).

## Conclusion
This week provided a comprehensive overview of Spark DataFrames, including their creation from various sources, manipulation techniques, and a foundational understanding of how they differ from RDDs. Emphasis was placed on practical examples using Jupyter notebooks to solidify the concepts discussed.