## 1. Data Frame Attributes vs. Functions
- **Attributes**: Accessed without parentheses. Example: `df.columns` returns column names.
- **Functions**: Require parentheses. Examples include `df.describe()`, `df.show()`, and `df.count()`.
- **Distinction**: Functions perform operations and return new data frames, while attributes provide metadata about the data frame.

## 2. Column Names and Data Types
- **Accessing Columns**: Use `df.columns` to retrieve a list of column names.
- **Data Types**: `df.dtypes` displays the data type for each column.
- **Schema Inspection**: `df.printSchema()` provides a detailed structure of the data frame, including nested columns.

## 3. Descriptive Statistics with `describe` and `show`
- `df.describe()`: Calculates simple statistics (`count`, `mean`, `standard deviation`, `min`, `max`) for numeric columns.
- `df.describe().show()`: Displays the calculated statistics in a readable format.

## 4. Selecting and Manipulating Columns
- **Selecting Columns**: `df.select("column_name")` extracts one or more columns, returning a new data frame.
- **Comparison with SQL**: Upcoming discussions will differentiate PySpark's `select` from SQL's `SELECT` statement.

## 5. Creating and Dropping Columns
- **Creating Columns**:
    - Use `df.withColumn("new_column", function)` to add a new column.
    - Example: Creating a new column by applying a function from pyspark.sql.functions.
- **Dropping Columns**: `df.drop("column_name")` removes a specified column from the data frame.
- **Immutability of RDDs**: All transformations return new data frames; original data remains unchanged.

## 6. Filtering Data Frames
- **Filtering Rows**: `df.filter(condition)` filters rows based on a boolean condition, similar to Pandas' boolean masking.
- **Example Condition**: Filtering rows where the `stars` column is greater than or equal to 4.
- **Displaying Results**: Use `df.filter(condition).show()` to view filtered data.

## 7. Collecting and Iterating Over Rows
- **Collecting Data**: `df.collect()` retrieves all rows as a list of SparkRow objects.
- **Accessing Rows**: Access individual rows using list indices, e.g., `result[0]`.
- **Converting to Dictionary**: `row.asDict()` transforms a SparkRow into a Python dictionary for easier data manipulation.

## 8. Grouping and Sorting Data
- **Group By**: `df.groupBy("column").count()~ aggregates data based on unique values in a specified column.
- **Sorting**: `Ddf.groupBy("column").count().sort("column", ascending=False)` sorts the aggregated results.
- **Example**: Ranking businesses based on the number of stars and counting occurrences.

## 9. Renaming Columns
- **Renaming Columns**: `df.withColumnRenamed("old_name", "new_name")` changes the name of a specified column.
- **Usage**: Facilitates clarity and consistency in data frame schemas.

## 10. Explode Function
- **Purpose**: `explode` creates a separate row for each element in a list or array within a column.
- **Application**: Useful for normalizing nested or complex data structures.
- **Example**: Expanding a list of scores into individual rows for each score.

## 11. Conditional Statements with when
- **Usage**: Implements if-else logic on a per-row basis.
- **Syntax**: `f.when(condition, value).otherwise(other_value)` where `f` is an alias for `pyspark.sql.functions`.
- **Example**: Creating a new column `Good` that assigns `1` `if` `score > 50`, `else` `0`.

## 12. Loading JSON Files and Understanding Schema
- **Loading Data**: ``spark.read.json("file_path")` loads JSON data into a data frame.
- **Schema Examination**: `df.printSchema()` and `df.dtypes` help understand the structure and data types.
- **Example Dataset**: Yelp Academic Dataset includes files like Business, Check-in, Review, TIP, and User.

## 13. Working with the Yelp Dataset
- **Business Data Frame**:
    - Contains fields such as address, attributes, business ID, categories, city, hours.
    - Demonstrated methods to access and manipulate nested structures.
- **Review, Check-in, TIP, and User Data Frames**:
    - Loaded additional JSON files corresponding to different aspects of Yelp data.
    - Explored schemas and performed operations like filtering and counting based on specific criteria.
- **Example Operation**: Counting users with more than 5,000 cool compliments using `filter` and `count`.

## Action Items
- **Practice Data Frame Operations**:
    - `describe`, `show`, `select`, `filter`, `groupby`, `sort`, `rename`, and `explode` functions in PySpark.
    - Create and drop columns using `withColumn` and `drop`.
- **Load and Explore the Yelp Dataset**:
    - Load JSON files for Business, Check-in, Review, TIP, and User.
    - Examine schemas and perform basic data manipulations.
- **Implement Conditional Logic**:
    - Use the `when` function to create new columns based on specific conditions.

## Follow-up
- **Upcoming Topics**:
    - Detailed exploration of user-defined functions (UDFs) in PySpark.
    - Advanced data manipulation techniques and optimizations.
- **Next Week’s Focus**:
    - Introduction to SQL's `SELECT` statement and its differences from PySpark's `select` function.
- **Homework Assignment**:
    - Tasks related to the Yelp dataset to reinforce data frame operations and schema understanding.