<a href="https://colab.research.google.com/github/anjli01/PySpark-Notes/blob/main/03_DataFrame_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## DataFrame Basics

`DataFrames` are the primary abstraction for working with structured data in modern Spark applications. They represent a distributed collection of data organized into named columns, conceptually similar to a table in a relational database.

### 1. Creating DataFrames

DataFrames can be created from various sources.

#### A. From Lists (Python)

You can create a DataFrame from a Python list of tuples or `Row` objects.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

# Initialize SparkSession
spark = SparkSession.builder.appName("CreateDFFromList").getOrCreate()

print("--- DataFrame from List (Inferred Schema) ---")
# 1. Using a list of tuples with inferred schema (less explicit but quicker for small data)
data_inferred = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
columns_inferred = ["Name", "ID"]
df_inferred = spark.createDataFrame(data_inferred, columns_inferred)
df_inferred.show()
df_inferred.printSchema()

print("\n" + "-" * 30 + "\n")

print("--- DataFrame from List (Explicit Schema) ---")
# 2. Using a list of rows with explicit schema (recommended for robustness, especially in production)
data_explicit = [
    ("David", 4, "New York"),
    ("Eve", 5, "London"),
    ("Frank", 6, "Paris")
]

schema_explicit = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("City", StringType(), True)
])

df_explicit = spark.createDataFrame(data_explicit, schema=schema_explicit)
df_explicit.show()
df_explicit.printSchema()

spark.stop()

--- DataFrame from List (Inferred Schema) ---
+-------+---+
|   Name| ID|
+-------+---+
|  Alice|  1|
|    Bob|  2|
|Charlie|  3|
+-------+---+

root
 |-- Name: string (nullable = true)
 |-- ID: long (nullable = true)


------------------------------

--- DataFrame from List (Explicit Schema) ---
+-----+---+--------+
| Name|Age|    City|
+-----+---+--------+
|David|  4|New York|
|  Eve|  5|  London|
|Frank|  6|   Paris|
+-----+---+--------+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- City: string (nullable = true)



**Data Engineer Tip:** While inferred schema is convenient for quick analysis, **explicitly defining schemas for production pipelines is highly recommended.** It prevents unexpected type inference issues, improves performance (Spark doesn't need to sample data to infer types), and acts as documentation.

#### B. From RDDs (Python)

You can convert a Resilient Distributed Dataset (RDD) to a DataFrame. This is useful when migrating legacy RDD-based logic or integrating with existing RDD data.

In [2]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("CreateDFFromRDD").getOrCreate()
sc = spark.sparkContext # SparkContext for RDD operations

# Create an RDD of tuples
rdd_data = sc.parallelize([
    ("Alice", 30),
    ("Bob", 25),
    ("Charlie", 35)
])

print("--- DataFrame from RDD (Inferred Schema) ---")
# Method 1: Infer schema (less control, provides generic column names if not specified)
# When providing a list of strings as schema, it will infer types from the RDD data.
df_from_rdd_inferred = spark.createDataFrame(rdd_data, ["Name", "Age"])
df_from_rdd_inferred.show()
df_from_rdd_inferred.printSchema()

print("\n" + "-" * 30 + "\n")

print("--- DataFrame from RDD (Explicit Schema with Row objects) ---")
# Method 2: Explicitly map to Row objects with a defined schema (recommended for robustness)
# Each element in the RDD needs to be a Row object with named fields matching the schema.
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True)
])

rdd_rows = rdd_data.map(lambda p: Row(Name=p[0], Age=p[1]))
df_from_rdd_explicit = spark.createDataFrame(rdd_rows, schema)
df_from_rdd_explicit.show()
df_from_rdd_explicit.printSchema()

spark.stop()

--- DataFrame from RDD (Inferred Schema) ---
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 30|
|    Bob| 25|
|Charlie| 35|
+-------+---+

root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)


------------------------------

--- DataFrame from RDD (Explicit Schema with Row objects) ---
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 30|
|    Bob| 25|
|Charlie| 35|
+-------+---+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)



#### C. From External Files (Python)

Spark excels at reading data from various external file formats like CSV, JSON, Parquet, ORC, etc.

**To run this example, you'd need sample `people.csv` and `people.json` files:**

**`people.csv`:**
```csv
name,age
Alice,30
Bob,25
Charlie,35
```

**`people.json`:**
```json
{"name":"Alice","age":30}
{"name":"Bob","age":25}
{"name":"Charlie","age":35}
```


In [3]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CreateDFFromFiles").getOrCreate()

# Create dummy files for demonstration (ensure these exist in your working directory)
csv_data = """name,age
Alice,30
Bob,25
Charlie,35
"""
with open("people.csv", "w") as f:
    f.write(csv_data)

json_data = """{"name":"Alice","age":30}
{"name":"Bob","age":25}
{"name":"Charlie","age":35}
"""
with open("people.json", "w") as f:
    f.write(json_data)


# Read CSV file
print("--- DataFrame from CSV ---")
df_csv = spark.read.csv("people.csv", header=True, inferSchema=True)
df_csv.show()
df_csv.printSchema()

print("\n" + "-" * 30 + "\n")

# Read JSON file
print("--- DataFrame from JSON ---")
df_json = spark.read.json("people.json")
df_json.show()
df_json.printSchema()

print("\n" + "-" * 30 + "\n")

# For Parquet, you'd typically write a DataFrame to Parquet first, then read it back
# Parquet is a columnar storage format optimized for analytical queries.
print("--- DataFrame from Parquet (Write then Read) ---")
df_csv.write.mode("overwrite").parquet("people.parquet") # write df_csv to parquet
df_parquet = spark.read.parquet("people.parquet") # read from parquet
df_parquet.show()
df_parquet.printSchema()

spark.stop()

--- DataFrame from CSV ---
+-------+---+
|   name|age|
+-------+---+
|  Alice| 30|
|    Bob| 25|
|Charlie| 35|
+-------+---+

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)


------------------------------

--- DataFrame from JSON ---
+---+-------+
|age|   name|
+---+-------+
| 30|  Alice|
| 25|    Bob|
| 35|Charlie|
+---+-------+

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)


------------------------------

--- DataFrame from Parquet (Write then Read) ---
+-------+---+
|   name|age|
+-------+---+
|  Alice| 30|
|    Bob| 25|
|Charlie| 35|
+-------+---+

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)



**Data Engineer Tip on File Formats:**
*   **CSV/JSON:** Human-readable, good for ingestion but less efficient for large-scale analytical queries. `inferSchema=True` can be slow for large files as Spark needs to sample data.
*   **Parquet:** Binary, columnar storage format. Highly recommended for intermediate and final data storage in data lakes/warehouses due to:
    *   **Columnar compression:** Reduces storage footprint.
    *   **Predicate pushdown:** Spark can read only relevant columns and filter data at the source, significantly speeding up queries.
    *   **Schema evolution:** Handles changes in schema gracefully.

### 2. Basic DataFrame Operations (Actions & Transformations)

Once you have a DataFrame, you can perform various fundamental operations.

#### A. `select()`

`select()` is a **transformation** used to choose specific columns from a DataFrame. It returns a new DataFrame with only the selected columns.

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col # Using 'col' is recommended for robustness

spark = SparkSession.builder.appName("DFBasicOps").getOrCreate()

data = [("Alice", 30, "New York"), ("Bob", 25, "London"), ("Charlie", 35, "Paris")]
columns = ["Name", "Age", "City"]
df = spark.createDataFrame(data, columns)
df.show()

print("--- Select a single column ---")
df.select("Name").show()

print("--- Select multiple columns ---")
df.select("Name", "City").show()

print("--- Select columns using col() function (recommended for robustness) ---")
# Using col() is safer as it prevents ambiguity if a column name clashes with a keyword.
df.select(col("Name"), col("Age")).show()

print("--- Select a column and alias it ---")
df.select(df.Name.alias("Full_Name"), df.Age).show()

spark.stop()

+-------+---+--------+
|   Name|Age|    City|
+-------+---+--------+
|  Alice| 30|New York|
|    Bob| 25|  London|
|Charlie| 35|   Paris|
+-------+---+--------+

--- Select a single column ---
+-------+
|   Name|
+-------+
|  Alice|
|    Bob|
|Charlie|
+-------+

--- Select multiple columns ---
+-------+--------+
|   Name|    City|
+-------+--------+
|  Alice|New York|
|    Bob|  London|
|Charlie|   Paris|
+-------+--------+

--- Select columns using col() function (recommended for robustness) ---
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 30|
|    Bob| 25|
|Charlie| 35|
+-------+---+

--- Select a column and alias it ---
+---------+---+
|Full_Name|Age|
+---------+---+
|    Alice| 30|
|      Bob| 25|
|  Charlie| 35|
+---------+---+



#### B. `filter()` / `where()`

`filter()` (or its alias `where()`) is a **transformation** used to filter rows based on a given condition. It returns a new DataFrame containing only the rows that satisfy the condition.

In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("DFBasicOps").getOrCreate()

data = [("Alice", 30, "New York"), ("Bob", 25, "London"), ("Charlie", 35, "Paris")]
columns = ["Name", "Age", "City"]
df = spark.createDataFrame(data, columns)
df.show()

print("--- Filter using string expression ---")
df.filter("Age > 28").show()

print("--- Filter using column object (recommended for complex conditions) ---")
df.filter(col("Age") <= 30).show()

print("--- Filter with multiple conditions (AND) ---")
df.filter((col("Age") > 25) & (col("City") == "New York")).show()

print("--- Filter with multiple conditions (OR) ---")
df.filter((col("Age") < 25) | (col("City") == "Paris")).show()

print("--- Using where() - alias for filter() ---")
df.where(col("City") == "London").show()

spark.stop()

+-------+---+--------+
|   Name|Age|    City|
+-------+---+--------+
|  Alice| 30|New York|
|    Bob| 25|  London|
|Charlie| 35|   Paris|
+-------+---+--------+

--- Filter using string expression ---
+-------+---+--------+
|   Name|Age|    City|
+-------+---+--------+
|  Alice| 30|New York|
|Charlie| 35|   Paris|
+-------+---+--------+

--- Filter using column object (recommended for complex conditions) ---
+-----+---+--------+
| Name|Age|    City|
+-----+---+--------+
|Alice| 30|New York|
|  Bob| 25|  London|
+-----+---+--------+

--- Filter with multiple conditions (AND) ---
+-----+---+--------+
| Name|Age|    City|
+-----+---+--------+
|Alice| 30|New York|
+-----+---+--------+

--- Filter with multiple conditions (OR) ---
+-------+---+-----+
|   Name|Age| City|
+-------+---+-----+
|Charlie| 35|Paris|
+-------+---+-----+

--- Using where() - alias for filter() ---
+----+---+------+
|Name|Age|  City|
+----+---+------+
| Bob| 25|London|
+----+---+------+



#### C. `show()`

`show()` is an **action** that displays the contents of the DataFrame in a tabular format. It's very useful for inspecting your data during development and debugging.

In [6]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DFBasicOps").getOrCreate()

data = [("Alice", 30, "New York"), ("Bob", 25, "London"), ("Charlie", 35, "Paris")]
columns = ["Name", "Age", "City"]
df = spark.createDataFrame(data, columns)

print("--- Show entire DataFrame (default 20 rows, truncated) ---")
df.show()

print("--- Show first N rows ---")
df.show(2) # Shows first 2 rows

print("--- Show without truncating column values ---")
df.show(truncate=False) # Important for seeing full column values

spark.stop()

--- Show entire DataFrame (default 20 rows, truncated) ---
+-------+---+--------+
|   Name|Age|    City|
+-------+---+--------+
|  Alice| 30|New York|
|    Bob| 25|  London|
|Charlie| 35|   Paris|
+-------+---+--------+

--- Show first N rows ---
+-----+---+--------+
| Name|Age|    City|
+-----+---+--------+
|Alice| 30|New York|
|  Bob| 25|  London|
+-----+---+--------+
only showing top 2 rows

--- Show without truncating column values ---
+-------+---+--------+
|Name   |Age|City    |
+-------+---+--------+
|Alice  |30 |New York|
|Bob    |25 |London  |
|Charlie|35 |Paris   |
+-------+---+--------+



#### D. `printSchema()`

`printSchema()` is an **action** that displays the schema (column names and their data types) of the DataFrame in a tree-like format.


In [7]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

spark = SparkSession.builder.appName("DFBasicOps").getOrCreate()

# Example with a more complex schema including an ArrayType
data = [
    ("Alice", 30, "New York", ["reading", "hiking"]),
    ("Bob", 25, "London", ["gaming"]),
    ("Charlie", 35, "Paris", [])
]
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("City", StringType(), True),
    StructField("Hobbies", ArrayType(StringType()), True)
])

df = spark.createDataFrame(data, schema=schema)

print("--- DataFrame Schema ---")
df.printSchema() # Crucial for understanding data structure and for debugging type issues.

spark.stop()

--- DataFrame Schema ---
root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- City: string (nullable = true)
 |-- Hobbies: array (nullable = true)
 |    |-- element: string (containsNull = true)



** Tip:** `printSchema()` is one of your most valuable tools! Always use it after reading data or performing complex transformations to verify data types and structure. Incorrect data types can lead to errors or performance bottlenecks.