<a href="https://colab.research.google.com/github/dataeng-lab/learn-pyspark-by-doing/blob/main/pyspark_transformations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("PySparkExample").getOrCreate()

# Create a sample DataFrame
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
columns = ["name", "id"]
df = spark.createDataFrame(data, columns)

# Show the original DataFrame
print("Original DataFrame:")
df.show()

# Transformation: Select a column
selected_df = df.select("name")
print("DataFrame after selecting 'name' column:")
selected_df.show()

# Transformation: Filter rows
filtered_df = df.filter(df["id"] > 1)
print("DataFrame after filtering for id > 1:")
filtered_df.show()

# Action: Count the number of rows
count = df.count()
print(f"Number of rows in the original DataFrame: {count}")

# Action: Collect all data into a list
collected_data = df.collect()
print("Collected data:")
for row in collected_data:
    print(row)

# Stop the SparkSession
spark.stop()

Original DataFrame:
+-------+---+
|   name| id|
+-------+---+
|  Alice|  1|
|    Bob|  2|
|Charlie|  3|
+-------+---+

DataFrame after selecting 'name' column:
+-------+
|   name|
+-------+
|  Alice|
|    Bob|
|Charlie|
+-------+

DataFrame after filtering for id > 1:
+-------+---+
|   name| id|
+-------+---+
|    Bob|  2|
|Charlie|  3|
+-------+---+

Number of rows in the original DataFrame: 3
Collected data:
Row(name='Alice', id=1)
Row(name='Bob', id=2)
Row(name='Charlie', id=3)
