**What is PySpark?**

PySpark allows Python developers to use Spark for large-scale data processing. It combines the scalability and speed of Spark with the simplicity of Python, making it a popular choice for data engineers, data scientists, and analysts.

Advantages of PySpark

1. Open-source and free to use.
2. Handles large-scale data processing with ease.
3. Combines the simplicity of Python with the performance of Spark.
4. Extensive community support and documentation.

In [1]:
!pip install pyspark



In [2]:
import pyspark
print(pyspark.__version__)

3.5.3


In [8]:
# Sample data
data = [("Alice", 29), ("Bob", 35), ("Cathy", 45)]
columns = ["Name", "Age"]

# Create DataFrame
df = spark.createDataFrame(data, schema=columns)

# Show the DataFrame
df.show()

+-----+---+
| Name|Age|
+-----+---+
|Alice| 29|
|  Bob| 35|
|Cathy| 45|
+-----+---+



In [9]:
# Filter rows where age > 30
filtered_df = df.filter(df.Age > 30)
filtered_df.show()

+-----+---+
| Name|Age|
+-----+---+
|  Bob| 35|
|Cathy| 45|
+-----+---+



In [10]:
# Select the Name column
name_df = df.select("Name")
name_df.show()

+-----+
| Name|
+-----+
|Alice|
|  Bob|
|Cathy|
+-----+



In [11]:
# Add a new column 'Age_in_5_years'
df_with_new_column = df.withColumn("Age_in_5_years", df.Age + 5)
df_with_new_column.show()

+-----+---+--------------+
| Name|Age|Age_in_5_years|
+-----+---+--------------+
|Alice| 29|            34|
|  Bob| 35|            40|
|Cathy| 45|            50|
+-----+---+--------------+



In [16]:
# Find the maximum salary
agg_df.agg({"Salary": "max"}).show()

+-----------+
|max(Salary)|
+-----------+
|       6000|
+-----------+



In [17]:
# Count the number of rows
count = df.count()
print(f"Row Count: {count}")

Row Count: 3
