# PySpark HDFS Example
This notebook demonstrates reading from and writing to HDFS using PySpark.

In [1]:

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("HDFSReadWriteExample") \
    .getOrCreate()


## Generate Sample DataFrame

In [2]:

# Sample data
data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]
df = spark.createDataFrame(data, ["name", "age"])
df.show()


+-----+---+
| name|age|
+-----+---+
|Alice| 34|
|  Bob| 45|
|Cathy| 29|
+-----+---+



## Write DataFrame to HDFS

In [5]:

# Write to HDFS in CSV format
hdfs_path = "hdfs://namenode:8020/user/jovyan/data/people"
df.write.mode("overwrite").csv(hdfs_path)
print("Data written to HDFS at:", hdfs_path)


Data written to HDFS at: hdfs://namenode:8020/user/jovyan/data/people


## Read DataFrame from HDFS

In [6]:

# Read back from HDFS
df_read = spark.read.csv(hdfs_path, inferSchema=True, header=False)
df_read.show()


+-----+---+
|  _c0|_c1|
+-----+---+
|Alice| 34|
|Cathy| 29|
|  Bob| 45|
+-----+---+



## Done