
# üß† Pandas vs PySpark ‚Äî CSV Reading & Operations (Side-by-Side Practice)

This notebook shows **Pandas** and **PySpark** equivalents for reading and manipulating CSV files.

üëâ You can run this locally on Jupyter after installing:
```bash
pip install pandas pyspark
```


In [1]:

# =============================================================
# üìÅ Create a sample CSV file (emp.csv)
# =============================================================

csv_data = '''eno,ename,deptno,salary,doj
1,Amit,10,50000,2021-01-10
2,Neha,20,60000,2021-02-15
3,Ravi,10,55000,2021-03-12
4,Kiran,30,62000,2021-04-01
5,Meena,20,58000,2021-05-18
'''
with open("emp.csv", "w") as f:
    f.write(csv_data)

print("‚úÖ emp.csv created")


‚úÖ emp.csv created


In [2]:

# =============================================================
# 1Ô∏è‚É£ SETUP
# =============================================================

# --- Pandas ---
import pandas as pd

# --- PySpark ---
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date, year, month, dayofmonth, dayofweek

spark = SparkSession.builder.appName("Pandas_vs_PySpark").getOrCreate()


## 2Ô∏è‚É£ Read CSV

In [3]:

# --- Pandas ---
pd_df = pd.read_csv("emp.csv")
print("Pandas:")
print(pd_df.head())

# --- PySpark ---
spark_df = spark.read.csv("emp.csv", header=True, inferSchema=True)
print("PySpark:")
spark_df.show()


Pandas:
   eno  ename  deptno  salary         doj
0    1   Amit      10   50000  2021-01-10
1    2   Neha      20   60000  2021-02-15
2    3   Ravi      10   55000  2021-03-12
3    4  Kiran      30   62000  2021-04-01
4    5  Meena      20   58000  2021-05-18
PySpark:
+---+-----+------+------+----------+
|eno|ename|deptno|salary|       doj|
+---+-----+------+------+----------+
|  1| Amit|    10| 50000|2021-01-10|
|  2| Neha|    20| 60000|2021-02-15|
|  3| Ravi|    10| 55000|2021-03-12|
|  4|Kiran|    30| 62000|2021-04-01|
|  5|Meena|    20| 58000|2021-05-18|
+---+-----+------+------+----------+



## 3Ô∏è‚É£ Select specific columns

In [4]:

# --- Pandas ---
pd_df2 = pd.read_csv("emp.csv", usecols=['eno','deptno'])
print("Pandas:")
print(pd_df2.head())

# --- PySpark ---
spark_df2 = spark.read.csv("emp.csv", header=True, inferSchema=True)
spark_df2 = spark_df2.select("eno","deptno")
print("PySpark:")
spark_df2.show()


Pandas:
   eno  deptno
0    1      10
1    2      20
2    3      10
3    4      30
4    5      20
PySpark:
+---+------+
|eno|deptno|
+---+------+
|  1|    10|
|  2|    20|
|  3|    10|
|  4|    30|
|  5|    20|
+---+------+



## 4Ô∏è‚É£ Data type conversion

In [None]:

from pyspark.sql.types import StructType, StructField, IntegerType, FloatType, StringType

# --- Pandas ---
pd_df3 = pd.read_csv("emp.csv", dtype={'eno': int, 'salary': float})
print("Pandas:")
print(pd_df3.dtypes)

# --- PySpark ---
schema = StructType([
    StructField("eno", IntegerType(), True),
    StructField("ename", StringType(), True),
    StructField("deptno", IntegerType(), True),
    StructField("salary", FloatType(), True),
    StructField("doj", StringType(), True)
])
spark_df3 = spark.read.csv("emp.csv", header=True, schema=schema)
print("PySpark:")
spark_df3.printSchema()


## 5Ô∏è‚É£ Skipping rows / partial data

In [5]:

# --- Pandas ---
pd_df4 = pd.read_csv("emp.csv", skiprows=2, nrows=2)
print("Pandas (rows 3‚Äì4):")
print(pd_df4)

# --- PySpark ---
spark_df4 = spark.read.csv("emp.csv", header=True, inferSchema=True)
spark_df4 = spark_df4.limit(2)
print("PySpark (first 2 rows):")
spark_df4.show()


Pandas (rows 3‚Äì4):
   2   Neha  20  60000  2021-02-15
0  3   Ravi  10  55000  2021-03-12
1  4  Kiran  30  62000  2021-04-01
PySpark (first 2 rows):
+---+-----+------+------+----------+
|eno|ename|deptno|salary|       doj|
+---+-----+------+------+----------+
|  1| Amit|    10| 50000|2021-01-10|
|  2| Neha|    20| 60000|2021-02-15|
+---+-----+------+------+----------+



## 6Ô∏è‚É£ Parsing Dates

In [6]:

# --- Pandas ---
pd_df5 = pd.read_csv("emp.csv", parse_dates=['doj'])
print("Pandas date parts:")
print(pd_df5['doj'].dt.year.head())
print(pd_df5['doj'].dt.month.head())

# --- PySpark ---
spark_df5 = spark.read.csv("emp.csv", header=True, inferSchema=True)
spark_df5 = spark_df5.withColumn("doj", to_date("doj", "yyyy-MM-dd"))                     .withColumn("year", year("doj"))                      .withColumn("month", month("doj"))                      .withColumn("day", dayofmonth("doj"))                      .withColumn("weekday", dayofweek("doj"))
print("PySpark date parts:")
spark_df5.select("doj","year","month","day","weekday").show()


Pandas date parts:
0    2021
1    2021
2    2021
3    2021
4    2021
Name: doj, dtype: int32
0    1
1    2
2    3
3    4
4    5
Name: doj, dtype: int32
PySpark date parts:
+----------+----+-----+---+-------+
|       doj|year|month|day|weekday|
+----------+----+-----+---+-------+
|2021-01-10|2021|    1| 10|      1|
|2021-02-15|2021|    2| 15|      2|
|2021-03-12|2021|    3| 12|      6|
|2021-04-01|2021|    4|  1|      5|
|2021-05-18|2021|    5| 18|      3|
+----------+----+-----+---+-------+



## 7Ô∏è‚É£ Handling Header & Names

In [7]:

# --- Pandas ---
pd_df6 = pd.read_csv("emp.csv", header=0, names=['E_No','E_Name','Dept_No','Sal','DOJ'])
print("Pandas:")
print(pd_df6.head())

# --- PySpark ---
spark_df6 = spark.read.csv("emp.csv", header=False, inferSchema=True)
spark_df6 = spark_df6.toDF("E_No","E_Name","Dept_No","Sal","DOJ")
print("PySpark:")
spark_df6.show()


Pandas:
   E_No E_Name  Dept_No    Sal         DOJ
0     1   Amit       10  50000  2021-01-10
1     2   Neha       20  60000  2021-02-15
2     3   Ravi       10  55000  2021-03-12
3     4  Kiran       30  62000  2021-04-01
4     5  Meena       20  58000  2021-05-18
PySpark:
+----+------+-------+------+----------+
|E_No|E_Name|Dept_No|   Sal|       DOJ|
+----+------+-------+------+----------+
| eno| ename| deptno|salary|       doj|
|   1|  Amit|     10| 50000|2021-01-10|
|   2|  Neha|     20| 60000|2021-02-15|
|   3|  Ravi|     10| 55000|2021-03-12|
|   4| Kiran|     30| 62000|2021-04-01|
|   5| Meena|     20| 58000|2021-05-18|
+----+------+-------+------+----------+



## 8Ô∏è‚É£ Encoding & Compression

In [None]:

# --- Pandas ---
pd_df7 = pd.read_csv("emp.csv", encoding='utf-8')
print("Pandas read with encoding:")
print(pd_df7.head())

# --- PySpark ---
spark_df7 = spark.read.option("encoding","UTF-8").csv("emp.csv", header=True, inferSchema=True)
print("PySpark read with encoding:")
spark_df7.show()


## 9Ô∏è‚É£ Writing CSV

In [5]:

# --- Pandas ---
pd_df.to_csv("output_pandas.csv", index=False)
print("‚úÖ Pandas written file: output_pandas.csv")

# --- PySpark ---
spark_df.write.option("header", True).mode("overwrite").csv("output_spark")
print("‚úÖ PySpark written folder: output_spark/")


‚úÖ Pandas written file: output_pandas.csv
‚úÖ PySpark written folder: output_spark/



# ‚úÖ Summary Table

| Feature | Pandas | PySpark |
|----------|---------|---------|
| Read CSV | `pd.read_csv()` | `spark.read.csv()` |
| Select Columns | `usecols=['A','B']` | `.select("A","B")` |
| Dtype | `dtype={'A':int}` | `schema=StructType([...])` |
| Skip Rows | `skiprows=10` | No direct ‚Äî filter manually |
| nrows | `nrows=10` | `.limit(10)` |
| Parse Dates | `parse_dates=['col']` | `to_date()` + date funcs |
| Header | `header=1` | `.option("header",True)` |
| Names | `names=[...]` | `.toDF("col1","col2")` |
| Encoding | `encoding='utf-8'` | `.option("encoding","utf-8")` |
| Compression | `compression='gzip'` | auto |
| Write CSV | `to_csv(index=False)` | `.write.option("header",True).csv()` |


In [9]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
print("‚úÖ PySpark version:", pyspark.__version__)
print("‚úÖ Spark version:", spark.version)


‚úÖ PySpark version: 4.0.0
‚úÖ Spark version: 4.0.0


## Read