## Exploratory Data Analysis on PySpark DataFrame

- `PySpark.sql.SaprkSession.read.csv` is used to read a file from storage and convert it to a DataFrame.

- The package `com.databricks.spark.csv` that was developed by Databricks and merged with PySpark 2.x.x.x is used in background to convert the data to a DataFrame.


In [1]:
import findspark
findspark.init()
import pyspark

from pyspark.sql import SparkSession
from pyspark.sql.types import *

spark = SparkSession.builder.appName("EDA-App").getOrCreate()

import warnings
warnings.filterwarnings("ignore")

spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/25 20:51:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
col_filamentType = StructField("FilamentType", StringType(), True)
col_bulbPower = StructField("BulbPower", StringType(), True)
col_lifeInHours = StructField("LifeInHours", DoubleType(), True)

filament_data_schema = StructType([col_filamentType, col_bulbPower, col_lifeInHours])

In [3]:
filament_df = spark.read.csv(
    "file:////home/ashru/filament_data.csv",
    header=True,
    schema=filament_data_schema,
    mode="DROPMALFORMED" # drops all corrupt data
)


filament_df.show()

[Stage 0:>                                                          (0 + 1) / 1]

22/12/25 20:58:37 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: filamentType, BulbPower, Life
 Schema: FilamentType, BulbPower, LifeInHours
Expected: LifeInHours but found: Life
CSV file: file:///run/media/ashrulochan/Sector8/Books-and-Notes/To_be_continued/PySparkRecipes/filament_data.csv
+------------+---------+-----------+
|FilamentType|BulbPower|LifeInHours|
+------------+---------+-----------+
|   filamentA|     100W|      605.0|
|   filamentB|     100W|      683.0|
|   filamentB|     100W|      691.0|
|   filamentB|     200W|      561.0|
|   filamentA|     200W|      530.0|
|   filamentA|     100W|      619.0|
|   filamentB|     100W|      686.0|
|   filamentB|     200W|      600.0|
|   filamentB|     100W|      696.0|
|   filamentA|     200W|      579.0|
|   filamentA|     200W|      520.0|
|   filamentA|     100W|      622.0|
|   filamentA|     100W|      668.0|
|   filamentB|     200W|      569.0|
|   filamentB|     200W|      555.0|
|   filamentA|

                                                                                

In [4]:
filament_df.printSchema()

root
 |-- FilamentType: string (nullable = true)
 |-- BulbPower: string (nullable = true)
 |-- LifeInHours: double (nullable = true)



In [5]:
filament_df.describe().show()

22/12/25 21:04:46 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: filamentType, BulbPower, Life
 Schema: FilamentType, BulbPower, LifeInHours
Expected: LifeInHours but found: Life
CSV file: file:///run/media/ashrulochan/Sector8/Books-and-Notes/To_be_continued/PySparkRecipes/filament_data.csv


                                                                                

+-------+------------+---------+-----------------+
|summary|FilamentType|BulbPower|      LifeInHours|
+-------+------------+---------+-----------------+
|  count|          16|       16|               16|
|   mean|        null|     null|         607.8125|
| stddev|        null|     null|61.11652122517009|
|    min|   filamentA|     100W|            520.0|
|    max|   filamentB|     200W|            696.0|
+-------+------------+---------+-----------------+



In [7]:
# number of data points from both filament types

print("filament type A:",filament_df.filter(filament_df.FilamentType=='filamentA').count())
print("filament type B:",filament_df.filter(filament_df.FilamentType=='filamentB').count())

filament type A: 8
filament type B: 8


In [8]:
# number of bulbs with power 100W
print(filament_df.filter(filament_df.BulbPower == '100W').count())

# number of bulbs with power 2000W
print(filament_df.filter(filament_df.BulbPower == '200W').count())

8
8


In [9]:
# count of data points with filament type = B and bulb power = 100W
print("Filament Type B & Bulb Power 100W:",\
    filament_df.filter((filament_df.FilamentType=='filamentB') & (filament_df.BulbPower == '100W')).count())

# count of data points with filament type = B and bulb power = 200W
print("Filament Type B & Bulb Power 200W:",\
    filament_df.filter((filament_df.FilamentType == 'filamentB') & (filament_df.BulbPower == '200W')).count())

# count of data points with filament type = A and bulb power = 100W
print("Filament Type A & Bulb Power 100W", \
    filament_df.filter((filament_df.FilamentType == 'filamentA') & (filament_df.BulbPower == '100W')).count())

# count of data points with filament type = A and bulb power = 200W
print("Filament Type A & Bulb Power 200W", \
    filament_df.filter((filament_df.FilamentType == 'filamentA') & (filament_df.BulbPower == '200W')).count())


Filament Type B & Bulb Power 100W: 4
Filament Type B & Bulb Power 200W: 4
Filament Type A & Bulb Power 100W 4
Filament Type A & Bulb Power 200W 4


### References

- [Getting CSV file into PySpark DataFrame](https://stackoverflow.com/questions/29936156/get-csv-to-spark-dataframe)