In [170]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [171]:
spark = SparkSession.builder.appName("Python Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()

#### 1. **Dataset Understanding & Schema Validation**

1.	What is the schema of the dataset?
2.	Are column data types correctly inferred (dates, numerics, strings)?
3.	How many rows and columns are present?
4.	Which columns are categorical, numerical, and date/time?
5.	Are there any columns that should be cast to different types (e.g., string → date, string → double)?
6.	Is the dataset wide or tall, and does that affect processing strategy?
7.	Are column names consistent (no spaces, special characters, casing issues)?

In [172]:
df = spark.read.csv("superstore.csv",header=True,inferSchema=True)
df.show(5)

+------+--------------+----------+----------+--------------+-----------+---------------+---------+-------------+---------------+----------+-----------+------+---------------+---------------+------------+--------------------+--------+--------+--------+--------+
|Row ID|      Order ID|Order Date| Ship Date|     Ship Mode|Customer ID|  Customer Name|  Segment|      Country|           City|     State|Postal Code|Region|     Product ID|       Category|Sub-Category|        Product Name|   Sales|Quantity|Discount|  Profit|
+------+--------------+----------+----------+--------------+-----------+---------------+---------+-------------+---------------+----------+-----------+------+---------------+---------------+------------+--------------------+--------+--------+--------+--------+
|     1|CA-2016-152156| 11/8/2016|11/11/2016|  Second Class|   CG-12520|    Claire Gute| Consumer|United States|      Henderson|  Kentucky|      42420| South|FUR-BO-10001798|      Furniture|   Bookcases|Bush Somerset 

1.	What is the schema of the dataset?

In [173]:
df.printSchema()

root
 |-- Row ID: integer (nullable = true)
 |-- Order ID: string (nullable = true)
 |-- Order Date: string (nullable = true)
 |-- Ship Date: string (nullable = true)
 |-- Ship Mode: string (nullable = true)
 |-- Customer ID: string (nullable = true)
 |-- Customer Name: string (nullable = true)
 |-- Segment: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Postal Code: integer (nullable = true)
 |-- Region: string (nullable = true)
 |-- Product ID: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Sub-Category: string (nullable = true)
 |-- Product Name: string (nullable = true)
 |-- Sales: string (nullable = true)
 |-- Quantity: string (nullable = true)
 |-- Discount: string (nullable = true)
 |-- Profit: double (nullable = true)



2.	Are column data types correctly inferred (dates, numerics, strings)?

No, I am modifying the data types for columns for which data types are not correctly inferred.

In [174]:
df = df.withColumn("Sales",col("Sales").astype("float"))
df = df.withColumn("Quantity",col("Quantity").astype("float"))
df = df.withColumn("Discount",col("Discount").astype("float"))
df = df.withColumn("Profit",col("Profit").astype("float"))

df = df.withColumn("Order Date",split("Order Date","/"))
df = df.withColumn("Ship Date",split("Ship Date","/"))

df.show(5)

+------+--------------+--------------+--------------+--------------+-----------+---------------+---------+-------------+---------------+----------+-----------+------+---------------+---------------+------------+--------------------+--------+--------+--------+--------+
|Row ID|      Order ID|    Order Date|     Ship Date|     Ship Mode|Customer ID|  Customer Name|  Segment|      Country|           City|     State|Postal Code|Region|     Product ID|       Category|Sub-Category|        Product Name|   Sales|Quantity|Discount|  Profit|
+------+--------------+--------------+--------------+--------------+-----------+---------------+---------+-------------+---------------+----------+-----------+------+---------------+---------------+------------+--------------------+--------+--------+--------+--------+
|     1|CA-2016-152156| [11, 8, 2016]|[11, 11, 2016]|  Second Class|   CG-12520|    Claire Gute| Consumer|United States|      Henderson|  Kentucky|      42420| South|FUR-BO-10001798|      Furni

In [168]:
df.show(5)

+------+--------------+--------------+--------------+--------------+-----------+---------------+---------+-------------+---------------+----------+-----------+------+---------------+---------------+------------+--------------------+--------+--------+--------+--------+
|Row ID|      Order ID|    Order Date|     Ship Date|     Ship Mode|Customer ID|  Customer Name|  Segment|      Country|           City|     State|Postal Code|Region|     Product ID|       Category|Sub-Category|        Product Name|   Sales|Quantity|Discount|  Profit|
+------+--------------+--------------+--------------+--------------+-----------+---------------+---------+-------------+---------------+----------+-----------+------+---------------+---------------+------------+--------------------+--------+--------+--------+--------+
|     1|CA-2016-152156| [11, 8, 2016]|[11, 11, 2016]|  Second Class|   CG-12520|    Claire Gute| Consumer|United States|      Henderson|  Kentucky|      42420| South|FUR-BO-10001798|      Furni

3.	How many rows and columns are present?

In [178]:
print("Number of Rows      :  ",len(df.columns))
print("Number of Columns   :  ",df.count())

Number of Rows      :   21
Number of Columns   :   9994


4.	Which columns are categorical, numerical, and date/time?

Categorical Columns:
- Order ID
- Ship Mode
- Customer ID
- Customer Name
- Segment
- Country
- City
- State
- Region
- Category
- Sub-Category

Numerical Columns:
- Postal Code
- Sales
- Quantity
- Discount
- Profit

Date/Time Columns:
- Order Date
- Ship Date


Other Columns:
- Product ID
- Product Name

5.	Are there any columns that should be cast to different types (e.g., string → date, string → double)?

Yes, I did cast certain columns to different types. 

I casted the below columns from string to float:
- Sales
- Quantity
- Discount

I converted the below columns from string to list (they are actually dates):
- Order Date
- Ship Date

6.	Is the dataset wide or tall, and does that affect processing strategy?

Extremely tall

7.	Are column names consistent (no spaces, special characters, casing issues)?