# **Importing the Dataset from Kaggle**

In this section, we connect Google Drive to access the Kaggle API token stored inside it.  
Using the Kaggle API, we import the **Superstore Sales Dataset**, which is a retail dataset of a global superstore collected over four years.  
After downloading, we unzip the dataset files so that we can start exploring and analyzing them with PySpark.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!mkdir ~/.kaggle
!cp /content/drive/MyDrive/kaggle/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [3]:
!kaggle datasets download -d rohitsahoo/sales-forecasting

Dataset URL: https://www.kaggle.com/datasets/rohitsahoo/sales-forecasting
License(s): GPL-2.0
Downloading sales-forecasting.zip to /content
  0% 0.00/480k [00:00<?, ?B/s]
100% 480k/480k [00:00<00:00, 1.42GB/s]


In [4]:
!unzip -o sales-forecasting.zip

Archive:  sales-forecasting.zip
  inflating: train.csv               


# **Reading CSV File**

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SalesAnalysis").getOrCreate()

In [7]:
# Define the dataset path
path = "/content/train.csv"

# Read the CSV file into a Spark DataFrame
df = spark.read.csv(path, header=True, inferSchema=True)

# Display the first 5 rows of the DataFrame
df.show(5)

+------+--------------+----------+----------+--------------+-----------+---------------+---------+-------------+---------------+----------+-----------+------+---------------+---------------+------------+--------------------+--------+
|Row ID|      Order ID|Order Date| Ship Date|     Ship Mode|Customer ID|  Customer Name|  Segment|      Country|           City|     State|Postal Code|Region|     Product ID|       Category|Sub-Category|        Product Name|   Sales|
+------+--------------+----------+----------+--------------+-----------+---------------+---------+-------------+---------------+----------+-----------+------+---------------+---------------+------------+--------------------+--------+
|     1|CA-2017-152156|08/11/2017|11/11/2017|  Second Class|   CG-12520|    Claire Gute| Consumer|United States|      Henderson|  Kentucky|      42420| South|FUR-BO-10001798|      Furniture|   Bookcases|Bush Somerset Col...|  261.96|
|     2|CA-2017-152156|08/11/2017|11/11/2017|  Second Class|   C

In [14]:
# Number of rows
df.count()

9800

In [15]:
# To see basic statistics of numeric columns
df.describe().show()

+-------+------------------+--------------+--------------+-----------+------------------+-----------+-------------+--------+-------+------------------+-------+---------------+----------+------------+--------------------+-----------------+
|summary|            Row ID|      Order ID|     Ship Mode|Customer ID|     Customer Name|    Segment|      Country|    City|  State|       Postal Code| Region|     Product ID|  Category|Sub-Category|        Product Name|            Sales|
+-------+------------------+--------------+--------------+-----------+------------------+-----------+-------------+--------+-------+------------------+-------+---------------+----------+------------+--------------------+-----------------+
|  count|              9800|          9800|          9800|       9800|              9800|       9800|         9800|    9800|   9800|              9789|   9800|           9800|      9800|        9800|                9800|             9508|
|   mean|            4900.5|          NULL| 

In [8]:
# Print the schema (structure) of the DataFrame
df.printSchema()

root
 |-- Row ID: integer (nullable = true)
 |-- Order ID: string (nullable = true)
 |-- Order Date: string (nullable = true)
 |-- Ship Date: string (nullable = true)
 |-- Ship Mode: string (nullable = true)
 |-- Customer ID: string (nullable = true)
 |-- Customer Name: string (nullable = true)
 |-- Segment: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Postal Code: integer (nullable = true)
 |-- Region: string (nullable = true)
 |-- Product ID: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Sub-Category: string (nullable = true)
 |-- Product Name: string (nullable = true)
 |-- Sales: string (nullable = true)



In [9]:
# Check column names with their data types
df.dtypes

[('Row ID', 'int'),
 ('Order ID', 'string'),
 ('Order Date', 'string'),
 ('Ship Date', 'string'),
 ('Ship Mode', 'string'),
 ('Customer ID', 'string'),
 ('Customer Name', 'string'),
 ('Segment', 'string'),
 ('Country', 'string'),
 ('City', 'string'),
 ('State', 'string'),
 ('Postal Code', 'int'),
 ('Region', 'string'),
 ('Product ID', 'string'),
 ('Category', 'string'),
 ('Sub-Category', 'string'),
 ('Product Name', 'string'),
 ('Sales', 'string')]

The **Segment** column has 3 unique values: *Consumer*, *Corporate*, and *Home Office*.

- **Consumer**: individual buyers purchasing products for personal use.  
- **Corporate**: businesses or companies purchasing in bulk for their offices or employees.  
- **Home Office**: small or self-employed individuals running a business from home, such as freelancers or small startups.

In [10]:
# Get unique values from the Segment column
df.select("Segment").distinct().show()

+-----------+
|    Segment|
+-----------+
|   Consumer|
|Home Office|
|  Corporate|
+-----------+



In this section, we correct the data types of some columns.  
As you can see, the columns **"Order Date"** and **"Ship Date"** were strings but should be converted to dates.  
The column **"Postal Code"** should be an integer, and **"Sales"** should be a double for numerical calculations.

In [11]:
# Change the types of the columns to their correct types
from pyspark.sql.functions import col, to_date

# Convert numeric columns
df = df.withColumn("Sales", col("Sales").cast("double"))
df = df.withColumn("Postal Code", col("Postal Code").cast("int"))

# Convert date columns (notice the corrected quotes and format)
df = df.withColumn("Order Date", to_date(col("Order Date"), "dd/MM/yyyy"))
df = df.withColumn("Ship Date", to_date(col("Ship Date"), "dd/MM/yyyy"))

In [13]:
# Check column names with their data types
df.dtypes

[('Row ID', 'int'),
 ('Order ID', 'string'),
 ('Order Date', 'date'),
 ('Ship Date', 'date'),
 ('Ship Mode', 'string'),
 ('Customer ID', 'string'),
 ('Customer Name', 'string'),
 ('Segment', 'string'),
 ('Country', 'string'),
 ('City', 'string'),
 ('State', 'string'),
 ('Postal Code', 'int'),
 ('Region', 'string'),
 ('Product ID', 'string'),
 ('Category', 'string'),
 ('Sub-Category', 'string'),
 ('Product Name', 'string'),
 ('Sales', 'double')]

In [17]:
# To check for missing or null values in selected columns
def check_nulls(df):
    for c in df.columns:
        num_null = df.filter((col(c).isNull()) | (col(c) == "")).count()
        print(f"The column '{c}' has {num_null} null values.")

check_nulls(df)

The column 'Row ID' has 0 null values.
The column 'Order ID' has 0 null values.
The column 'Order Date' has 0 null values.
The column 'Ship Date' has 0 null values.
The column 'Ship Mode' has 0 null values.
The column 'Customer ID' has 0 null values.
The column 'Customer Name' has 0 null values.
The column 'Segment' has 0 null values.
The column 'Country' has 0 null values.
The column 'City' has 0 null values.
The column 'State' has 0 null values.
The column 'Postal Code' has 11 null values.
The column 'Region' has 0 null values.
The column 'Product ID' has 0 null values.
The column 'Category' has 0 null values.
The column 'Sub-Category' has 0 null values.
The column 'Product Name' has 0 null values.
The column 'Sales' has 292 null values.


*   The **Sales** column is very important for our analysis, so we will drop any rows that contain null values in this column.
*   On the other hand, the **Postal Code** column is not as critical since we already have information about the city, state, and region. Therefore, we will replace its null values with a placeholder value of 0.


In [18]:
# Drop the rows where 'Sales' has null values
df = df.na.drop(subset=['Sales'])

# Replace null values in 'Postal Code' with a placeholder value of 0
df = df.na.fill({'Postal Code': 0})

In [19]:
check_nulls(df)

The column 'Row ID' has 0 null values.
The column 'Order ID' has 0 null values.
The column 'Order Date' has 0 null values.
The column 'Ship Date' has 0 null values.
The column 'Ship Mode' has 0 null values.
The column 'Customer ID' has 0 null values.
The column 'Customer Name' has 0 null values.
The column 'Segment' has 0 null values.
The column 'Country' has 0 null values.
The column 'City' has 0 null values.
The column 'State' has 0 null values.
The column 'Postal Code' has 0 null values.
The column 'Region' has 0 null values.
The column 'Product ID' has 0 null values.
The column 'Category' has 0 null values.
The column 'Sub-Category' has 0 null values.
The column 'Product Name' has 0 null values.
The column 'Sales' has 0 null values.
