### Step 1: Setting Up the Spark Environment

In a real-world company setup, we wouldn’t use Google Colab directly. Instead, we would:

**1\. Deploy a Spark Cluster** (like AWS EMR, GCP Dataproc, or an on-prem Hadoop cluster, Azure HD Insight).

**2\. Store Data in HDFS** instead of local storage.

*   Load data from Kaggle i.e. Data Source (#!/bin/bash curl -L -o ~/olist/brazilian-ecommerce.zip\\ https://www.kaggle.com/api/v1/datasets/download/olistbr/brazilian-ecommerce)
*     !unzip brazilian-ecommerce.zip -d ~/olist/data/
    

**3\. Use PySpark** to interact with data.

## Commands/Steps to be Executed in Bash

```bash
# Create directories
mkdir olist
mkdir olist/data

# Download the dataset from Kaggle
curl -L -o ~/olist/brazilian-ecommerce.zip https://www.kaggle.com/api/v1/datasets/download/olistbr/brazilian-ecommerce

# Unzip the dataset into the data folder
unzip ~/olist/brazilian-ecommerce.zip -d ~/olist/data/

# Create a directory in HDFS
hadoop fs -mkdir /data/olist/

# Upload all CSV files to HDFS
hadoop fs -put ~/olist/data/*.csv /data/olist/



In [21]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

In [1]:
# Initiating Spark Session

spark = SparkSession.builder \
.appName("OlistData") \
.getOrCreate()

25/05/13 03:06:58 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [2]:
# Checking Files

!hadoop fs -ls /data/olist/

Found 9 items
-rw-r--r--   2 priteshsingh101 hadoop    9033957 2025-05-12 13:14 /data/olist/olist_customers_dataset.csv
-rw-r--r--   2 priteshsingh101 hadoop   61273883 2025-05-12 13:14 /data/olist/olist_geolocation_dataset.csv
-rw-r--r--   2 priteshsingh101 hadoop   15438671 2025-05-12 13:14 /data/olist/olist_order_items_dataset.csv
-rw-r--r--   2 priteshsingh101 hadoop    5777138 2025-05-12 13:14 /data/olist/olist_order_payments_dataset.csv
-rw-r--r--   2 priteshsingh101 hadoop   14451670 2025-05-12 13:14 /data/olist/olist_order_reviews_dataset.csv
-rw-r--r--   2 priteshsingh101 hadoop   17654914 2025-05-12 13:14 /data/olist/olist_orders_dataset.csv
-rw-r--r--   2 priteshsingh101 hadoop    2379446 2025-05-12 13:14 /data/olist/olist_products_dataset.csv
-rw-r--r--   2 priteshsingh101 hadoop     174703 2025-05-12 13:14 /data/olist/olist_sellers_dataset.csv
-rw-r--r--   2 priteshsingh101 hadoop       2613 2025-05-12 13:14 /data/olist/product_category_name_translation.csv


In [3]:
# DEFINING PATH

hdfs_path = "/data/olist/"

customers_df = spark.read.csv(hdfs_path + "olist_customers_dataset.csv", header = True, inferSchema = True)

                                                                                

In [4]:
customers_df.show(10)

+--------------------+--------------------+------------------------+--------------------+--------------+
|         customer_id|  customer_unique_id|customer_zip_code_prefix|       customer_city|customer_state|
+--------------------+--------------------+------------------------+--------------------+--------------+
|06b8999e2fba1a1fb...|861eff4711a542e4b...|                   14409|              franca|            SP|
|18955e83d337fd6b2...|290c77bc529b7ac93...|                    9790|sao bernardo do c...|            SP|
|4e7b3e00288586ebd...|060e732b5b29e8181...|                    1151|           sao paulo|            SP|
|b2b6027bc5c5109e5...|259dac757896d24d7...|                    8775|     mogi das cruzes|            SP|
|4f2d8ab171c80ec83...|345ecd01c38d18a90...|                   13056|            campinas|            SP|
|879864dab9bc30475...|4c93744516667ad3b...|                   89254|      jaragua do sul|            SC|
|fd826e7cf63160e53...|addec96d2e059c80c...|            

In [5]:
# Data loading 

customers_df = spark.read.csv(hdfs_path + "olist_customers_dataset.csv", header = True, inferSchema = True)
geolocation_df = spark.read.csv(hdfs_path + "olist_geolocation_dataset.csv", header = True, inferSchema = True)
order_items_df = spark.read.csv(hdfs_path + "olist_order_items_dataset.csv", header = True, inferSchema = True)
payments_df = spark.read.csv(hdfs_path + "olist_order_payments_dataset.csv", header = True, inferSchema = True)
reviews_df = spark.read.csv(hdfs_path + "olist_order_reviews_dataset.csv", header = True, inferSchema = True)
orders_df = spark.read.csv(hdfs_path + "olist_orders_dataset.csv", header = True, inferSchema = True)
products_df = spark.read.csv(hdfs_path + "olist_products_dataset.csv", header = True, inferSchema = True)
sellers_df = spark.read.csv(hdfs_path + "olist_sellers_dataset.csv", header = True, inferSchema = True)
category_translation_df = spark.read.csv(hdfs_path + "product_category_name_translation.csv", header = True, inferSchema = True)

                                                                                

In [23]:
customers_df.printSchema()
geolocation_df.printSchema()
order_items_df.printSchema()
payments_df.printSchema()
reviews_df.printSchema()
orders_df.printSchema()
products_df.printSchema()
sellers_df.printSchema()
category_translation_df.printSchema()

root
 |-- customer_id: string (nullable = true)
 |-- customer_unique_id: string (nullable = true)
 |-- customer_zip_code_prefix: integer (nullable = true)
 |-- customer_city: string (nullable = true)
 |-- customer_state: string (nullable = true)

root
 |-- geolocation_zip_code_prefix: integer (nullable = true)
 |-- geolocation_lat: double (nullable = true)
 |-- geolocation_lng: double (nullable = true)
 |-- geolocation_city: string (nullable = true)
 |-- geolocation_state: string (nullable = true)

root
 |-- order_id: string (nullable = true)
 |-- order_item_id: integer (nullable = true)
 |-- product_id: string (nullable = true)
 |-- seller_id: string (nullable = true)
 |-- shipping_limit_date: timestamp (nullable = true)
 |-- price: double (nullable = true)
 |-- freight_value: double (nullable = true)

root
 |-- order_id: string (nullable = true)
 |-- payment_sequential: integer (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- payment_installments: integer (nullable 

In [20]:
# Checking Data Leakage or drop

print(f'customers_df: {customers_df.count()} rows')
print(f'geolocation_df: {geolocation_df.count()} rows')
print(f'order_items_df: {order_items_df.count()} rows')
print(f'payments_df: {payments_df.count()} rows')
print(f'reviews_df: {reviews_df.count()} rows')
print(f'orders_df: {orders_df.count()} rows')
print(f'products_df: {products_df.count()} rows')
print(f'sellers_df: {sellers_df.count()} rows')
print(f'category_translation_df: {category_translation_df.count()} rows')

customers_df: 99441 rows
geolocation_df: 1000163 rows
order_items_df: 112650 rows
payments_df: 103886 rows
reviews_df: 104162 rows
orders_df: 99441 rows
products_df: 32951 rows
sellers_df: 3095 rows
category_translation_df: 71 rows


In [34]:
# Check for NULLS

# Needs [] beacuse we are doing list comprehension

print("customers_df:")
customers_df.select([sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) for c in customers_df.columns]).show()

print("geolocation_df")
geolocation_df.select([sum(when(col(c).isNull(),1).otherwise(0)).alias(c) for c in geolocation_df.columns]).show()

print("order_items_df")
order_items_df.select([sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) for c in order_items_df.columns]).show()

print("payments_df")
payments_df.select([sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) for c in payments_df.columns]).show()

print("reviews_df")
reviews_df.select([sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) for c in reviews_df.columns]).show()

print("orders_df")
orders_df.select([sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) for c in orders_df.columns]).show()

print("products_df")
products_df.select([sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) for c in products_df.columns]).show()

print("sellers_df")
sellers_df.select([sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) for c in sellers_df.columns]).show()

print("category_translation_df")
category_translation_df.select([sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) for c in category_translation_df.columns]).show()


customers_df:
+-----------+------------------+------------------------+-------------+--------------+
|customer_id|customer_unique_id|customer_zip_code_prefix|customer_city|customer_state|
+-----------+------------------+------------------------+-------------+--------------+
|          0|                 0|                       0|            0|             0|
+-----------+------------------+------------------------+-------------+--------------+

geolocation_df


                                                                                

+---------------------------+---------------+---------------+----------------+-----------------+
|geolocation_zip_code_prefix|geolocation_lat|geolocation_lng|geolocation_city|geolocation_state|
+---------------------------+---------------+---------------+----------------+-----------------+
|                          0|              0|              0|               0|                0|
+---------------------------+---------------+---------------+----------------+-----------------+

order_items_df
+--------+-------------+----------+---------+-------------------+-----+-------------+
|order_id|order_item_id|product_id|seller_id|shipping_limit_date|price|freight_value|
+--------+-------------+----------+---------+-------------------+-----+-------------+
|       0|            0|         0|        0|                  0|    0|            0|
+--------+-------------+----------+---------+-------------------+-----+-------------+

payments_df
+--------+------------------+------------+--------------

                                                                                

+---------+--------+------------+--------------------+----------------------+--------------------+-----------------------+
|review_id|order_id|review_score|review_comment_title|review_comment_message|review_creation_date|review_answer_timestamp|
+---------+--------+------------+--------------------+----------------------+--------------------+-----------------------+
|        1|    2236|        2380|               92157|                 63079|                8764|                   8785|
+---------+--------+------------+--------------------+----------------------+--------------------+-----------------------+

orders_df
+--------+-----------+------------+------------------------+-----------------+----------------------------+-----------------------------+-----------------------------+
|order_id|customer_id|order_status|order_purchase_timestamp|order_approved_at|order_delivered_carrier_date|order_delivered_customer_date|order_estimated_delivery_date|
+--------+-----------+------------+---

In [35]:
# Duplicate Values

customers_df.groupby('customer_id').count().filter('count>1').show()



+-----------+-----+
|customer_id|count|
+-----------+-----+
+-----------+-----+



                                                                                

In [39]:
# Customer Distribution by State

customers_df.groupBy('customer_state').count().orderBy('count', ascending = False).show()

+--------------+-----+
|customer_state|count|
+--------------+-----+
|            SP|41746|
|            RJ|12852|
|            MG|11635|
|            RS| 5466|
|            PR| 5045|
|            SC| 3637|
|            BA| 3380|
|            DF| 2140|
|            ES| 2033|
|            GO| 2020|
|            PE| 1652|
|            CE| 1336|
|            PA|  975|
|            MT|  907|
|            MA|  747|
|            MS|  715|
|            PB|  536|
|            PI|  495|
|            RN|  485|
|            AL|  413|
+--------------+-----+
only showing top 20 rows



In [44]:
# Orders 

orders_df.show()

+--------------------+--------------------+------------+------------------------+-------------------+----------------------------+-----------------------------+-----------------------------+
|            order_id|         customer_id|order_status|order_purchase_timestamp|  order_approved_at|order_delivered_carrier_date|order_delivered_customer_date|order_estimated_delivery_date|
+--------------------+--------------------+------------+------------------------+-------------------+----------------------------+-----------------------------+-----------------------------+
|e481f51cbdc54678b...|9ef432eb625129730...|   delivered|     2017-10-02 10:56:33|2017-10-02 11:07:15|         2017-10-04 19:55:00|          2017-10-10 21:25:13|          2017-10-18 00:00:00|
|53cdb2fc8bc7dce0b...|b0830fb4747a6c6d2...|   delivered|     2018-07-24 20:41:37|2018-07-26 03:24:27|         2018-07-26 14:31:00|          2018-08-07 15:27:45|          2018-08-13 00:00:00|
|47770eb9100c2d0c4...|41ce2a54c0b03bf34...|  

In [45]:
# Order Status Distribution

orders_df.groupBy('order_status').count().orderBy('count', ascending = False).show()

+------------+-----+
|order_status|count|
+------------+-----+
|   delivered|96478|
|     shipped| 1107|
|    canceled|  625|
| unavailable|  609|
|    invoiced|  314|
|  processing|  301|
|     created|    5|
|    approved|    2|
+------------+-----+



In [46]:
# Payments

payments_df.show()

+--------------------+------------------+------------+--------------------+-------------+
|            order_id|payment_sequential|payment_type|payment_installments|payment_value|
+--------------------+------------------+------------+--------------------+-------------+
|b81ef226f3fe1789b...|                 1| credit_card|                   8|        99.33|
|a9810da82917af2d9...|                 1| credit_card|                   1|        24.39|
|25e8ea4e93396b6fa...|                 1| credit_card|                   1|        65.71|
|ba78997921bbcdc13...|                 1| credit_card|                   8|       107.78|
|42fdf880ba16b47b5...|                 1| credit_card|                   2|       128.45|
|298fcdf1f73eb413e...|                 1| credit_card|                   2|        96.12|
|771ee386b001f0620...|                 1| credit_card|                   1|        81.16|
|3d7239c394a212faa...|                 1| credit_card|                   3|        51.84|
|1f78449c8

In [48]:
# payment_type distribution

payments_df.groupBy('payment_type').count().orderBy('count', ascending = False).show()

+------------+-----+
|payment_type|count|
+------------+-----+
| credit_card|76795|
|      boleto|19784|
|     voucher| 5775|
|  debit_card| 1529|
| not_defined|    3|
+------------+-----+



In [51]:
# order items

order_items_df.show(5)

+--------------------+-------------+--------------------+--------------------+-------------------+-----+-------------+
|            order_id|order_item_id|          product_id|           seller_id|shipping_limit_date|price|freight_value|
+--------------------+-------------+--------------------+--------------------+-------------------+-----+-------------+
|00010242fe8c5a6d1...|            1|4244733e06e7ecb49...|48436dade18ac8b2b...|2017-09-19 09:45:35| 58.9|        13.29|
|00018f77f2f0320c5...|            1|e5f2d52b802189ee6...|dd7ddc04e1b6c2c61...|2017-05-03 11:05:13|239.9|        19.93|
|000229ec398224ef6...|            1|c777355d18b72b67a...|5b51032eddd242adc...|2018-01-18 14:48:30|199.0|        17.87|
|00024acbcdf0a6daa...|            1|7634da152a4610f15...|9d7a1d34a50524090...|2018-08-15 10:10:18|12.99|        12.79|
|00042b26cf59d7ce6...|            1|ac6c3623068f30de0...|df560393f3a51e745...|2017-02-13 13:57:51|199.9|        18.14|
+--------------------+-------------+------------

In [53]:
# Top selling products

top_products = order_items_df.groupBy('product_id').agg(sum('price').alias('Total_Sales'))
top_products.orderBy('Total_Sales', ascending = False).show()



+--------------------+------------------+
|          product_id|       Total_Sales|
+--------------------+------------------+
|bb50f2e236e5eea01...|           63885.0|
|6cdd53843498f9289...| 54730.20000000005|
|d6160fb7873f18409...|48899.340000000004|
|d1c427060a0f73f6b...| 47214.51000000006|
|99a4788cb24856965...|43025.560000000085|
|3dd2a17168ec895c7...| 41082.60000000005|
|25c38557cf793876c...| 38907.32000000001|
|5f504b3a1c75b73d6...|37733.899999999994|
|53b36df67ebb7c415...| 37683.42000000001|
|aca2eb7d00ea1a7b8...| 37608.90000000007|
|e0d64dcfaa3b6db5c...|          31786.82|
|d285360f29ac7fd97...|31623.809999999983|
|7a10781637204d8d1...|           30467.5|
|f1c7f353075ce59d8...|          29997.36|
|f819f0c84a64f02d3...|29024.479999999996|
|588531f8ec37e7d5f...|28291.989999999998|
|422879e10f4668299...|26577.219999999972|
|16c4e87b98a9370a9...|           25034.0|
|5a848e4ab52fd5445...|24229.029999999962|
|a62e25e09e05e6faf...|           24051.0|
+--------------------+------------

                                                                                

In [57]:
# Avg Delivering Time Analysis

delivery_df = orders_df.select('order_id','order_purchase_timestamp','order_delivered_customer_date', 
                               datediff(col('order_delivered_customer_date'), col('order_purchase_timestamp')).alias('deliivery_time'))

In [59]:
delivery_df.orderBy('deliivery_time', ascending = False).show()

+--------------------+------------------------+-----------------------------+--------------+
|            order_id|order_purchase_timestamp|order_delivered_customer_date|deliivery_time|
+--------------------+------------------------+-----------------------------+--------------+
|ca07593549f1816d2...|     2017-02-21 23:31:27|          2017-09-19 14:36:39|           210|
|1b3190b2dfa9d789e...|     2018-02-23 14:57:35|          2018-09-19 23:24:07|           208|
|440d0d17af552815d...|     2017-03-07 23:59:51|          2017-09-19 15:12:50|           196|
|2fb597c2f772eca01...|     2017-03-08 18:09:02|          2017-09-19 14:33:17|           195|
|285ab9426d6982034...|     2017-03-08 22:47:40|          2017-09-19 14:00:04|           195|
|0f4519c5f1c541dde...|     2017-03-09 13:26:57|          2017-09-19 14:38:21|           194|
|47b40429ed8cce3ae...|     2018-01-03 09:44:01|          2018-07-13 20:51:31|           191|
|2fe324febf907e3ea...|     2017-03-13 20:17:10|          2017-09-19 17

In [60]:
spark.stop()