## Data Ingestion & Data Exploration

- Store all the csv dataset into GCS
- Use spark dataframe for scalable operation
- Explore practical business usecase
- Document everything

- import dataset dari kaggle
- cek schema dan head of the data
- define schema
- count data
- explore:
    1. customers
    2. sellers
    3. products
    4. geolocation
    5. orders
    6. order_payments
    7. order_items
    8. order_reviews
- save raw data into GCS

In [11]:
!ls -l -h /

total 80K
lrwxrwxrwx   1 root root      7 May 13 15:04 bin -> usr/bin
drwxr-xr-x   4 root root   4.0K May 13 15:08 boot
-rw-r--r--   1 root root    646 Jun  1  2023 copyright
drwxr-xr-x   2 root root   4.0K Jun 12 11:00 data
drwxr-xr-x  14 root root   3.0K Jun 12 10:33 dev
drwxr-xr-x 112 root root    12K Jun 12 10:33 etc
drwxrwxr-x   7 root hadoop 4.0K May 31 02:03 hadoop
drwxr-xr-x   4 root root   4.0K May 31 02:20 home
lrwxrwxrwx   1 root root      7 May 13 15:04 lib -> usr/lib
lrwxrwxrwx   1 root root      9 May 13 15:04 lib64 -> usr/lib64
drwx------   2 root root    16K May 13 15:04 lost+found
drwxr-xr-x   2 root root   4.0K May 13 15:04 media
drwxr-xr-x   2 root root   4.0K May 13 15:04 mnt
drwxr-xr-x   9 root root   4.0K May 16 00:40 opt
dr-xr-xr-x 171 root root      0 Jun 12 10:33 proc
drwx------   9 root root   4.0K Jun  5 04:36 root
drwxr-xr-x  27 root root    740 Jun 12 10:33 run
lrwxrwxrwx   1 root root      8 May 13 15:04 sbin -> usr/sbin
drwxr-xr-x   2 root root   4.0K May

In [12]:
# import dataset dari kaggle
!curl -L -o /data/brazilian-ecommerce.zip\
  https://www.kaggle.com/api/v1/datasets/download/olistbr/brazilian-ecommerce

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 42.6M  100 42.6M    0     0  90.0M      0 --:--:-- --:--:-- --:--:-- 90.0M


In [13]:
!unzip /data/brazilian-ecommerce.zip -d  /data/ecommerce_real

Archive:  /data/brazilian-ecommerce.zip
  inflating: /data/ecommerce_real/olist_customers_dataset.csv  
  inflating: /data/ecommerce_real/olist_geolocation_dataset.csv  
  inflating: /data/ecommerce_real/olist_order_items_dataset.csv  
  inflating: /data/ecommerce_real/olist_order_payments_dataset.csv  
  inflating: /data/ecommerce_real/olist_order_reviews_dataset.csv  
  inflating: /data/ecommerce_real/olist_orders_dataset.csv  
  inflating: /data/ecommerce_real/olist_products_dataset.csv  
  inflating: /data/ecommerce_real/olist_sellers_dataset.csv  
  inflating: /data/ecommerce_real/product_category_name_translation.csv  


In [14]:
!ls -l -h /data/ecommerce_real/

total 121M
-rw-r--r-- 1 root root 8.7M Oct  1  2021 olist_customers_dataset.csv
-rw-r--r-- 1 root root  59M Oct  1  2021 olist_geolocation_dataset.csv
-rw-r--r-- 1 root root  15M Oct  1  2021 olist_order_items_dataset.csv
-rw-r--r-- 1 root root 5.6M Oct  1  2021 olist_order_payments_dataset.csv
-rw-r--r-- 1 root root  14M Oct  1  2021 olist_order_reviews_dataset.csv
-rw-r--r-- 1 root root  17M Oct  1  2021 olist_orders_dataset.csv
-rw-r--r-- 1 root root 2.3M Oct  1  2021 olist_products_dataset.csv
-rw-r--r-- 1 root root 171K Oct  1  2021 olist_sellers_dataset.csv
-rw-r--r-- 1 root root 2.6K Oct  1  2021 product_category_name_translation.csv


In [25]:
!hadoop fs -ls -h /datahdfs

Found 9 items
-rw-r--r--   2 root hadoop      8.6 M 2025-06-12 11:06 /datahdfs/olist_customers_dataset.csv
-rw-r--r--   2 root hadoop     58.4 M 2025-06-12 11:06 /datahdfs/olist_geolocation_dataset.csv
-rw-r--r--   2 root hadoop     14.7 M 2025-06-12 11:06 /datahdfs/olist_order_items_dataset.csv
-rw-r--r--   2 root hadoop      5.5 M 2025-06-12 11:06 /datahdfs/olist_order_payments_dataset.csv
-rw-r--r--   2 root hadoop     13.8 M 2025-06-12 11:06 /datahdfs/olist_order_reviews_dataset.csv
-rw-r--r--   2 root hadoop     16.8 M 2025-06-12 11:06 /datahdfs/olist_orders_dataset.csv
-rw-r--r--   2 root hadoop      2.3 M 2025-06-12 11:06 /datahdfs/olist_products_dataset.csv
-rw-r--r--   2 root hadoop    170.6 K 2025-06-12 11:06 /datahdfs/olist_sellers_dataset.csv
-rw-r--r--   2 root hadoop      2.6 K 2025-06-12 11:06 /datahdfs/product_category_name_translation.csv


In [15]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.window import Window

In [16]:
spark = SparkSession.builder.appName("brazilian-ecommerce").master("local[*]").getOrCreate()

25/06/12 11:03:32 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [17]:
# define schema 
schema_customers = 'customer_id STRING, customer_unique_id STRING, customer_zip_code_prefix INT, customer_city STRING, customer_state STRING'
schema_geolocation = 'geolocation_zip_code_prefix INT, geolocation_lat DOUBLE, geolocation_lng DOUBLE, geolocation_city STRING, geolocation_state STRING'
schema_order_items = 'order_id STRING, order_item_id INT, product_id STRING, seller_id STRING, shipping_limit_date STRING, price DOUBLE, freight_value DOUBLE'
schema_order_payments = 'order_id STRING, payment_sequential INT, payment_type STRING, payment_installments INT, payment_value DOUBLE'
schema_order_reviews = 'review_id STRING, order_id STRING, review_score INT, review_comment_title STRING, review_comment_message STRING, review_creation_date STRING, review_answer_timestamp STRING'
schema_orders = 'order_id STRING, customer_id STRING, order_status STRING, order_purchase_timestamp STRING, order_approved_at STRING, order_delivered_carrier_date STRING, order_delivered_customer_date STRING, order_estimated_delivery_date STRING'
schema_products = 'product_id STRING, product_category_name STRING, product_name_lenght INT, product_description_lenght INT, product_photos_qty INT, product_weight_g INT, product_length_cm INT, product_height_cm INT, product_width_cm INT'
schema_sellers = 'seller_id STRING, seller_zip_code_prefix INT, seller_city STRING, seller_state STRING'
category_translation = 'product_category_name STRING, product_category_name_english STRING'

In [26]:
path_head = "/datahdfs/olist_"
path_tail = "_dataset.csv"

df_customers = spark.read.csv(f"{path_head}customers{path_tail}", header=True, schema=schema_customers)
df_geolocation = spark.read.csv(f"{path_head}geolocation{path_tail}", header=True, schema=schema_geolocation)
df_order_items = spark.read.csv(f"{path_head}order_items{path_tail}", header=True, schema=schema_order_items)
df_order_payments = spark.read.csv(f"{path_head}order_payments{path_tail}", header=True, schema=schema_order_payments)
df_order_reviews = spark.read.csv(f"{path_head}order_reviews{path_tail}", header=True, schema=schema_order_reviews)
df_orders = spark.read.csv(f"{path_head}orders{path_tail}", header=True, schema=schema_orders)
df_products = spark.read.csv(f"{path_head}products{path_tail}", header=True, schema=schema_products)
df_sellers = spark.read.csv(f"{path_head}sellers{path_tail}", header=True, schema=schema_sellers)

df_cat_trans = spark.read.csv("/datahdfs/product_category_name_translation.csv", header=True, schema=category_translation)

In [42]:
# change bucket-name into your own bucket-name
output_head = "gs://bucket-name/ecommerce_real/olist_"

### Customers

In [27]:
df_customers.show(10)

                                                                                

+--------------------+--------------------+------------------------+--------------------+--------------+
|         customer_id|  customer_unique_id|customer_zip_code_prefix|       customer_city|customer_state|
+--------------------+--------------------+------------------------+--------------------+--------------+
|06b8999e2fba1a1fb...|861eff4711a542e4b...|                   14409|              franca|            SP|
|18955e83d337fd6b2...|290c77bc529b7ac93...|                    9790|sao bernardo do c...|            SP|
|4e7b3e00288586ebd...|060e732b5b29e8181...|                    1151|           sao paulo|            SP|
|b2b6027bc5c5109e5...|259dac757896d24d7...|                    8775|     mogi das cruzes|            SP|
|4f2d8ab171c80ec83...|345ecd01c38d18a90...|                   13056|            campinas|            SP|
|879864dab9bc30475...|4c93744516667ad3b...|                   89254|      jaragua do sul|            SC|
|fd826e7cf63160e53...|addec96d2e059c80c...|            

In [28]:
df_customers.printSchema()

root
 |-- customer_id: string (nullable = true)
 |-- customer_unique_id: string (nullable = true)
 |-- customer_zip_code_prefix: integer (nullable = true)
 |-- customer_city: string (nullable = true)
 |-- customer_state: string (nullable = true)



In [29]:
# check missing values
df_customers.select([count(when(col(c).isNull(), 1)).alias(c) for c in df_customers.columns]).show()



+-----------+------------------+------------------------+-------------+--------------+
|customer_id|customer_unique_id|customer_zip_code_prefix|customer_city|customer_state|
+-----------+------------------+------------------------+-------------+--------------+
|          0|                 0|                       0|            0|             0|
+-----------+------------------+------------------------+-------------+--------------+



                                                                                

In [30]:
# check if there is any duplicate
df_customers.groupBy("customer_id") \
    .count() \
    .filter("count > 1") \
    .count()

                                                                                

0

In [31]:
# Cek duplikat berdasarkan customer_unique_id dan lokasi
df_customers.groupBy("customer_unique_id", "customer_zip_code_prefix", "customer_city", "customer_state") \
    .count() \
    .filter("count > 1") \
    .orderBy(desc("count")) \
    .show(truncate=False)



+--------------------------------+------------------------+------------------+--------------+-----+
|customer_unique_id              |customer_zip_code_prefix|customer_city     |customer_state|count|
+--------------------------------+------------------------+------------------+--------------+-----+
|8d50f5eadf50201ccdcedfb9e2ac8455|4045                    |sao paulo         |SP            |17   |
|1b6c7548a2a1f9037c1fd3ddfed95f33|38301                   |ituiutaba         |MG            |7    |
|ca77025e7201e3b30c44b472ff346268|51021                   |recife            |PE            |7    |
|6469f99c1f9dfae7733b25662e7f1782|11065                   |santos            |SP            |7    |
|de34b16117594161a6a89c50b289d35a|9130                    |santo andre       |SP            |6    |
|f0e310a6839dce9de1638e0fe5ab282a|29050                   |vitoria           |ES            |6    |
|12f5d6e1cbf93dafd9dcc19095df0b3d|82200                   |curitiba          |PR            |6    |


                                                                                

In [32]:
# Cek apakah ada customer_unique_id dengan >1 kombinasi lokasi
df_customers.groupBy("customer_unique_id") \
    .agg(countDistinct("customer_zip_code_prefix", "customer_city", "customer_state").alias("address_count")) \
    .filter("address_count > 1") \
    .orderBy(desc("address_count")) \
    .show()



+--------------------+-------------+
|  customer_unique_id|address_count|
+--------------------+-------------+
|b9badb100ff8ecc16...|            3|
|d44ccec15f5f86d14...|            3|
|9832ae2f7d3e5fa4c...|            3|
|3e43e6105506432c9...|            3|
|dce6dc77c19d239bf...|            2|
|c45ece361aab055ea...|            2|
|556ffcdd6185be1c4...|            2|
|a262442e3ab89611b...|            2|
|a53c5bbb7be8b7a0f...|            2|
|1291474366a550ebc...|            2|
|d0f6a442ee95ea867...|            2|
|0178b244a5c281fb2...|            2|
|dc80a79483121fee9...|            2|
|f92fd2c87375f957e...|            2|
|18d61a27f478cd620...|            2|
|16bd767d18bf82bb0...|            2|
|507dc9becd4fc6563...|            2|
|012452d40dafae4df...|            2|
|2e08911037fe7ec67...|            2|
|d052fe86655dae0e9...|            2|
+--------------------+-------------+
only showing top 20 rows



                                                                                

In [33]:
# Tampilkan detail customer dengan lebih dari satu alamat
# Ambil ID yang punya >1 alamat
multiple_addresses = df_customers.withColumn("full_address", concat_ws("|", "customer_zip_code_prefix", "customer_city", "customer_state")) \
    .groupBy("customer_unique_id") \
    .agg(countDistinct("full_address").alias("address_count")) \
    .filter("address_count > 1") \
    .select("customer_unique_id")

# Join untuk lihat data aslinya
df_customers.join(multiple_addresses, on="customer_unique_id") \
    .select("customer_unique_id", "customer_zip_code_prefix", "customer_city", "customer_state") \
    .orderBy("customer_unique_id") \
    .show(truncate=False)

                                                                                

+--------------------------------+------------------------+--------------------+--------------+
|customer_unique_id              |customer_zip_code_prefix|customer_city       |customer_state|
+--------------------------------+------------------------+--------------------+--------------+
|004b45ec5c64187465168251cd1c9c2f|57035                   |maceio              |AL            |
|004b45ec5c64187465168251cd1c9c2f|57055                   |maceio              |AL            |
|0058f300f57d7b93c477a131a59b36c3|41370                   |salvador            |BA            |
|0058f300f57d7b93c477a131a59b36c3|40731                   |salvador            |BA            |
|012452d40dafae4df401bced74cdb490|3220                    |sao paulo           |SP            |
|012452d40dafae4df401bced74cdb490|3984                    |sao paulo           |SP            |
|0178b244a5c281fb2ade54038dd4b161|12518                   |guaratingueta       |SP            |
|0178b244a5c281fb2ade54038dd4b161|14960 

In [34]:
# Top 10 cities with the most customers (select city, state, count)
df_customers.groupBy("customer_city", "customer_state") \
    .agg(countDistinct(col("customer_id")).alias("num_customers")) \
    .orderBy(desc("num_customers")) \
    .show(10, truncate=False)

# Top 10 cities with the most unique customers (select city, state, count)
df_customers.groupBy("customer_city", "customer_state") \
    .agg(countDistinct(col("customer_unique_id")).alias("unique_customers")) \
    .orderBy(desc("unique_customers")) \
    .show(10, truncate=False)

                                                                                

+---------------------+--------------+-------------+
|customer_city        |customer_state|num_customers|
+---------------------+--------------+-------------+
|sao paulo            |SP            |15540        |
|rio de janeiro       |RJ            |6882         |
|belo horizonte       |MG            |2773         |
|brasilia             |DF            |2131         |
|curitiba             |PR            |1521         |
|campinas             |SP            |1444         |
|porto alegre         |RS            |1379         |
|salvador             |BA            |1245         |
|guarulhos            |SP            |1189         |
|sao bernardo do campo|SP            |938          |
+---------------------+--------------+-------------+
only showing top 10 rows

+---------------------+--------------+----------------+
|customer_city        |customer_state|unique_customers|
+---------------------+--------------+----------------+
|sao paulo            |SP            |14984           |
|rio de 

In [35]:
# Top 10 states with the most customers
df_customers.groupBy("customer_state") \
    .agg(countDistinct(col("customer_id")).alias("num_customers"))\
    .orderBy(desc("num_customers")) \
    .show(40, truncate=False)

# Top 10 states with the most unique customers
df_customers.groupBy("customer_state") \
    .agg(countDistinct(col("customer_unique_id")).alias("unique_customers")) \
    .orderBy(desc("unique_customers")) \
    .show(40, truncate=False)

                                                                                

+--------------+-------------+
|customer_state|num_customers|
+--------------+-------------+
|SP            |41746        |
|RJ            |12852        |
|MG            |11635        |
|RS            |5466         |
|PR            |5045         |
|SC            |3637         |
|BA            |3380         |
|DF            |2140         |
|ES            |2033         |
|GO            |2020         |
|PE            |1652         |
|CE            |1336         |
|PA            |975          |
|MT            |907          |
|MA            |747          |
|MS            |715          |
|PB            |536          |
|PI            |495          |
|RN            |485          |
|AL            |413          |
|SE            |350          |
|TO            |280          |
|RO            |253          |
|AM            |148          |
|AC            |81           |
|AP            |68           |
|RR            |46           |
+--------------+-------------+





+--------------+----------------+
|customer_state|unique_customers|
+--------------+----------------+
|SP            |40302           |
|RJ            |12384           |
|MG            |11259           |
|RS            |5277            |
|PR            |4882            |
|SC            |3534            |
|BA            |3277            |
|DF            |2075            |
|ES            |1964            |
|GO            |1952            |
|PE            |1609            |
|CE            |1313            |
|PA            |949             |
|MT            |876             |
|MA            |726             |
|MS            |694             |
|PB            |519             |
|PI            |482             |
|RN            |474             |
|AL            |401             |
|SE            |342             |
|TO            |273             |
|RO            |240             |
|AM            |143             |
|AC            |77              |
|AP            |67              |
|RR           

                                                                                

In [36]:
# Top 10 zip code area with the most customers

df_customers.groupBy("customer_zip_code_prefix") \
    .agg(countDistinct(col("customer_id")).alias("num_customers")) \
    .orderBy(desc("num_customers")) \
    .show(10, truncate=False)

df_customers.groupBy("customer_zip_code_prefix") \
    .agg(countDistinct(col("customer_id")).alias("num_customers")) \
    .agg(min(col("num_customers")), max(col("num_customers"))) \
    .show()

                                                                                

+------------------------+-------------+
|customer_zip_code_prefix|num_customers|
+------------------------+-------------+
|22790                   |142          |
|24220                   |124          |
|22793                   |121          |
|24230                   |117          |
|22775                   |110          |
|29101                   |101          |
|13212                   |95           |
|35162                   |93           |
|22631                   |89           |
|38400                   |87           |
+------------------------+-------------+
only showing top 10 rows





+------------------+------------------+
|min(num_customers)|max(num_customers)|
+------------------+------------------+
|                 1|               142|
+------------------+------------------+



                                                                                

In [37]:
# Top 10 zip code area with the most unique customers

df_customers.groupBy("customer_zip_code_prefix") \
    .agg(countDistinct(col("customer_unique_id")).alias("num_customers")) \
    .orderBy(desc("num_customers")) \
    .show(10, truncate=False)

df_customers.groupBy("customer_zip_code_prefix") \
    .agg(countDistinct(col("customer_unique_id")).alias("num_customers")) \
    .agg(min(col("num_customers")), max(col("num_customers"))) \
    .show()

+------------------------+-------------+
|customer_zip_code_prefix|num_customers|
+------------------------+-------------+
|22790                   |136          |
|22793                   |119          |
|24220                   |114          |
|24230                   |113          |
|22775                   |107          |
|29101                   |100          |
|13212                   |92           |
|35162                   |91           |
|22631                   |87           |
|38400                   |86           |
+------------------------+-------------+
only showing top 10 rows

+------------------+------------------+
|min(num_customers)|max(num_customers)|
+------------------+------------------+
|                 1|               136|
+------------------+------------------+



In [38]:
# Top 5 kota per state
windowSpec = Window.partitionBy("customer_state").orderBy(col("total_customers").desc())

top_cities = df_customers.groupBy("customer_state", "customer_city") \
    .agg(count("*").alias("total_customers")) \
    .withColumn("rank", row_number().over(windowSpec)) \
    .filter(col("rank") <= 5)

top_cities.show()

+--------------+-------------------+---------------+----+
|customer_state|      customer_city|total_customers|rank|
+--------------+-------------------+---------------+----+
|            AC|         rio branco|             70|   1|
|            AC|    cruzeiro do sul|              3|   2|
|            AC|             xapuri|              2|   3|
|            AC|   senador guiomard|              2|   4|
|            AC|          brasileia|              1|   5|
|            AL|             maceio|            247|   1|
|            AL|          arapiraca|             29|   2|
|            AL|             penedo|              8|   3|
|            AL|palmeira dos indios|              8|   4|
|            AL|    teotonio vilela|              8|   5|
|            AM|             manaus|            140|   1|
|            AM|            humaita|              4|   2|
|            AM|        urucurituba|              2|   3|
|            AM|              coari|              1|   4|
|            A

In [39]:
# Show berapa banyak city per state
df_customers.groupBy("customer_state") \
    .agg(countDistinct(col("customer_city")).alias("num_cities")) \
    .orderBy(desc("num_cities")) \
    .show()

+--------------+----------+
|customer_state|num_cities|
+--------------+----------+
|            MG|       745|
|            SP|       629|
|            RS|       379|
|            PR|       364|
|            BA|       353|
|            SC|       240|
|            GO|       178|
|            CE|       161|
|            PE|       152|
|            RJ|       149|
|            MA|       122|
|            MT|       101|
|            ES|        95|
|            PB|        92|
|            RN|        90|
|            PA|        89|
|            PI|        72|
|            AL|        68|
|            MS|        67|
|            TO|        56|
+--------------+----------+
only showing top 20 rows



In [40]:
df_customers.agg(countDistinct('customer_id'), countDistinct('customer_unique_id')).show()



+---------------------------+----------------------------------+
|count(DISTINCT customer_id)|count(DISTINCT customer_unique_id)|
+---------------------------+----------------------------------+
|                      99441|                             96096|
+---------------------------+----------------------------------+



                                                                                

In [43]:
df_customers.write.mode('overwrite').parquet(f'{output_head}customers')

                                                                                

### Sellers

In [44]:
df_sellers.show(10)

[Stage 101:>                                                        (0 + 1) / 1]

+--------------------+----------------------+-----------------+------------+
|           seller_id|seller_zip_code_prefix|      seller_city|seller_state|
+--------------------+----------------------+-----------------+------------+
|3442f8959a84dea7e...|                 13023|         campinas|          SP|
|d1b65fc7debc3361e...|                 13844|       mogi guacu|          SP|
|ce3ad9de960102d06...|                 20031|   rio de janeiro|          RJ|
|c0f3eea2e14555b6f...|                  4195|        sao paulo|          SP|
|51a04a8a6bdcb23de...|                 12914|braganca paulista|          SP|
|c240c4061717ac180...|                 20920|   rio de janeiro|          RJ|
|e49c26c3edfa46d22...|                 55325|           brejao|          PE|
|1b938a7ec6ac5061a...|                 16304|        penapolis|          SP|
|768a86e36ad6aae3d...|                  1529|        sao paulo|          SP|
|ccc4bbb5f32a6ab2b...|                 80310|         curitiba|          PR|

                                                                                

In [45]:
df_sellers.printSchema()

root
 |-- seller_id: string (nullable = true)
 |-- seller_zip_code_prefix: integer (nullable = true)
 |-- seller_city: string (nullable = true)
 |-- seller_state: string (nullable = true)



In [46]:
# check missing values
df_sellers.select([count(when(col(c).isNull(), 1)).alias(c) for c in df_sellers.columns]).show()

+---------+----------------------+-----------+------------+
|seller_id|seller_zip_code_prefix|seller_city|seller_state|
+---------+----------------------+-----------+------------+
|        0|                     0|          0|           0|
+---------+----------------------+-----------+------------+



In [47]:
# check if there is any duplicate
df_sellers.groupBy("seller_id") \
    .count() \
    .filter("count > 1") \
    .count()

                                                                                

0

In [48]:
# Top 10 cities with the most sellers
df_sellers.groupBy("seller_city") \
    .agg(countDistinct(col("seller_id")).alias("num_sellers")) \
    .orderBy(desc("num_sellers")) \
    .show(10, truncate=False)

+--------------+-----------+
|seller_city   |num_sellers|
+--------------+-----------+
|sao paulo     |694        |
|curitiba      |127        |
|rio de janeiro|96         |
|belo horizonte|68         |
|ribeirao preto|52         |
|guarulhos     |50         |
|ibitinga      |49         |
|santo andre   |45         |
|campinas      |41         |
|maringa       |40         |
+--------------+-----------+
only showing top 10 rows



In [49]:
# Top 10 state with the most sellers
df_sellers.groupBy("seller_state") \
    .agg(countDistinct(col("seller_id")).alias("num_sellers")) \
    .orderBy(desc("num_sellers")) \
    .show(10, truncate=False)

+------------+-----------+
|seller_state|num_sellers|
+------------+-----------+
|SP          |1849       |
|PR          |349        |
|MG          |244        |
|SC          |190        |
|RJ          |171        |
|RS          |129        |
|GO          |40         |
|DF          |30         |
|ES          |23         |
|BA          |19         |
+------------+-----------+
only showing top 10 rows



In [50]:
# Top 10 zip code area with the most sellers

df_sellers.groupBy("seller_zip_code_prefix") \
    .agg(countDistinct(col("seller_id")).alias("num_sellers")) \
    .orderBy(desc("num_sellers")) \
    .show(10, truncate=False)

df_sellers.groupBy("seller_zip_code_prefix") \
    .agg(countDistinct(col("seller_id")).alias("num_sellers")) \
    .agg(min(col("num_sellers")), max(col("num_sellers"))) \
    .show()

+----------------------+-----------+
|seller_zip_code_prefix|num_sellers|
+----------------------+-----------+
|14940                 |49         |
|13660                 |10         |
|13920                 |9          |
|16200                 |9          |
|14020                 |8          |
|1026                  |8          |
|87050                 |8          |
|13481                 |7          |
|37540                 |7          |
|35530                 |6          |
+----------------------+-----------+
only showing top 10 rows

+----------------+----------------+
|min(num_sellers)|max(num_sellers)|
+----------------+----------------+
|               1|              49|
+----------------+----------------+



In [51]:
# Show berapa banyak city per state
df_sellers.groupBy("seller_state") \
    .agg(countDistinct(col("seller_city")).alias("num_cities")) \
    .orderBy(desc("num_cities")) \
    .show()

+------------+----------+
|seller_state|num_cities|
+------------+----------+
|          SP|       261|
|          MG|        82|
|          PR|        67|
|          SC|        65|
|          RS|        51|
|          RJ|        38|
|          BA|        12|
|          GO|        12|
|          ES|        11|
|          CE|         7|
|          PB|         5|
|          RN|         4|
|          PE|         4|
|          DF|         3|
|          MT|         3|
|          MS|         2|
|          RO|         2|
|          SE|         2|
|          AM|         1|
|          AC|         1|
+------------+----------+
only showing top 20 rows



In [53]:
# Top 5 kota per state berdasarkan total sellers
windowSpec = Window.partitionBy("seller_state").orderBy(col("total_sellers").desc())

top_cities_sellers = df_sellers.groupBy("seller_state", "seller_city") \
    .agg(count("*").alias("total_sellers")) \
    .withColumn("rank", row_number().over(windowSpec)) \
    .filter(col("rank") <= 5)

top_cities_sellers.show()

+------------+--------------------+-------------+----+
|seller_state|         seller_city|total_sellers|rank|
+------------+--------------------+-------------+----+
|          AC|          rio branco|            1|   1|
|          AM|              manaus|            1|   1|
|          BA|            salvador|            7|   1|
|          BA|    lauro de freitas|            2|   2|
|          BA|               bahia|            1|   3|
|          BA|        porto seguro|            1|   4|
|          BA|          barro alto|            1|   5|
|          CE|           fortaleza|            7|   1|
|          CE|            pacatuba|            1|   2|
|          CE|       varzea alegre|            1|   3|
|          CE|    juzeiro do norte|            1|   4|
|          CE|             eusebio|            1|   5|
|          DF|            brasilia|           28|   1|
|          DF|                gama|            1|   2|
|          DF|         brasilia df|            1|   3|
|         

In [54]:
df_sellers.agg(countDistinct('seller_id'), count('*')).show()

+-------------------------+--------+
|count(DISTINCT seller_id)|count(1)|
+-------------------------+--------+
|                     3095|    3095|
+-------------------------+--------+



In [55]:
df_sellers.write.mode('overwrite').parquet(f'{output_head}sellers')

                                                                                

### Products

In [56]:
df_products.show(10)

+--------------------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+
|          product_id|product_category_name|product_name_lenght|product_description_lenght|product_photos_qty|product_weight_g|product_length_cm|product_height_cm|product_width_cm|
+--------------------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+
|1e9e8ef04dbcff454...|           perfumaria|                 40|                       287|                 1|             225|               16|               10|              14|
|3aa071139cb16b67c...|                artes|                 44|                       276|                 1|            1000|               30|               18|              20|
|96bd76ec8810374ed...|        esporte_lazer|                 46|                       250|    

In [57]:
df_products.printSchema()

root
 |-- product_id: string (nullable = true)
 |-- product_category_name: string (nullable = true)
 |-- product_name_lenght: integer (nullable = true)
 |-- product_description_lenght: integer (nullable = true)
 |-- product_photos_qty: integer (nullable = true)
 |-- product_weight_g: integer (nullable = true)
 |-- product_length_cm: integer (nullable = true)
 |-- product_height_cm: integer (nullable = true)
 |-- product_width_cm: integer (nullable = true)



In [59]:
# rename some columns name
df_products_renamed = df_products \
    .withColumnRenamed("product_name_lenght", "product_name_length") \
    .withColumnRenamed("product_description_lenght", "product_description_length")

In [60]:
# check if there is any duplicate
df_products_renamed.groupBy("product_id") \
    .count() \
    .filter("count > 1") \
    .count()

0

In [61]:
# check missing values
df_products_renamed.select([count(when(col(c).isNull(), 1)).alias(c) for c in df_products_renamed.columns]).show()

+----------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+
|product_id|product_category_name|product_name_length|product_description_length|product_photos_qty|product_weight_g|product_length_cm|product_height_cm|product_width_cm|
+----------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+
|         0|                  610|                610|                       610|               610|               2|                2|                2|               2|
+----------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+



In [62]:
df_products_renamed.filter(
    col("product_category_name").isNull() |
    col("product_name_length").isNull() |
    col("product_description_length").isNull() |
    col("product_photos_qty").isNull() |
    col("product_weight_g").isNull() |
    col("product_length_cm").isNull() |
    col("product_height_cm").isNull() |
    col("product_width_cm").isNull()
).show()

print(f'Total Null:')
df_products_renamed.filter(
    col("product_category_name").isNull() |
    col("product_name_length").isNull() |
    col("product_description_length").isNull() |
    col("product_photos_qty").isNull() |
    col("product_weight_g").isNull() |
    col("product_length_cm").isNull() |
    col("product_height_cm").isNull() |
    col("product_width_cm").isNull()
).count()

+--------------------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+
|          product_id|product_category_name|product_name_length|product_description_length|product_photos_qty|product_weight_g|product_length_cm|product_height_cm|product_width_cm|
+--------------------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+
|a41e356c76fab6633...|                 NULL|               NULL|                      NULL|              NULL|             650|               17|               14|              12|
|d8dee61c2034d6d07...|                 NULL|               NULL|                      NULL|              NULL|             300|               16|                7|              20|
|56139431d72cd51f1...|                 NULL|               NULL|                      NULL|    

611

In [63]:
df_products_renamed.filter(
    col("product_category_name").isNull() &
    col("product_name_length").isNull() &
    col("product_description_length").isNull() &
    col("product_photos_qty").isNull() &
    col("product_weight_g").isNull() &
    col("product_length_cm").isNull() &
    col("product_height_cm").isNull() &
    col("product_width_cm").isNull()
).show()

+--------------------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+
|          product_id|product_category_name|product_name_length|product_description_length|product_photos_qty|product_weight_g|product_length_cm|product_height_cm|product_width_cm|
+--------------------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+
|5eb564652db742ff8...|                 NULL|               NULL|                      NULL|              NULL|            NULL|             NULL|             NULL|            NULL|
+--------------------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+



In [64]:
df_products_renamed.filter(
    col("product_weight_g").isNull() &
    col("product_length_cm").isNull() &
    col("product_height_cm").isNull() &
    col("product_width_cm").isNull()
).show()

+--------------------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+
|          product_id|product_category_name|product_name_length|product_description_length|product_photos_qty|product_weight_g|product_length_cm|product_height_cm|product_width_cm|
+--------------------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+
|09ff539a621711667...|                bebes|                 60|                       865|                 3|            NULL|             NULL|             NULL|            NULL|
|5eb564652db742ff8...|                 NULL|               NULL|                      NULL|              NULL|            NULL|             NULL|             NULL|            NULL|
+--------------------+---------------------+-------------------+--------------------------+----

In [65]:
df_products_renamed.filter(
    col("product_category_name").isNull() &
    col("product_name_length").isNull() &
    col("product_description_length").isNull() &
    col("product_photos_qty").isNull()
).show()

+--------------------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+
|          product_id|product_category_name|product_name_length|product_description_length|product_photos_qty|product_weight_g|product_length_cm|product_height_cm|product_width_cm|
+--------------------+---------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+
|a41e356c76fab6633...|                 NULL|               NULL|                      NULL|              NULL|             650|               17|               14|              12|
|d8dee61c2034d6d07...|                 NULL|               NULL|                      NULL|              NULL|             300|               16|                7|              20|
|56139431d72cd51f1...|                 NULL|               NULL|                      NULL|    

In [66]:
df_cat_trans.show(5)

+---------------------+-----------------------------+
|product_category_name|product_category_name_english|
+---------------------+-----------------------------+
|         beleza_saude|                health_beauty|
| informatica_acess...|         computers_accesso...|
|           automotivo|                         auto|
|      cama_mesa_banho|               bed_bath_table|
|     moveis_decoracao|              furniture_decor|
+---------------------+-----------------------------+
only showing top 5 rows



In [67]:
df_products_en = df_products_renamed.join(df_cat_trans, "product_category_name", "left")

In [68]:
df_products_en.filter(col('product_category_name_english').isNull()).select('product_category_name').distinct().show(truncate=False)

+---------------------------------------------+
|product_category_name                        |
+---------------------------------------------+
|portateis_cozinha_e_preparadores_de_alimentos|
|pc_gamer                                     |
|NULL                                         |
+---------------------------------------------+



In [69]:
# top 10 categories with the highest number of products
df_products_en.groupBy('product_category_name_english').count().orderBy(desc('count')).show(truncate=False)

+-------------------------------+-----+
|product_category_name_english  |count|
+-------------------------------+-----+
|bed_bath_table                 |3029 |
|sports_leisure                 |2867 |
|furniture_decor                |2657 |
|health_beauty                  |2444 |
|housewares                     |2335 |
|auto                           |1900 |
|computers_accessories          |1639 |
|toys                           |1411 |
|watches_gifts                  |1329 |
|telephony                      |1134 |
|baby                           |919  |
|perfumery                      |868  |
|fashion_bags_accessories       |849  |
|stationery                     |849  |
|cool_stuff                     |789  |
|garden_tools                   |753  |
|pet_shop                       |719  |
|NULL                           |623  |
|electronics                    |517  |
|construction_tools_construction|400  |
+-------------------------------+-----+
only showing top 20 rows



In [70]:
# rare 10 categories with the lowest number of products
df_products_en.groupBy('product_category_name_english').count().orderBy(asc('count')).show(truncate=False)

+-------------------------------------+-----+
|product_category_name_english        |count|
+-------------------------------------+-----+
|cds_dvds_musicals                    |1    |
|security_and_services                |2    |
|home_comfort_2                       |5    |
|fashion_childrens_clothes            |5    |
|tablets_printing_image               |9    |
|la_cuisine                           |10   |
|furniture_mattress_and_upholstery    |10   |
|diapers_and_hygiene                  |12   |
|flowers                              |14   |
|fashion_sport                        |19   |
|arts_and_craftmanship                |19   |
|party_supplies                       |26   |
|music                                |27   |
|fashio_female_clothing               |27   |
|cine_photo                           |28   |
|computers                            |30   |
|small_appliances_home_oven_and_coffee|31   |
|books_imported                       |31   |
|costruction_tools_tools          

In [71]:
df_products_en.withColumn('product_weight_kg', col('product_weight_g') / 1000) \
.groupBy('product_category_name_english')\
.agg(avg('product_weight_kg').alias('avg_weight_kg'))\
.orderBy(desc('avg_weight_kg'))\
.show(truncate=False)

+---------------------------------------+------------------+
|product_category_name_english          |avg_weight_kg     |
+---------------------------------------+------------------+
|furniture_mattress_and_upholstery      |13.189999999999998|
|office_furniture                       |12.740867313915857|
|kitchen_dining_laundry_garden_furniture|11.598563829787233|
|furniture_bedroom                      |9.997222222222222 |
|home_appliances_2                      |9.913333333333332 |
|furniture_living_room                  |8.93484615384615  |
|computers                              |7.995333333333333 |
|industry_commerce_and_business         |5.929191176470588 |
|agro_industry_and_commerce             |5.263405405405402 |
|air_conditioning                       |4.459959677419355 |
|la_cuisine                             |4.35              |
|small_appliances                       |4.012398268398271 |
|home_confort                           |3.8004504504504513|
|luggage_accessories    

In [72]:
# distribution product_photos_qty
df_products_en.groupBy("product_photos_qty").count().orderBy("count").show()

+------------------+-----+
|product_photos_qty|count|
+------------------+-----+
|                19|    1|
|                20|    1|
|                18|    2|
|                14|    5|
|                17|    7|
|                15|    8|
|                13|    9|
|                12|   35|
|                11|   46|
|                10|   95|
|                 9|  105|
|                 8|  192|
|                 7|  343|
|              NULL|  610|
|                 6|  968|
|                 5| 1484|
|                 4| 2428|
|                 3| 3860|
|                 2| 6263|
|                 1|16489|
+------------------+-----+



In [73]:
# check the product with 0 dimensions
df_products_en.filter((col("product_weight_g") == 0) | (col("product_height_cm") == 0) | (col("product_width_cm") == 0) | (col("product_length_cm") == 0)).show()

+---------------------+--------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+-----------------------------+
|product_category_name|          product_id|product_name_length|product_description_length|product_photos_qty|product_weight_g|product_length_cm|product_height_cm|product_width_cm|product_category_name_english|
+---------------------+--------------------+-------------------+--------------------------+------------------+----------------+-----------------+-----------------+----------------+-----------------------------+
|      cama_mesa_banho|81781c0fed9fe1ad6...|                 51|                       529|                 1|               0|               30|               25|              30|               bed_bath_table|
|      cama_mesa_banho|8038040ee2a71048d...|                 48|                       528|                 1|               0|               30|           

In [74]:
df_products_en.agg(countDistinct('product_id'), count("*")).show()

+--------------------------+--------+
|count(DISTINCT product_id)|count(1)|
+--------------------------+--------+
|                     32951|   32951|
+--------------------------+--------+



In [75]:
df_products.write.mode('overwrite').parquet(f'{output_head}products')

                                                                                

### Geolocation

In [76]:
df_geolocation.show(10)

+---------------------------+-------------------+------------------+----------------+-----------------+
|geolocation_zip_code_prefix|    geolocation_lat|   geolocation_lng|geolocation_city|geolocation_state|
+---------------------------+-------------------+------------------+----------------+-----------------+
|                       1037| -23.54562128115268|-46.63929204800168|       sao paulo|               SP|
|                       1046|-23.546081127035535|-46.64482029837157|       sao paulo|               SP|
|                       1046| -23.54612896641469|-46.64295148361138|       sao paulo|               SP|
|                       1041|  -23.5443921648681|-46.63949930627844|       sao paulo|               SP|
|                       1035|-23.541577961711493|-46.64160722329613|       sao paulo|               SP|
|                       1012|-23.547762303364266|-46.63536053788448|       são paulo|               SP|
|                       1047|-23.546273112412678|-46.64122516971

In [77]:
df_geolocation.printSchema()

root
 |-- geolocation_zip_code_prefix: integer (nullable = true)
 |-- geolocation_lat: double (nullable = true)
 |-- geolocation_lng: double (nullable = true)
 |-- geolocation_city: string (nullable = true)
 |-- geolocation_state: string (nullable = true)



In [78]:
# check missing values
df_geolocation.select([count(when(col(c).isNull(), 1)).alias(c) for c in df_geolocation.columns]).show()



+---------------------------+---------------+---------------+----------------+-----------------+
|geolocation_zip_code_prefix|geolocation_lat|geolocation_lng|geolocation_city|geolocation_state|
+---------------------------+---------------+---------------+----------------+-----------------+
|                          0|              0|              0|               0|                0|
+---------------------------+---------------+---------------+----------------+-----------------+



                                                                                

In [79]:
# check if there is any duplicate
df_geolocation.groupBy("geolocation_zip_code_prefix") \
    .count() \
    .filter("count > 1") \
    .count()

                                                                                

17972

In [80]:
df_geolocation.groupBy("geolocation_zip_code_prefix") \
    .count() \
    .filter("count > 1") \
    .show()



+---------------------------+-----+
|geolocation_zip_code_prefix|count|
+---------------------------+-----+
|                       2122|   33|
|                       2366|   33|
|                       3918|   50|
|                       4101|   72|
|                       9852|  107|
|                      13289|   61|
|                      26087|  111|
|                      28024|   95|
|                       3226|   24|
|                       4190|   52|
|                      18201|   69|
|                      20396|   11|
|                       1303|  166|
|                       6825|   22|
|                       8257|   60|
|                      13483|   83|
|                      24855|   41|
|                      25638|    2|
|                       2443|   80|
|                       2721|  149|
+---------------------------+-----+
only showing top 20 rows



                                                                                

In [81]:
df_geolocation.filter('geolocation_zip_code_prefix = 1238').select('geolocation_lat', 'geolocation_lng').groupBy('geolocation_lat', 'geolocation_lng').count().orderBy(desc('count')).show()



+-------------------+-------------------+-----+
|    geolocation_lat|    geolocation_lng|count|
+-------------------+-------------------+-----+
|-23.540840832361937|-46.650614537191984|   11|
| -23.54339898029315| -46.65151833472787|    8|
|  -23.5444758373536| -46.65439170330084|    8|
| -23.54143969441404|-46.660078121559266|    7|
|-23.544121128772638| -46.65531989785911|    7|
|-23.543897741329538| -46.65162344947782|    7|
| -23.54397971579432|-46.651665442264445|    7|
| -23.54052443880199| -46.65050381848346|    6|
| -23.54323868858565| -46.65731599111321|    6|
| -23.54126327699232| -46.65082846280802|    5|
|-23.545262137111173|-46.652460296699154|    4|
| -23.54029141326677| -46.65043534401869|    4|
| -23.54112827699232|-46.650778962808005|    4|
| -23.54069585789716| -46.65056346058632|    4|
|-23.541332668878816|-46.660236596024035|    4|
|-23.543396632263747| -46.65715435665641|    4|
| -23.54156737887341|  -46.6599863751225|    4|
|-23.543927088810687| -46.65582194892955

                                                                                

In [82]:
df_geolocation.select("geolocation_state").distinct().show(27)



+-----------------+
|geolocation_state|
+-----------------+
|               SP|
|               RJ|
|               AC|
|               ES|
|               RN|
|               MS|
|               CE|
|               MG|
|               DF|
|               RO|
|               AM|
|               MT|
|               PB|
|               BA|
|               SE|
|               PR|
|               AP|
|               RR|
|               TO|
|               AL|
|               SC|
|               PI|
|               GO|
|               RS|
|               PA|
|               PE|
|               MA|
+-----------------+



                                                                                

In [83]:
df_geolocation.groupBy("geolocation_state").agg(countDistinct("geolocation_zip_code_prefix").alias("num_zip_code")).orderBy(desc("num_zip_code")).show(27)



+-----------------+------------+
|geolocation_state|num_zip_code|
+-----------------+------------+
|               SP|        6349|
|               MG|        1868|
|               RJ|        1390|
|               RS|        1132|
|               PR|        1046|
|               BA|         992|
|               GO|         773|
|               SC|         620|
|               PE|         596|
|               CE|         548|
|               DF|         516|
|               PB|         324|
|               ES|         315|
|               MA|         313|
|               PA|         309|
|               PI|         307|
|               RN|         280|
|               MT|         254|
|               MS|         242|
|               TO|         184|
|               AL|         178|
|               AM|         144|
|               SE|         135|
|               RO|         108|
|               AC|          46|
|               RR|          28|
|               AP|          26|
+---------

                                                                                

In [84]:
df_geolocation.groupBy("geolocation_city") \
    .agg(countDistinct("geolocation_lat", "geolocation_lng").alias("unique_specific_locations")) \
    .orderBy(desc("unique_specific_locations")) \
    .show(10, truncate=False)



+----------------+-------------------------+
|geolocation_city|unique_specific_locations|
+----------------+-------------------------+
|sao paulo       |79774                    |
|rio de janeiro  |35075                    |
|são paulo       |19682                    |
|belo horizonte  |19444                    |
|curitiba        |11236                    |
|porto alegre    |8668                     |
|salvador        |8058                     |
|guarulhos       |7393                     |
|brasilia        |6868                     |
|osasco          |4990                     |
+----------------+-------------------------+
only showing top 10 rows



                                                                                

In [85]:
df_geolocation.groupBy("geolocation_city") \
    .agg(countDistinct("geolocation_zip_code_prefix").alias("unique_general_locations")) \
    .orderBy(desc("unique_general_locations")) \
    .show(10, truncate=False)



+----------------+------------------------+
|geolocation_city|unique_general_locations|
+----------------+------------------------+
|sao paulo       |3171                    |
|são paulo       |3013                    |
|brasilia        |496                     |
|brasília        |406                     |
|rio de janeiro  |404                     |
|salvador        |275                     |
|goiania         |233                     |
|goiânia         |210                     |
|belo horizonte  |205                     |
|fortaleza       |172                     |
+----------------+------------------------+
only showing top 10 rows



                                                                                

In [86]:
df_geolocation.agg(countDistinct("geolocation_zip_code_prefix"), count("*")).show()



+-------------------------------------------+--------+
|count(DISTINCT geolocation_zip_code_prefix)|count(1)|
+-------------------------------------------+--------+
|                                      19015| 1000163|
+-------------------------------------------+--------+



                                                                                

In [87]:
df_geolocation.write.mode('overwrite').parquet(f'{output_head}geolocation')

                                                                                

### Orders

In [88]:
df_orders.show(5)

+--------------------+--------------------+------------+------------------------+-------------------+----------------------------+-----------------------------+-----------------------------+
|            order_id|         customer_id|order_status|order_purchase_timestamp|  order_approved_at|order_delivered_carrier_date|order_delivered_customer_date|order_estimated_delivery_date|
+--------------------+--------------------+------------+------------------------+-------------------+----------------------------+-----------------------------+-----------------------------+
|e481f51cbdc54678b...|9ef432eb625129730...|   delivered|     2017-10-02 10:56:33|2017-10-02 11:07:15|         2017-10-04 19:55:00|          2017-10-10 21:25:13|          2017-10-18 00:00:00|
|53cdb2fc8bc7dce0b...|b0830fb4747a6c6d2...|   delivered|     2018-07-24 20:41:37|2018-07-26 03:24:27|         2018-07-26 14:31:00|          2018-08-07 15:27:45|          2018-08-13 00:00:00|
|47770eb9100c2d0c4...|41ce2a54c0b03bf34...|  

In [89]:
df_orders.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- order_status: string (nullable = true)
 |-- order_purchase_timestamp: string (nullable = true)
 |-- order_approved_at: string (nullable = true)
 |-- order_delivered_carrier_date: string (nullable = true)
 |-- order_delivered_customer_date: string (nullable = true)
 |-- order_estimated_delivery_date: string (nullable = true)



In [90]:
# distribution of order_status
df_orders.groupBy("order_status").count().orderBy("count").show()



+------------+-----+
|order_status|count|
+------------+-----+
|    approved|    2|
|     created|    5|
|  processing|  301|
|    invoiced|  314|
| unavailable|  609|
|    canceled|  625|
|     shipped| 1107|
|   delivered|96478|
+------------+-----+



                                                                                

In [91]:
# check if there is any duplicate order_id
df_orders.groupBy("order_id") \
    .count() \
    .filter("count > 1") \
    .count()

                                                                                

0

In [92]:
# check missing values
df_orders.select([count(when(col(c).isNull(), 1)).alias(c) for c in df_orders.columns]).show()



+--------+-----------+------------+------------------------+-----------------+----------------------------+-----------------------------+-----------------------------+
|order_id|customer_id|order_status|order_purchase_timestamp|order_approved_at|order_delivered_carrier_date|order_delivered_customer_date|order_estimated_delivery_date|
+--------+-----------+------------+------------------------+-----------------+----------------------------+-----------------------------+-----------------------------+
|       0|          0|           0|                       0|              160|                        1783|                         2965|                            0|
+--------+-----------+------------+------------------------+-----------------+----------------------------+-----------------------------+-----------------------------+



                                                                                

In [93]:
df_orders.filter(col("order_approved_at").isNull()).show()

+--------------------+--------------------+------------+------------------------+-----------------+----------------------------+-----------------------------+-----------------------------+
|            order_id|         customer_id|order_status|order_purchase_timestamp|order_approved_at|order_delivered_carrier_date|order_delivered_customer_date|order_estimated_delivery_date|
+--------------------+--------------------+------------+------------------------+-----------------+----------------------------+-----------------------------+-----------------------------+
|00b1cb0320190ca0d...|3532ba38a3fd24225...|    canceled|     2018-08-28 15:26:39|             NULL|                        NULL|                         NULL|          2018-09-12 00:00:00|
|ed3efbd3a87bea76c...|191984a8ba4cbb214...|    canceled|     2018-09-20 13:54:16|             NULL|                        NULL|                         NULL|          2018-10-17 00:00:00|
|df8282afe61008dc2...|aa797b187b5466bc6...|    canceled

In [94]:
df_orders \
.filter((col("order_status") != "canceled") | (col("order_status") != "unavailable"))\
.filter(
    col("order_approved_at").isNull() &
    col("order_delivered_carrier_date").isNotNull() &
    col("order_delivered_customer_date").isNotNull()
    )\
.show()

+--------------------+--------------------+------------+------------------------+-----------------+----------------------------+-----------------------------+-----------------------------+
|            order_id|         customer_id|order_status|order_purchase_timestamp|order_approved_at|order_delivered_carrier_date|order_delivered_customer_date|order_estimated_delivery_date|
+--------------------+--------------------+------------+------------------------+-----------------+----------------------------+-----------------------------+-----------------------------+
|e04abd8149ef81b95...|2127dc6603ac33544...|   delivered|     2017-02-18 14:40:00|             NULL|         2017-02-23 12:04:47|          2017-03-01 13:25:33|          2017-03-17 00:00:00|
|8a9adc69528e1001f...|4c1ccc74e00993733...|   delivered|     2017-02-18 12:45:31|             NULL|         2017-02-23 09:01:52|          2017-03-02 10:05:06|          2017-03-21 00:00:00|
|7013bcfc1c97fe719...|2941af76d38100e0f...|   delivered

In [95]:
df_orders \
.filter((col("order_status") != "canceled") | (col("order_status") != "unavailable"))\
.filter(
    (col("order_delivered_carrier_date").isNull()) &
    (col("order_status") ==  "delivered")
    )\
.show()

+--------------------+--------------------+------------+------------------------+-------------------+----------------------------+-----------------------------+-----------------------------+
|            order_id|         customer_id|order_status|order_purchase_timestamp|  order_approved_at|order_delivered_carrier_date|order_delivered_customer_date|order_estimated_delivery_date|
+--------------------+--------------------+------------+------------------------+-------------------+----------------------------+-----------------------------+-----------------------------+
|2aa91108853cecb43...|afeb16c7f46396c0e...|   delivered|     2017-09-29 08:52:58|2017-09-29 09:07:16|                        NULL|          2017-11-20 19:44:47|          2017-11-14 00:00:00|
|2d858f451373b04fb...|e08caf668d499a6d6...|   delivered|     2017-05-25 23:22:43|2017-05-25 23:30:16|                        NULL|                         NULL|          2017-06-23 00:00:00|
+--------------------+--------------------+--

In [96]:
# Average Delivery Time Analysis
df_orders = df_orders.withColumn("order_purchase_timestamp", to_timestamp(col("order_purchase_timestamp")))
df_orders = df_orders.withColumn("order_delivered_customer_date", to_timestamp(col("order_delivered_customer_date")))
df_delivery = df_orders.select('order_id','order_purchase_timestamp','order_delivered_customer_date')
df_delivery.withColumn('delivery_time',datediff(col('order_delivered_customer_date'),col('order_purchase_timestamp'))).orderBy(desc('delivery_time')).show()

[Stage 274:>                                                        (0 + 1) / 2]

+--------------------+------------------------+-----------------------------+-------------+
|            order_id|order_purchase_timestamp|order_delivered_customer_date|delivery_time|
+--------------------+------------------------+-----------------------------+-------------+
|ca07593549f1816d2...|     2017-02-21 23:31:27|          2017-09-19 14:36:39|          210|
|1b3190b2dfa9d789e...|     2018-02-23 14:57:35|          2018-09-19 23:24:07|          208|
|440d0d17af552815d...|     2017-03-07 23:59:51|          2017-09-19 15:12:50|          196|
|2fb597c2f772eca01...|     2017-03-08 18:09:02|          2017-09-19 14:33:17|          195|
|285ab9426d6982034...|     2017-03-08 22:47:40|          2017-09-19 14:00:04|          195|
|0f4519c5f1c541dde...|     2017-03-09 13:26:57|          2017-09-19 14:38:21|          194|
|47b40429ed8cce3ae...|     2018-01-03 09:44:01|          2018-07-13 20:51:31|          191|
|2fe324febf907e3ea...|     2017-03-13 20:17:10|          2017-09-19 17:00:07|   

                                                                                

In [97]:
# count order per customer (i guess because customer_id is different for every order, we should join it first with customer table)
df_orders.groupBy("customer_id").agg(count("order_id").alias("num_orders")).orderBy(desc("num_orders")).show()



+--------------------+----------+
|         customer_id|num_orders|
+--------------------+----------+
|f1e46939e6408b3e6...|         1|
|3391c4bc11a817e79...|         1|
|90d7075599361b694...|         1|
|5a58afc695ee03b9b...|         1|
|a340ce6c3570e68d4...|         1|
|9687241c8ed401845...|         1|
|b76530b7e66b27cd6...|         1|
|a3537e15f3e4f88e0...|         1|
|4d91a0aeb419c5f26...|         1|
|2774b6023c768c91e...|         1|
|2aec499f94f5e8278...|         1|
|5a6ae5b68fd27c687...|         1|
|791fffae1e2c66693...|         1|
|2b139a6842a26c357...|         1|
|0721e1c4b91bc6ded...|         1|
|9601347c41eb3f6dd...|         1|
|d480546bdc6b03fca...|         1|
|be170bc588de8c9e1...|         1|
|6f8b4eeaba59ef3fa...|         1|
|18dff7c2afb1ce5fb...|         1|
+--------------------+----------+
only showing top 20 rows



                                                                                

In [98]:
df_orders_customers = df_orders.join(df_customers, "customer_id", "left")
df_orders_customers.show()

                                                                                

+--------------------+--------------------+------------+------------------------+-------------------+----------------------------+-----------------------------+-----------------------------+--------------------+------------------------+--------------------+--------------+
|         customer_id|            order_id|order_status|order_purchase_timestamp|  order_approved_at|order_delivered_carrier_date|order_delivered_customer_date|order_estimated_delivery_date|  customer_unique_id|customer_zip_code_prefix|       customer_city|customer_state|
+--------------------+--------------------+------------+------------------------+-------------------+----------------------------+-----------------------------+-----------------------------+--------------------+------------------------+--------------------+--------------+
|9ef432eb625129730...|e481f51cbdc54678b...|   delivered|     2017-10-02 10:56:33|2017-10-02 11:07:15|         2017-10-04 19:55:00|          2017-10-10 21:25:13|          2017-10-18 

In [99]:
df_orders_customers.groupBy("customer_unique_id").agg(countDistinct("order_id").alias("num_orders")).orderBy(desc("num_orders")).show()



+--------------------+----------+
|  customer_unique_id|num_orders|
+--------------------+----------+
|8d50f5eadf50201cc...|        17|
|3e43e6105506432c9...|         9|
|1b6c7548a2a1f9037...|         7|
|ca77025e7201e3b30...|         7|
|6469f99c1f9dfae77...|         7|
|de34b16117594161a...|         6|
|47c1a3033b8b77b3a...|         6|
|f0e310a6839dce9de...|         6|
|12f5d6e1cbf93dafd...|         6|
|63cfc61cee11cbe30...|         6|
|dc813062e0fc23409...|         6|
|56c8638e7c058b98a...|         5|
|b4e4f24de1e8725b7...|         5|
|5e8f38a9a1c023f3d...|         5|
|4e65032f1f574189f...|         5|
|394ac4de8f3acb142...|         5|
|74cb1ad7e6d567432...|         5|
|35ecdf6858edc6427...|         5|
|fe81bb32c243a86b2...|         5|
|083ca1aa470c28023...|         4|
+--------------------+----------+
only showing top 20 rows



                                                                                

In [100]:
df_orders_customers.filter(col('customer_unique_id') == '8d50f5eadf50201ccdcedfb9e2ac8455').show()

+--------------------+--------------------+------------+------------------------+-------------------+----------------------------+-----------------------------+-----------------------------+--------------------+------------------------+-------------+--------------+
|         customer_id|            order_id|order_status|order_purchase_timestamp|  order_approved_at|order_delivered_carrier_date|order_delivered_customer_date|order_estimated_delivery_date|  customer_unique_id|customer_zip_code_prefix|customer_city|customer_state|
+--------------------+--------------------+------------+------------------------+-------------------+----------------------------+-----------------------------+-----------------------------+--------------------+------------------------+-------------+--------------+
|897b7f72042714efa...|c2213109a2cc0e75d...|   delivered|     2018-08-07 23:32:14|2018-08-07 23:45:21|         2018-08-09 13:35:00|          2018-08-10 20:26:44|          2018-08-13 00:00:00|8d50f5eadf50

In [101]:
# check format timestamp
timestamp_cols = [
    "order_purchase_timestamp",
    "order_approved_at",
    "order_delivered_carrier_date",
    "order_delivered_customer_date",
    "order_estimated_delivery_date"
]

valid_format_timestamp = r'^\d{4}-(0[1-9]|1[0-2])-' \
                         r'(0[1-9]|[12]\d|3[01]) ' \
                         r'(0\d|1\d|2[0-3]):' \
                         r'([0-5]\d):' \
                         r'([0-5]\d)$'

for col_name in timestamp_cols:
    df_orders = df_orders.withColumn(
        f"{col_name}_is_valid",
        col(col_name).rlike(valid_format_timestamp)
    )

df_orders.filter(
    (col("order_purchase_timestamp_is_valid") == False) |
    (col("order_approved_at_is_valid") == False) |
    (col("order_delivered_carrier_date_is_valid") == False) |
    (col("order_delivered_customer_date_is_valid") == False) |
    (col("order_estimated_delivery_date") == False)
).show()

                                                                                

+--------+-----------+------------+------------------------+-----------------+----------------------------+-----------------------------+-----------------------------+---------------------------------+--------------------------+-------------------------------------+--------------------------------------+--------------------------------------+
|order_id|customer_id|order_status|order_purchase_timestamp|order_approved_at|order_delivered_carrier_date|order_delivered_customer_date|order_estimated_delivery_date|order_purchase_timestamp_is_valid|order_approved_at_is_valid|order_delivered_carrier_date_is_valid|order_delivered_customer_date_is_valid|order_estimated_delivery_date_is_valid|
+--------+-----------+------------+------------------------+-----------------+----------------------------+-----------------------------+-----------------------------+---------------------------------+--------------------------+-------------------------------------+--------------------------------------+-----

In [102]:
df_orders.agg(countDistinct("order_id"), count("*")).show()



+------------------------+--------+
|count(DISTINCT order_id)|count(1)|
+------------------------+--------+
|                   99441|   99441|
+------------------------+--------+



                                                                                

In [103]:
df_orders.write.mode('overwrite').parquet(f'{output_head}orders')

                                                                                

### Order Items

In [104]:
df_order_items.show(5)

[Stage 299:>                                                        (0 + 1) / 1]

+--------------------+-------------+--------------------+--------------------+-------------------+-----+-------------+
|            order_id|order_item_id|          product_id|           seller_id|shipping_limit_date|price|freight_value|
+--------------------+-------------+--------------------+--------------------+-------------------+-----+-------------+
|00010242fe8c5a6d1...|            1|4244733e06e7ecb49...|48436dade18ac8b2b...|2017-09-19 09:45:35| 58.9|        13.29|
|00018f77f2f0320c5...|            1|e5f2d52b802189ee6...|dd7ddc04e1b6c2c61...|2017-05-03 11:05:13|239.9|        19.93|
|000229ec398224ef6...|            1|c777355d18b72b67a...|5b51032eddd242adc...|2018-01-18 14:48:30|199.0|        17.87|
|00024acbcdf0a6daa...|            1|7634da152a4610f15...|9d7a1d34a50524090...|2018-08-15 10:10:18|12.99|        12.79|
|00042b26cf59d7ce6...|            1|ac6c3623068f30de0...|df560393f3a51e745...|2017-02-13 13:57:51|199.9|        18.14|
+--------------------+-------------+------------

                                                                                

In [105]:
df_order_items.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- order_item_id: integer (nullable = true)
 |-- product_id: string (nullable = true)
 |-- seller_id: string (nullable = true)
 |-- shipping_limit_date: string (nullable = true)
 |-- price: double (nullable = true)
 |-- freight_value: double (nullable = true)



In [106]:
# check missing values
df_order_items.select([count(when(col(c).isNull(), 1)).alias(c) for c in df_order_items.columns]).show()



+--------+-------------+----------+---------+-------------------+-----+-------------+
|order_id|order_item_id|product_id|seller_id|shipping_limit_date|price|freight_value|
+--------+-------------+----------+---------+-------------------+-----+-------------+
|       0|            0|         0|        0|                  0|    0|            0|
+--------+-------------+----------+---------+-------------------+-----+-------------+



                                                                                

In [107]:
# check if there is any duplicate row -> order_id + order_item_id
df_order_items.groupBy("order_id", "order_item_id") \
    .count() \
    .filter("count > 1") \
    .count()

                                                                                

0

In [108]:
df_order_items.groupBy('order_item_id').count().orderBy(desc('count')).show(50)



+-------------+-----+
|order_item_id|count|
+-------------+-----+
|            1|98666|
|            2| 9803|
|            3| 2287|
|            4|  965|
|            5|  460|
|            6|  256|
|            7|   58|
|            8|   36|
|            9|   28|
|           10|   25|
|           11|   17|
|           12|   13|
|           13|    8|
|           14|    7|
|           15|    5|
|           19|    3|
|           16|    3|
|           20|    3|
|           18|    3|
|           17|    3|
|           21|    1|
+-------------+-----+



                                                                                

In [109]:
df_order_items.filter('order_id = "00143d0f86d6fbd9f9b38ab440ac16f5"').show()

                                                                                

+--------------------+-------------+--------------------+--------------------+-------------------+-----+-------------+
|            order_id|order_item_id|          product_id|           seller_id|shipping_limit_date|price|freight_value|
+--------------------+-------------+--------------------+--------------------+-------------------+-----+-------------+
|00143d0f86d6fbd9f...|            1|e95ee6822b66ac605...|a17f621c590ea0fab...|2017-10-20 16:07:52|21.33|         15.1|
|00143d0f86d6fbd9f...|            2|e95ee6822b66ac605...|a17f621c590ea0fab...|2017-10-20 16:07:52|21.33|         15.1|
|00143d0f86d6fbd9f...|            3|e95ee6822b66ac605...|a17f621c590ea0fab...|2017-10-20 16:07:52|21.33|         15.1|
+--------------------+-------------+--------------------+--------------------+-------------------+-----+-------------+



In [110]:
df_order_items.groupBy('order_id', 'product_id').count().orderBy(desc('count')).show(truncate=False)



+--------------------------------+--------------------------------+-----+
|order_id                        |product_id                      |count|
+--------------------------------+--------------------------------+-----+
|1b15974a0141d54e36626dca3fdc731a|ee3d532c8a438679776d222e997606b3|20   |
|ab14fdcfbe524636d65ee38360e22ce8|9571759451b1d780ee7c15012ea109d4|20   |
|9ef13efd6949e4573a18964dd1bbe7f5|37eb69aca8718e843d897aa7b82f462d|15   |
|428a2f660dc84138d969ccd69a0ab6d5|89b190a046022486c635022524a974a8|15   |
|73c8ab38f07dc94389065f7eba4f297a|422879e10f46682990de24d770e7f83d|14   |
|9bdc4d4c71aa1de4606060929dee888c|44a5d24dd383324a421569ca697b13c2|14   |
|37ee401157a3a0b28c9c6d0ed8c3b24b|d34c07a2d817ac73f4caf8c574215fed|13   |
|3a213fcdfe7d98be74ea0dc05a8b31ae|a62e25e09e05e6faf31d90c6ec1aa3d1|12   |
|2c2a19b5703863c908512d135aa6accc|03e1c946c0ddfc58724ff262aef08dff|12   |
|6c355e2913545fa6f72c40cbca57729e|32e18e89237933ebdaaebd78a27e7fa1|11   |
|8272b63d03f5f79c56e9e4120aec44ef|05b5

                                                                                

In [111]:
df_order_items.filter('order_id = "8272b63d03f5f79c56e9e4120aec44ef"').show(30, truncate=False)

+--------------------------------+-------------+--------------------------------+--------------------------------+-------------------+-----+-------------+
|order_id                        |order_item_id|product_id                      |seller_id                       |shipping_limit_date|price|freight_value|
+--------------------------------+-------------+--------------------------------+--------------------------------+-------------------+-----+-------------+
|8272b63d03f5f79c56e9e4120aec44ef|1            |270516a3f41dc035aa87d220228f844c|2709af9587499e95e803a6498a5a56e9|2017-07-21 18:25:23|1.2  |7.89         |
|8272b63d03f5f79c56e9e4120aec44ef|2            |05b515fdc76e888aada3c6d66c201dff|2709af9587499e95e803a6498a5a56e9|2017-07-21 18:25:23|1.2  |7.89         |
|8272b63d03f5f79c56e9e4120aec44ef|3            |05b515fdc76e888aada3c6d66c201dff|2709af9587499e95e803a6498a5a56e9|2017-07-21 18:25:23|1.2  |7.89         |
|8272b63d03f5f79c56e9e4120aec44ef|4            |05b515fdc76e888aada3c6

In [112]:
# check format timestamp
timestamp_cols = [
    "shipping_limit_date",
]

valid_format_timestamp = r'^\d{4}-(0[1-9]|1[0-2])-' \
                         r'(0[1-9]|[12]\d|3[01]) ' \
                         r'(0\d|1\d|2[0-3]):' \
                         r'([0-5]\d):' \
                         r'([0-5]\d)$'

for col_name in timestamp_cols:
    df_order_items = df_order_items.withColumn(
        f"{col_name}_is_valid",
        col(col_name).rlike(valid_format_timestamp)
    )

df_order_items.filter(
    (col("shipping_limit_date") == False)).show()

+--------+-------------+----------+---------+-------------------+-----+-------------+----------------------------+
|order_id|order_item_id|product_id|seller_id|shipping_limit_date|price|freight_value|shipping_limit_date_is_valid|
+--------+-------------+----------+---------+-------------------+-----+-------------+----------------------------+
+--------+-------------+----------+---------+-------------------+-----+-------------+----------------------------+



In [113]:
df_order_items.agg(countDistinct("order_id"), count("*")).show()

+------------------------+--------+
|count(DISTINCT order_id)|count(1)|
+------------------------+--------+
|                   98666|  112650|
+------------------------+--------+



In [114]:
df_order_items.write.mode('overwrite').parquet(f'{output_head}order_items')

                                                                                

### Order Payments

In [115]:
df_order_payments.show(5)

[Stage 328:>                                                        (0 + 1) / 1]

+--------------------+------------------+------------+--------------------+-------------+
|            order_id|payment_sequential|payment_type|payment_installments|payment_value|
+--------------------+------------------+------------+--------------------+-------------+
|b81ef226f3fe1789b...|                 1| credit_card|                   8|        99.33|
|a9810da82917af2d9...|                 1| credit_card|                   1|        24.39|
|25e8ea4e93396b6fa...|                 1| credit_card|                   1|        65.71|
|ba78997921bbcdc13...|                 1| credit_card|                   8|       107.78|
|42fdf880ba16b47b5...|                 1| credit_card|                   2|       128.45|
+--------------------+------------------+------------+--------------------+-------------+
only showing top 5 rows



                                                                                

In [116]:
df_order_payments.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- payment_sequential: integer (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- payment_installments: integer (nullable = true)
 |-- payment_value: double (nullable = true)



In [117]:
# check missing values
df_order_payments.select([count(when(col(c).isNull(), 1)).alias(c) for c in df_order_payments.columns]).show()



+--------+------------------+------------+--------------------+-------------+
|order_id|payment_sequential|payment_type|payment_installments|payment_value|
+--------+------------------+------------+--------------------+-------------+
|       0|                 0|           0|                   0|            0|
+--------+------------------+------------+--------------------+-------------+



                                                                                

In [118]:
# check if there is any duplicate row -> order_id + payment_sequential
df_order_payments.groupBy("order_id", "payment_sequential") \
    .count() \
    .filter("count > 1") \
    .count()

                                                                                

0

In [119]:
df_order_payments.select("payment_type").distinct().show()



+------------+
|payment_type|
+------------+
| not_defined|
| credit_card|
|      boleto|
|  debit_card|
|     voucher|
+------------+



                                                                                

In [120]:
df_order_payments.select("payment_sequential").distinct().orderBy("payment_sequential").show()



+------------------+
|payment_sequential|
+------------------+
|                 1|
|                 2|
|                 3|
|                 4|
|                 5|
|                 6|
|                 7|
|                 8|
|                 9|
|                10|
|                11|
|                12|
|                13|
|                14|
|                15|
|                16|
|                17|
|                18|
|                19|
|                20|
+------------------+
only showing top 20 rows



                                                                                

In [121]:
df_order_payments.select("payment_installments").distinct().orderBy("payment_installments").show()

+--------------------+
|payment_installments|
+--------------------+
|                   0|
|                   1|
|                   2|
|                   3|
|                   4|
|                   5|
|                   6|
|                   7|
|                   8|
|                   9|
|                  10|
|                  11|
|                  12|
|                  13|
|                  14|
|                  15|
|                  16|
|                  17|
|                  18|
|                  20|
+--------------------+
only showing top 20 rows



In [122]:
df_order_payments.groupBy("payment_type").count().orderBy(desc("count")).show()

+------------+-----+
|payment_type|count|
+------------+-----+
| credit_card|76795|
|      boleto|19784|
|     voucher| 5775|
|  debit_card| 1529|
| not_defined|    3|
+------------+-----+



In [123]:
df_order_payments.groupBy("order_id").count().filter("count > 1").count()

                                                                                

2961

In [124]:
df_order_payments.groupBy("order_id") \
    .count() \
    .filter("count > 1") \
    .orderBy("count", ascending=False) \
    .show(20, truncate=False)



+--------------------------------+-----+
|order_id                        |count|
+--------------------------------+-----+
|fa65dad1b0e818e3ccc5cb0e39231352|29   |
|ccf804e764ed5650cd8759557269dc13|26   |
|285c2e15bebd4ac83635ccc563dc71f4|22   |
|895ab968e7bb0d5659d16cd74cd1650c|21   |
|fedcd9f7ccdc8cba3a18defedd1a5547|19   |
|ee9ca989fc93ba09a6eddc250ce01742|19   |
|21577126c19bf11a0b91592e5844ba78|15   |
|4bfcba9e084f46c8e3cb49b0fa6e6159|15   |
|4689b1816de42507a7d63a4617383c59|14   |
|3c58bffb70dcf45f12bdf66a3c215905|14   |
|73df5d6adbeea12c8ae03df93f346e86|13   |
|cf101c3abd3c061ca9f78c1bbb1125af|13   |
|4fb76fa13b108a0d0478483421b0992c|13   |
|6d58638e32674bebee793a47ac4cbadc|12   |
|1d9a9731b9c10fc9cba74e6f74782e8b|12   |
|1a611328643ae11146ba09a4425d2e12|12   |
|d744783ed2ace06cac647a9e64dcbcfd|12   |
|67d83bd36ec2c7fb557742fb58837659|12   |
|465c2e1bee4561cb39e0db8c5993aafc|12   |
|68986e4324f6a21481df4e6e89abcf01|12   |
+--------------------------------+-----+
only showing top

                                                                                

In [125]:
df_order_payments.filter(col('order_id') == '285c2e15bebd4ac83635ccc563dc71f4').orderBy('payment_sequential').show()

+--------------------+------------------+------------+--------------------+-------------+
|            order_id|payment_sequential|payment_type|payment_installments|payment_value|
+--------------------+------------------+------------+--------------------+-------------+
|285c2e15bebd4ac83...|                 1| credit_card|                   1|         1.62|
|285c2e15bebd4ac83...|                 2|     voucher|                   1|         1.24|
|285c2e15bebd4ac83...|                 3|     voucher|                   1|          1.4|
|285c2e15bebd4ac83...|                 4|     voucher|                   1|         2.89|
|285c2e15bebd4ac83...|                 5|     voucher|                   1|         1.75|
|285c2e15bebd4ac83...|                 6|     voucher|                   1|         2.85|
|285c2e15bebd4ac83...|                 7|     voucher|                   1|         1.14|
|285c2e15bebd4ac83...|                 8|     voucher|                   1|         1.23|
|285c2e15b

In [126]:
df_order_payments.agg(countDistinct("order_id"), count("*")).show()



+------------------------+--------+
|count(DISTINCT order_id)|count(1)|
+------------------------+--------+
|                   99440|  103886|
+------------------------+--------+



                                                                                

In [127]:
df_orders_payments = df_orders.join(df_order_payments, "order_id", "left")

In [128]:
df_orders_payments.filter('payment_type is null').show()

                                                                                

+--------------------+--------------------+------------+------------------------+-------------------+----------------------------+-----------------------------+-----------------------------+---------------------------------+--------------------------+-------------------------------------+--------------------------------------+--------------------------------------+------------------+------------+--------------------+-------------+
|            order_id|         customer_id|order_status|order_purchase_timestamp|  order_approved_at|order_delivered_carrier_date|order_delivered_customer_date|order_estimated_delivery_date|order_purchase_timestamp_is_valid|order_approved_at_is_valid|order_delivered_carrier_date_is_valid|order_delivered_customer_date_is_valid|order_estimated_delivery_date_is_valid|payment_sequential|payment_type|payment_installments|payment_value|
+--------------------+--------------------+------------+------------------------+-------------------+----------------------------+

In [129]:
df_orders_payments.write.mode('overwrite').parquet(f'{output_head}order_payments')

                                                                                

### Order Reviews

In [130]:
df_order_reviews.show(5)

+--------------------+--------------------+------------+--------------------+----------------------+--------------------+-----------------------+
|           review_id|            order_id|review_score|review_comment_title|review_comment_message|review_creation_date|review_answer_timestamp|
+--------------------+--------------------+------------+--------------------+----------------------+--------------------+-----------------------+
|7bc2406110b926393...|73fc7af87114b3971...|           4|                NULL|                  NULL| 2018-01-18 00:00:00|    2018-01-18 21:46:59|
|80e641a11e56f04c1...|a548910a1c6147796...|           5|                NULL|                  NULL| 2018-03-10 00:00:00|    2018-03-11 03:05:13|
|228ce5500dc1d8e02...|f9e4b658b201a9f2e...|           5|                NULL|                  NULL| 2018-02-17 00:00:00|    2018-02-18 14:36:24|
|e64fb393e7b32834b...|658677c97b385a9be...|           5|                NULL|  Recebi bem antes ...| 2017-04-21 00:00:00|   

In [131]:
df_order_reviews.printSchema()

root
 |-- review_id: string (nullable = true)
 |-- order_id: string (nullable = true)
 |-- review_score: integer (nullable = true)
 |-- review_comment_title: string (nullable = true)
 |-- review_comment_message: string (nullable = true)
 |-- review_creation_date: string (nullable = true)
 |-- review_answer_timestamp: string (nullable = true)



In [132]:
df_order_reviews.groupBy("review_score").count().orderBy(desc("count")).show()



+------------+-----+
|review_score|count|
+------------+-----+
|           5|57328|
|           4|19142|
|           1|11424|
|           3| 8179|
|        NULL| 4937|
|           2| 3151|
|           0|    1|
+------------+-----+



                                                                                

In [133]:
# check missing values
df_order_reviews.select([count(when(col(c).isNull(), 1)).alias(c) for c in df_order_reviews.columns]).show()



+---------+--------+------------+--------------------+----------------------+--------------------+-----------------------+
|review_id|order_id|review_score|review_comment_title|review_comment_message|review_creation_date|review_answer_timestamp|
+---------+--------+------------+--------------------+----------------------+--------------------+-----------------------+
|        1|    2236|        4937|               92157|                 63079|                8764|                   8785|
+---------+--------+------------+--------------------+----------------------+--------------------+-----------------------+



                                                                                

In [134]:
# check if there is any duplicate row
df_order_reviews.groupBy("review_id") \
    .count() \
    .filter("count > 1") \
    .count()

                                                                                

955

In [135]:
df_order_reviews.filter('review_id is null').show(truncate=False)

+---------+-----------------------------------------------+------------+--------------------------------+----------------------+--------------------+-----------------------+
|review_id|order_id                                       |review_score|review_comment_title            |review_comment_message|review_creation_date|review_answer_timestamp|
+---------+-----------------------------------------------+------------+--------------------------------+----------------------+--------------------+-----------------------+
|NULL     |material de boa qualidade.Agora é amarelo mesmo|NULL        |mas e muito bonito.eu recomendo"|2018-01-06 00:00:00   |2018-01-08 14:20:31 |NULL                   |
+---------+-----------------------------------------------+------------+--------------------------------+----------------------+--------------------+-----------------------+



In [136]:
df_order_reviews.filter('review_score = 0').show(truncate=False)

+---------------------+-------------------------------+------------+--------------------+----------------------+--------------------+-----------------------+
|review_id            |order_id                       |review_score|review_comment_title|review_comment_message|review_creation_date|review_answer_timestamp|
+---------------------+-------------------------------+------------+--------------------+----------------------+--------------------+-----------------------+
|Conforme outros sites| poderia ser livre acima de 100|0           |NULL                |NULL                  |NULL                |NULL                   |
+---------------------+-------------------------------+------------+--------------------+----------------------+--------------------+-----------------------+



In [137]:
df_order_reviews.filter(
    """
    order_id IS NULL AND
    review_score IS NULL
    """
).count()

2236

In [138]:
df_order_reviews.filter(
    """
    order_id IS NULL AND
    review_score IS NULL
    """
).show(10, truncate=False)

+-------------------------------------------------+--------+------------+--------------------+----------------------+--------------------+-----------------------+
|review_id                                        |order_id|review_score|review_comment_title|review_comment_message|review_creation_date|review_answer_timestamp|
+-------------------------------------------------+--------+------------+--------------------+----------------------+--------------------+-----------------------+
|,2018-02-16 00:00:00,2018-02-20 10:52:22         |NULL    |NULL        |NULL                |NULL                  |NULL                |NULL                   |
|A entrega foi efetuada muito antes do prazo dado.|NULL    |NULL        |NULL                |NULL                  |NULL                |NULL                   |
|O produto já começou a ser usado e até o presente|NULL    |NULL        |NULL                |NULL                  |NULL                |NULL                   |
| Estou satisfeita    

In [139]:
df_order_reviews.filter(
    """
    order_id IS NULL AND
    review_score IS NULL AND
    review_comment_title IS NULL
    """
).count()

                                                                                

2235

In [140]:
df_order_reviews.filter(
    """
    order_id IS NULL AND
    review_score IS NULL AND
    review_comment_title IS NOT NULL
    """
).show()

+--------------------+--------+------------+--------------------+----------------------+--------------------+-----------------------+
|           review_id|order_id|review_score|review_comment_title|review_comment_message|review_creation_date|review_answer_timestamp|
+--------------------+--------+------------+--------------------+----------------------+--------------------+-----------------------+
|Grato. Espero uma...|    NULL|        NULL|,2017-06-07 00:00...|                  NULL|                NULL|                   NULL|
+--------------------+--------+------------+--------------------+----------------------+--------------------+-----------------------+



In [141]:
df_order_reviews.select("review_id").distinct().filter(
    ~col("review_id").rlike("^[a-f0-9]{32}$")
).count()

                                                                                

4547

In [142]:
df_order_reviews.select("order_id").distinct().filter(
    ~col("order_id").rlike("^[a-f0-9]{32}$")
).count()

                                                                                

1069

In [143]:
df_order_reviews.filter(
    """
    review_score IS NULL AND
    (review_comment_title IS NOT NULL OR
    review_comment_message IS NOT NULL)
    """
).show()

+--------------------+--------------------+------------+--------------------+----------------------+--------------------+-----------------------+
|           review_id|            order_id|review_score|review_comment_title|review_comment_message|review_creation_date|review_answer_timestamp|
+--------------------+--------------------+------------+--------------------+----------------------+--------------------+-----------------------+
|Tapete de Eva Nº ...|90 (ESTE FOI ENTREG"|        NULL| 2018-01-10 09:52:57|                  NULL|                NULL|                   NULL|
|            De resto|           tudo ok."|        NULL| 2018-04-22 16:36:04|                  NULL|                NULL|                   NULL|
|        é lamentável| esta loja que ta...|        NULL| 2018-08-23 00:00:00|   2018-08-30 03:02:39|                NULL|                   NULL|
|Pois já se passou...| gostaria de uma ...|        NULL| 2018-03-25 23:32:00|                  NULL|                NULL|   

In [144]:
df_order_reviews.filter(col("order_id").rlike("^[a-f0-9]{32}$"))\
    .groupBy("order_id") \
    .count() \
    .filter("count = 2") \
    .show(30,truncate=False)



+--------------------------------+-----+
|order_id                        |count|
+--------------------------------+-----+
|78cd965d0bc0388d390404eee6490c5b|2    |
|92e0a2b039d9ce627cbcc94ecf879f87|2    |
|0b127810f86631fa896fd74a421045f8|2    |
|b7293e3014a7261f0d26d28a1e927864|2    |
|70c77e51e0f179d75a64a614135afb6a|2    |
|26ba6dc5d33b55a3107a56f4e8d77395|2    |
|5040757d4e06a4be96d3827b860b4e7c|2    |
|ee15d1d23898663a46ad1c4b08a45759|2    |
|3fe4dbcdb046a475dbf25463c1ca78bd|2    |
|02e0b68852217f5715fb9cc885829454|2    |
|3cf387bb14e9db171ccbb9b87ea607bb|2    |
|b4b23a5f1e3af0a90cce47b18af0e612|2    |
|138db286991b7cc18370680e5f5154da|2    |
|95e7f49dc56e12097c265c45527a3941|2    |
|96f5be02bc9ffc589f3274500a64a7e2|2    |
|3ee6f1a94cd7b1aa622553fd94a3aaf3|2    |
|e706d614326b76f48f24d82f86c0b2ed|2    |
|0715dfcf2383aa72c181d8b47f6cb589|2    |
|256d17dcfae69f2c09f3ab3daf76b0ef|2    |
|056bfadd41b8600ad5ecfef2ac132188|2    |
|2f8f31eb2f7b6572836d662a6625c8e4|2    |
|3df55fc07ff4631

                                                                                

In [145]:
df_order_reviews.filter("order_id = '3fe4dbcdb046a475dbf25463c1ca78bd'").show()

+--------------------+--------------------+------------+--------------------+----------------------+--------------------+-----------------------+
|           review_id|            order_id|review_score|review_comment_title|review_comment_message|review_creation_date|review_answer_timestamp|
+--------------------+--------------------+------------+--------------------+----------------------+--------------------+-----------------------+
|308316408775d1600...|3fe4dbcdb046a475d...|           5|                NULL|  Ajudem a rastrear...| 2017-09-07 00:00:00|    2017-09-11 09:58:09|
|89d6214895235bb95...|3fe4dbcdb046a475d...|           5|                NULL|                  NULL| 2017-09-12 00:00:00|    2017-09-13 10:09:12|
+--------------------+--------------------+------------+--------------------+----------------------+--------------------+-----------------------+



In [146]:
df_orders.count(), df_order_reviews.count()

(99441, 104162)

In [147]:
df_orders_orderreviews = df_orders.join(df_order_reviews, "order_id", "left")

In [148]:
df_orders_orderreviews.count()

                                                                                

99992

In [149]:
df_orders_orderreviews.filter(
    (col("order_status") == "delivered") &
    (col("review_score").isNotNull())
).groupBy("review_score").count().orderBy(desc("review_score")).show()



+------------+-----+
|review_score|count|
+------------+-----+
|           5|57066|
|           4|18987|
|           3| 7961|
|           2| 2941|
|           1| 9406|
+------------+-----+



                                                                                

In [150]:
# check format timestamp
timestamp_cols = [
    "review_creation_date",
    "review_answer_timestamp"
]

valid_format_timestamp = r'^\d{4}-(0[1-9]|1[0-2])-' \
                         r'(0[1-9]|[12]\d|3[01]) ' \
                         r'(0\d|1\d|2[0-3]):' \
                         r'([0-5]\d):' \
                         r'([0-5]\d)$'

for col_name in timestamp_cols:
    df_order_reviews = df_order_reviews.withColumn(
        f"{col_name}_is_valid",
        col(col_name).rlike(valid_format_timestamp)
    )

df_order_reviews.filter(
    (col("review_creation_date") == False) |
    (col("review_answer_timestamp") == False)
).show()

                                                                                

+---------+--------+------------+--------------------+----------------------+--------------------+-----------------------+-----------------------------+--------------------------------+
|review_id|order_id|review_score|review_comment_title|review_comment_message|review_creation_date|review_answer_timestamp|review_creation_date_is_valid|review_answer_timestamp_is_valid|
+---------+--------+------------+--------------------+----------------------+--------------------+-----------------------+-----------------------------+--------------------------------+
+---------+--------+------------+--------------------+----------------------+--------------------+-----------------------+-----------------------------+--------------------------------+



In [151]:
df_order_reviews.write.mode('overwrite').parquet(f'{output_head}order_reviews')

                                                                                

In [152]:
spark.stop()

## Summary

1. Customers:
    - customer_city lower to title case
    - no missing value
    - no duplicate row
    - customer_unique_id can be double (it means that there is 1 cust with 2 or many customer_id yang mungkin ada karena different location)
    - total row 99,441 | distinct customer_id 99,441 | distinct cust unique id 96,096
2. Sellers:
    - seller_city lower to title case
    - no missing value
    - no duplicate row
    - total row 3095 | distinct seller_id 3095
3. Products:
    - rename some column
    - no duplikat
    "- ada missing value (total 611)
         - semua fitur NULL kecuali ID 1 row
         - product_category_name, product_name_length, product_description_length each 609 row NULL
         - dimensi 1 row NULL"
    - map category english name into category ori name (ada 2 kategori yang ga ke map -> map manual)
    - need to change dimensi weight dari g to Kg
    - need to calculate volume product
    - fitur dimensi ada weight dan volume (plt)
    - total row == distinct product_id (32951)
4. Geolocation:
    - no missing value
    - duplikat zip code prefix dengan lat dan long yang dekat-dekatan => ambil rata-rata lat long sehingga zip code prefix tidak duplikat
    - distinct zip code prefix 19015 | total row 1,000,163
    - geolocation_city lower to title case
5. Orders:
    - ada missing value
         - order_approved_at NULL when status delivered ??? (14 row)
         - order_delivered_carrier_date NULL when status delivered
    - delivery time (delivered - purchased) ada yang sampai ratusan hari ???
    - All the timestamp cols are in the same format yyyy-MM-dd HH:mm:ss. Change tipe data string jadi timestamps dan date (estimated date)
    - total row == distinct order_id (99441 -> sama dengan customer_id)


6. Order Items:
    - no missing value
    - no duplicate row
    - ada duplikat order_id | distinct 98666
    - the timestamp col is in the same format yyyy-MM-dd HH:mm:ss. Change tipe data string jadi timestamps
    "- The order_id = 00143d0f86d6fbd9f9b38ab440ac16f5 has 3 items (same product). Each item has the freight calculated accordingly to its measures and weight. To get the total freight value for each order you just have to sum.
    The total order_item value is: 21.33 * 3 = 63.99
    The total freight value is: 15.10 * 3 = 45.30
    The total order value (product + freight) is: 45.30 + 63.99 = 109.29"
7. Order Payments:
    - no missing value
    - duplicate row aman
    - ada duplicate order_id (karena ada payment sequential)
    - payment-sequential tu urutan bayar untuk satu order_id
    - payment_installments tu berapa kali cicil
    - total row 103886 | distinct order_id 99440 
    - tabel ini perlu disederhanakan menjadi 1 order_id -> 1 detail payment (paymentsequential->max, payment_type->list, installment total cicil, nominal payment)
8. Order Reviews:
    - ada missing value
         - NULL di review_id 1 row (kayanya didelete aja)
         - NULL di order_id 2236 row
         - NULL di review score 4937 row dst
    - review score 0 (ga make sense dan ternyata review_id dan order_idnya kacau)
    - review dan order id banyak yang kacau (sebaiknya didelete)
    - ada order_id yang duplikat -> semua yang duplikat > 2 itu order_idnya ngaco. yang duplikat 2 juga ada yang ngaco beberapa. Trus yang duplikat 2 ini kemungkinan ada update review tapi ada juga yang ga berubah :")
    - All the timestamp cols are in the same format yyyy-MM-dd HH:mm:ss. Change tipe data string jadi timestamps (review_answer_timestamp) dan date (review_creation_date)
    total row 104,162 | distinct 99742
