# Proyek Analisis Data: E-commerce Public Dataset
- **Nama:** Argo Wahyu Utomo (Arguto)
- **Email:** argo.wahyu.utomo@gmail.com
- **ID Dicoding:** B244051F

## Menentukan Pertanyaan Bisnis

### Penjualan
- Bagaimana tren penjualan (revenue, orders, AOV) selama beberapa bulan terakhir?
- Bagaimana performa penjualan berdasarkan kategori produk?
- Bagaimana performa penjualan berdasarkan produk?

### Metode Pembayaran
- Metode pembayaran apa yang paling sering digunakan?
- Berapa rata-rata nilai transaksi untuk setiap metode pembayaran?
- Bagaimana distribusi jumlah cicilan untuk pembayaran menggunakan kartu kredit?

### Logistik
- Seberapa akurat estimasi pengiriman dibandingkan dengan tanggal pengiriman sebenarnya?
- Berapa nilai minimal, rata-rata, dan maksimal biaya pengiriman per kategori produk?

### Kepuasan Pelanggan
- Bagaimana tingkat kepuasan pelanggan secara keseluruhan?
- Pembeli dari kota mana yang memiliki tingkat kepuasan tertinggi?
- Pembeli dari kota mana yang memiliki tingkat kepuasan terendah?
- Kategori produk apa yang memiliki tingkat kepuasan pelanggan tertinggi?
- Bagaimana tingkat kepuasan pelanggan berdasarkan lokasi geografis penjual?

### Lokasi Geografis
- Bagaimana distribusi pelanggan berdasarkan lokasi geografisnya?
- Bagaimana distribusi penjual berdasarkan lokasi geografisnya?

## Import Semua Packages/Library yang Digunakan

In [182]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Data Wrangling

### Gathering Data

#### Memuat file CSV

In [183]:
dir_ = "dataset/"
file_orders = dir_ + "orders_dataset.csv"
file_products = dir_ + "products_dataset.csv"
file_product_category = dir_ + "product_category_name_translation.csv"
file_items = dir_ + "order_items_dataset.csv"
file_payments = dir_ + "order_payments_dataset.csv"
file_reviews =  dir_ + "order_reviews_dataset.csv"
file_customers = dir_ + "customers_dataset.csv"
file_sellers = dir_ + "sellers_dataset.csv" 
file_geolocation = dir_ + "geolocation_dataset.csv" 

# Load each CSV file into a DataFrame
df_orders = pd.read_csv(file_orders)
df_products = pd.read_csv(file_products)
df_product_category = pd.read_csv(file_product_category)
df_items = pd.read_csv(file_items)
df_payments = pd.read_csv(file_payments)
df_reviews = pd.read_csv(file_reviews)
df_customers = pd.read_csv(file_customers)
df_sellers = pd.read_csv(file_sellers)
df_geolocation = pd.read_csv(file_geolocation)

#### Menampilkan DataFrame

In [184]:
df_dict = {
    "orders": df_orders,
    "products": df_products,
    "product_category": df_product_category,
    "items": df_items,
    "payments": df_payments,
    "reviews": df_reviews,
    "customers": df_customers,
    "sellers": df_sellers,
    "geolocation": df_geolocation, 
}

top_row_count = 3
for table_name, df in df_dict.items():
    print(f"{table_name.title()} (df_{table_name}) top {top_row_count} rows")
    display(df.head(top_row_count))  # Renders in a rich tabular format

Orders (df_orders) top 3 rows


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00


Products (df_products) top 3 rows


Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria,40.0,287.0,1.0,225.0,16.0,10.0,14.0
1,3aa071139cb16b67ca9e5dea641aaa2f,artes,44.0,276.0,1.0,1000.0,30.0,18.0,20.0
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer,46.0,250.0,1.0,154.0,18.0,9.0,15.0


Product_Category (df_product_category) top 3 rows


Unnamed: 0,product_category_name,product_category_name_english
0,beleza_saude,health_beauty
1,informatica_acessorios,computers_accessories
2,automotivo,auto


Items (df_items) top 3 rows


Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87


Payments (df_payments) top 3 rows


Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71


Reviews (df_reviews) top 3 rows


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24


Customers (df_customers) top 3 rows


Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP


Sellers (df_sellers) top 3 rows


Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ


Geolocation (df_geolocation) top 3 rows


Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.64482,sao paulo,SP
2,1046,-23.546129,-46.642951,sao paulo,SP


#### Insights
- Dataset terdiri dari 9 file csv yang kita muat ke dalam 9 DataFrame, yaitu: df_orders, df_products, df_product_category, df_items, df_payments, df_customers, df_sellers, dan df_geolocation.

### Assessing Data

#### Jumlah Baris & Kolom

In [185]:
# Create a list to store table information
table_dimension = []

# Iterate over the dictionary to collect data
for table_name, df in df_dict.items():
    table_dimension.append({
        "Table": table_name.title(),
        "DataFrame": f"df_{table_name}",
        "Rows": f"{df.shape[0]:,}",
        "Columns": f"{df.shape[1]:,}"
    })

# Create a DataFrame from the collected data
table_dimension_df = pd.DataFrame(table_dimension)

# Display the DataFrame
display(table_dimension_df)

Unnamed: 0,Table,DataFrame,Rows,Columns
0,Orders,df_orders,99441,8
1,Products,df_products,32951,9
2,Product_Category,df_product_category,71,2
3,Items,df_items,112650,7
4,Payments,df_payments,103886,5
5,Reviews,df_reviews,99224,7
6,Customers,df_customers,99441,5
7,Sellers,df_sellers,3095,4
8,Geolocation,df_geolocation,1000163,5


#### Missing Values, Data Type, Unique, Duplicate

In [186]:
# Create a function to show table info
def show_table_info(table_name):
    df = df_dict.get(table_name)
    # Initialize a list to collect column information
    table_column_info = []
    
    # Get memory usage for each column
    memory_usage = df.memory_usage(deep=True)
    total_memory_usage = memory_usage.sum()

    # Iterate over each column in the DataFrame
    for column_name in df.columns:
        # Collect information about the column
        table_column_info.append({
            "Table": table_name.title(),
            "Column": column_name,
            "Not-Null": f"{df[column_name].notnull().sum():,}",
            "Missing Values": f"{df[column_name].isnull().sum():,}",
            "DType": df[column_name].dtype,
            "Unique": f"{df[column_name].nunique():,}",
            # "Duplicate": df[column_name].duplicated().sum(),
            "Duplicate": "",  # tidak perlu menghitung duplicated data per kolom
            "Memory Usage": memory_usage[column_name],
        })
    
    # Add a row for total memory usage (aggregate row)
    table_column_info.append({
        "Table": table_name.title(),
        "Column": "[Total]",
        "Not-Null": f"{df.notnull().all(axis=1).sum():,}",  # Count rows where all columns are non-null
        "Missing Values": f"{df.isnull().any(axis=1).sum():,}",  # Count rows with any null values
        "DType": "",
        "Unique": f"{df.shape[0] - df.duplicated().sum():,}",  # Count unique rows without dropping duplicates
        "Duplicate": f"{df.duplicated().sum():,}",  # Count total duplicated rows
        "Memory Usage": total_memory_usage,
    })

    # Create a DataFrame from the collected information
    table_column_info_df = pd.DataFrame(table_column_info)

    # Convert memory usage to human-readable format (e.g., KB, MB) if desired
    table_column_info_df["Memory Usage"] = table_column_info_df["Memory Usage"].apply(
        lambda x: (
            f"{x / (1024**3):.2f} GB" if x >= 1024**3 else
            f"{x / (1024**2):.2f} MB" if x >= 1024**2 else
            f"{x / 1024:.2f} KB" if x >= 1024 else
            f"{x} Bytes"
        )
    )

     # Display the DataFrame
    display(table_column_info_df)


**Catatan**
- Total Not-Null adalah jumlah baris yang seluruh kolomnya tidak mengandung missing value
- Total Missing Value adalah jumlah baris yang mengandung sekurang-kurangnya 1 missing value
- Total Unique adalah jumlah baris unik berdasarkan nilai seluruh kolom
- Total Duplicate adalah jumlah baris yang duplicated berdasarkan nilai dari seluruh kolom
- Tidak perlu menghitung jumlah duplicate pada tiap kolom

##### df_orders

In [187]:
show_table_info("orders")

Unnamed: 0,Table,Column,Not-Null,Missing Values,DType,Unique,Duplicate,Memory Usage
0,Orders,order_id,99441,0,object,99441,,7.68 MB
1,Orders,customer_id,99441,0,object,99441,,7.68 MB
2,Orders,order_status,99441,0,object,8,,5.50 MB
3,Orders,order_purchase_timestamp,99441,0,object,98875,,6.45 MB
4,Orders,order_approved_at,99281,160,object,90733,,6.44 MB
5,Orders,order_delivered_carrier_date,97658,1783,object,81018,,6.39 MB
6,Orders,order_delivered_customer_date,96476,2965,object,95664,,6.35 MB
7,Orders,order_estimated_delivery_date,99441,0,object,459,,6.45 MB
8,Orders,[Total],96461,2980,,99441,0.0,52.94 MB


**Insight:**
- Tabel orders (df_orders) memiliki 8 kolom
- kolom `order_approved_at` mengandung missing value, kemungkinan besar karena order memang belum di-approve
- kolom `order_delivered_carrier_date` dan `order_delivered_customer_date` juga mengandung missing value, kemungkinan besar karena order belum dikirim atau data delivery memang belum saatnya terisi
- tipe data `order_purchase_timestamp`, `order_approved_at`, `order_delivered_carrier_date`, `order_delivered_customer_date`, dan `order_estimated_delivery_date` masih perlu diubah di tahap selanjutnya (data cleaning)
- tidak terdapat data duplikat
- terdapat 99,441 pesanan berbeda dengan 8 status berbeda dari 99,441 pelanggan
- dari seluruh pesanan, baru terdapat 90,733 pesanan yang disetujui 

##### df_products

In [188]:
show_table_info("products")

Unnamed: 0,Table,Column,Not-Null,Missing Values,DType,Unique,Duplicate,Memory Usage
0,Products,product_id,32951,0,object,32951,,2.55 MB
1,Products,product_category_name,32341,610,object,73,,1.99 MB
2,Products,product_name_lenght,32341,610,float64,66,,257.43 KB
3,Products,product_description_lenght,32341,610,float64,2960,,257.43 KB
4,Products,product_photos_qty,32341,610,float64,19,,257.43 KB
5,Products,product_weight_g,32949,2,float64,2204,,257.43 KB
6,Products,product_length_cm,32949,2,float64,99,,257.43 KB
7,Products,product_height_cm,32949,2,float64,102,,257.43 KB
8,Products,product_width_cm,32949,2,float64,95,,257.43 KB
9,Products,[Total],32340,611,,32951,0.0,6.30 MB


**Insight:**
- tabel products (df_products) memiliki 9 kolom
- banyak terdapat missing values
- tipe data semua kolom sudah OK
- tidak terdapat data duplikat
- terdapat 32,951 produk

##### df_product_category

In [189]:
show_table_info("product_category")

Unnamed: 0,Table,Column,Not-Null,Missing Values,DType,Unique,Duplicate,Memory Usage
0,Product_Category,product_category_name,71,0,object,71,,4.57 KB
1,Product_Category,product_category_name_english,71,0,object,71,,4.51 KB
2,Product_Category,[Total],71,0,,71,0.0,9.21 KB


**Insight:**
- tabel product_category (df_product_category) memiliki 2 kolom 
- tidak terdapat missing value
- tipe data seluruh kolom sudah OK
- tidak terdapat data duplikat
- terdapat 71 kategori produk

##### df_items

In [190]:
show_table_info("items")

Unnamed: 0,Table,Column,Not-Null,Missing Values,DType,Unique,Duplicate,Memory Usage
0,Items,order_id,112650,0,object,98666,,8.70 MB
1,Items,order_item_id,112650,0,int64,21,,880.08 KB
2,Items,product_id,112650,0,object,32951,,8.70 MB
3,Items,seller_id,112650,0,object,3095,,8.70 MB
4,Items,shipping_limit_date,112650,0,object,93318,,7.31 MB
5,Items,price,112650,0,float64,5968,,880.08 KB
6,Items,freight_value,112650,0,float64,6999,,880.08 KB
7,Items,[Total],112650,0,,112650,0.0,35.99 MB


**Insight:**
- tabel items (df_items) memiliki 7 kolom
- tidak terdapat missing value
- tipe data `shipping_limit_date` perlu disesuaikan
- tidak terdapat data duplikat

##### df_payments

In [191]:
show_table_info("payments")

Unnamed: 0,Table,Column,Not-Null,Missing Values,DType,Unique,Duplicate,Memory Usage
0,Payments,order_id,103886,0,object,99440,,8.02 MB
1,Payments,payment_sequential,103886,0,int64,29,,811.61 KB
2,Payments,payment_type,103886,0,object,5,,5.83 MB
3,Payments,payment_installments,103886,0,int64,24,,811.61 KB
4,Payments,payment_value,103886,0,float64,29077,,811.61 KB
5,Payments,[Total],103886,0,,103886,0.0,16.23 MB


**Insight:**
- tabel payments (df_payments) memiliki 5 kolom
- tidak terdapat missing values
- tipe data seluruh kolom sudah OK
- tidak terdapat data duplikat
- terdapat 103,886 pembayaran untuk 99,440 order berbeda dengan 5 pilihan tipe pembayaran

##### df_reviews

In [192]:
show_table_info("reviews")

Unnamed: 0,Table,Column,Not-Null,Missing Values,DType,Unique,Duplicate,Memory Usage
0,Reviews,review_id,99224,0,object,98410,,7.66 MB
1,Reviews,order_id,99224,0,object,98673,,7.66 MB
2,Reviews,review_score,99224,0,int64,5,,775.19 KB
3,Reviews,review_comment_title,11568,87656,object,4527,,3.39 MB
4,Reviews,review_comment_message,40977,58247,object,36159,,6.78 MB
5,Reviews,review_creation_date,99224,0,object,636,,6.43 MB
6,Reviews,review_answer_timestamp,99224,0,object,98248,,6.43 MB
7,Reviews,[Total],9839,89385,,99224,0.0,39.12 MB


**Insight:**
- tabel reviews (df_reviews) memiliki 7 kolom
- terdapat 89,385 baris dengan missing values
- kolom `review_comment_title` memiliki 87,656 missing values
- kolom `review_comment_message` memiliki 58,247 missing values
- tipe data `review_creation_date` dan `review_anser_timestamp` perlu disesuaikan
- tidak terdapat data duplikat
- terdapat 98,410 review dari 98,673 order berbeda dengan 5 pilihan score

##### df_customers

In [193]:
show_table_info("customers")

Unnamed: 0,Table,Column,Not-Null,Missing Values,DType,Unique,Duplicate,Memory Usage
0,Customers,customer_id,99441,0,object,99441,,7.68 MB
1,Customers,customer_unique_id,99441,0,object,96096,,7.68 MB
2,Customers,customer_zip_code_prefix,99441,0,int64,14994,,776.88 KB
3,Customers,customer_city,99441,0,object,4119,,5.63 MB
4,Customers,customer_state,99441,0,object,27,,4.84 MB
5,Customers,[Total],99441,0,,99441,0.0,26.59 MB


**Insight:**
- tabel customers (df_customerss) memiliki 5 kolom
- tidak terdapat missing values
- tipe data `customer_zip_code_prefix` lebih tepat jika diubah ke string/object
- tidak terdapat data duplikat
- terdapat 99,441 pelanggan dengan 14,994 kode pos berbeda dari 4,119 kota dan 27 state

##### df_sellers

In [194]:
show_table_info("sellers")

Unnamed: 0,Table,Column,Not-Null,Missing Values,DType,Unique,Duplicate,Memory Usage
0,Sellers,seller_id,3095,0,object,3095,,244.82 KB
1,Sellers,seller_zip_code_prefix,3095,0,int64,2246,,24.18 KB
2,Sellers,seller_city,3095,0,object,611,,178.89 KB
3,Sellers,seller_state,3095,0,object,23,,154.15 KB
4,Sellers,[Total],3095,0,,3095,0.0,602.16 KB


**Insight:**
- tabel sellers (df_sellers) memiliki 4 kolom
- tidak terdapat missing values
- tipe data `seller_zip_code_prefix` lebih tepat jika diubah ke string/object
- tidak terdapat data duplikat
- terdapat 3,095 penjual dengan 2,246 kode pos berbeda dari 611 kota dan 23 state

##### df_geolocation

In [195]:
show_table_info("geolocation")

Unnamed: 0,Table,Column,Not-Null,Missing Values,DType,Unique,Duplicate,Memory Usage
0,Geolocation,geolocation_zip_code_prefix,1000163,0,int64,19015,,7.63 MB
1,Geolocation,geolocation_lat,1000163,0,float64,717360,,7.63 MB
2,Geolocation,geolocation_lng,1000163,0,float64,717613,,7.63 MB
3,Geolocation,geolocation_city,1000163,0,object,8011,,57.84 MB
4,Geolocation,geolocation_state,1000163,0,object,27,,48.65 MB
5,Geolocation,[Total],1000163,0,,738332,261831.0,129.38 MB


**Insight:**
- tabel geolocation (df_geolocation) memiliki 5 kolom
- tidak terdapat missing values
- tipe data `geolocation_zip_code_prefix` lebih tepat jika diubah ke string/object
- terdapat data duplikat sebanyak 261,831
- jumlah unique geolocation: 738,332

#### Min, Mean, Median, Mode, Max, Outliers

In [196]:
# Create a function to show table description
def show_table_desc(table_name):
    df = df_dict.get(table_name)
    
    # Initialize a list to collect column information
    table_column_desc = []

    # Iterate over each column in the DataFrame
    for column_name in df.columns:
        # Initialize a dictionary to hold the statistics for each column
        column_desc = {
            "Table": table_name.title(),
            "Column": column_name,
            "Dtype": str(df[column_name].dtype)
        }
        
        # If the column is numeric, calculate the statistics
        if df[column_name].dtype in ['int64', 'float64']:
            column_desc["Min"] = df[column_name].min()
            column_desc["Max"] = df[column_name].max()
            column_desc["Mean"] = df[column_name].mean()
            column_desc["Q1"] = df[column_name].quantile(0.25)
            column_desc["Q2"] = df[column_name].median()
            column_desc["Q3"] = df[column_name].quantile(0.75)
        else:
            # For non-numeric columns, only include relevant statistics
            column_desc["Mode"] = df[column_name].mode()[0]  # Most frequent value (mode) for non-numeric columns
            # column_desc["Unique_Count"] = df[column_name].nunique()
            column_desc["Sample Value"] = ', '.join(df[column_name].unique()[:1].astype(str))
        
        # Append the column statistics to the list
        table_column_desc.append(column_desc)

    # Create a DataFrame from the collected column descriptions
    table_column_desc_df = pd.DataFrame(table_column_desc)
    
    # Remove columns that contain all None values
    columns_to_keep = [col for col in table_column_desc_df.columns 
                      if not table_column_desc_df[col].isna().all()]
    
    table_column_desc_df = table_column_desc_df.fillna("")
    table_column_desc_df = table_column_desc_df[columns_to_keep]

    # Display the DataFrame
    display(table_column_desc_df)


##### df_orders

In [197]:
show_table_desc("orders")

Unnamed: 0,Table,Column,Dtype,Mode,Sample Value
0,Orders,order_id,object,00010242fe8c5a6d1ba2dd792cb16214,e481f51cbdc54678b7cc49136f2d6af7
1,Orders,customer_id,object,00012a2ce6f8dcda20d059ce98491703,9ef432eb6251297304e76186b10a928d
2,Orders,order_status,object,delivered,delivered
3,Orders,order_purchase_timestamp,object,2017-11-20 10:59:08,2017-10-02 10:56:33
4,Orders,order_approved_at,object,2018-02-27 04:31:10,2017-10-02 11:07:15
5,Orders,order_delivered_carrier_date,object,2018-05-09 15:48:00,2017-10-04 19:55:00
6,Orders,order_delivered_customer_date,object,2016-10-27 17:32:07,2017-10-10 21:25:13
7,Orders,order_estimated_delivery_date,object,2017-12-20 00:00:00,2017-10-18 00:00:00


**Insights:**
- Kita perlu sesuaikan tipe data beberapa kolom sebelum bisa menghitung angka statistik deskriptifnya dengan benar

##### df_products

In [198]:
show_table_desc("products")

Unnamed: 0,Table,Column,Dtype,Mode,Sample Value,Min,Max,Mean,Q1,Q2,Q3
0,Products,product_id,object,00066f42aeeb9f3007548bb9d3f33c38,1e9e8ef04dbcff4541ed26657ea517e5,,,,,,
1,Products,product_category_name,object,cama_mesa_banho,perfumaria,,,,,,
2,Products,product_name_lenght,float64,,,5.0,76.0,48.476949,42.0,51.0,57.0
3,Products,product_description_lenght,float64,,,4.0,3992.0,771.495285,339.0,595.0,972.0
4,Products,product_photos_qty,float64,,,1.0,20.0,2.188986,1.0,1.0,3.0
5,Products,product_weight_g,float64,,,0.0,40425.0,2276.472488,300.0,700.0,1900.0
6,Products,product_length_cm,float64,,,7.0,105.0,30.815078,18.0,25.0,38.0
7,Products,product_height_cm,float64,,,2.0,105.0,16.937661,8.0,13.0,21.0
8,Products,product_width_cm,float64,,,6.0,118.0,23.196728,15.0,20.0,30.0


**Insights:**
- Kelihatannya sudah OK, nanti kalau ada kolom yang memang diperlukan utnuk analisis baru kita bisa cek lebih via visualisasi

##### df_product_category

In [199]:
show_table_desc("product_category")

Unnamed: 0,Table,Column,Dtype,Mode,Sample Value
0,Product_Category,product_category_name,object,agro_industria_e_comercio,beleza_saude
1,Product_Category,product_category_name_english,object,agro_industry_and_commerce,health_beauty


**Insights:**
- Kelihatannya sudah OK

##### df_items

In [200]:
show_table_desc("items")

Unnamed: 0,Table,Column,Dtype,Mode,Sample Value,Min,Max,Mean,Q1,Q2,Q3
0,Items,order_id,object,8272b63d03f5f79c56e9e4120aec44ef,00010242fe8c5a6d1ba2dd792cb16214,,,,,,
1,Items,order_item_id,int64,,,1.0,21.0,1.197834,1.0,1.0,1.0
2,Items,product_id,object,aca2eb7d00ea1a7b8ebd4e68314663af,4244733e06e7ecb4970a6e2683c13e61,,,,,,
3,Items,seller_id,object,6560211a19b47992c3666cc44a7e94c0,48436dade18ac8b2bce089ec2a041202,,,,,,
4,Items,shipping_limit_date,object,2017-07-21 18:25:23,2017-09-19 09:45:35,,,,,,
5,Items,price,float64,,,0.85,6735.0,120.653739,39.9,74.99,134.9
6,Items,freight_value,float64,,,0.0,409.68,19.99032,13.08,16.26,21.15


**Insights:**
- Kita perlu convert tipe data `shipping_limit_date` sebelum cek distribusi & potensi outliernya

##### df_payments

In [206]:
show_table_desc("payments")

Unnamed: 0,Table,Column,Dtype,Mode,Sample Value,Min,Max,Mean,Q1,Q2,Q3
0,Payments,order_id,object,fa65dad1b0e818e3ccc5cb0e39231352,b81ef226f3fe1789b1e8b2acac839d17,,,,,,
1,Payments,payment_sequential,int64,,,1.0,29.0,1.092679,1.0,1.0,1.0
2,Payments,payment_type,object,credit_card,credit_card,,,,,,
3,Payments,payment_installments,int64,,,0.0,24.0,2.853349,1.0,1.0,4.0
4,Payments,payment_value,float64,,,0.0,13664.08,154.10038,56.79,100.0,171.8375


**Insights:**
- `payment_sequential` perlu dicek lebih lanjut distribusinya, Q1 = Q2 = Q3 = 1, namun nilai Max 29 (potensi outlier, namun belum tentu outlier)
- `payment_installment` perlu dicek lebih lanjut distribusinya (potensi outlier di nilai Max)

##### df_reviews

In [202]:
show_table_desc("reviews")

Unnamed: 0,Table,Column,Dtype,Mode,Sample Value,Min,Max,Mean,Q1,Q2,Q3
0,Reviews,review_id,object,08528f70f579f0c830189efc523d2182,7bc2406110b926393aa56f80a40eba40,,,,,,
1,Reviews,order_id,object,03c939fd7fd3b38f8485a0f95798f1f6,73fc7af87114b39712e6da79b0a377eb,,,,,,
2,Reviews,review_score,int64,,,1.0,5.0,4.086421,4.0,5.0,5.0
3,Reviews,review_comment_title,object,Recomendo,,,,,,,
4,Reviews,review_comment_message,object,Muito bom,,,,,,,
5,Reviews,review_creation_date,object,2017-12-19 00:00:00,2018-01-18 00:00:00,,,,,,
6,Reviews,review_answer_timestamp,object,2017-06-15 23:21:05,2018-01-18 21:46:59,,,,,,


**Insights:**
- `review_answer_timestamp` perlu diubah dulu tipe datanya sebelum cek lagi potensi outlier

##### df_customers

In [203]:
show_table_desc("customers")

Unnamed: 0,Table,Column,Dtype,Mode,Sample Value,Min,Max,Mean,Q1,Q2,Q3
0,Customers,customer_id,object,00012a2ce6f8dcda20d059ce98491703,06b8999e2fba1a1fbc88172c00ba8bc7,,,,,,
1,Customers,customer_unique_id,object,8d50f5eadf50201ccdcedfb9e2ac8455,861eff4711a542e4b93843c6dd7febb0,,,,,,
2,Customers,customer_zip_code_prefix,int64,,,1003.0,99990.0,35137.474583,11347.0,24416.0,58900.0
3,Customers,customer_city,object,sao paulo,franca,,,,,,
4,Customers,customer_state,object,SP,SP,,,,,,


**Insights:**
- Selain issue tipe data kolom `customer_zip_code_prefix`, kelihatannya sudah OK

##### df_sellers

In [204]:
show_table_desc("sellers")

Unnamed: 0,Table,Column,Dtype,Mode,Sample Value,Min,Max,Mean,Q1,Q2,Q3
0,Sellers,seller_id,object,0015a82c2db000af6aaaf3ae2ecb0532,3442f8959a84dea7ee197c632cb2df15,,,,,,
1,Sellers,seller_zip_code_prefix,int64,,,1001.0,99730.0,32291.059451,7093.5,14940.0,64552.5
2,Sellers,seller_city,object,sao paulo,campinas,,,,,,
3,Sellers,seller_state,object,SP,SP,,,,,,


**Insights:**
- Selain issue tipe data kolom `seller_zip_code_prefix`, kelihatannya sudah OK

##### df_geolocation

In [205]:
show_table_desc("geolocation")

Unnamed: 0,Table,Column,Dtype,Min,Max,Mean,Q1,Q2,Q3,Mode,Sample Value
0,Geolocation,geolocation_zip_code_prefix,int64,1001.0,99990.0,36574.166466,11075.0,26530.0,63504.0,,
1,Geolocation,geolocation_lat,float64,-36.605374,45.065933,-21.176153,-23.603546,-22.919377,-19.97962,,
2,Geolocation,geolocation_lng,float64,-101.466766,121.105394,-46.390541,-48.573172,-46.637879,-43.767709,,
3,Geolocation,geolocation_city,object,,,,,,,sao paulo,sao paulo
4,Geolocation,geolocation_state,object,,,,,,,SP,SP


**Insights:**
- Selain issue tipe data `geolocation_zip_code_prefix`, kelihatannya sudah OK

### Cleaning Data

**Insight:**
- xxx
- xxx

## Exploratory Data Analysis (EDA)

### Explore ...

**Insight:**
- xxx
- xxx

## Visualization & Explanatory Analysis

### Pertanyaan 1:

### Pertanyaan 2:

**Insight:**
- xxx
- xxx

## Analisis Lanjutan (Opsional)

## Conclusion

- Conclution pertanyaan 1
- Conclution pertanyaan 2