# Proyek Analisis Data: Brazilian E-Commerce Public Dataset by Olist
- **Nama:** Albar Pambagio Arioseto
- **Email:** albarpambagio@gmail.com
- **ID Dicoding:** albarpambagio 

## Menentukan Pertanyaan Bisnis

- bagaimana performa penjualan dari sao paulo pada tahun 2018?
- bagaimana performa skor review produk dari sao paulo pada tahun 2018?

## Import Semua Packages/Library yang Digunakan

In [19]:
import pandas as pd
import altair as alt
import numpy as np
from datetime import datetime

## Data Wrangling

### Gathering Data

In [None]:
orders = pd.read_csv('data/olist_orders_dataset.csv')
customers = pd.read_csv('data/olist_customers_dataset.csv')
payments = pd.read_csv('data/olist_order_payments_dataset.csv')
reviews = pd.read_csv('data/olist_order_reviews_dataset.csv')

In [20]:
# Filter for São Paulo customers only
sao_paulo_customers = customers[customers['customer_state'] == 'SP']['customer_id'].unique()

In [21]:
# Filter orders by São Paulo customers
sao_paulo_orders = orders[orders['customer_id'].isin(sao_paulo_customers)]

In [None]:
# Filter orders for 2018 only
sao_paulo_orders['order_purchase_timestamp'] = pd.to_datetime(sao_paulo_orders['order_purchase_timestamp'])
sao_paulo_orders_2018 = sao_paulo_orders[
    (sao_paulo_orders['order_purchase_timestamp'] >= '2018-01-01') & 
    (sao_paulo_orders['order_purchase_timestamp'] <= '2018-12-31')
]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sao_paulo_orders['order_purchase_timestamp'] = pd.to_datetime(sao_paulo_orders['order_purchase_timestamp'])


In [33]:
# Get order_ids for 2018 São Paulo orders
sp_order_ids_2018 = sao_paulo_orders_2018['order_id'].unique()

In [34]:
# Get payment information for these orders
sp_payments_2018 = payments[payments['order_id'].isin(sp_order_ids_2018)]

# Get review information for these orders
sp_reviews_2018 = reviews[reviews['order_id'].isin(sp_order_ids_2018)]

In [36]:
# Convert review date to datetime
sp_reviews_2018['review_creation_date'] = pd.to_datetime(sp_reviews_2018['review_creation_date'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sp_reviews_2018['review_creation_date'] = pd.to_datetime(sp_reviews_2018['review_creation_date'])


In [37]:
# Extract month from timestamps
sao_paulo_orders_2018['month'] = sao_paulo_orders_2018['order_purchase_timestamp'].dt.month
sp_reviews_2018['month'] = sp_reviews_2018['review_creation_date'].dt.month

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sao_paulo_orders_2018['month'] = sao_paulo_orders_2018['order_purchase_timestamp'].dt.month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sp_reviews_2018['month'] = sp_reviews_2018['review_creation_date'].dt.month


**Insight:**
- dataset bersumber dari kaggle
- format dataset: csv

### Assessing Data

#### Checking missing value

In [69]:
sp_payments_2018.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24669 entries, 1 to 103881
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   order_id              24669 non-null  object 
 1   payment_sequential    24669 non-null  int64  
 2   payment_type          24669 non-null  object 
 3   payment_installments  24669 non-null  int64  
 4   payment_value         24669 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 1.1+ MB


In [67]:
sp_reviews_2018.info()

<class 'pandas.core.frame.DataFrame'>
Index: 23829 entries, 0 to 99218
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   review_id                23829 non-null  object        
 1   order_id                 23829 non-null  object        
 2   review_score             23829 non-null  int64         
 3   review_comment_title     5040 non-null   object        
 4   review_comment_message   9237 non-null   object        
 5   review_creation_date     23829 non-null  datetime64[ns]
 6   review_answer_timestamp  23829 non-null  object        
 7   month                    23829 non-null  int32         
dtypes: datetime64[ns](1), int32(1), int64(1), object(5)
memory usage: 1.5+ MB


#### Checking duplicates

In [70]:
sp_payments_2018.duplicated().sum()

np.int64(0)

In [71]:
sp_reviews_2018.duplicated().sum()

np.int64(0)

#### Checking outlier

In [60]:
def detect_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    
    return outliers

In [72]:
outliers_p = detect_outliers_iqr(sp_payments_2018, 'payment_value')
outliers_p

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
26,d0a945f85ba1074b60aac97ade7e240e,1,credit_card,2,541.00
52,2e2c60b99754ae1e4d8b18846cfec9f2,1,credit_card,4,542.66
54,95442deb81a5d91c97c0df96b431634a,1,boleto,1,368.98
172,1594012ccc1b0770373ce691d697e5ae,1,credit_card,8,362.17
177,b545ba7b0bd67a3128185c7214704319,1,credit_card,8,340.08
...,...,...,...,...,...
103465,656689d74b6f04ec228f13e98fc9b15b,1,boleto,1,548.90
103484,0cf1684e713ddfe42529ec72206b25b3,1,credit_card,10,551.49
103498,b5b0b49d9903b852bf6c638643a89dc3,1,boleto,1,334.86
103813,2ab10ab526351fd3b05219e9eb4f7d9f,1,credit_card,8,366.73


In [75]:
outliers = detect_outliers_iqr(sp_reviews_2018, 'review_score')
outliers

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,month
5,15197aa66ff4d0650b5434f1b46cda19,b18dcdf73be66366873cd26c5724d1dc,1,,,2018-04-13,2018-04-16 00:39:37,4
19,373cbeecea8286a2b66c97b1b157ec46,583174fbe37d3d5f0d6661be3aad1786,1,Não chegou meu produto,Péssimo,2018-08-15,2018-08-15 04:10:37,8
89,65dfeb60c40e3cbb0a1838285d86f885,a2714ecbf6eeb3bb9cd7dba6dc1c5e82,1,,Pedi reembolso e sem resposta até momento,2018-03-08,2018-03-08 12:27:34,3
91,1090909faae22e5ab76903e8493063f1,a1341cb83bbf1b47392f4a3685d56bad,1,,,2018-03-07,2018-03-07 15:32:23,3
149,ef8ae544c432bb1053ba5990bd0d6227,e18ebd6286697a3f0f6fe267d8286cb2,1,,EU NÃO RECEBI O PRODUTO E CONSTA NO SISTEMA QU...,2018-01-12,2018-01-13 00:28:49,1
...,...,...,...,...,...,...,...,...
99081,dfb01be03c8f304bb01e3d2d90719a5b,c5d41b216e4b42500c5da2be17a74065,1,,"Meus pedidos estão dando como entregues , mais...",2018-05-29,2018-05-31 13:35:06,5
99091,fade356e7332606aa22776d8d553cdce,837b75362f8a7c08c85182dfd16cb72d,1,,Ainda nao recevi o produto.,2018-08-12,2018-08-12 13:51:31,8
99139,cadeddfc924c941517913adcc05dcb26,5f0729d8bd88589e8501140a9eb86ed3,2,,"nao satifez minhas expectativas, não funciona ...",2018-02-03,2018-02-06 11:52:01,2
99200,2ee221b28e5b6fceffac59487ed39348,f2d12dd37eaef72ed7b1186b2edefbcd,2,Foto enganosa,Foto muito diferente principalmente a graninha...,2018-03-28,2018-05-25 01:23:26,3


**Insight:**
- tidak terdapat missing value
- tidak terdapat duplicate data
- ada sekitar 7% outliers di nilai pembayaran
- ada sekitar 12% outliers di skor review

### Cleaning Data

**Insight:**
- Outlier tidak ditangani secara khusus karena diperlukan investigasi lebih lanjut untuk memastikan apakah outlier tersebut merupakan kesalahan data atau justru mencerminkan kondisi sebenarnya.
- Selain itu, tidak dilakukan proses penghapusan (dropping) maupun imputasi data, karena hal tersebut tidak dibutuhkan untuk menjawab dua pertanyaan analisis yang telah disebutkan sebelumnya.

## Exploratory Data Analysis (EDA)

### Explore Central Tendency by Month

In [76]:
# Prepare monthly payment data
monthly_payments = sp_payments_2018.merge(
    sao_paulo_orders_2018[['order_id', 'month']], 
    on='order_id', 
    how='left'
)

In [78]:
# Calculate monthly payment statistics
monthly_payment_stats = monthly_payments.groupby('month').agg(
    total_payment=('payment_value', 'sum'),
    avg_payment=('payment_value', 'mean'),
    order_count=('order_id', 'nunique')
).reset_index()

In [79]:
#Calculate monthly review statistics
monthly_review_stats = sp_reviews_2018.groupby('month').agg(
    avg_score=('review_score', 'mean'),
    review_count=('review_id', 'count')
).reset_index()

In [83]:
monthly_payment_stats['avg_payment'].describe()

count     10.000000
mean     165.020124
std       82.879280
min      128.363226
25%      135.912673
50%      139.275169
75%      143.338270
max      400.363750
Name: avg_payment, dtype: float64

In [82]:
monthly_review_stats['avg_score'].describe()

count    8.000000
mean     4.198373
std      0.115121
min      3.979211
25%      4.163892
50%      4.224971
75%      4.282395
max      4.317053
Name: avg_score, dtype: float64

In [80]:
# Add month name for better visualization
month_names = {
    1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun',
    7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'
}
monthly_payment_stats['month_name'] = monthly_payment_stats['month'].map(month_names)

In [84]:
monthly_review_stats['month_name'] = monthly_review_stats['month'].map(month_names)

# Ensure all months are included (even if no data)
all_months = pd.DataFrame({'month': range(1, 13)})
all_months['month_name'] = all_months['month'].map(month_names)

monthly_payment_stats = all_months.merge(monthly_payment_stats, on='month', how='left').fillna(0)
monthly_review_stats = all_months.merge(monthly_review_stats, on='month', how='left').fillna(0)

In [87]:
monthly_payment_stats.head()

Unnamed: 0,month,month_name_x,total_payment,avg_payment,order_count,month_name_y
0,1,Jan,430306.48,135.273964,3052.0,Jan
1,2,Feb,358518.49,128.363226,2703.0,Feb
2,3,Mar,441724.26,140.229924,3037.0,Mar
3,4,Apr,452747.67,142.463081,3059.0,Apr
4,5,May,492978.52,148.936109,3207.0,May


#### Insights
- rerata nilai pembayaran pada tahun 2018 untuk sao paulo adalah 165 brazilian real (paling rendahnya 128 dan paling tingginya 400)
- rerata skor pada tahun 2018 untuk sao paulo adalah 4.2 (paling rendahnya 3.9 dan paling tingginya 4.3)
- data hanya tersedia dari bulan januari sampai oktober untuk data nilai pembayaran
- data hanya tersedia dari bulan januari sampai agustus untuk data review

## Visualization

### Pertanyaan 1:

In [96]:
# Create Altair visualization for average payment by month
avg_payment_chart = alt.Chart(monthly_payment_stats).mark_bar().encode(
    x=alt.X('month:N', title='Month', sort=list(month_names.values())),
    y=alt.Y('avg_payment:Q', title='Average Payment Value (R$)'),
    tooltip=['month', 'avg_payment']
).properties(
    title='Average Payment Value by Month - São Paulo (2018)',
    width=600,
    height=400
)

# Add text labels for the average payment values
payment_text = avg_payment_chart.mark_text(
    align='center',
    baseline='top',
    dy=10
).encode(
    text=alt.Text('avg_payment:Q', format=',.2f')
)

In [97]:
# Display the charts with text labels
payment_chart_with_text = avg_payment_chart + payment_text

# Show the charts
payment_chart_with_text.display()

### Pertanyaan 2:

In [101]:
# Create Altair visualization for average review score by month
avg_review_chart = alt.Chart(monthly_review_stats).mark_bar(color='orange').encode(
    x=alt.X('month:N', title='Month', sort=list(month_names.values())),
    y=alt.Y('avg_score:Q', title='Average Review Score (1-5)', scale=alt.Scale(domain=[0, 5])),
    tooltip=['month', 'avg_score']
).properties(
    title='Average Review Score by Month - São Paulo (2018)',
    width=600,
    height=400
)

# Add text labels for the average review scores
review_text = avg_review_chart.mark_text(
    align='center',
    baseline='top',
    dy=10
).encode(
    text=alt.Text('avg_score:Q', format='.2f')
)

In [102]:
review_chart_with_text = avg_review_chart + review_text
review_chart_with_text.display()

## Conclusion

### Temuan
- ada lonjakan rerata nilai pembelian pada bulan november 2018 sebesar 400 brazilian real, sedangkan bulan selainnya tidak ada pergerakan yang drastis (cenderung stabil)
- ada penurunan rerata review pada bulan maret 2018 sebesar 3.98, sedangkan di bulan lainnya skor review cukup bagus dan tidak ada pergerakan yang drastis (cenderung stabil)

### Rekomendasi
- perlu diadakan investigasi apakah lonjakan tersebut berasosiasi dengan strategi marketing tertentu atau tidak, sehingga inisiatifnya bisa direplikasi di bulan lain
- perlu adanya analisis lanjutan terkait penurunan review di bulan maret, yang selanjutnya bisa diketahui apa akar masalah yang menjadi penyebab penurunan tersebut sehingga dapat diselesaikan dan dicegah di kemudian hari