<a href="https://colab.research.google.com/github/William9923/future-data-ecommerce/blob/master/notebooks/13_02_2021DatasetRelationshipExploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Exploration

Goal : 
* Finding relationship and understanding data context
* Exploring data for database creation context


## Importing Libraries

In [1]:
#ignore warnings
import warnings
warnings.filterwarnings('ignore')

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd

In [2]:
def profilling(df, filename):
  # importing sweetviz
  import sweetviz as sv                 #analyzing the dataset
  advert_report = sv.analyze(df)  #display the report
  advert_report.show_html(filename)

In [3]:
def show_missing_data(df):
    print(f"Shape : {df.shape}")
    print(f"Missing Data : {df.isnull().sum()}")
    return None

## Importing Data


In [4]:
print("Loading Dataset ...")

data_folder = "data/raw/"

user = pd.read_csv(data_folder + "user_dataset.csv")
order = pd.read_csv(data_folder + "order_dataset.csv")
order_item = pd.read_csv(data_folder + "order_item_dataset.csv")
payment = pd.read_csv(data_folder + "payment_dataset.csv")
products = pd.read_csv(data_folder + "products_dataset.csv")
seller = pd.read_csv(data_folder + "seller_dataset.csv")
feedback = pd.read_csv(data_folder + "feedback_dataset.csv")

print("Finish...")

Loading Dataset ...
Finish...


In [None]:
dataset = [user, order, order_item, payment, products, seller, feedback]
filenames = ["User", "Order", "OrderItem", "Payment", "Products", "Seller", "Feedback"]

report_folder = 'reports/docs/'

for data, filename in zip(dataset, filenames) :
  profilling(data, filename + ".html")

## Single Entity Exploration

### User Dataset Exploration

In [5]:
print("Missing Data | User : ")
show_missing_data(user)

Missing Data | User : 
Shape : (99441, 4)
Missing Data : user_name            0
customer_zip_code    0
customer_city        0
customer_state       0
dtype: int64


In [6]:
user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   user_name          99441 non-null  object
 1   customer_zip_code  99441 non-null  int64 
 2   customer_city      99441 non-null  object
 3   customer_state     99441 non-null  object
dtypes: int64(1), object(3)
memory usage: 3.0+ MB


In [7]:
user.head()

Unnamed: 0,user_name,customer_zip_code,customer_city,customer_state
0,861eff4711a542e4b93843c6dd7febb0,14409,KABUPATEN PEKALONGAN,JAWA TENGAH
1,290c77bc529b7ac935b93aa66c333dc3,9790,KOTA BEKASI,JAWA BARAT
2,060e732b5b29e8181a18229c7b0b2b5e,1151,KOTA TANGERANG,BANTEN
3,259dac757896d24d7702b9acbbff3f3c,8775,KABUPATEN BANDUNG BARAT,JAWA BARAT
4,345ecd01c38d18a9036ed96c73b8d066,13056,KOTA JAKARTA TIMUR,DKI JAKARTA


In [8]:
print(f"Total rows : {user.shape[0]}")
print(f"Total unique rows (based on username) : {user.user_name.nunique()}")
print(f"Number of duplicate after processed : {user.duplicated(subset = ['user_name']).sum()}")

Total rows : 99441
Total unique rows (based on username) : 96096
Number of duplicate after processed : 3345


In [9]:
# Get all duplicate data
user.duplicated(subset = ['user_name']).head()

0    False
1    False
2    False
3    False
4    False
dtype: bool

In [10]:
# Check single sample duplicate username
user.loc[user.user_name == "2b6ce149982204423f4efac29701255a"]

Unnamed: 0,user_name,customer_zip_code,customer_city,customer_state
54643,2b6ce149982204423f4efac29701255a,31840,KABUPATEN TANGERANG,BANTEN
85949,2b6ce149982204423f4efac29701255a,31840,KABUPATEN TANGERANG,BANTEN


In [11]:
user_no_duplicate = user.drop_duplicates(subset=['user_name'], keep="last")
user_no_duplicate.to_csv("data/processed/user_no_duplicate.csv", index=False)

In [12]:
print(f"Number of duplicate after processed : {user_no_duplicate.duplicated(subset = ['user_name']).sum()}")

Number of duplicate after processed : 0


**Few Important Notes :**

* user_name encoded
* multiple same username (?)
---
**Data Types For Database:**

* user_name : VARCHAR (PK)
* customer_zip_code : VARCHAR
* customer_city : VARCHAR
* customer_state : VARCHAR
---
**Preprocess Procedure:**

* drop_duplicate (use last entry as saved username)
* ask the context of multiple zip_code

### Product Dataset Exploration

In [13]:
print("Missing Data | Product : ")
show_missing_data(products)

Missing Data | Product : 
Shape : (32951, 9)
Missing Data : product_id                      0
product_category              623
product_name_lenght           610
product_description_lenght    610
product_photos_qty            610
product_weight_g                2
product_length_cm               2
product_height_cm               2
product_width_cm                2
dtype: int64


In [14]:
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32951 entries, 0 to 32950
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   product_id                  32951 non-null  object 
 1   product_category            32328 non-null  object 
 2   product_name_lenght         32341 non-null  float64
 3   product_description_lenght  32341 non-null  float64
 4   product_photos_qty          32341 non-null  float64
 5   product_weight_g            32949 non-null  float64
 6   product_length_cm           32949 non-null  float64
 7   product_height_cm           32949 non-null  float64
 8   product_width_cm            32949 non-null  float64
dtypes: float64(7), object(2)
memory usage: 2.3+ MB


In [15]:
print(f"Total rows : {products.shape[0]}")
print(f"Total unique rows (based on product_id) : {products.product_id.nunique()}")
print(f"Number of duplicate after processed : {products.duplicated(subset = ['product_id']).sum()}")

Total rows : 32951
Total unique rows (based on product_id) : 32951
Number of duplicate after processed : 0


**Few Important Notes :**

* missing data on some rows
---
**Data Types For Database:**

* product_id : VARCHAR (PK)
* product_category : INT
* product_name_length : INT
* product_description_length : INT
* product_photos_qty  : INT
* product_weight_g : FLOAT
* product_height_cm : FLOAT
* product_width_cm : FLOAT
---
**Preprocess Procedure:**

* keep missing data (allow NULL on database)
* rename columns (product_name_lenght -> product_name_length, product_description_lenght -> product_description_length)

### Seller Dataset Exploration

In [16]:
print("Missing Data | Seller : ")
show_missing_data(seller)

Missing Data | Seller : 
Shape : (3095, 4)
Missing Data : seller_id          0
seller_zip_code    0
seller_city        0
seller_state       0
dtype: int64


In [17]:
seller.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3095 entries, 0 to 3094
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   seller_id        3095 non-null   object
 1   seller_zip_code  3095 non-null   int64 
 2   seller_city      3095 non-null   object
 3   seller_state     3095 non-null   object
dtypes: int64(1), object(3)
memory usage: 96.8+ KB


**Few Important Notes :**

---
**Data Types For Database:**

* seller_id : VARCHAR (PK)
* seller_zip_code : VARCHAR
* seller_city : VARCHAR
* seller_state : VARCHAR
---
**Preprocess Procedure:**
-

### Order Dataset Exploration

In [18]:
print("Missing Data | Order : ")
show_missing_data(order)

Missing Data | Order : 
Shape : (99441, 8)
Missing Data : order_id                      0
user_name                     0
order_status                  0
order_date                    0
order_approved_date         160
pickup_date                1783
delivered_date             2965
estimated_time_delivery       0
dtype: int64


In [19]:
order.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   order_id                 99441 non-null  object
 1   user_name                99441 non-null  object
 2   order_status             99441 non-null  object
 3   order_date               99441 non-null  object
 4   order_approved_date      99281 non-null  object
 5   pickup_date              97658 non-null  object
 6   delivered_date           96476 non-null  object
 7   estimated_time_delivery  99441 non-null  object
dtypes: object(8)
memory usage: 6.1+ MB


In [20]:
print(f"Total rows : {order.shape[0]}")
print(f"Total unique rows (based on product_id) : {order.order_id.nunique()}")
print(f"Number of duplicate after processed : {order.duplicated(subset = ['order_id']).sum()}")

Total rows : 99441
Total unique rows (based on product_id) : 99441
Number of duplicate after processed : 0


**Few Important Notes :**
* Some missing value in order_approved_date, pickup_date, delivered_date, estimated_time_delivery
---
**Data Types For Database:**

* order_id : VARCHAR (PK)
* user_name : VARCHAR
* order_status : VARCHAR
* order_date : VARCHAR
* order_approved_date : TIMESTAMP
* pickup_date : TIMESTAMP
* delivered_date : TIMESTAMP
* estimated_time_delivery : TIMESTAMP
---
**Preprocess Procedure:**
* keep missing data (allow NULL on database)

### Order Item Dataset Exploration

In [21]:
print("Missing Data | Order Item: ")
show_missing_data(order_item)

Missing Data | Order Item: 
Shape : (112650, 7)
Missing Data : order_id             0
order_item_id        0
product_id           0
seller_id            0
pickup_limit_date    0
price                0
shipping_cost        0
dtype: int64


In [22]:
order_item.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112650 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   order_id           112650 non-null  object 
 1   order_item_id      112650 non-null  int64  
 2   product_id         112650 non-null  object 
 3   seller_id          112650 non-null  object 
 4   pickup_limit_date  112650 non-null  object 
 5   price              112650 non-null  float64
 6   shipping_cost      112650 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 6.0+ MB


In [23]:
# Checking Primary Key
print("Testing PK : Order Id")
print(f"Total rows : {order_item.shape[0]}")
print(f"Total unique rows (based on order_item_id) : {order_item.order_id.nunique()}")
print(f"Number of duplicate after processed : {order_item.duplicated(subset = ['order_id']).sum()}")
print('-' * 20)

print("Testing PK : Order Item Id")
print(f"Total rows : {order_item.shape[0]}")
print(f"Total unique rows (based on order_item_id) : {order_item.order_item_id.nunique()}")
print(f"Number of duplicate after processed : {order_item.duplicated(subset = ['order_item_id']).sum()}")
print('-' * 20)

print("Testing PK : Composite(order_id, order_item_id)")
print(f"Total rows : {order_item.shape[0]}")
print(f"Total unique rows (based on order_item_id) : {order_item[['order_item_id', 'order_id']].nunique()}")
print(f"Number of duplicate after processed : {order_item.duplicated(subset = ['order_item_id', 'order_id']).sum()}")

Testing PK : Order Id
Total rows : 112650
Total unique rows (based on order_item_id) : 98666
Number of duplicate after processed : 13984
--------------------
Testing PK : Order Item Id
Total rows : 112650
Total unique rows (based on order_item_id) : 21
Number of duplicate after processed : 112629
--------------------
Testing PK : Composite(order_id, order_item_id)
Total rows : 112650
Total unique rows (based on order_item_id) : order_item_id       21
order_id         98666
dtype: int64
Number of duplicate after processed : 0


**Few Important Notes :**
* what is order_item_id? 
* PK is composite key of order_id & order_item_id
---
**Data Types For Database:**

* order_id : VARCHAR
* order_item_id : INT
* product_id : VARCHAR
* seller_id : VARCHAR
* pickup_limit_date : TIMESTAMP
* price : FLOAT
* shipping_cost : FLOAT
---
**Preprocess Procedure:**
-

### Payment Dataset Exploration

In [24]:
print("Missing Data | Payment: ")
show_missing_data(payment)

Missing Data | Payment: 
Shape : (103886, 5)
Missing Data : order_id                0
payment_sequential      0
payment_type            0
payment_installments    0
payment_value           0
dtype: int64


In [25]:
payment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103886 entries, 0 to 103885
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   order_id              103886 non-null  object 
 1   payment_sequential    103886 non-null  int64  
 2   payment_type          103886 non-null  object 
 3   payment_installments  103886 non-null  int64  
 4   payment_value         103886 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 4.0+ MB


In [26]:
payment.head()

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99330.0
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24390.0
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65710.0
3,ba78997921bbcdc1373bb41e913ab953,1,credit_card,8,107780.0
4,42fdf880ba16b47b59251dd489d4441a,1,credit_card,2,128450.0


In [27]:
# Checking Primary Key
print("Testing PK : Order Id")
print(f"Total rows : {payment.shape[0]}")
print(f"Total unique rows (based on order_id) : {payment.order_id.nunique()}")
print(f"Number of duplicate after processed : {payment.duplicated(subset = ['order_id']).sum()}")
print('-' * 20)

print("Testing PK : Composite(Order Id, Payment Sequential)")
print(f"Total rows : {payment.shape[0]}")
print(f"Number of duplicate after processed : {payment.duplicated(subset = ['order_id','payment_sequential']).sum()}")
print('-' * 20)

Testing PK : Order Id
Total rows : 103886
Total unique rows (based on order_id) : 99440
Number of duplicate after processed : 4446
--------------------
Testing PK : Composite(Order Id, Payment Sequential)
Total rows : 103886
Number of duplicate after processed : 0
--------------------


In [33]:
merged = pd.merge(order_item, payment, left_on='order_id', right_on='order_id')

In [48]:
check = merged[['order_id', 'order_item_id', 'payment_sequential']].copy().groupby(by="order_id").count()
check.loc[check.order_item_id != check.payment_sequential]

Unnamed: 0_level_0,order_item_id,payment_sequential
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1


In [50]:
payment.loc[payment.payment_sequential > 1].groupby(by="order_id").first()

Unnamed: 0_level_0,payment_sequential,payment_type,payment_installments,payment_value
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0016dfedd97fc2950e388d2971d718c7,2,voucher,1,17920.0
002f19a65a2ddd70a090297872e6d64e,2,voucher,1,33180.0
0071ee2429bc1efdc43aa3e073a5290e,2,voucher,1,92440.0
009ac365164f8e06f59d18a08045f6c4,5,voucher,1,8750.0
00ac05fe0fc047c54418098eb64e3aaa,2,debit_card,1,123470.0
...,...,...,...,...
ff7400d904161b62b6e830b3988f5cbd,2,voucher,1,100000.0
ff978de32e717acd3b5abe1fb069d2b6,4,voucher,1,7680.0
ffa1dd97810de91a03abd7bd76d2fed1,2,voucher,1,418730.0
ffa39020fe7c8a3e907320e1bec4b985,2,voucher,1,64010.0


**Few Important Notes :**
* missing column payment_id?
* what is payment_sequential?
---
**Data Types For Database:**

* order_id : VARCHAR (PK)
* payment_sequential : INT (PK)
* payment_type : VARCHAR
* payment_installments : VARCHAR
* payment_value : FLOAT
---
**Preprocess Procedure:**
- confused, because metadata different from specs, need clarification

In [59]:
payment.loc[payment.order_id == "009ac365164f8e06f59d18a08045f6c4"].payment_value.sum()

32000.0

In [55]:
order_item.loc[order_item.order_id == "009ac365164f8e06f59d18a08045f6c4"].price.values[0] + order_item.loc[order_item.order_id == "009ac365164f8e06f59d18a08045f6c4"].shipping_cost.values[0]

32000.0

### Feedback Dataset Exploration

In [28]:
print("Missing Data | Feedback: ")
show_missing_data(feedback)

Missing Data | Feedback: 
Shape : (100000, 5)
Missing Data : feedback_id                0
order_id                   0
feedback_score             0
feedback_form_sent_date    0
feedback_answer_date       0
dtype: int64


In [29]:
feedback.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   feedback_id              100000 non-null  object
 1   order_id                 100000 non-null  object
 2   feedback_score           100000 non-null  int64 
 3   feedback_form_sent_date  100000 non-null  object
 4   feedback_answer_date     100000 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.8+ MB


In [30]:
feedback.head()

Unnamed: 0,feedback_id,order_id,feedback_score,feedback_form_sent_date,feedback_answer_date
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,2018-03-01 00:00:00,2018-03-02 10:26:53


In [31]:
# Checking Primary Key
print("Testing PK : Feedback Id")
print(f"Total rows : {feedback.shape[0]}")
print(f"Total unique rows (based on feedback_id) : {feedback.feedback_id.nunique()}")
print(f"Number of duplicate after processed : {feedback.duplicated(subset = ['feedback_id']).sum()}")
print('-' * 20)

print("Testing PK : Order Id")
print(f"Total rows : {feedback.shape[0]}")
print(f"Total unique rows (based on order_id) : {feedback.order_id.nunique()}")
print(f"Number of duplicate after processed : {feedback.duplicated(subset = ['order_id']).sum()}")
print('-' * 20)

print("Testing PK : Composite(Order Id, Feedback Id)")
print(f"Total rows : {feedback.shape[0]}")
print(f"Number of duplicate after processed : {feedback.duplicated(subset = ['order_id','feedback_id']).sum()}")
print('-' * 20)


Testing PK : Feedback Id
Total rows : 100000
Total unique rows (based on feedback_id) : 99173
Number of duplicate after processed : 827
--------------------
Testing PK : Order Id
Total rows : 100000
Total unique rows (based on order_id) : 99441
Number of duplicate after processed : 559
--------------------
Testing PK : Composite(Order Id, Feedback Id)
Total rows : 100000
Number of duplicate after processed : 0
--------------------


**Few Important Notes :**
* is delta time (how long to answer a feedback) have a correlation in feedback score?
* what about order that user / buyer not give feedback?
---
**Data Types For Database:**

* feedback_id : VARCHAR (PK)
* order_id : VARCHAR (PK)
* feedback_score : INT
* feedback_form_sent_date : TIMESTAMP
* feedback_answer_date : TIMESTAMP
---
**Preprocess Procedure:**
-

## Entity Relationship Exploration

Potential Relationship (based on dataset) : 
1. User 1 .. M Order on _user_name_
2. Order 1 .. M Order_item on _order_id_
3. Order 1 .. 1 Payment on _order_id_
4. Order 1 .. 1 Feedback on _order_id_
5. Product 1 .. M Order_item on _product_id_
6. Seller 1 .. M Order_item on _seller_id_

In [None]:
user = pd.read_csv("data/processed/user_no_duplicate.csv")

In [None]:
# Check user - order
print("User - Order")
set(user.columns).intersection(order.columns)

In [None]:
# Try to join using on 'user_name' on order
print ("Uniqueness of username in User table : ")
user.user_name.value_counts()

In [None]:
print("Uniquness of username in Order table :")
order.user_name.value_counts()

In [None]:
merged = pd.merge(user, order, left_on='user_name', right_on='user_name')
merged.info()

**Conclussion**:
User 1 .. M Order on _user_name_

In [None]:
# Check Order - OrderItem
print("Order - OrderItem")
set(order.columns).intersection(order_item.columns)

In [None]:
print ("Uniqueness of order_id in Order table : ")
order.order_id.value_counts()

In [None]:
print ("Uniqueness of order_id in OrderItem table : ")
order_item.order_id.value_counts()

In [None]:
merged = pd.merge(order, order_item, left_on='order_id', right_on='order_id')
merged.info()

**Conclussion**:
Order 1 .. M Order_item on _order_id_

In [None]:
# Check Order - Payment
print("Order - Payment")
set(order.columns).intersection(payment.columns)

In [None]:
payment.order_id.value_counts()

In [None]:
merged = pd.merge(order, payment, left_on='order_id', right_on='order_id')
merged.info()

**Conclussion**:
Order 1 .. M Order_item on _order_id_

In [None]:
# Check Order - Payment
print("Order - Feedback")
set(order.columns).intersection(feedback.columns)

In [None]:
feedback.order_id.value_counts()

In [None]:
merged = pd.merge(order, feedback, left_on='order_id', right_on='order_id')
merged.info()

**Conclussion**:
Order 1 .. M Feedback on _order_id_

In [None]:
# Check Order - Payment
print("OrderItem - Product")
set(order_item.columns).intersection(products.columns)

In [None]:
print(f"Order Item rows : {order_item.shape[0]}")
print(f"Producst rows : {products.shape[0]}")

In [None]:
merged = pd.merge(order_item, products, left_on='product_id', right_on='product_id')
merged.info()

**Conclussion**:
Product 1 .. M Order_item on _product_id_

In [None]:
# Check Order - Payment
print("OrderItem - Seller")
set(order_item.columns).intersection(seller.columns)

In [None]:
print(f"Order Item rows : {order_item.shape[0]}")
print(f"Producst rows : {seller.shape[0]}")

In [None]:
merged = pd.merge(order_item, seller, left_on='seller_id', right_on='seller_id')
merged.info()

**Conclussion**:
Seller 1 .. M Order_item on _seller_id_

[TODO (Database Integration)] : 
1. Check each potential relationship
2. Create ER Diagram
3. Porting into local Postgres SQL
---
[TODO (Datamart)]:
1. https://stackoverflow.com/questions/61030755/connect-to-postresql-database-from-google-colab
--> Integrating colab with local db