<a href="https://colab.research.google.com/github/William9923/future-data-ecommerce/blob/master/notebooks/13_02_2021DatasetRelationshipExploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Exploration

Goal : 
* Finding relationship and understanding data context
* Exploring data for database creation context


## Importing Libraries

In [None]:
#ignore warnings
import warnings
warnings.filterwarnings('ignore')

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd

In [None]:
def profilling(df, filename):
  # importing sweetviz
  import sweetviz as sv                 #analyzing the dataset
  advert_report = sv.analyze(df)  #display the report
  advert_report.show_html(filename)

In [None]:
def show_missing_data(df):
    print(f"Shape : {df.shape}")
    print(f"Missing Data : {df.isnull().sum()}")
    return None

## Importing Data


In [None]:
print("Loading Dataset ...")

data_folder = "data/raw/"

user = pd.read_csv(data_folder + "user_dataset.csv")
order = pd.read_csv(data_folder + "order_dataset.csv")
order_item = pd.read_csv(data_folder + "order_item_dataset.csv")
payment = pd.read_csv(data_folder + "payment_dataset.csv")
products = pd.read_csv(data_folder + "products_dataset.csv")
seller = pd.read_csv(data_folder + "seller_dataset.csv")
feedback = pd.read_csv(data_folder + "feedback_dataset.csv")

print("Finish...")

In [None]:
dataset = [user, order, order_item, payment, products, seller, feedback]
filenames = ["User", "Order", "OrderItem", "Payment", "Products", "Seller", "Feedback"]

report_folder = 'reports/docs/'

for data, filename in zip(dataset, filenames) :
  profilling(data, filename + ".html")

## Single Entity Exploration

### User Dataset Exploration

In [None]:
print("Missing Data | User : ")
show_missing_data(user)

In [None]:
user.info()

In [None]:
user.head()

In [None]:
print(f"Total rows : {user.shape[0]}")
print(f"Total unique rows (based on username) : {user.user_name.nunique()}")
print(f"Number of duplicate after processed : {user.duplicated(subset = ['user_name']).sum()}")

In [None]:
# Get all duplicate data
user.duplicated(subset = ['user_name']).head()

In [None]:
# Check single sample duplicate username
user.loc[user.user_name == "2b6ce149982204423f4efac29701255a"]

In [None]:
user_no_duplicate = user.drop_duplicates(subset=['user_name'], keep="last")
user_no_duplicate.to_csv("data/processed/user_no_duplicate.csv", index=False)

In [None]:
print(f"Number of duplicate after processed : {user_no_duplicate.duplicated(subset = ['user_name']).sum()}")

**Few Important Notes :**

* user_name encoded
* multiple same username (?)
---
**Data Types For Database:**

* user_name : VARCHAR (PK)
* customer_zip_code : VARCHAR
* customer_city : VARCHAR
* customer_state : VARCHAR
---
**Preprocess Procedure:**

* drop_duplicate (use last entry as saved username)
* ask the context of multiple zip_code

### Product Dataset Exploration

In [None]:
print("Missing Data | Product : ")
show_missing_data(products)

In [None]:
products.info()

In [None]:
print(f"Total rows : {products.shape[0]}")
print(f"Total unique rows (based on product_id) : {products.product_id.nunique()}")
print(f"Number of duplicate after processed : {products.duplicated(subset = ['product_id']).sum()}")

**Few Important Notes :**

* missing data on some rows
---
**Data Types For Database:**

* product_id : VARCHAR (PK)
* product_category : INT
* product_name_length : INT
* product_description_length : INT
* product_photos_qty  : INT
* product_weight_g : FLOAT
* product_height_cm : FLOAT
* product_width_cm : FLOAT
---
**Preprocess Procedure:**

* keep missing data (allow NULL on database)
* rename columns (product_name_lenght -> product_name_length, product_description_lenght -> product_description_length)

### Seller Dataset Exploration

In [None]:
print("Missing Data | Seller : ")
show_missing_data(seller)

In [None]:
seller.info()

**Few Important Notes :**

---
**Data Types For Database:**

* seller_id : VARCHAR (PK)
* seller_zip_code : VARCHAR
* seller_city : VARCHAR
* seller_state : VARCHAR
---
**Preprocess Procedure:**
-

### Order Dataset Exploration

In [None]:
print("Missing Data | Order : ")
show_missing_data(order)

In [None]:
order.info()

In [None]:
print(f"Total rows : {order.shape[0]}")
print(f"Total unique rows (based on product_id) : {order.order_id.nunique()}")
print(f"Number of duplicate after processed : {order.duplicated(subset = ['order_id']).sum()}")

**Few Important Notes :**
* Some missing value in order_approved_date, pickup_date, delivered_date, estimated_time_delivery
---
**Data Types For Database:**

* order_id : VARCHAR (PK)
* user_name : VARCHAR
* order_status : VARCHAR
* order_date : VARCHAR
* order_approved_date : TIMESTAMP
* pickup_date : TIMESTAMP
* delivered_date : TIMESTAMP
* estimated_time_delivery : TIMESTAMP
---
**Preprocess Procedure:**
* keep missing data (allow NULL on database)

### Order Item Dataset Exploration

In [None]:
print("Missing Data | Order Item: ")
show_missing_data(order_item)

In [None]:
order_item.info()

In [None]:
# Checking Primary Key
print("Testing PK : Order Id")
print(f"Total rows : {order_item.shape[0]}")
print(f"Total unique rows (based on order_item_id) : {order_item.order_id.nunique()}")
print(f"Number of duplicate after processed : {order_item.duplicated(subset = ['order_id']).sum()}")
print('-' * 20)

print("Testing PK : Order Item Id")
print(f"Total rows : {order_item.shape[0]}")
print(f"Total unique rows (based on order_item_id) : {order_item.order_item_id.nunique()}")
print(f"Number of duplicate after processed : {order_item.duplicated(subset = ['order_item_id']).sum()}")
print('-' * 20)

print("Testing PK : Composite(order_id, order_item_id)")
print(f"Total rows : {order_item.shape[0]}")
print(f"Total unique rows (based on order_item_id) : {order_item[['order_item_id', 'order_id']].nunique()}")
print(f"Number of duplicate after processed : {order_item.duplicated(subset = ['order_item_id', 'order_id']).sum()}")

**Few Important Notes :**
* what is order_item_id? 
* PK is composite key of order_id & order_item_id
---
**Data Types For Database:**

* order_id : VARCHAR
* order_item_id : INT
* product_id : VARCHAR
* seller_id : VARCHAR
* pickup_limit_date : TIMESTAMP
* price : FLOAT
* shipping_cost : FLOAT
---
**Preprocess Procedure:**
-

### Payment Dataset Exploration

In [None]:
print("Missing Data | Payment: ")
show_missing_data(payment)

In [None]:
payment.info()

In [None]:
payment.head()

In [None]:
# Checking Primary Key
print("Testing PK : Order Id")
print(f"Total rows : {payment.shape[0]}")
print(f"Total unique rows (based on order_id) : {payment.order_id.nunique()}")
print(f"Number of duplicate after processed : {payment.duplicated(subset = ['order_id']).sum()}")
print('-' * 20)

print("Testing PK : Composite(Order Id, Payment Sequential)")
print(f"Total rows : {payment.shape[0]}")
print(f"Number of duplicate after processed : {payment.duplicated(subset = ['order_id','payment_sequential']).sum()}")
print('-' * 20)

**Few Important Notes :**
* missing column payment_id?
* what is payment_sequential?
---
**Data Types For Database:**

* order_id : VARCHAR (PK)
* payment_sequential : INT (PK)
* payment_type : VARCHAR
* payment_installments : VARCHAR
* payment_value : FLOAT
---
**Preprocess Procedure:**
- confused, because metadata different from specs, need clarification

### Feedback Dataset Exploration

In [None]:
print("Missing Data | Feedback: ")
show_missing_data(feedback)

In [None]:
feedback.info()

In [None]:
feedback.head()

In [None]:
# Checking Primary Key
print("Testing PK : Feedback Id")
print(f"Total rows : {feedback.shape[0]}")
print(f"Total unique rows (based on feedback_id) : {feedback.feedback_id.nunique()}")
print(f"Number of duplicate after processed : {feedback.duplicated(subset = ['feedback_id']).sum()}")
print('-' * 20)

print("Testing PK : Order Id")
print(f"Total rows : {feedback.shape[0]}")
print(f"Total unique rows (based on order_id) : {feedback.order_id.nunique()}")
print(f"Number of duplicate after processed : {feedback.duplicated(subset = ['order_id']).sum()}")
print('-' * 20)

print("Testing PK : Composite(Order Id, Feedback Id)")
print(f"Total rows : {feedback.shape[0]}")
print(f"Number of duplicate after processed : {feedback.duplicated(subset = ['order_id','feedback_id']).sum()}")
print('-' * 20)


**Few Important Notes :**
* is delta time (how long to answer a feedback) have a correlation in feedback score?
* what about order that user / buyer not give feedback?
---
**Data Types For Database:**

* feedback_id : VARCHAR (PK)
* order_id : VARCHAR (PK)
* feedback_score : INT
* feedback_form_sent_date : TIMESTAMP
* feedback_answer_date : TIMESTAMP
---
**Preprocess Procedure:**
-

## Entity Relationship Exploration

Potential Relationship (based on dataset) : 
1. User 1 .. M Order on _user_name_
2. Order 1 .. M Order_item on _order_id_
3. Order 1 .. 1 Payment on _order_id_
4. Order 1 .. 1 Feedback on _order_id_
5. Product 1 .. M Order_item on _product_id_
6. Seller 1 .. M Order_item on _seller_id_

In [None]:
user = pd.read_csv("data/processed/user_no_duplicate.csv")

In [None]:
# Check user - order
print("User - Order")
set(user.columns).intersection(order.columns)

In [None]:
# Try to join using on 'user_name' on order
print ("Uniqueness of username in User table : ")
user.user_name.value_counts()

In [None]:
print("Uniquness of username in Order table :")
order.user_name.value_counts()

In [None]:
merged = pd.merge(user, order, left_on='user_name', right_on='user_name')
merged.info()

**Conclussion**:
User 1 .. M Order on _user_name_

In [None]:
# Check Order - OrderItem
print("Order - OrderItem")
set(order.columns).intersection(order_item.columns)

In [None]:
print ("Uniqueness of order_id in Order table : ")
order.order_id.value_counts()

In [None]:
print ("Uniqueness of order_id in OrderItem table : ")
order_item.order_id.value_counts()

In [None]:
merged = pd.merge(order, order_item, left_on='order_id', right_on='order_id')
merged.info()

**Conclussion**:
Order 1 .. M Order_item on _order_id_

In [None]:
# Check Order - Payment
print("Order - Payment")
set(order.columns).intersection(payment.columns)

In [None]:
payment.order_id.value_counts()

In [None]:
merged = pd.merge(order, payment, left_on='order_id', right_on='order_id')
merged.info()

**Conclussion**:
Order 1 .. M Order_item on _order_id_

In [None]:
# Check Order - Payment
print("Order - Feedback")
set(order.columns).intersection(feedback.columns)

In [None]:
feedback.order_id.value_counts()

In [None]:
merged = pd.merge(order, feedback, left_on='order_id', right_on='order_id')
merged.info()

**Conclussion**:
Order 1 .. M Feedback on _order_id_

In [None]:
# Check Order - Payment
print("OrderItem - Product")
set(order_item.columns).intersection(products.columns)

In [None]:
print(f"Order Item rows : {order_item.shape[0]}")
print(f"Producst rows : {products.shape[0]}")

In [None]:
merged = pd.merge(order_item, products, left_on='product_id', right_on='product_id')
merged.info()

**Conclussion**:
Product 1 .. M Order_item on _product_id_

In [None]:
# Check Order - Payment
print("OrderItem - Seller")
set(order_item.columns).intersection(seller.columns)

In [None]:
print(f"Order Item rows : {order_item.shape[0]}")
print(f"Producst rows : {seller.shape[0]}")

In [None]:
merged = pd.merge(order_item, seller, left_on='seller_id', right_on='seller_id')
merged.info()

**Conclussion**:
Seller 1 .. M Order_item on _seller_id_

[TODO (Database Integration)] : 
1. Check each potential relationship
2. Create ER Diagram
3. Porting into local Postgres SQL
---
[TODO (Datamart)]:
1. https://stackoverflow.com/questions/61030755/connect-to-postresql-database-from-google-colab
--> Integrating colab with local db