<a href="https://colab.research.google.com/github/William9923/future-data-ecommerce/blob/master/notebooks/27_05_2021_CustomerSatisfactionAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Customer Satisfaction Analysis

## Purpose
Finding insight on:
- Is there a correlation between delivered process and rating given by user (feedback) ?
- Is there a correlation between purchase respond time and rating given by user (feedback) ?
- Are late delivered product affecting the review / feedback score?
- What can we do to improve the satisfaction from the user?

## Background
Rating / review is one of the most important part in ecommerce. Normally, an average customer can perceive wether the product is good or bad solely based on rating alone. But, we should know that it is not the only factor that determine a product is reviewed / rated as a good or bad product. Other factor, such as delay between estimated delivery time and actual delivered time might affecting the rating too. Based on [reference](https://targetbay.com/blog/customer-reviews/), we know why rating is important for other user that might buy a product. Because this data doesn't have actual text review, I want to try to analyze other factor that might affecting the customer satisfaction. The biggest factor outside of product quality might be the next thing that we need to improve in order to make more user use the application

## Assumption
Based on my assumption, probably the product (especially product price / value) or how the order fulfilled / deliverd gonna affect the customer review score.
The hypotesting that we gonna check : 
- working days estimated delivery time
- working days actual delivery time
- working days delivery time delta
- is late (the product arrived late)

Also, based on my hipotesis, we might be able to increase user satisfaction if we were able to be more precise on the estimated delivery time, so that customer is not angry if the order is late, but also not make the customer wait too long.

## Reference 
- [Delivery Date Estimation](https://towardsdatascience.com/delivery-date-estimation-5aff1a0ff8dc)
- 

## Sanity Check

In [1]:
import json

def load_config(file_path: str = "./config.json"):
    with open(file_path) as config_file:
        data = json.load(config_file)
    return data

config = load_config("../config.json")
DBNAME = config.get("DBNAME")
HOSTNAME = config.get("HOSTNAME")
USER = config.get("USER")
PASS = config.get("PASS")
SCHEMA = config.get("SCHEMA")

In [1]:
# Basic 
import sys
import numpy as np
import scipy as sp
import pandas as pd

# SQL Engine
import psycopg2
import pandas as pd
from sqlalchemy import create_engine

# Profiling process
from tqdm import tqdm

# Warning problems in notebook
import warnings
warnings.filterwarnings('ignore')

# Visualization
import plotly.express as px
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns

# Reporting result
import sweetviz as sv
from dataprep.eda import create_report

ModuleNotFoundError: No module named 'sqlalchemy'

In [3]:
# Load data

# Create an engine instance
alchemyEngine = create_engine(
    f'postgresql+psycopg2://{USER}:{PASS}@{HOSTNAME}/{DBNAME}', pool_recycle=3600)

# Connect to PostgreSQL server
conn = alchemyEngine.connect()

schema = SCHEMA

In [4]:
# Init needed data

QUERY = """
with order_grouped as (
	select 
		foi.order_id ,
		SUM(foi.price) as total_price,
		AVG(foi.price) as average_price,
		COUNT(foi.order_item_id) as total_item,
		SUM(foi.shipping_cost) as shipping_cost 
	from staging.fct_order_items foi 
	group by 1
)
select 
	distinct 
	foi.order_id ,
	
	-- order
	ddo.date as order_date,
	ddo.day_of_year as order_day_of_year,
	dto.hour as order_time,
	
	-- approved
	dda.date as order_approved_date,
	dda.day_of_year as order_approved_day_of_year,
	dta.hour as order_approved_time,
	
	-- pickup
	ddp.date as pickup_date,
	ddp.day_of_year as pickup_day_of_year,
	dtp.hour as pickup_time,
	
	-- delivered
	ddd.date as delivered_date,
	ddd.day_of_year as delivered_day_of_year,
	dtd.hour as delivered_time,
	
	-- estimated delivery
	dde.date as estimated_date_delivery,
	dde.day_of_year as estimated_day_of_year_delivery,
	dte.hour as estimated_time_delivery,
	
	-- pickup limit
	ddl.date as pickup_limit_date,
	ddl.day_of_year as pickup_limit_day_of_year,
	dtl.hour as pickup_limit_time,
	
	-- user 
	du.user_name ,
	du.customer_state ,
	
	-- seller
	ds.seller_id,
	ds.seller_state,
	
	-- order detail
	og.total_price,
	og.average_price,
	og.total_item,
	og.shipping_cost,
	
	-- feedback
	df.feedback_score 
	
from staging.fct_order_items foi 
left join order_grouped og on og.order_id = foi.order_id 
left join staging.dim_feedback df on foi.feedback_key = df.feedback_key 
left join staging.dim_user du on foi.user_key = du.user_key 
left join staging.dim_seller ds on foi.seller_key = ds.seller_key 
left join staging.dim_date ddp on foi.pickup_date = ddp.date_id 
left join staging.dim_time dtp on foi.pickup_time = dtp.time_id
left join staging.dim_date dde on foi.estimated_date_delivery = dde.date_id 
left join staging.dim_time dte on foi.estimated_time_delivery = dte.time_id
left join staging.dim_date ddd on foi.delivered_date = ddd.date_id 
left join staging.dim_time dtd on foi.delivered_time = dtd.time_id
left join staging.dim_date dda on foi.order_approved_date = dda.date_id 
left join staging.dim_time dta on foi.order_approved_time = dta.time_id
left join staging.dim_date ddo on foi.order_date = ddo.date_id 
left join staging.dim_time dto on foi.order_time = dto.time_id
left join staging.dim_date ddl on foi.pickup_limit_date = ddl.date_id 
left join staging.dim_time dtl on foi.pickup_limit_time = dtl.time_id

where foi.order_item_status in ('delivered')
"""

# Init dataframe
df = pd.read_sql_query(QUERY, conn)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97819 entries, 0 to 97818
Data columns (total 28 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   order_id                        97819 non-null  object 
 1   order_date                      97819 non-null  object 
 2   order_day_of_year               97819 non-null  float64
 3   order_time                      97819 non-null  int64  
 4   order_approved_date             97805 non-null  object 
 5   order_approved_day_of_year      97805 non-null  float64
 6   order_approved_time             97805 non-null  float64
 7   pickup_date                     97817 non-null  object 
 8   pickup_day_of_year              97817 non-null  float64
 9   pickup_time                     97817 non-null  float64
 10  delivered_date                  97811 non-null  object 
 11  delivered_day_of_year           97811 non-null  float64
 12  delivered_time                  

In [5]:
# Checking missing value
df[['order_date', 'order_approved_date', 'pickup_date', 'delivered_date', 'feedback_score']].loc[pd.isna(df['order_approved_date']) | pd.isna(df['pickup_date']) | pd.isna(df['delivered_date'])]

Unnamed: 0,order_date,order_approved_date,pickup_date,delivered_date,feedback_score
5094,2018-07-01,2018-07-01,2018-07-03,,5.0
7168,2017-02-17,,2017-02-22,2017-03-02,5.0
12554,2018-06-27,2018-06-27,2018-07-03,,5.0
16145,2017-09-29,2017-09-29,,2017-11-20,5.0
16527,2017-02-18,,2017-02-22,2017-03-03,4.0
17069,2017-11-28,2017-11-28,2017-11-30,,5.0
17242,2017-05-25,2017-05-25,,,5.0
17713,2018-07-01,2018-07-01,2018-07-03,,5.0
17786,2017-02-17,,2017-02-22,2017-03-03,5.0
22907,2017-02-17,,2017-02-22,2017-03-03,3.0


In [6]:
# Target : Categorical Data, unbalance

df.feedback_score.value_counts()

5.0    57216
4.0    19073
1.0    10198
3.0     8184
2.0     3148
Name: feedback_score, dtype: int64

In [7]:
# Initialize calender object

from workalendar.asia import Singapore
cal = Singapore()

In [8]:
# Base Transformer
class Transformer(object):
  def __init__(self, df, transformation_map):
    self.df = df.copy()
    self.transformation_map = transformation_map 
  
  def parse(self):
    pass 

  def clean(self):
    pass

  def transform(self):
    if self.transformation_map is not None : 
      df = self.df.copy()
      for key, value in tqdm(self.transformation_map.items()):
        df[key] = df.apply(
            value,
            axis = 1
        )
      self.df = df
    else :
      raise Exception("Null Transformation Map")

  def get_data(self):
    return self.df

  def process(self):
    self.parse()
    self.clean()
    self.transform()
    return self.get_data()

In [9]:
# Datetime Transformer

class DateTransformer(Transformer):
  def __init__(self, df, cal):
    transformation = {
        # Calculate the working days interval between event
        'wd_approved_interval' : lambda row: (row.order_approved_date - row.order_date).days,

        # Calculate pickup time
        'wd_pickup_interval' : lambda row: (row.pickup_date - row.order_date).days,

        # Calculate shipping time
        'wd_shipping_interval' : lambda row: (row.delivered_date - row.pickup_date).days,

        # Calculate working days actual day
        'wd_estimated_delivery_interval' : lambda row:(row.estimated_date_delivery - row.order_date).days,

        # Check Delta interval from order -> arrive to customer
        'wd_actual_delivery_interval' : lambda row: (row.delivered_date - row.order_date).days,

        # Calculate the difference between actual order delivery and estimated date delivery
        'wd_delta_delivery_interval' : lambda row: (row.delivered_date -  row.estimated_date_delivery).days,

        'wd_delta_shipping' : lambda row : (row.delivered_date - row.pickup_date).days - (row.estimated_date_delivery - row.pickup_limit_date).days,
        'abs_wd_delta_delivery_interval' : lambda row: abs(row['wd_delta_delivery_interval']),
        'is_late_delivery' : lambda row: row.delivered_date > row.estimated_date_delivery,
        'is_late_shipping' : lambda row: (row.delivered_date - row.pickup_date).days > (row.estimated_date_delivery - row.pickup_limit_date).days,
        'is_late_pickup' : lambda row: row.pickup_date > row.pickup_limit_date,
    }
    super().__init__(df, transformation)
    self.cal = cal

  def parse(self):
    df = self.df.copy()
    df['order_date'] = pd.to_datetime(df['order_date'], format='%Y-%m-%d')
    df['order_approved_date'] = pd.to_datetime(df['order_approved_date'], format='%Y-%m-%d')
    df['pickup_date'] = pd.to_datetime(df['pickup_date'], format='%Y-%m-%d')
    df['delivered_date'] = pd.to_datetime(df['delivered_date'], format='%Y-%m-%d')
    df['estimated_date_delivery'] = pd.to_datetime(df['estimated_date_delivery'], format='%Y-%m-%d')
    df['pickup_limit_date'] = pd.to_datetime(df['pickup_limit_date'], format='%Y-%m-%d')
    self.df = df 
  
  def clean(self):
    # Drop misuse col
    df = self.df.copy()

    # Remove all record that have missed information in it
    df = df.dropna(subset=['order_approved_date', 'pickup_date', 'delivered_date'])

    # Remove all record that delivered before approved (misuse of the ecommerce system)
    df = df.loc[~(df.delivered_date < df.order_approved_date)]
    df = df.loc[~(df.delivered_date < df.pickup_date)]
    df = df.loc[~(df.order_approved_date > df.pickup_date)]
    df = df.loc[~(df.estimated_date_delivery < df.order_approved_date)]

    # Copy back again data into original data
    self.df = df 

In [10]:
# Geolocation Transformer

class LocationTransformer(Transformer):
  def __init__(self, df):
    # TODO : We need to segment the location if we want to create machine learning out of it
    # Might be using the freight value?
    transformation = {
      'routes' : lambda row : 
        f"{row['seller_state'] if row['seller_state'] < row['customer_state'] else row['customer_state']} - {row['customer_state'] if row['seller_state'] < row['customer_state'] else row['seller_state']}",
    }
    super().__init__(df, transformation)

In [11]:
# Processing Date 
date_transformer = DateTransformer(df, cal)
df = date_transformer.process()

df.info()

100%|██████████████████████████████████████████████████████████████████████████████████| 11/11 [01:05<00:00,  5.95s/it]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 97067 entries, 0 to 97818
Data columns (total 39 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   order_id                        97067 non-null  object        
 1   order_date                      97067 non-null  datetime64[ns]
 2   order_day_of_year               97067 non-null  float64       
 3   order_time                      97067 non-null  int64         
 4   order_approved_date             97067 non-null  datetime64[ns]
 5   order_approved_day_of_year      97067 non-null  float64       
 6   order_approved_time             97067 non-null  float64       
 7   pickup_date                     97067 non-null  datetime64[ns]
 8   pickup_day_of_year              97067 non-null  float64       
 9   pickup_time                     97067 non-null  float64       
 10  delivered_date                  97067 non-null  datetime64[ns]
 11  de




In [12]:
# Processing Location
location_transformer = LocationTransformer(df)
df = location_transformer.process()

df.info()

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.72s/it]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 97067 entries, 0 to 97818
Data columns (total 40 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   order_id                        97067 non-null  object        
 1   order_date                      97067 non-null  datetime64[ns]
 2   order_day_of_year               97067 non-null  float64       
 3   order_time                      97067 non-null  int64         
 4   order_approved_date             97067 non-null  datetime64[ns]
 5   order_approved_day_of_year      97067 non-null  float64       
 6   order_approved_time             97067 non-null  float64       
 7   pickup_date                     97067 non-null  datetime64[ns]
 8   pickup_day_of_year              97067 non-null  float64       
 9   pickup_time                     97067 non-null  float64       
 10  delivered_date                  97067 non-null  datetime64[ns]
 11  de




In [13]:
fig = px.box(df.loc[df['routes'] == 'BANTEN - JAWA BARAT'], x="shipping_cost", template="ggplot2")
fig.show(renderer="colab")

In [14]:
grouped_pricing = df.groupby(['routes']).agg(value=('shipping_cost', 'mean'))
grouped_pricing = grouped_pricing.reset_index()

In [15]:
fig = px.bar(grouped_pricing.sort_values(by=['value']), y="routes", x="value", color="routes", template="ggplot2")
fig.show(renderer="colab")

In [None]:
fig = px.box(df, x="shipping_cost", template="ggplot2")
fig.show(renderer="colab")

### Working Days Approval Time

---
User might unsatisfied if time needed to approve the order is too long

In [None]:
fig = px.box(df, x="feedback_score", y="wd_approved_interval", color="feedback_score", template="ggplot2", title="Working Day Approval Time")
fig.update_xaxes(title_text='Feedback Score')
fig.update_yaxes(title_text='Approval Time (days)')
fig.update_xaxes( type='category')
# To sorting : Uncomment below
# fig.update_xaxes(categoryorder='total descending')
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=1.02,
    xanchor="right",
    x=1
))
fig.show(renderer="colab")

---

Looks like it is not about the interval between order and approval that determine much for the feedback score. We can **exclude** the possibility that approval interval might affect the feedback score

---

### Working Days Pickup Interval Time

---
What about pickup time? User might unsatisfied if it takes too long for the seller to place the order into the transporting process

In [None]:
fig = px.box(df, x="feedback_score", y="wd_pickup_interval", color="feedback_score", template="ggplot2")
fig.show(renderer="colab")

---

From boxplot result alone, we could determine that rating 1 - 2 might be affected by this factor, still need more proof if this feature really determine the feedback score

---

### Working Days Shipping Time

---
What about shipping time? User might mad if it takes too long for transport process to ship their item

In [None]:
fig = px.box(df, x="feedback_score", y="wd_shipping_interval", color="feedback_score", template="ggplot2")
fig.show(renderer="colab")

---

Based on graph above, we could see that the less time needed to ship, normally the feedback keep getting higher. Might be important to look up what's the connection between this and user feedback result

---

### Working Days Estimated Delivery Time

---
User might unsatisfied if the application estimate too long for the item to arrive to the user

In [None]:
fig = px.box(df, x="feedback_score", y="wd_estimated_delivery_interval", color="feedback_score", template="ggplot2")
fig.show(renderer="colab")

---

It seems that the estimated time is not the problem.

---

### Working Days Actual Delivery Time

---
User might unsatisfied if the item need too much time to arrive to the user

In [None]:
fig = px.box(df, x="feedback_score", y="wd_actual_delivery_interval", color="feedback_score", template="ggplot2")
fig.show(renderer="colab")

---

It seems that the more time needed for the item to arrive, the worse the feedback score. But if we see from previous finding, estimated date actually not affecting the feedback score. This might infer that the user might be mad if the difference between actual delivery time and estimated delivery time is big (delayed product arrival)

---

### Working Days Actual Delivery Time

---
From previous finding, let us check the difference between estimated and actual delivery date from each feedback score

In [None]:
fig = px.histogram(df, x="abs_wd_delta_delivery_interval", template="ggplot2")
fig.show(renderer="colab")

---

Wow, from the histogram, we can see that most of the time, the system estimation of when the order is arrived is off by about 5 - 20 days. That actually pretty bad!!

---

In [None]:
fig = px.box(df.loc[df['wd_delta_delivery_interval'] > 0], x="feedback_score", y="wd_delta_delivery_interval", color="feedback_score", template="ggplot2")
fig.show(renderer="colab")

In [None]:
fig = px.box(df.loc[df['wd_delta_delivery_interval'] < 0], x="feedback_score", y="wd_delta_delivery_interval", color="feedback_score", template="ggplot2")
fig.show(renderer="colab")

---

From above graph, we can conclude 2 things : 
- If the arrival date is delayed, then the more time it took to arrive, the feedback score might become worse
- But, if the arrival date is actually faster, then it might not affecting the feedback score as much as it does if the actual arrival date is late

---

In [None]:
grouped_late_pickup = df.groupby(['feedback_score', 'is_late_pickup']).agg(count=('order_id', 'count'))
grouped_late_pickup = grouped_late_pickup.reset_index()

grouped_late_delivery = df.groupby(['feedback_score', 'is_late_delivery']).agg(count=('order_id', 'count'))
grouped_late_delivery = grouped_late_delivery.reset_index()

grouped_late_shipping = df.groupby(['feedback_score', 'is_late_shipping']).agg(count=('order_id', 'count'))
grouped_late_shipping = grouped_late_shipping.reset_index()

In [None]:
fig = px.bar(grouped_late_pickup, x="feedback_score", y="count", color="is_late_pickup",barmode="group", template="ggplot2")
fig.show(renderer="colab")

In [None]:
fig = px.bar(grouped_late_delivery, x="feedback_score", y="count", color="is_late_delivery", barmode="group", template="ggplot2")
fig.show(renderer="colab")

In [None]:
fig = px.bar(grouped_late_shipping, x="feedback_score", y="count", color="is_late_shipping", barmode="group", template="ggplot2")
fig.show(renderer="colab")

In [None]:
fig = px.box(df, x="feedback_score", y="average_price", color="feedback_score", template="ggplot2")
fig.show(renderer="colab")

In [None]:
fig = px.box(df, x="feedback_score", y="total_price", color="feedback_score", template="ggplot2")
fig.show(renderer="colab")

## Recommendation
---

- Have to increase the accuracy of the estimating order arrival date -> CDT (Customer Delivery Time) Prediction -> Supervised Learning
- Because the fare of each order is different, it could be infered that probably the order is using many type of transport vendor. Might be better if the ecommerce only use (option available for user) transport vendor that have better quality, because we can see that pickup time is not the problem but shipping time (affecting the difference from estimated - actual delivery time)

-> Next : Check the source - destination and make it into variable?
          Quantify into each group for each order fee (for quality segmentation)