# Orders

#### Investigating the orders, and their associated review score.

we will create one single data table containing **all our orders with some engineered statistics for them as additional columns.**

Creating the following DataFrame, which will be very handy for the modeling phase.

  - `order_id` (_str) the id of the order_
  - `wait_time` (_float) the number of days between order_date and delivered_date_
  - `expected_wait_time` (_float) the number of days between order_date and estimated_delivery_date_
  - `delay_vs_expected` (_float) if the actual delivery date is later than the estimated delivery date, returns the absolute number of days between the two dates, otherwise return 0_
  - `order_status` (_str) the status of the order_
  - `dim_is_five_star` (_int) 1 if the order received a five_star, 0 otherwise_
  - `dim_is_one_star` (_int) 1 if the order received a one_star, 0 otherwise_
  - `review_score`(_int) from 1 to 5_
  - `number_of_product` (_int) number of products that the order contains_
  - `number_of_sellers` (_int) number of sellers involved in the order_
  - `price` (_float) total price of the order paid by customer_
  - `freight_value` (_float) value of the freight paid by customer_
  - `distance_customer_seller` (_float) the distance in km between customer and seller_


In [1]:
# Auto reload imported module everytime a jupyter cell is executed
%load_ext autoreload
%autoreload 2

In [2]:
# Import usual modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
# Import olist data
from olistdash.data import Olist
olist=Olist()
data=olist.get_data()
matching_table = olist.get_matching_table()

## Code `order.py`

In [4]:
orders = data['orders'].copy() # to be sure not to modify the `data` variable
orders.head(1)

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00


#### Filter delivered orders

In [5]:
orders = orders.query("order_status=='delivered'").copy()

## 1. def get_wait_time():
Return a dataframe with `[order_id, wait_time, expected_wait_time, delay_vs_expected, order_status]`

#### ⁉ ... Checking type of objects in "time columns"

In [6]:
print(type(orders['order_delivered_customer_date'][1]))
print(type(orders['order_estimated_delivery_date'][1]))
print(type(orders['order_purchase_timestamp'][1]))

<class 'str'>
<class 'str'>
<class 'str'>


#### ❕❗ As they are 'str' it is needed to convert them to pandas.datetime

#### handling datetime

In [7]:
# converting column ['order_delivered_customer_date'] from "str" to pandas.datetime
orders.loc[:, 'order_delivered_customer_date'] = pd.to_datetime(orders['order_delivered_customer_date'])
# converting column ['order_estimated_delivery_date'] from "str" to pandas.datetime
orders.loc[:, 'order_estimated_delivery_date'] = pd.to_datetime(orders['order_estimated_delivery_date'])
# converting column ['order_purchase_timestamp'] from "str" to pandas.datetime
orders.loc[:, 'order_purchase_timestamp'] = pd.to_datetime(orders['order_purchase_timestamp'])

#### ⁉ ... Checking again type of objects in "time columns"

In [8]:
print(type(orders['order_delivered_customer_date'][1]))
print(type(orders['order_estimated_delivery_date'][1]))
print(type(orders['order_purchase_timestamp'][1]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>


#### 👌🆗 Now it is possible to procced

### 1.1 Computing ["delay_vs_expected"]

In [9]:
# Creating new column ['delay_vs_expected']
orders.loc[:, 'delay_vs_expected'] = (orders['order_estimated_delivery_date'] - orders['order_delivered_customer_date']) / np.timedelta64(24, 'h')

In [10]:
# TEST
orders['delay_vs_expected'][1]

5.355729166666666

##### Analize new column:

In [11]:
orders['delay_vs_expected'].describe()

count    96470.000000
mean        11.178126
std         10.184354
min       -188.975081
25%          6.389815
50%         11.948102
75%         16.244065
max        146.016123
Name: delay_vs_expected, dtype: float64

#### handling negative delays for ['delay_vs_expected'] column

In [12]:
# Creating custom function and applying it to the column
def handle_delay(x):
    if x < 0:
        return abs(x)
    else:
        return 0
    
orders.loc[:,'delay_vs_expected'] = orders['delay_vs_expected'].apply(handle_delay)

Testing result of applying handle_delay function:

In [13]:
orders['delay_vs_expected'].describe()

count    96478.000000
mean         0.774811
std          4.752895
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max        188.975081
Name: delay_vs_expected, dtype: float64

### 1.2 Computing `['wait_time']` column
wait_time = time between purchasing item to becoming it (in days)

In [14]:
orders.head(1)

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,delay_vs_expected
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,0.0


In [15]:
# compute wait_time (Create new column)
orders.loc[:, 'wait_time'] = (orders['order_delivered_customer_date'] - orders['order_purchase_timestamp']) / np.timedelta64(24, 'h')

In [16]:
# TEST
orders['wait_time'][1]

13.782037037037037

### Computing `['expected_wait_time']` column
expected_wait_time = days between purchasing an item to estimated_delivery_date (in days)

In [17]:
# compute expected wait time
orders.loc[:, 'expected_wait_time'] = (orders['order_estimated_delivery_date'] - orders['order_purchase_timestamp']) / np.timedelta64(24, 'h')

In [18]:
# TEST
orders['expected_wait_time'][1]

19.137766203703702

#### DATA FRAME WITH NEW COLUMNS

In [19]:
orders[['order_id', 'wait_time', 'expected_wait_time', 'delay_vs_expected', 'order_status']].head(2)

Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status
0,e481f51cbdc54678b7cc49136f2d6af7,8.436574,15.544063,0.0,delivered
1,53cdb2fc8bc7dce0b6741e2150273451,13.782037,19.137766,0.0,delivered


## 2. def get_review_score():
Returns a DataFrame with:
    `'order_id'`, `'dim_is_five_star'`, `'dim_is_one_star'`, `'review_score'`

In [21]:
reviews = data['order_reviews'].copy()
reviews.head(2)

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13


In [24]:
# Implementing functions for the new columns 'dim_is_five_star' and 'dim_is_one_star'
def dim_five_star(d):
    if d == 5:
        return 1
    return 0

def dim_one_star(d):
    if d == 1:
        return 1
    return 0

In [25]:
reviews["dim_is_five_star"] = reviews["review_score"].map(dim_five_star)
reviews["dim_is_one_star"] = reviews["review_score"].map(dim_one_star)

In [28]:
# TEST
reviews[["order_id", "dim_is_five_star", "dim_is_one_star", "review_score"]].head()

Unnamed: 0,order_id,dim_is_five_star,dim_is_one_star,review_score
0,73fc7af87114b39712e6da79b0a377eb,0,0,4
1,a548910a1c6147796b98fdf73dbeba33,1,0,5
2,f9e4b658b201a9f2ecdecbb34bed034b,1,0,5
3,658677c97b385a9be170737859d3511b,1,0,5
4,8e6bfb81e283fa7e4f11123a3fb894f1,1,0,5


# ⚠ Implement code in order.py

In [31]:
# Testing code once implemented to olistdash/order.py
from olistdash.order import Order
Order().get_review_score()

Unnamed: 0,order_id,dim_is_five_star,dim_is_one_star,review_score
0,73fc7af87114b39712e6da79b0a377eb,0,0,4
1,a548910a1c6147796b98fdf73dbeba33,1,0,5
2,f9e4b658b201a9f2ecdecbb34bed034b,1,0,5
3,658677c97b385a9be170737859d3511b,1,0,5
4,8e6bfb81e283fa7e4f11123a3fb894f1,1,0,5
...,...,...,...,...
99995,22ec9f0669f784db00fa86d035cf8602,1,0,5
99996,55d4004744368f5571d1f590031933e4,1,0,5
99997,7725825d039fc1f0ceb7635e3f7d9206,0,0,4
99998,f8bd3f2000c28c5342fedeb5e50f2e75,0,1,1
