# Orders

✏️ **Exercise**

Today, we will investigate the **orders**, and their associated review score.

For that purpose, we will create one single data table containing **all unique orders as index and all properties of these orders as columns.**

Our goal is to create the following DataFrame, which will come very handy later on for our modelling phase

  - `order_id` (_str) the id of the order_
  - `wait_time` (_float) the number of days between order_date and delivered_date_
  - `delay_vs_expected` (_float) if the actual delivery date is later than the estimated delivery date, returns the absolute number of days between the two dates, otherwise return 0_
  - `dim_is_five_star` (_int) 1 if the order received a five_star, 0 otherwise_
  - `dim_is_one_star` (_int) 1 if the order received a one_star, 0 otherwise_
  - `review_score`(_int) from 1 to 5_
  - `number_of_product` (_int) number of products that the order contains_
  - `number_of_sellers` (_int) number of sellers involved in the order_
  - `price` (_float) total price of the order paid by customer_
  - `freight_value` (_float) value of the freight paid by customer_
  - (Optional) `distance_customer_seller` (_float) the distance in km between customer and seller_
  
We also want to filter out "non-delivered" orders, unless explicitely specified

❓ **Your challenge**: 

- Implement each feature as a separate method within the `Order` class available at `olist/order.py`
- Then, create a method `get_training_data()` that returns the complete DataFrame.

Suggested methodology:
- Use the notebook below to write and test your code step-by-step first
- Then copy the code into `order.py` once you are certain of your code logic
- Focus on the data manipulation logic now, we will analyse the dataset visually in the next challenges

<details>
    <summary>🔥 Notebook best practices (must read) </summary>

From now on, exploratory notebooks are going get pretty very long, and we strongly advice you to follow these notebook principles
- Code your logic so that your Notebook can always be run from top to bottom without crashing (Cell --> Run All)
- Name your variables carefully 
- Use dummy names such as `tmp` or `_` for intermediary steps when you know you won't need them long
- Clear your code and merge cells when relevant (`Shit-M`) to merge two cells to minimize Notebook size
- Hide your cell output if you don't need to see it anymore (double click on the red `Out[]:` section to the left of your cell).
- Make heavy use of jupyber nbextention `Collapsable Headings` and `Table of Content` (call a TA if you can't find them)
- Use the following shortcuts 
    - `a` to insert a cell above
    - `b` to insert a cell below
    - `dd` to delete a cell
    - `esc` and `arrows` to move between cells
    - `shit-Enter` to execute cell and move focus to the next one
    - use `Shift + Tab` when you're between method brackets e.g. `group_by()` to get the docs! Repeat many time to open it permanently

</details>





In [1]:
# Auto reload imported module everytime a jupyter cell is executed (handy for olist.order.py updates)
%load_ext autoreload
%autoreload 2
# !pip install jupyter_contrib_nbextensions

In [4]:
# !jupyter contrib nbextension install --sys-prefix
# !jupyter nbextension enable scratchpad/main --sys-prefix
# !jupyter nbextension list
# !jupyter contrib nbextension install --user/

In [5]:
# Import usual modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [6]:
from olist.data import Olist
data = Olist().get_data()
matching_table = Olist().get_matching_table()

# 1. Code `order.py`

In [132]:
orders = data['orders'].copy() # good practice to be sure not to modify your `data` variable

### get_wait_time
Return a dataframe with [order_id, wait_time, expected_wait_time ,delay_vs_expected]

Hints:
- Don't forget to convert dates from "string" type to "pandas.datetime' using [`pandas.to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)
- Take time to understand what python [`datetime`](https://docs.python.org/3/library/datetime.html) objects are

In [133]:
# We give you the pseudo-code below for this first method

# Inspect orders dataframe
# handle datetime
# compute wait time
# compute delay vs expected - Carefully handles "negative" delays
# check new dataframe and copy code carefully to `olist/order.py`

In [134]:
orders['order_purchase_timestamp'] = pd.to_datetime(orders['order_purchase_timestamp'])
orders['order_delivered_customer_date'] = pd.to_datetime(orders['order_delivered_customer_date'])
orders['wait_time'] = (orders['order_delivered_customer_date'] -\
                               orders['order_purchase_timestamp'])/np.timedelta64(24, 'h')

In [135]:
orders['order_estimated_delivery_date'] = pd.to_datetime(orders['order_estimated_delivery_date'])
orders['expected_wait_time'] = (orders['order_estimated_delivery_date'] -\
                               orders['order_purchase_timestamp'])/np.timedelta64(24, 'h')

In [141]:
orders['delay_vs_expected'] = (orders['order_delivered_customer_date'] -\
                               orders['order_estimated_delivery_date'])

In [137]:
orders['delay_vs_expected'] [orders['delay_vs_expected']  < pd.to_timedelta(0)] = pd.to_timedelta(0)
orders['delay_vs_expected'] = orders['delay_vs_expected']/np.timedelta64(24, 'h')
# orders.describe()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  orders['delay_vs_expected'] [orders['delay_vs_expected']  < pd.to_timedelta(0)] = pd.to_timedelta(0)


In [142]:
orders.head()
orders_time = orders[["order_id", "wait_time", "expected_wait_time", "delay_vs_expected"]]

In [152]:
# Check your new dataframe and commit your code to olist/order.py when it works. 
# Test it below
from olist.order import Order
Order().get_wait_time().describe()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  orders['delay_vs_expected'] [orders['delay_vs_expected']  < pd.to_timedelta(0)] = pd.to_timedelta(0)


Unnamed: 0,wait_time,expected_wait_time,delay_vs_expected
count,96476.0,99441.0,96476.0
mean,12.558702,23.76765,0.774961
std,9.54653,8.832371,4.753103
min,0.533414,1.648993,0.0
25%,6.766403,18.33169,0.0
50%,10.217755,23.24037,0.0
75%,15.720327,28.424861,0.0
max,209.628611,155.135463,188.975081


### get_review_score
     Returns a DataFrame with:
        order_id, dim_is_five_star, dim_is_one_star, review_score

In [153]:
# Load reviews dataset
reviews = data['order_reviews'].copy()
reviews

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53
...,...,...,...,...,...,...,...
99995,f3897127253a9592a73be9bdfdf4ed7a,22ec9f0669f784db00fa86d035cf8602,5,,,2017-12-09 00:00:00,2017-12-11 20:06:42
99996,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rápida. Super...",2018-03-22 00:00:00,2018-03-23 09:10:43
99997,1adeb9d84d72fe4e337617733eb85149,7725825d039fc1f0ceb7635e3f7d9206,4,,,2018-07-01 00:00:00,2018-07-02 12:59:13
99998,be360f18f5df1e0541061c87021e6d93,f8bd3f2000c28c5342fedeb5e50f2e75,1,,Solicitei a compra de uma capa de retrovisor c...,2017-12-15 00:00:00,2017-12-16 01:29:43


In [154]:
# Fill the functions below, that you will have to apply "element-wise" to each Series in the next cell below
# So as to create the 2 new columns requested 
def dim_five_star(x):
    if x == 5:
        return 1
    return 0

def dim_one_star(x):
    if x == 1:
        return 1
    return 0

In [155]:
reviews["dim_is_five_star"] = reviews["review_score"].map(dim_five_star) # --> Series([0, 1, 1, 0, 0, 1 ...])


reviews["dim_is_one_star"] = reviews["review_score"].map(dim_one_star) # --> Series([0, 1, 1, 0, 0, 1 ...])
reviews.head()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,dim_is_five_star,dim_is_one_star
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59,0,0
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13,1,0
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24,1,0
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,1,0
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,1,0


In [157]:
# Check your new dataframe and commit your code to olist/order.py when it works. 
# Test it below
from olist.order import Order
Order().get_review_score().head()

Unnamed: 0,order_id,dim_is_five_star,dim_is_one_star,review_score
0,73fc7af87114b39712e6da79b0a377eb,0,0,4
1,a548910a1c6147796b98fdf73dbeba33,1,0,5
2,f9e4b658b201a9f2ecdecbb34bed034b,1,0,5
3,658677c97b385a9be170737859d3511b,1,0,5
4,8e6bfb81e283fa7e4f11123a3fb894f1,1,0,5


### get_number_products(self):
     Returns a DataFrame with:
        order_id, number_of_products (total number of products per order)

In [174]:
display(data.keys())
items = data['order_items'].copy()
items[["order_id", "order_item_id"]].groupby(by = "order_id").count().rename(columns={"order_item_id" : "product_count"})

dict_keys(['sellers', 'order_reviews', 'order_items', 'customers', 'orders', 'order_payments', 'product_category_name_translation', 'products', 'geolocation'])

Unnamed: 0_level_0,number_of_products
order_id,Unnamed: 1_level_1
00010242fe8c5a6d1ba2dd792cb16214,1
00018f77f2f0320c557190d7a144bdd3,1
000229ec398224ef6ca0657da4fc703e,1
00024acbcdf0a6daa1e931b038114c75,1
00042b26cf59d7ce69dfabb4e55b4fd9,1
...,...
fffc94f6ce00a00581880bf54a75a037,1
fffcd46ef2263f404302a634eb57f7eb,1
fffce4705a9662cd70adb13d4a31832d,1
fffe18544ffabc95dfada21779c9644f,1


In [176]:
# Check your new dataframe and commit your code to olist/order.py when it works. 
Order().get_number_products()
# tmp[tmp["number_of_products"]>1]

Unnamed: 0_level_0,number_of_products
order_id,Unnamed: 1_level_1
0008288aa423d2a3f00fcb17cd7d8719,2
00143d0f86d6fbd9f9b38ab440ac16f5,3
001ab0a7578dd66cd4b0a71f5b6e1e41,3
001d8f0e34a38c37f7dba2a37d4eba8b,2
002c9def9c9b951b1bec6d50753c9891,2
...,...
ffd84ab39cd5e873d8dba24342e65c01,2
ffe4b41e99d39f0b837a239110260530,2
ffecd5a79a0084f6a592288c67e3c298,3
fff8287bbae429a99bb7e8c21d151c41,2


### get_number_sellers:
     Returns a DataFrame with:
        order_id, number_of_sellers (total number of unique sellers per order)

<details>
    <summary>Hint</summary>

`pd.Series.nunique()`
</details>

In [185]:
tmp = items[["order_id", "seller_id"]].groupby(by = "order_id")\
            .nunique([["seller_id"]]).rename(columns={"seller_id" : "number_of_sellers"})
# tmp[tmp["number_of_sellers"]>3]

In [196]:
# Check your new dataframe and commit your code to olist/order.py when it works. 
Order().get_number_sellers()

Unnamed: 0_level_0,number_of_sellers
order_id,Unnamed: 1_level_1
00010242fe8c5a6d1ba2dd792cb16214,1
00018f77f2f0320c557190d7a144bdd3,1
000229ec398224ef6ca0657da4fc703e,1
00024acbcdf0a6daa1e931b038114c75,1
00042b26cf59d7ce69dfabb4e55b4fd9,1
...,...
fffc94f6ce00a00581880bf54a75a037,1
fffcd46ef2263f404302a634eb57f7eb,1
fffce4705a9662cd70adb13d4a31832d,1
fffe18544ffabc95dfada21779c9644f,1


### get_price_and_freight
     Returns a DataFrame with:
        order_id, price, freight_value

<details>
    <summary>Hint</summary>

`pd.Series.agg()` allows you to apply one transformation method per columns of your groupby object
</details>

In [187]:
items.head()

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.9,18.14


In [208]:
tmp = items[["order_id", "price", "freight_value"]].groupby(by = "order_id")\
            .sum([["price", "freight_value"]])
# tmp.describe()

In [194]:
# Check your new dataframe and commit your code to olist/order.py when it works. 
Order().get_price_and_freight()

Unnamed: 0_level_0,price,freight_value
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1
00010242fe8c5a6d1ba2dd792cb16214,58.90,13.29
00018f77f2f0320c557190d7a144bdd3,239.90,19.93
000229ec398224ef6ca0657da4fc703e,199.00,17.87
00024acbcdf0a6daa1e931b038114c75,12.99,12.79
00042b26cf59d7ce69dfabb4e55b4fd9,199.90,18.14
...,...,...
fffc94f6ce00a00581880bf54a75a037,299.99,43.41
fffcd46ef2263f404302a634eb57f7eb,350.00,36.53
fffce4705a9662cd70adb13d4a31832d,99.90,16.95
fffe18544ffabc95dfada21779c9644f,55.99,8.72


### get_distance_seller_customer (OPTIONAL - Try only after finishing today's challenges)
[order_id, distance_seller_customer] (the distance in km between customer and seller)

💡Have a look at the `haversine_distance` formula we coded for you in the `olist.utils` module

In [None]:
# Check your new dataframe and commit your code to olist/order.py when it works. 
Order().get_distance_seller_customer()

# 2. Test your newly coded module

In [202]:
tmp = pd.merge(pd.merge(pd.merge(pd.merge(Order().get_wait_time()\
               , Order().get_review_score(), on ='order_id')\
               , Order().get_number_products(), on ='order_id')\
               , Order().get_number_sellers(), on ='order_id')\
               , Order().get_price_and_freight(), on ='order_id')
# tmp.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  orders['delay_vs_expected'] [orders['delay_vs_expected']  < pd.to_timedelta(0)] = pd.to_timedelta(0)


Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value
0,e481f51cbdc54678b7cc49136f2d6af7,8.436574,15.544063,0.0,0,0,4,1,1,29.99,8.72
1,53cdb2fc8bc7dce0b6741e2150273451,13.782037,19.137766,0.0,0,0,4,1,1,118.7,22.76
2,47770eb9100c2d0c44946d9cf07ec65d,9.394213,26.639711,0.0,1,0,5,1,1,159.9,19.22
3,949d5b44dbf5de918fe9c16f97b45f8a,13.20875,26.188819,0.0,1,0,5,1,1,45.0,27.2
4,ad21c59c0840e6cb83a9ceb5573f8159,2.873877,12.112049,0.0,1,0,5,1,1,19.9,8.72


❓ Time to code `get_training_data` making use of your previous coded methods.

In [204]:
%%time
from olist.order import Order
result = Order().get_training_data()
result

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  orders['delay_vs_expected'] [orders['delay_vs_expected']  < pd.to_timedelta(0)] = pd.to_timedelta(0)


CPU times: user 8.13 s, sys: 1.11 s, total: 9.24 s
Wall time: 9.24 s


Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value
0,e481f51cbdc54678b7cc49136f2d6af7,8.436574,15.544063,0.0,0,0,4,1,1,29.99,8.72
1,53cdb2fc8bc7dce0b6741e2150273451,13.782037,19.137766,0.0,0,0,4,1,1,118.70,22.76
2,47770eb9100c2d0c44946d9cf07ec65d,9.394213,26.639711,0.0,1,0,5,1,1,159.90,19.22
3,949d5b44dbf5de918fe9c16f97b45f8a,13.208750,26.188819,0.0,1,0,5,1,1,45.00,27.20
4,ad21c59c0840e6cb83a9ceb5573f8159,2.873877,12.112049,0.0,1,0,5,1,1,19.90,8.72
...,...,...,...,...,...,...,...,...,...,...,...
99217,9c5dedf39a927c1b2549525ed64a053c,8.218009,18.587442,0.0,1,0,5,1,1,72.00,13.08
99218,63943bddc261676b46f01ca7ac2f7bd8,22.193727,23.459051,0.0,0,0,4,1,1,174.90,20.10
99219,83c1379a015df1e13d02aae0204711ab,24.859421,30.384225,0.0,1,0,5,1,1,205.99,65.02
99220,11c177c8e97725db2631073c19f07b62,17.086424,37.105243,0.0,0,0,2,2,1,359.98,81.18


🏁 Congratulation! Commit and push your notebook before starting the next challenge.