# Orders

✏️ **Exercise**

Today, we will investigate the **orders**, and their associated **review score**.

👉 Our goal is to create a DataFrame with the following features:


| feature_name 	| type 	| description 	|
|:---	|:---:	|:---	|
| `order_id` 	| str 	| the id of the order 	|
| `wait_time` 	| float 	| the number of days between order_purchase_timestamp and order_delivered_customer_date 	|
| `expected_wait_time` 	| float 	| the number of days between order_purchase_timestamp and estimated_delivery_date 	|
| `delay_vs_expected` 	| float 	| if the actual order_delivered_customer_date is later than the estimated delivery date, returns the number of days between the two dates, otherwise return 0 	|
| `order_status` 	| str 	| the status of the order 	|
| `dim_is_five_star` 	| int 	| 1 if the order received a five-star review, 0 otherwise 	|
| `dim_is_one_star` 	| int 	| 1 if the order received a one_star, 0 otherwise 	|
| `review_score` 	| int 	| from 1 to 5 	|
| `number_of_products` 	| int 	| number of products that the order contains 	|
| `number_of_sellers` 	| int 	| number of sellers involved in the order 	|
| `price` 	| float 	| total price of the order paid by customer 	|
| `freight_value` 	| float 	| value of the freight paid by customer 	|
| `distance_customer_seller` 	| float 	| the distance in km between customer and seller (optional) 	|  
  
⚠️ We also want to filter out "non-delivered" orders, unless explicitly specified, otherwise we cannot compute the potential delays.

❓ **Your challenge**: 

- Implement each feature as a separate method within the `Order` class available at `olist/order.py`
- Then, create a method `get_training_data()` that returns the complete DataFrame **without `NaN`s**.

💡 Suggested methodology:
- Use the notebook below to write and test your code step-by-step first
- Then copy the code into `order.py` once you are certain of your code logic
- Focus on the data manipulation logic now, we will analyse the dataset visually in the next challenges

🔥 Notebook best practices (must-read) 👇

<details>
    <summary>▸ <i>click here</i></summary>

From now on, exploratory notebooks are going to become pretty long, and we strongly advise you to follow these notebook principles:
- Code your logic so that your Notebook can always be ran from top to bottom without crashing (Cell --> Run All)
- Name your variables carefully 
- Use dummy names such as `tmp` or `_` for intermediary steps when you know you won't need them for long
- Clear your code and merge cells when relevant to minimize Notebook size (`Shift-M`)
- Hide your cell output if you don't need to see it anymore (double-click on the red `Out[]:` section to the left of your cell).
- Make heavy use of jupyter nbextention `Collapsible Headings` and `Table of Content` (call a TA if you can't find them)
- Use the following shortcuts 
    - `a` to insert a cell above
    - `b` to insert a cell below
    - `dd` to delete a cell
    - `esc` and `arrows` to move between cells
    - `Shift-Enter` to execute cell and move focus to the next one
    - use `Shift + Tab` when you are between method brackets e.g. `groupby()` to get the docs! Repeat a few times to open it permanently

</details>





In [1]:
# Auto reload imported module every time a jupyter cell is executed (handy for olist.order.py updates)
%load_ext autoreload
%autoreload 2

In [2]:
# Import usual modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
# Import olist data
from olist.data import Olist
olist = Olist()
data = olist.get_data()

In [4]:
# What datasets do we have access to now ?
data.keys()

dict_keys(['sellers', 'order_reviews', 'order_items', 'customers', 'orders', 'order_payments', 'product_category_name_translation', 'products', 'geolocation'])

In [5]:
orders = data['orders'].copy() # good practice to be sure not to modify your `data` variable

assert(orders.shape == (99441, 8))

## 1. Code `order.py`

### a) `get_wait_time`
    ❓ Return a Dataframe with:
           order_id, wait_time, expected_wait_time, delay_vs_expected, order_status


🎁 We give you the pseudo-code below 👇 for this first method:

> 1. Inspect the `orders` dataframe
2. Filter the dataframe on `delivered orders`
3. Handle `datetime`
    - Take time to understand what python [`datetime`](https://docs.python.org/3/library/datetime.html) objects are
    - and convert dates from "string" type to "pandas.datetime' using [`pandas.to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)
4. Compute `wait_time`
5. Compute `expected_wait_time`
6. Compute `delay_vs_expected`
7. Check the new dataframe 
8. Once you are satisfied with your code, you can carefully copy-paste it from the notebook to to `olist/order.py`

<details>
    <summary>💡Hint</summary>

For both `wait_time` and `delay_vs_expected`, you need to subtract the relevant dates/timestamps to get the time difference between the `pandas.datetime` objects. Then, you can either use [`datetime.timedelta()`](https://docs.python.org/3/library/datetime.html#timedelta-objects) or [`np.timedelta64()`](https://numpy.org/doc/stable/reference/arrays.datetime.html#datetime-and-timedelta-arithmetic) to find out how many days that subtraction represents!

</details>

#### Inspect dataframe orders

In [6]:
orders.info()
orders.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       99441 non-null  object
 1   customer_id                    99441 non-null  object
 2   order_status                   99441 non-null  object
 3   order_purchase_timestamp       99441 non-null  object
 4   order_approved_at              99281 non-null  object
 5   order_delivered_carrier_date   97658 non-null  object
 6   order_delivered_customer_date  96476 non-null  object
 7   order_estimated_delivery_date  99441 non-null  object
dtypes: object(8)
memory usage: 6.1+ MB


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00


#### Filter on delivered orders

In [7]:
# Add .copy() to make it clear that we are making a copy
# Will prevent warnings down the line
delivered_orders = orders[orders["order_status"] == "delivered"].copy()

#### Handle datetime columns

In [8]:
# Transform date columns to datetime type
dt_columns = [
    "order_purchase_timestamp",
    "order_approved_at",
    "order_delivered_carrier_date",
    "order_delivered_customer_date",
    "order_estimated_delivery_date"
    ]

for col_name in dt_columns:
    #delivered_orders[col_name] = pd.to_datetime(delivered_orders[col_name])
    delivered_orders.loc[:, col_name] = pd.to_datetime(delivered_orders.loc[:, col_name])

# Inspect dataframe
delivered_orders.info()
delivered_orders.head(1)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96478 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   order_id                       96478 non-null  object        
 1   customer_id                    96478 non-null  object        
 2   order_status                   96478 non-null  object        
 3   order_purchase_timestamp       96478 non-null  datetime64[ns]
 4   order_approved_at              96464 non-null  datetime64[ns]
 5   order_delivered_carrier_date   96476 non-null  datetime64[ns]
 6   order_delivered_customer_date  96470 non-null  datetime64[ns]
 7   order_estimated_delivery_date  96478 non-null  datetime64[ns]
dtypes: datetime64[ns](5), object(3)
memory usage: 6.6+ MB


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18


#### Compute wait time

In [9]:
# Calculate time between purchase made and ordered delivered
delivered_orders["wait_time"] = delivered_orders["order_delivered_customer_date"] - delivered_orders["order_purchase_timestamp"]

# Inspect result
delivered_orders.info()
delivered_orders.head(1)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96478 entries, 0 to 99440
Data columns (total 9 columns):
 #   Column                         Non-Null Count  Dtype          
---  ------                         --------------  -----          
 0   order_id                       96478 non-null  object         
 1   customer_id                    96478 non-null  object         
 2   order_status                   96478 non-null  object         
 3   order_purchase_timestamp       96478 non-null  datetime64[ns] 
 4   order_approved_at              96464 non-null  datetime64[ns] 
 5   order_delivered_carrier_date   96476 non-null  datetime64[ns] 
 6   order_delivered_customer_date  96470 non-null  datetime64[ns] 
 7   order_estimated_delivery_date  96478 non-null  datetime64[ns] 
 8   wait_time                      96470 non-null  timedelta64[ns]
dtypes: datetime64[ns](5), object(3), timedelta64[ns](1)
memory usage: 7.4+ MB


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,wait_time
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8 days 10:28:40


#### Compute expected wait time

In [10]:
# Calculate time between purchase and expected delivery
delivered_orders["expected_wait_time"] = delivered_orders["order_estimated_delivery_date"] - delivered_orders["order_purchase_timestamp"]

# Inspect result
delivered_orders.info()
delivered_orders.head(1)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96478 entries, 0 to 99440
Data columns (total 10 columns):
 #   Column                         Non-Null Count  Dtype          
---  ------                         --------------  -----          
 0   order_id                       96478 non-null  object         
 1   customer_id                    96478 non-null  object         
 2   order_status                   96478 non-null  object         
 3   order_purchase_timestamp       96478 non-null  datetime64[ns] 
 4   order_approved_at              96464 non-null  datetime64[ns] 
 5   order_delivered_carrier_date   96476 non-null  datetime64[ns] 
 6   order_delivered_customer_date  96470 non-null  datetime64[ns] 
 7   order_estimated_delivery_date  96478 non-null  datetime64[ns] 
 8   wait_time                      96470 non-null  timedelta64[ns]
 9   expected_wait_time             96478 non-null  timedelta64[ns]
dtypes: datetime64[ns](5), object(3), timedelta64[ns](2)
memory usage: 8.1+

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,wait_time,expected_wait_time
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8 days 10:28:40,15 days 13:03:27


## Create delay_vs_expected              

In [11]:
delivered_orders["delay_vs_expected"] = (
    delivered_orders["order_delivered_customer_date"] - delivered_orders["order_estimated_delivery_date"]
    ).dt.days

delivered_orders.loc[delivered_orders["delay_vs_expected"] < 0, "delay_vs_expected"] = 0

delivered_orders.info()
delivered_orders.head(1)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96478 entries, 0 to 99440
Data columns (total 11 columns):
 #   Column                         Non-Null Count  Dtype          
---  ------                         --------------  -----          
 0   order_id                       96478 non-null  object         
 1   customer_id                    96478 non-null  object         
 2   order_status                   96478 non-null  object         
 3   order_purchase_timestamp       96478 non-null  datetime64[ns] 
 4   order_approved_at              96464 non-null  datetime64[ns] 
 5   order_delivered_carrier_date   96476 non-null  datetime64[ns] 
 6   order_delivered_customer_date  96470 non-null  datetime64[ns] 
 7   order_estimated_delivery_date  96478 non-null  datetime64[ns] 
 8   wait_time                      96470 non-null  timedelta64[ns]
 9   expected_wait_time             96478 non-null  timedelta64[ns]
 10  delay_vs_expected              96470 non-null  float64        
dtypes:

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,wait_time,expected_wait_time,delay_vs_expected
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,8 days 10:28:40,15 days 13:03:27,0.0


👀 Check the dataframe you've just created. <br/> 

💪 When your code works, commit it to `olist/order.py` <br/>

🧪 Now, test it by running the following cell 👇 

In [12]:
# Test your code here
from olist.order import Order
Order().get_wait_time()

Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status
0,e481f51cbdc54678b7cc49136f2d6af7,8.0,15,0.0,delivered
1,53cdb2fc8bc7dce0b6741e2150273451,13.0,19,0.0,delivered
2,47770eb9100c2d0c44946d9cf07ec65d,9.0,26,0.0,delivered
3,949d5b44dbf5de918fe9c16f97b45f8a,13.0,26,0.0,delivered
4,ad21c59c0840e6cb83a9ceb5573f8159,2.0,12,0.0,delivered
...,...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,8.0,18,0.0,delivered
99437,63943bddc261676b46f01ca7ac2f7bd8,22.0,23,0.0,delivered
99438,83c1379a015df1e13d02aae0204711ab,24.0,30,0.0,delivered
99439,11c177c8e97725db2631073c19f07b62,17.0,37,0.0,delivered


In [13]:
from nbresult import ChallengeResult
test = Order().get_wait_time()
result = ChallengeResult('wait_time', dve_type=test["delay_vs_expected"].dtype, shape=test.shape, dve_min=test["delay_vs_expected"].min(), dve_max=test["delay_vs_expected"].max())
result.write(); print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/branchedelac/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/branchedelac/code/branchedelac/04-Decision-Science/02-Statistical-Inference/data-orders/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_wait_time.py::TestWaitTime::test_wait_time [32mPASSED[0m[32m                   [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/wait_time.pickle

[32mgit[39m commit -m [33m'Completed wait_time step'[39m

[32mgit[39m push origin master



### b) `get_review_score`
     ❓ Returns a DataFrame with:
        order_id, dim_is_five_star, dim_is_one_star, review_score

dim_is_$N$_star should contain `1` if review_score=$N$ and `0` otherwise 

<details>
    <summary markdown='span'>Hints</summary>

Think about `Series.map()` or `DataFrame.apply()`
    
</details>

👉 We load the `reviews` for you

In [14]:
reviews = data['order_reviews'].copy()
assert(reviews.shape == (99224,7))
reviews.head(1)

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59


In [15]:
#reviews.loc[reviews["review_score"] == 1, "dim_is_one_star"] = 1
#reviews.loc[reviews["review_score"] == 5, "dim_is_five_star"] = 1

reviews["dim_is_one_star"] = np.where(reviews['review_score'] == 1, 1, 0)
reviews["dim_is_five_star"] = np.where(reviews['review_score'] == 5, 1, 0)

# Inspect
reviews.info()
display(reviews.loc[reviews["review_score"] == 1].head(1))
display(reviews.loc[reviews["review_score"] == 2].head(1))
display(reviews.loc[reviews["review_score"] == 5].head(1))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99224 entries, 0 to 99223
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   review_id                99224 non-null  object
 1   order_id                 99224 non-null  object
 2   review_score             99224 non-null  int64 
 3   review_comment_title     11568 non-null  object
 4   review_comment_message   40977 non-null  object
 5   review_creation_date     99224 non-null  object
 6   review_answer_timestamp  99224 non-null  object
 7   dim_is_one_star          99224 non-null  int64 
 8   dim_is_five_star         99224 non-null  int64 
dtypes: int64(3), object(6)
memory usage: 6.8+ MB


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,dim_is_one_star,dim_is_five_star
5,15197aa66ff4d0650b5434f1b46cda19,b18dcdf73be66366873cd26c5724d1dc,1,,,2018-04-13 00:00:00,2018-04-16 00:39:37,1,0


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,dim_is_one_star,dim_is_five_star
16,9314d6f9799f5bfba510cc7bcd468c01,0dacf04c5ad59fd5a0cc1faa07c34e39,2,,"GOSTARIA DE SABER O QUE HOUVE, SEMPRE RECEBI E...",2018-01-18 00:00:00,2018-01-20 21:25:45,0,0


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,dim_is_one_star,dim_is_five_star
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13,0,1


Once again, 

👀 Check the dataframe you've just created. <br/> 

💪 When your code works, commit it to `olist/order.py` <br/>

🧪 Now, test it by running the following cell 👇 

In [16]:
# Test your code here
from olist.order import Order
Order().get_review_score()

Unnamed: 0,order_id,dim_is_five_star,dim_is_one_star,review_score
0,73fc7af87114b39712e6da79b0a377eb,0,0,4
1,a548910a1c6147796b98fdf73dbeba33,1,0,5
2,f9e4b658b201a9f2ecdecbb34bed034b,1,0,5
3,658677c97b385a9be170737859d3511b,1,0,5
4,8e6bfb81e283fa7e4f11123a3fb894f1,1,0,5
...,...,...,...,...
99219,2a8c23fee101d4d5662fa670396eb8da,1,0,5
99220,22ec9f0669f784db00fa86d035cf8602,1,0,5
99221,55d4004744368f5571d1f590031933e4,1,0,5
99222,7725825d039fc1f0ceb7635e3f7d9206,0,0,4


In [17]:
from nbresult import ChallengeResult
result = ChallengeResult('review_score', shape=Order().get_review_score().shape)
result.write(); print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/branchedelac/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/branchedelac/code/branchedelac/04-Decision-Science/02-Statistical-Inference/data-orders/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_review_score.py::TestReviewScore::test_review_score [32mPASSED[0m[32m          [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/review_score.pickle

[32mgit[39m commit -m [33m'Completed review_score step'[39m

[32mgit[39m push origin master



### c) `get_number_products`:
     ❓ Returns a DataFrame with:
        order_id, number_of_products (total number of products per order)

In [18]:
order_items = data["order_items"].copy()
order_items.info()
order_items.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112650 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   order_id             112650 non-null  object 
 1   order_item_id        112650 non-null  int64  
 2   product_id           112650 non-null  object 
 3   seller_id            112650 non-null  object 
 4   shipping_limit_date  112650 non-null  object 
 5   price                112650 non-null  float64
 6   freight_value        112650 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 6.0+ MB


Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29


In [33]:
print("unique sellers", order_items["seller_id"].nunique())
print("unique products", order_items["product_id"].nunique())

order_items[["order_id", "product_id", "seller_id"]].groupby(
    "order_id"
).count().sort_values("seller_id")

unique sellers 3095
unique products 32951


Unnamed: 0_level_0,product_id,seller_id
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1
00010242fe8c5a6d1ba2dd792cb16214,1,1
a6e9d106235bcf1dda54253686d89e99,1,1
a6e9b80a7636eb8dd592dbb3e20d0a91,1,1
a6e963c11e80432334e984ead4797a8b,1,1
a6e8ad5db31e71f5f12671af561acb4a,1,1
...,...,...
428a2f660dc84138d969ccd69a0ab6d5,15,15
9ef13efd6949e4573a18964dd1bbe7f5,15,15
1b15974a0141d54e36626dca3fdc731a,20,20
ab14fdcfbe524636d65ee38360e22ce8,20,20


In [39]:
number_products = order_items[["order_id", "product_id"]].groupby(
    "order_id"
).count().reset_index().rename(columns={"product_id": "number_products"})

number_products.value_counts("number_products")

number_products
1     88863
2      7516
3      1322
4       505
5       204
6       198
7        22
8         8
10        8
12        5
11        4
9         3
14        2
15        2
20        2
13        1
21        1
dtype: int64

🧪 Same routine: 
* check your dataframe, 
* commit your code to `olist/order.py`
* and check that it truly works.

In [20]:
from nbresult import ChallengeResult
result = ChallengeResult('number_products', shape=Order().get_number_products().shape)
result.write(); print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/branchedelac/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/branchedelac/code/branchedelac/04-Decision-Science/02-Statistical-Inference/data-orders/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_number_products.py::TestNumberProducts::test_review_score [32mPASSED[0m[32m    [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/number_products.pickle

[32mgit[39m commit -m [33m'Completed number_products step'[39m

[32mgit[39m push origin master



### d) `get_number_sellers`:
     ❓ Returns a DataFrame with:
        order_id, number_of_sellers (total number of unique sellers per order)
        
<details>
    <summary>▸ <i>Hint</i></summary>

`pd.Series.nunique()`
</details>

In [21]:
sellers = data["sellers"]
sellers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3095 entries, 0 to 3094
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   seller_id               3095 non-null   object
 1   seller_zip_code_prefix  3095 non-null   int64 
 2   seller_city             3095 non-null   object
 3   seller_state            3095 non-null   object
dtypes: int64(1), object(3)
memory usage: 96.8+ KB


In [36]:
number_sellers = order_items[["order_id", "seller_id"]].groupby(
    "order_id"
).nunique().reset_index().rename(columns={"seller_id": "number_sellers"})

number_sellers.value_counts("number_sellers")

number_sellers
1    97388
2     1219
3       54
4        3
5        2
dtype: int64

In [27]:
from nbresult import ChallengeResult
result = ChallengeResult('number_sellers', shape=Order().get_number_sellers().shape)
result.write(); print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/branchedelac/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/branchedelac/code/branchedelac/04-Decision-Science/02-Statistical-Inference/data-orders/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_number_sellers.py::TestNumberSellers::test_number_seller [32mPASSED[0m[32m     [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/number_sellers.pickle

[32mgit[39m commit -m [33m'Completed number_sellers step'[39m

[32mgit[39m push origin master



### e) `get_price_and_freight`
     Returns a DataFrame with:
        order_id, price, freight_value

<details>
    <summary>▸ <i>Hint</i></summary>

`pd.Series.agg()` allows you to apply one transformation method per column of your groupby object
</details>

In [28]:
price_and_freight = (data["order_items"].copy()
                     .groupby("order_id")
                     .sum()
                     .drop("order_item_id", axis=1)
                     .reset_index())
price_and_freight.info()
price_and_freight.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98666 entries, 0 to 98665
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   order_id       98666 non-null  object 
 1   price          98666 non-null  float64
 2   freight_value  98666 non-null  float64
dtypes: float64(2), object(1)
memory usage: 2.3+ MB


Unnamed: 0,order_id,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,58.9,13.29


In [29]:
from nbresult import ChallengeResult
result = ChallengeResult('price', shape=Order().get_price_and_freight().shape)
result.write(); print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/branchedelac/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/branchedelac/code/branchedelac/04-Decision-Science/02-Statistical-Inference/data-orders/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_price.py::TestPrice::test_price [32mPASSED[0m[32m                              [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/price.pickle

[32mgit[39m commit -m [33m'Completed price step'[39m

[32mgit[39m push origin master



### e) [OPTIONAL] `get_distance_seller_customer` 
**(Try  to code this function only after finishing today's challenges - Skip to next section)**

    ❓ Returns a Dataframe with:
        order_id, distance_seller_customer (the distance in km between customer and seller)

💡Have a look at the `haversine_distance` formula we coded for you in the `olist.utils` module

In [30]:
# Import olist data
from olist.data import Olist
olist = Olist()
data = olist.get_data()

In [31]:
# Merge orders and items
items_and_orders = data["order_items"].merge(data["orders"], how="left", on="order_id")[["order_id", "customer_id", "seller_id"]]
print(items_and_orders.shape)
items_and_orders.head(1)

(112650, 3)


Unnamed: 0,order_id,customer_id,seller_id
0,00010242fe8c5a6d1ba2dd792cb16214,3ce436f183e68e07877b285a838db11a,48436dade18ac8b2bce089ec2a041202


In [32]:
items_and_orders_with_zips = (
    items_and_orders.merge(data["customers"][["customer_id", "customer_zip_code_prefix"]],
    how="left", on="customer_id")
    .merge(data["sellers"][["seller_id", "seller_zip_code_prefix"]], how="left", on="seller_id")
          )

print(items_and_orders_with_zips.shape)
items_and_orders_with_zips.head(1)

(112650, 5)


Unnamed: 0,order_id,customer_id,seller_id,customer_zip_code_prefix,seller_zip_code_prefix
0,00010242fe8c5a6d1ba2dd792cb16214,3ce436f183e68e07877b285a838db11a,48436dade18ac8b2bce089ec2a041202,28013,27277


In [33]:
print(data["geolocation"].shape)
geolocation = data["geolocation"][["geolocation_zip_code_prefix", "geolocation_lat", "geolocation_lng"]].groupby("geolocation_zip_code_prefix").mean().reset_index()
print(geolocation.shape)
geolocation.head(1)

(1000163, 5)
(19015, 3)


Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng
0,1001,-23.55019,-46.634024


In [34]:
items_with_coords = items_and_orders_with_zips.merge(geolocation,
                how="left",
                left_on= "customer_zip_code_prefix",
                right_on ="geolocation_zip_code_prefix").merge(geolocation,
                how="left",
                left_on= "seller_zip_code_prefix",
                right_on ="geolocation_zip_code_prefix", suffixes=["_customer", "_seller"])



print(items_with_coords.shape)
items_with_coords.info()
items_with_coords.head(1)


(112650, 11)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 112650 entries, 0 to 112649
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   order_id                              112650 non-null  object 
 1   customer_id                           112650 non-null  object 
 2   seller_id                             112650 non-null  object 
 3   customer_zip_code_prefix              112650 non-null  int64  
 4   seller_zip_code_prefix                112650 non-null  int64  
 5   geolocation_zip_code_prefix_customer  112348 non-null  float64
 6   geolocation_lat_customer              112348 non-null  float64
 7   geolocation_lng_customer              112348 non-null  float64
 8   geolocation_zip_code_prefix_seller    112397 non-null  float64
 9   geolocation_lat_seller                112397 non-null  float64
 10  geolocation_lng_seller                112397 non-null  

Unnamed: 0,order_id,customer_id,seller_id,customer_zip_code_prefix,seller_zip_code_prefix,geolocation_zip_code_prefix_customer,geolocation_lat_customer,geolocation_lng_customer,geolocation_zip_code_prefix_seller,geolocation_lat_seller,geolocation_lng_seller
0,00010242fe8c5a6d1ba2dd792cb16214,3ce436f183e68e07877b285a838db11a,48436dade18ac8b2bce089ec2a041202,28013,27277,28013.0,-21.762775,-41.309633,27277.0,-22.496953,-44.127492


In [35]:
from math import radians, sin, cos, asin, sqrt
import matplotlib.pyplot as plt
import seaborn as sns
def haversine_distance(lon1, lat1, lon2, lat2):
    """
    Compute distance between two pairs of coordinates (lon1, lat1, lon2, lat2)
    See - (https://en.wikipedia.org/wiki/Haversine_formula)
    """
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
    return 2 * 6371 * asin(sqrt(a))


In [36]:
items_with_coords["distance"] = items_with_coords.apply(lambda x: haversine_distance(
    x["geolocation_lng_customer"],
    x["geolocation_lat_customer"],
    x["geolocation_lng_seller"],
    x["geolocation_lat_seller"]), axis=1)

print(items_with_coords.shape)
items_with_coords.head(1)

(112650, 12)


Unnamed: 0,order_id,customer_id,seller_id,customer_zip_code_prefix,seller_zip_code_prefix,geolocation_zip_code_prefix_customer,geolocation_lat_customer,geolocation_lng_customer,geolocation_zip_code_prefix_seller,geolocation_lat_seller,geolocation_lng_seller,distance
0,00010242fe8c5a6d1ba2dd792cb16214,3ce436f183e68e07877b285a838db11a,48436dade18ac8b2bce089ec2a041202,28013,27277,28013.0,-21.762775,-41.309633,27277.0,-22.496953,-44.127492,301.504681


👀 Check your new dataframe and commit your code to olist/order.py when it works. 

In [37]:
# get a distance per order
orders_with_distance = items_with_coords[["order_id", "distance"]].groupby("order_id").mean()
print(orders_with_distance.shape)


(98666, 1)


In [40]:
# Import olist data
from olist.data import Olist
olist = Olist()
data = olist.get_data()

🧪  Test your code

In [41]:
from nbresult import ChallengeResult

result = ChallengeResult('distance',
    mean = Order().get_distance_seller_customer()['distance_seller_customer'].mean())
result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/branchedelac/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/branchedelac/code/branchedelac/04-Decision-Science/02-Statistical-Inference/data-orders/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_distance.py::TestDistance::test_distance [32mPASSED[0m[32m                     [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/distance.pickle

[32mgit[39m commit -m [33m'Completed distance step'[39m

[32mgit[39m push origin master



## 2. All at once: `get_training_data`

❓ Time to code `get_training_data` making use of your previous coded methods, to gather all order features in one table

In [74]:
# YOUR CODE HERE
my_order = Order()
print(my_order)
my_order.data["orders"].head(1)

<olist.order.Order object at 0x7fce83d16200>


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00


In [85]:
print("Expected shape:", "(96353, 12)")
print(my_order.get_wait_time().shape)
print(my_order.get_review_score().shape)
print(my_order.get_number_products().shape)
print(my_order.get_number_sellers().shape)
print(my_order.get_price_and_freight().shape)

Expected shape: (96353, 12)
(96478, 5)
(99224, 4)
(98666, 2)
(98666, 2)
(98666, 3)


In [95]:
aa = (my_order.get_wait_time()
                .merge(my_order.get_review_score(), how="inner", on="order_id")
                .merge(my_order.get_number_products(), how="inner", on="order_id")
                .merge(my_order.get_number_sellers(), how="inner", on="order_id")
                .merge(my_order.get_price_and_freight(), how="inner", on="order_id")
                ).dropna()

aa.head(1)

Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value
0,e481f51cbdc54678b7cc49136f2d6af7,8 days 10:28:40,15 days 13:03:27,0.0,delivered,0,0,4,1,1,29.99,8.72


In [96]:
print(aa.shape)

aa["order_id"].nunique()

(96353, 12)


95824

In [94]:
print(
    my_order.get_wait_time().merge(
        my_order.get_review_score(), how="inner", on="order_id").shape)

(96361, 8)


🧪  Test it below

In [97]:
from nbresult import ChallengeResult
from olist.order import Order
data = Order().get_training_data()

result = ChallengeResult('training',
    shape=data.shape,
    columns=sorted(list(data.columns))
)
result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/branchedelac/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/branchedelac/code/branchedelac/04-Decision-Science/02-Statistical-Inference/data-orders/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 2 items

test_training.py::TestTraining::test_training_data_columns [32mPASSED[0m[32m        [ 50%][0m
test_training.py::TestTraining::test_training_data_shape [32mPASSED[0m[32m          [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/training.pickle

[32mgit[39m commit -m [33m'Completed training step'[39m

[32mgit[39m push origin master



🏁 Congratulations! 

💾 Commit and push your notebook before starting the next challenge.