# Expected Delivery

👇 Run the code below

In [1]:
import pandas as pd

data = pd.read_csv('data/order_deliveries.csv')

data.head()

Unnamed: 0,customer_state,seller_state,product_weight_g,product_length_cm,product_height_cm,product_width_cm,order_purchase_timestamp,order_delivered_customer_date
0,RJ,SP,1825,53,10,40,18/09/2017 20:11,28/09/2017 19:19
1,RJ,SP,700,65,18,28,16/10/2017 14:12,25/10/2017 16:43
2,RJ,SP,1825,53,10,40,14/04/2018 00:04,25/04/2018 23:10
3,RJ,SP,1825,53,10,40,10/10/2017 15:32,23/10/2017 12:09
4,RJ,SP,1825,53,10,40,26/09/2017 11:51,10/10/2017 18:05


Each observation of the dataset represents an item being delivered from a  `seller_state` to a `customer_state`. The columns describe the size and weight of each item. There are two columns that inform on the time the order was placed (`order_purchase_timestamp`) and it was delivered (`order_delivered_customer_date`).

The task is to to inform customers the **number of days until delivery** at the moment the order is placed. Because customers would rather a delivery arrive early than late, you should favor a model that **overshoots the predictions**.

## Target engineering

👇 Create the target by computing the time difference between `order_purchase_timestamp` and `order_delivered_customer_date`. Round it up to days.

<details>
<summary>💡 Hint</summary>
    
Convert each column to datetime and compute the difference.
    
[`to_datetime` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)
    
[`dt.days` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.days.html)
</details>

In [2]:
# To datetime
data['order_purchase_timestamp'] = pd.to_datetime(data['order_purchase_timestamp'])
data['order_delivered_customer_date'] = pd.to_datetime(data['order_delivered_customer_date'])

# Days until delivery
data['target'] = (data['order_delivered_customer_date'] - data['order_purchase_timestamp']).dt.days

## Feature preprocessing

👇 Perform the necessary preprocessing on the features

<details>
<summary>💡 Hints </summary>
    
- One-hot-encode customer and seller state information with [get_dummies()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

    
- Scale product details 
    
</details>

In [3]:
from sklearn.preprocessing import MinMaxScaler

# Scale product features
scaler = MinMaxScaler()
data['product_weight_g'],data['product_length_cm'],data['product_height_cm'],data['product_width_cm']= scaler.fit_transform(data[['product_weight_g','product_length_cm','product_height_cm','product_width_cm']]).T

# One hot encode state features
data = pd.get_dummies(data,columns = ['customer_state','seller_state'] )

data.head()

Unnamed: 0,product_weight_g,product_length_cm,product_height_cm,product_width_cm,order_purchase_timestamp,order_delivered_customer_date,target,customer_state_AL,customer_state_AM,customer_state_AP,...,customer_state_RJ,customer_state_RN,customer_state_RO,customer_state_RS,customer_state_SC,customer_state_SE,customer_state_SP,customer_state_TO,seller_state_SC,seller_state_SP
0,0.057692,0.402439,0.056818,0.271028,2017-09-18 20:11:00,2017-09-28 19:19:00,9,0,0,0,...,1,0,0,0,0,0,0,0,0,1
1,0.020067,0.54878,0.147727,0.158879,2017-10-16 14:12:00,2017-10-25 16:43:00,9,0,0,0,...,1,0,0,0,0,0,0,0,0,1
2,0.057692,0.402439,0.056818,0.271028,2018-04-14 00:04:00,2018-04-25 23:10:00,11,0,0,0,...,1,0,0,0,0,0,0,0,0,1
3,0.057692,0.402439,0.056818,0.271028,2017-10-10 15:32:00,2017-10-23 12:09:00,12,0,0,0,...,1,0,0,0,0,0,0,0,0,1
4,0.057692,0.402439,0.056818,0.271028,2017-09-26 11:51:00,2017-10-10 18:05:00,14,0,0,0,...,1,0,0,0,0,0,0,0,0,1


## Linear Regression

👇 Train a `LinearRegression` model and make cross validated predictions.

[`cross_val_predict` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html)

In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_predict
import numpy as np

# Prepare X and y
X = data.drop(columns = ['order_purchase_timestamp','order_delivered_customer_date','target'])
y = data.target

# Instanciate model
lin_model = LinearRegression()

# Make cross-validated predictions 
y_pred_lin = cross_val_predict(lin_model, X, y, cv=10)

👇 Engineer a scoring metric that preserves the magnitude of the target and the direction of the errors made. Encapsulate the scoring metric in a function.

<details>
<summary>💡 Hint</summary>
    
Computing the mean differences between predicted and true values is a simple way to preserve direction and magnitude.
    
</details>

In [5]:
def directed_error(y, y_pred):
    return np.mean(y_pred - y)

directed_error(y, y_pred_lin)

-0.014107806424457863

##  KNN Regressor

👇 Train a `KNNRegressor` model and use your scoring function to evaluate its performance.

In [6]:
from sklearn.neighbors import KNeighborsRegressor

# Instanciate model
knn_model = KNeighborsRegressor(n_neighbors=3)

# Make cross-validated predictions 
y_pred_knn = cross_val_predict(knn_model, X, y, cv=10)

directed_error(y, y_pred_knn)

0.37533333333333313

## Model Selection

❓ Which of the two models would you chose for the task and based on your metric's score?

<details>
<summary>Answer</summary>

The model best suited for the task is the one with the positive error (or least negative). If the average error is positive, the model tends to overshoot delivery.
</details>

# 🏁