# Target variable calculation
In some cases it's recommended to calculate the target variable before going forward with Data Exploration; since it may lead to extra insights about whether variables' distributions, for example, depend on the target variable.


## Target variable definition
Our target variable is ***is_returning_customer***, which is a boolean variables that tells if a customer is going to order again in the following 6 months.

The main events here are orders; in the periods of time between orders for a specific customer the features don't change much (only the time features), but for example other features like, which methods the customer historically used for payments, or which platform they used, do not change until the following order.

So in the light of this main event, we can further define the target variable as: For each customer, for each order, it tells whether there is going to be another order within the span of 6 months from the order date.

^That was my initial thinking

**BUT** I thought such modelling may have a few issues application- and usage-wise, for example it will not take into consideration the time that has passed since the last order.

In other words, the model will not be able to give us information relative to the time of running prediction, only to the time of the last order.

That's why I figured it could make more sense to predict the time-to-next-order.

In that case, application-wise, for a specific customer, we predict if they are going to purchase again or not, by predicting their time-to-next-order, and see if it is within 6 months from the time of running the prediction.

But I will implement both approaches (classification and regression) and examine the performance of both.

### Question?
Here comes an important point of definition: Are failed orders... orders? Meaning: should they be counted?

So, if a customer makes an order **x**, and within less than 6 months they make an order **y**.

if either **x** or **y** (or both) were failed orders, would the customers still count as returning?

As per my understanding, they should. For multiple reasons:
1. From business point of view, there is value of keeping customers on our platform, even if their orders fail sometimes. In the long term still having them in the customer base, increases the likelihood of more purchases.
2. In the specification, it did not specify that we want to predict a "successful" order in the following 6 months, just any order.
3. If we don't count Failed Orders, we may lose the information in these orders.

However, it's not very complicated to implement both scenarios (counting and ignoring failed orders), so I'm going to implement the logic in both cases, with a config boolen variable **COUNT_FAILED_ORDERS** to determine which definition to use.

## Calculation method
1. We make sure Data is truly sorted by customer_id, order_time, order_hr.
2. We group by customer_id.
3. We calculate the order time difference (in hours) between each order and the following order.

Now the time difference is assigned in each order, how long in the past the previous order was. What we want is the difference being assigned to the previous order, how long in the future is the following order going to be. So...

4. We shift the newly-calculated time difference one row up.

5. (For the classification approach) The target variable is whether the date difference variable is <= 4320 hours (180 days).

**Important calculation note:** From the definition of the data (and also from my initial exploration), failed orders are not counted in the customer_order_rank. So if **COUNT_FAILED_ORDERS** is set to True -which is going to be its default value- then the rank will need to be recalculated.

**Assumption:** 6 months = 180 days, which is not always the case, so I will add it to the list of assumptions.

# Config

In [1]:
COUNT_FAILED_ORDERS = True

# Imports

In [2]:
import pandas as pd

import sys
sys.path.append("..") #There are better ways to do this
from dragon_fruit.calculation_functions.CalculateFeatures import calculate_time_between_orders

# Reading Data

In [3]:
Data = pd.read_csv("../data/machine_learning_challenge_order_data.csv.gz")
Data

Unnamed: 0,customer_id,order_date,order_hour,customer_order_rank,is_failed,voucher_amount,delivery_fee,amount_paid,restaurant_id,city_id,payment_id,platform_id,transmission_id
0,000097eabfd9,2015-06-20,19,1.0,0,0.0,0.000,11.46960,5803498,20326,1779,30231,4356
1,0000e2c6d9be,2016-01-29,20,1.0,0,0.0,0.000,9.55800,239303498,76547,1619,30359,4356
2,000133bb597f,2017-02-26,19,1.0,0,0.0,0.493,5.93658,206463498,33833,1619,30359,4324
3,00018269939b,2017-02-05,17,1.0,0,0.0,0.493,9.82350,36613498,99315,1619,30359,4356
4,0001a00468a6,2015-08-04,19,1.0,0,0.0,0.493,5.15070,225853498,16456,1619,29463,4356
...,...,...,...,...,...,...,...,...,...,...,...,...,...
786595,fffe9d5a8d41,2016-09-30,20,,1,0.0,0.000,10.72620,983498,10346,1779,29463,212
786596,ffff347c3cfa,2016-08-17,21,1.0,0,0.0,0.000,7.59330,52893498,41978,1619,30359,4356
786597,ffff347c3cfa,2016-09-15,21,2.0,0,0.0,0.000,5.94720,164653498,41978,1619,30359,4356
786598,ffff4519b52d,2016-04-02,19,1.0,0,0.0,0.000,21.77100,16363498,80562,1491,29751,4228


# Main Logic
The code used to be here, but it was moved to its own file in the calculation_functions, but here's a view:

<code>
#I had to also sort by customer_rank, because there were orders that were on the same day, at the same hour
#So the only information about their true order was in the rank.
Data = Data.sort_values(['customer_id', 'order_date', 'order_hour','customer_order_rank'])

#Converting order_date to datetime
Data['order_date'] = pd.to_datetime(Data.order_date)

if COUNT_FAILED_ORDERS:
    #For recalculating the rank
    Data['new_rank'] = 1 #Every order counts!
    Grouped = Data.groupby('customer_id').agg({"order_date":"diff", "new_rank":"cumsum"})
    Data = Data.drop(['customer_order_rank','new_rank'],axis=1)
    Grouped = Grouped.rename({"new_rank":"customer_order_rank"},axis=1)
else:
    Grouped = Data[Data.is_failed==0].groupby('customer_id').agg({"order_date":"diff"})

Grouped = Grouped.rename({"order_date":"day_diff"},axis=1)
Grouped.day_diff = Grouped.day_diff.shift(-1)
Grouped.day_diff = Grouped.day_diff.dt.days

Data = pd.merge(Data,Grouped, left_index=True, right_index=True,how='left')

Data['is_returning_customer'] = 0
Data.loc[Data.day_diff<=180,'is_returning_customer'] = 1

Data
                            </code>

In [4]:
Data = calculate_time_between_orders(Data,COUNT_FAILED_ORDERS)
Data

Unnamed: 0,customer_id,order_date,order_hour,is_failed,voucher_amount,delivery_fee,amount_paid,restaurant_id,city_id,payment_id,platform_id,transmission_id,order_time,time_since_last_order,customer_order_rank,time_to_next_order,is_returning_customer
0,000097eabfd9,2015-06-20,19,0,0.0,0.000,11.46960,5803498,20326,1779,30231,4356,2015-06-20 19:00:00,,1,,0
1,0000e2c6d9be,2016-01-29,20,0,0.0,0.000,9.55800,239303498,76547,1619,30359,4356,2016-01-29 20:00:00,,1,,0
2,000133bb597f,2017-02-26,19,0,0.0,0.493,5.93658,206463498,33833,1619,30359,4324,2017-02-26 19:00:00,,1,,0
3,00018269939b,2017-02-05,17,0,0.0,0.493,9.82350,36613498,99315,1619,30359,4356,2017-02-05 17:00:00,,1,,0
4,0001a00468a6,2015-08-04,19,0,0.0,0.493,5.15070,225853498,16456,1619,29463,4356,2015-08-04 19:00:00,,1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
786595,fffe9d5a8d41,2016-09-30,20,1,0.0,0.000,10.72620,983498,10346,1779,29463,212,2016-09-30 20:00:00,0.0,3,,0
786596,ffff347c3cfa,2016-08-17,21,0,0.0,0.000,7.59330,52893498,41978,1619,30359,4356,2016-08-17 21:00:00,,1,696.0,1
786597,ffff347c3cfa,2016-09-15,21,0,0.0,0.000,5.94720,164653498,41978,1619,30359,4356,2016-09-15 21:00:00,696.0,2,,0
786598,ffff4519b52d,2016-04-02,19,0,0.0,0.000,21.77100,16363498,80562,1491,29751,4228,2016-04-02 19:00:00,,1,,0


# Testing
Checking a few examples to make sure the logic works as expected:

In [5]:
Check_Logic_df = Data.groupby('customer_id').agg({"order_date":"count", "is_failed":"sum"})
Check_Logic_df[(Check_Logic_df.order_date>1) & (Check_Logic_df.is_failed>1)]

Unnamed: 0_level_0,order_date,is_failed
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0011f9b2693b,6,2
004136071043,137,3
004182ce338f,27,2
005a37d50057,68,4
0063666607bb,273,7
...,...,...
ffbca9c1cc9c,8,6
ffc57fdc93fe,16,2
ffcdbbc627fe,41,20
ffe2a6942fd2,41,8


In [6]:
Data[Data.customer_id == '000d48ed2b1a']
#Ok

Unnamed: 0,customer_id,order_date,order_hour,is_failed,voucher_amount,delivery_fee,amount_paid,restaurant_id,city_id,payment_id,platform_id,transmission_id,order_time,time_since_last_order,customer_order_rank,time_to_next_order,is_returning_customer
192,000d48ed2b1a,2015-06-04,18,0,0.0,0.0,21.5055,87993498,51602,1619,29815,4324,2015-06-04 18:00:00,,1,2423.0,1
193,000d48ed2b1a,2015-09-13,17,0,0.0,0.0,17.0982,87993498,51602,1619,29815,4356,2015-09-13 17:00:00,2423.0,2,748.0,1
194,000d48ed2b1a,2015-10-14,21,0,0.0,0.0,16.9389,87993498,51602,1619,29815,4356,2015-10-14 21:00:00,748.0,3,1125.0,1
195,000d48ed2b1a,2015-11-30,18,0,0.0,0.0,23.4702,87993498,51602,1619,29815,4356,2015-11-30 18:00:00,1125.0,4,263.0,1
196,000d48ed2b1a,2015-12-11,17,0,0.0,0.0,15.4521,181163498,51602,1619,29815,4356,2015-12-11 17:00:00,263.0,5,2116.0,1
197,000d48ed2b1a,2016-03-08,21,0,0.0,0.0,9.3987,87993498,51602,1619,29815,4356,2016-03-08 21:00:00,2116.0,6,8418.0,0
198,000d48ed2b1a,2017-02-22,15,0,1.029,0.0,7.965,312033498,51602,1779,29815,4356,2017-02-22 15:00:00,8418.0,7,120.0,1
199,000d48ed2b1a,2017-02-27,15,0,0.0,0.986,10.4076,87993498,51602,1619,29815,4356,2017-02-27 15:00:00,120.0,8,,0


In [7]:
Data[Data.customer_id=='0011f9b2693b']
#Ok

Unnamed: 0,customer_id,order_date,order_hour,is_failed,voucher_amount,delivery_fee,amount_paid,restaurant_id,city_id,payment_id,platform_id,transmission_id,order_time,time_since_last_order,customer_order_rank,time_to_next_order,is_returning_customer
230,0011f9b2693b,2016-09-01,14,1,0.0,0.98107,14.00247,159553498,39335,1779,29463,212,2016-09-01 14:00:00,,1,30.0,1
231,0011f9b2693b,2016-09-02,20,0,0.0,0.98107,8.16678,159553498,39335,1779,29463,4196,2016-09-02 20:00:00,30.0,2,42.0,1
232,0011f9b2693b,2016-09-04,14,0,0.0,0.98107,12.67497,159553498,39335,1779,29463,4196,2016-09-04 14:00:00,42.0,3,21.0,1
233,0011f9b2693b,2016-09-05,11,1,0.0,0.986,8.7615,187213498,39335,1779,29463,212,2016-09-05 11:00:00,21.0,4,220.0,1
234,0011f9b2693b,2016-09-14,15,0,0.0,0.98107,27.73413,159553498,39335,1619,29463,4196,2016-09-14 15:00:00,220.0,5,1132.0,1
235,0011f9b2693b,2016-10-31,19,0,0.0,0.98107,17.18316,159553498,39335,1619,29463,4196,2016-10-31 19:00:00,1132.0,6,,0
