# Logit Orders - A warm-up challenge (~1h)

Let's figure out the impact of `wait_time` and `delay_vs_expected` on very good and very bad reviews

Using our `orders` training_set, we will run two multivariate logistic regressions (`logit_one` and `logit_five`) to predict `dim_is_one_star` and `dim_is_five_star` respectively.

 

In [31]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import numpy as np

❓ Import your dataset

In [2]:
from olist.order import Order
orders = Order().get_training_data(with_distance_seller_customer=True)

In [3]:
orders.head(3)

Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value,distance_seller_customer
0,e481f51cbdc54678b7cc49136f2d6af7,8.0,15,0.0,delivered,0,0,4,1,1,29.99,8.72,18.063837
1,53cdb2fc8bc7dce0b6741e2150273451,13.0,19,0.0,delivered,0,0,4,1,1,118.7,22.76,856.29258
2,47770eb9100c2d0c44946d9cf07ec65d,9.0,26,0.0,delivered,1,0,5,1,1,159.9,19.22,514.130333


In [4]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96525 entries, 0 to 96532
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   order_id                  96525 non-null  object 
 1   wait_time                 96525 non-null  float64
 2   expected_wait_time        96525 non-null  int64  
 3   delay_vs_expected         96525 non-null  float64
 4   order_status              96525 non-null  object 
 5   dim_is_five_star          96525 non-null  int64  
 6   dim_is_one_star           96525 non-null  int64  
 7   review_score              96525 non-null  int64  
 8   number_of_products        96525 non-null  int64  
 9   number_of_sellers         96525 non-null  int64  
 10  price                     96525 non-null  float64
 11  freight_value             96525 non-null  float64
 12  distance_seller_customer  96525 non-null  float64
dtypes: float64(5), int64(6), object(2)
memory usage: 10.3+ MB


❓ Select which features you want to use (avoid data-leaks)

In [9]:
features = orders[['dim_is_one_star',
                   'dim_is_five_star',
                   'wait_time',
                   'delay_vs_expected',
                   'number_of_products',
                   'distance_seller_customer']].copy()

In [10]:
features.head(3)

Unnamed: 0,dim_is_one_star,dim_is_five_star,wait_time,delay_vs_expected,number_of_products,distance_seller_customer
0,0,0,8.0,0.0,1,18.063837
1,0,0,13.0,0.0,1,856.29258
2,0,1,9.0,0.0,1,514.130333


In [11]:
features.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96525 entries, 0 to 96532
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   dim_is_one_star           96525 non-null  int64  
 1   dim_is_five_star          96525 non-null  int64  
 2   wait_time                 96525 non-null  float64
 3   delay_vs_expected         96525 non-null  float64
 4   number_of_products        96525 non-null  int64  
 5   distance_seller_customer  96525 non-null  float64
dtypes: float64(3), int64(3)
memory usage: 5.2 MB


❓ Check the multi-colinearity of your features, using the `VIF index`. It shouldn't be too high (< 10 preferably) to ensure we can trust the partial regression coefficents and their associated `p-values` 

In [12]:
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

In [13]:
df = pd.DataFrame()
df["vif_index"] = [vif(features.values, i) for i in range(features.shape[1])]
df["features"] = features.columns
df

Unnamed: 0,vif_index,features
0,1.42191,dim_is_one_star
1,2.167241,dim_is_five_star
2,5.641868,wait_time
3,2.1131,delay_vs_expected
4,3.187753,number_of_products
5,2.67225,distance_seller_customer


❓ Fit two LOGIT models (`logit_one` and `logit_five`) to predict `dim_is_one_star` and `dim_is_five_star`

In [18]:
logit_one = smf.logit(formula='dim_is_one_star ~ wait_time + delay_vs_expected + number_of_products + distance_seller_customer', data=features).fit();
print(logit_one.summary())

Optimization terminated successfully.
         Current function value: 0.282138
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:        dim_is_one_star   No. Observations:                96525
Model:                          Logit   Df Residuals:                    96520
Method:                           MLE   Df Model:                            4
Date:                Thu, 06 May 2021   Pseudo R-squ.:                  0.1356
Time:                        12:42:48   Log-Likelihood:                -27233.
converged:                       True   LL-Null:                       -31505.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                   -3.7953      0.032   -118.885      0.000      -3.858

In [19]:
logit_five = smf.logit(formula='dim_is_five_star ~ wait_time + delay_vs_expected + number_of_products + distance_seller_customer', data=features).fit();
print(logit_five.summary())

Optimization terminated successfully.
         Current function value: 0.639420
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:       dim_is_five_star   No. Observations:                96525
Model:                          Logit   Df Residuals:                    96520
Method:                           MLE   Df Model:                            4
Date:                Thu, 06 May 2021   Pseudo R-squ.:                 0.05583
Time:                        12:43:14   Log-Likelihood:                -61720.
converged:                       True   LL-Null:                       -65370.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                    1.3360      0.021     63.684      0.000       1.295

In [22]:
logit_five.summary()

0,1,2,3
Model:,Logit,Pseudo R-squared:,0.056
Dependent Variable:,dim_is_five_star,AIC:,123450.1085
Date:,2021-05-06 13:01,BIC:,123497.4963
No. Observations:,96525,Log-Likelihood:,-61720.0
Df Model:,4,LL-Null:,-65370.0
Df Residuals:,96520,LLR p-value:,0.0
Converged:,1.0000,Scale:,1.0
No. Iterations:,7.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,1.3360,0.0210,63.6839,0.0000,1.2949,1.3771
wait_time,-0.0543,0.0012,-44.5006,0.0000,-0.0566,-0.0519
delay_vs_expected,-0.0942,0.0050,-18.6588,0.0000,-0.1040,-0.0843
number_of_products,-0.3274,0.0137,-23.9833,0.0000,-0.3542,-0.3007
distance_seller_customer,0.0001,0.0000,11.2323,0.0000,0.0001,0.0002


In [33]:
logit_one.params.apply(lambda x: np.exp(x))

Intercept                   0.022475
wait_time                   1.075566
delay_vs_expected           1.056075
number_of_products          1.748255
distance_seller_customer    0.999705
dtype: float64

In [32]:
logit_five.params.apply(lambda x: np.exp(x))

Intercept                   3.803760
wait_time                   0.947187
delay_vs_expected           0.910144
number_of_products          0.720759
distance_seller_customer    1.000149
dtype: float64

❓Interpret your results:

- Interpret the partial coefficients in your own words.
- Check their statistical significance with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importance?

In [34]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more more than one_star"

your_answer = [a]

🧪 __Test your code__

In [35]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())

platform darwin -- Python 3.8.6, pytest-6.2.3, py-1.10.0, pluggy-0.13.1 -- /Users/smrack/.pyenv/versions/3.8.6/envs/lewagon/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/smrack/code/olushO/data-challenges/04-Decision-Science/04-Logistic-Regression/01-Logit
plugins: anyio-2.2.0, dash-1.20.0
[1mcollecting ... [0mcollected 1 item

tests/test_logit.py::TestLogit::test_question [32mPASSED[0m[32m                     [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/logit.pickle

[32mgit[39m commit -m [33m'Completed logit step'[39m

[32mgit[39m push origin master


<details>
    <summary>Explanations</summary>


> _All other thing being equal, the delay factor tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
</details>


❓ How do these regression coefficients compare with an OLS on `review_score` using the same features? Double check that both OLS and Logit analyses tell approximately "the same story".

### 🏁 Congratulation! Don't forget to commit and push your notebook