# Logit Orders - A warm-up challenge (~1h)

Let's figure out the impact of `wait_time` and `delay_vs_expected` on very good and very bad reviews

Using our `orders` training_set, we will run two multivariate logistic regressions (`logit_one` and `logit_five`) to predict `dim_is_one_star` and `dim_is_five_star` respectively.

 

In [1]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

❓ Import your dataset

In [57]:
from olist.data import Olist
from olist.order import Order

In [61]:
data = Order().get_training_data()
data.head()

Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value
0,e481f51cbdc54678b7cc49136f2d6af7,8.436574,15.544063,0.0,delivered,0,0,4,1,1,29.99,8.72
1,53cdb2fc8bc7dce0b6741e2150273451,13.782037,19.137766,0.0,delivered,0,0,4,1,1,118.7,22.76
2,47770eb9100c2d0c44946d9cf07ec65d,9.394213,26.639711,0.0,delivered,1,0,5,1,1,159.9,19.22
3,949d5b44dbf5de918fe9c16f97b45f8a,13.20875,26.188819,0.0,delivered,1,0,5,1,1,45.0,27.2
4,ad21c59c0840e6cb83a9ceb5573f8159,2.873877,12.112049,0.0,delivered,1,0,5,1,1,19.9,8.72


In [62]:
data.columns

Index(['order_id', 'wait_time', 'expected_wait_time', 'delay_vs_expected',
       'order_status', 'dim_is_five_star', 'dim_is_one_star', 'review_score',
       'number_of_products', 'number_of_sellers', 'price', 'freight_value'],
      dtype='object')

❓ Select which features you want to use (avoid data-leaks)

In [92]:
features = data[['wait_time', 'expected_wait_time', 'delay_vs_expected', 'dim_is_five_star', 'dim_is_one_star',
       'number_of_products', 'price']]

In [93]:
features

Unnamed: 0,wait_time,expected_wait_time,delay_vs_expected,dim_is_five_star,dim_is_one_star,number_of_products,price
0,8.436574,15.544063,0.0,0,0,1,29.99
1,13.782037,19.137766,0.0,0,0,1,118.70
2,9.394213,26.639711,0.0,1,0,1,159.90
3,13.208750,26.188819,0.0,1,0,1,45.00
4,2.873877,12.112049,0.0,1,0,1,19.90
...,...,...,...,...,...,...,...
97010,8.218009,18.587442,0.0,1,0,1,72.00
97011,22.193727,23.459051,0.0,0,0,1,174.90
97012,24.859421,30.384225,0.0,1,0,1,205.99
97013,17.086424,37.105243,0.0,0,0,2,359.98


❓ Check the multi-colinearity of your features, using the `VIF index`. It shouldn't be too high (< 10 preferably) to ensure we can trust the partial regression coefficents and their associated `p-values` 

In [50]:
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

In [100]:
df = pd.DataFrame()
df["vif_index"] = [vif(features.values, i) for i in range(features.shape[1])]
df["features"] = features.columns
df

Unnamed: 0,vif_index,features
0,7.884202,wait_time
1,8.840316,expected_wait_time
2,2.472747,delay_vs_expected
3,2.423833,dim_is_five_star
4,1.422505,dim_is_one_star
5,4.112032,number_of_products
6,1.474911,price


❓ Fit two LOGIT models (`logit_one` and `logit_five`) to predict `dim_is_one_star` and `dim_is_five_star`

### Logit_one

In [102]:
logit_one = smf.logit(formula='dim_is_one_star ~ wait_time + expected_wait_time + delay_vs_expected + number_of_products + price', data=features).fit()
logit_one.params

Optimization terminated successfully.
         Current function value: 0.281664
         Iterations 7


Intercept            -3.492998
wait_time             0.078812
expected_wait_time   -0.026156
delay_vs_expected     0.038543
number_of_products    0.560977
price                 0.000219
dtype: float64

In [88]:
logit_one.summary()

0,1,2,3
Dep. Variable:,dim_is_one_star,No. Observations:,97007.0
Model:,Logit,Df Residuals:,97003.0
Method:,MLE,Df Model:,3.0
Date:,"Thu, 06 May 2021",Pseudo R-squ.:,0.1266
Time:,10:58:09,Log-Likelihood:,-27642.0
converged:,True,LL-Null:,-31650.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-4.1247,0.030,-135.677,0.000,-4.184,-4.065
wait_time,0.0824,0.001,78.649,0.000,0.080,0.084
price,0.0001,4.92e-05,3.012,0.003,5.17e-05,0.000
number_of_products,0.5532,0.016,34.539,0.000,0.522,0.585


### Logit_five

In [103]:
logit_five = smf.logit(formula='dim_is_five_star ~ wait_time + expected_wait_time + delay_vs_expected + number_of_products + price', data=features).fit()
logit_five.params

Optimization terminated successfully.
         Current function value: 0.639464
         Iterations 7


Intercept             1.218822
wait_time            -0.055287
expected_wait_time    0.010288
delay_vs_expected    -0.085096
number_of_products   -0.340829
price                 0.000100
dtype: float64

In [90]:
logit_five.summary()

0,1,2,3
Dep. Variable:,dim_is_five_star,No. Observations:,97007.0
Model:,Logit,Df Residuals:,97003.0
Method:,MLE,Df Model:,3.0
Date:,"Thu, 06 May 2021",Pseudo R-squ.:,0.05029
Time:,10:58:12,Log-Likelihood:,-62390.0
converged:,True,LL-Null:,-65693.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,1.5032,0.021,73.199,0.000,1.463,1.543
wait_time,-0.0618,0.001,-70.966,0.000,-0.063,-0.060
price,0.0001,3.35e-05,3.989,0.000,6.79e-05,0.000
number_of_products,-0.3375,0.014,-24.477,0.000,-0.365,-0.310


❓Interpret your results:

- Interpret the partial coefficients in your own words.
- Check their statistical significance with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importance?

### Answers
1. All p-values for both logit_one and logit_five are statistically significant, where p < 0.05.
2. negative reviews (1 star) have a stronger weight compared with positive reviews (5 stars)

In [104]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more more than one_star"

your_answer = [a]

🧪 __Test your code__

In [105]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())

platform darwin -- Python 3.8.6, pytest-6.2.3, py-1.10.0, pluggy-0.13.1 -- /Users/renatoboemer/.pyenv/versions/3.8.6/envs/lewagon/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/renatoboemer/code/boemer00/data-challenges/04-Decision-Science/04-Logistic-Regression/01-Logit
plugins: anyio-2.2.0, dash-1.20.0
[1mcollecting ... [0mcollected 1 item

tests/test_logit.py::TestLogit::test_question [32mPASSED[0m[32m                     [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/logit.pickle

[32mgit[39m commit -m [33m'Completed logit step'[39m

[32mgit[39m push origin master


In [106]:
!git add tests/logit.pickle

In [107]:
!git commit -m 'Completed logit step'

[master 159998c] Completed logit step
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 04-Decision-Science/04-Logistic-Regression/01-Logit/tests/logit.pickle


In [108]:
!git push origin master

Enumerating objects: 12, done.
Counting objects: 100% (12/12), done.
Delta compression using up to 8 threads
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 744 bytes | 744.00 KiB/s, done.
Total 7 (delta 3), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (3/3), completed with 3 local objects.[K
To github.com:boemer00/data-challenges.git
   74b67be..159998c  master -> master


<details>
    <summary>Explanations</summary>


> _All other thing being equal, the delay factor tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
</details>


❓ How do these regression coefficients compare with an OLS on `review_score` using the same features? Double check that both OLS and Logit analyses tell approximately "the same story".

### 🏁 Congratulation! Don't forget to commit and push your notebook