In [None]:
%load_ext autoreload
%autoreload 2

# `Logit` on Orders - A warm-up challenge (~1h)

## Select features

🎯 Let's figure out the impact of `wait_time` and `delay_vs_expected` on very `good/bad reviews`

👉 Using our `orders` training_set, we will run two `multivariate logistic regressions`:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

 

In [10]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from scipy.stats import zscore
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

👉 Import your dataset:

In [33]:
from olist.order import Order
orders = Order().get_training_data(with_distance_seller_customer=True)

👉 Select in a list which features you want to use:

⚠️ Make sure you are not creating data leakage (i.e. selecting features that are derived from the target)

💡 To figure out the impact of `wait_time` and `delay_vs_expected` we need to control for the impact of other features, include in your list all features that may be relevant

In [5]:
def standardize(df, features):
    df_standardized = df.copy()
    for f in features:
        mu = df[f].mean()
        sigma = df[f].std()
        df_standardized[f] = df[f].map(lambda x: (x - mu) / sigma)
    return df_standardized

In [17]:
orders.head()

Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value,distance_seller_customer
0,e481f51cbdc54678b7cc49136f2d6af7,8.436574,15.544063,0.0,delivered,0,0,4,1,1,29.99,8.72,18.063837
1,53cdb2fc8bc7dce0b6741e2150273451,13.782037,19.137766,0.0,delivered,0,0,4,1,1,118.7,22.76,856.29258
2,47770eb9100c2d0c44946d9cf07ec65d,9.394213,26.639711,0.0,delivered,1,0,5,1,1,159.9,19.22,514.130333
3,949d5b44dbf5de918fe9c16f97b45f8a,13.20875,26.188819,0.0,delivered,1,0,5,1,1,45.0,27.2,1822.800366
4,ad21c59c0840e6cb83a9ceb5573f8159,2.873877,12.112049,0.0,delivered,1,0,5,1,1,19.9,8.72,30.174037


In [55]:
features = ['wait_time','delay_vs_expected','price','number_of_sellers',
            'distance_seller_customer', 'number_of_products']

🕵🏻 Check the `multi-colinearity` of your features, using the `VIF index`.

* It shouldn't be too high (< 10 preferably) to ensure that we can trust the partial regression coefficents and their associated `p-values` 
* Do not forget to standardize your data ! 
    * A `VIF Analysis` is made by regressing a feature vs. the other features...
    * So you want to `remove the effect of scale` so that your features have an equal importance before running any linear regression!
    
    
📚 <a href="https://www.statisticshowto.com/variance-inflation-factor/">Statistics How To - Variance Inflation Factor</a>

📚  <a href="https://online.stat.psu.edu/stat462/node/180/">PennState - Detecting Multicollinearity Using Variance Inflation Factors</a>

⚖️ Standardizing:

In [40]:
orders_standardized = standardize(orders, features)
orders_standardized

Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value,distance_seller_customer
0,e481f51cbdc54678b7cc49136f2d6af7,-0.431192,15.544063,-0.161781,delivered,0,0,4,-0.264595,-0.112544,-0.513802,-0.652038,-0.979475
1,53cdb2fc8bc7dce0b6741e2150273451,0.134174,19.137766,-0.161781,delivered,0,0,4,-0.264595,-0.112544,-0.086640,0.000467,0.429743
2,47770eb9100c2d0c44946d9cf07ec65d,-0.329907,26.639711,-0.161781,delivered,1,0,5,-0.264595,-0.112544,0.111748,-0.164053,-0.145495
3,949d5b44dbf5de918fe9c16f97b45f8a,0.073540,26.188819,-0.161781,delivered,1,0,5,-0.264595,-0.112544,-0.441525,0.206815,2.054621
4,ad21c59c0840e6cb83a9ceb5573f8159,-1.019535,12.112049,-0.161781,delivered,1,0,5,-0.264595,-0.112544,-0.562388,-0.652038,-0.959115
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95875,9c5dedf39a927c1b2549525ed64a053c,-0.454309,18.587442,-0.161781,delivered,1,0,5,-0.264595,-0.112544,-0.311513,-0.449408,-0.893033
95876,63943bddc261676b46f01ca7ac2f7bd8,1.023841,23.459051,-0.161781,delivered,0,0,4,-0.264595,-0.112544,0.183977,-0.123156,-0.212797
95877,83c1379a015df1e13d02aae0204711ab,1.305780,30.384225,-0.161781,delivered,1,0,5,-0.264595,-0.112544,0.333684,1.964490,0.617630
95878,11c177c8e97725db2631073c19f07b62,0.483664,37.105243,-0.161781,delivered,0,0,2,1.601605,-0.112544,1.075186,2.715522,-0.387558


👉 Run your VIF Analysis to analyze the potential multicolinearities:

In [37]:
# Correlation matix is not sufficient to detect soft or event strict multicolinearity
orders_standardized.corr().style.background_gradient(cmap='coolwarm')

Unnamed: 0,wait_time,expected_wait_time,delay_vs_expected,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value,distance_seller_customer
wait_time,1.0,0.385628,0.702597,-0.234101,0.305577,-0.334036,-0.019754,-0.040702,0.055638,0.167284,0.394984
expected_wait_time,0.385628,1.0,0.005519,-0.050333,0.034842,-0.052525,0.015735,0.024884,0.076606,0.238748,0.513583
delay_vs_expected,0.702597,0.005519,1.0,-0.156735,0.284706,-0.272361,-0.013653,-0.017162,0.016632,0.023887,0.066066
dim_is_five_star,-0.234101,-0.050333,-0.156735,1.0,-0.396354,0.791749,-0.07227,-0.070536,-0.012762,-0.058773,-0.056559
dim_is_one_star,0.305577,0.034842,0.284706,-0.396354,1.0,-0.807758,0.119848,0.102241,0.04466,0.082778,0.04318
review_score,-0.334036,-0.052525,-0.272361,0.791749,-0.807758,1.0,-0.12334,-0.117017,-0.034538,-0.090014,-0.059147
number_of_products,-0.019754,0.015735,-0.013653,-0.07227,0.119848,-0.12334,1.0,0.288734,0.153551,0.438056,-0.017306
number_of_sellers,-0.040702,0.024884,-0.017162,-0.070536,0.102241,-0.117017,0.288734,1.0,0.042986,0.13358,-0.007644
price,0.055638,0.076606,0.016632,-0.012762,0.04466,-0.034538,0.153551,0.042986,1.0,0.410129,0.079348
freight_value,0.167284,0.238748,0.023887,-0.058773,0.082778,-0.090014,0.438056,0.13358,0.410129,1.0,0.314188


In [45]:
orders_features = orders_standardized[features]

df = pd.DataFrame()

df["features"] = orders_features.columns

df["vif_index"] = [vif(orders_features.values, i) for i in range(orders_features.shape[1])]

round(df.sort_values(by="vif_index", ascending = False),2)

Unnamed: 0,features,vif_index
0,wait_time,2.62
1,delay_vs_expected,2.21
4,freight_value,1.67
5,distance_seller_customer,1.44
6,number_of_products,1.37
2,price,1.21
3,number_of_sellers,1.09


## Logistic Regressions

👉 Fit two `Logistic Regression` models:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

`Logit 1️⃣`

In [56]:
logit_one = smf.logit(formula=f"dim_is_one_star ~ {' + '.join(features)}", data=orders_standardized).fit();
logit_one.summary()

Optimization terminated successfully.
         Current function value: 0.273598
         Iterations 7


0,1,2,3
Dep. Variable:,dim_is_one_star,No. Observations:,95872.0
Model:,Logit,Df Residuals:,95865.0
Method:,MLE,Df Model:,6.0
Date:,"Tue, 22 Feb 2022",Pseudo R-squ.:,0.1447
Time:,19:48:20,Log-Likelihood:,-26230.0
converged:,True,LL-Null:,-30669.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.4673,0.013,-190.880,0.000,-2.493,-2.442
wait_time,0.7101,0.017,42.178,0.000,0.677,0.743
delay_vs_expected,0.2570,0.018,13.910,0.000,0.221,0.293
price,0.0431,0.010,4.105,0.000,0.023,0.064
number_of_sellers,0.1776,0.008,22.621,0.000,0.162,0.193
distance_seller_customer,-0.1844,0.013,-13.787,0.000,-0.211,-0.158
number_of_products,0.2409,0.009,26.272,0.000,0.223,0.259


`Logit 5️⃣`

In [57]:
logit_five = smf.logit(formula=f"dim_is_five_star ~ {' + '.join(features)}", data=orders_standardized).fit();
logit_five.summary()

Optimization terminated successfully.
         Current function value: 0.636832
         Iterations 7


0,1,2,3
Dep. Variable:,dim_is_five_star,No. Observations:,95872.0
Model:,Logit,Df Residuals:,95865.0
Method:,MLE,Df Model:,6.0
Date:,"Tue, 22 Feb 2022",Pseudo R-squ.:,0.05805
Time:,19:48:41,Log-Likelihood:,-61054.0
converged:,True,LL-Null:,-64817.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.3388,0.007,47.337,0.000,0.325,0.353
wait_time,-0.5214,0.012,-44.772,0.000,-0.544,-0.499
delay_vs_expected,-0.4335,0.024,-18.442,0.000,-0.480,-0.387
price,0.0225,0.007,3.180,0.001,0.009,0.036
number_of_sellers,-0.1426,0.008,-18.212,0.000,-0.158,-0.127
distance_seller_customer,0.0886,0.008,11.126,0.000,0.073,0.104
number_of_products,-0.1344,0.008,-17.742,0.000,-0.149,-0.120


💡 It's time to analyse the results of these two logistic regressions:

- Interpret the partial coefficients in your own words.
- Check their statistical significances with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importances?

In [53]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more more than one_star"

your_answer = [a]

🧪 __Test your code__

In [54]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/christianklaus/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/christianklaus/code/christianklausML/data-challenges/04-Decision-Science/04-Logistic-Regression/01-Logit
plugins: anyio-3.4.0
[1mcollecting ... [0mcollected 1 item

tests/test_logit.py::TestLogit::test_question [32mPASSED[0m[32m                     [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/logit.pickle

[32mgit[39m commit -m [33m'Completed logit step'[39m

[32mgit[39m push origin master


<details>
    <summary>- <i>Explanations and advanced concepts </i> -</summary>


> _All other thing being equal, the `delay factor` tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
❗️ However, to be totally rigorous, we have to be **more careful when comparing coefficients from two different models**, because **they might not be based on similar populations**!
    We have 2 sub-populations here: (people who gave 1-stars; and people who gave 5-stars) and they may exhibit intrinsically different behavior patterns. It may well be that "happy-people" (who tends to give 5-stars easily) are less sensitive as "grumpy-people" (who shoot 1-stars like Lucky-Luke), when it comes to "delay", or "price"...

</details>


## Logistic vs. Linear ?

👉 Compare:
- the regression coefficients obtained from the `Logistic Regression `
- with the regression coefficients obtained through a `Linear Regression` 
- on `review_score`, using the same features. 

⚠️ Check that both sets of coefficients  tell  "the same story".

> more or less same, except for price

In [None]:
# YOUR CODE HERE

🏁 Congratulations! 

💾 Don't forget to commit and push your `logit.ipynb` notebook !