In [1]:
%load_ext autoreload
%autoreload 2

# `Logit` on Orders - A warm-up challenge (~1h)

## Select features

🎯 Let's figure out the impact of `wait_time` and `delay_vs_expected` on very `good/bad reviews`

👉 Using our `orders` training_set, we will run two `multivariate logistic regressions`:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

 

In [2]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import scipy as sp
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

👉 Import your dataset:

In [3]:
from olist.order import Order
orders = Order().get_training_data(with_distance_seller_customer=True)

👉 Select in a list which features you want to use:

⚠️ Make sure you are not creating data leakage (i.e. selecting features that are derived from the target)

💡 To figure out the impact of `wait_time` and `delay_vs_expected` we need to control for the impact of other features, include in your list all features that may be relevant

In [4]:
orders.columns

Index(['order_id', 'wait_time', 'expected_wait_time', 'delay_vs_expected',
       'order_status', 'dim_is_five_star', 'dim_is_one_star', 'review_score',
       'number_of_products', 'number_of_sellers', 'price', 'freight_value',
       'distance_seller_customer'],
      dtype='object')

In [5]:
orders.head(2)

Unnamed: 0,order_id,wait_time,expected_wait_time,delay_vs_expected,order_status,dim_is_five_star,dim_is_one_star,review_score,number_of_products,number_of_sellers,price,freight_value,distance_seller_customer
0,e481f51cbdc54678b7cc49136f2d6af7,8.0,15,0.0,delivered,0,0,4,1,1,29.99,8.72,18.57611
1,53cdb2fc8bc7dce0b6741e2150273451,13.0,19,0.0,delivered,0,0,4,1,1,118.7,22.76,851.495069


In [6]:
features = [ 'wait_time', 'expected_wait_time', 'delay_vs_expected', "number_of_products",
       'number_of_sellers', 'price', 'freight_value',
       'distance_seller_customer']

🕵🏻 Check the `multi-colinearity` of your features, using the `VIF index`.

* It shouldn't be too high (< 10 preferably) to ensure that we can trust the partial regression coefficents and their associated `p-values` 
* Do not forget to standardize your data ! 
    * A `VIF Analysis` is made by regressing a feature vs. the other features...
    * So you want to `remove the effect of scale` so that your features have an equal importance before running any linear regression!
    
    
📚 <a href="https://www.statisticshowto.com/variance-inflation-factor/">Statistics How To - Variance Inflation Factor</a>

📚  <a href="https://online.stat.psu.edu/stat462/node/180/">PennState - Detecting Multicollinearity Using Variance Inflation Factors</a>

⚖️ Standardizing:

In [7]:
orders_standardized_features = orders.copy()[features]
orders_standardized_features =  sp.stats.zscore(orders_standardized_features)
orders_standardized_features.head(5)

Unnamed: 0,wait_time,expected_wait_time,delay_vs_expected,number_of_products,number_of_sellers,price,freight_value,distance_seller_customer
0,-0.428004,-0.955667,-0.153336,-0.264596,-0.112545,-0.513805,-0.652042,-0.981493
1,0.100519,-0.499258,-0.153336,-0.264596,-0.112545,-0.086641,0.000467,0.42367
2,-0.322299,0.299458,-0.153336,-0.264596,-0.112545,0.111749,-0.164054,-0.145003
3,0.100519,0.299458,-0.153336,-0.264596,-0.112545,-0.441528,0.206816,2.061328
4,-1.062231,-1.297973,-0.153336,-0.264596,-0.112545,-0.562391,-0.652042,-0.962766


In [8]:
orders_standardized_features.corr()

Unnamed: 0,wait_time,expected_wait_time,delay_vs_expected,number_of_products,number_of_sellers,price,freight_value,distance_seller_customer
wait_time,1.0,0.38489,0.693622,-0.019835,-0.040803,0.055527,0.167221,0.39535
expected_wait_time,0.38489,1.0,0.008143,0.015497,0.024779,0.076746,0.238833,0.515099
delay_vs_expected,0.693622,0.008143,1.0,-0.013274,-0.016424,0.016497,0.023471,0.064346
number_of_products,-0.019835,0.015497,-0.013274,1.0,0.288734,0.153551,0.438056,-0.01714
number_of_sellers,-0.040803,0.024779,-0.016424,0.288734,1.0,0.042986,0.13358,-0.007592
price,0.055527,0.076746,0.016497,0.153551,0.042986,1.0,0.410129,0.07967
freight_value,0.167221,0.238833,0.023471,0.438056,0.13358,0.410129,1.0,0.315126
distance_seller_customer,0.39535,0.515099,0.064346,-0.01714,-0.007592,0.07967,0.315126,1.0


👉 Run your VIF Analysis to analyze the potential multicolinearities:

In [9]:
vif(orders_standardized_features.values, 0)

2.9426158845083945

In [10]:
vif_df = pd.DataFrame()
vif_df["features"] = orders_standardized_features.columns
vif_df["vif_index"] = [vif(orders_standardized_features.values, i) for i in range(orders_standardized_features.shape[1])]
round(vif_df.sort_values(by="vif_index", ascending = False),2)

Unnamed: 0,features,vif_index
0,wait_time,2.94
2,delay_vs_expected,2.35
6,freight_value,1.68
7,distance_seller_customer,1.6
1,expected_wait_time,1.59
3,number_of_products,1.37
5,price,1.21
4,number_of_sellers,1.1


## Logistic Regressions

👉 Fit two `Logistic Regression` models:
- `logit_one` to predict `dim_is_one_star` 
- `logit_five` to predict `dim_is_five_star`.

`Logit 1️⃣`

In [11]:
orders.columns

Index(['order_id', 'wait_time', 'expected_wait_time', 'delay_vs_expected',
       'order_status', 'dim_is_five_star', 'dim_is_one_star', 'review_score',
       'number_of_products', 'number_of_sellers', 'price', 'freight_value',
       'distance_seller_customer'],
      dtype='object')

### Logistic model One

In [12]:
log_target_one = "dim_is_one_star"
orders_standardized_with_target = pd.concat(
    [orders_standardized_features, orders[log_target_one]], axis=1
)

model_one = smf.logit(
    formula=f'{log_target_one} ~ {" + ".join(features)}',
    data=orders,
).fit()
model_one.params

Optimization terminated successfully.
         Current function value: 0.272892
         Iterations 7


Intercept                  -4.979249
wait_time                   0.092636
expected_wait_time         -0.025711
delay_vs_expected           0.019595
number_of_products          0.467443
number_of_sellers           1.520401
price                       0.000250
freight_value              -0.000627
distance_seller_customer   -0.000225
dtype: float64

#### Logistic model Five

`Logit 5️⃣`

In [13]:
log_target_five = "dim_is_five_star"
orders_standardized_with_target = pd.concat(
    [orders_standardized_features, orders[log_target_five]], axis=1
)

model_five = smf.logit(
    formula=f'{log_target_five} ~ {" + ".join(features)}',
    data=orders,
).fit()
model_five.params

Optimization terminated successfully.
         Current function value: 0.636749
         Iterations 7


Intercept                   2.328325
wait_time                  -0.061201
expected_wait_time          0.009023
delay_vs_expected          -0.077156
number_of_products         -0.253785
number_of_sellers          -1.180288
price                       0.000097
freight_value               0.000056
distance_seller_customer    0.000112
dtype: float64

#### Comparison

💡 It's time to analyse the results of these two logistic regressions:

- Interpret the partial coefficients in your own words.
- Check their statistical significances with `p-values`
- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importances?

In [14]:
print("**Score One**")
print(model_one.summary())

print("**Score Five**")
print(model_five.summary())


**Score One**


                           Logit Regression Results                           
Dep. Variable:        dim_is_one_star   No. Observations:                95872
Model:                          Logit   Df Residuals:                    95863
Method:                           MLE   Df Model:                            8
Date:                Thu, 01 Feb 2024   Pseudo R-squ.:                  0.1469
Time:                        13:21:54   Log-Likelihood:                -26163.
converged:                       True   LL-Null:                       -30669.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                   -4.9792      0.073    -68.094      0.000      -5.123      -4.836
wait_time                    0.0926      0.002     45.728      0.000       0.089       0.

In [39]:
# Translating an intercept into probability of succsess when all dependent variables are 0
import numpy as np

log_odds = 2.328325
print(log_odds)
odds = np.exp(log_odds)
print(odds)
probability = odds/(1+odds)
print(probability)


2.328325
10.260740390953762
0.9111958925184579


In [15]:
# Among the following sentences, store the ones that are true in the list below

a = "delay_vs_expected influences five_star ratings even more than one_star ratings"
b = "wait_time influences five_star ratings even more more than one_star"

your_answer = [a]

🧪 __Test your code__

In [16]:
from nbresult import ChallengeResult

result = ChallengeResult('logit',
    answers = your_answer
)
result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/branchedelac/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/branchedelac/code/branchedelac/04-Decision-Science/04-Logistic-Regression/data-logit/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_logit.py::TestLogit::test_question [32mPASSED[0m[32m                           [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/logit.pickle

[32mgit[39m commit -m [33m'Completed logit step'[39m

[32mgit[39m push origin master



<details>
    <summary>- <i>Explanations and advanced concepts </i> -</summary>


> _All other thing being equal, the `delay factor` tends to increase the chances of getting stripped of the 5-star even more so than it affect the chances of 1-star reviews. Probably because 1-stars are really targeting bad products themselves, not bad deliveries_
    
❗️ However, to be totally rigorous, we have to be **more careful when comparing coefficients from two different models**, because **they might not be based on similar populations**!
    We have 2 sub-populations here: (people who gave 1-stars; and people who gave 5-stars) and they may exhibit intrinsically different behavior patterns. It may well be that "happy-people" (who tends to give 5-stars easily) are less sensitive as "grumpy-people" (who shoot 1-stars like Lucky-Luke), when it comes to "delay", or "price"...

</details>


## Logistic vs. Linear ?

👉 Compare the coefficients obtained from:
- A `Logistic Regression` to explain `dim_is_five_star`
- A `Linear Regression` to explain `review_score` 

Make sure to use the same set of features for both regressions.  

⚠️ Check that both sets of coefficients  tell  "the same story".

> YOUR ANSWER HERE

In [27]:
log_target_five = "dim_is_five_star"
orders_standardized_with_target = pd.concat(
    [orders_standardized_features, orders[log_target_five]], axis=1
)

model_five = smf.logit(
    formula=f'{log_target_five} ~ {" + ".join(features)}',
    data=orders,
).fit()
model_five.params

Optimization terminated successfully.
         Current function value: 0.636749
         Iterations 7


Intercept                   2.328325
wait_time                  -0.061201
expected_wait_time          0.009023
delay_vs_expected          -0.077156
number_of_products         -0.253785
number_of_sellers          -1.180288
price                       0.000097
freight_value               0.000056
distance_seller_customer    0.000112
dtype: float64

🏁 Congratulations! 

💾 Don't forget to commit and push your `logit.ipynb` notebook !

In [28]:
orders_standardized_with_target = pd.concat(
    [orders_standardized_features, orders["review_score"]], axis=1
)

model_score = smf.ols(
    formula=f'review_score ~ {" + ".join(features)}',
    data=orders,
).fit()

model_score.params

Intercept                   5.855413
wait_time                  -0.053008
expected_wait_time          0.010874
delay_vs_expected           0.000064
number_of_products         -0.240361
number_of_sellers          -1.093721
price                      -0.000017
freight_value              -0.000036
distance_seller_customer    0.000118
dtype: float64

#### Analysis

In [29]:
pd.concat(
    [
        model_score.params.reset_index().rename(
            columns={"index": "Review Score Linear"}
        ),
        model_five.params.reset_index().rename(columns={"index": "Five Star Logistic"}),
    ],
    axis=1,
)

Unnamed: 0,Review Score Linear,0,Five Star Logistic,0.1
0,Intercept,5.855413,Intercept,2.328325
1,wait_time,-0.053008,wait_time,-0.061201
2,expected_wait_time,0.010874,expected_wait_time,0.009023
3,delay_vs_expected,6.4e-05,delay_vs_expected,-0.077156
4,number_of_products,-0.240361,number_of_products,-0.253785
5,number_of_sellers,-1.093721,number_of_sellers,-1.180288
6,price,-1.7e-05,price,9.7e-05
7,freight_value,-3.6e-05,freight_value,5.6e-05
8,distance_seller_customer,0.000118,distance_seller_customer,0.000112


##### Coefficients
1. wait_time has a negative effect on the review_score, and reduces the likelihood of getting a five star review.
2. expected_wait_time has a positive effect on the review_score, and increases the likelihood of getting a five star review.
3. delay_time has a positive effect on the review_score (!), but reduces the likelihood of getting a five star review. The former seems surprising.
4. number_of_products has a negative effect on the review_score, and reduces the likelihood of getting a five star review.
5. number_of_sellerss has the greatest negative effect on the review_score, and reduces the likelihood of getting a five star review the most.
6. price has a negative effect on the review_score, but increases the likelihood of getting a five star review.
7. Likewise, price has a negative effect on the review_score, but increases the likelihood of getting a five star review.
8. distance_seller_customer has very small positive effect on the review_score, and increases the likelihood of getting a five star review.


In both models, number of sellers followed by number of products have the most noticeable (negative) effect on review score. After those, it's wait_time in the case of the linear model and delay in the case of the logistic model. The greatest positive effect comes from expected_wait_time.