# Logit Orders - A warm-up challenge (~1h)

Using our `orders` training_set, we will run two multivariate logistic regressions (`logit_one` and `logit_five`) to predict `dim_is_five_star` (and `dim_is_one_star`respectively)

In [7]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

❓ Import your dataset

In [32]:
from olist.data import Olist
from olist.seller import Seller, Order
orders = Order().get_training_data()

In [34]:
orders.columns
# data.shape

Index(['order_id', 'wait_time', 'expected_wait_time', 'delay_vs_expected',
       'order_status', 'dim_is_five_star', 'dim_is_one_star', 'review_score',
       'number_of_products', 'number_of_sellers', 'price', 'freight_value'],
      dtype='object')

❓ Create a list with the features you want to use (avoid data-leaks)

In [35]:
features = ['wait_time', 'expected_wait_time', 'number_of_products', 'price', 'freight_value']

❓ Check multi-colinearity of your feature, using the `VIF index`. It shouldn't be too high (< 10 preferably) so as to trust the partial regression coefficent and their associated `p-values` 

In [36]:
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
df = pd.DataFrame()
data_features = orders[features]
df["vif_index"] = [vif(data_features.values, i) for i in range(data_features.shape[1])]
df["features"] = data_features.columns
df

Unnamed: 0,vif_index,features
0,3.20085,wait_time
1,5.72246,expected_wait_time
2,4.517835,number_of_products
3,1.706016,price
4,3.205882,freight_value


❓ Fit two LOGIT models  `logit_one` and `logit_five` to predict `dim_is_five_star` (and `dim_is_one_star` respectively)

In [44]:
def standardize(df, features):
    df_standardized = df.copy()
    for f in features:
        mu = df[f].mean()
        sigma = df[f].std()
        df_standardized[f] = df[f].map(lambda x: (x - mu) / sigma)
    return df_standardized

In [47]:
orders_std = standardize(orders, features)
# orders_std

In [48]:
logit_one = smf.logit(formula='dim_is_one_star ~ wait_time + expected_wait_time +\
    number_of_products + price + freight_value', data=orders_std).fit();
logit_one.summary()

Optimization terminated successfully.
         Current function value: 0.282039
         Iterations 7


0,1,2,3
Dep. Variable:,dim_is_one_star,No. Observations:,97007.0
Model:,Logit,Df Residuals:,97001.0
Method:,MLE,Df Model:,5.0
Date:,"Thu, 29 Oct 2020",Pseudo R-squ.:,0.1356
Time:,12:55:48,Log-Likelihood:,-27360.0
converged:,True,LL-Null:,-31650.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.4455,0.013,-194.482,0.000,-2.470,-2.421
wait_time,0.8949,0.011,79.220,0.000,0.873,0.917
expected_wait_time,-0.2958,0.014,-21.682,0.000,-0.323,-0.269
number_of_products,0.3236,0.010,32.079,0.000,0.304,0.343
price,0.0602,0.011,5.467,0.000,0.039,0.082
freight_value,-0.0441,0.013,-3.429,0.001,-0.069,-0.019


In [49]:
logit_one = smf.logit(formula='dim_is_five_star ~ wait_time + expected_wait_time +\
    number_of_products + price + freight_value', data=orders_std).fit();
logit_one.summary()

Optimization terminated successfully.
         Current function value: 0.641176
         Iterations 5


0,1,2,3
Dep. Variable:,dim_is_five_star,No. Observations:,97007.0
Model:,Logit,Df Residuals:,97001.0
Method:,MLE,Df Model:,5.0
Date:,"Thu, 29 Oct 2020",Pseudo R-squ.:,0.05319
Time:,12:55:58,Log-Likelihood:,-62199.0
converged:,True,LL-Null:,-65693.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.3562,0.007,52.583,0.000,0.343,0.369
wait_time,-0.6696,0.009,-70.551,0.000,-0.688,-0.651
expected_wait_time,0.1436,0.008,18.348,0.000,0.128,0.159
number_of_products,-0.1953,0.008,-23.797,0.000,-0.211,-0.179
price,0.0128,0.008,1.687,0.092,-0.002,0.028
freight_value,0.0242,0.009,2.802,0.005,0.007,0.041


❓Interpret your results:

- Interpret the partial coefficients in your own words.

- Do you notice any differences between `logit_one` and `logit_five` in terms of coefficient importance? 

- Check their statistical significance with `p-values`

☝️ It seems that `wait_time` influences `one_star` ratings even more more than `five_star`

❓ How does these regression coefficients compare with an OLS on `review_score` with the same features? Double check that both OLS and Logit analysis tell approximatively "the same story".

In [50]:
model_linear = smf.ols(formula = 'review_score ~ wait_time + expected_wait_time +\
    number_of_products + price + freight_value', data=orders_std).fit()
model_linear.summary()

0,1,2,3
Dep. Variable:,review_score,R-squared:,0.137
Model:,OLS,Adj. R-squared:,0.137
Method:,Least Squares,F-statistic:,3072.0
Date:,"Thu, 29 Oct 2020",Prob (F-statistic):,0.0
Time:,12:56:05,Log-Likelihood:,-155710.0
No. Observations:,97007,AIC:,311400.0
Df Residuals:,97001,BIC:,311500.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4.1422,0.004,1070.917,0.000,4.135,4.150
wait_time,-0.4851,0.004,-115.102,0.000,-0.493,-0.477
expected_wait_time,0.1178,0.004,27.533,0.000,0.109,0.126
number_of_products,-0.1788,0.004,-41.146,0.000,-0.187,-0.170
price,-0.0082,0.004,-1.922,0.055,-0.016,0.000
freight_value,0.0168,0.005,3.461,0.001,0.007,0.026

0,1,2,3
Omnibus:,18340.248,Durbin-Watson:,2.008
Prob(Omnibus):,0.0,Jarque-Bera (JB):,33285.636
Skew:,-1.2,Prob(JB):,0.0
Kurtosis:,4.574,Cond. No.,2.05
