
A/B 테스팅 데이터를 분석을 해보십시오.

독립변수는 weekend와 group을 사용하고 종속변수는 click으로 하여 로지스틱 회귀분석을 해보세요.

단, 데이터에서 분할하여 80%는 추정용으로 사용하고, 나머지 20%는 테스트용으로 사용하세요. (random_state=1234로 설정)

모형 1은 click ~ weekend + group로, 모형 2는 click ~ weekend + group + weekend:group으로 분석하세요.

In [1]:
import pandas as pd

### 데이터 불러오기

A/B 테스팅 데이터: 주중/주말 별 group에 따른 클릭율 
- weekend: 주말 - 1 , 주중 - 0
- 독립변수: weekend, group
- 종속변수: click


In [62]:
df=pd.read_excel('data/abtest.xlsx')
df.head()

Unnamed: 0,weekend,group,click
0,1,A,0
1,1,A,1
2,0,B,0
3,0,B,1
4,0,A,1


In [64]:
df.shape

(10000, 3)

### 데이터 분할
분할(test_size=0.2: 20%를 테스트용, random_state=1234: 난수를 생성할 때 초기값을 1234로 설정)  
데이터 과적합을 방지하기 위해서 데이터를 랜덤하게 쪼개어서 train data로 모델링하고, test 데이터로 예측한다.


In [44]:
from sklearn.model_selection import train_test_split

In [45]:
# test 20%, train80%
train_df, test_df = train_test_split(df, test_size=0.2, random_state=1234)

### 로지스틱 회귀분석

In [10]:
from statsmodels.formula.api import logit

In [47]:
#모형1 click ~ weekend + group
res1 = logit('click ~ weekend + group',train_df).fit()
res1.summary()

Optimization terminated successfully.
         Current function value: 0.682483
         Iterations 4


0,1,2,3
Dep. Variable:,click,No. Observations:,8000.0
Model:,Logit,Df Residuals:,7997.0
Method:,MLE,Df Model:,2.0
Date:,"Fri, 30 Oct 2020",Pseudo R-squ.:,0.01536
Time:,22:16:02,Log-Likelihood:,-5459.9
converged:,True,LL-Null:,-5545.0
Covariance Type:,nonrobust,LLR p-value:,1.03e-37

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.1358,0.035,3.916,0.000,0.068,0.204
group[T.B],-0.4721,0.045,-10.443,0.000,-0.561,-0.384
weekend,0.3845,0.050,7.641,0.000,0.286,0.483


In [48]:
#모형2 click ~ weekend + group + weekend:group
res2 = logit('click ~ weekend + group + weekend:group',train_df).fit()
res2.summary()

Optimization terminated successfully.
         Current function value: 0.654409
         Iterations 5


0,1,2,3
Dep. Variable:,click,No. Observations:,8000.0
Model:,Logit,Df Residuals:,7996.0
Method:,MLE,Df Model:,3.0
Date:,"Fri, 30 Oct 2020",Pseudo R-squ.:,0.05586
Time:,22:16:13,Log-Likelihood:,-5235.3
converged:,True,LL-Null:,-5545.0
Covariance Type:,nonrobust,LLR p-value:,5.894e-134

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.4345,0.038,11.389,0.000,0.360,0.509
group[T.B],-1.0853,0.055,-19.766,0.000,-1.193,-0.978
weekend,-0.6382,0.070,-9.141,0.000,-0.775,-0.501
weekend:group[T.B],2.1595,0.104,20.742,0.000,1.955,2.364


## 모형 비교
aic, bic 작을수록 좋음

In [19]:
print(res1.aic)
print(res2.aic)

10925.735420066721
10506.489989442214


In [20]:
print(res1.bic)
print(res2.bic)

10946.697010528707
10506.489989442214


## 모델 평가

In [49]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score

In [50]:
import numpy
predict1 = res1.predict(test_df)
predict2 = res2.predict(test_df)

prediction1 = numpy.where(predict1 > 0.5,1,0)
prediction2 = numpy.where(predict2 > 0.5,1,0)

### 1) 정확도 비교

In [51]:
print(accuracy_score(test_df['click'],prediction1))
print(accuracy_score(test_df['click'],prediction2))

0.6085
0.613


### 2) 정밀도 비교

In [52]:
print(precision_score(test_df['click'],prediction1))
print(precision_score(test_df['click'],prediction2))

0.5794392523364486
0.6068204613841525


### 3) 재현도 비교

In [54]:
print(recall_score(test_df['click'],prediction1))
print(recall_score(test_df['click'],prediction2))

0.7537993920972644
0.6129685916919959


## 모형2의 결과 해석을 위한 수식 만들기

In [55]:
res2.summary()

0,1,2,3
Dep. Variable:,click,No. Observations:,8000.0
Model:,Logit,Df Residuals:,7996.0
Method:,MLE,Df Model:,3.0
Date:,"Fri, 30 Oct 2020",Pseudo R-squ.:,0.05586
Time:,22:26:14,Log-Likelihood:,-5235.3
converged:,True,LL-Null:,-5545.0
Covariance Type:,nonrobust,LLR p-value:,5.894e-134

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.4345,0.038,11.389,0.000,0.360,0.509
group[T.B],-1.0853,0.055,-19.766,0.000,-1.193,-0.978
weekend,-0.6382,0.070,-9.141,0.000,-0.775,-0.501
weekend:group[T.B],2.1595,0.104,20.742,0.000,1.955,2.364


group[T.B]: group이 B인가요?  
- group='A' 일때 0 (no)
- group='B' 일때 1 (yes)

In [56]:
intercept = 0.4345
groupB=-1.0853
weekend=-0.6382
w_groupB=2.1595

In [57]:
# 수식 만들기
def logit_value(w,g):
    y = intercept + (groupB * g) + (weekend * w) + (w_groupB * w * g)
    return y

In [59]:
# 주중: groupA 가 groupB보다 클릭율이 높다
# 1) group A
print(logit_value(0,0))
# 2) gropu B
print(logit_value(0,1))

0.4345
-0.6507999999999999


In [61]:
# 주말: groupA 가 groupB보다 클릭율이 낮다
# 1) group A
print(logit_value(1,0))
# 2) gropu B
print(logit_value(1,1))

-0.2037
0.8705



#### Q1. aic와 bic로 비교할 때 어떤 모형이 더 나은 모형입니까?
- 모형2

#### Q2. 테스트 데이터를 이용해서 모형 1과 모형2의 정확도를 평가해보세요. (문턱값 0.5) 어떤 모형의 정확도가 더 높습니까?
- 모형2

#### Q3. 모형2의 기울기를 해석해보세요. 어떤 결론을 내릴 수 있습니까? (weekend=1 이 주말)
- 주중에는 A안, 주말에는 B안이 더 클릭이 많이 된다