## 简介

归因分析是对比某个指标在 2 个时段的变化. 

为此, 我们需要准备 2 个数据集, 分别记为 df_a 和 df_b. 每个数据集由维度和指标构成, 假设 df_a 内容如下:

| 操作系统 | 付费状态 | 用户占比 | 转化率 |
| --- | --- | --- | --- |
| iOS | 已付费 | 20% | 12% |
| iOS | 未付费 | 30% | 15% |
| Android | 已付费 | 10% | 8% |
| Android | 已付费 | 40% | 17% |

其中的维度: 操作系统, 付费状态. 指标: 用户占比, 转化率. 

对数据集的指标进行某种计算, 可以得到整体指标. 这个计算可以用简单的 Python 函数表达:

```python
def get_metrics(df):
    return sum(df['用户占比']*df['转化率'])
```

于是, 两个时段的整体指标变化就是 $\Delta = \text{get_metric}(df\_b) - \text{get_metric}(df\_a)$

我们的分析目标, 就是解释 $\Delta$, 具体来说:

1. 分析从时段 a 到 b, 每个指标 [用户占比, 转化率] 对 $\Delta$ 分别的贡献.
2. 分析同期各个维度下, 每个指标 [用户占比, 转化率] 分别的贡献. 例如: iOS 已付费用户的用户占比变化, 占 $\Delta$ 的比例.

综上, 归因可将 2 个时段的指标差异, 量化归属到每个维度下的每个指标. 

你可能会问: 为啥不用控制变量法? 具体原因可参见 wiki: Shapley Value.

In [None]:
# ! pip3 install bytedtqs 

In [1]:
from itertools import combinations
from scipy.special import factorial, comb
import pandas as pd
import numpy as np
import random
from IPython.display import display, HTML

import bytedtqs

## 读取 Hive 源数据

- 通过 Python 字符串替换传入 SQL 参数. (如果系统化, 可以做得更友好)
- 运行 2 段 SQL 得到 2 个时段的指标数据集. 

注意: 为了分析方便, 我们把 df_a 和 df_b 分别命名成 df_ctl 和 df_trt.

In [29]:
df_ctl = pd.read_csv('./input_eo_ab_remove_level_test_control_group.csv')
df_trt = pd.read_csv('./input_eo_ab_remove_level_test_treatment_group.csv')

df_ctl = df_ctl[lambda x: x['milestone']!='入群损失']
df_trt = df_trt[lambda x: x['milestone']!='入群损失']

## 数据预处理

如开篇所说, 为了做归因分析, 要定义 `get_metrics` 函数. 

在 Hive 数据集的 columns 中, 选定参与 `get_metrics` 函数运算的指标, 存在 VAR_COLS. 同时给数据集整体指标命名, 存在 METRIC_NAME.

为避免 df_a 和 df_b 的维度不一致, 还需要进行维度补齐. 

In [30]:
# meta: DIM_COLS, VAR_COLS, METRIC_NAME
# note: DIM_COLS + VAR_COLS = all columns of sql; follow the order in sql
DIM_COLS = ['milestone']
# _dim = ['os','channel','role','mile_stone_name']

VAR_COLS = ['group_users','d8_conversion']
METRIC_NAME = 'd8_conversion'

# operation function: VAR_COLS => METRIC
def get_metrics(df):
    return sum(df['group_users']*df['d8_conversion'])/sum(df['group_users'])

In [31]:
# fill NA
df_ctl[DIM_COLS] = df_ctl[DIM_COLS].astype(str).fillna('_')
df_trt[DIM_COLS] = df_trt[DIM_COLS].astype(str).fillna('_')
df_ctl[VAR_COLS] = df_ctl[VAR_COLS].fillna(0)
df_trt[VAR_COLS] = df_trt[VAR_COLS].fillna(0)

# select required columns
DIM_COLS, VAR_COLS = sorted(DIM_COLS), sorted(VAR_COLS)
df_ctl = df_ctl[DIM_COLS + VAR_COLS]
df_trt = df_trt[DIM_COLS + VAR_COLS]

# combine dim columns into one for easier analysis
NEW_DIM_COL = '_dim'

df_ctl[NEW_DIM_COL] = df_ctl[DIM_COLS].apply(tuple, axis=1)
df_trt[NEW_DIM_COL] = df_trt[DIM_COLS].apply(tuple, axis=1)

# drop old dim cols
df_ctl, df_trt = df_ctl.drop(DIM_COLS, axis=1), df_trt.drop(DIM_COLS, axis=1)

# find the set of all dim values
DIM_VALS = pd.concat([df_ctl[NEW_DIM_COL], df_trt[NEW_DIM_COL]]).unique()

# make sure both dataframes have records for all dim values
for d in DIM_VALS:
    new_row = dict()
    new_row[NEW_DIM_COL] = d
    for v in VAR_COLS:
        new_row[v] = 0
    # tuple in set
    if d not in set(df_ctl[NEW_DIM_COL].values):
        df_ctl = df_ctl.append(new_row, ignore_index=True)
    if d not in set(df_trt[NEW_DIM_COL].values):
        df_trt = df_trt.append(new_row, ignore_index=True)

## 归因分析

原始的 Shapley Value 有 subset 操作, 理论的复杂度是 $O(2^n)$.

$\varphi _{i}(v)=\sum _{S\subseteq N\setminus \{i\}}{\frac {|S|!\;(n-|S|-1)!}{n!}}(v(S\cup \{i\})-v(S))$

为了解决这个问题, 利用抽样, 与 "控制变量法" 做了一点权衡. 损失精度, 用 `random.seed()` 保障可复现.

In [32]:
# set max sample size
SAMPLE_SIZE = 20

# players: dim x variable
players = [(i, j) for i in range(len(DIM_VALS)) for j in range(len(VAR_COLS))]
phi = dict()

# sample
N = len(players)
SAMPLE_SIZE = min(SAMPLE_SIZE, factorial(N))
seq_list = list()

random.seed(666)
for _ in range(SAMPLE_SIZE):
    seq = list(range(N))
    random.shuffle(seq)
    seq_list.append(seq)

In [33]:
# reuse the same set of sequences for all players
for seq in seq_list:
    # make of copy of ctl
    df_s = df_ctl.copy()
    # current utility
    v_current = get_metrics(df_s)

    for i in range(N):
        # select player p
        p = players[seq[i]]
        # select dim and variable
        d, v = DIM_VALS[p[0]], VAR_COLS[p[1]]
        # update df_s
        df_s.loc[lambda x: x[NEW_DIM_COL]==d, v] = \
            df_trt.loc[lambda x: x[NEW_DIM_COL]==d, v].values
        # calculate marginal utility
        v_si = get_metrics(df_s)
        phi_i = v_si - v_current
        # update current utility
        v_current = v_si
        
        # add utility for player p
        if p in phi:
            phi[p] += phi_i
        else:
            phi[p] = phi_i

# divided by sample size
phi_avg = {k:1.0*v/SAMPLE_SIZE for k, v in phi.items()}

## 展示结果

这部分有些改进空间, 可以更友好, 改进也很简单. 目前是通过看 Top 因素, 定位主要的影响因素. 

In [34]:
# overall metrics
metrics_ctl, metrics_trt = \
get_metrics(df_ctl), get_metrics(df_trt)
delta_metrics = metrics_trt - metrics_ctl

# standardize (because of sampling)
phi_sum = sum(phi.values())
phi_std = {k:1.0*delta_metrics*v/phi_sum for k, v in phi.items()}

# sum of positive and negative contribution
phi_sum_pos, phi_sum_neg = 0, 0

for _, v in phi_std.items():
    if v > 0:
        phi_sum_pos += v
    else:
        phi_sum_neg += v

# save contribution of each player
player_contribution = \
[{'_dim':DIM_VALS[k[0]], 
  '_var':VAR_COLS[k[1]],
  '基期': df_ctl.loc[lambda x: x[NEW_DIM_COL]==DIM_VALS[k[0]], VAR_COLS[k[1]]].values[0],
  '现期': df_trt.loc[lambda x: x[NEW_DIM_COL]==DIM_VALS[k[0]], VAR_COLS[k[1]]].values[0],
  '贡献': v,
  '贡献占整体': v/delta_metrics,
  '贡献占同向': v/phi_sum_pos if v > 0 else v/phi_sum_neg,
 } 
 for k, v in phi_std.items()]

# sort by contribution according to overall change sign
if delta_metrics > 0:
    player_contribution.sort(key=lambda x: -x['贡献'])
else:
    player_contribution.sort(key=lambda x: x['贡献'])

In [35]:
df_contribution = pd.DataFrame(player_contribution)

df_contribution_split = pd.concat(
    [
        pd.DataFrame(df_contribution.loc[:,NEW_DIM_COL].tolist(), columns=DIM_COLS), 
        df_contribution
    ], 
    axis=1,
)

BASE_VAR_COL = 'group_users'

for c in DIM_COLS:
    for v in VAR_COLS:
        df_contribution_grouped = \
            df_contribution_split[lambda x: x['_var']==v].\
            groupby(c)[['贡献','贡献占整体','贡献占同向']].sum().\
            sort_values(by='贡献占整体', ascending=False)
        
        x = df_contribution_split[lambda x: x['_var']==BASE_VAR_COL]['基期'].sum()
        df_proportion_grouped = df_contribution_split[lambda x: x['_var']==BASE_VAR_COL].\
            groupby(c)['基期'].sum().rename('基期比重') / x
        
        merged = pd.merge(
            df_contribution_grouped, 
            df_proportion_grouped, 
            on=c,
        ).assign(importance=lambda x: np.abs(x['贡献占整体'])/x['基期比重'])
        
        print(f"维度: {c}, 变量: {v}, 累计贡献: {sum(merged['贡献']):.4f}")
        display(
            merged.style.\
                background_gradient(subset=pd.IndexSlice[:,['贡献占整体','基期比重','importance']]).\
                format("{:.4f}")
        )
        print("-"*50)

维度: milestone, 变量: d8_conversion, 累计贡献: 0.0151


Unnamed: 0_level_0,贡献,贡献占整体,贡献占同向,基期比重,importance
milestone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A1初,0.0071,0.3664,0.3339,0.4712,0.7777
A1中,0.0025,0.1288,0.1174,0.0574,2.2442
B1初,0.0025,0.1274,0.1161,0.142,0.8969
A2中,0.0018,0.0926,0.0844,0.0506,1.8319
A2初,0.0013,0.0676,0.0616,0.0917,0.7371
B2,0.001,0.0491,0.0447,0.0244,2.0135
A1高,0.0005,0.0265,0.0241,0.0414,0.6393
A2高,0.0,0.0003,0.0003,0.0561,0.0056
C2,0.0,0.0,0.0,0.0003,0.0
C1,-0.0002,-0.0114,0.1169,0.0037,3.1031


--------------------------------------------------
维度: milestone, 变量: group_users, 累计贡献: 0.0043


Unnamed: 0_level_0,贡献,贡献占整体,贡献占同向,基期比重,importance
milestone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A1初,0.0036,0.187,0.1704,0.4712,0.3969
A2初,0.0004,0.0214,0.0195,0.0917,0.2337
A1中,0.0002,0.0103,0.0094,0.0574,0.1796
A2中,0.0002,0.0082,0.0075,0.0506,0.1617
B1初,0.0001,0.0059,0.0054,0.142,0.0417
A2高,0.0001,0.0042,0.0038,0.0561,0.0743
C1,0.0,0.0016,0.0015,0.0037,0.4489
B2,-0.0,-0.0001,0.0008,0.0244,0.0031
C2,-0.0,-0.0009,0.0094,0.0003,3.5038
B1高,-0.0001,-0.0042,0.0427,0.0045,0.9331


--------------------------------------------------


In [36]:


# c = ('channel','role')
# for v in VAR_COLS:
#     df_contribution_grouped = \
#         df_contribution_split[lambda x: x['_var']==v].\
#         groupby(c)[['贡献','贡献占整体','贡献占同向']].sum().\
#         sort_values(by='贡献占整体', ascending=False)

#     x = df_contribution_split[lambda x: x['_var']==BASE_VAR_COL]['基期'].sum()
#     df_proportion_grouped = df_contribution_split[lambda x: x['_var']==BASE_VAR_COL].\
#         groupby(c)['基期'].sum().rename('基期比重') / x

#     merged = pd.merge(
#         df_contribution_grouped, 
#         df_proportion_grouped, 
#         on=c,
#     ).assign(comp=lambda x: x['贡献占整体']/x['基期比重'])

#     print(f"dim: {c}, variable: {v}")
#     display(
#         merged.style.\
#             background_gradient(subset=pd.IndexSlice[:,['贡献占整体','comp']]).\
#             format("{:.4f}")
#     )
#     print("-"*50)

In [37]:
N_REASONS = 50


def decimal_select(num):
    """
    format decimal places for common metrics
    """
    if num >= 100:
        return int(num)
    elif num >= 1:
        return round(num, 2)
    else:
        return round(num, 4)
    

print(f"整体指标 {METRIC_NAME} ({decimal_select(metrics_ctl)} => {decimal_select(metrics_trt)}), \
变化量 = {decimal_select(delta_metrics)}")
print("-"*30)
print("主要影响因素:\n")

for i in range(N_REASONS):
    if i <= len(player_contribution)-1:
        r = player_contribution[i]
        print(f"""{i+1}. {r['_dim']} 的 {r['_var']} ({decimal_select(r['基期'])} => {decimal_select(r['现期'])}).""",
        f"""贡献: {decimal_select(r['贡献'])}.""",
        f"""占{'正' if r['贡献'] > 0 else '负'}向的: {int(r['贡献占同向']*100)}%.""")

整体指标 d8_conversion (0.0522 => 0.0716), 变化量 = 0.0194
------------------------------
主要影响因素:

1. ('A1初',) 的 d8_conversion (0.0345 => 0.0524). 贡献: 0.0071. 占正向的: 33%.
2. ('A1初',) 的 group_users (1798 => 1069). 贡献: 0.0036. 占正向的: 17%.
3. ('A1中',) 的 d8_conversion (0.0548 => 0.0833). 贡献: 0.0025. 占正向的: 11%.
4. ('B1初',) 的 d8_conversion (0.0738 => 0.0906). 贡献: 0.0025. 占正向的: 11%.
5. ('A2中',) 的 d8_conversion (0.0622 => 0.0898). 贡献: 0.0018. 占正向的: 8%.
6. ('A2初',) 的 d8_conversion (0.0829 => 0.0951). 贡献: 0.0013. 占正向的: 6%.
7. ('B2',) 的 d8_conversion (0.043 => 0.0784). 贡献: 0.001. 占正向的: 4%.
8. ('A1高',) 的 d8_conversion (0.0506 => 0.0615). 贡献: 0.0005. 占正向的: 2%.
9. ('A2初',) 的 group_users (350 => 410). 贡献: 0.0004. 占正向的: 1%.
10. ('A1中',) 的 group_users (219 => 396). 贡献: 0.0002. 占正向的: 0%.
11. ('A2中',) 的 group_users (193 => 256). 贡献: 0.0002. 占正向的: 0%.
12. ('B1初',) 的 group_users (542 => 563). 贡献: 0.0001. 占正向的: 0%.
13. ('A2高',) 的 group_users (214 => 225). 贡献: 0.0001. 占正向的: 0%.
14. ('C1',) 的 group_users (14 => 9). 贡献

In [38]:
# contribution summary by var
df_res_var = pd.DataFrame(player_contribution).groupby('_var')['贡献'].sum().reset_index()
df_res_var

Unnamed: 0,_var,贡献
0,d8_conversion,0.015122
1,group_users,0.004307


In [42]:
# contribution summary by dim
df_res_dim = pd.merge(
    pd.merge(
    pd.DataFrame(player_contribution).groupby('_dim')['贡献'].sum().sort_values().reset_index(),
        df_ctl,
        on='_dim'
    ),
    df_trt,
    on='_dim'
)

df_res_dim.sort_values(by='贡献', ascending=False)

Unnamed: 0,_dim,贡献,d8_conversion_x,group_users_x,d8_conversion_y,group_users_y
11,"(A1初,)",0.010753,0.0345,1798,0.0524,1069
10,"(A1中,)",0.002703,0.0548,219,0.0833,396
9,"(B1初,)",0.00259,0.0738,542,0.0906,563
8,"(A2中,)",0.001959,0.0622,193,0.0898,256
7,"(A2初,)",0.00173,0.0829,350,0.0951,410
6,"(B2,)",0.000952,0.043,93,0.0784,102
5,"(A1高,)",0.00041,0.0506,158,0.0615,244
4,"(A2高,)",8.7e-05,0.0888,214,0.0889,225
3,"(C2,)",-1.8e-05,0.0,1,0.0,2
2,"(C1,)",-0.000189,0.0714,14,0.0,9
