## 简介

归因分析是对比某个指标在 2 个时段的变化. 

为此, 我们需要准备 2 个数据集, 分别记为 df_a 和 df_b. 每个数据集由维度和指标构成, 假设 df_a 内容如下:

| 操作系统 | 付费状态 | 用户占比 | 转化率 |
| --- | --- | --- | --- |
| iOS | 已付费 | 20% | 12% |
| iOS | 未付费 | 30% | 15% |
| Android | 已付费 | 10% | 8% |
| Android | 已付费 | 40% | 17% |

其中的维度: 操作系统, 付费状态. 指标: 用户占比, 转化率. 

对数据集的指标进行某种计算, 可以得到整体指标. 这个计算可以用简单的 Python 函数表达:

```python
def get_metrics(df):
    return sum(df['用户占比']*df['转化率'])
```

于是, 两个时段的整体指标变化就是 $\Delta = \text{get_metric}(df\_b) - \text{get_metric}(df\_a)$

我们的分析目标, 就是解释 $\Delta$, 具体来说:

1. 分析从时段 a 到 b, 每个指标 [用户占比, 转化率] 对 $\Delta$ 分别的贡献.
2. 分析同期各个维度下, 每个指标 [用户占比, 转化率] 分别的贡献. 例如: iOS 已付费用户的用户占比变化, 占 $\Delta$ 的比例.

综上, 归因可将 2 个时段的指标差异, 量化归属到每个维度下的每个指标. 

你可能会问: 为啥不用控制变量法? 具体原因可参见 wiki: Shapley Value.

In [None]:
# ! pip3 install bytedtqs 

In [1]:
from itertools import combinations
from scipy.special import factorial, comb
import pandas as pd
import numpy as np
import random
from IPython.display import display, HTML

import bytedtqs

## 读取 Hive 源数据

- 通过 Python 字符串替换传入 SQL 参数. (如果系统化, 可以做得更友好)
- 运行 2 段 SQL 得到 2 个时段的指标数据集. 

In [2]:
# set starting & ending dates for sql
_dates_1 = {'start_date': '2020-09-14', 'end_date': '2020-09-20'}
_dates_2 = {'start_date': '2020-09-21', 'end_date': '2020-09-27'}


# set dimensions for sql
_dim = ['os','channel','role','mile_stone_name','ez_registered','age','city_level']

In [6]:
sql = """
set spark.sql.adaptive.enabled = true;
set spark.sql.adaptive.join.enabled = true;
set spark.sql.adaptive.hashJoin.enabled = true;
set spark.sql.adaptive.skewedJoin.enabled = true;
set spark.dynamicAllocation.enabled = true;
set spark.dynamicAllocation.maxExecutors = 2000;
set spark.driver.memory = 20g;
set spark.executor.memory = 20g;
set spark.shuffle.hdfs.enabled = true;
set spark.shuffle.io.maxRetries = 1;
set spark.shuffle.io.retryWait = 0s;
set spark.sql.adaptive.maxNumPostShufflePartitions = 5000;
set spark.sql.sources.bucketing.enabled = true;

-- for Hive 0.11.0 through 2.1.x
set hive.groupby.orderby.position.alias=true;
-- allow cartesian
set hive.mapred.mode=nonstrict;
-- allow parallel
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;

--------------- SEP ---------------

-- ez device_id 的首次登录记录
with ez_did as (
  select
    device_id
    ,min(create_time) as create_time
  from dm_ez.user_device
  where date = regexp_replace('{end_date}', '-', '')
  group by device_id
),

-- eo 新用户 device_id 画像和行为
eo_did as (
  select
  nu.install_date
  ,nu.device_id
  ,nu.os
  ,case 
    when channel_user_name_modify in ('toutiao_promote','toutiaodsp_new') then '内广' 
    when channel_user_name_modify in ('googleadwords_int','store_google') and os='android' then 'Google-Android' 
    when channel_user_name_modify in ('googleadwords_int','store_google') and os='ios' then 'Google-iOS' 
    when channel_user_name_modify='huawei_id' then '华为' 
    when channel_user_name_modify='guangdiantong' then '广点通' 
    when channel_user_name_modify='AppStore' then 'Apple Store' 
    when channel_user_name_modify='oppo_id' then 'OPPO' 
    when channel_user_name_modify='vivo_id' then 'VIVO' 
    when channel_user_name_modify='xiaomi_id' then '小米' 
    when channel_user_name_modify='fensitong' then '粉丝通' 
    when channel_user_name_modify='baiduxinxiliu' then '百度信息流' 
    when channel_user_name_modify='FacebookAds' then 'Facebook' 
    when channel_user_name_modify='store_tengxun' then '应用宝' else '其他' 
  end as channel
  ,if(ez.create_time is not null and from_unixtime(ez.create_time) < nu.install_time, '已登录EZ', '未登录EZ') as ez_registered
  -- 画像数据可能变, 取 max 
  ,coalesce(max(case nu.role
    when 1 then '上班族'
    when 2 then '自由职业' 
    when 3 then '大学生'
    when 4 then '中小学生'
  end),'unknown') as role
  ,coalesce(max(nu.mile_stone_name), 'unknown') as mile_stone_name
  ,coalesce(max(nu.edu), 'unknown') as edu
  ,coalesce(max(nu.age), 'unknown') as age
  ,coalesce(max(nu.city_level_in_profile), 'unknown') as city_level
  ,coalesce(max(nu.career), 'unknown') as career
  -- 为避免记录重复, 一下也是 max
  ,max(if(gap_days = '0days' and enter_camp = 1, 1, 0)) as d0_enter_camp
  ,max(if(gap_days = '0days' and is_study = 1, 1, 0)) as d0_study
  ,max(if(gap_days = '0days' and enter_wechat_group = 1, 1, 0)) as d0_enter_group
  ,max(if(gap_days = '3days' and enter_wechat_group = 1, 1, 0)) as d3_enter_group
  ,max(if(gap_days = '3days' and enter_wechat_group = 1, order_cnt, 0)) as d3_group_cnt
  ,max(if(gap_days = '3days' and enter_wechat_group = 1, order_revenue, 0)) as d3_group_revenue
  ,max(if(gap_days = '6days' and enter_wechat_group = 1, 1, 0)) as d6_enter_group
  ,max(if(gap_days = '6days' and enter_wechat_group = 1, order_cnt, 0)) as d6_group_cnt
  ,max(if(gap_days = '6days' and enter_wechat_group = 1, order_revenue, 0)) as d6_group_revenue
from dm_eo.dwd_eo_newer_revenue_di as nu
left join ez_did as ez on nu.device_id = ez.device_id
where nu.date >= regexp_replace('{start_date}','-','')
  and nu.install_date between '{start_date}' and '{end_date}'
  and nu.gap_days in ('0days','3days','6days')
  and (nu.is_test != 1 or nu.is_test is null)
group by nu.install_date
  ,nu.device_id
  ,nu.os
  ,case 
    when channel_user_name_modify in ('toutiao_promote','toutiaodsp_new') then '内广' 
    when channel_user_name_modify in ('googleadwords_int','store_google') and os='android' then 'Google-Android' 
    when channel_user_name_modify in ('googleadwords_int','store_google') and os='ios' then 'Google-iOS' 
    when channel_user_name_modify='huawei_id' then '华为' 
    when channel_user_name_modify='guangdiantong' then '广点通' 
    when channel_user_name_modify='AppStore' then 'Apple Store' 
    when channel_user_name_modify='oppo_id' then 'OPPO' 
    when channel_user_name_modify='vivo_id' then 'VIVO' 
    when channel_user_name_modify='xiaomi_id' then '小米' 
    when channel_user_name_modify='fensitong' then '粉丝通' 
    when channel_user_name_modify='baiduxinxiliu' then '百度信息流' 
    when channel_user_name_modify='FacebookAds' then 'Facebook' 
    when channel_user_name_modify='store_tengxun' then '应用宝' else '其他' 
  end
  ,if(ez.create_time is not null and from_unixtime(ez.create_time) < nu.install_time, '已登录EZ', '未登录EZ')
),

eo_sub as (
  select
    *
    ,sum(1) over() as dnu_sum
    ,sum(d0_enter_group) over() as d0_enter_group_sum
    ,sum(d3_enter_group) over() as d3_enter_group_sum
    ,sum(d6_enter_group) over() as d6_enter_group_sum
  from eo_did
)

select
  {dim}
  ,dnu_sum
  ,sum(1) as dnu
  ,sum(d0_enter_camp) as d0_enter_camp
  ,sum(d0_enter_group) as d0_enter_group
  ,sum(d3_enter_group) as d3_enter_group
  ,sum(d6_enter_group) as d6_enter_group
  ,1.0*sum(1)/dnu_sum as dnu_prop
  ,1.0*sum(d0_enter_group)/d0_enter_group_sum as d0_enter_group_prop
  ,1.0*sum(d3_enter_group)/d3_enter_group_sum as d3_enter_group_prop
  ,1.0*sum(d6_enter_group)/d6_enter_group_sum as d6_enter_group_prop
  ,1.0*sum(d0_enter_camp)/sum(1) as d0_dnu_to_camp
  ,1.0*sum(d0_enter_group)/sum(1) as d0_dnu_to_group
  ,1.0*sum(d0_study)/sum(1) as d0_dnu_to_study
  ,if(sum(d0_enter_camp) = 0, 0, 1.0*sum(d0_enter_group)/sum(d0_enter_camp)) as d0_camp_to_group
  ,if(sum(d0_enter_camp) = 0, 0, 1.0*sum(d3_enter_group)/sum(d0_enter_camp)) as d3_camp_to_group
  ,if(sum(d0_enter_camp) = 0, 0, 1.0*sum(d6_enter_group)/sum(d0_enter_camp)) as d6_camp_to_group
  ,if(sum(d3_enter_group) = 0, 0, 1.0*sum(d3_group_cnt)/sum(d3_enter_group)) as d3_group_to_order
  ,if(sum(d3_enter_group) = 0, 0, 1.0*sum(d3_group_revenue)/sum(d3_enter_group)) as d3_rev_per_group_user
  ,if(sum(d3_group_cnt) = 0, 0, 1.0*sum(d3_group_revenue)/sum(d3_group_cnt)/100) as d3_arpu
  ,if(sum(d6_enter_group) = 0, 0, 1.0*sum(d6_group_cnt)/sum(d6_enter_group)) as d6_group_to_order
  ,if(sum(d6_group_cnt) = 0, 0, 1.0*sum(d6_group_revenue)/sum(d6_group_cnt)/100) as d6_arpu
from eo_sub
group by {dim}, dnu_sum, d0_enter_group_sum, d3_enter_group_sum, d6_enter_group_sum
"""

In [7]:
# generate sql
_dim_str = ', '.join(_dim)
sql_1 = sql.format(**{**_dates_1, 'dim': _dim_str})
sql_2 = sql.format(**{**_dates_2, 'dim': _dim_str})
# sql_3 = sql.format(**{**_dates_3, 'dim': _dim_str})
# sql_4 = sql.format(**{**_dates_4, 'dim': _dim_str})

# refresh client
# app_id = [替换成自己的 token]
# app_key = [替换成自己的 token]
app_id = 'lFKW9WPzA2tHT7Bv3HuNH2UnIonYG75hnWR6maHVo7YYIXqm'
app_key = 'wTmX8lGFWeFFgROnTzOgb9uIrzrTLeDTtPil0LADDYeOayQo'
user_name = 'wufei.97'

client = bytedtqs.TQSClient(app_id=app_id, app_key=app_key)

注意: 为了分析方便, 我们把 df_a 和 df_b 分别命名成 df_ctl 和 df_trt.

In [10]:
job_1 = client.execute_query(user_name=user_name, query=sql_1)
job_2 = client.execute_query(user_name=user_name, query=sql_2)
# job_3 = client.execute_query(user_name=user_name, query=sql_3)
# job_4 = client.execute_query(user_name=user_name, query=sql_4)

df_1 = pd.read_csv(job_1.get_result().result_url)
df_2 = pd.read_csv(job_2.get_result().result_url)
# df_3 = pd.read_csv(job_3.get_result().result_url)
# df_4 = pd.read_csv(job_4.get_result().result_url)

[2020-09-28 12:54:23,041] - [INFO] - job submitted, job_id: 132564950
[2020-09-28 12:54:23,068] - [INFO] - job_id: 132564950, engine_type: Hive, status: Created
[2020-09-28 12:54:25,092] - [INFO] - job_id: 132564950, engine_type: Spark, status: Processing
[2020-09-28 12:54:27,122] - [INFO] - job_id: 132564950, engine_type: Spark, status: Processing
[2020-09-28 12:54:29,146] - [INFO] - job_id: 132564950, engine_type: Spark, status: Processing
[2020-09-28 12:54:31,168] - [INFO] - job_id: 132564950, engine_type: Spark, status: Processing
[2020-09-28 12:54:33,186] - [INFO] - job_id: 132564950, engine_type: Spark, status: Processing
[2020-09-28 12:54:35,207] - [INFO] - job_id: 132564950, engine_type: Spark, status: Processing
[2020-09-28 12:54:37,234] - [INFO] - job_id: 132564950, engine_type: Spark, status: Processing, tracking_urls: http://n17-075-136.byted.org:8060/proxy/application_1596090352493_5537173/application_1596090352493_5537173/jobs/job?id=886
[2020-09-28 12:54:39,260] - [INFO]

ParserError: Error tokenizing data. C error: Expected 2 fields in line 7, saw 25


In [30]:
df_1 = pd.read_csv(job_1.get_result().result_url, skiprows=6)
df_2 = pd.read_csv(job_2.get_result().result_url, skiprows=6)

In [44]:
df_ctl = df_1 # control: df_a
df_trt = df_2 # treatment: df_b

## 数据预处理

如开篇所说, 为了做归因分析, 要定义 `get_metrics` 函数. 

在 Hive 数据集的 columns 中, 选定参与 `get_metrics` 函数运算的指标, 存在 VAR_COLS. 同时给数据集整体指标命名, 存在 METRIC_NAME.

为避免 df_a 和 df_b 的维度不一致, 还需要进行维度补齐. 

In [45]:
# meta: DIM_COLS, VAR_COLS, METRIC_NAME
# note: DIM_COLS + VAR_COLS = all columns of sql; follow the order in sql
DIM_COLS = _dim
# _dim = ['os','channel','role','mile_stone_name']

VAR_COLS = ['dnu_prop','d0_dnu_to_study']
METRIC_NAME = 'd0_dnu_to_study'

# operation function: VAR_COLS => METRIC
def get_metrics(df):
    return sum(df['dnu_prop']*df['d0_dnu_to_study'])

# # --------------- SEP ---------------
# DIM_COLS = _dim
# # _dim = ['os','channel','role','mile_stone_name']

# VAR_COLS = ['dnu','d0_dnu_to_group']
# METRIC_NAME = 'd0_dnu_to_group'

# # operation function: VAR_COLS => METRIC
# def get_metrics(df):
#     return sum(df['dnu']*df['d0_dnu_to_group'])/sum(df['dnu'])


In [46]:
# fill NA
df_ctl[DIM_COLS] = df_ctl[DIM_COLS].astype(str).fillna('_')
df_trt[DIM_COLS] = df_trt[DIM_COLS].astype(str).fillna('_')
df_ctl[VAR_COLS] = df_ctl[VAR_COLS].fillna(0)
df_trt[VAR_COLS] = df_trt[VAR_COLS].fillna(0)

# select required columns
DIM_COLS, VAR_COLS = sorted(DIM_COLS), sorted(VAR_COLS)
df_ctl = df_ctl[DIM_COLS + VAR_COLS]
df_trt = df_trt[DIM_COLS + VAR_COLS]

# combine dim columns into one for easier analysis
NEW_DIM_COL = '_dim'

df_ctl[NEW_DIM_COL] = df_ctl[DIM_COLS].apply(tuple, axis=1)
df_trt[NEW_DIM_COL] = df_trt[DIM_COLS].apply(tuple, axis=1)

# drop old dim cols
df_ctl, df_trt = df_ctl.drop(DIM_COLS, axis=1), df_trt.drop(DIM_COLS, axis=1)

# find the set of all dim values
DIM_VALS = pd.concat([df_ctl[NEW_DIM_COL], df_trt[NEW_DIM_COL]]).unique()

# make sure both dataframes have records for all dim values
for d in DIM_VALS:
    new_row = dict()
    new_row[NEW_DIM_COL] = d
    for v in VAR_COLS:
        new_row[v] = 0
    # tuple in set
    if d not in set(df_ctl[NEW_DIM_COL].values):
        df_ctl = df_ctl.append(new_row, ignore_index=True)
    if d not in set(df_trt[NEW_DIM_COL].values):
        df_trt = df_trt.append(new_row, ignore_index=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


## 归因分析

原始的 Shapley Value 有 subset 操作, 理论的复杂度是 $O(2^n)$.

$\varphi _{i}(v)=\sum _{S\subseteq N\setminus \{i\}}{\frac {|S|!\;(n-|S|-1)!}{n!}}(v(S\cup \{i\})-v(S))$

为了解决这个问题, 利用抽样, 与 "控制变量法" 做了一点权衡. 损失精度, 用 `random.seed()` 保障可复现.

In [47]:
# set max sample size
SAMPLE_SIZE = 20

# players: dim x variable
players = [(i, j) for i in range(len(DIM_VALS)) for j in range(len(VAR_COLS))]
phi = dict()

# sample
N = len(players)
SAMPLE_SIZE = min(SAMPLE_SIZE, factorial(N))
seq_list = list()

random.seed(666)
for _ in range(SAMPLE_SIZE):
    seq = list(range(N))
    random.shuffle(seq)
    seq_list.append(seq)

In [48]:
# reuse the same set of sequences for all players
for seq in seq_list:
    # make of copy of ctl
    df_s = df_ctl.copy()
    # current utility
    v_current = get_metrics(df_s)

    for i in range(N):
        # select player p
        p = players[seq[i]]
        # select dim and variable
        d, v = DIM_VALS[p[0]], VAR_COLS[p[1]]
        # update df_s
        df_s.loc[lambda x: x[NEW_DIM_COL]==d, v] = \
            df_trt.loc[lambda x: x[NEW_DIM_COL]==d, v].values
        # calculate marginal utility
        v_si = get_metrics(df_s)
        phi_i = v_si - v_current
        # update current utility
        v_current = v_si
        
        # add utility for player p
        if p in phi:
            phi[p] += phi_i
        else:
            phi[p] = phi_i

# divided by sample size
phi_avg = {k:1.0*v/SAMPLE_SIZE for k, v in phi.items()}

## 展示结果

这部分有些改进空间, 可以更友好, 改进也很简单. 目前是通过看 Top 因素, 定位主要的影响因素. 

In [49]:
# overall metrics
metrics_ctl, metrics_trt = \
get_metrics(df_ctl), get_metrics(df_trt)
delta_metrics = metrics_trt - metrics_ctl

# standardize (because of sampling)
phi_sum = sum(phi.values())
phi_std = {k:1.0*delta_metrics*v/phi_sum for k, v in phi.items()}

# sum of positive and negative contribution
phi_sum_pos, phi_sum_neg = 0, 0

for _, v in phi_std.items():
    if v > 0:
        phi_sum_pos += v
    else:
        phi_sum_neg += v

# save contribution of each player
player_contribution = \
[{'_dim':DIM_VALS[k[0]], 
  '_var':VAR_COLS[k[1]],
  '基期': df_ctl.loc[lambda x: x[NEW_DIM_COL]==DIM_VALS[k[0]], VAR_COLS[k[1]]].values[0],
  '现期': df_trt.loc[lambda x: x[NEW_DIM_COL]==DIM_VALS[k[0]], VAR_COLS[k[1]]].values[0],
  '贡献': v,
  '贡献占整体': v/delta_metrics,
  '贡献占同向': v/phi_sum_pos if v > 0 else v/phi_sum_neg,
 } 
 for k, v in phi_std.items()]

# sort by contribution according to overall change sign
if delta_metrics > 0:
    player_contribution.sort(key=lambda x: -x['贡献'])
else:
    player_contribution.sort(key=lambda x: x['贡献'])

In [55]:
df_contribution = pd.DataFrame(player_contribution)

df_contribution_split = pd.concat(
    [
        pd.DataFrame(df_contribution.loc[:,NEW_DIM_COL].tolist(), columns=DIM_COLS), 
        df_contribution
    ], 
    axis=1,
)

BASE_VAR_COL = 'dnu_prop'

for c in DIM_COLS:
    for v in VAR_COLS:
        df_contribution_grouped = \
            df_contribution_split[lambda x: x['_var']==v].\
            groupby(c)[['贡献','贡献占整体','贡献占同向']].sum().\
            sort_values(by='贡献占整体', ascending=False)
        
        x = df_contribution_split[lambda x: x['_var']==BASE_VAR_COL]['基期'].sum()
        df_proportion_grouped = df_contribution_split[lambda x: x['_var']==BASE_VAR_COL].\
            groupby(c)['基期'].sum().rename('基期比重') / x
        
        merged = pd.merge(
            df_contribution_grouped, 
            df_proportion_grouped, 
            on=c,
        ).assign(importance=lambda x: np.abs(x['贡献占整体'])/x['基期比重'])
        
        print(f"维度: {c}, 变量: {v}, 累计贡献: {sum(merged['贡献']):.4f}")
        display(
            merged.style.\
                background_gradient(subset=pd.IndexSlice[:,['贡献占整体','基期比重','importance']]).\
                format("{:.4f}")
        )
        print("-"*50)

维度: channel, 变量: d0_dnu_to_study, 累计贡献: -0.0373


Unnamed: 0_level_0,贡献,贡献占整体,贡献占同向,基期比重,importance
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
内广,-0.0125,0.7442,0.2724,0.4259,1.7474
广点通,-0.0078,0.4612,0.1481,0.1261,3.6582
VIVO,-0.0046,0.2746,0.0966,0.1511,1.8166
OPPO,-0.0033,0.1969,0.0532,0.0464,4.2432
华为,-0.0033,0.1932,0.0833,0.123,1.5709
其他,-0.0028,0.165,0.0381,0.0138,11.9958
小米,-0.0016,0.0923,0.0414,0.0623,1.4806
Apple Store,-0.0006,0.036,0.0333,0.0334,1.0783
应用宝,-0.0005,0.0277,0.0253,0.0108,2.5587
Google-Android,-0.0002,0.0128,0.0035,0.0048,2.6414


--------------------------------------------------
维度: channel, 变量: dnu_prop, 累计贡献: 0.0204


Unnamed: 0_level_0,贡献,贡献占整体,贡献占同向,基期比重,importance
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
其他,-0.0045,0.2687,0.0558,0.0138,19.5404
Apple Store,-0.0016,0.0947,0.0331,0.0334,2.8383
内广,-0.0009,0.051,0.5026,0.4259,0.1197
粉丝通,-0.0002,0.0103,0.002,0.0013,7.8708
Google-iOS,0.0,-0.0013,0.0003,0.0011,1.218
Google-Android,0.0001,-0.0086,0.003,0.0048,1.7812
应用宝,0.0004,-0.0243,0.0184,0.0108,2.2407
小米,0.0019,-0.111,0.046,0.0623,1.7817
OPPO,0.002,-0.1186,0.0496,0.0464,2.5552
华为,0.0026,-0.1557,0.1091,0.123,1.2661


--------------------------------------------------
维度: mile_stone_name, 变量: d0_dnu_to_study, 累计贡献: -0.0373


Unnamed: 0_level_0,贡献,贡献占整体,贡献占同向,基期比重,importance
mile_stone_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A1中,-0.0294,1.7448,0.3558,0.0355,49.1026
unknown,-0.0034,0.2,0.0679,0.4824,0.4146
B2,-0.0023,0.1368,0.0272,0.0087,15.683
A2初,-0.0022,0.1288,0.0458,0.0445,2.8937
B1高,-0.0007,0.0396,0.038,0.0161,2.4639
B1初,-0.0006,0.0332,0.0491,0.0591,0.5615
C1,-0.0003,0.0152,0.003,0.0011,14.3293
A2高,-0.0002,0.0132,0.0444,0.0236,0.561
A1高,-0.0,0.0026,0.0326,0.025,0.1046
A0,-0.0,0.0013,0.0003,0.0001,11.1285


--------------------------------------------------
维度: mile_stone_name, 变量: dnu_prop, 累计贡献: 0.0204


Unnamed: 0_level_0,贡献,贡献占整体,贡献占同向,基期比重,importance
mile_stone_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A1初,-0.0079,0.4717,0.1742,0.2783,1.6949
B1初,-0.0053,0.3156,0.1186,0.0591,5.3418
unknown,-0.0023,0.1351,0.0338,0.4824,0.2801
B2,-0.0022,0.1329,0.0264,0.0087,15.2296
A2初,-0.0011,0.0642,0.0802,0.0445,1.4429
A2高,-0.0005,0.0289,0.0449,0.0236,1.2255
A2中,-0.0003,0.0203,0.0576,0.0254,0.7979
C1,-0.0002,0.0136,0.0027,0.0011,12.8031
A0,-0.0,0.0022,0.0004,0.0001,18.5475
B1中,-0.0,0.0012,0.0002,0.0001,12.8596


--------------------------------------------------
维度: os, 变量: d0_dnu_to_study, 累计贡献: -0.0373


Unnamed: 0_level_0,贡献,贡献占整体,贡献占同向,基期比重,importance
os,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
android,-0.0357,2.1179,0.7017,0.9127,2.3205
ios,-0.0016,0.0945,0.0963,0.0873,1.0828


--------------------------------------------------
维度: os, 变量: dnu_prop, 累计贡献: 0.0204


Unnamed: 0_level_0,贡献,贡献占整体,贡献占同向,基期比重,importance
os,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ios,-0.0045,0.2672,0.1076,0.0873,3.0605
android,0.0249,-1.4796,1.0943,0.9127,1.6212


--------------------------------------------------
维度: role, 变量: d0_dnu_to_study, 累计贡献: -0.0373


Unnamed: 0_level_0,贡献,贡献占整体,贡献占同向,基期比重,importance
role,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
大学生,-0.0113,0.6724,0.2059,0.1686,3.9869
上班族,-0.0079,0.4705,0.2178,0.3146,1.4954
unknown,-0.0066,0.3942,0.1336,0.2353,1.6754
中小学生,-0.0066,0.3898,0.1287,0.1605,2.4291
自由职业,-0.0048,0.2856,0.1121,0.121,2.3605


--------------------------------------------------
维度: role, 变量: dnu_prop, 累计贡献: 0.0204


Unnamed: 0_level_0,贡献,贡献占整体,贡献占同向,基期比重,importance
role,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
unknown,-0.0047,0.2777,0.1267,0.2353,1.1802
上班族,0.0016,-0.0944,0.3681,0.3146,0.3001
自由职业,0.0034,-0.2021,0.1563,0.121,1.6703
中小学生,0.0072,-0.4264,0.1788,0.1605,2.6573
大学生,0.0129,-0.7672,0.3722,0.1686,4.549


--------------------------------------------------


In [51]:


# c = ('channel','role')
# for v in VAR_COLS:
#     df_contribution_grouped = \
#         df_contribution_split[lambda x: x['_var']==v].\
#         groupby(c)[['贡献','贡献占整体','贡献占同向']].sum().\
#         sort_values(by='贡献占整体', ascending=False)

#     x = df_contribution_split[lambda x: x['_var']==BASE_VAR_COL]['基期'].sum()
#     df_proportion_grouped = df_contribution_split[lambda x: x['_var']==BASE_VAR_COL].\
#         groupby(c)['基期'].sum().rename('基期比重') / x

#     merged = pd.merge(
#         df_contribution_grouped, 
#         df_proportion_grouped, 
#         on=c,
#     ).assign(comp=lambda x: x['贡献占整体']/x['基期比重'])

#     print(f"dim: {c}, variable: {v}")
#     display(
#         merged.style.\
#             background_gradient(subset=pd.IndexSlice[:,['贡献占整体','comp']]).\
#             format("{:.4f}")
#     )
#     print("-"*50)

In [52]:
N_REASONS = 50


def decimal_select(num):
    """
    format decimal places for common metrics
    """
    if num >= 100:
        return int(num)
    elif num >= 1:
        return round(num, 2)
    else:
        return round(num, 4)
    

print(f"整体指标 {METRIC_NAME} ({decimal_select(metrics_ctl)} => {decimal_select(metrics_trt)}), \
变化量 = {decimal_select(delta_metrics)}")
print("-"*30)
print("主要影响因素:\n")

for i in range(N_REASONS):
    if i <= len(player_contribution)-1:
        r = player_contribution[i]
        print(f"""{i+1}. {r['_dim']} 的 {r['_var']} ({decimal_select(r['基期'])} => {decimal_select(r['现期'])}).""",
        f"""贡献: {decimal_select(r['贡献'])}.""",
        f"""占{'正' if r['贡献'] > 0 else '负'}向的: {int(r['贡献占同向']*100)}%.""")

整体指标 d0_dnu_to_study (0.1894 => 0.1725), 变化量 = -0.0168
------------------------------
主要影响因素:

1. ('广点通', 'A1中', 'android', '大学生') 的 d0_dnu_to_study (0.434 => 0.1446). 贡献: -0.0042. 占负向的: 4%.
2. ('内广', 'A1中', 'android', '上班族') 的 d0_dnu_to_study (0.3182 => 0.1096). 贡献: -0.004. 占负向的: 4%.
3. ('内广', 'A1中', 'android', '自由职业') 的 d0_dnu_to_study (0.4356 => 0.0804). 贡献: -0.0035. 占负向的: 4%.
4. ('内广', 'A1初', 'android', '上班族') 的 dnu_prop (0.0713 => 0.0574). 贡献: -0.0034. 占负向的: 4%.
5. ('内广', 'A1中', 'android', '中小学生') 的 d0_dnu_to_study (0.2857 => 0.0602). 贡献: -0.0027. 占负向的: 3%.
6. ('内广', 'A1初', 'android', '自由职业') 的 dnu_prop (0.0339 => 0.0255). 贡献: -0.0023. 占负向的: 2%.
7. ('VIVO', 'A1中', 'android', '上班族') 的 d0_dnu_to_study (0.625 => 0.0523). 贡献: -0.0019. 占负向的: 2%.
8. ('VIVO', 'A1中', 'android', '中小学生') 的 d0_dnu_to_study (0.2667 => 0.0357). 贡献: -0.0017. 占负向的: 2%.
9. ('内广', 'B1初', 'android', '大学生') 的 dnu_prop (0.005 => 0.002). 贡献: -0.0016. 占负向的: 1%.
10. ('内广', 'B1初', 'android', '上班族') 的 dnu_prop (0.0121 => 

In [53]:
# contribution summary by var
df_res_var = pd.DataFrame(player_contribution).groupby('_var')['贡献'].sum().reset_index()
df_res_var

Unnamed: 0,_var,贡献
0,d0_dnu_to_study,-0.037276
1,dnu_prop,0.020428


In [54]:
# contribution summary by dim
df_res_dim = pd.merge(
    pd.merge(
    pd.DataFrame(player_contribution).groupby('_dim')['贡献'].sum().sort_values().reset_index(),
        df_ctl,
        on='_dim'
    ),
    df_trt,
    on='_dim'
)

df_res_dim

Unnamed: 0,_dim,贡献,d0_dnu_to_study_x,dnu_prop_x,d0_dnu_to_study_y,dnu_prop_y
0,"(内广, A1初, android, 上班族)",-0.003255,0.246173,0.071251,0.248950,0.057380
1,"(内广, A1初, android, 自由职业)",-0.002227,0.278175,0.033869,0.282051,0.025506
2,"(内广, B1初, android, 大学生)",-0.001599,0.536585,0.004968,0.534483,0.001996
3,"(内广, B1初, android, 上班族)",-0.001283,0.496241,0.012087,0.496377,0.009500
4,"(内广, unknown, android, unknown)",-0.001275,0.040362,0.043532,0.014909,0.032321
...,...,...,...,...,...,...
574,"(广点通, B1高, android, 大学生)",0.001083,0.467742,0.001878,0.483051,0.004062
575,"(内广, A1初, android, 中小学生)",0.001131,0.237640,0.018994,0.245877,0.022959
576,"(广点通, A2中, android, 大学生)",0.001202,0.471429,0.004241,0.500000,0.006402
577,"(内广, A1中, android, 上班族)",0.001758,0.318182,0.007998,0.109553,0.039274
