In [2]:
import numpy as np
import pandas as pd

加载 line_profiler 扩增, 执行魔术命令lprun分析代码:
```jupyter
%load_ext line_profiler
%lprun
```
G20200616010153

In [3]:
%load_ext line_profiler

In [5]:
# test data
y = np.random.randint(2, size=(5000, 1))
x = np.random.randint(10, size=(5000, 1))
data = pd.DataFrame(np.concatenate([y, x], axis=1), columns=['y', 'x'])

In [4]:
def target_mean_v1(data, y_name, x_name):
    result = np.zeros(data.shape[0])
    for i in range(data.shape[0]):
        groupby_result = data[data.index != i].groupby([x_name], as_index=False).agg(['mean', 'count'])
        result[i] = groupby_result.loc[groupby_result.index == data.loc[i, x_name], (y_name, 'mean')]
    return result

In [7]:
%lprun -f target_mean_v1 target_mean_v1(data, 'y', 'x')

Timer unit: 1e-06 s

Total time: 33.8473 s
File: <ipython-input-4-1e10119a1d07>
Function: target_mean_v1 at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def target_mean_v1(data, y_name, x_name):
     2         1        131.0    131.0      0.0      result = np.zeros(data.shape[0])
     3      5001       5849.0      1.2      0.0      for i in range(data.shape[0]):
     4      5000   28619463.0   5723.9     84.6          groupby_result = data[data.index != i].groupby([x_name], as_index=False).agg(['mean', 'count'])
     5      5000    5221895.0   1044.4     15.4          result[i] = groupby_result.loc[groupby_result.index == data.loc[i, x_name], (y_name, 'mean')]
     6         1          0.0      0.0      0.0      return result

## target_mean_v2
target_mean_v1 中为每条记录求mean,count，会导致其余数据多次sum, count。例如：求第1条时会对 [2, 5000]的记录进行分组sum, count。求第2条是会再次处理[3, 5000] + {1}的记录。

改进方法：
* 先对所有数据分组，同时计算该分组类的count，sum.
* 在依次处理每条记录：找到该记录对应组，减去记录本身值后求mean。

改进前时间复杂度： O(n*n), 改进后：O(n)，应该要快不少。

In [26]:
def target_mean_v2(data, y_name, x_name):
    grps = dict()
    nrow = data.shape[0]
    result = np.zeros(nrow)
    total_sum = 0
    total_count = 0
    for rx in range(nrow):
        row = data.iloc[rx]
        x_v, y_v = row[x_name], row[y_name]
        if x_v not in grps:
            grps[x_v] = [y_v, 1]
        else:
            g = grps[x_v]
            g[0] += y_v
            g[1] += 1
            
        total_sum += y_v
        total_count += 1
            
    for rx in range(nrow):
        row = data.iloc[rx]
        x_v, y_v = row[x_name], row[y_name]
        g = grps[x_v]
        if g[1] == 1:
            result[rx] = total_sum / total_count
        else:
            result[rx] = (g[0] - y_v)/(g[1] - 1)
    return result

In [29]:
%lprun -f target_mean_v2 target_mean_v2(data, 'y', 'x')

Timer unit: 1e-06 s

Total time: 2.03087 s
File: <ipython-input-26-a7b5d990ef58>
Function: target_mean_v2 at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def target_mean_v2(data, y_name, x_name):
     2         1          1.0      1.0      0.0      grps = dict()
     3         1         12.0     12.0      0.0      nrow = data.shape[0]
     4         1         17.0     17.0      0.0      result = np.zeros(nrow)
     5         1          1.0      1.0      0.0      total_sum = 0
     6         1          0.0      0.0      0.0      total_count = 0
     7      5001       2295.0      0.5      0.1      for rx in range(nrow):
     8      5000     901154.0    180.2     44.4          row = data.iloc[rx]
     9      5000     122426.0     24.5      6.0          x_v, y_v = row[x_name], row[y_name]
    10      5000       4717.0      0.9      0.2          if x_v not in grps:
    11        10          6.0      0.6      0.0     

## target_encoding_v3
target_encoding_v2 比 v1快约17倍。 根据v2的trick分析知： data.iloc 语句还有待优化。即采用DataFrame原生的元素检索方式性能非得低下，两处循环中DataFrame检索消耗占到了98%。

利用DataFrame的values属性可以获得DataFrame内部数据的Numpy.narray二位数组表示形式。

In [35]:
def target_mean_v3(data, y_idx, x_idx):
    grps = dict()
    nrow = data.shape[0]
    raw_data = data.values
    result = np.zeros(nrow)
    total_sum = 0
    total_count = 0
    for rx in range(nrow):
        x_v, y_v = raw_data[rx][x_idx], raw_data[rx][y_idx]
        if x_v not in grps:
            grps[x_v] = [y_v, 1]
        else:
            g = grps[x_v]
            g[0] += y_v
            g[1] += 1
            
        total_sum += y_v
        total_count += 1
            
    total_mean = total_sum / total_count
    
    for rx in range(nrow):
        x_v, y_v = raw_data[rx][x_idx], raw_data[rx][y_idx]
        g = grps[x_v]
        if g[1] == 1:
            result[rx] = total_mean
        else:
            result[rx] = (g[0] - y_v)/(g[1] - 1)
    return result

In [36]:
# y_name 列的位置 0， x_name列的位置 1
%lprun -f target_mean_v3 target_mean_v3(data, 0, 1)

Timer unit: 1e-06 s

Total time: 0.050626 s
File: <ipython-input-35-307da1fd567d>
Function: target_mean_v3 at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def target_mean_v3(data, y_idx, x_idx):
     2         1          1.0      1.0      0.0      grps = dict()
     3         1         14.0     14.0      0.0      nrow = data.shape[0]
     4         1         43.0     43.0      0.1      raw_data = data.values
     5         1         14.0     14.0      0.0      result = np.zeros(nrow)
     6         1          0.0      0.0      0.0      total_sum = 0
     7         1          0.0      0.0      0.0      total_count = 0
     8      5001       2841.0      0.6      5.6      for rx in range(nrow):
     9      5000       7555.0      1.5     14.9          x_v, y_v = raw_data[rx][x_idx], raw_data[rx][y_idx]
    10      5000       6153.0      1.2     12.2          if x_v not in grps:
    11        10          6.0      0.6

## target_encoding_v4

v3 相比 v1提升了 600+倍（33 / 0.05）。从trick结果来看，如果不采用C语言等其他技术，那么至少还有一处可以优化，即 __分组查找__(行10，13，24)耗时占比近：23.4%

事实上，如果x列的取值范围一定，那么可以采用数组的来存放分组（用下标来标识组key）


In [38]:
4.6 + 6.6 + 12.2

23.4

12

In [60]:
#%%timeit
grps1 = data.groupby('x', as_index=True).agg(['mean','count'])

In [6]:
data.

Unnamed: 0,y,x
0,0,12
1,0,10
2,0,14
3,1,11
4,0,14
...,...,...
4995,1,16
4996,0,10
4997,1,16
4998,0,12


In [61]:
grps

Unnamed: 0_level_0,y,y
Unnamed: 0_level_1,mean,count
x,Unnamed: 1_level_2,Unnamed: 2_level_2
10,0.490364,467
11,0.528846,520
12,0.458244,467
13,0.546025,478
14,0.499025,513
15,0.4714,507
16,0.532164,513
17,0.501984,504
18,0.521401,514
19,0.49323,517


In [58]:
grps

Unnamed: 0_level_0,y,y
Unnamed: 0_level_1,mean,count
x,Unnamed: 1_level_2,Unnamed: 2_level_2
10,0.490364,467
11,0.528846,520
12,0.458244,467
13,0.546025,478
14,0.499025,513
15,0.4714,507
16,0.532164,513
17,0.501984,504
18,0.521401,514
19,0.49323,517
